The Computer Journal Advance Access originally published online on October 14, 2008
The Computer Journal 2009 52(8):890-901; doi:10.1093/comjnl/bxn049
| ||||||||||||||||||||||||||||||||||||||||||||||||
This article appears in the following The Computer Journal issue: Incorporating Systems, communications and services in smart homes and Software engineering for e-business Special Issues [View the issue table of contents]
GA-Based Keyword Selection for the Design of an Intelligent Web Document Search System
Department of Computer Science and Information Engineering, Chung Hua University, No. 707, Section 2, WuFu Road, Hsinchu, 300 Taiwan, Republic of China
* Corresponding author: chc{at}chu.edu.tw
Received 31 January 2008; revised 31 July 2008
The main steps for designing an automatic document classification system include feature extraction and classification. In this article a method to improve feature extraction is proposed. In this method, genetic algorithm was applied to determine the threshold values of four criteria for extracting the representative keywords for each class. The purpose of these four threshold values is to extract as few representative keywords as possible. This keyword extraction method was combined with two classification algorithms, vector space model and support vector machine, for examining the performance of the proposed classification system under various extracting conditions.
Key Words: web document classification keyword extraction genetic algorithm vector space method support vector machine