Data preprocessing is the first step of web mining, data mining, information retrieval and pattern recognition [1]. Once the data is well prepared, the mined results are more accurate and reliable. The principal aim of data preparation [2],[3],[6] is to provide a quality data for other steps in web mining and information retrieval.
Web page classification [4] is one of the essential techniques for Web mining because classifying Web pages of an interesting class is often the first step of mining the Web. However, constructing a classifier for an interesting class requires laborious preprocessing. Although Yahoo and similar Web directory service systems use human readers to classify Web documents, reduced cost and increased speed make automatic classification highly desirable. Typical classification methods [4] use positive and negative examples as training sets, and then assign each document a class label from a set of predefined topic categories. The proposed research work will identify the various such preprocessing problems and develop efficient algorithms based on machine learning and AI techniques to overcome these difficulties and make the web mining process better.
Web pages [3]—especially dynamically generated ones—contain several items that cannot be classified as the “primary content,” e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content [5], [7], and largely do not seek the non informative content. A tool that assists an end-user or application to search and process information from Web pages automatically [8], must separate the “primary content sections” from the other content sections. The proposed system should process the web pages in such a way that the search engines and web miners concentrate only on the primary content and hence retrieve the best information for the users.
Modeling user preference [10] is one of the challenging issues in intelligent information systems. Extensive research has been performed to automatically analyze user preference and to utilize it. This is again a preprocessing to be done before applying the web mining and searching process. This will help the web miners to present personalized and rich information to the users.
We believe data mining should be integrated with the Web search engine service to enhance the quality of Web searches. To do so, we can start by enlarging the set of search keywords to include a set of keyword synonyms for web search [9]. This preprocessing step will make the search engines to search based on the semantics instead of just the keywords.
Hence processing of web pages, user profiles, user behaviors, access patterns and bringing them to a format as desired by the web miners and search engines is an important research area for web intelligence.
Web page classification [4] is one of the essential techniques for Web mining because classifying Web pages of an interesting class is often the first step of mining the Web. However, constructing a classifier for an interesting class requires laborious preprocessing. Although Yahoo and similar Web directory service systems use human readers to classify Web documents, reduced cost and increased speed make automatic classification highly desirable. Typical classification methods [4] use positive and negative examples as training sets, and then assign each document a class label from a set of predefined topic categories. The proposed research work will identify the various such preprocessing problems and develop efficient algorithms based on machine learning and AI techniques to overcome these difficulties and make the web mining process better.
Web pages [3]—especially dynamically generated ones—contain several items that cannot be classified as the “primary content,” e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content [5], [7], and largely do not seek the non informative content. A tool that assists an end-user or application to search and process information from Web pages automatically [8], must separate the “primary content sections” from the other content sections. The proposed system should process the web pages in such a way that the search engines and web miners concentrate only on the primary content and hence retrieve the best information for the users.
Modeling user preference [10] is one of the challenging issues in intelligent information systems. Extensive research has been performed to automatically analyze user preference and to utilize it. This is again a preprocessing to be done before applying the web mining and searching process. This will help the web miners to present personalized and rich information to the users.
We believe data mining should be integrated with the Web search engine service to enhance the quality of Web searches. To do so, we can start by enlarging the set of search keywords to include a set of keyword synonyms for web search [9]. This preprocessing step will make the search engines to search based on the semantics instead of just the keywords.
Hence processing of web pages, user profiles, user behaviors, access patterns and bringing them to a format as desired by the web miners and search engines is an important research area for web intelligence.
1 comment:
That was really a great abstract. This was really an informative post. Not all are aware on what web mining is as well on how it process. This post simply tackles everything we should know about web mining.
Post a Comment