Showing posts with label PhD Screening Abstracts. Show all posts
Showing posts with label PhD Screening Abstracts. Show all posts

Monday, January 21, 2008

Intelligent Preprocessor for Search Engines and Web miners

Data preprocessing is the first step of web mining, data mining, information retrieval and pattern recognition [1]. Once the data is well prepared, the mined results are more accurate and reliable. The principal aim of data preparation [2],[3],[6] is to provide a quality data for other steps in web mining and information retrieval.

Web page classification [4] is one of the essential techniques for Web mining because classifying Web pages of an interesting class is often the first step of mining the Web. However, constructing a classifier for an interesting class requires laborious preprocessing. Although Yahoo and similar Web directory service systems use human readers to classify Web documents, reduced cost and increased speed make automatic classification highly desirable. Typical classification methods [4] use positive and negative examples as training sets, and then assign each document a class label from a set of predefined topic categories. The proposed research work will identify the various such preprocessing problems and develop efficient algorithms based on machine learning and AI techniques to overcome these difficulties and make the web mining process better.

Web pages [3]—especially dynamically generated ones—contain several items that cannot be classified as the “primary content,” e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content [5], [7], and largely do not seek the non informative content. A tool that assists an end-user or application to search and process information from Web pages automatically [8], must separate the “primary content sections” from the other content sections. The proposed system should process the web pages in such a way that the search engines and web miners concentrate only on the primary content and hence retrieve the best information for the users.

Modeling user preference [10] is one of the challenging issues in intelligent information systems. Extensive research has been performed to automatically analyze user preference and to utilize it. This is again a preprocessing to be done before applying the web mining and searching process. This will help the web miners to present personalized and rich information to the users.

We believe data mining should be integrated with the Web search engine service to enhance the quality of Web searches. To do so, we can start by enlarging the set of search keywords to include a set of keyword synonyms for web search [9]. This preprocessing step will make the search engines to search based on the semantics instead of just the keywords.

Hence processing of web pages, user profiles, user behaviors, access patterns and bringing them to a format as desired by the web miners and search engines is an important research area for web intelligence.

Efficient System for Information Extraction from Web Using Soft Computing approaches

The information extraction is the most highly performed activity on the web. Searching, comprehending and using the semi-structured information [2] stored on the web poses a significant challenge because this data is more sophisticated and dynamic than the information that commercial database systems store. To supplement keyword-based indexing, which forms the cornerstone for web search engines; researchers have applied data mining to web-page ranking [1]. In this context, data mining helps web search engines find high-quality web pages for the users.

Defining how to design an efficient information extraction system [3] for web presents a major research challenge. Achieving this requires overcoming two fundamental problems. First, the traditional schemes for accessing the immense amounts of data that reside on the web fundamentally assume the text-oriented, keyword based view [6] of web pages. Second, we must replace the current primitive access schemes with more sophisticated versions that can exploit the web fully.

Discovering and extracting novel and useful knowledge from web sources call for innovative approaches that draw from a wide range of fields spanning data mining, machine learning, soft computing, statistics, databases, information retrieval, artificial intelligence, and natural language processing [8].

The proposed research work aims at developing algorithms for building an efficient system for information extraction from web, using the suitable soft computing approaches. The preprocessing of web pages require the content of the web pages to be stored as fragments which will facilitate in identifying the primary content to be delivered. The user access information can be represented using fuzzy sets [5] which will help in taking the decisions faster [7] and ranking the pages accurately. The system has to learn from the various types of queries it receives and should be capable of performing better and faster for all similar queries. This learning capability [4] can be implemented using the neural network techniques. The Evolutionary computation techniques such as genetic algorithms can be used to support the searching mechanism for retrieving the required information faster. The system will also have support for wide range of queries using the Natural Language Processing techniques [8].
Thus usage of Soft Computing approaches for Information Extraction system is an important research thrust in Web Intelligence. These techniques will make it possible to fully use the immense information available in the web and make the web a richer, friendlier, and more intelligent resource that we can all share and explore.

Monday, January 14, 2008

Building Intelligent Information System by exploiting web usage regularities and information structures in web pages

Download this Abstract

The Web is an immense and dynamic collection of pages that includes countless hyperlinks and huge volumes of access and usage information, which will provide a rich and unprecedented data mining source. However, the Web also poses several challenges to effective resource and knowledge discovery. First, the web page complexity far exceeds the complexity of any traditional text document collection. Second, the web constitutes a highly dynamic information source. Not only does the web continue to grow rapidly, the information it holds also receives constant updates. Linkage information and access records also undergo frequent updates.

The Internet’s rapidly expanding user community connects millions of workstations. These users have markedly different backgrounds, interests, and usage purposes. Many lack good knowledge of the information network’s structure and are unaware of a particular search’s heavy cost. Hence lengthy waits are required to retrieve search results.

The proposed research work aims at exploiting the web usage regularities and information structures in web pages to build intelligent information systems. The system should be able to collect and segregate user access information and mine useful information from it. This also should build complete concept models for web user information needs based on the surfers’ access history.

The system uses the information structures such as incoming links, out going links of a web page in mining the information. The incoming links of a page can be used to classify the page in a concise manner. This enhances the browsing and querying of web pages. To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intra-site redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. The system should be capable of handling these intra-page informative structures and eliminate the redundant information.

Context of a hyperlink or link context is defined as the terms that appear in the text around a hyperlink within a web page. The system should be able to apply the link contexts to a variety of web information retrieval and categorization tasks. Thus the usage of these approaches in web mining will improve the information extraction and make the web friendlier to the users.


C. Rajesh Kumar, Lecturer, Sathyabama University