Monday, January 21, 2008

Efficient System for Information Extraction from Web Using Soft Computing approaches

The information extraction is the most highly performed activity on the web. Searching, comprehending and using the semi-structured information [2] stored on the web poses a significant challenge because this data is more sophisticated and dynamic than the information that commercial database systems store. To supplement keyword-based indexing, which forms the cornerstone for web search engines; researchers have applied data mining to web-page ranking [1]. In this context, data mining helps web search engines find high-quality web pages for the users.

Defining how to design an efficient information extraction system [3] for web presents a major research challenge. Achieving this requires overcoming two fundamental problems. First, the traditional schemes for accessing the immense amounts of data that reside on the web fundamentally assume the text-oriented, keyword based view [6] of web pages. Second, we must replace the current primitive access schemes with more sophisticated versions that can exploit the web fully.

Discovering and extracting novel and useful knowledge from web sources call for innovative approaches that draw from a wide range of fields spanning data mining, machine learning, soft computing, statistics, databases, information retrieval, artificial intelligence, and natural language processing [8].

The proposed research work aims at developing algorithms for building an efficient system for information extraction from web, using the suitable soft computing approaches. The preprocessing of web pages require the content of the web pages to be stored as fragments which will facilitate in identifying the primary content to be delivered. The user access information can be represented using fuzzy sets [5] which will help in taking the decisions faster [7] and ranking the pages accurately. The system has to learn from the various types of queries it receives and should be capable of performing better and faster for all similar queries. This learning capability [4] can be implemented using the neural network techniques. The Evolutionary computation techniques such as genetic algorithms can be used to support the searching mechanism for retrieving the required information faster. The system will also have support for wide range of queries using the Natural Language Processing techniques [8].
Thus usage of Soft Computing approaches for Information Extraction system is an important research thrust in Web Intelligence. These techniques will make it possible to fully use the immense information available in the web and make the web a richer, friendlier, and more intelligent resource that we can all share and explore.

No comments: