Web Mining Research: January 2008

Monday, January 21, 2008

Topics of Interest

Information Extraction systems
Web Structure Mining
Domain Specific Web Search
Personalized Web Search
Web Page Classification methods
Web caching methods
Automatic Fragment Detection
Intelligent Data Preparation
Automatic Identification of Informative sections in web pages
Adaptive Information Retrieval
Pattern Discovery on Web

Phd 4 years chart - Rajesh Kumar

The chart describing the four year action plan for PhD course is attached. The chart is prepared by Rajesh Kumar
Download Chart

Intelligent Preprocessor for Search Engines and Web miners

Download this Abstract

Data preprocessing is the first step of web mining, data mining, information retrieval and pattern recognition [1]. Once the data is well prepared, the mined results are more accurate and reliable. The principal aim of data preparation [2],[3],[6] is to provide a quality data for other steps in web mining and information retrieval.

Web page classification [4] is one of the essential techniques for Web mining because classifying Web pages of an interesting class is often the first step of mining the Web. However, constructing a classifier for an interesting class requires laborious preprocessing. Although Yahoo and similar Web directory service systems use human readers to classify Web documents, reduced cost and increased speed make automatic classification highly desirable. Typical classification methods [4] use positive and negative examples as training sets, and then assign each document a class label from a set of predefined topic categories. The proposed research work will identify the various such preprocessing problems and develop efficient algorithms based on machine learning and AI techniques to overcome these difficulties and make the web mining process better.

Web pages [3]—especially dynamically generated ones—contain several items that cannot be classified as the “primary content,” e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content [5], [7], and largely do not seek the non informative content. A tool that assists an end-user or application to search and process information from Web pages automatically [8], must separate the “primary content sections” from the other content sections. The proposed system should process the web pages in such a way that the search engines and web miners concentrate only on the primary content and hence retrieve the best information for the users.

Modeling user preference [10] is one of the challenging issues in intelligent information systems. Extensive research has been performed to automatically analyze user preference and to utilize it. This is again a preprocessing to be done before applying the web mining and searching process. This will help the web miners to present personalized and rich information to the users.

We believe data mining should be integrated with the Web search engine service to enhance the quality of Web searches. To do so, we can start by enlarging the set of search keywords to include a set of keyword synonyms for web search [9]. This preprocessing step will make the search engines to search based on the semantics instead of just the keywords.

Hence processing of web pages, user profiles, user behaviors, access patterns and bringing them to a format as desired by the web miners and search engines is an important research area for web intelligence.

Efficient System for Information Extraction from Web Using Soft Computing approaches

Download this Abstract

The information extraction is the most highly performed activity on the web. Searching, comprehending and using the semi-structured information [2] stored on the web poses a significant challenge because this data is more sophisticated and dynamic than the information that commercial database systems store. To supplement keyword-based indexing, which forms the cornerstone for web search engines; researchers have applied data mining to web-page ranking [1]. In this context, data mining helps web search engines find high-quality web pages for the users.

Defining how to design an efficient information extraction system [3] for web presents a major research challenge. Achieving this requires overcoming two fundamental problems. First, the traditional schemes for accessing the immense amounts of data that reside on the web fundamentally assume the text-oriented, keyword based view [6] of web pages. Second, we must replace the current primitive access schemes with more sophisticated versions that can exploit the web fully.

Discovering and extracting novel and useful knowledge from web sources call for innovative approaches that draw from a wide range of fields spanning data mining, machine learning, soft computing, statistics, databases, information retrieval, artificial intelligence, and natural language processing [8].

The proposed research work aims at developing algorithms for building an efficient system for information extraction from web, using the suitable soft computing approaches. The preprocessing of web pages require the content of the web pages to be stored as fragments which will facilitate in identifying the primary content to be delivered. The user access information can be represented using fuzzy sets [5] which will help in taking the decisions faster [7] and ranking the pages accurately. The system has to learn from the various types of queries it receives and should be capable of performing better and faster for all similar queries. This learning capability [4] can be implemented using the neural network techniques. The Evolutionary computation techniques such as genetic algorithms can be used to support the searching mechanism for retrieving the required information faster. The system will also have support for wide range of queries using the Natural Language Processing techniques [8].
Thus usage of Soft Computing approaches for Information Extraction system is an important research thrust in Web Intelligence. These techniques will make it possible to fully use the immense information available in the web and make the web a richer, friendlier, and more intelligent resource that we can all share and explore.

Friday, January 18, 2008

Web Mining based Intelligent Search Engines

Introduction

•Information explosion led to difficulties in finding required information
•Search Engine – an important tool for getting information on internet
•Current search engines lack accuracy and personalization
•Due to rapid development of internet, effective and accurate Intelligent Search Engine based on web mining technology has become the most important research issue

Evolution of Search Engines

•First Search Engine – 1994 à World Wide Web Worm
•As the number of websites increased new techniques were required to get accurate search results
•Most of the available search engines return several thousands of pages, most of which are irrelevant
•Quality and relativity of searching results can be improved using following three technologies:
1.Clustering the web documents
2.Analyzing the web hyperlinks structure
3.Analyzing the web usage logs

More in the presentation Download the Presentation

Monday, January 14, 2008

Screening PPT - Sanjay Sugumar

Download this Presentation

Presentation prepared for PhD Screening

Screening PPT - Saravanan

Presentation prepared for PhD Screening.

Download this Presentation

Progress Review Meeting - Jan 2008, Rajesh Kumar

Download this Report

The proposed research work aims at exploiting the web usage regularities and information structures in web pages to build intelligent information systems.

Problems Identified:

Need for improving Precision and Speed in Structure Mining
Query based Classification using Links
Statistical Approaches to link mining with meta-data discovery and mapping
Applying Link Mining in Semantic Web
Combine Information extraction with techniques from link mining to construct Semantic Web
To make use of Semantic and Ontological information in Link Mining Endeavors
Filtering sequential patterns and clusters
Need to develop tools, which incorporate statistical methods, visualization, and human factors to help better understand the mined knowledge.

Course work completed:

Advanced Internet Technologies.

Paper Published:

Saravanan.P, Dr.D.Sridharan, Rajesh Kumar.C, “The Current Approaches in Automatic Identification of Fragments and Informative Sections”, International Conference on Trendz in Information Sciences and Computing, Sathyabama University,Chennai, 2007

Screening PPT - Rajesh Kumar

Presentation prepared for PhD Screening

Download Presentation

Progress Review Meeting - Jan 2008, Sanjay Sugumar

Download this Report

The proposed research work aims at developing algorithms for building an efficient system for information extraction from web, using suitable soft computing approaches.

Problems Identified:

Simple blind keyword based query processing in search engines
No deductive capability in information extraction
No soft decision in classifying the web content
Lack of personalization as using user’s history and nature
Web services are ignored when extracting information from web
Simple host based clustering algorithms are used
Present techniques don’t mine linguistic association rules

Course work completed:

Advanced Databases (Instead of Database Technology)

Progress Review Meeting - July 2007, Sanjay Sugumar

Download this Report

The proposed research work aims at developing algorithms for building an efficient system for information extraction from web, using suitable soft computing approaches.

Problems Identified:

Simple blind keyword based query processing in search engines
No deductive capability in information extraction
No soft decision in classifying the web content
Lack of personalization as using user’s history and nature
Web services are ignored when extracting information from web
Simple host based clustering algorithms are used
Present techniques don’t mine linguistic association rules

Course work completed:

Data Warehousing and Data Mining
Advanced Internet Technologies

Chart for Carrying out research Work - Prepared by Sanjay Sugumar

The Chart gives the briefing of work to be done in four years. This chart is prepared by Sanjay Sugumar.

Download this chart

Author Guidelines for 8.5x11-inch Proceedings Manuscripts

Download this Guidelines Document

Courtesy: IEEE

Abstract

The abstract is to be in fully-justified italicized text, at the top of the left-hand column as it is here, below the author information. Use the word “Abstract” as the title, in 12-point Times, boldface type, centered relative to the column, initially capitalized. The abstract is to be in 10-point, single-spaced type, and may be up to 3 in. (7.62 cm) long. Leave two blank lines after the abstract, then begin the main text. All manuscripts must be in English.

1. Introduction

These guidelines include complete descriptions of the fonts, spacing, and related information for producing your proceedings manuscripts.
A zip-file of this sample manuscript is also available (http://mecha.ee.boun.edu.tr/word2.zip), which you can use as a template to prepare your paper.
Please note that your paper should normally be limited to six pages. A maximum of two additional pages can be used subject to a charge of $100/page.

2. Formatting your paper

All printed material, including text, illustrations, and charts, must be kept within a print area of 6-7/8 inches (17.5 cm) wide by 8-7/8 inches (22.54 cm) high. Do not write or print anything outside the print area. All text must be in a two-column format. Columns are to be 3-1/4 inches (8.25 cm) wide, with a 5/16 inch (0.8 cm) space between them. Text must be fully justified.

3. Main title

The main title (on the first page) should begin 1-3/8 inches (3.49 cm) from the top edge of the page, centered, and in Times 14-point, boldface type. Capitalize the first letter of nouns, pronouns, verbs, adjectives, and adverbs; do not capitalize articles, coordinate conjunctions, or prepositions (unless the title begins with such a word). Leave two blank lines after the title.

4. Author name(s) and affiliation(s)

Author names and affiliations are to be centered beneath the title and printed in Times 12-point, non-boldface type. Multiple authors may be shown in a two- or three-column format, with their affiliations below their respective names. Affiliations are centered below each author name, italicized, not bold. Include e-mail addresses if possible. Follow the author information by two blank lines before main text.

5. Second and following pages

The second and following pages should begin 1.0 inch (2.54 cm) from the top edge. On all pages, the bottom margin should be 1-1/8 inches (2.86 cm) from the bottom edge of the page for 8.5 x 11-inch paper; for A4 paper, approximately 1-5/8 inches (4.13 cm) from the bottom edge of the page.

6. Type-style and fonts

Wherever Times is specified, Times Roman, or New Times Roman may be used. If neither is available on your word processor, please use the font closest in appearance to Times that you have access to. Please avoid using bit-mapped fonts if possible. True-Type 1 fonts are preferred.

7. Main text

Type your main text in 10-point Times, single-spaced. Do not use double-spacing. All paragraphs should be indented 1 pica (approximately 1/6- or 0.17-inch or 0.422 cm). Be sure your text is fully justified—that is, flush left and flush right. Please do not place any additional blank lines between paragraphs.
Figure and table captions should be 10-point Helvetica (or a similar sans-serif font), boldface. Callouts should be 9-point Helvetica, non-boldface. Initially capitalize only the first word of each figure caption and table title. Figures and tables must be numbered separately. For example: “Figure 1. Database contexts”, “Table 1. Input data”. Figure captions are to be below the figures. Table titles are to be centered above the tables.

8. First-order headings

For example, “1. Introduction”, should be Times 12-point boldface, initially capitalized, flush left, with one blank line before, and one blank line after. Use a period (“.”) after the heading number, not a colon.

8.1. Second-order headings

As in this heading, they should be Times 11-point boldface, initially capitalized, flush left, with one blank line before, and one after.

8.1.1. Third-order headings. Third-order headings, as in this paragraph, are discouraged. However, if you must use them, use 10-point Times, boldface, initially capitalized, flush left, preceded by one blank line, followed by a period and your text on the same line.

9. Printing your paper

Print your properly-formatted text on high-quality, 8.5 x 11-inch white printer paper. A4 paper is also acceptable, but please leave the extra 0.5 inch (1.27 cm) at the BOTTOM of the page. If the last page of your paper is only partially filled, arrange the columns so that they are evenly balanced if possible, rather than having one long column.

10. Page numbering

Number your pages lightly, in pencil, on the upper right-hand corners of the BACKS of the pages (for example, 1/6, 2/6; or 1 of 6, 2 of 6; and so forth). Please do NOT write on the fronts of the pages, nor on the lower halves of the backs of the pages. Do not automatically paginate your pages. Note that unnumbered pages that get out of order can be very difficult to put back in order!

11. Illustrations, graphs, and photographs
All graphics should be centered. Your artwork must be in place in the article (preferably printed as part of the text rather than pasted up). If you are using photographs and are able to have halftones made at a print shop, use a 100- or 110-line screen. If you must use photos, they must be pasted onto your manuscript. Use rubber cement to affix the halftones or photos in place. Black and white, clear, glossy-finish photos are preferable to color. Supply the best quality photographs and illustrations possible. Penciled lines and very fine lines do not reproduce well. Remember, the quality of the book cannot be better than the originals provided. Do not use tape on your pages!

11.1. Color images in proceedings

The use of color on interior pages (that is, pages other than the cover of the proceedings) is prohibitively expensive. Interior pages may be published in color only when it is specifically requested and budgeted for by the authors. DO NOT SUBMIT COLOR IMAGES IN YOUR PAPER UNLESS SPECIFICALLY INSTRUCTED TO DO SO.

11.2. Symbols

If your word processor or typewriter cannot produce Greek letters, mathematical symbols, or other graphical elements, please use pressure-sensitive (self-adhesive) rub-on symbols or letters (available in most stationery stores, art stores, or graphics shops).

11.3. Footnotes

Use footnotes sparingly (or not at all!) and place them at the bottom of the column on the page on which they are referenced. Use Times 8-point type, single-spaced. To help your readers, avoid using footnotes altogether and include necessary peripheral observations in the text (within parentheses, if you prefer, as in this sentence).

12. References

List and number all bibliographical references in 9-point Times, single-spaced, at the end of your paper. When referenced in the text, enclose the citation number in square brackets, for example [1]. Where appropriate, include the name(s) of editors of referenced books.

[1] A.B. Smith, C.D. Jones, and E.F. Roberts, “Article Title”, Journal, Publisher, Location, Date, pp. 1-10.

[2] Jones, C.D., A.B. Smith, and E.F. Roberts, Book Title, Publisher, Location, Date.

13. Copyright forms and reprint orders

You must include your signed copyright release form that will be available in Author's Package when you submit your finished paper. We MUST have this form before your paper can be published in the proceedings.

The Current Approaches in Automatic Identification of Fragments and Informative Sections

Download Full Paper

Data Preprocessing (also called as Data preparation) is the first step of Data Mining, Web Mining, Data Warehousing and any other data processing applications. Especially in web mining, a Data preparation technique plays a vital role due to two reasons. 1. Web contents are highly dynamic in nature 2. Web contents are huge in volume. Some of the most essential techniques for web mining are web page fragmentation also called as segmentation, fragments detection, informative fragments detection. Manual fragmentations, identification of primary contents from the web pages are error prone, expensive and unscalable. This paper presents the various such preprocessing techniques. For segmentation, the web page gets formally defined and gets partitioned into web page blocks. This paper presents the Vision-based page segmentation (VIPS) algorithm [1], Support Vector Machines (SVM) to partition the HTML page into web page blocks. For fragment identification, this paper presents the various techniques based on sharing behavior, personalization characteristics, and change patterns. These techniques help to identify the cost effective units in dynamic web pages [2]. In the identified fragments, clients and end users will be interested only in the primary contents (informative sections) and not in secondary contents (non informative contents). Identifying the primary contents from the segmented fragments will improve the speed & quality of mining [3]. This paper presents the various approaches to extract the primary contents automatically based on the similarity between the fragments of different web pages, redundancy, features.

P. Saravanan, Sathyabama University

Building Intelligent Information System by exploiting web usage regularities and information structures in web pages

Download this Abstract

The Web is an immense and dynamic collection of pages that includes countless hyperlinks and huge volumes of access and usage information, which will provide a rich and unprecedented data mining source. However, the Web also poses several challenges to effective resource and knowledge discovery. First, the web page complexity far exceeds the complexity of any traditional text document collection. Second, the web constitutes a highly dynamic information source. Not only does the web continue to grow rapidly, the information it holds also receives constant updates. Linkage information and access records also undergo frequent updates.

The Internet’s rapidly expanding user community connects millions of workstations. These users have markedly different backgrounds, interests, and usage purposes. Many lack good knowledge of the information network’s structure and are unaware of a particular search’s heavy cost. Hence lengthy waits are required to retrieve search results.

The proposed research work aims at exploiting the web usage regularities and information structures in web pages to build intelligent information systems. The system should be able to collect and segregate user access information and mine useful information from it. This also should build complete concept models for web user information needs based on the surfers’ access history.

The system uses the information structures such as incoming links, out going links of a web page in mining the information. The incoming links of a page can be used to classify the page in a concise manner. This enhances the browsing and querying of web pages. To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intra-site redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. The system should be capable of handling these intra-page informative structures and eliminate the redundant information.

Context of a hyperlink or link context is defined as the terms that appear in the text around a hyperlink within a web page. The system should be able to apply the link contexts to a variety of web information retrieval and categorization tasks. Thus the usage of these approaches in web mining will improve the information extraction and make the web friendlier to the users.

C. Rajesh Kumar, Lecturer, Sathyabama University