Data Preprocessing (also called as Data preparation) is the first step of Data Mining, Web Mining, Data Warehousing and any other data processing applications. Especially in web mining, a Data preparation technique plays a vital role due to two reasons. 1. Web contents are highly dynamic in nature 2. Web contents are huge in volume. Some of the most essential techniques for web mining are web page fragmentation also called as segmentation, fragments detection, informative fragments detection. Manual fragmentations, identification of primary contents from the web pages are error prone, expensive and unscalable. This paper presents the various such preprocessing techniques. For segmentation, the web page gets formally defined and gets partitioned into web page blocks. This paper presents the Vision-based page segmentation (VIPS) algorithm [1], Support Vector Machines (SVM) to partition the HTML page into web page blocks. For fragment identification, this paper presents the various techniques based on sharing behavior, personalization characteristics, and change patterns. These techniques help to identify the cost effective units in dynamic web pages [2]. In the identified fragments, clients and end users will be interested only in the primary contents (informative sections) and not in secondary contents (non informative contents). Identifying the primary contents from the segmented fragments will improve the speed & quality of mining [3]. This paper presents the various approaches to extract the primary contents automatically based on the similarity between the fragments of different web pages, redundancy, features.
P. Saravanan, Sathyabama University
1 comment:
It is a very useful information for all research scholars.
Post a Comment