TR02-0001 (March)

Title
Mining tables and lists on the Web for desired relation

Author(s)
Yangyang Wu and Haruo Yokota

Contact person
Haruo Yoktoa (yokota@cs.titech.ac.jp)

Abstract
According to human writing manner, there are all kinds of tables and lists on the Web. These tables and lists carry a lot of useful information. Using search engines, it is not easy to fine them. In this paper, we propose a novel method to recognize and extract relations from the Web. It is based on semantic and formal characters. We define models to represent a desired relation and a "repeated structure" like table or list on Web pages, and introduce a set of functions to measure repeated structures to see if they contain a desired relation. We develop algorithms for training machine and mining the Web for desired relations. Finally we give our experiment results and discuss the further works.


TR02-0002 (March)

Title
A special data structure for web page analysis

Author(s)
Yangyang Wu and Haruo Yokota

Contact person
Haruo Yokota (yokota@cs.titech.ac.jp)

Abstract
Mining the Web for desired information is one of hot topics in recent years. According to human writing manner, there are all kinds of tables and lists on the Web. These tables and lists contain a lot of useful information. Analyzing and recognizing them is one of important works for Web content mining. In this paper, we present a special data structure, called WPS-tree, for web page analysis. The WPS-tree is based on visible object. It will catch the logical structure of pages more exactly. We give its definition and algorithm of constructing the tree, and discuss how to use it to recognize the nested relationship of data and the relationship between HTML tags and texts on Web pages. In particular we descript how to use it to recognize the repeated structures like table and list on the pages in our relation recognition system and discuss the results of our experiments.