利用JTidy和XML实现Web数据信息的批量提取

英文篇名：Extracting formatted batch data from web by JTidy and XML
中文刊名：计算机工程与设计
英文刊名：Computer Engineering and Design
作者：刘钊夏 ; 何明昕
英文作者：LIU Zhao-xia ; HE Ming-xin(Department of Computer Science ; Jinan University ; Guangzhou 510632 ; China)
中文关键词：Web内容提取 ; XML ; JTidy工具包 ; Dom4j工具包 ; 标记路径 ; 频繁路径
英文关键词：web content extraction ; XML ; JTidy ; Dom4j ; label path ; frequent path
出版日期：2010-03-28
机构：暨南大学计算机科学系;
年：2010
期：06
出版单位：计算机工程与设计

摘要

为了有效地在Web上进行数据信息的提取,实现Web数据的清理与集成,针对发布批量格式化数据的网页类型,提出了利用XML和JTidy自动从Web页面批量提取数据信息的方法。根据该类网页的特点,基于开发一种通用程序的思想,对页面标签结构进行分析与分类,讨论了识别数据元素和对数据元素进行分组等提取过程中的难点,在此基础上建立了总体扫描与提取的算法。实验结果表明了批量提取信息方法的可行性与有效性。
To extract data information from web effectively and implement web data purification and integration, the approach that automatically extract interested batch data from web pages is presented using XML and JTidy tools.Targeted on the specific web and a general processing idea, page structure is analyzed and classified.The main difficulties in design that are identifying and labeling data element, are discussed and the algorithms of general scanning and extracting are constructed.Finally, a case study of extracting a web page is presented to verify the feasibility and validity of the method.

引文

[1]Jussi Myllymaki.Effective web data extraction with standard XML technologies[EB/OL].http://www10.org/cdrom/papers/102/index.html.
    [2]Jussi Myllymaki,Jared Jackson.Automatically extract informa-tion with HTML,XML,and Java[EB/OL].http://www.ibm.com/developerworks/library/wa-wbdm/.
    [3]盖磊,王海军,刘俊民.一种基于XML的Web地震信息提取的实现[J].计算机应用与软件,2007,24(8):103-105.
    [4]梅东霞,张晓明.基于单个XML文档结构的数据挖掘[J].石油化工高等学校学报,2007,20(1):94-98.
    [5]JTidy网站[EB/OL].http://jtidy.sourceforge.net/.
    [6]Dom4j网站[EB/OL].http://www.dom4j.org/.
    [7]唐红光,周铁军.基于XML的Web数据挖掘技术[J].信息科学,2007(1):14.
    [8]杨彬.利用XML技术进行Web内容挖掘[J].计算机与现代化,2005(11):48-50.