利用JTidy和XML实现Web数据信息的批量提取
详细信息 本馆镜像全文    |  推荐本文 | | 获取馆网全文
摘要
为了有效地在Web上进行数据信息的提取,实现Web数据的清理与集成,针对发布批量格式化数据的网页类型,提出了利用XML和JTidy自动从Web页面批量提取数据信息的方法。根据该类网页的特点,基于开发一种通用程序的思想,对页面标签结构进行分析与分类,讨论了识别数据元素和对数据元素进行分组等提取过程中的难点,在此基础上建立了总体扫描与提取的算法。实验结果表明了批量提取信息方法的可行性与有效性。
To extract data information from web effectively and implement web data purification and integration, the approach that automatically extract interested batch data from web pages is presented using XML and JTidy tools.Targeted on the specific web and a general processing idea, page structure is analyzed and classified.The main difficulties in design that are identifying and labeling data element, are discussed and the algorithms of general scanning and extracting are constructed.Finally, a case study of extracting a web page is presented to verify the feasibility and validity of the method.
引文
[1]Jussi Myllymaki.Effective web data extraction with standard XML technologies[EB/OL].http://www10.org/cdrom/papers/102/index.html.
    [2]Jussi Myllymaki,Jared Jackson.Automatically extract informa-tion with HTML,XML,and Java[EB/OL].http://www.ibm.com/developerworks/library/wa-wbdm/.
    [3]盖磊,王海军,刘俊民.一种基于XML的Web地震信息提取的实现[J].计算机应用与软件,2007,24(8):103-105.
    [4]梅东霞,张晓明.基于单个XML文档结构的数据挖掘[J].石油化工高等学校学报,2007,20(1):94-98.
    [5]JTidy网站[EB/OL].http://jtidy.sourceforge.net/.
    [6]Dom4j网站[EB/OL].http://www.dom4j.org/.
    [7]唐红光,周铁军.基于XML的Web数据挖掘技术[J].信息科学,2007(1):14.
    [8]杨彬.利用XML技术进行Web内容挖掘[J].计算机与现代化,2005(11):48-50.

版权所有:© 2023 中国地质图书馆 中国地质调查局地学文献中心