用户名: 密码: 验证码:
一种基于元数据的搜索引擎的设计与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
Internet的发展使得互联网成为一个巨大的信息库,但是信息的获取质量却停滞不前。传统的搜索引擎大都基于关键字机械匹配,因而不具备理解文档内容的能力,也导致查准率普遍不高。为此本文提出一种新的基于元数据和RDF的搜索引擎模型。元数据是描述数据的数据,而RDF是一种携带元数据的很好的工具。由于计算机可以理解RDF描述和携带的元数据的含义,因此可以做到基于内容的精确检索。该模型包含词汇集设计、RDF描述生成工具、运行于服务器端的RDF描述信息收集和解析程序、基于词汇集的查询四个模块。词汇集定义了从哪些角度去描述资源;RDF生成工具帮助用户建立对网络资源的描述,或者将RDF描述以XML数据岛的形式嵌入网页中,或者直接将RDF描述文档发往搜索引擎服务器的RDF文档缓冲区;RDF收集和解析模块则负责在网络上寻找被RDF描述过的网页资源,以文本文档的形式存储在服务器的RDF文档缓冲区中,RDF解析器解析该缓冲区中的RDF文档,得到的三元组被存储在索引数据库中,查询模块提供查询界面接受用户检索,并将结果以元数据的方式显示出来。
     本系统还研究了受控词表的建立、元数据自动生成等内容,这些研究是对该机制的完善和补充。建立受控词表是为了更好的描述和查询,而元数据自动生成能提高系统的自动化程度。
The development of Internet makes it a huge base of information, but the quality of information obtainment does not get ahead. The most of traditional search engines are based on matching of key words, so they don't have capacity of understanding documents on Internet, which result in the low accuracy of searching. This paper provides a new searching model based on Metadata and RDF. Metadata is data about data, while RDF is a good tool that describes and carries metadata. Because computer can understand the meaning of metadata carried by RDF, our searching engine can provide information retrieval based on concept or content. This system includes four modules: design of vocabularies, a tool for generating RDF description, a run-at-server procedure to collecting and parsing RDF description and a B/S procedure for user to retrieve. The vocabulary defines a set of metadata that is used to describe resources. The RDF generating tool help user to describe resource on Internet. There are two ways of description, one is embedding RDF information into web pages in the form of XML Island, the other is directly sending RDF description information to RDF document buffer at the searching engine server; RDF collecting and parsing module's responsibility is searching web pages which is described by RDF, then storing them to RDF documents buffer in the form of text file, RDF Parser will parse RDF doc in these text files to triples which is stored at index database; finally, retrieve module provides interface for user to retrieve, and display the retrieve result in the form of metadata.
    In addition, we have researched the controlled vocabulary and the automatic generation of metadata, which is used to perfect the mechanism we provide. Vcabulary can make description and query more convenient, and the latter can make the system more automatic.
引文
[1] 张晓辉,绍华,常桂然.WWW上的信息发现与搜索引擎技术.小型微型计算机系统,1998,19(6):1~6
    [2] 翁惠玉,马范援.网络搜索引擎的现状分析.情报学报,1999,18(3):1~3
    [3] Tim Berners-Lee. HyperText Markup Language (HTML) Home Page, 2004.http: //www.w3.org/MarkUp/
    [4] 赵慧勤.网络信息资源组织——元数据.情报理论与实践,2000,23(6):1~3
    [5] Frank Manola, Eric Miller, etc. RDF Primer W3C Recommendation 10 February 2004, 2004. http: //www.w3.org/TR/REC-rdf-syntax/
    [6] 章毓晋.基于内容的视觉信息检索.北京:科学出版社,2003:3~494
    [7] Mayhem Chaos. MusicBrainz Metadata Initiative 2.1, 2001. http: //www.musicbrainz.org/MM/
    [8] Yves Lafon, Bert Bos. Describing and retrieving photos using RDF and HTTP, 2002. http: //www.w3.org/TR/photo-rdf/
    [9] Wolfgang Nejdl, Boris Wolf, Changtao Qu, etc. EDUTELLA: A P2P Networking Infrastructure Based on RDF, Proceedings of the eleventh international conference on World Wide Web, 2002. http: //edutella.jxta.org/reports/edutella-whitepaper, pdf
    [10] Alan Yates. Distributed Systems Technology Centre, 1999. http: //www.dstc.edu.au/index.html
    [11] B Rajapatirana. The 5th Dublin Core Metadata Workshop: a report and observations, 1997. http: //www.nla.gov.au/nla/staffpaper/Helsinki.html
    [12] D. Campbell. The MetaWeb Project: Project activities and timeline, 1997. http: //www.dstc.edu.au/RDU/MetaWeb/timeline.html
    [13] Eric Miller. Semantic Web, 2004. http: //www.w3.org/2001/sw/
    [14] Marshall Catherine C, Shipman, Frank M. Which Semantic Web, Proceedings of the ACM Conference on Hypertext, 2003, 24: 57~66
    [15] T. Berners-Lee. The semantics toolbox: Building semantics on top
    
    of XML, 1998. http: //www.w3.org/Designlssues/Toolbox.html
    [16] Jay Glynn. On SGML and HTML, 2001. http: //www.w3.org/TR/REC-html40/intro/sgmltut.html
    [17] Elliotte Rusty Harold. XML Bible 2nd Edition. 北京:电子工业出版社,2002.117~121
    [18] Dan Brickley, R.V.Guha. Resource Description Framework(RDF) Schema Specification 1.0, 2000. http: //www. w3. org/TR/2000/CR-rdf-schema-20000327/
    [19] 南昌华东交通大学网络中心.XML数据岛技术及应用.微型机与应用,2002,8:1~2
    [20] Kokkelink, Stefan and Roland Schwnzl. Dublin Core Metadata Element Set: Reference Description, 2002. http: //dublincore.org/documents/dces/
    [21] 王新.都柏林核心集综述.情报理论与实践,2000,23(5):1~3
    [22] 刘英梅,刘赛红.都柏林核心元数据及其应用.情报科学,2000,18(6):1~3
    [23] 周维.都柏林核心集及其在web检索中的应用.情报检索,2002,(6):1~2
    [24] Misha Wolf, Charles Wicksteed. Date and Time Format, 2003.http: //www.w3.org/WR/NOWE-datetime
    [25] DCMI Usage Board. DCMI Type Vocabulary, 2000. http: //dublincore.org/documents/dcmi-type-vocabulary/
    [26] H. Alvestrand. Tags for the Identification of Languages, 2000.http: //www.ietf.org/rfc/rfc1766.txt
    [27] Code for the Representation of the Names of Languages (From ISO 639, revised 1989), 1989. http: //www.oasis-open, org/cover/iso639a.html
    [28] American National Standards Institute. Country Code List: ISO 3166-1993(E), 2001. http: //www.oasis-open, org/cover/country3166.html
    [29] Dave Sussman&Alex Homer, Professional ASP. NET(王毅等译).北京,清华大学出版社,2002.1~780
    [30] Simon Robinson, K. Scott Allen. Professional C# 2nd Edition(杨浩,杨铁男等译).北京:清华大学出版社,2002:1~735
    
    
    [31] 李诚,司昌龙,张志新.易学易用完全掌握Jbuilder8.北京:机械工业出版社,2003.188~491
    [32] AaronWalsh. Java2宝典.北京:电子工业出版社,2001.1~432
    [33] Cutting-edge Content Management, 2004. http: //www.profium.com/gb/company/index.shtml
    [34] Fabio Claudio, Ferracchiati, Jay Glynn. Net Programming With C#(毛尧飞等译).北京:清华大学出版社,2002.1~671
    [35] Jenkins Charlotte, Jackson Mike, Burden. PeterAutomatic, etc. RDF metadata generation for resource discovery, Computer Networks, 1999.31(11): 1305~1320
    [36] 孙斌.信息提取技术概述(上).术语标准化与信息技术,2002,2:2~3
    [37] 孙斌.信息提取技术概述(上).术语标准化与信息技术,2002,2:5~6
    [38] 孙宏林,俞士汶.浅层句法分析方法概述,当代语言学,2000,2:1~3
    [39] 詹卫东.一个简单的基于Chart的自底向上句法分析器,2004.http: //ccl.pku.edu.cn/doubtfire/Course/ChineseInformationProcessing/2002_2003_1.htm
    [40] 陈火旺,钱家骅,孙永强.编译原理.北京:国防工业出版社,1994.55~58
    [41] 邓志鸿,唐世渭,张铭等.Ontology研究综述.北京大学学报(自然科学版),2002,38(5):1~12
    [42] 廖明宏.本体论与信息检索.计算机工程,2000,26(2):1~3
    [43] Gomez-Perez Asuncion, Corcho Oscar. Ontology languages for the semantic web, IEEE Intelligent Systems and Their Applications, 2002, 5(3): 54~60
    [44] Ding Ying, Foo Schubert. Ontology research and development(PartⅠ -A review of ontology generation), Journal of Information Science, 2002, 28(2): 123~136

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700