文本信息检索中修饰语作用的研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

文本信息检索中修饰语作用的研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

作者：马晖男
论文级别：硕士
学科专业名称：系统工程
中文关键词：文本信息检索 ; 修饰语 ; 向量空间模型 ; 精确率 ; 召回率
英文关键词：Document Information Retrieval ; Modifier Words ; VSM ; Precision ; Recall
学位年度：2004
导师：党延忠 ; 吴江宁
学科代码：081103
学位授予单位：大连理工大学
论文提交日期：2004-04-01

摘要

随着网络信息时代的到来，信息日新月异，并呈指数增长趋势，形成“信息爆炸”。在进行信息检索时，与用户需求匹配的信息经常不在检索结果内，而大量用户不需求的信息——“信息垃圾”，却占用检索结果的相当大的一部分。因此，改进文本信息检索系统的检索性能，提高检索质量就成为亟待解决的问题。
     该论文的主要研究目的是，针对可能影响检索效力的一个容易被忽略的因素——修饰语，研究其在文本信息检索中的作用。针对这一目的，开发了改进的向量空间模型(Modified Vector Space Model，MVSM)，并使用英文文本进行了试验，进而说明修饰语的作用。
     本文通过对修饰语作用的研究，主要取得以下成果：
     (1)传统模型(如布尔检索模型)的查询语句关键词以及文本关键词仅仅为独立的实词(名词、动词、形容词、副词)，将传统的向量空间模型(Vector Space Model，VSM)进行改进，设计并实现了能够完成该研究目的的信息检索模型(MVSM)。该模型与传统向量空间模型主要区别以及优点在于：它将传统的检索关键词(本文中主要指名词)与修饰它的修饰语(本文中主要指形容词)作为一个整体关键词来看待，一定程度上确定了歧义词的真正含义；同时，将检索关键词中的修饰语以及它所修饰的中心词根据它们的同义词进行扩展并重组，使得一些由于用词生僻而原本检索不出来的却符合用户需要的文本能够检索出来。
     (2)使用标准语料库(TREC)，运用设计好的MVSM模型，输入共150个查询语句，进行各种针对修饰语的试验，并将其结果与普通检索试验的结果进行比较，从而说明了考虑了修饰语的模型的意义。
     (3)对于信息检索系统主要从精确率、召回率两方面指标进行评价，并使用Excel画出试验结果图进行统计说明，更加形象地看出，MVSM模型的检索精确率、召回率比普通检索有一定程度提高。试验结果表明，修饰语在文本信息检索中的作用的确不可忽略。
With the coming of Internet era, information changes each passing day and shows an exponential increasing tendency, which leads to information explosion. However, the phenomenon happens more often than not that is when people retrieve documents, the exact information which did match the need can't be obtained, on the contrary too much information trash, which is out of the need of users, is engendered. Therefore, improving the effectiveness and quality of the information retrieval (IR) system has become a desired issue.
    The objective of this paper is to research into the importance of modifier words, which is a factor often ignored but maybe influences IR system effectiveness, to document information retrieval. According to this, a modified vector space model (MVSM) is developed. Experiments using English documents are also done to show the importance of modifier words.
    During the course of research, the achievement can be summarized as follows:
    (1) In the traditional keyword-based information retrieval (IR) system such as Boolean IR model, queries and documents are represented by many separated words or terms of which some are nouns and verbs, and some are adjectives and adverbs. Based on the traditional vector space model (VSM), MVSM is designed and realized. The main difference between the traditional one and the new one is to combine the modifier (adjective in this paper) with its corresponding headword (noun in this paper) as integrated keyword (combined term) in the new model, which can confirm the exact meaning of polysemy to some extent on the one hand. Meanwhile expanding the modifier and headword according to their synonyms and recombining them can result in finding out some other useful documents, which can't be obtained originally because of the rare keywords of queries.
    (2) Experiments for verifying the importance of modifier words have been implemented by using benchmark corpora (TREC). The MVSM is applied to the experiments. And 150 queries are inputted in for test. By comparing the results obtained from MVSM with that of traditional VSM, the difference is remarkable, showing the great importance of modifier words.
    (3) Information retrieval models typically express the retrieval performance of the system in terms of two quantities: precision and recall. And from the result charts in Excel format both of the precision and the recall of MVSM are found increased visually. The experiment results show that the importance of modifier words can't be ignored in document information retrieval.

引文

1 范真祥．现代信息技术对文献检索课挑战．中国国防科技信息．1997，2：66-68．
    2 王湄生．信息检索系统中的自然语言处理．云南民族学院学报(自然科学版)．2000，9(1)：54-56．
    3 贡大跃．基于查询扩充机制的中文文本检索模型．辽宁师专学报．2000，2(1)：99-103．
    4 赵亚莉．英语定语修饰语结构歧义分析．教育与管理．2002，7：56-57．
    5 D.H. Kraft and F.E. Petry. "Fuzzy information systems: managing uncertainty in databases and information retrieval systems", Fuzzy Sets and Systems, 1997, 90: 183-191.
    6 赖茂生，王延飞，赵丹群．计算机情报检索．第一版．北京大学出版社，1993．3：38-49．
    7 G. Salton, A. Wong and C. S. Yang. "A vector space model for automatic indexing", Communications of the ACM, 1975, 18(11): 613-620.
    8 G. Salton and C. Buckley. "Term-weighting in information retrieval using the term precision model", Journal of the Association for Computing Machinery, 1982, 29(1): 152-170.
    9 齐向华．文本信息检索模型．晋图学刊，1998．3：33-34．
    10 C.H. Papadimitriou, P. Raghavan, H. Tamaki and S. Vempala. "Latent semantic indexing: A probabilistic analysis", Journal of Computer and System Sciences, 2000, 61: 217-235.
    11 T. A. Letsche and M. W. Berry. "Large-scale information retrieval with latent semantic indexing", Information Sciences, 1997, 100(1-4): 105-137.
    12 牛伟霞，张永奎．潜在语义索引方法在信息过滤中的应用．计算机工程与应用，2001，9：57-59．
    13 B. Stacy and C. Charles. "E. ects of varying focus and accenting of adjuncts on the comprehension of utterances", Journal of Memory and Language, 2002, 47(4): 571-588.
    14 Mintz, Toben H. and Gleitman, Lila R. "Adjectives really do modify nouns: the incremental and restricted nature of early adjective acquisition", Cognition, 2002, 84(3): 267-293.
    15 刘应德．浅析名词及其修饰语的关系．重庆交通学院学报(社科版)．2002，2(1)：

    72-74．
    16 景品兰，贾正平．浅谈英语形容词与副词的异同．雁北师院学报(文科版)．1996，4：63-65．
    17 黄昌宁．大规模真实文本处理的理论与方法．当代语言学(试刊)，1998，1：44-47．
    18 徐海燕，卢晓勤．智能情报检索与NLP．情报科学，2001，19(12)：1289-1291．
    19 姚天顺，朱靖波，张俐，杨莹．自然语言理解——一种让机器懂得人类语言的研究．第二版．清华大学出版社，2002．10：1-4．
    20 杨惠中，卫乃兴，李文中，濮建忠，雷秀云等．语料库语言学导论．上海外语教育出版社，2002，7：28-35．
    21 贾同兴．人工智能与情报检索．第一版．北京图书馆出版社，1997．7：110-114．
    22 曾民族．文本信息检索技术进展和性能评价框架．现代图书情报技术．1997，3：14-18．
    23 李广原，陈丹．文本信息检索技术．广西科学院学报，2001，17(2)：57-60．
    24 http://nlp, cs. nyu. edu/app/
    25 http://www.seochat, org/beginner/op2, htm
    26 http://science, ctgu. edu. cn/xspj/wxjs/wxjsjc34. htm
    27 http://202.117.24.24/html/xjtu/kejian/yxkj/pages/bjjc/chapter1/7, htm
    28 S.C. Cater, K. Blane, L.D. Harvel. "Construction and evaluation of a prototype topological information retrieval system", In Proceedings of the Energy and Information Technologies in the Southeast, Southeastcon, IEEE, 1989, 9 April—12 April: 1336-1340.
    29 http://trec, nist. gov/
    30 郭以昆，吴立德，黄萱菁．大规模文本检索的现状及发展．计算机工程．1999，25(3)：3-7．
    31 J. Li-Ping, H. Hou-Kuan and S. Hong-Bo. "Improved feature selection approach TFIDF in text mining", In Proceedings of 1st information conference on Machine Learning and Cybernetics, Beijing, 2002, 4 November—5 November: 944-946.
    32 陆玉昌，鲁明羽，李凡，周立柱．向量空间法中单词权重函数的分析和构造．计算机研究与发展．2002，39(10)：1025-1210．
    33 张仰森，徐波，曹元大．自然语言处理中的语言模型及其比较研究．广西师范大学学报(自然科学版)．2003．21(1)：16-24．
    34 D.L. Lee, C. Huei and k. Seamons. "Document ranking and the vector-space model",

    Software, IEEE, 1997, 14(2): 67-75.
    35 http://www.cogsci, princeton, edu/~wn/
    36 R. Mandala, T. Tokunaga, and H. Tanaka. "Query expansion using heterogeneous thesauri", Information Processing and Management, 2000, 36: 361-378.
    37 M. C. Kim and K. S. Choi. "A comparison of collocation-based similarity measures in query expansion", Information Processing and Management, 1999, 35: 19-30.
    38 Y. Qiu and H. Frei. "Concept based query expansion", In Proceedings of the 16th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, USA, 1993, 27 June—1 July: 160-169.
    39 J. Y. Nie and F. Jin. "Integrating logical operators in query expansion in vector space model", In Workshop on Mathematical/Formal Methods in Information Retrieval, 25th ACM-SIGIR, Tampere, Finland, 2002, 8.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700