文本主题切分技术和ROCCHIO模型在信息检索中应用的研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

文本主题切分技术和ROCCHIO模型在信息检索中应用的研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

作者：吴曾
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：信息检索 ; 向量空间模型 ; 主题切分 ; 文本过滤 ; Rocchio模型 ; 梯度下降算法
英文关键词：Information Retrieval ; Text Segment ; Text Filter ; Rocchio Model ; Gradient Decrease
学位年度：2004
导师：杜林
学科代码：081202
学位授予单位：中国科学院研究生院（软件研究所）
论文提交日期：2004-05-01

摘要

我们当前面临着信息爆炸的时代，如何从海量的信息获得所需要的成为人们在信息时代所面临的主要问题之一。随着信息检索技术研究的深入和应用的扩大，用户对检索的要求越来越细，研究者将全文本检索技术已经细分为问答式检索，网络检索，交互式检索，文本过滤等。为了检索系统的精度和用户对系统的满意度，研究者开始考虑如何在现有的自然语言处理和文本处理的理论下，综合使用各种相关理论和技术来更好的满足用户需求。
     本文的提出背景是第12届国际文本检索大会的子项目，高精度文本检索(Text Retrieval Conference 12，High Accuracy Retrieval of Document Track)。首先分析了向量空间模型，它的优缺点，基于它广泛使用的SMART系统，概率空间模型，它的优缺点以及现在基于该模型的INQUERY平台。因为尽管现代信息检索已经不单纯是文本检索或者全文检索，而且这两种模型也已经提出了很多年，但是，上述两种模型以及各种基于此改进的模型还是广泛的用于各种检索方式的第一步或者它们处理问题的思想也被广泛的借鉴。然后，介绍了根据不同线索将文本中不同子主题切分的技术，这些线索包括词义网络和各个段落关键词出现的频率统计，并指出它们各自的优缺点。其次介绍了文本过滤技术，并分析了常用的Rocchio模型的特点。再次，介绍了用在本文所涉及到的浅层自然语言处理技术。最后，介绍了为了准确把握用户需求所需要的一些要素。
     本文针对文本检索大会子项目的要求和基于段落的，用户查询时可能提供一篇相关文章的查询特点，首先将Rocchio模型和向量空间算法结合起来来把握用户需求并计算文档与查询的相关度，再使用梯度下降技术来训练模型中的参数，最后依据查询和段落层的相关度，使用基于段落切分的方法返回包含用户查询最相关文章。
     最后，以上述技术为背景，本文实现了上述试验，并分析了试验结果。
In current century, how to achieve useful information for the users from huge mount of information is one of the main problems confronted with people. With the development of research and application in Information Retrieval(IR), the IR technology is divided into Question/Answer, Web, Interactive and Text Filter and so on. To advance i:he precision of IR system and make users more satisfied with the results, researchers have merged relevant technologies and theories based on current Natural Language Process(NLP) and IR to implement the goal.
    The backgrovind of this paper is Text Retrieval Conference (TREC), High Accuracy Retrieval of Document (HARD). In this paper, the characteristics of traditional vector model and probabilistic model are introduced . Although the modern IR is not restricted in full text retrieval, these two models are widely and effectively used in the first step in kinds of modern IR. Then the threads in segmenting document into different topic is introduced, which includes statistical methods and semantic network. Then, the Rocchio model characteristics in text filter are analyzed. Then, shallow technologies of NLP used in this paper are introduced. At last, to make the user query more precise, some elements are introduced.
    To fulfill the requirement and characteristics of this track, which include paragraph-based and a relevant document supplied by user before retrieval, the rocchio model and vector model are merged to compute relevance between query and document. Then, Gradient Decrease method is used to train the parameters of rocchio model. Then, based on the paragraph-level relevance, the sorted documents are returned.
    Based on such technologies, experiments are done and results are analyzed.

引文

1. HARD Track Overview in TREC 2003 High Accuracy Retrieval from Documents,J. Allan, University of Massachusetts Amherst.
    2. N.J.Belkin, D.Kelly, H-J.Li, Rutgers' HARD and Web Interactive Track Experiences at TREC 12.
    3. Salton, Gerard. 1988. Automatic Text Processing: The transformation, analysis, and retrieval of information by computer.
    4. Hearst, Marti A. 1993a. Cases as structured indexes for full-length documents. In Proceedings of the 1993 AAAI Spring Symposium on Case-based Reasoning and Information Retrieval, Stanford, CA.
    5. TextTiling: A Quantitative Approach to Discourse Segmentation, Marti A.Hearst, SIGIR 2000.
    6. Hideki Kozirna:Text Segmentation: Based On Similarity Between Words.
    7. Jerry Hobbs, Mark Stickel, Doug Appelt, and Paul Martin. Interpretation as Abduction.Artificial Intelligence, 63, pages 69-142, 1993.
    8. Church, K. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the Second Conference on Applied Natural Language Processing, ACL, 136-143, 1988.
    9. Cutting, D., Kupiec, J., Pederson, J. Sibun, P. A Practical Part-of-Speech.
    10. Tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, ACL, 1992.
    11．张华平，使用角色模型来识别汉语实体(http://www.nlp.org.cn，2003年国际分词大赛)．
    12. Bin Wang, Hongbo Xu, Zhifeng Yang, Yue Liu, Xueqi Cheng, Dongbo Bu, Shuo Bai Trec-10 Experiments at CAS-ICT: Filtering, Web and QA, 2001, Augest, Text Retrieval Conference.
    13. William. S. Cooper, Aitao Chen, "TREC-3 working note: Experiments in the Probabilistic Retrieval of Full Text Documents".


    14. Rauber, (2001) Integrating automatic genre analysis into digital libraries.
    15. Kelly, D.&Cool(2002) Effects of topic familiarity on information search behavior (http://aspn.activestate.com/ASPN/CodeDoc/Lingua-EN-Fathom/Fathom.html#S YNOPSIS).
    16. Donna Harman "Ranking Algorithrns" In "Information Retrieval Data Structure & Algorithms" pp363-392 1992.
    17. Thorsten Joachims, Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization.
    18. Abraham Ittycheriah, Salim Roukos, "TREC-11 working note: IBM's Statistical Question Answering System-TREC-11".
    19. Jinxi Xu, Ana Licuanan, Jonathan May, Scott Miller and Ralph Weischedel "TREC-11 working note: TREC 2002 QA at BBN: Answer Selection and Confidence Estimation".
    20. David R. H. Miller, Tim Leek, Richard M. Schwartz, "A Hidden Markov Model Information Retrieval System" In "Annual ACM Conference on Research and Development in Information Retrieval Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval" Berkeley, California, United States,pp214-221 1999.
    21. Adam Berger, John Lafferty, "Information Retrieval as Statistical Translation" In "Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval((SIGIR'99) ", pages222—229, 1999.
    22. Daniel Pack, Clifford Weinstein "The Use of Dynamic Segment Scoring for Language-Independent Question Answering" In "HLT2001 Presentations ".
    23. Adam Berger, Vibhu Mittal "Query-relevant summarization using FAQs" In "Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL-http://citeseer.nj.nec.com/341776.html"2000.
    24. Lynn Carlson, Daniel Marcu, Mary Ellen Okurowski. "Building a discourse-tagged corpus in the framework of rhetorical structure theory" In Proceedings of the 2nd SIGDIAL workshop on discourse and dialogue, Eurospeech, Aalborg, Denmark.


    25. W. Bruce Croft, Stephen Cronen-Townsend, Victor Lavrenko "Relevance Feedback and Personalization: A Language Modeling Perspective" In Proceedings of the DELOS-NSF Workshop on Personalization and Recommender Systems in Digital Libraries, pages 49—54, 2001.
    26. Belkin, N.J. Cognitive models and information transfer. Social Science Information Studies, vol.4, nos.2&3:111-129.
    27. Harman, Overview of TREC-4.
    28. Incremental Relevance Feedback for Information Filtering.
    29. F.Jelinek, Markov source modeling of text generation.
    30. NISO Press: Bethesda, Information Retrieval: Application Services Definition and Protocol Specification.
    31. D.A.Norman, The Psychology of Everyday Things. Basic Books, New York, 1988.
    32. Michael J.Witbrock, Vibhu O.Mittal Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries.
    33. Somers, Harold (2001), 'Bilingual Parallel Corpora and Language Engineering', Anglo-Indian workshop "Language Engineering for South-Asian languages" (LESAL), (Mumbai, April 2001).
    34. McEnery, Tony & Andrew Wilson (1996, 1997), Corpus Linguistics, Edinburgh U.P.
    35．刘群，《计算语言学—第三讲：语料库》，中国科学院研究生院2002～2003学年第二学期课程讲义．
    36．翁富良等，《计算语言学导论》，中国社会科学出版社，1998．
    37．周强，基于语料库和面向统计学的自然语言处理技术介绍，《计算机科学》，22(4)，P36-40．1995．
    38．薛松，孙乐，孙玉芳，EBMT中的英汉平行语料短语对齐算法．第7届中国科学院计算机科学与技术研究生学术研讨会．2002
    39. Marti A.Hearst, Christian Plaunt, Subtopic Structuring for Full-Length Document Access.
    40. Croft, W. Bruce, Robert Krovetz, & H. Turtle. 1990. Interactive retrieval of complex documents. Information Processing and Management.


    41. Hahn, Udo. 1990. Topic parsing: Accounting for text macro structures in full-text analysis. Information Processing and Management.
    42. Liddy, Elizabeth. 1991. The discourse level structure of empirical abstracts {an exploratory study.Information Processing and Management.
    43. Rabiner, Lawrence R., & Ronald W. Schafer.1978. Digital processing of speech signals. New Jersey: Prentice-Hall, Inc.
    44. Ro, Jung Soon. 1988a. An evaluation of the applicability of ranking algorithms to improve the e_ectiveness of full-text retrieval. I. on the e_ectiveness of full-text retrieval. Journal of the American Society for Information Science.
    45. Dan Moldovan, Sanda Harabagiu, The Structure and Performance of an Open-Domain Question Answering System.
    46. Sanda Harabagiu and Steven Maiorano. Finding answers in large collections of texts: Paragraph indexing and abductive inference.
    47. G.A.Miller. WordNet: A Lexical Database. Communication of ACM, vol 38, pages 39-41, 1995.
    48. Doug Beeferman, Adam Berge, John Lafferty, Statistical Models for Text Segmentation.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700