基于LUCENE的藏文全文检索系统研究与实现

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

基于LUCENE的藏文全文检索系统研究与实现

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research and Implementation of the Tibetan Full Text Retrieval System Based on LUCENE
作者：巴桑杰布
论文级别：硕士
学科专业名称：中国少数民族语言文学
中文关键词：藏文全文检索系统 ; LUCENE ; 分词器
英文关键词：Tibetan full-text search ; LUCENE ; the parser/word
英文关键词：breaker
学位年度：2012
导师：欧珠
学科代码：050107
学位授予单位：西藏大学
论文提交日期：2012-04-01

摘要

近年来，通过国家一些专项项目的实施，使藏文信息处理研究和开发领域取得了长足的发展，从标准统一到关键性藏文基础软件开发等各方面都取得了突破性成果，具备了进一步研究和开发的先决条件。然而，藏文信息处理技术发展处于起步阶段，藏文全文检索系统等应用系统缺口突出，作为信息社会里人们获取信息不可或缺的工具，研究实现藏文全文检索系统，即是本文努力所在。
     藏文全文检索系统研究内容包括传统领域的字、词、句子、段落、文章的语法知识，以及信息处理领域的信息检索原理、分词技术、查询方法、文档相关性排序算法等知识。同时，还需要解决互联网信息冗余大、质量良莠不齐、格式繁多、位置分散、关联复杂、用户需求表达难等问题。LUCENE作为开放源代码的全文检索工具包，通过其框架规范，扩展相关功能，以实现目标系统全文检索功能，成为解决以上问题的一种捷径。
     本文通过对全文检索理论及基于LUCENE全文检索系统的研究的基础上，得到如下成果：
     第一，设计实现基于LUCENE的藏文分词器，该分词器同时支持藏、汉、英三种语言的二元切分；
     第二，结合藏文句子的特性——句子主要成分间都通过格助词相连接来表达语义关系，提出了本文实现的藏文分词器的优化策略，同时提出切分格助词之紧缩格的方法及切分紧缩格后的藏字复原方法，以提高切分准确率；
     第三，利用本文实现的藏文分词器，设计实现了基于LUCENE的藏文全文检索系统，该系统同时支持藏、汉、英三种语言的全文检索。
In recent years, through the implementation of national special projects,Tibetan information research and development have made great strides inthe field of development, from the standard into Tibetan language basedsoftware development, and other key sectors， it have achievedbreakthrough results and it is a prerequisite for further research anddevelopment. However, the development of Tibetan language informationprocessing technology is in its infancy. Tibetan applications such asfull-text retrieval system gaps have been highlighted. As an indispensabletool of accessing information in the information society, to research toachieve Tibetan language full-text retrieval system is the emphasis on thisarticle.
     Tibetan text retrieval system includes the traditional areas of the word,words, sentences, paragraphs, grammar of the article, and informationretrieval principle in the field of information processing, knowledge ofword segmentation, query methods, document relevance rankingalgorithm etc. At the same time, it is also necessary to solve theredundancy of Internet information, the quality varies greatly, range offormats, scattered locations, association with complex, and difficulties inthe needs of users’ expressions etc. LUCENE as a full text search tool ofopen source code package, through the specification of its frame,extended functions in order to achieve targeted system for full text searchfunction and to become a shortcut to resolve the aforementionedproblems.
     This paper is based on the theoretical research of full text search andLUCENE full text search system and gets the following results:
     First, it designs and implements Tibetan Word segmentation based onLUCENE. It at the same time supports binary segmentation of threelanguages-Tibetan, Chinese, or English.
     Second, It incorporates the characteristics of Tibetan sentences bycombining the main components of a sentence and auxiliary words toexpress the semantic relations, and advocates optimized strategiesachieved by this article. At the same time it advocates segmentation ofsplitting auxiliary word as well as the tightening method and the restoration of Tibetan words after splitting to improve the accuracy ofsegmentation.
     Third, by applying Tibetan language segmentation achieve by thisarticle, it is designed to achieve Tibetan-language text retrieval systembased on the LUCENE, while supporting the full-text search of threelanguages--Tibetan, Chinese and English.

引文

[1] Christopher D. Manning，Prabhakar Raghavan&Hinrich Schütze.信息检索导论[M].王斌译，北京：人民邮电出版社，2010.9
    [2]全文搜索引擎-搜索引擎，SEO百科全书，http://www.moonseo.n
    [3][11][22]王斌，现代信息检索，中科院研究生院2007年度秋季课程讲义
    [4][21]Ricardo Beaza-yates,Berthier Ribeiro-Neto等.现代信息检索[M].王知津,贾福新,郑红军等译,北京:机械工业出版社,2006年5月
    [5][24][34] Otis Gospodnetic Erik Hatcher.Lucene IN ACTION中文版[M].谭鸿，黎俊鸿，周鹏等译，北京：电子工业出版社，2007年1月
    [6]陆云.用JSP调用Lucene包来实现全文检索—利用基于Tomcat的Web服务器[J],电脑学习，2007年第3期，29-30
    [7]于莉.经典信息检索模型的分类比较[J],软件，2011年第3期（32~34）
    [8]迟呈英，战学刚，姚天顺.基于P范式模型的检索[J].中文信息学报，第14卷第4期
    [9]Salton.Introduction to Modern Information Retrieval.McGraw-Hill,1983
    [10]Salton,Automatic Text Processing:the Transformation,Analysisand Retrieval of Information by Computer.Addison-Wesle Inc,1989
    [12]王贽.中文文本自动分类研究与实现.西安交通大学硕士论文，2002：48-49
    [13]陈淑珍.Web文本挖掘中的特征表示与特征提取技术.三明高等专科学校学报，2004，21（2）：53-57
    [14]张琼.Web主题网页内容安全监管研究.西安电子科技大学硕士论文，2008：16-18
    [15]迟呈英，战学刚，姚天顺.基于p范式模型的检索[J].中文信息学报，2000年第4期：37-38
    [16] Salto n G, Fox E A, Wu H. Extended Boo lean I nformation RetrievalCommunications o f the ACM, Dec.1983,26(12):1022-1036
    [17]潘雪峰，花贵春&梁斌.走进搜索引擎（第2版）[M]，北京：电子工业出版社，2011年2月
    [18]李新友.信息检索中的查询扩展技术研究.广西师范大学硕士学位论文，2007：15-25
    [19] Brin, S. and Page, L. The anatomy of a large-scale hypertextualWeb search engine. In Proceedings of the7th
    [20]李立.中文信息检索系统研究.华中师范大学硕士学位论文，2008：9-11
    [23]范鑫鑫.一种基于分配因子的链接分析算法[J].中国科技论文在线，2008
    [24][25]高欣.基于Lucene的全文检索系统的研究与实现.天津师范大学研究生学位论文，2010
    [26]lucene中国：http:www.lucene.com.cn
    [27][28]李刚，宋伟，邱哲.[Ajax+Lucene]构建搜过引擎[M],北京：人民邮电出版社，2006.4
    [29][30]觉先.Lucene原理与代码分析.http://forfuture1978.iteye.com
    [31][33]陈俊.LUCENE开发部署指导手册.昆明南天开发中心，2010.8
    [32]陈枕.Lucene排序算法的个性化改进[J].科技与企业,2012年第2期
    [35]卢亚军等，基于大型藏文语料库的藏文字符、部件、音节、词汇频度与通用度统计及其应用研究[J].西北民族大学学报（自然科学版），2003(48)；
    [36]蒋明原、孔令德，基于LUCEN的藏文信息采集及检索系统研究[J]，电脑开发与应用，2011（2）；
    [37][38]陈玉忠，李保利，俞士汶等.《基于格助词和接续特征的藏文分词方案》.2002
    [39]格桑居冕.实用藏文文法教程（修订本）[M],四川名族出版社,2004.11
    [40][41]巴桑杰布，羊毛卓玛，欧珠.藏文分词系统中紧缩格识别和切分后藏字复原的算法研究.西藏科技,2012年第2期
    [42]源代码来自：www.lucene.apache.org
    [43]扎西次仁.一个人机互助的藏文分词和词登录系统的设计［A］.中国少数民族语言文字现代化文集，北京：民族出版社，1999年。
    [44]陈玉忠,俞士汶.藏文信息处理的研究现状与展望[J],北京大学计算语言学研究所
    [45]宋春阳.面向信息处理的现代汉语“N＋N”逻辑语义研究[M].北京：学林出版社,2005.5
    [46]扎西加,索南尖措.基于藏语信息处理的词类体系研究[J].西藏大学学报，2008.5：36-39
    [47]罗秉芬,江狄.藏文计算机自动分词的基本规则[C].中国少数民族语言文字现代化文集,北京：民族出版社，1999年
    [48]吉太加.现代藏文语法通论[M].甘肃：甘肃民族出版社,2000.9
    [49]俞士文主编.计算语言学[M].北京：商务印书馆,2003.9
    [50]陈玉忠,俞士文.面向信息处理的藏语虚词的语法信息表述研究[C].第十届全国少数民族语言文字信息处理学术研讨会论文集，2006年
    [51]陈玉忠.信息处理用现代藏语词语的分类方案[C].第十届全国少数民族语言文字信息处理学术研讨会论文集，2006年
    [52]柴宝杰.中文自动分词若干技术的研究.燕山大学硕士论文，2007
    [53]卢亚军主编.现代藏文频率词典，北京：民族出版社，2007.10。
    [54]杨宪泽,谈文荣等.自然语言处理的原理及其应用[J],四川:西南交通大学出版社,2007.3

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700