面向语言学研究的大规模汉语生语料库检索工具CCRLT

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

面向语言学研究的大规模汉语生语料库检索工具CCRLT

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Chinese Rough Corpus Retrieval Tools for Linguistic Research
作者：岳炳词
论文级别：硕士
学科专业名称：计算机应用
中文关键词：索引库 ; 语料库 ; 检索
英文关键词：Index Library ; Text Corpus ; Retrieval
学位年度：2001
导师：宋柔
学科代码：081203
学位授予单位：北京工业大学
论文提交日期：2001-06-01

摘要

对大规模语料进行统计，发现一些语言现象和建立统计语言模型，是语言学和计算语言学研
    究常用的方法。我们研制成功了大规模汉语生语料检索工具CCRLT系统，用来辅助语言学和计
    算语言学的研究。CCRLT系统与传统的全文检索系统相比，具有许多全新的特征：处理的对象
    是生语料，索引库采用PAT_ARRAY数据结构来组织，具高效率的检索算法和索引库生成算法，
    可进行基于字的检索、基于词的检索、基于词类的检索和混合检索。，检索的结果保持按某种方
    式的有序性。在基于词的检索中支持未登录词的动态归并，可进行未登录词的检索，也可进行
    人名、地名、企业名等专名的检索。CCRLT系统独立于词库、词类标记集，具很好的通用性，
    可满足不同用户的需要。实验结果表明，现设计出的CCRLT系统，有很高的索引库生成效率和
    检索效率，是对语言学研究和计算语言学研究使用价值很高一种检索工具。
To count the large scale language corpus , to find some linguistic phenomena and to
     establish statistical model are the common method used in linguistic research and
     computational linguistic research. We have developed the large scale Chinese Rough Corpus
     Retrieving TooI--CCRLT system in order to assist the research on linguistics and
     computational linguistics. Comparing with the traditional full text retrieval system, the CCRLT
     has many completely new features: the object dealt with is rough corpus, the index-base is
     organized by the data structure of PAT_ARRAY; it possesses highly efficient retrieval
     algorithm and formation algorithm of index-base; it can retrieve in accordance with
     character, word , parts of speech and inter-retrieving. The result of retrieval keeps order in
     some way. It supports the dynamical formation and retrieving of unregistered words and the
     retrieving of special terms, such as name of person, place, enterprise and so on. The
     CCRLT, which is independent of word library and parts of speech, is all--purpose. It can
     meet the demands of different users. The experimental results show that the CCRLT system
     has high efficiency of retrieval and index-base formation. It is a very valueable retrieval tool
     for research on linguistics and computational linguistics.

引文

[1] James Allen, Natural Language Understanding, 1996
    [2] X. D. huang, A. Acero, H. Hon, and S.Meredith,Spoken Language Processing(Draft), Prentice hall, 1999
    [3] Gaston H. Gonnet, Ricardo A. Baeza-Yates, New Indeices for Text:PAT Trees and PAT Arrays, 1994
    [4] Morrison D, PATRICIA-Pratrical Algorithm to Retrieve Information Coded in Alphanumeric.JACM, 1968(15)
    [5] W. Bruce Croft, What Do People want from information retrieval, D_Lib Mangazine, November 1995
    [6] Sergey Brin, Lawrenec Page, The Anatomy of a Large-Seale Hypertextual Web Search Engin, 1997
    [7] Lee-Feng Chien, PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval, 1997
    [8] 朱德熙，语法讲义，商务印书馆，1982
    [9] 宋柔，关于分词规范的探讨，语言文字应用，1997第3期，107-112
    [10] 宋柔，分词--汉语信息处理的基础工程，计算机世界，1997． 12． 25，第48期
    [11] 杨文清，黄宜华，张福炎，中文Web文档库全文检索技术研究与实现，中文信息学报， 1999第4期
    [12] 沈达阳，孙茂松， Internet中文个人信息搜索，中文信息学报，1999第2期
    [13] 严尉敏，吴伟民，数据结构，北京：清华大学出版社，1992
    [14] 冯志伟，自然语言的计算机处理， 1996
    [15] 白硕，语言学知识的计算机辅助发现， 1995
    [16] 吴立德等箸，大规模中文文本处理，上海：复旦大学出版社，1997
    [17] 张小斌，程学旗，白硕，朱建民， “天罗”个人信息代理系统的设计与实现，计算语言学文集，1999
    [18] 杜林，张毅波，孙玉芳，基于Web中文检索系统SEARCH2000的设计与实现，中文信息学报，1999第四期
    [19] 陈小荷，对外汉语教学学科理论建设的基础工程--“汉语中介语语料库系统”介绍

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700