用户名: 密码: 验证码:
科研项目管理中的文本挖掘方法研究及应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
在基础研究的科研项目管理中,项目的相似性分析是一个基本的管理问题,依据相似性可以对项目进行分类,避免重复申报、重复立项,还可以为相似项目的评审选择合适的同行专家。项目的相似性分析一般是根据项目建议书的标题、摘要和关键词并结合项目管理人员的经验进行。但是由于一方面项目数量逐年快速增加,另一方面基础研究具有创新性、不确定性、学科交叉融合及新观点、新概念和新的知识点不断涌现等特点,科研项目管理人员很难根据基础研究项目的真正内涵进行相似性分析,从而给管理工作带来了极大的挑战。因此,从项目的知识内涵中进行相似性分析就成为现实的需求,这就需要对项目进行知识挖掘,并从知识管理的角度探讨项目管理问题。
     科研项目建议书是由自然语言撰写的文本,特别是我国基础研究的建议书绝大部分是中文文本。因此对项目进行知识挖掘就转变为对项目建议书的文本挖掘。本文针对基础研究项目建设书的特点,研究了文本挖掘中的基本方法。本文的主要工作如下:
     1.提出了符合中文科研项目文本特点的长度优先的无词典切分思路。与英文相比,文本切分是中文文本挖掘中的基本问题。科研项目文本中包含大量的语义不可分割的专业术语,并且不断涌现新术语,特别是基础研究项目的中文文本更是如此。现有的文本切分方法不适合于解决基础研究项目的文本切分问题,因此,本文提出了长度优先的无词典切分思路。
     2.提出了中文科研项目文本的切分方法。依据上述的切分思路,提出了科研项目文本的三种切分方法:正向串频最大匹配法、逆向串频最大匹配法和双向串频最大匹配法,实验结果表明双向串频最大匹配法可以达到更好的切分精度。结合统计学习与规则筛选,这几种方法可以切分出专指语义串、短语和词。给出了专指语义串的定义,并从系统整体性和语义优先性的角度对其表示科研项目建议书内容的合理性进行了分析。上述方法既可以解决中文基础研究项目建议书文本的切分问题,又可以应用于一般文本的切分。
     3.提出了科研项目文本的层次特征项获取和建模方法。针对科研项目的特征项之间在语义上具有层次性的特点,在文本切分结果的基础上,提出了基于迭代学习的层次特征项获取方法。通过迭代学习不仅可以获取切分结果中包含的层次特征项,而且可以获取切分结果中所不包含的层次特征项,从而更全面地表示文本。在层次特征项的基础上,采用网络作为语义层次结构关系的表示工具,从而实现科研项目文本建模。与通常的向量空间模型相比,该模型既可以表示特征项信息,又可以表示特征项之间的语义关系。该模型对单个文本的表示、领域文本的表示及本体等的自动构建具有重要意义。
Similarity analysis of projects is a basic management problem in the domain of scientific research project management of fundamental research. On the basis of similarity, projects can be classified to avoid repetition and appropriate experts can be selected to evaluate projects. Similarity of projects is analyzed by manager based on experience, title, abstract, and keywords of scientific research project requisitions. The main characteristics of fundamental research are innovation, uncertainty, fusion and cross of subjects, continuous appearance of new viewpoint and new concept. With the rapid increase of projects, it's difficult for the project manager to analyze the similarity on the basis of the project's meaning. This is a great challenge to the project management, so similarity analysis from the knowledge meaning of project is a practical requirement.Discovering knowledge from projects and discussing the problem of scientific research management from the point of knowledge management is really a problem.
    The scientific research project requisitions are texts written by natural language and most requisitions of fundamental research in China are Chinese texts. So knowledge discovery from projects is text mining from requisitions. The basic methods of text mining are studied on the basis of the characteristics of scientific research project requisitions of fundamental research. The main research work of this paper is listed below:
    1. A new segmentation idea which is on the basis of longer strings first and need not dictionary is proposed. Compared with English, segmentation is a basic problem of Chinese text mining. Plentiful professional terminologies which have semantic integrity exist in scientific research project text and new domain-specific terms increase continuously, especially in Chinese text of fundamental research. Current segmentation methods do not suit text of fundamental research, so an idea of longer strings first and without using dictionary is put forward in this paper.
    2. Chinese scientific research project text's segmentation methods are proposed. Three text segmentation methods without using dictionary are proposed based on above idea: maximum matching and frequency statistics (MMFS), reverse maximum matching and frequency statistics (RMMFS), bidirectional maximum matching and frequency statistics (BMMFS). The segmentation results indicate that BMMFS has better precision. Combining statistics and rules, these methods can get special semantic strings, phrases and words. The
引文
[1] R J 格雷厄姆.项目管理与组织行为.北京:石油大学出版社,1988.
    [2] 王悦,孙树栋.科研项目管理的成功标准和风险分析.中国科技论坛,2005,(4):57-60.
    [3] 徐建锁.知识管理和文本挖掘的若干问题研究:(博士学位论文).天津:天津大学,2004.
    [4] 朱祖平.刍议知识管理及其体系框架.科研管理,2000,21(1):19-25.
    [5] 严娜,孙凌,李宏轩.从知识组织到知识自组织.情报科学,2001,19(7):765-767,782.
    [6] 王众托.知识系统工程.北京:科学出版社,2004.
    [7] 彼得 F 德鲁克,知识管理.《哈佛商业评论》,2000.
    [8] Daniel. E. O'Leary, Using AI in knowledge management: knowledge bases and ontologies. IEEE Intelligent Systems, 1998, 13(3):34-39.
    [9] Liebowitz J. Knowledge Management Handbook. London: CRC Press, 1999.
    [10] Andra Warton. Common knowledge. Document World, Oct/Nov 1998
    [11] 乌家培.正确认识信息与知识及其相关问题的关系.情报理论与实践,1999,22(1):1-4.
    [12] 李海鹰.图书馆知识管理的基本理念与镱略.图书与情报,2004,(4):14-16,23.
    [13] 李丹.科学研究活动中的知识管理研究:(博士学位论文).武汉:武汉大学,2005.
    [14] 李思经,周国民.科研机构知识管理研究.北京:经济科学出版社,2005.
    [15] 张晓刚.面向软件过程改进的知识管理技术研究:(博士学位论文).北京:中国科学院研究生院,2004.
    [16] 丘磐.科技项目管理之知识管理.科技管理研究,2003,(4):17-22.
    [17] 党延忠.基础研究学科发展的宏观知识挖掘.管理工程学报,2006,20(2):102-107.
    [18] Farrad, U. M., Piatetsky-Shapiro, G., Smyth, P., et al. From Data Mining to Knowledge Discovery: An Overview, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.
    [19] 韩客松,王永成.文本挖掘、数据挖掘和知识管理——二十一世纪的智能信息处理.情报学报,2001,20(1):100-104.
    [20] AH-HWEE TAN. Text Mining: the state of the art and the challenges. PAKDD' 99 Workshop on Knowledge discovery from Advanced Databases(KDAD' 99), Beijing, 1999: 71-76.
    [21] Feldman R. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence(IJCAI-99) Workshop on Text Mining: Foundations, Techniques and Applications. 1999.
    [22] Jiang Shaohua, Dang Yanzhong. Case-Based Reasoning Supported by Latent Semantic Indexing for Document Management. The Fourth International Conference on Systems Science and Systems Engineering (ICSSE' 03). Hong Kong SAR, China. November 25-28, 2003: 295-299.
    [23] 周雪忠.文本挖掘在中医药中的若干应用研究:(博士学位论文).杭州:浙江大学,2004.
    [24] Feldman R, Dagan I. Knowledge discovery in textual databases (KDT). In: Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), Montreal, Canada, 1995: 112-117.
    [25] Kodratoff Y., Knowledge Discovery in Texts: A Definition, and Applications, Proc. ISMIS'99, Warsaw, 1999.
    [26] 陈玉泉,朱锡钧,陆汝占.文本数据的数据挖掘算法.上海交通大学学报,2000,34(7):936-938.
    [27] Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers. 2000.
    [28] Cherfi H., Napoli A., Toussaint Y., Towards a Text Mining Methodology Using Frequent Itemsets and Association Rule Extraction. JIM Knowledge Discovery and Discrete Mathematics, France, 2003: 285-294.
    [29] Hu X, et al. Extracting and Mining Protein-Protein Interaction Network from Biomedical Literature. Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (IEEE CIBCB). Place, 2004: 244-251.
    [30] Gordon M. D., Lindsay R. K. Literature-based discovery by lexical statistics. Journal of the American Society for Information Science, 1999, 50(7):574-587.
    [31] Michael D. Lee, Elissa Y. Corlett. Sequential sampling models of human text classification. Cognitive Science, 2003, 27(2):159-193.
    [32] Masao Fuketa, Sangkon Lee, Takako Tsuji, et al. A document classification method by using field association words. Information Sciences, 2000, 126(1-4):57-70.
    [33] David C. Blair. The challenge of commercial document retrieval. Part Ⅰ: Major issues, and a framework based on search exhaustivity, determinacy of representation and document collection size. Information Processing and Management, 2002, 38(2):237-291.
    [34] 周水庚,关佶红,胡运发.基于文档实例的中文信息检索.计算机工程与应用,2000,10:14-16.
    [35] 黄萱菁,吴立德.基于向量空间模型的文本过滤系统.软件学报,2003,14(3):435-442.
    [36] 代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究.中文信息学报,2004,18(1):26-32.
    [37] EFTK Sang, W Daelemans, H Déjean, et al. Applying system combination to base noun phrase identificationl. In: Proc of COLING 2001, Saarbrücken, Germany: Morgan Kaufmann Publishers. 2000, 857-863.
    [38] 赵军.基于转换的汉语基本名词短语识别模型.中文信息学报,1999,13(2):1-7.
    [39] Y. Yang, J. P. Pedersen. A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, 14th Inernational Conference on Machine Learning, Nashville, Morgan Kaufmann, 1997: 412-420.
    [40] Mannila n., Toivonen H.. Discovering generalized episodes using minimal occurrences. In KDD-96. Portland, Oregon, USA. AAAI Press. 1996: 146-151.
    [41] Feldman R., Pagan I., Klosgen W.. Efficient algorithms for mining and manipulating associations in texts. In Cybernetics and Systems, Vol. 2, The: 13th European Meeting on Cybernetics and Systems Research. Vienna, Austria, 1996.
    [42] Ahonen-Myka H, Heinonen O, Klemettinen M., et al. Finding co-occurring text phrases by combining sequence and frequent set discovery. In: Proceedings of 16th International Joint Conference on Artifical Intelligence IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, 1999: 1-9.
    [43] Lent B., Agrawal R., Srikant R. Discovering Trends in Text Databases. In: Proceedings of the 3rd International Conference on Knowledge Discovery (KDD), AAAI Press. 1997: 227-230.
    [44] Agrawal R., Srikant R.. Fast Algorithms for Mining Association Rules, Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994: 487-499.
    [45] Rajman M., Besancon R.. Text mining: Natural language techniques and text mining applications. In Proc. Of the 7th IFIP Working Conference on Database Semantics (DS-7). Chapam & Hall, 1997: 1-15.
    [46] Feldman R, Fresko M., Kinar Y., et al, Text Mining at the Term Level. Lecture Notes in Computer Science. Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery. Springer-Verlag, 1998: 65-73.
    [47] Beil F., Ester M., Xu X.. Frequent Term-Based Text Clustering. KDD-02. ACM Press, New York, USA. 2002: 436-442.
    [48] Holt J. D. and Chung S. M.. Efficient Mining of Association Rules in Text Databases. Proceedings of the eighth international conference on Information and knowledge management (CIKM). Kansas City, 1999: 234-242.
    [49] Gerald DeJong. An Overview of the Frump System. In: Strategies for Natural Language Processing, W. G. Lehnert, M. H. Ringle(Eds), Lawrence Erlbaum Associates, 1982: 149-176.
    [50] Helena Ahonen, Oskari Heinonen, Mika Klemettinen, et al. Mining in the Phrasal Frontier. In: Proceedings of PKDD' 97-1st European Symposium on Principles of Data Mining and Knowledge Discovery. Norndheim, 1997: 343-350.
    [51] 周雅倩,郭以昆,黄萱菁等.基于最大熵方法的中英文基本名词短语识别.计算机研究与发展,2003,40(3):440-446.
    [52] 周明.基于语料库的中文最长名词短语的自动抽取.计算语言进展与应用.北京:清华大学出版社,1995.
    [53] 刘芳.基于统计的汉语组块分析.中文信息学报,2000,14(6):28-32.
    [54] Zhaoguo Xuan, Yanzhong Dang, Shaohua Jiang, et al. A High Precision Algorithm for Automatic Extracting of High-frequency Words Based on Statistics. The Second International Symposium on Knowledge Management for Strategic Creation of Technology, November 14-17, 2005, Kobe, Japan, 334-338.
    [55] 张春霞,郝天永.汉语自动分词的研究现状与困难.系统仿真学报,2005,17(1):138-142,147.
    [56] 张小衡,王玲玲.中文机构名称的识别与分析.中文信息学报,1997,11(4):21-31.
    [57] Joon Ho Lee, Hyun Yang Cho, Hyouk Ro Park. n-Gram-based indexing for Korean text retrieval. Information Processing and Management, 1999, 35(4):427-441.
    [58] 欧振猛,余顺争.中文分词算法在搜索引擎应用中的研究.计算机工程与应用,2000,36(8):80-84.
    [59] 朱寰,阮彤,于庆喜.文本分割算法对中文信息过滤影响研究.计算机工程与应用,2002,13:62-65.
    [60] 刘涌泉.再读词的问题.中文信息学报,1988,(2):47-50.
    [61] 刘源,梁南元.汉语处理的基础工程—现代汉语词频统计.中文信息学报,1986,(1):17-25.
    [62] 揭春雨,刘源,梁南元.论汉语自动分词方法.中文信息学报,1989,(1):1-9.
    [63] 金凌,吴文虎,郑方等.距离加权统计语言模型及其应用.中文信息学报,2001,15(6):47-52.
    [64] 陈浪舟,黄泰翼.一种新颖的词聚类算法和可变长统计语言模型.计算机学报,1999,22(9):942-948.
    [65] 姚天顺,朱靖波,张刑等.自然语言理解.北京:清华大学出版社,2002.
    [66] 邹海山,吴勇,吴月珠等.中文搜索引擎中的中文信息处理技术.计算机应用研究,2000,(12):21-24.
    [67] 刘颖.计算语言学.北京:清华大学出版社,2002.
    [68] 朱德熙.语法讲义.北京:商务印书馆,1982.
    [69] 史忠植.知识发现.北京:清华大学出版社,2002.
    [70] Makoto Nagaoand ShinsukeMori. A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases for Large Text Data of Japanese. Proceedings of Coling-94, 611-615.
    [71] 傅赛香,袁鼎荣,黄柏雄等.基于统计的无词典分词方法.广西科学院学报,2002,18(4):252-264.
    [72] 刘挺,吴岩,王开铸.串频统计和词形匹配相结合的汉语自动分词系统.中文信息学报,1998,12(1):17-25.
    [73] 韩客松,王永成,陈桂林.无词典高频字串快速提取和统计算法研究.中文信息学报,2001,15(2):23-30.
    [74] 胥桂仙,苏筱蔚,陈淑艳.中文文本挖掘中的无词典分词的算法及其应用.吉林工学院学报,2002,23(1):16-18.
    [75] 金翔宇,孙正兴,张福炎.一种非受限中文文档自动抽词方法.中文信息学报,2001,15(6):33-39.
    [76] 姜韶华,党延忠,宣照国.无词典抽词的RMMFS和BMMFS方法及其比较研究.情报学报,2006,25(4):499-503.
    [77] 姜韶华,党延忠.基于长度递减与串频统计的文本切分算法.情报学报,2006,25(1):74-79.
    [78] 魏宏森,曾国屏.系统论—系统科学哲学.北京:清华大学出版社,1995.
    [79] 李国辉,汤大权,武德峰.信息组织与检索.北京:科学出版社,2003.
    [80] 周强,孙茂松,黄昌宁.汉语最长名词短语的自动识别.软件学报,2000,11(2):195-201.
    [81] 李振星,徐泽平,唐卫清等.网页多词元快速聚类算法.计算机工程,2003,29(2):20-22.
    [82] 宋明亮.汉语词汇字面相似性原理与后控制词表动态维护研究.情报学报,1996,15(4):261-271.
    [83] 窦竹梅,何新贵,彭甫阳.基于知识的文本检索.系统工程与电子技术,1995,(2):60-68.
    [84] 冯志伟.自然语言的计算机处理.上海:上海外语教育出版社,1996.
    [85] Salton, G., Wang, A., Yang, C. S. A vector space model for information retrieval. Journal of the American Society for Information Science, 1975, 18(11):613-620.
    [86] G. A. Miller. WORDNET: A Lexical Database for English. Communications of ACM, 1995, 38(11):39-41.
    [87] 董振东,董强.知网简介.http://www.keenage.com.
    [88] Gruber T R. A translation approach to portable Ontology specifications. Knowledge Acquisition, 1993, 5(2):199-220.
    [89] Collins, A. M., Quillian, M. R.. Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behavior, 1969, 8: 240-248.
    [90] 韩客松,王永成.中文全文标引的主题词标引和主题概念标引方法.情报学报,2001,20(2):212-216.
    [91] 裴炳镇,陈晓明,胡熠等.一种建立中文概念分类关系的新算法.计算机工程与应用,2004,36:18-21.
    [92] 鲍文,胡清华,于达仁.基于K-近邻方法的科技文献分类.情报学报,2003,22(4):452-453.
    [93] 牛凯.中文科技文献计算机自动标引系统的研究.情报学报,1995,14(1):16-26.
    [94] Borgatti, S.P., Everett, M. G., Freeman, L. C. 2002. Ucinet 6 for Windows: Software for Social Network Analysis. Harvard, MA: Analytic Technologies.
    [95] 张艳,宗成庆,徐波.汉语术语定义的结构分析和提取.中文信息学报,2003,17(6):9-16.
    [96] 王强军,李芸,张普.信息技术领域术语提取的初步研究.自然语言处理,2003,1:32-33.
    [97] Fellbaum C ed. WordNet: An electronic lexical database. Cambridge: Mass MIT Press, 1998.
    [98] 傅兴岭.现代汉语通用字典.北京:外语教学与研究出版社,1987.
    [99] 黄萱菁,吴立德,王文欣等.基于机器学习的无需人工编制词典的切词系统.模式识别与人工智能,1996,9(4):297-303.
    [100] 李振星,徐泽平,唐卫清等.全二分最大匹配快速分词算法.计算机工程与应用,2002,11,106-109.
    [101] 吴立德等.大规模中文文本处理.上海:复旦大学出版社,1997.
    [102] 徐时仪.数据库建设与断代词典编纂.中国算书论集.北京:语文出版社,2000.
    [103] 隋岩,张普.基于动态流通语料库的“动态词典”编纂.中国算书论集.北京:语文出版社,2000.
    [104] 孙霞,郑庆华,王朝静等.一种基于生语料的领域词典生成方法.小型微型计算机系统,2005,26(6):1088-1092.
    [105] 成文丽,曲开社,冯秀芳.计算机辅助词典编纂的技术分析与系统设计.山西大学学报(自然科学版),2003,26(2):130-133.
    [106] 郑泽之,张普,杨建国.基于语料库的字母词语自动提取研究.中文信息学报,2005,19(2):78-85.
    [107] 张德鑫.水至清则无鱼—我的新生词语规范观.北京大学学报(哲学社会科学版),2000,37(5):106-119.
    [108] 冯志伟.现代术语学引论.北京:语文出版社.1997.
    [109] 杜波,田怀凤,王立.基于多策略的专业领域术语抽取器的设计.计算机工程,2005,31(14):159-160.
    [110] 邹纲,刘洋,刘群等.面向Internet的中文新词语检测.中文信息学报,2004,18(6):1-9.
    [111] Hua-Ping ZHANG, Qun LIU. et al, Chinese Name Entity Recognition Using Role Model. Special issue "Word Formation and Chinese Language processing" of the International Journal of Computational Linguistics and Chinese Language Processing, 2003, 8(2):29-60.
    [112] 陈小荷.自动分词中未登录词问题的一揽子解决方案.语言文字应用,1999,3:103-109.
    [113] 郑家恒,杜永萍,宋礼鹏.农业病虫害词汇获取方法初探.第七届全国计算语言学联合学术会议论文集(JSCL-2003).北京:清华大学出版社,2003.
    [114] Craig G. Nevill-Manning, Ian H. Witten. Identifying Hierarchical Structure in Sequences: A linear-time algorithm. Journal of Artificial Intelligence Research, 1997, 7:67-82.
    [115] Pantel P, Lin D. A Statistical Corpus-based Term Extractor. Lecture Notes in Artificial Intelligence. Springer-Verlag, 2001: 34-46.
    [116] Lai Yusheng, Wu Chunghsien. Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknownword Methodology. ACM Transaction on Asian Language Information Processing, 2002, 1:34-64.
    [117] 刘涌泉.中国计算机与自然语言处理的新进展.情报科学,1987,8(1):64-70.
    [118] 路志英,林孔元,郭祺.中文切分词典的最大匹配索引法.天津大学学报,1999,32(5):599-603.
    [119] A. Chen et al. Chinese text retrieval without using a dictionary. Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1997: 42-49.
    [120] 吴栋.中文信息检索引擎中的分词与检索技术.计算机应用,2004,24(7):128-131.
    [121] 张树武,黄泰翼.汉语统计语言模型的N值分析.中文信息学报,1998,12(1):35-41.
    [122] G. K. Zipf. Human Behavior and the Principle of least Effort. Addison-Wesley, 1949.
    [123] 丁学东.文献计量学基础.北京:北京大学出版社,1993.
    [124] 刘源等.汉语字、词的概率分布、熵及冗余度.中文信息处理国际会议论文集,北京,1987,505-509.
    [125] 关毅,王晓龙,张凯.现代汉语计算语言模型中语言单位的频度—频级关系.中文信息学报,1999,13(2):8-15.
    [126] 杨波,阎素兰.齐普夫定律的汉语适用性研究及其在自动标引中的应用.情报理论与实践,2004,27(3):252-255.
    [127] 刘秉权,王晓龙,王宇颖.一种多知识源汉语语言模型的研究与实现.计算机研究与发展,2002,39(2):231-235.
    [128] 姜韶华,党延忠.无词典中英文混合术语抽取及算法研究.情报学报,2006,25(3):301-305.
    [129] 姜韶华,党延忠.自动提取含字母词语的领域新术语的研究.计算机工程,已录用.
    [130] Jonathan D. Cohen. An n-gram hash and skip algorithm for finding large numbers of keywords in continuous text streams. Software-Practice and Experience, 1998, 28(15): 1605-1635.
    [131] 张健,李素建,刘群.N-gram统计模型在机器翻译系统中的应用.计算机工程与应用,2002,(8):73-78.
    [132] 何浩,杨海棠.一种基于N-Gram技术的中文文献自动分类方法.情报学报,2002,21(4):421-427.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700