文本预处理后的LDA模型主题发现与技术演进研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

文本预处理后的LDA模型主题发现与技术演进研究

详细信息查看全文 | 推荐本文 |

英文篇名：Research of Topics Discovery and Tech Evolution Based on Text Preprocessed LDA Model
作者：王丽 ; 沈湘
英文作者：WANG Li;SHEN Xiang;National Science Library, Chinese Academy of Sciences;Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences;
关键词：LDA模型 ; 技术演进 ; 文本预处理 ; 可视化 ; 技术词自动识别
英文关键词：LDA model;;tech evolution;;preprocessed text;;visualization;;automatic term identification
中文刊名：LYTS
英文刊名：Agricultural Library and Information
机构：中国科学院文献情报中心;中国科学院大学经济与管理学院图书情报与档案管理系;
出版日期：2019-05-22 15:12
出版单位：农业图书情报
年：2019
期：v.31;No.274
基金：NSTL基金项目“面向国家重点研发计划的专题情报服务”(项目编号:2018XM06)
语种：中文;
页：LYTS201904004
页数：10
CN：04
ISSN：10-1554/G2
分类号：21-30

摘要

[目的]在科技情报资源快速增长的环境下,通过大文本数据分析快速发现研究主题,且进一步挖掘各研究主题下的技术发展与变化,对做出全面快速响应的科技情报工作有着重要的意义。[方法]针对大文本数据,利用Python实现了文本预处理后的LDA模型主题发现与技术演进,首先构建文本预处理泛化模型,实现技术词自动识别处理;然后基于技术词进行LDA模型构建及可视化,来识别研究主题;最后基于技术词构建技术演进的计算模型,来进一步挖掘技术的发展与变化。[结果]文章以SiC技术领域43621项专利为分析对象进行了实践,包括文本预处理、主题发现及可视化、某主题下技术发展和变化分析等全流程,处理畅通且用时很短(案例全程历时约10分钟)。[局限]文章提出的LDA各主题下技术演进模型中,文档只与其相关度最大的主题关联,尚未对文档多主题关联情况下的演进效果进行对比,后续有待进一步优化验证。[结论]文章提出的方法对快速全面把握一个科技领域有着重要作用,通过主题的识别以及主题之下的技术发展变化,可以以不同的颗粒度去研究一个科技领域,并对后续的调研分析提供有价值的线索。
[Objective] Computational science and Data Science are inspiring the intelligent analysis and information service today. Machine learning text analysis methods is changing the traditional analysis methods. This article discuss the benefits of unsupervised learning approaches in patent text mining. [Methods] Patent data of SiC industry were preprocessed by filter model based on NLTK Toolkit to identify the tech terms and then clustered based on Latent Dirichlet Allocation model to find the latent topics which were visualized. Based on group operation Top terms ranked by tf-idf through every year were used to reveal the R&D focus evolution. [Results] This research offers a demonstration of the proposed method based on 43,621 SiC patents. The results show 28 Research and Development topics with tech terms in SiC industry and present a Research and Development focus evolution based new emerging terms of every year which provides a clue for more detail analyses later. Finally,we discuss the clues for the R&D focus in the SiC industry.[Limitation]Multi Topics for documents were not compared for the R&D focus evolution in this article. That will be discussed in future. [Conclusions]The results show a efficent way to find technology focus evolution from a large scale text data.

引文

[1]Alghamdi,R,Alfalqi K.A Survey of Topic Modeling in Text Mining[J].International Journal of Advanced Computer Science and Applications,2015,6(1):9-27.
    [2]van Eck NJ,Waltman L,Noyons ECM,et al.Automatic Term Identification for Bibliometric Mapping[J].Scientometrics,2010,82:581-569.
    [3]Didier B.Surface Grammatical Analysis for The Extraction of Terminological Noun Phrases[C].Proceeding COLING'92 Proceedings of the 14th conference on Computational linguistics,1992,3:977-981.
    [4]王博,刘盛博,丁堃,刘则渊.基于LDA主题模型的专利内容分析方法[J].科研管理,2015,36(3):111-117.(Wang Bo,Liu Shengbo,Ding Kun,Liu Zeyuan.Patent content analysis method based on LDAtopic model[J].Science Research Management,2015,100:317-329.)
    [5]Justeson,J.S.,Katz,S.M..Technical Terminology:Some Linguistic Properties and An Algorithm for Identification in Text[J].Natural Language Engineering,1995,1(1):9-27.
    [6]Thomas L G,Mark S.Finding scientific topics[J].PNAS,2004,101(1):5228-5235.
    [7]Donghyun Choi,Bomi Song.Exploring Technological Trends in Logistics:Topic Modeling-Based Patent Analysis[J].Sustainability,2018,10(8):2810-2835.
    [8]宫小翠,安新颖.基于LDA模型的医学领域主题分裂融合探测[J]图书情报工作,2017,61(18):76-83.(Gong Xiaocui,An Xinying.AResearch of Topic Splitting and Merging Detecting in the Medical Field Based on the LDA Model[J].Library and Information Service,2017,61(18):64-74.
    [9]曲佳彬,欧石燕.基于主题过滤与主题关联的学科主题演化分析[J].数据分析与知识发现,2018,2(1):64-75.(Jiabin Qu,Shiyan Ou.Analyzing Topic Evolution with Topic Filtering and Relevance.Data Analysis and Knowledge Discovery[J].Data Analysis and Knowledge Discovery,2018,2(1):64-75.)
    [10]Jacob P.Python Text Processing with NLTK 2.0 Cookbook[M].UK.:Packt Publishing Ltd.,2010.
    [11]王丽,邹丽雪,刘细文.基于LDA主题模型的文献关联分析及可视化研究[J].数据分析与知识发现,2018,2(3):98-107.(Wang Li Zou Lixue,Liu Xiwen.Visualizing Document Correlation Based on LDA Model[J].Data Analysis and Knowledge Discovery,2018,2(3):98-107.)
    [12]Blei,David M.,Ng,Andrew Y.,Jordan,Michael I.Lafferty,John.Latent Dirichlet allocation[J].Journal of Machine Learning Research.January 2003,3:993-1022.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700