融合多特征的基于远程监督的中文领域实体关系抽取

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

融合多特征的基于远程监督的中文领域实体关系抽取

详细信息查看全文 | 推荐本文 |

英文篇名：Entity Relations Extraction in Chinese Domain
作者：王斌 ; 郭剑毅 ; 线岩团 ; 王红斌 ; 余正涛
英文作者：WANG Bin;GUO Jianyi;XIAN Yantuan;WANG Hongbin;YU Zhengtao;Faculty of Information Engineering and Automation,Kunming University of Science and Technology;Key Laboratory of Intelligent Information Processing,Kunming University of Science and Technology;
关键词：远程监督 ; 实体关系抽取 ; 领域知识库 ; 特征融合 ; 隐含狄利克雷分布主题模型
英文关键词：Distant Supervision;;Entity Relation Extraction;;Domain Knowledge Base;;Feature Fusion;;Latent Dirichlet Allocation Topic Model
中文刊名：MSSB
英文刊名：Pattern Recognition and Artificial Intelligence
机构：昆明理工大学信息工程与自动化学院;昆明理工大学智能信息处理重点实验室;
出版日期：2019-02-15
出版单位：模式识别与人工智能
年：2019
期：v.32;No.188
基金：国家自然科学基金项目(No.61562052,61363044,61462054)资助~~
语种：中文;
页：MSSB201902005
页数：11
CN：02
ISSN：34-1089/TP
分类号：39-49

摘要

针对从未标记的文本中抽取中文领域实体关系的问题,文中提出基于远程监督的领域实体属性关系抽取的混合方法,利用知识库中已有结构化的关系三元组,从自然语言文本中自动获取训练语料.针对远程监督方法标注数据存在大量噪声的问题,采用隐含狄利克雷分布主题模型抽取主题关键词,再与关系类型进行相似度计算和对关键词模式匹配进行去噪.最后提取词性特征、依存关系特征和短语句法树特征,并进行融合,训练关系抽取模型.实验表明,3种特征融合的F值较高,抽取性能较好.
Aiming at the extraction of Chinese domain entity relationship from unlabeled text,a hybrid method of domain entity attribute extraction based on distant supervision is proposed. The structured relational three tuples in the knowledge base are applied to obtain the training corpus automatically from the natural language text. Due to the large amount of noise in the annotation data of distant supervision method,the latent Dirichlet allocation( LDA) topic model for topic keyword extraction is adopted,and then the similarity calculation with relationship type and keyword pattern matching for denoising are performed. Finally,the part-of-speech feature,the dependency feature and the phrase syntax tree feature are extracted,and the relationship extraction model is trained. Experiments show that the method fusing three features produces higher F value and better extraction performance.

引文

[1] CRAVEN M, KUMLIEN J. Constructing Biological Knowledge Bases by Extracting Information from Text Sources[C/OL]. [2018-08-25]. http://www.aaai.org/Papers/ISMB/1999/ISMB99-010.pdf.
    [2] MINTZ M, BILLS S, SNOW R, et al. Distant Supervision for Relation Extraction without Labeled Data // Proc of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Stroudsburg, USA: ACL, 2009: 1003-1011.
    [3] 欧阳丹彤,瞿剑峰,叶育鑫.关系抽取中基于本体的远监督样本扩充.软件学报, 2014, 25(9): 2088-2101.(OUYANG D T, QU J F, YE Y X. Extending Training Set in Distant Supervision by Ontology for Relation Extraction. Journal of Software, 2014, 25(9): 2088-2101.)
    [4] 贾真,何大可,杨燕,等.基于弱监督学习的中文网络百科关系抽取.智能系统学报, 2015, 10(1): 113-119.(JIA Z, HE D K, YANG Y, et al. Relation Extraction from Chinese Online Encyclopedia Based on Weakly Supervised Learning. CAAI Transactions on Intelligent Systems,2015,10(1): 113-119.)
    [5] RIEDEL S, YAO L M, MCCALLUM A. Modeling Relations and Their Mentions without Labeled Text // Proc of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin, Germany: Springer-Verlag, 2010: 148-163.
    [6] FAN M, ZAHO D L, ZHOU Q, et al. Errata: Distant Supervision for Relation Extraction with Matrix Completion[C/OL]. [2018-08-25]. https://arxiv.org/pdf/1411.4455.pdf.
    [7] TAKAMATSU S, SATO I, NAKAGAWA H. Reducing Wrong Labels in Distant Supervision for Relation Extraction // Proc of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2012: 721-729.
    [8] QU J F, OUYANG D T, HUA W, et al. Distant Supervision for Neural Relation Extraction Integrated with Word Attention and Pro-perty Features. Neural Networks, 2018, 100: 59-69.
    [9] JI G L, LIU K, HE S Z, et al. Distant Supervision for Relation Extraction with Sentence-Level Attention and Entity Descriptions // Proc of the 31st AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2017: 3060-3066.
    [10] 刘剑,许洪波,唐慧丰,等.面向中文网络百科的语义知识库构建.系统仿真学报, 2016, 28(3): 542-548.(LIU J, XU H B, TANG H F, et al. Semantic Knowledge Base Constructed from Chinese Online Encyclopedia. Journal of System Simulation, 2016, 28(3): 542-548.)
    [11] XU B, XU Y, LIANG J Q, et al. CN-DBpedia: A Never-Ending Chinese Knowledge Extraction System // Proc of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Berlin, Germany: Springer, 2017: 428-438.
    [12] 张巧燕,林民,张树钧.基于维基百科的领域概念语义知识库的自动构建方法.计算机应用研究, 2018, 35(1): 130-134.(ZHANG Q Y, LIN M, ZHANG S J. Research on Automatic Construction of Domain Concepts on Wikipedia Semantic Knowledge Base. Application Research of Computers, 2018, 35(1): 130-134.)
    [13] 王磊,董玮,董少林,等.基于在线百科的知识库构建方法研究.信息系统工程, 2018(1): 110-111.(WANG L, DONG W, DONG S L, et al. Research on the Construction Method of Knowledge Base Based on Online Encyclopedia. Information Systems Engineering, 2018(1): 110-111.)
    [14] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient Estimation of Word Representations in Vector Space[C/OL]. [2018-08-25]. https://arxiv.org/pdf/1301.3781.pdf.
    [15] GOLDBERG Y, LEVY O. Word2vec Explained: Deriving Mikolov et al.′s Negative-Sampling Word-Embedding Method[C/OL]. [2018-08-25]. https://arxiv.org/pdf/1402.3722.pdf.
    [16] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet Allocation. Journal of Machine Learning Research Archive, 2003, 3: 993-1022.
    [17] CHEN W H, ZHANG X. Research on Text Categorization Model Based on LDA-KNN // Proc of the 2nd IEEE Advanced Information Technology, Electronic and Automation Control Conference. Washington, USA: IEEE, 2017: 2719-2726.
    [18] ZHU J R, WANG Q L, LIU Y, et al. A Method of Optimizing LDA Result Purity Based on Semantic Similarity // Proc of the 32nd Youth Academic Annual Conference of Chinese Association of Automation. Washington, USA: IEEE, 2017: 361-365.
    [19] KIM Y. Convolutional Neural Networks for Sentence Classification[C/OL]. [2018-08-25]. https://arxiv.org/pdf/1408.5882.pdf.
    [20] VAPNIK V N. The Nature of Statistical Learning Theory. New York, USA: Springer-Verlag, 1995.
    [21] HOCHREITER S, SCHMIDHUBER J. Long Short-Term Memory. Neural Computation, 1997, 9(8): 1735-1780.
    [22] LUONG M T, PHAM H, MANNING C D. Effective Approaches to Attention Based Neural Machine Translation[C/OL]. [2018-08-25]. https://nlp.stanford.edu/pubs/emnlp15_attn.pdf.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700