篇章级并列关系文本块识别方法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

篇章级并列关系文本块识别方法研究

详细信息查看全文 | 推荐本文 |

英文篇名：Identifying Coordinate Text Blocks in Discourses
作者：裴晶晶 ; 乐小虬
英文作者：Pei Jingjing;Le Xiaoqiu;National Science Library, Chinese Academy of Sciences;Department of Library, Information and Archives Management, School of Economics and Management,University of Chinese Academy of Sciences;
关键词：并列关系 ; 文本表示 ; 文本块 ; 深度学习
英文关键词：Coordinate Relationship;;Text Representation;;Text Block;;Deep Learning
中文刊名：XDTQ
英文刊名：Data Analysis and Knowledge Discovery
机构：中国科学院文献情报中心;中国科学院大学经济与管理学院图书情报与档案管理系;
出版日期：2019-05-25
出版单位：数据分析与知识发现
年：2019
期：v.3;No.29
语种：中文;
页：XDTQ201905006
页数：6
CN：05
ISSN：10-1478/G2
分类号：55-60

摘要

【目的】识别出科技论文中分布在不同段落、在语义及版面视觉上具有并列关系的文本块,捕捉并列关系文本特征,为并列关系知识对象识别提供预训练模型。【方法】以段落为处理单元,在字符向量和词向量的基础上附加版面视觉特征,对不同层级具有并列关系的文本进行多维特征表征,利用卷积神经网络(Convolutional Neural Networks, CNN)模型对标注数据进行文本分类训练,得到并列关系文本块识别模型。【结果】在人工标注的科技论文数据集上展开实验,对并列关系文本块分类准确率达96%,比基准模型高出约3%,召回率高出约2%。【局限】仅适用于HTML网页文本数据,对于其他格式的文本数据还有待进一步研究和实验。【结论】以段落为处理单元,综合多种特征后利用卷积神经网络模型能够高效识别篇章级并列关系文本块,可以作为并列关系知识对象识别预训练模型。
[Objective] This paper proposes a method to identify the coordinate text blocks by semantic and layout features, which are distributed in different paragraphs. It also provides a pre-trained model for these knowledge objects.[Methods] First, we used each paragraph as a processing unit and added the layout features based on the character and word vectors. Then, we concatenated multi-dimensional features to represent each paragraph. Third, we employed the convolutional neural network(CNN) model to train the annotated data and obtained the recognition model for coordinate relationship text blocks. [Results] The proposed approach achieved a precision of 96% with manually annotated scientific papers, which was 3% higher than those of the baseline model. The recall was also improved by 2%.[Limitations] Our model can only work with HTML files. More research is needed to examine it with other data formats. [Conclusions] The proposed method is able to effectively identify coordinate text blocks in discourses, which can be used as a pre-trained model for coordinate knowledge objects.

引文

[1]Nivre J.Dependency Parsing[J].Language&Linguistics Compass,2010,4(3):138-152.
    [2]昝红英,张静杰,娄鑫坡.汉语虚词用法在依存句法分析中的应用研究[J].中文信息学报,2013,27(5):35-42.(Zan Hongying,Zhang Jingjie,Lou Xinpo.Studies on the Application of Chinese Functional Words’Usages in Dependency Parsing[J].Journal of Chinese Information Processing,2013,27(5):35-42.)
    [3]王东波.基于规则的单层单标记联合结构自动识别[J].文教资料,2008(9):29-31.(Wang Dongbo.Automatic Identification of Non-nest Coordinate Structure Based on Rules[J].Data of Culture and Education,2008(9):29-31.)
    [4]Magerman D M.Natural Language Parsing as Statistical Pattern Recognition[D].California:Doctoral Dissertation Stanford University,1994.
    [5]郑略省,吕学强,刘坤,等.汉语并列关系的识别研究[J].北京大学学报:自然科学版,2013,49(1):20-24.(Zheng Luesheng,Lv Xueqiang,Liu Kun,et al.Automatic Identification of Chinese Coordination Relations[J].Acta Scientiarum Naturalium Universitatis Pekinensis,2013,49(1):20-24.)
    [6]石翠,王杨,杨彬,等.面向中文专利文献的单层并列结构识别[J].现代图书情报技术,2014(10):76-83.(Shi Cui,Wang Yang,Yang Bin,et al.Identification of Non-nest Coordination for Chinese Patent Literature[J].New Technology of Library and Information Service,2014(10):76-83.)
    [7]苗艳军,李军辉,周国栋.统计和规则相结合的并列结构自动识别[J].计算机应用研究,2009,26(9):3403-3406.(Miao Yanjun,Li Junhui,Zhou Guodong.Automatic Identification of Coordinate Structure Based on Statistics and Rules[J].Application Research of Computers,2009,26(9):3403-3406.)
    [8]Socher R,Lin C C,Manning C,et al.Parsing Natural Scenes and Natural Language with Recursive Neural Networks[C]//Proceedings of the 28th International Conference on Machine Learning.2011:129-136.
    [9]Zhao M,Ohshima H,Tanaka K.Finding“Similar But Different”Documents Based on Coordinate Relationship[C]//Proceedings of the 2016 International Conference on Asian Digital Libraries.2016:110-123.
    [10]Wang S,Huang M,Deng Z.Densely Connected CNN with Multi-scale Feature Attention for Text Classification[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence.2018:4468-4474.
    [11]Mikolov T,Sutskever I,Chen K,et al.Distributed Representations of Words and Phrases and Their Compositionality[C]//Proceedings of the 2013 Conference on Neural Information Processing Systems,2013:3111-3119.
    [12]Pennington J,Socher R,Manning C.Glove:Global Vectors for Word Representation[C]//Proceedings of the 2014Conference on Empirical Methods in Natural Language Processing.2014:1532-1543.
    [13]张庆辉,万晨霞.卷积神经网络综述[J].中原工学院学报,2017,28(3):82-86.(Zhang Qinghui,Wan Chenxia.Review of Convolutional Neural Networks[J].Journal of Zhongyuan University of Technology,2017,28(3):82-86.)
    [14]LeCun Y,Bottou L,Bengio Y,et al.Gradient-Based Learning Applied to Document Recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
    [15]Krizhevsky A,Sutskever I,Hinton G.ImageNet Classification with Deep Convolutional Neural Networks[C]//Proceedings of the 2012 Conference on Neural Information Processing Systems.2012:1097-1105.
    [16]Kim Y.Convolutional Neural Networks for Sentence Classification[OL].arXiv Preprint.arXiv:1408.5882.
    [17]Zhang X,Zhao J,LeCun Y.Character-level Convolutional Networks for Text Classification[C]//Proceedings of the 2015Conference on Neural Information Processing Systems.2015:649-657.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700