用户名: 密码: 验证码:
语言节奏提取及其在文本分析中的应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
语言节奏是语言中广泛存在的一个重要特征,在语音识别、文学审美等诸多领域有广泛的研究与应用。语言节奏是一个复杂且综合的概念,每一个研究领域研究重点不同,对语言节奏的定义与分析方法各不相同,但目前还未有就语言中节奏信息具体量化的定义与分析。在此,本文综合多学科研究成果,通过对语言进行多方面分析与论证,提出了一套定义语言节奏和构建语言节奏的方法,并对生成的语言节奏进行特征提取与分析,在文本分析领域中得到了良好的应用。围绕语言节奏内容与应用的研究,本文主要进行了以下几个方面的工作:
     (1)通过对语言本质和本原进行研究与分析,完成了多层次语言节奏的定义。语言是综合而复杂的,在语言表达中涉及到生理特征、语法现象、情感内涵、逻辑关系等等诸多方面的内涵。虽然语言很复杂,但其是有规则的:语言中蕴含的这些丰富内容都具有一定节奏性特征,即可以从语言中节奏性表现出的特征,完成对语言中蕴含的更深层次内容分析。因此,语言节奏是语言的一个重要特征。本文将语言中存在的节奏划分为:自然节奏、语法节奏、逻辑节奏和情感节奏。针对四种节奏表现出的不同节奏特性,完成了这四种节奏的具体定义,且根据每一种节奏在语言中表现出的特性,完成了对其节奏标记的寻找与定义。深入的分析了每一种语言节奏的特征,对其存在的性质进行了具体的研究分析与论证。
     (2)完成了语言节奏提取方法的具体分析与设计。通过对语言节奏内涵和性质的具体分析与定义,根据语言中具有的节奏性特征,完成了语言节奏提取方法的具体研究分析与设计。针对语言中存在的自然节奏、语法节奏、逻辑节奏以及情感节奏各自特征,结合其在文本中存在的节奏标记,字距离等内容,完成了语言节奏单元、节奏序列等概念的具体定义,及各种语言节奏提取方法的描述与定义。且将各种语言节奏提取步骤进行了详细的阐述,同时对构建完成的语言节奏具有的性质进行了具体的分析。
     (3)完成了对语言节奏中存在的特征提取方法的研究与设计。根据语言节奏的中的节奏性特征,选取了两种不同的方法对语言节奏的特征进行提取。一是,通过构建语言节奏状态转移矩阵完成节奏特征提取,即随着文章的展开,语言节奏是不断发展变化的,也就是说语言节奏可以看成是在不同的状态之间不断转换的,通过构建语言节奏的状态转移矩阵可以完成对语言节奏状态变化之间存在的特征进行捕捉。二是,根据语言节奏中各节奏单元之间存在着邻接关系,提出了应用语言节奏网络完成对语言节奏特征的提取。完成了对语言节奏网络的定义与构建方法的具体描述。
     (4)将语言节奏特征分析的方法应用于不同的文本分析任务中,对语言节奏特征在文本分析中的有效性进行了实验验证。针对文本分析任务中文本分类、作者判别、作品文风判别、作者同一性判别以及话题判别任务,采用贝叶斯分类方法和K均值聚类方法,对实验中文档的语言节奏特征进行具体分析,实验结果良好。实验验证,通过对语言节奏的特征分析能够很好的解决文本分析领域中多种任务。
     (5)通过对语言节奏网络中存在的特性进行分析,完成了对语言中一些本质现象的探讨。通过对实验语料语言节奏网络的分析,得出语言节奏的网络是具有“小世界”特性的网络。通过对名著中语言节奏复杂网络特性分析,得出其具有平均距离短、聚类系数高显著的复杂网络特征,且具有平均距离聚类系数积值大的现象,完成了应用复杂网络分析方法找出名著中具有的相对显著特征。从对名家作品的语言节奏复杂网络分析,发现其同样具有平均距离短、聚类系数高,且平均距离聚类系数积值高的特征,从而从复杂网络分析的角度上完成了对作者语言掌控能力的分析。
Language rhythm is an important feature of language, which existing wildly andapplying in speech recognition and literary aesthetics and etc. It is a complex andcomposite concept, then there is not a unified define that can be recognizedextensively. Each Research field has its own view about it, so it is difficult to studyLanguage rhythm quantitatively. Thus, abstracting all the research successions, it issolved that how to define and build the language rhythm in this paper. Furthermore,extracting its character for analysis achieve very well and it prove that languagerhythm character is appropriate in text analysis. The main contents of this dissertationare as follows:
     (1)Researching the essence and nature in language, the definition of languagerhythm is accomplished in hierarchy. Language is very complex, it involves nature,grammar, logic, emotion and etc connotation, but there is a rule in it: rhythm. Rhythmis an outstanding characteristic for language. Then the language rhythm is portionedfor four: nature rhythm, grammar rhythm, logic rhythm and emotion rhythm. Andeach of them has been defined in this paper and their properties have been discussedfully.
     (2)Analyzing and designing the efficient method of extracting language rhythm.By researching the intention of language the rhythm in language is appeared.Thinking of their own character, the methods for extract the nature rhythm, grammarrhythm, logic rhythm and emotion rhythm are described here. And the languagerhythm unit, rhythm array and etc are defined too.
     (3)Finding how to extract the character of language rhythm. Two methods promptin this paper. One is to build the state transition matrix of language rhythm and theother is to create the language rhythm network. Because language is show in turn, andit can be describe transferring from one state to another one. Then the state transitionmatrix can denoted the language rhythm. For each language rhythm unit adjacent withanother one, then there is a kind of relation between neighbor, and the languagerhythm network is created by the node: rhythm unit and the edge that exited in twoneighbor nodes.
     (4)Applying the language rhythm feature in text analysis. Some tests are designedfor proving. The tasks, such as text classification, author distinguishes, styleidentifying, and topic finding, are accomplished successfully by using languagerhythm feature. Bayesian classifier and k-means are used in analyzing language rhythm feature. It is proved that language rhythm feature is suit for text analysis.
     (5)Analyzing the feature in language rhythm network, some nature of languagehas been discussed. By analyzing the language rhythm network, a truth is appearedthat it is a complex network with small shortest average distance, high clusteringcoefficient and scale-free. Studying the language rhythm of Masterpiece, find that itsnetwork has the salient features of "small-world "network, and its shortest averagedistance and clustering coefficient product is high markedly. And tested the same inthe work of phenomenon, the same happened. It is concluded that the ability forcontrol language can analyses by language rhythm network.
引文
[1]张永谦,哲学知识全书,甘肃人民出版社,1989:91
    [2]袁世全,冯涛,中国百科大辞典编委会编,中国百科大辞典,北京:华厦出版社,1990,505
    [3]陈原,社会语言学,学林出版社,1984
    [4]刘勰《文心雕龙》
    [5]钱谷容,论节奏,文艺理论研究,1994,(06)
    [6]王士元,柯津云,语言的起源及建模仿真初探,中国语文,2001,(03)
    [7]中国互联网络信息中心(CNNIC),第25次中国互联网络发展状况统计报告,http://www.cnnic.net.cn/html/Dir/2010/01/15/5767.htm,2010.01
    [8] Laver.John,Principles of phonetics Cambridge,Cambridge University Press,1994.05
    [9]曹剑芬,语言的节奏,中国社会科学院语音工作报告2003
    [10]许毅,普通话音联的声学语音学特性,中国语文,1986,(5)
    [11]曹剑芬,汉语普通话语音节奏的初步研究,中国社会科学院语言研究所1998年语音研究报告
    [12]刘现强,现代汉语节奏支点初探语言教学与研究,2007,(3),56~62
    [13]刘俐李,近八十年汉语韵律研究回望,语文研究,2007,(02),5~12
    [14]吴为善,汉语节律的自然特征,上海师范大学学报(哲学社会科学版),2003,(02),100~106
    [15]陈胜,现代汉语韵律词研究综述,语文学刊,2006,(21),139~141
    [16]伍巍,现代汉语节律的功能——口语中的“重音”与“断连”,修辞学习,2005,(03),29~32
    [17]吴洁敏,朱宏达,汉语节律学,语文出版社,2001,30
    [18]沈炯,汉语语势重音的音理(简要报告),语文研究,1994,(03),10~15
    [19]沈炯,汉语语调模型刍议,语文研究,1994,(4),16~24
    [20] Chu,M,Wang Y. and He L. Labeling stress in continuous Mandarin speech perceptually, InThe Proceedings of the15th International Congress of Phonetic Science. Barcelona,2095~2098.
    [21] Van Valin,R. D. and Lapolla,R. J.(2002) Syntax:Structure,Meaning and Function.Beijing:Peking University Press.
    [22]王韫佳,初敏,贺琳,汉语焦点重音和语义重音分布的初步实验研究,世界汉语教学,2006,(2),86~98
    [23] Wen Xu,Zhang Yu,et al. Syntactic Structure Parsing Based Chinese Question Classification[J]. Journal of Chinese Information Processing,2006,20(2):33~39(in Chinese)
    [24] Sun Ang,Jiang Minghu,et al. Chinese question answering based on syntax analysis andanswer classification [J]. Acta Electronica Sinica,2008,36(5):833~839(in Chinese)
    [25] Peng F,et al. Combining deep linguistics analysis and surface pattern learning:a hybridapproach to chinese definitional question answering[A]. Proc of Human Language TechnologyConference and Conference on Empirical Methods in Natural Language Processing [C].Vancouver, Canada:Association for Computational Linguistics,2005.307~314.
    [26]董秀芳,论句法结构的词汇化,语言研究,2002,(3),56~64
    [27]杨玉芳,句法边界的韵律学表现,声学学报,1997,(5),
    [28]杨玉芳,汉语韵律层级结构边界的声学分析,声学学报,2004(1),29~36
    [29]曹建芬,音段延长的不同类型及其韵律价值,中国社会科学院语言研究所2005年语音研究报告
    [30]林茂灿,汉语韵律结构和功能语调,http://ling.cass.cn/yuyin/report/report_2002.htm,2002.
    [31]吴宗济,普通话四字组韵律变量的处理规则,语音研究报告,中国社会科学院语言研究所语音研究室刊,1998
    [32]倪崇嘉,刘文举,徐波,汉语韵律短语的时长与音高研究,中文信息学报,2009.07,83~87
    [33]王茂林,林茂灿,普通话自然话语中的下倾,语音研究报告,2003,http://ling.cass.cn/yuyin/report/files/2003_12.pdf
    [34]邵艳秋,韩纪庆,刘挺,赵永贞,自然风格言语的汉语句重音自动判别研究,声学学报,2006,(3),203~210
    [35]郑元者,试论劳动说与模仿说的有效性问题,复旦学报(社会科学版),1998(6),73~79
    [36]文炼,汉语语句的节律问题,中国语文,1994,(1),22~25
    [37]朱麟,汉语与英语节奏模式的对比——暨语音学前沿问题国际论坛,第七届中国语音学学术会议暨语音学前沿问题国际论坛论文集,2006,
    [38]安英姬,汉语节奏的二三律,汉语学习,2000,(01),17~21
    [39]吴为善,现代汉语三音节组合规律初探,汉语学习,1986,(05),1~2
    [40] Schmidt-Rinehart, The effects of topic familiarity on second language listeningcomprehension,The modern language journal,1994,(2),179~189
    [41]王蓓,杨玉芳,吕士楠,汉语韵律层级结构边界的声学分析,声学学报,2004,29(1):29~36
    [42]戴昭铭,文化语言学导论,语文出版社,1998:271
    [43]曹剑芬,基于语法信息的汉语韵律结构预测,中文信息学报,2003,(3)
    [44]吴淮南,汉语词语构成的音节因素和节奏语感的培养,南京大学学报(哲学·人文科学·社会科学),2003,(3)
    [45]毛世桢,现代汉语语法停顿初探,华东师范大学学报(哲学社会科学版),1994,(2)
    [46] Borden,Gloria J. and Katherine S. Harris. Speech Science Primer〔M〕(2nd Edition).Baltimore,Maryland:Williams&Wilkins,1984.
    [47]张立杰,卓琼妍,侍建国,英语短语重音与汉语短语停延,外语学刊,2004.02
    [48]叶军,停顿的声学征兆,第三届全国语音学研讨会论文集,1996
    [49]刘长军,新闻英语中语音停顿还原性的声学研究,现代外语,2007.03
    [50]吴洁敏,停延初探,语文建设,1990.(3)
    [51] Steinhauer K. Electrophysiological correlates of prosody and punctuation. Brain and language86,2003,142~164
    [52]刘现强,诗歌格律与汉语节奏研究,语言学论丛,商务印书馆,2006,32
    [53] Ostendorf,M. and Veilleux,N. A Hierarchical Stochastic Model for Automatic Prediction ofProsodic Boundar y Location[J],Computational Linguistics,1994,20(1).
    [54] Taylor,P and Black,A. Assigning Phrase Breaks from Part-of-Speech Sequences[A].Proceedings of Eurospeech97,Rhodes,Greece,1997,12:995~998.
    [55]胡伟湘,徐波,黄泰翼,汉语韵律边界的声学实验研究,中文信息学报,2002,16(1)
    [56]聂鑫,王作英,汉语语句中短语间停顿的自动预测方法,中文信息学报,2003,17(4)39~43
    [57]牛正雨,柴佩琪,基于边界点词性特征统计的韵律短语切分,中文信息学报,2001,15(5),19~25
    [58]李剑峰,胡国平,王仁华,基于最大熵模型的韵律短语边界预测,中文信息学报,2003,18,(5):56~63
    [59]Pawlak Z,Skowron A,Rough sets:Some extensions[J].Information Sciences,2007,177(1):28~40.
    [60] Simon M. Lucas,A Panaretos,L Sosa,A Tang,S Wong,R Young,ICDAR2003RobustReading Competitions[A],7th International Conference on Document Analysis and Recognition(ICDAR2003)[C],2003,(2):682~687.
    [61]Laurent Blin,Mike Edgington,Prosody Prediction Using a Tree structure Similarity Metric.Proceedings of ICSLP,2000
    [62] Ying Zhiwei, ShiXiaohua. An RNN based algorithm to detect prosodic phrase for ChineseTTS [A]. IEEE International Conference on Acoustics,Speech and Signal Processing [C],NewYork:Institute of Electrical and Electronics Engineers,2001,809~812
    [63]Christian Wolf and Jean-Michel Jolion,Extraction and Recognition of Artificial Text inMultimedia Documents [R]. Technical Report RFV-RR-2002,2002.
    [64] Chiu-Yu Tseng,Dade-Chen,The Interplay and Interaction between Prosody and Syntax:Evidence from Mandarin Chinese,Proceedings of ICSLP,2000
    [65] Xin Li,Dan Roth,The Role of Semantic Information in Learning Question Classifiers [A],In:First International Joint Conference on Natural Language Processing[C],2004:451~458.
    [66] Shigeru FUJ IO,Yoshinori SAGISAKA,Norio HIGUCIH,Stochastic Modeling of PauseInsertion Using Context Free Grammar,IEEE Transactions on Speech and Audio Processing,1995
    [67] Sin-Horng Chen,Shaw-Hwa,Hwang,Yih-Ru Wang,An RNN-based prosodic informationsynthesizer for Mandarin text-to-speech,IEEE TRANSACTIONS ON SPEECH AND AUDIOPROCESSING,1998,6,(3):226~239
    [68] Daniel Jurafsky,Chuck Wooters,Jonathan Segal,Andreas Stolcke,Eric Fosler,GaryTajchmanf and Nelson Morgan,Using a stochastic context-free grammar as a language model forspeech recognition,Acoustics,Speech,and Signal Processing,1995,ICASSP-95,5:189~192
    [69]Songfang Huang,Bowen Zhou,An EM algorithm for SCFG in formal syntax-basedtranslation, Acoustics,Speech and Signal Processing,2009,ICASSP2009:4813~4816
    [70]应宏,蔡莲红,基于结构助词驱动的韵律短语界定的研究,中文信息学报,1999,13(6):41~46
    [71]赵晟,陶建华,蔡莲红,基于规则学习的韵律结构预测,中文信息学报,2002,16,(5):30~37
    [72]赖先刚,节奏——语言的音乐美,修辞学习,2001,3,24~26
    [73]朱光潜,朱光潜美学文集,第2卷,上海文艺出版社,1982
    [74] Yang Y,Pederson J O.A Comparative Study on Feature Selection in Text Categorization[C]Proceedings of the14th International Conference on Machine Learning,San Francisco,USA:Morgan Kaufmann,1997:412~420.
    [75] ApteC. Automated Learning of Decision Rules for Text Categorization[J],ACM transactionson information systems,1994,12:233~251
    [76]Quinlan J R.Constructing Decision Tree,C4.5[J],Programs for Machine Learning,1993,3:17~26
    [77] MACRO Zaffalon,MARCUS Hutter,Robust feature selection by mutual informationDistributions[A],Proceedings of the18th international conference on uncertainty in artificialintelligence[C],UAI,2002,577-584
    [78] John Wiley and Sons,A Elements of Information Theory,Cover TM,Thomas J [M],1991:274
    [79]Lewis D D.Feature Selection and Feature Extraction for Text Categorization[C]. Proof Speechand Natual Language Workshop.San Francisco,USA:Morgan Kaufmann,1992:212~217
    [80]KollerD,SahamiM,Hierarchically Classifying Documents using Very FewWords[C] Pro.ofthe14th International Conference on Machine Learning(ICML),SanFrancisco,USA:MorganKaufmann,1997:170~178
    [80] Yang Y,Pedersen J O.A comparative study of feature selection in text categorization[C],Proceedings of the Fourteenth International Conference on Machine Learning,ICML97MorganKaufmann Publishers Inc,San Francisco,CA,USA
    [81] ZHONGN,DONG J,Using rough sets with heuristics for feature selection[J],Journal ofIntelligent Information Systems,2001,16:299-214
    [82]Rijsbergen C J V.The Selection of Good Search Terms[J],Information Processing andManagement1981,17:77~91
    [83] MladenicD,GrobelnikM,Feature selection on hierarchy of web documents[J],DecisionSupport System,2003,35:45~87
    [84]Liu Tao, Liu Shengping, Chen Zheng, An evaluation on feature selection for textclustering[A].Proceedings of the20International Conference on Machine learning [C],Washington DC:2003
    [85]Liu Mingji,Wang Xiufeng,A Knowledge Discovery Algorithm Based on Genetic Algorithm,The Third World Congress on Intelligent Control and Automation,IEEE,WCICA,2000:131~134
    [86] Lu Mingyu,Diao Lili,et al,The design and implementation of an excellent textcategorization,Proceedings of the4th World Congress on Intelligent Control and Automation,2002.
    [87] Wang Yi,Wang Xiao-Jing,A new approach to feature selection in text classification,MachineI earning and Cybernetics,2005,In:Proceedings of2005International Conference on,6:3814~3819
    [88]Yong Bae Lee, Hyon M, Text Genre Classification with Genre-revealing andSubject-revealing Features[C],Proc. of the25th Annual IntelACM SIGIR Conf. on Research andDevelopment in Information Retrieval. Tampere, Finland:[s. n.],2002
    [89]Rogati M,Yang Y,High-Performing feature selection for text classification,In:David G,Kalpakis K,Sajda Q,Han D,Len S,eds, Proc,of the11th ACM Int'l Conf. on Informationand Knowledge Management (CIKM-02),McLean:ACM Press,2002,659~661.
    [90]POWER D J, KAPARTH I S. Bu ild ingW eb based decision support systems[J],Studies ininformatics and control,2002,11(4):291~302.
    [91] YANG W,ZHU W,LIU Y,Research of a Web based DSS intelligent Agent over datawarehouse [C],2004IEEE/WIC/ACM International Conference on Web Intelligence (WI04).Beijing:IEEEPress,2004:433-436.
    [92] SCHUTZE H, HULL D A, PEDERSEN JO. A com parison of classifiers and document representations for the routing problem[C]Proceedings of the18th ACM International Conference onResearch and Development in Information Retrieval,1995:229~237
    [93] D. D. Lewis,Naive (Bayes) at forty:The IndependenceAssumption in Informationc Retrieval.In Proceedings of the10th European Conference on Machine Learning, NewYork,1998,4~15
    [94]Wang Zhihai,Webb G I,A Heuristic Lazy Bayesian Rule Algorithm[C],Proc. of AustralianData Mining Workshop,Sydney,Australia:Sydeny University of Technology Press,2002:57-63.
    [95]陈宗明,逻辑与语言表达,上海人民出版社,1984,23
    [96]李淮春,马克思主义哲学全书,中国人民大学出版社,1996:765
    [97]张清源,现代汉语知识辞典,四川人民出版社,1990
    [98]秦牧,语林采英,上海文艺出版社,1983
    [99]清·杜文澜《憩园词话》卷一
    [100]阎景翰主编.写作艺术大辞典.西安:陕西人民出版社.1990.第135~136页
    [101][日]竹内敏雄编,刘晓路,何志明,林文军译,美学百科辞典,长沙:湖南人民出版社,1988:349~350
    [102]西丁主编,美术辞林·漫画艺术卷,陕西人民美术出版社,2000:11
    [103]阎景翰主编,写作艺术大辞典,陕西人民出版社,1990,41~42
    [104]中华人民共和国国家标准:标点符号用法
    [105]钱揖丽,荀恩东,基于标点信息和统计语言模型的语音停顿预测,模式识别与人工智能,2008,21,(4):541~545
    [106]吴福祥,汉语体标记“了,着”为什么不能强制性使用,当代语言学,2005,3,237~250
    [107]黄伯荣,廖序东,现代汉语(增订四版,下册)[M],高等教育出版社,2007.6
    [108]陆俭明,现代汉语语法研究教程(第三版)[M],北京大学出版社,2005.2
    [109]蓝鹰,论汉语传统虚词研究的人文精神,云南师范大学学报,2001,33,(7):28~34
    [110]张谊生,现代汉语虚词,华东师范大学出版社,2000.9
    [111]李晓琪,现代汉语虚词讲义,北京大学出版社,2005.3
    [112]潘伟锵,贺前华,韦岗,文语转换系统中虚词停顿的研究,华南理工大学学报(自然科学版),2002,30(6):44~48
    [113]冯志伟,语言与数学,世界图书出版公司,2011.01
    [114]D.J.Watts and S.H.Strogatz,Collective dynamics of ‘small-world’ network,Nature393(1998)

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700