用户名: 密码: 验证码:
话题追踪与演化分析技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
话题追踪与演化分析技术旨在将用户关注的话题以最直观的方式呈现出来,使得用户方便的对话题的来龙去脉有一个全面的了解,在军事和民用方面都具有重要的理论价值和现实意义。本文主要研究话题追踪技术,话题演化分析技术及话题中的事件发现与关系分析技术,取得了如下四个方面的研究成果:
     (1)目前的话题追踪与演化分析算法将话题看作是单一的新闻报道的集合,没有考虑到话题的内部结构。本文通过分析话题内部各要素的关系,同时考虑到话题追踪与演化分析任务的时序性,建立了话题结构模型。为话题追踪与演化分析奠定了模型基础。
     (2)针对话题的偏移问题,提出了一种基于子话题反馈的话题追踪算法。该算法利用新事件检测的思想进行时间片划分,根据话题的偏移及时地修改话题向量。实验表明,该算法能够有效的适应话题偏移,与传统算法相比提高了话题追踪的召回率。
     (3)话题追踪无法分析、表示话题的演化历程,针对这个问题,结合BLOG社团演化分析的思想,提出了基于子话题相似度的话题演化分析算法。实验表明该算法可以准确的展示出话题的发展演化历程。
     (4)根据话题结构模型,结合时序文本挖掘的思想,提出了一种基于子话题整合的事件发现算法,并在此基础上改进了事件演化分析算法。该算法充分考虑了话题的内部结构特征,实验证明了算法的有效性。
     论文最后给出了话题追踪与演化分析原型系统的设计与实现细节。并对本文工作进行了总结,对今后的工作做了进一步的展望。
Topic Tracking and Evolution Analysis is an event-based information organizing task for tracking the reappearance of the news topic user interested and analyze the evolution process of the topic. Its purpose is to organize information efficiently and help people finding what they want intuitively. It is theoretically and practically valuable in military and civil field. This dissertation studies the topic tracking algorithms, topic evolution analysis method and event discovering and evolution analysis technique. The major contribution of this dissertation is as follows:
     (1)After the analysis of topic's structure together with the concepts related, we propose a Topic Structure Model constructed by news layer, subtopic layer and event layer. Considering topic through such a structure view point instead of the traditional flat news layer could take more issues into account. The setup of the model lays a foundation of the topic tracking and evolution analysis.
     (2)The causation of the topic evolution is the shifting of the various aspects in topic. Based on this conclusion, this dissertation proposed a Subtopic Feedback based Topic Tracking Algorithm, which utilize the idea of New Event Detection to organize the news temporally. The experiment in this dissertation proves the algorithm could adapt the evolution of the topic commendably.
     (3)After the analysis of the topic evolution, this dissertation proposed a Subtopic Similarity based Topic Evolution Analysis Algorithm. This algorithm analyze the evolution of the topic though the relation of the subtopic, and could show the process of the topic evolution in the granularity of time slice. The experiment in this thesis proves the effectivity of this algorithm.
     (4)Based on the Topic Structure Model, this dissertation proposed an Event Discovering and Evolution Analysis Algorithm. The algorithm takes the time, location, character and organization properties into account. The experiment in this thesis shows the algorithm is effective.
     Finally, the design principle and technique of realizing cartographic generalization system are brought forward, the research work of this paper is summarized, and the future developing direction of the Topic Tracking and Evolution Analysis technologies is indicated.
引文
[1]CNNIC.中国互联网络信息资源数量调查[EB/OL].http://www.cnnic.gov.cn,2008-7.
    [2]J. Allan. Topic Detection and Tracking:Event-based Information Organization[M]. Boston: Kluwer Academic Publisher,2002:1241-1253.
    [3]李保利,俞士汉.话题识别与追踪研究[J].计算机工程与应用,2003,39(17):7-10.
    [4]C. Cieri, S. Strassel, D. Graff, N. Martey, K. Rennert, and M. Liberman. Corpora for Topic Detection and Tracking[A]. In:James Allan. Topic Detection and Tracking-Event based Information Organization[M]. Boston:Kluwer Academic Publisher,2002:33-66.
    [5]NIST. The year 2002 Topic Detection and Tracking Task Definition and Evaluation Plan[R]. Gaithersburg:NIST,2002:1468-1480.
    [6]J. Allan, J. Carbonell, G. Doddington. Topic Detection and Tracking Pilot Study:Final Report[A]. In:Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop[C]. San Francisco:Morgan Kaufmann,1998:194-218.
    [7]J. Rocchio. Relevance Feedback in Information Retrieval[A]. In:G. Salton. The SMART Retrieval System:Experiments in Automatic Document Processing[M]. New Jersey: Prentice Hall,1971:313-323.
    [8]Y. Yang, T. Ault, T. Pierce and C. Lattimer. Improving Text Categorization Methods for Event Tracking[A]. In:Proceedings of the 23rd International Conference on Research and Development in Information Retrieval[C]. Athens:ACM Press,2000:65-72.
    [9]S. Dharanipragada, M. Franz and J. McCarle. Segmentation and Detection at IBM[J]. The Kluwer International Series On Information Retrieval,2002,24(6):135-148.
    [10]T.M. Mitchell著,曾华军,张银奎等译.机器学习[M].北京:机械工业出版社,2003:38-43.
    [11]R.O. Duda, P.E. Hart, D.G.. Stork著.Pattern Classification[M].北京:机械工业出版社,2004:174-182.
    [12]J.Han,M.Kamber著,范明,孟小峰等译.数据挖掘概念与技术[M].北京:机械工业出版社,2001:209-211.
    [13]C. Wayne. Multilingual Topic Detection and Tracking:Successful Research Enabled by Corpora and Evaluation[A]. In:Proceedings of Language Resources and Evaluation Conference[C]. Athens:ACM Press,2000:1487-1494.
    [14]C.K. Ryu, H.J. Kim, S.H. Ji, G. Woo, H.G. Cho. Detecting and Tracing Plagiarized Documents by Reconstruction Plagiarism-Evolution Tree[A]. In:Proceedings of 2008 IEEE 8th International Conference on Computer and Information Technology[C]. Sydney: IEEE Computer Society,2008:119-124.
    [15]S. Lee, H.J. Kim. News Keyword Extraction for Topic Tracking[A]. In:Proceedings of 4th International Conference on Networked Computing and Advanced Information Management[C]. Gyeongju:IEEE Computer Society,2008:554-559.
    [16]F. Fukumoto, Y. Yamaji. Topic Tracking based on Linguistic Features[A]. In:Proceedings of 2nd International Joint Conference on Natural Language Processing[C]. Jeju Island: Springer Verlag,2005:10-21.
    [17]B.L. Li, W.J. Li, Q. Lu. Enhancing Topic Tracking with Temporal Information[A]. In: Proceedings of 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval[C]. Seattle:ACM Press,2006:667-668.
    [18]H.Z. Wang, J.B. Zhu, D. Ji, M. Ye, B. Zhang. Time Adaptive Boosting Model for Topic Tracking[A]. In:Proceedings of 2005 IEEE International Conference on Natural Language Proceeding and Knowledge Engineering[C]. WuHan:IEEE Computer Society,2005: 488-492.
    [19]乇会珍,朱靖波.基于反馈学习自适应的中文话题追踪[J].中文信息学报,2006,20(3):92-98.
    [20]W. Zheng, Y. Zhang, Y. Hong, J.L. Fan, T. Liu. Topic Tracking based on Keywords Dependency Profile[A]. In:Proceedings of 4th Asia Information Retrieval Symposium[C]. Harbin:Springer Verlag,2008:129-140.
    [21]宋丹,林鸿飞,杨志豪.基于内容计算和链接分析的Web话题追踪方法[J].情报学报,2007,26(4):555-560.
    [22]潘渊,李弼成,张先飞.LS-SVM:一种有效的新闻主题追踪方法[J].计算机应用研究,2008,25(9):2660-2664.
    [23]J. Makkonen. Investigations on Event Evolution in TDT[A]. In:Proceedings of HLT NAACL[C]. Edmonton:Association for Computational Linguistics,2003:43-48.
    [24]J. Allan, R. Gupta and V. Khandelwal. Temporal Summaries of New Topics[A]. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval[C]. New Orleans:ACM Press,2001:10-18.
    [25]J. Allan, R. Papka and V. Lavrenko, On-line New Event Detection and Tracking[A]. In Proceedings of the SIGIR'98[C]. Melbourne:ACM Press,1998:37-45.
    [26]R. Papka. On-line New Event Detection, Clustering and Tracking[D]. Massachusetts: University of Massachusetts,1999.
    [27]R. Nallapati, A. Feng, F.C. Peng, J. Allan. Event Threading within News Topics[A]. In Proceedings of the CIKM'04[C]. Washington.DC:ACM Press,2004:446-453.
    [28]J. Yun, S.H. Myaeng, Y. Jung. Use of Place Information for Improved Event Tracking[J]. Information Processing and Management,1996,42(3):365-378.
    [29]C.P. Wei and Y.H. Chang. Discovering Event Evolution Patterns from Document Sequences[J]. IEEE Transactions on Systems, Man, and Cybernetics,2007,37(2): 273-283.
    [30]C.C. Yang, X.D. Shi, and C.P. Wei. Tracing the Event Evolution of Terror Attacks from On-Line News[A]. In:Proceedings of ISI 2006[C]. San Diego:Springer Verlag,2006: 343-354.
    [31]J.T. Qiu, C. Li, S.J. Qiao, T.Y. Li, J. Zhu. Timeline Analysis of Web News Events[A]. In: Proceedings of the 4th International Conference on Advanced Data Mining and Applications[C]. ChengDu:IEEE Computer Society,2008:123-134.
    [32]J. Cho, G.M. Hector. The Evolution of the Web and Implications for an Incremental Crawler[A]. In:Proceedings of 26th International Conference on Very Large Databases[C]. Cairo:Morgan Kaufmann,2000:200-209.
    [33]李盛韬,余智华,程学旗,白硕.Web信息采集研究进展[J].计算机科学,2003,30(2):151-171.
    [34]荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报(自然科学版),2004,32(1):84-87.
    [35]高凯,王永,肖君.网页去重策略[J].上海交通大学学报,2006,40(5):75-77.
    [36]朱永盛,武港山.基于Web的新闻信息抽取[J].计算机工程,2006,32(10):74-76.
    [37]孙承杰,关毅.基于统计的网页『F文信息抽取方法的研究[J].中文信息学报,2004,18(5):17-22.
    [38]蒲宇达,关毅,王强.基于数据挖掘思想的网页正文抽取方法的研究[A].见:第三届学生计算语言学研讨会论文集[c].沈阳,2006:246-250.
    [39]J. Allan, V. Lavrenko, and H. Jin. First Story Detection in TDT is hard[A]. In Proceedings of 9th Conference on Information Knowledge Management[C]. Virginia:McClean,2000: 374-381
    [40]Z.W. Li, B. Wang, M.J. Li. A Probabilistic Model for Retrospective News Event Detection[A]. In:Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval[C]. Salvador:ACM Press,2005: 106-113.
    [41]J. Allan, R. Papka, V. Lavrenko. On-line New Event Detection and Tracking[A]. In: Proceedings of SIGIR'98[C]. Melbourne:ACM Press,1998:37-45.
    [42]张阔,李涓子,吴刚,王克宏.基于词元再评估的新事件检测模型[J].软件学报,2008,9(4):817-828.
    [43]洪宇,张宇,范基礼,刘挺,李生.基于子话题分治匹配的新事件检测[J].计算机学报,2008,31(4):687-695.
    [44]Y. Yang, J. Carbonell, R. Brown, T. Pierce, B.T. Archibald, and X. Liu. Learning Approaches for Detecting and Tracking News Events[J]. IEEE Intelligent System Special Issue on Applications of Intelligent Information Retrieval,1999,14(4):32-43.
    [45]X. Huang and S. Roberton. A Probabilistic Approach to Chinese Information Retrieval: Theory and Experiments [A]. In Proceedings of the 22nd Annual BCSIRSG Colloquium on Information Retrieval Researeh[C]. Cambridge:Elsevier,2000:178-193.
    [46]F. Peng and D. Schuurmans. Self Supervised Chinese Word Segmentation[A]. In: Proceedings of the 4th International Symposium on Intelligent Data Analysis[C]. Lisbon: Springer Verlag,2001:238-247.
    [47]A. Bookstein. Fussy requests:An Approach to Weighted Boolean Searches[J]. Jounal of the Ameriean Society for Information Sciences,1980,31(4):240-247.
    [48]M. Ikonomakis, S. Kotsiantis, V Tampakas. Text Classification Using Machine Leaning Techniques [J]. Wseas Transaction on Computers,2005,4(8):966-974.
    [49]D. Lee, H. Chuang, K. Seamons. Document Ranking and the Vector-Space Model[J]. IEEE Software,1997,3(14):67-75.
    [50]R. Losee. Parameter Estimation for Probabilistic Document-Retrieval Models[J]. Journal of the American Society for Information Science,1988,39(1):8-16.
    [51]M.E. Maron and J.L. Kuhns. On Relevance Probabilistic Indexing and Information Retrieval [J]. Journal of the ACM,1960,7(3):261-244.
    [52]A. Bookstein. Outline of a General Probabilistic Retrieval Model[J]. Journal of Documentation,1983,39(2):63-72.
    [53]G. Salton, J. Allan, C. Buekley. Automatic Analysis:Theme Generation and Summarization of Machine-Readable Texts[J]. Science,1994,264(3):1421-1426.
    [54]G. Salton, C. Buekley. Term Weighting Approaches in Automatic Text Retrieval[J]. Information Processing and Retrieval,1998,24(5):513-523.
    [55]T. Joachims. Leaning to Classify Text Using Support Vector Machines[J]. Computational Linguistics,2003,29(4):655-664.
    [56]Y. Lai, C. Wu. Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology[J]. ACM Transactions on Asian Language Information Processing,1(1),2002:34-64.
    [57]G Salton. Developments in Automatic Text Retrieval[J]. Science,1991,253(5023): 974-980.
    [58]E. Rasmussen. Clustering Algorithms[A]. In:N. Frakes, Y. Baeza. Information Retrieval: Data Structure and Algorithms[M]. New Jersey:Prentice Hall,1992:419-442.
    [59]孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1):48-61.
    [60]李保利.汉语新闻报道中的话题追踪与识别研究[D].北京:北京大学,2003.
    [61]Y.R. Lin, H. Sundaram, Y. Chi, J. Tatemura and B.L. Tseng. Blog Community Discovery and Evolution Based on Mutual Awareness Expansion[A]. In:Proceedings of the International Conference on Web Intelligence[C], Silicon Valley:IEEE Computer Society, 2007:48-56.
    [62]Q. Mei, C.X. Zhai. Discovering Evolutionary Theme Patterns from Text-An Exploration of Temporal Text Mining[A]. In Proceedings of the KDD'05[C]. Chicago:Association for Computing Machinery,2005:198-207.
    [63]C. Zhai, A. Velivelli and B. Yu. A Cross-collection Mixture Model for Comparative Text Mining[A]. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining[C]. Seattle:Association for Computing Machinery, 2004:743-748.
    [64]G.D. Zhou, J. Su. Named Entity Recognition using an HMM-based Chunk Tagger. Annual Meeting of the ACL[A]. In:Proceedings of the 40th Annual Meeting on Association for Computation Linguistics[C]. Philadelphia:Association for Computational Linguistics, 2001:473-480.
    [65]A. McCallum, W. Li. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web[A]. In:Proceedings of the 7th Conference on Natural Language Learning[C]. Edmonton:Association for Computational Linguistics, 2003:188-191.
    [66]The Stanford Natural Language Processing Group[EB/OL]. http://nlp.stanford.edu/ ner/index.shtml.2006.
    [67]天津市海量科技发展有限公司.海量智能分词研究版[EB/OL]. http://www.hylanda.com.2005.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700