基于Dirichlet过程混合模型的话题识别与追踪

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

基于Dirichlet过程混合模型的话题识别与追踪

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Topic Detection and Tracking Based on Dirichlet Process Mixture Model
作者：王婵
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：话题识别与追踪 ; 话题识别 ; 话题追踪 ; Dirichlet ; 过程混合模型 ; Gibbs抽样 ; 话题先验知识
英文关键词：topic detection and tracking ; topic detection ; topic
英文关键词：tracking ; dirichlet process mixture model ; gibbs sampling ; prior
英文关键词：knowledge of known topic
学位年度：2013
导师：王小捷
学科代码：0812
学位授予单位：北京邮电大学
论文提交日期：2013-05-10

摘要

互联网已成为当前人们获取新闻的一个重要途径。将已有各种新闻报道按话题进行分类,进而追踪特定话题的新报道返回给用户,不仅可以有效节省用户获取相关新闻的时间,也提供了一种基于话题对网络新闻数据进行有效组织的方式,有着广泛的现实需求。为达成此目的,需要解决两个关键问题：其一是如何将初始呈现给用户的新闻报道自动地依据其所涉及话题的异同进行分组,其二是如何自动判断新出现的报道是否属于某个已知话题或属于一个新话题。这两个问题分别是话题识别与话题追踪。
     对话题识别与追踪的研究已经有近二十年的历史,取得了不少进展,但是仍然存在一些问题。例如,话题识别任务中如何确定话题数量的问题,话题追踪任务面临的数据稀疏问题、话题漂移问题以及话题偏离问题。
     本文针对这些问题,分别对话题识别技术和话题追踪技术展开研究,在Dirichlet过程混合模型(DPMM)这个统一的模型框架下提出了一系列有效的解决方法,最后,通过综合这些解决方法提出了一个能满足节省用户新闻获取时间、对互联网新闻数据进行基于话题的组织等应用需求的系统方案。论文的主要工作和研究成果如下：
     (1)针对话题识别任务在先验知识缺乏时难以预先确定话题数目的问题,将DPMM引入话题识别研究中,提出了一个基于DPMM的话题识别模型。该模型无需预先给定话题数目,而是可以根据输入的新闻报道而自动确定。模型假设任一报道都对应一个话题分布,并将其中具有最大概率的话题作为这个报道的话题标签。实验表明,基于DPMM的话题识别模型可以得到比已有方法更好的识别性能,最低识别代价仅为0.0981,比基于传统聚类算法的话题识别模型降低了50%以上。
     (2)提出了一种考虑上下文信息的Gibbs抽样(C_Gibbs)方法,该方法在对某个词产生抽样概率时同时考虑其上下文中的其他词,以建模同一报道中的词间相关性。实验表明,与Gibbs抽样方法相比,基于C_Gibbs抽样方法进行参数推导可以大幅度提高识别系统的性能。
     (3)提出了一个能有效结合待测话题信息的DPMM进行静态话题追踪。模型在基于Gibbs抽样进行参数推理时融入待测话题信息,得到报道和各个待测话题的相关度。同时,对多次Gibbs抽样结果进行投票确定最后的话题追踪结果。实验结果表明,该模型只需要少量的种子报道,就可以显著提高话题追踪的性能,最低追踪代价仅为0.0723,比基于一元语言模型的话题追踪模型降低了45%。同时,该投票方法也保证了性能的稳定性。
     (4)针对话题追踪任务中存在的话题漂移问题以及已有自适应方法中存在的话题偏离现象,本文在基于DPMM的静态话题追踪模型的基础上,提出了一种新的自适应话题追踪方法。该方法的基本思想是在追踪过程中考虑追踪反馈,并在话题、报道相关度计算过程中为追踪反馈赋予一个M_reli参数,以控制不相关报道反馈带来的误差。实验结果表明,该方法不仅可以在一定程度上解决话题漂移问题,并可以有效地抑制已有自适应算法中的话题偏离现象。该模型最低追踪代价仅为0.0677,比静态话题模型降低了6%。
     (5)综合本文提出的一系列话题识别和追踪技术,设计了一个可以满足前述应用需求的话题识别与追踪系统方案。该系统首先利用话题识别和话题追踪技术将新闻报道流以报道簇为单位组织起来,每个报道簇对应一个话题,同时获取报道流中描述话题内容的标签,并将相关报道和标签同时呈现给用户,达到节省用户新闻获取时间、并基于话题对互联网新闻数据进行组织的目的。
Internet has become one important way of obtaining news. How to group large volumes of news stories according to the latent topics and track news of a specific topic can not only efficiently reduce time of mastering interested news for users, but also offers an efficient topic oriented information organization. Two key problems must be solved in implementing the topic oriented information organization:how to automatically group initial news stories according to the latent topics discussed in stories;and how to automatically associate incoming stories with topics that are known in advance, or cluster them into new topics. These two problems are corresponding to topic detection and topic tracking.
     Lots of progress has been made on the research of topic detection and tracking, however, there are still some defects in them. For instance, how to precisely decide the number of topics in topic detection task, how to deal with serious data sparseness problem, topic excursion and topic deviation problem in topic tracking task.
     To overcome the above problems, this thesis investigates a Bayesian non-parametric approach called Dirichlet Process Mixture Model (DPMM). Firstly DPMM is implemented on topic detection and topic tracking separately. Then DPMM is refined to resolve the two tasks simultaneously, and is verified to be effective under various data settings. Finally, through integrating topic detection and tracking, a system scheme is designed to reduce time of mastering interested news for user and meet the application requirement of topic-oriented Internet information organization. The main research work and achievements are as following:
     (1) To overcome the subjectivity in determining the number of topics due to lack of prior knowledge of the topic, a topic detection model based on DPMM is proposed in this thesis. The model does not fix the number of topics, but determines it through processing news stories automatically. DPMM assumes that every story is corresponding to a topic distribution, and assigns the topic corresponding to the maximum probability to this story. The experimental results indicate that topic detection model based on DPMM achieves better performance than several existing methods. The lowest detection error cost is0.0981, decreased by more than50%compared with the traditional cluster-based topic detection models.
     (2) To smooth the word independence assumption in DPMM, the contextual information is introduced in Gibbbs sampling during parameter inference. The improved sampling method takes contextual words into account when obtaining sampling probability of a word, which reflects real word correlations in a natural language. The experimental results show that the improved parameters inference method can yields better performance of topic detection.
     (3) To alleviate the influence of lacking on-topic stories in static topic tracking task, the prior knowledge of known topics is efficiently exploited and used in Gibbs sampling procedure. Then, the results of topic tracking are obtained by making a vote on Gibbs sampling results. As indicated by the experiments, the prior knowledge can improve the performance of topic tracking significantly even with a few on-topic stories. The lowest tracking error cost is0.0723, decreased by45%compared with the topic tracking method based on unigram model. Moreover, vote method can ensure the stability of performance.
     (4) To overcome topic excursion and topic deviation brought by existing adaptive learning mechanisms, the thesis presents a new adaptive tracking method based on DPMM. The basic idea of adaptive tracking method is to endow tracking feedback with a metric, M_reli, to control errors brought by feedback of off-topic stories. The experimental results show that the adaptive DPMM model, without a large scale of in-domain data, can solve topic excursion of topic tracking task and topic deviation brought by existing adaptive learning mechanisms significantly. The lowest tracking error cost is0.0677, decreased by6%compared with static topic tracking model.
     (5) Based on the above technologies of topic detection and topic tracking technology, a topic detection and tracking system is designed to meet the practical application requirement. The system scheme firstly organizes news stories streams by taking story cluster as a unit, per story cluster corresponds to a topic, and obtains tags describing topic from news stories streams. Finally, story clusters and topic tags are presented to users. The system scheme can achieve the goal of reducing time of mastering interested news for users and organizing Internet news stories according to the latent topics.

引文

[1]Allan J, Carbonell J, Doddington G, et al. Topic detection and Tracking pilot study:Final report. In Proceedings of Broadcast News Transcription and Understanding Workshop, Lansdowne, VA,1998, pp.194-218.
    [2]张晓艳,王挺.话题发现与追踪技术研究.计算机科学与探索,3(4),2009,pp.347-357.
    [3]TDT Homepage at the National Institute of Standards and Technology. http://www.nist.gov/TDT.
    [4]中国互联网络信息中心.第31次中国互联网发展状况统计报告.2013年1月15日.
    [5]李正茂.2020年互联网数据量将是目前的44倍.信息系统工程,(6),2011,pp.11.
    [6]Beibei Zhang, Xiaohong Guan, Muhammad Junaid Khan, et al. A time-varying propagation model of hot topic on BBS sites and Blog networks. Information Sciences:an International Journal,187(3),2012, pp.15-32.
    [7]Erzhong Zhou, Ning Zhong, Yuefeng Li. Hot topic detection in professional blogs. In Proceedings of the 7th international conference on Active media technology, Lanzhou,2011, pp.141-152.
    [8]Ye Hui-min, Cheng Wei, Dai Guan-zhong. Design and implementation of on-line hot topic discovery model. Wuhan University Journal of Natural Sciences,11(1),2006, pp.21-26.
    [9]Rui Long, Haofen Wang, Yuqiang Chen, et al. Towards Effective Event Detection, Tracking and Summarization on Microblog Data. In Proceedings of the 12th international conference on Web-age information management, Wuhan, 2011,pp.652-663.
    [10]D Harwath, TJ Hazen. Topic identification based extrinsic evaluation of summarization techniques applied to conversational speech. In IEEE International Conference on Acoustics, Speech and Signal Processing, Lexington, MA,2012, pp.5073-5076.
    [11]A Pons-Porrataa, R Berlanga-Llavorib, J Ruiz-Shulcloperc. Topic discovery based on txt mining techniques. Information Processing & Management,43(3), 2007, pp.752-768.
    [12]J Fiscus, G Doddington, J Garofolo, et al. NIST's 1998 topic detection and tracking evaluation (TDT2). In Broadcast News Workshop'99 Proceedings, VA, USA,1999, pp.19-24.
    [13]J.G. Fiscus, GR. Doddington. Topic Detection and Tracking Evaluation Overview. In:Topic Detection and Tracking:Event-Based Information Organization, MA, USA,2002, pp.17-31.
    [14]李保利,俞士汶.话题识别与跟踪研究.计算机工程与应用,39(17),2003,pp.7-10.
    [15]J Allan, V Lavrenko, D Frey, V Khandelwal. UMass at TDT 2000. In Proceedings of Topic Detection and Tracking Workshop, USA,2000, pp. 109-115.
    [16]Jonathan Fiscus. Overview of the TDT 2001 Evaluation and Results. In TDT 2001 Evaluation Workshop, Maryland, USA,2001.
    [17]T Brants, F Chen, A Farahat. A system for new event detection.In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, New York, USA,2003, pp.330-337.
    [18]D Trieschnigg, W Kraaij. TNO hierarchical topic detection report at TDT2004. In the7th Topic Detection and Tracking workshop,2004.
    [19]The 2004 Topic Detection and Tracking (TDT2004) Task Definition and Evaluation Plan. In http://www.nist.gov.
    [20]骆卫华,刘群,程学旗.话题检测与跟踪技术的发展与研究.见全国第七届计算语言学联合学术会议,哈尔滨,2003,pp.560-566.
    [21]Jay Asla, Fernando Diaz Matthew, Ekstrand-Abueg. Temporal Summarization. In TREC 2013, http://www.trec-ts.org/,2012.
    [22]Y.Zhang, J.G.Carbonell, J.Allan. Topic Detection and Tracking:Detection-Task. In Proceedings of the Workshop of Topic Detection and Tracking.1997.
    [23]J Carbonell, Y Yang, J Lafferty, et al. CMU Report on TDT-2:Segmentation, Detection and Tracking.In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, San Francisco,1999, pp.117-120.
    [24]洪宇, 张宇,刘挺等.话题检测与跟踪的评测及研究综述.中文信息学报,21(6),2007,pp.71-84.
    [25]R PAPKA. On-line New Event Detection Clustering and Tracking. [Dissertation] University of Massachusetts Amherst,1999.
    [26]N Hoogma. The modules and methods of topic detection and tracking. In Proceedings of the2nd Twente Student Conference on IT, Enschede,2005
    [27]James Allan, Ron Papka, Victor Lavrenko. On-line new event detection and tracking. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, NY, USA, 1998, pp.37-45.
    [28]税仪冬,瞿有利,黄厚宽.周期分类和Single-Pass聚类相结合的话题识别与跟踪方法.北京交通大学学报,33(5),2009,pp.85-89.
    [29]骆卫华,于满泉,许洪波等.基于多策略优化的分治多层聚类算法的话题发现研究.中文信息学报,20(1),2006,pp.29-35.
    [30]Y Yang, T Pierce, J Carbonell. A study on Retrospective and On-Line Event detection. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, USA,1998, pp.28-36.
    [31]雷震,吴玲达,雷蕾等.初始化类中心的增量K均值法及其在新闻事件探测中的应用.情报学报,25(3),2006,pp.289-295.
    [32]Jin, Yaohong. A topic detection and tracking method combining NLP with suffix tree clustering. In ICCSEE 2012, Hangzhou,2012, pp.227-230.
    [33]殷风景,肖卫东,葛斌.一种面向网络话题发现的增量文本聚类算法.计算机应用研究,28(1),2011,pp.54-57.
    [34]李岩,娄云.文本聚类算法在舆情监控中的应用分析.电子设计工程,21(1),2013,pp.70-73.
    [35]Chen Hsin-His, Ku Lun-wei. Description of a Topic Detection Algorithm on TDT3 Mandarin Test. In Proceedings of Topic Detection and Tracking Workshop, USA,2000, pp.165-166.
    [36]Li zhiwei, Wang Bin, Li Mingjing. A probabilistic model for retrospective news event detection. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, Salvador, Bahia, Brazil,2005, pp.106-113.
    [37]Zhang Kuo, Juan Zi, Wu LiGang. New Event Detection Based on Indexing-tree and Named Entity. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, "New York, NY,2007, pp.215-222.
    [38]G Kumaran, J Allan. Text Classification and Named Entities for New Event Detection. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, United Kingdom,2004, pp.297-304.
    [39]王振宇,吴泽衡,唐远华.基于多向量和二次聚类的话题检测.计算机工程与设计,33(8),2012,pp.3214-3218.
    [40]Juha Makkonen, Helena Ahonen-Myka, Marko Salmenkivi. Simple Semantics in Topic Detection and Tracking. Information Retrieval,7(3-4),2004, pp.347-368.
    [41]J Makkonen, H Ahonen-Myka, M Salmenkivi. Topic Detection and Tracking with Spatio-temporal Evidence. In Proceedings of the 25th European conference on IR research, Italy,2003, pp.251-265.
    [42]J Allan, V Lavrenko, R Nallapati. UMass at TDT2002. In Proceedings of TDT Workshop, USA,2002.
    [43]路荣,项亮,刘明荣等.基于隐主题分析和文本聚类的微博客中新闻话题的发现.模式识别与人工智能,25(3),2012,pp.382-387.
    [44]Bo Huang, Yan Yang, Amjad Mahmood, et al. Microblog topic detection based on LDA model and Single-Pass Clustering. In RSCTC, Chengdu,2012, pp.166-171.
    [45]Liu Zitao, Yu Wenchao, Chen Wei. Short text feature selection for microblog mining. In CISE, Wuhan,2010, pp.1-4.
    [46]KY Chen, L Luesukprasert, ST Chou. Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling. IEEE Transactions on Knowledge and Data Engineering,19(8),2007, pp.1016-1025.
    [47]赵文清,侯小可.基于词共现图的中文微博新闻话题识别.智能系统学报,07(5),2012,pp.444-449.
    [48]周刚,邹鸿程,熊小兵等.MB-Singlepas:基于组合相似度的微博话题检测.计算机科学,39(10),2012,pp.198-202.
    [49]陈黎飞,姜青山,王声瑞.基于层次划分的最佳聚类数确定方法.软件学报,19(1),2008,pp.62-72.
    [50]S Still, W Bialek. How many clusters? An information theoretic perspective. Neural computation,16(12),2004, pp.2483-2506.
    [51]AV Kapp, R Tibshirani. Are clusters found in one dataset present in another dataset?. Biostatistics,8(1),2007, pp.9-31.
    [52]M Bouguessa, S Wang, H Sun. An objective approach to cluster validation. Pattern Recognition Letters,27(13),2006, pp.1419-1430.
    [53]K Zhu, S Su, J Li. Optimal Number of Clusters and the Best Partition in Fuzzy C-mean. Systems Engineering-theory & Practice,3,2005, pp.52-61.
    [54]张忠平,王爱杰,柴旭光.简单有效的确定聚类数目算法.计算机工程与应用,45(15),2009,pp.166-168.
    [55]杨善林,李永森,胡笑旋等.K-means算法中的k值优化问题研究.系统工程理论与实践,26(2),2006,pp.97-101.
    [56]刘金岭.海量中文短信文本最佳聚类数研究.计算机工程,36(8),2010,pp.66-68.
    [57]陈黎飞,姜青山,王声瑞.基于层次划分的最佳聚类数确定方法.软件学报,19(1),2008,pp.62-72.
    [58]D Pelleg, A Moore. X-means:Extending K-means with efficient estimation of the number of cluster. In Proceedings of the seventeenth international conference on machine learning, CA, USA,2000, pp.727-734.
    [59]刘素芹,柴松.命名实体的网络话题K-means动态检测方法.智能系统学报,5(2),2010,pp.122-126.
    [60]Thomas S Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics,1(2),1973, pp.209-230.
    [61]A Vlachos, Z Ghahramani, A Korhonen. Dirichlet Process Mixture Models for Verb Clustering. In Proceedings of the ICML workshop on Prior Knowledge for Text and Language, Helsinki, Finland,2008.
    [62]张林,刘辉.Dirichlet过程混合模型的聚类算法.中国矿业大学学报,41(1),2012,pp.159-163.
    [63]A Haghighi, D Klein. Unsupervised Coreference Resolution in a Nonparametric Bayesian Model. Annual meeting-Association for Computational Linguistics,45(1),2007, pp.848-855.
    [64]Y Watanabe, Y Okaxta, K Kaneji, et al. Multiple Media Database System for TV Newscasts and Newspapers. In Advanced Multimedia Content Processing, Japan,1998, pp.208-220.
    [65]Y Yang, T Ault, T Pierce, et al. Improving text categorization methods for event tracking. In Proceedings of the 23th annual international ACM SIGIR conference on research and development in information retrieval, NY, USA, 2000, pp.65-72.
    [66]姚长青,杜永萍.基于主题的舆情跟踪方法研究及性能评价.图书情报工作,56(18),2012,pp.50-52.
    [67]T Leek, R Schwartz, S Sista. Probabilistic Approaches to topic detection and tracking. In Topic detection and tracking, Boston,2002, pp.67-83.
    [68]J.P. Yamron, S. Knecht, and P. vanMulbregt. Dragon's Tracking and Detection Systems for the TDT2000 Evaluation. The Topic Detection and Tracking Workshop, USA,2000, pp.75-80.
    [69]YY Lo, JL Gauvain. The LIMSI Topic Tracking System for TDT2001. The Topic Detection and Tracking Workshop, Maryland, USA,2001, PP.1-5.
    [70]N Lester, HE Williams. Topic Tracking at RMIT University. The Topic Detection and Tracking Workshop, USA,2002.
    [71]王会珍,朱靖波,陈文亮等.基于一元语言模型的中文话题追踪.见第二届全国学生计算语言学研讨会,北京,2004,pp.422-427.
    [72]Zheng Wei, Zhang Yu, Hong Yu, et al. Topic Tracking Based on Keywords Dependency Profile. In AIRS, Harbin, China,2008, PP.129-140.
    [73]张晓艳,王挺,梁晓.LDA模型在话题追踪中的应用.计算机科学,38(10),2011,pp.136-139.
    [74]Zhang Xiaoyan, Wang Ting. Topic Tracking with Improved Representation Model and Joint Tracking Method. International Journal of Wavelets, Multiresolution and Information Processing,8(6),2010, pp.913-930.
    [75]Stephen A. Lowe. The beta-binomial mixture model and its application to tdt tracking and detection. In Proceedings of DARPA Broadcast News Workshop, San Francisco,1999, pp.127-131.
    [76]Qiu Jing, Liao Lejian. Add Temporal Information to Dependency Structure Language Model for Topic Detection and Tracking. In Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, Kunming,2008, pp.575-1580.
    [77]张辉,周敬民,王亮.基于三维文档向量的自适应话题追踪器模型.中文信息学报,24(5),2010,pp.70-76.
    [78]Elsayed, DW Oard, D Doermann, et al. TDT-2004:Adaptive topic tracking at Maryland." In Proc. of DARPA Workshop,2004.
    [79]王会珍,朱靖波,季铎等.基于反馈学习自适应的中文话题追踪.中文信息学报,20(3),2006,pp.92-98.
    [80]Wang Huizhen, Zhu Jingbo, JI Duo, et al. Time Adaptive Boosting Model for Topic tracking. In Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, Shenyang,2005, pp. 488-492.
    [81]郑燕,鲁燃,赵爱华.基于反馈报道的话题模型动态修正方法.计算机应用,32(5),2012,pp.1343-1346.
    [82]VR Shanks, HE Williams. TDT2001 Topic Tracking at RMIT University. In TDT 2001 Evaluation Workshop, Maryland, USA,2001.
    [83]周茜,赵明生,扈旻.中文文本分类中的特征选择研究.中文信息学报,18(3),2004,pp.17-23.
    [84]C Shah, WB Croft, D Jensen. Representing documents with named entities for story link detection (SLD). In Proceedings of the 15th ACM international conference on Information and knowledge management, NY, USA,2006, pp. 868-869.
    [85]张阔,李涓子,吴刚等.基于词元再评估的新事件检测模型.软件学报,19(4),2008,pp.817-828.
    [86]Zhang Xiaoyan, Wang Ting, Chen Huowang. Story link detection based on dynamic information extending. In ACL-IJCNLP, Hyderabad,2008, pp.40-47.
    [87]PF Brown, VJD Pietra, SAD Pietra, et al. The mathematics of statistical machine translation:Parameter estimation. Computational Linguistics,19(2), 1993, pp.269-311.
    [88]奚宁,赵迎功,汤光超等.统计机器翻译中多种语言模型的融合.见第七届全国机器翻译研讨会,厦门,2011,pp.220-228.
    [89]F Jelinek. Self-organized language modeling for speech recognition. Readings in Speech Recognition,1990, pp.450-506.
    [90]关毅,王晓龙,张凯.基于统计与规则相结合的汉语计算语言模型及其在语音识别中的应用.高技术通讯,4(4),1998,pp.16-20.
    [91]FM Hasan, N UzZaman, M Khan. Comparison of different POS Tagging Techniques (n-gram, HMM and Brill's tagger) for Bangla. In SCSS, Netherlands,2007, pp.121-126.
    [92]WB Croft, J Lafferty. Language Modeling for Information Retrieval. Vol.13, Kluwer Academic Pub,2003.
    [93]宗成庆.统计自然语言处理.清华大学出版社,2008,pp.74-92.
    [94]DM Blei, AY Ng, MI Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, (3),2003, pp.993-1022.
    [95]Tomoharu Iwata, Shinji Watanabe, Takeshi Yamada, et al. Topic tracking model for analyzing consumer purchase behavior. In Proceedings of the 21st international jont conference on Artifical intelligence, California, USA,2009, pp.1427-1432.
    [96]L AlSumait, D Barbar'a, C Domeniconi. On-Line LDA:Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking. In IEEE International Conference on Data Mining, Pisa, Italy,2008, pp.3-12.
    [97]F Chen, A Farahat, T Brants. Multiple similarity measures and source-pair information in story link detection. In HLT-NAACL, Boston,2004, pp. 313-320.
    [98]J Makkonen, H Anonen-Myka, M Salmenkivi. Simple semantics in topic detection and tracking. Information Retrieval,7(3/4),2004, pp.347-368.
    [99]A Farahat, F Chen, T Brants. Optimizing story link detection is not equivalent to optimizing new event detection. In ACL, Sapporo, Japan,2003, pp.232-239.
    [100]V Lavrenko, J Allan, E DeGuzman, et al. Relevance models for topic detection and tracking. In Proceedings of the second international conference on Human Language Technology Research, USA,2002, pp.115-121.
    [101]张晓艳.新闻话题表示模型和关联追踪技术研究.[博士论文],国防科技大学,2010.
    [102]韩家炜,坎伯.数据挖掘：概念与技术.机械工业出版社,2001,pp.196-220.
    [103]EW Forgy. Cluster analysis of multivariate data:Efficiency versus interpretability of classifications. Biometrics,1965, pp.768-780.
    [104]WB Frakes, R Baeza-Yates. Information Retrieval:Data Structures and Algorithms, Pearson Education India,1992.
    [105]YW Teh, MI Jordan, MJ Beal, et al. Sharing clusters among related groups: hierarchical Dirichlet processes. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, Canada,2004, pp.1385-1392.
    [106]YW Teh, MI Jordan, MJ Beal, et al. Hierarchical Dirichlet processes. Journal of the American Statistical Association,101(476),2006. pp.1566-1581.
    [107]YW Teh, MI Jordan. Hierarchical Bayesian nonparametric models with applications. Bayesian Nonparametrics:Principles and Practice, Cambridge University Press,2010, pp.1-47.
    [108]徐谦,周俊生,陈家骏.Dirichlet过程及其在自然语言处理中的应用.中文信息学报,23(5),2009,pp.25-46.
    [109]MD Escobar, M West, M West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association,90(430),1995, PP.577-588.
    [110]J Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica,4, 1994,pp.639-650.
    [111]I Porteous, AT Ihler, P Smyth, et al. Gibbs sampling for (Coupled) infinite mixture models in the stick breaking representation. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, MA, USA,2006.
    [112]周建英,王飞跃,曾大军.分层Dirichlet过程及其应用综述.自动化学报,37(4),2011,pp.389-407.
    [113]D Blackwell, JB MacQueen. Ferguson distributions via Polya Urn schemes. The Annals of Statistics,1(2),1973, pp.353-355.
    [114]O Yakhnenko, V Honavar. Annotating images and image objects using a hierarchical Dirichlet process model. In Proceedings of the 9th International Workshop on Multimedia Data Mining, New York, USA,2008, pp.1-7.
    [115]DM Blei, MI Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis,1(1),2006, pp.121-144.
    [116]K Kurihara, M Welling, YW The. Collapsed variational Dirichlet process mixture models. In Proceedings of the 20th International Joint Conference on Articial Intelligence, San Francisco, USA,2007, pp.2796-2801.
    [117]CWang, DM Blei. Variational inference for the nested Chinese restaurant process. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems. Vancouver, Canada,2009, pp.1-9.
    [118]O Yakhnenko, V Honavar. Multi-modal hierarchical Dirichlet process model for predicting image annotation and image-object label correspondence. In Proceedings of the SIAM International Conference on Data Mining, Sparks, USA,2009, PP.281-294.
    [119]RM Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of Computational and Graphical Statistics,9(2),2000, PP. 249-265.
    [120]Wang Xiao gang. Learning Motion Patterns Using Hierarchical Bayesian Models. [Ph.D. dissertation], Massachusetts Institute of Technology, USA, 2009.
    [121]EB Fox. Bayesian Nonparametric Learning of Complex Dynamical Phenomena. [Ph.D. dissertation], Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, USA,2009.
    [122]Meng Fanrong, Tong Xuejiao, Wang Zhixiao. A clustering-ensemble approach based on voting. In AICI2011, Taiyuan, China,2011, pp. 421-427.
    [123]HG Ayad, MS Kemel. On voting-based consensus of cluster ensembles. Pattern Recognition,43(5),2010, pp.1943-1953.
    [124]洪宇.基于语义结构和时序特征的话题检测与跟踪技术研究.[博士论文],哈尔滨工业大学,2009.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700