用户名: 密码: 验证码:
WEB文本挖掘中关键问题的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网和通讯网的迅猛发展,网络文本成为信息的主要载体及人们生活中不可或缺的主要信息来源,文本挖掘技术的研究意义和实用价值越来越突出。另一方面,随着Web2.0时代的到来,出现了越来越多的由用户创作的网络数字内容。用户数字内容的大量产生和传播使得短文本计算、Web文本信息抽取、文本情感分析等逐渐成为Web文本挖掘研究的热点问题。针对这些问题,本文进行了以下研究:
     (1)基于统计语言模型的短文本计算。针对短文本包含字符少、文本语言不规范、文本数量巨大的特点,本文提出了一种基于N-gram的特征提取和RPCL(Rival Penalized Competitive Learning)的短文本聚类算法。首先进行基于字符级的N-gram特征提取,即从未分词的语料中抽取中文块。中文块可以是一个汉字、一个词或者字符串,这样,中文块不但可以表达短文本的语义信息,而且能够保留语序结构和字符之间的依赖。然后通过统计子串约减和互信息过滤得到候选中文块集合。最后,使用一种神经网络聚类算法RPCL对短文本进行聚类。实验结果表明,这种基于N-gram的特征提取和RPCL的短文本聚类算法能够有效的对短文本聚类,并能有效的降低特征的维度。
     (2)面向广告推荐和情感分析的Web文本信息抽取。针对广告推荐中的复合词抽取问题,本文提出了基于隐马尔科夫模型的半监督中文复合词抽取算法。从少量种子复合词出发,通过设定一个BEMI(Begin,End,Middle,Independent)模板,使用隐马尔科夫模型识别与种子复合词具有相同或相似信息的复合词。算法采用Bootstrapping的学习方法,通过自学习不断增大复合词列表的规模。实验结果表明,本算法可以满足广告系统关键词推荐的信息抽取需求,并具有较高的准确率和可以接受的召回率。
     针对文本分析问题中情感词抽取的问题,本文提出了基于最大熵和LMR(Left,Middle,Right)模板的中文情感词抽取算法。通过对文本设定一个滑动窗口,使用LMR模板标记词的位置信息,使用词、词的先后位置信息、词性信息作为特征,对情感词进行识别和抽取。实验结果表明,本算法具有较高的召回率和准确率,同时在某些特征组合的情况下,情感词抽取具有良好的鲁棒性。
     (3)基于监督和半监督的文本情感分类。针对网络上大量流行音乐、网友原创、改编的音乐,本文提出了一种对音乐歌词的情感分类方法。首先,通过对歌词语料库的词进行统计发现其分布基本符合齐夫定律,但与中文分类通用语料库(863计划文本分类测试数据)中词语分布略有差异。由于对歌词表现的情感进行的分类不同于按照主题对普通文本的分类任务,所以需要抽取更多表现情感色彩的特征。本文在N元模型的框架下采取了三种不同的预处理方法(不同N-gram模板、消去停用词、按词性过滤)抽取更多的歌词情感语义特征,并提出了带有高斯先验和指数先验的最大熵模型的分类算法对歌词的情感特征进行建模。实验结果表明,具有高斯先验和指数先验的最大熵模型非常适合用于歌词情感分析问题。
     针对实际的情感分类中标注数据不足的情况,本文提出了一种基于半监督学习的文本情感分类算法。假设空间中存在一个情感流形结构,将待分类文本看作是这个情感流形上抽样的点。首先,利用这些点的邻域信息进行构图,每个点与它近邻的边的权重使用它的近邻线性加权表示;然后,将该图看作是一个概率转移矩阵,各类别的标签在此矩阵上扩散完成情感分类过程。在电影评论和中文歌词语料集上的实验结果表明,该算法在文本情感分类上具有良好的性能。
     (4)文本观点检索。以本文作者2008年参加的COAE2008中的面向主题的中文文本观点检索任务为主线,介绍了本文参评系统PRIS-SAS。本系统采用两阶段处理方式,在经过编码转换、分词等预处理后,PRIS-SAS首先使用Indri检索系统对语料集建立索引,使用任务中的主题词进行ad-hoc检索,然后使用本文中文本情感分类算法建立倾向性模型和极性模型,对检索得到的相关文本进行文本倾向性判断,并对检索结果重新排序。在COAE2008数据集上的评测指标表明,本文设计的文本观点检索系统达到了较高的性能水平。
With the rapid development of Internet and communication networks, web documents have become one of the major modern information media as well as an indispensable information source in people's lives. Text mining has become a technology of great research and practical significance. While the Web2.0 is coming, more and more users are involved in the generation of information, and more and more personal opinioned contents are full of the Internet. Such contents are meaningful and valuable for many applications, such as e-commerce, network community, network information security, web search engine and so on. However, it is enormous challenges to process these texts by traditional text mining.
     In this dissertation, three problems are investigated, which includes short text computing, web text information extraction, and text sentiment analysis. The main contributions of this dissertation are summarized as follows:
     (1) Short text computing based on statistical language model. We introduce an algorithm to cluster Chinese short texts based on N-gram feather extraction. Aiming at the characteristics of Chinese short texts, the algorithm employs N-gram feather extraction, statistical substring reduction and mutual information filtering to capture Chinese chunks from texts, which reflect the text semantic structure and character dependency. Then RPCL algorithm is applied to realizing text clustering with high precision, which needs not know the exact number of clusters. Experiment results show that this approach can remarkably reduce the dimensionality and effectively improve the performance of Chinese short texts clustering than traditional methods.
     (2) Web text information extraction based on keyword recommendation system and sentiment analysis. In keyword recommendation system in advertisement, we propose a semi-supervised Chinese compounds extraction approach based on HMM using bootstrapping in this paper. First, we define a set of tags BEMI {beginning, end, middle, independence}, which means the position of words in compounds. Then we employ HMM to extract compounds automatically in BEMI tagging algorithm. We rank the Compounds extracted from corpus by their word frequency and length in descending order, and add top N compounds in seed compounds list. The algorithm learns more Chinese compounds from corpus by bootstrapping. Experimental results show that this approach get much higher performance than Unsupervised one. Different from those extracted by traditional methods, these Chinese compounds contain category information, which can be used in text classification/clustering as features. Also, this approach can be applied in keyword recommendation system in advertisement for different kinds of advertisers because of its expansibility and versatility.
     For word level sentiment analysis, we propose an algorithm based on Maximum Entropy model and LMR template. LMR template is used to tag word position. Words, word position and POS are used as feature in ME. A text window sides and the sentiment of the word in M poisiton is labeled. Experimental results show that this algorithm has good performance in sentiment word extraction. And, this algorithm is robust in some feature combination.
     (3) Text sentiment classification based on supervised and semi-supervised learning. Most of pop music songs have suited lyrics, which play an essential role to semantically understand songs. Therefore, analysis of lyrics must be a complement of acoustic methods for music retrieval. One basic aspect of music retrieval is music emotion classification by learning from lyrics. This problem is different from traditional text classification in that more linguistic or semantic information is required for better emotion analysis. We investigate the lyrics corpus based on Zipf's Law using word as a unit, and results roughly obey Zipf's Law. Thereby, we study three kinds of preprocessing methods (including different N-grams, deleting stop words, and filtering based on POS) and a series of language grams under the well-known N-gram language model framework to extract more semantic features. Besides that, we also improve Maximum Entropy model with Gaussian and exponential priors to model features for music emotion classification. Experimental results show that feature extraction methods improved music emotion classification accuracy. ME with priors obtained the best results.
     Since labeled data in sentiment classification is scarce, we are interested in such situation. We introduce a novel semi-supervised learning algorithm to address such task. We assume that there is a sentiment manifold structure, and documents are sampled from such manifold. We do so by creating a graph on both labeled and unlabeled data, which is linearly constructed by data points' neighborhood information. Then, labels are spread though the graph, which is regarded as probabilistic transition matrix in the process of spread. This algorithm is capable for learning sentimental manifold structures within texts. Promising experimental results are shown in lyrics and movie review data.
     (4) Opinion retrieval. Following the Chinese Opinion Analysis Evaluation (COAE2008), we discuss text opinion retrieval. Our sentiment analysis system named PRIS-SAS employ a two-stage approach. After preprocessing, corpus given by COAE2008 is indexed by Indri retrieval system, which is used to ad-hoc retrieval. And then sentiment model and polarity model trained by ME with priors are used to classify text returned by Indri. The retrieval results are reranked by classification results. Experiments on COAE2008 datasets show that, the system proposed in this dissertation is a state-of-the-art opinion retrieval system.
引文
[1]第23次中国互联网络发展状况统计报告,中国互联网络信息中心(CNNIC),2009年1月.
    [2]2008年第二次手机短信息状况调查报告,12321网络不良与垃圾信息举报受理中心,2009年2月.
    [3]杨震.文本分类和聚类中若干问题的研究[博士学位论文].北京邮电大学,2007.
    [4]陈晓云.文本挖掘若干关键技术研究[博士学位论文].复旦大学,2005.
    [5]Sebastiani F.Machine learning in automated text categorization.ACM Computing Surveys(CSUR),Vol34,Issue 1,2002,1-47.
    [6]刘远超,王晓龙,徐志明等.文档聚类综述,中文信息学报,20(3),2006,55-62.
    [7]Baeza-Yates R,Ribeiro-Neto B.Modern information retrieval.ACM Press,1999.
    [8]Manning C D,Schutze H.Foundations of statistical natural language processing.The MIT Press,1999.
    [9]Crimmins F,Smeaton A F,Dkaki T,et al.TetraFusion:information discovery on the Interact.IEEE Intelligent Systems and their Applications,Volume 14,Issue 4,2002,55-62.
    [10]Assis F,Yerazunis W,Siefkes C,et al.CRM114 versus Mr.X:CRM114 notes for the TREC 2005 spam track.In Proceedings of 14th Text Retrieval Conference,2005.
    [11]Lewis D D.Naive(Bayes) at Forty:The Independence Assumption in Information Retrieval.In Proceedings of the 10th European Conference on Machine Learning New York,1998,4-15.
    [12]Eyheramendy S,Lewis D D,Madigan D.On the naive bayes model for text categorization.Artificial Intelligence & Statistics 2003.
    [13]Peng F,Schuurmans D.Combining naive bayes and n-gram language models for text classification.Proceedings of the 25th European Conference on Information Retrieval Researeh(ECIR03).April,2003,Pisa,Italy,335-350.
    [14]Yang Y.An evaluation of statistical approaches to text categorization.Information Retrieval,1999,1(1),76-88.
    [15]Cohen W,Singer Y.Context-sensitive learning methods for text categorization.In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1996,307-315.
    [16]Lewis D D,Schapire R E,Callan J P,et al.Training algorithms for linear text classifiers.In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1996,298-306.
    [17]Yang Y,Chute C G.A linear least squares fit mapping method for information retrieval from natural language texts.In Proceedings of the 14th Conference on Computational Linguistics(COLING92),1992.
    [18]Nigam K,Lafferty J,McCallum A.Using maximum entropy for text classification.In Proc.of the Int.Joint Conf.on Artificial Intelligence IJCAI-99 Workshop on Machine Learning for Information Filtering,1999,61-67.
    [19]Chen B,He H,Guo J.Constructing maximum entropy language models for movie review subjectivity analysis.Journal of Computer Science and Technology(JCST),23(2),2008,231-239.
    [20]Joachims T.Text categorization with support vector machines:learning with many relevant features.In Proceedings of 10th European Conference on Machine Learning,1998,137-142.
    [21]Hsu C,Lin C.A comparison on method for multi-class support vector machines.IEEE Transactions on Neural Networks,2002(13),415-425.
    [22]Jain A,Dobes R.Algorithms for clustering data.Engle-wood Cliffs,NJ:Prentice Hall,1998.
    [23]Chu S,Roddick J,Chen T,Pan J.Efficient search approaches for k-medoids-based algorithms,In Proc.of TENCON'02,1,2002,721a-715a.
    [24]Zhang B,Li H,Liu Y,et al.Improving web search results using affinity graph.SIGIR'05,2005,15-19.
    [25]Xue G,Lin C,Yang Q,et al.Scalable collaborative filtering using cluser-based smoothing.SIGIR'05,2005,114-121.
    [26]Cilibrasi R,Vitanyi P.The Google similarity distance.IEEE Transactions on Knowledge and Data Engineering,19(3),2007,370-383.
    [27]Xu R,Wunsch D.Survey of clustering algorithms.IEEE Transactions on Neural Networks,16(3),2005,645-678.
    [28]钟义信.自然语言理解的全信息方法论.北京邮电大学学报,27(4),2004,1-12.
    [29]王灿辉,张敏,马少平.自然语言处理在信息检索中的应用综述.中文信息学报,21(2),2007,35-45.
    [30]王枞,钟义信.网络信息内容安全.计算机工程与应用,2003,153-154.
    [31]钟义信.关于“信息-知识-智能转换规律”的研究.电子学报,32(4),2004,601-605.
    [32]Sager N.Natural Language Information Processing.Reading,Massachusetts:Addison Wesley,1981.
    [33]Dejong G.An Overview of the FRUMP System.In:LEHNERT W,RINGLE M H eds.Strategies for Natural Language Processing,Lawrence Erlba(?)m,1982:142-176.
    [34]Grishman R,Sundheim B.Message Understanding Conference-6:A Brief History.In Proceedings of the 16th International Conference on Computational Linguistics (COING-96),1996,08.
    [35]Automatic Content Extraction(ACE),http://www.nist.gov/speech/tests/ace/
    [36]Freitag D.Information extraction from html:Application of a general learning approach In Proceedings of the 15th Conference on Artificial Intelligence(AAAI-98),1998:pp.517-523.
    [37]Muslea I,Minton S,Knoblock C.A hierarchical approach to wrapper induction.In Proceedings of third International Conference on Autonomous agents(AA-1998),1998.
    [38]Kim J,MoNovan D.Acquisition of Semantic Patterns for information Extraction from corpora.In Proceedings of the ninth lEE Conference on Artificial Intelligence for Applications,Los Alamitos,CA,IEEE Computer Society Press,1993:pp.171-176.
    [39]Chen H H,Ding Y W,Tsai Sc et al.Description of the NTU system Used for MET2.In Proceedings of the Seventh Message Understanding Conference,1998.
    [40]Zhang Y M,Zhou J F.A Trainable Method for Extracting Chinese Entity Names and Their Relations.In Proceedings of the Second Chinese Language Processing Workshop,Hong Kong,2000-10.
    [41]杨文柱,李智玲等.基于信息抽取的Web查询系统的设计与实现.计算机应用,Vol.23(2),2003,pp.97-99.
    [42]李效东,顾毓清.基于DOM的Web信息提取.计算机学报,Vol.25(5),2002, pp.526-532.
    [43]胡睿,张冬茉,杜蓬.基于结点语义关系的信息抽取技术.计算机工程,Vol.27(4),2001,pp.26-28.
    [44]朱明,王军,王俊普.基于多层模式的多记录网页信息抽取方法.计算机工程,Vol27(9),2001,pp.41-42.
    [45]陆科进,李新颖.基于Ontology的文本信息抽取.计算机应用研究,2003,7,pp.46-48.
    [46]王放,顾宁,吴国文.基于本体的WEB表格信息抽取.小型微型计算机系统,Vol.24(12),2003,pp.2142-2146.
    [47]Wong S K M,Ziarko W,Raghavan V V,et al.Generalized vector spaces model in information retrieval.In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1985,18-25.
    [48]Raghavan V V,Wong S K M.A critical analysis of vector space model in information retrieval.Journal of the American Society for Information Science,37(5),1986,279-287.
    [49]Van gijsbergen C J.A new theoretical framework for information retrieval.In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval,1986,194-200.
    [50]邢永康,马少平.信息检索的概率模型,计算机科学,2003,30(08),13-17.
    [51]Song F,Croft W B.A general language model for information retrieval.In Proceedings of the 22nd Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval,1999,279-280.
    [52]张俊林.基于语言模型的信息检索系统研究[博士学位论文].中国科学院软件研究所,2004.
    [53]丁国栋,白硕,王斌.文本检索的统计语言建模方法综述,计算机研究与发展,43(5),2006,769-776.
    [54]2008年中国搜索引擎用户行为研究报告,中国互联网络信息中心(CNNIC),2009年2月.
    [55]Text REtrieval Conference(TREC).http://trec.nist.gov/
    [56]NII-NACSIS Text Collection for IR Systems(NTCIR).http://research.nii.ac.jp/ntcir/
    [57]龚才春.短文本语言计算的关键技术研究[博士学位论文].中国科学院计算技术研究所,2008.
    [58]黄永光,刘挺,车万翔等.面向变异短文本的快速聚类算法.中文信息学报,2007,21(2),63-68.
    [59]陈博.Web文本情感分类中关键问题的研究[博士学位论文].北京邮电大学,2008.
    [1]胡吉祥.基于频繁模式的消息文本聚类研究.[硕士学位论文].北京,中科院研究生院,2006.
    [2]吴薇.大规模短文本的分类过滤方法研究.[硕士学位论文].北京,北京邮电大学,2007.
    [3]龚春才,张华平,许洪波,程学旗,白硕.中文短文本流的快速编码识别算法.In Proceeding of the 7~(th) International Conference of Chinese Computing,2007,772-776.
    [4]Nie J,Gao J.On the use of words and N-grams for Chinese information retrieval.The 5~(th) International Workshop on Information Retrieval with Asian Languages,2000.
    [5]Baeza-Yates R,Ribeiro-Neto B.Modern information retrieval.ACM Press,1999.
    [6]Maron M.E.On relevance,probabilistic indexing and information retrieval.Journal of ACM,7(3):216-244,1960.
    [7]宗成庆.统计自然语言处理,清华大学出版社,2008.
    [8]Sebastiani E Machine learning in automated text categorization:a survey.Tech.Rep.IEI-B4-31-1999,Istituto di Elaborazione dell'Informazione,Consiglio Nazionale delle Ricerche,Pisa,IT,1999.
    [9]Yang Y,Pedersen J.O.A comparative study on feature selection in text categorization.The 14~(th) International Conference on Machine Learning,1997:412-420.
    [10]Nigam K,Lafferty J,McCallum A.Using maximum entropy for text classification.IJCAI-99 Workshop on Machine Learning for Information Filtering,1999:61-67.
    [11]Church K.W,Hanks P.Words association norms,mutual information and lexicography.Computational Linguistics,1989,16(1):22-29.
    [12]Dunning T.E.Accurate methods for the statistics of surprise and coincidence.Computional Linguistics,1993,19(1):61-74.
    [13]Mladenic D,Grobelnik M.Feature selection for classification based on text hierarchy.Workshop on Learning from Text and the Web,1998.
    [14]Ruizetal M.E.Automatic text categorization using neural networks.The 8~(th) ASIS SIG/CR Classification Research,1997,8:59-72.
    [15]Tokunaga T,Iwayama M.Text categorization based on weighted inverse document frequency.SIG-IPS Japan,1994,100(5).
    [16]Kolcz A,Prabakarmurthi V,Kalita J.Summarization as feature selection for text categorization.The 10~(th) International Conference on Information and Knowledge Management,2001:365-370.
    [17]Dash M,Liu H.Feature selection for clustering.PAKDD'00,2000,110-121.
    [18]Rogati M,Yang Y.High-performing feature selection for text classification.CIKM'02,2002:659-661.
    [19]Dy J.G,Brodley C.E.Feature subset selection and order identification for unsupervised learning.ICML'00,2000:247-254.
    [20]Talavera L.Dependency-based feature selection for clustering symbolic data.
    [21]龚才春.短文本语言计算的关键技术研究[博士学位论文].中国科学院计算技术研究所,2008.
    [22]黄永光,刘挺,车万翔等.面向变异短文本的快速聚类算法.中文信息学报,2007,21(2),63-68.
    [23]胡佳妮,郭军,邓伟洪等.基于短文本的独立语义特征抽取算法.通信学报,2007,28(12),121-124.
    [24]杨震.文本分类和聚类中若干问题的研究[博士学位论文].北京邮电大学,2007.
    [25]Nagao M,Mori S.A new method of N-gram statistics for large number of N and automatic extraction of words and phrases from large text data of Japanese.COLING-94,Kyoto,1994,611-615.
    [26]Fung P,Wu D.Statistical augmentation of a Chinese machine-readable dictionary.COLING-94,Kyoto,1994,69-85.
    [27]Zhang L,Lu X Q,Shen Y N,Yao T S.A statistical approach to extract Chinese chunk candidates from large corpora.ICCPOL2003,2003,109-117.
    [28]L(u|¨) X Q,Zhang L,Hu J F.Statistical substring reduction in linear time.IJCNLP2004,2004,320-327.
    [29]Han J,Kamber M.Data Mining:Concepts and Techniques.Morgan Kaufmann Publishers,San Francisco,2001.
    [30]Xu L,Krzyzak A,Oja E.Rival penalized competitive learning for clustering analysis, RBF Net,and Curve Detection.IEEE Transactions on Neural Networks,1993,636-649.
    [31]Xu L,Krzyzak A,Oja E.Unsupervised and supervised classification by rival penalized competitive learning.In the 11~(th) Proceeding of International Conference on Pattern Recognition,1992,492-496.
    [32]Ma J W,Wang T J,Xu L.Convergence analysis of rival penalized competitive learning(RPCL) algorithm.In:Proceedings of the 2002 International Joint Conference on Neural Network,2002,1596-1601.
    [33]Law L T,Cheung Y M.Color image segmentation using rival penalized controlled competitive learning.In:Proceedings of the 2003 International Joint Conference on Neural Networks,2003,108-112.
    [34]李桂芝,安成万,张永谦等.基于模糊熵和RPCL的彩色图像聚类分割.中国图象图形学报,2005,10(10),1264-1269.
    [35]http://www.sogou.com/labs/
    [36]Chen B,He H,Xu W R,Guo J.POC-NLW template based tagging method for Chinese word segmentation.Proceeding of the 2006 International Conference on Computational Intelligence and Security,Guangzhou,China,2006,1423-1428.
    [1].张素香.信息抽取中关键技术的研究.[博士学位论文].北京,北京邮电大学,2007.
    [2]Bahl L,Jelinek F,Mercer R.A maximum likelihood approach to continuous speech recognition.IEEE Trans.on Pattern Analysis and Machine Intelligence,5(2),1983,179-190.
    [3]刘颖.计算语言学.清华大学出版社,2002.
    [4]Rabiner LR.A tutorial on hidden Markov models and selected applications in speech recognition.In Proc.of the IEEE,77(2),1989,257-286.
    [5]Miller D R H,Leek T,Schwartz R M.A hidden Markov model information retrieval system.Proceedings of the 22~(nd) annual international ACM SIGIR conference on Research and Development in Information Retrieval,1999,214-221.
    [6]McCallum A,Freitag D,Pereira F.Maximum Entropy Markov Models for information extraction and segmentation.In Proceedings of International Conference on Machine Learning(ICML00),2000,591-598.
    [7]Brants T.TnT:a statistical part-of-speech tagger.In Proc.of the 6th Conf.on Applied Natural Language Processing,2000,224-231.
    [8]Ray S,Craven M.Representing sentence structure in Hidden Markov Models for information extraction.Proceedings of the 17~(th) International Joint Conference on Artificial Intelligence(IJCAI01),2001.
    [9]王小捷,常宝宝.自然语言处理技术基础.北京邮电大学出版社,2002.
    [10]Della Pietra S,Delta Pietra V,Mercer R L,et al.Adaptive language modeling using minimum discriminant estimation.In Proceedings of the Speech and Natural Language DARPA Workshop,1992.
    [11]Berger A L,Della Pietra S A,Della Pietra V J.A maximum entropy approach to natural language processing.Computational Linguistics,1996,22(1),39-71.
    [12]Ratnaparkhi A.A maximum entropy Part-Of-Speech tagger.In Proceedings of the Conference on Empirical Methods in Natural Language Processing,1996,17-18.
    [13]Reynar J C,Ratnaparkhi A.A maximum entropy approach to identifying sentence boundaries.In Proceedings of the Fifth Conference on Applied Natural Language Processing,1997,16-19.
    [14]Koeling R.Chunking with maximum entropy models.In Proceedings of CoNLL-2000 and LLL-2000,2000,139-141.
    [15]Luo X Q,Ittycheriah A,Jing H Y,et al.A mention-synchronous coreference resolution algorithm based on the Bell Tree.In Proceedings of ACL 2004.
    [16]Nigam K,Lafferty L,McCallum A.Using maximum entropy for text classification.In IJCAI-99 Workshop on Machine Learning for Information Filtering,1999.
    [17]Ittycheriah A,Roukos S.IBM's statistical question answering system for Trec-11.In Proceedings of the TREC-11 conference,NIST,2002,394-401.
    [18]Della Pietra S,Della Pietra V,Lafferty J.Inducing features of random fields.IEEE Transactions on Pattern Analysis and Machine Intelligence,19(4),1997,380-393.
    [19]www.trec.nist.gov
    [20]Zhang J,Gao J F,Zhou M.Extraction of Chinese compound words:an experimental study on a very large corpus.Proceedings of the 2~(nd) Workshop on Chinese Language Processing:Held in Conjunction with the 38~(th) Annual Meeting of the Association for Computational Linguistics,2000,Vol.12,132-139.
    [21]http://morph.ldc.upenn.edu/ctb/
    [22]周雅倩,郭以昆,黄萱菁等.基于最大熵方法的中英文基本名词短语识别.计算机研究与发展,2003.3,Vol.40,No.3,440-446.
    [23]冯冲,陈肇雄,黄河燕等.基于条件随机域的复杂最长名词短语识别.小型微型计算机系统,2006.6,Vol27,No.6,1134-1139.
    [24]Xue N.Chinese Word Segmentation as Character Tagging.In International Journal of Computational Linguistics and Chinese Language Procession,2003,8(1),29-48.
    [25]陈文良,朱慕华,朱靖波,等.基于Bootstrapping的文本分类模型.中文信息学报,2005,Vol.19,No.2,86-92.
    [26]Chen B,He H,Xu W R,et al.POC-NLW Template Based Tagging Method for Chinese Word Segmentation,Proceeding of the 2006 International Conference on Computational Intelligence and Security,Guangzhou,China,2006,pp.1423-1428.
    [27]赵军,黄昌宁.基于转换的汉语基本名词短语识别模型.中文信息学报,1999,Vol.13,No.2,pp.1—7,39.
    [28]Horrigan J A.Online shopping.Pew Internet & American Life Project Report,2008.
    [29]Kelsey group.Online consumer-generated reviews have significant impact on offline purchase behavior.Press Release,November 2007.
    [30]Hatzivassiloglou V,McKeown R.Predicting the semantic orientation of adjectives.Proceedings of the 35~(th) Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics,Marid,1997,174-181.
    [31]Turney P D.Thumbs up or thumbs down?:Semantic orientation applied to unsupervised classification of reviews.Proceedings of the 40~(th) Annual Meeting on Association for Computational Linguistics,Philadelphia,2002,417-424.
    [32]Riloff E,Wiebe J.Learning extraction patterns for subjective expressions.Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing,2003,70-77.
    [33]Kanayama H,Nasukawa T.Fully automatic lexicon expansion for domain-oriented sentiment analysis.Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing(EMNLP06),2006,355-363.
    [34]KajiN,Kitsuregawa M.Building lexicon for sentiment analysis from massive collection of HTML documents.Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning,2007,1075-1083.
    [35]朱嫣岚,闵锦,周雅倩,黄萱菁,吴立德.基于HowNet的词汇语义倾向计算.中文信息学报,2006,20(1),14-20.
    [36]Yao T F,Lou D C.Research on semantic orientation distinction for Chinese sentiment words.The 7~(th) International Conference on Chinese Computing,Wuhan,2007.
    [37]乔春庚,孙丽华,吴韶等.基于模式的中文倾向性分析研究.第一届中文倾向性分析评测研讨会,2008,21-31.
    [38]http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/maxent/
    [1]http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG/
    [2]http://research.nii.ac.jp/ntcir/
    [3]Wiener E.A neural network approach to topic spotting.In Proceedings of the 4~(th)Annual Symposium on Document Analysis and Information Retrieval(SDAIR95),1995.
    [4]Apte C,Damerau P,Weiss S.Text mining with decision rules and decision trees.In Proceedings of the Conference on Automated Learning and Discovery Workshop 6:Learning from Text and the Web,1998.
    [5]Lent B,Swami A,Widom J.Clustering association rules.In Proceedings of the 13~(th)International Conference on Data Engineering(ICDE97),1997.
    [6]Lewis D D.Na(i|¨)ve Bayes at forty:the independence assumption in information retrieval.In Proceedings of the 10~(th) European Conference on Machine Learning,1998,4-15.
    [7]Eyheramendy S,Lewis D D,Madigan D.On the Na(i|¨)ve Bayes model for text categorization.Artificial Intelligence & Statistics,2003.
    [8]Peng F,Schuurmans D.Combining Na(i|¨)ve Bayes and N-gram language models for text classification.In Proceedings of the 25~(th) European Conference on Information Retrieval Research(ECIR03),2003,14-16.
    [9]Yang Y.An evaluation of statistical approaches to text categorization.Information Retrieval,1999,1(1),76-88.
    [10]李荣陆,胡运发.基于密度的KNN文本分类器训练样本裁剪方法.计算机研究与发展,2004,41(4),539-545.
    [11]Joachims T.Text categorization with support vector machines:learning with many relevant features.In Proceedings of the 10~(th) European Conference on Machines Learning,1998,137-142.
    [12]Hsu C,Lin C.A comparison on methods for multi-class support vector machines,IEEE Transactions on Neural Networks,2002,13,415-425.
    [13]Nigam K,Lafferty L,McCallum A.Using maximum entropy for text classification.In IJCAI-99 Workshop on Machine Learning for Information Filtering,1999.
    [14]张学工.关于统计学习理论与支撑向量机.自动化学报,2000,26(1),32-42.
    [15]Berger A.Error-correcting output coding for text classification.In Proceedings of International Joint Conference on Artificial Intelligence:Workshop on Machine Learning for Information Filtering,1999.
    [16]Ghani R.Using error-correcting codes for text classification.In Proceedings of the 17~(th) International Conference on Machine Learning,2000.
    [17]Platt J,Cristianini N,Shawe-Taylor J.Large margin DAGs for multiclass classification.Advances in Neural Information Processing Systems,2000,12,547-553.
    [18]Zhu X J.Semi-supervised learning literature survey.Computer Sciences,University of Wisconsin-Madison,Tech.Rep.,2007
    [19]孙广玲,唐降龙.基于分层高斯混合模型的半监督学习算法.计算机研究与发展,41(1),2004,156-161.
    [20]李和平,胡占义,吴毅红等.基于半监督学习的行为建模与异常检测.软件学报,18(3),2007,527-537.
    [21]Nigam K,McCallum,A K,Thrun S,et al.Text classification from labeled and unlabeled documents using EM.Machine Learning,2000,39,103-134.
    [22]Nigam K.Using unlabeled data to improve text classification(Technical Report CMU-CS-01-126).Carnegie Mellon University,Doctoral Dissertation,2001.
    [23]Baluja S.Probabilistic modeling for face orientation discrimination:learning from labeled and unlabeled data.Neural Information Processing Systems,1998.
    [24]Fujino A,Ueda N,Saito K.A hybrid generative/discriminative approach to semi-supervised classifier design.The Twentieth National Conference on Artificial Intelligence(AAAI05),2005.
    [25]Yarowsky D.Unsupervised word sense disambiguation rivaling supervised methods.Proceedings of the 33~(rd) Annual Meeting of the Association for Computational Linguistics,1995,189-196.
    [26]Riloff E,Wiebe J,Wilson T.Learning subjective nouns using extraction pattern bootstrapping.Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003),2003.
    [27]Blum A,Mitchell T.Combining labeled and unlabeled data with co-training.COLT: Proceedings of the Workshop on Computational Learning Theory,1998.
    [28]Maeireizo B,Litman D,Hwa R.Co-training for predicting emotions with spoken dialogue data.The Companion Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics(ACL),2004.
    [29]Joachims T.Transductive inference for text classification using support vector machines.In Proceedings of the 16~(th) International Conference on Machine Learning,1999,200-209.
    [30]Lawrence N D,Jordan M I.Semi-supervised learning via Gaussian processes.Advances in Neural Information Processing Systems,2005,17.
    [31]Szummer M,Jaakkola T.Information regularization with partially labeled data.Advances in Neural Information Processing Systems,2002,15.
    [32]Zhu X.Semi-supervised learning with graphs.Doctoral dissertation,Carnegie Mellon University,CMU-LTI-05-192,2005.
    [33]Blum A,Lafferty J,Rwebangira M,et al.Semi-supervised learning using randomized mincuts.In Proceedings of the 21~(st) International Conference on Machine Learning (ICML04),2004.
    [34]Zhu X J,Ghahramani Z,Lafferty J.Semi-supervised learning using Gaussian fields and harmonic functions.In Proceedings of the 20~(th) International Conference on Machine Learning(ICML03),2003.
    [35]Zhou D,Bousquet O,Lal T,et al.Learning with local and global consistency.Advances in Neural Information Processing System 16,2004.
    [36]Belkin M,Matveeva I,Niyogi P.Regularization and semi-supervised learning on large graphs.COLT,20O4.
    [37]Belkin M,Niyogi P,Sindhwani V.Manifold regularization:a geometric framework for learning from examples.Technical Report TR-2004-06,University of Chicago,2004.
    [38]Wang F,Zhang C.Label propagation through linear neighborhoods.IEEE Transactions on Knowledge and Data Engineering,2008,Vol20,Issue 1,55-67.
    [39]Hearst MA.Direction-based text interpretation as an information access refinement.In Text-based intelligent systems:current research and practice in information extraction and retrieval,Lawrence Erlbaum Associates,Inc.,Mahwah,NJ,1992.
    [40]Sack W.On the computation of point of view.In Proc.of the 12th National Conf.on Artificial Intelligence,vol.2,1994.
    [41]Finn A,Kushmerick N,Smyth B.Genre classification and domain transfer for information filtering.In Proc.of the 24th BCS-IRSG European Colloquium on IR Research:Advances in Information Retrieval,2002,353-362.
    [42]Wiebe J,Bruce R,Bell M,et al.A corpus study of evaluative and speculative language.In Proc.of the 2nd SIGdial Workshop on Discourse and Dialogue,Vol.16,2001,1-10.
    [43]Bruce R,Wiebe J.Recognizing subjectivity:a case study in manual tagging.Natural Language Engineering,5(2),1999,1-16.
    [44]Wiebe J,Riloff E.Creating subjective and objective sentence classifiers from unannotated texts.In Proc.of the 6th Int.Conf.on Computational Linguistics and Intelligent Text Processing,2005,486-497.
    [45]Subasic P,Huettner A.Affect analysis of text using fuzzy semantic typing.IEEE Trans.on Fuzzy Systems,9(4),2001,483-496.
    [46]Das S R,Chen M.Yahoo! for Amazon:sentiment extraction from small talk on the web.In Proc.of the 8th Asia Pacific Finance Association Annual Conf.,2001.Available at http://scumis.scu.edu/srdas/chat.pdf
    [47]Turney P D.Thumbes up or thumbs down? Semantic orientation applied to unsupervised classification of reviews.In Proc.of the 40th Annual Meeting of the Association for Computational Linguistics,2002,417-424.
    [48]Turney P D,Littman ML.Measuring praise and criticism:inference of semantic orientation from association.ACM Transactions on Information Systems,21(4),2003,315-346.
    [49]Liu H,Lieberman H,Selker T.A model of textual affect sensing using real-world knowledge.In Proc.of the 11th Int.Conf.on Intelligent User Interface,2003,125-132.
    [50]http://openmind.media.mit.edu/
    [51]Pang B,Lee L,Vaithyanathan S.Thumbs up? Sentiment classification using machine learning techniques.In Proc.Conf.on Empirical Methods in Natural Language Processing,2002,79-86.
    [52]B,Lee L.A sentimental education:sentiment analysis using subjectivity summarization based on minimum cuts.In Proc.of the 42nd Meeting of the Association for Computational Languages,2004,271-278.
    [53]Pang B,Lee L.Seeing stars:exploiting class relationships for sentiment categorization with respect to rating scales.In Proc.of the 43rd Annual Meeting on Association for Computational Linguistics,2005,115-124.
    [54]Liu B,Hu M,Cheng J.Opinion observer:analyzing and comparing opinions on the web.In Proc.of the 14th Int.Conf.on World Wide Web,2005,342-351.
    [55]Hu M,Liu B.Mining and summarizing customer reviews.In Proc.of the 10th ACM SIGKDD Iint.Conf.on Knowledge Discovery and Data Mining 2004,168-177.
    [56]Hu M,Liu B.Mining opinion features in customer reviews.In Proc.of the 19th National Conf.on Artificial Intelligence(AAAI-2004),2004,755-760.
    [57]Lin WH,Wilson T,Wiebe J,et al.Which side are you on? Identifying perspectives at the document and sentence levels.In Proc.of the 10th Conf.on Computational Natural Language Learning,2006,109-116.
    [58]Whitelaw C,Garg N,Argamon S.Using appraisal groups for sentiment analysis.In Proc.of the 14th ACM Int.Conf.on Information and Knowledge Management,2005,625-631.
    [59]Yi J,Nasukawa T,Bunescu R,et al.Sentiment analyzer:extracting sentiments about a given topic using natural language processing techniques.In Proc.of the 3rd IEEE Int.Conf.on Data Mining,2003,427-434.
    [60]Goldberg A B,Zhu X.Seeing stars when there aren't many stars:Graph-based semi-supervised learning for sentiment categorization.In Proc.of HLT-NAACL 2006Workshop on Textgraphs:Graph-based Algorithms for Natural Language Processing,2006,45-52.
    [61]Mei Q,Ling X,Wondra M,et al.Topic sentiment mixture:modeling facets and opinions in Weblogs.In Proc.of the 16th Int.Conf.on World Wide Web,2007,171-180.
    [62]Ni X,Xue G,Ling X,et al.Exploring in the Weblog space by detecting informative and affective articles.In Proc.of the 16th Int.Conf.on World Wide Web,2007,281-290.
    [63]唐慧丰,谭松波,程学旗.基于监督学习的中文情感分类技术比较研究.中文信息学报,2007,21(6),88-94,108.
    [64]徐军,丁宇新,王晓龙.使用机器学习方法进行新闻的情感自动分类.中文信息学报,2007,21(6),95-100.
    [65]Chen B,He H,Guo J.Constructing maximum entropy language models for movie review subjectivity analysis.Journal of Computer Science and Technology(JCST),23(2),2008,231-239.
    [66]Huron D.Perceptual and cognitive applications in music information retrieval.In Proc.Int.Symp.Music Information Retrieval,2000.
    [67]Li T,Ogihara M.Toward intelligent music information retrieval,IEEE Transactions on Multimedia,Vol.8,No.3,June 2006,564-574.
    [68]Lu L,Liu D,Zhang H.Automatic mood detection and tracking of music audio signals.IEEE transactions on Audio,Speech,and Language Processing,Vol.14(1),2006,5-18.
    [69]郑亚斌,刘知远,孙茂松.中文歌词的统计特征及其检索应用.中文信息学报,2007,21(5),61-67.
    [70]胡熠,陆占汝,李学宁等.基于语言建模的文本情感分类研究.计算机研究与发展,2007,44(9),1469-1475.
    [71]Chen S F,Rosenfeld R.A Gaussian prior for smoothing maximum entropy models.Tech.Rep.CMUCS-99-108,Carnegie Mellon University,1999.
    [72]Ney H,Essen U,Kneser R.On structuring probabilistic dependences in stochastic language modeling.Computer,Speech,and Language,8,1994,1-38.
    [73]Kazama J,Tsujii J.Evaluation and extension of maximum entropy models with inequality constraints.In Proc.EMNLP 2003,2003,137-144.
    [74]http://bbs.langtech.org.cn/
    [75]http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/maxent/
    [76]http://www.csie.ntu.edu.tw/~cjlin/libsvm/
    [77]http://trec.nist.gov/tracks.html
    [78]Roweis S T,Saul L K.Nonlinear dimensionality reduction by locally linear umbedding.Science:vol.290,2000,2323-2326.
    [79]http://www.cs.cornell.edu/People/pabo/movie-review-data/
    [80]Deerwester S,Dumais S T,Furnas G W,et al.Indexing by latent semantic analysis.Journal of the American Society for Information Science,1990,41,391-407.
    [1]http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG/
    [2]http://research.nii.ac.jp/ntcir/
    [3]赵军,许洪波,黄萱菁等.中文倾向性分析评测技术报告.第一届中文倾向性分析评测研讨会,2008,1-20.
    [4]Eguchi K,Lavrenko V.Sentiment retrieval using generative models.In Proc.of the 2006 Conf.on Empirical Methods in Natural Language Processing,2006,345-354.
    [5]Skomorowski J.Topical opinion retrieval.[Master's Thesis].University of Waterloo,2006.
    [6]Osman D J,Yearwood J L.Opinion search in web logs.In Proc.of the 18th Conf.on Australasian Database,2007,133-139.
    [7]Zhang W,Yu C,Meng W.Opinion retrieval from blogs.In Proc.of the 16th ACM Conf.on Information and Knowledge Management,2007,831-840.
    [8]http://trec.nist.gov/
    [9]Ounis I,Rijke M,Macdonald C,et al.Overview of the TREC-2006 Blog Track.In Proc.the 15th Text REtrieval Conf.,2006.
    [10]Macdonald C,Ounis I,Soboroff I.Overview of the TREC 2007 Blog Track.In Proc.the 16th Text REtrieval Conf.,2007.
    [11]张猛,彭一凡,樊扬.中文倾向性分析的研究.第一届中文倾向性分析评测研讨会,2008,38-45.
    [12]何慧,李思,肖芬等.PRIS中文情感倾向性分析技术报告.第一届中文倾向性分析评测研讨会,2008,46-55.
    [13]廖祥文,关峰,王宇等.ICTNET:中文倾向性分析评测2008观点检索报告.第一届中文倾向性分析评测研讨会,2008,94-98.
    [14]吉阳生,戴新宇,黄书剑等.中文情感分析和观点检索研究.第一届中文倾向性分析评测研讨会,2008,99-104.
    [15]刘康,赵军.NLPR-OR.一种新的观点检索系统.第一届中文倾向性分析评测研讨会,2008,115-124.
    [16]http://www.lemurproject.org/indri/
    [17]http://www.searchforum.org.cn/tansongbo/corpus-senti.htm
    [18]Buckley C,Voorhees E M.Retrieval Evaluation with Incomplete Information.In Proc.of the 27th Annual Int.ACM SIGIR Conf.on Research and Development in Information Retrieval,2004,25-32.
    [19]Chen B,He H,Xu W,et al.POC-NLW template based tagging method for Chinese word segmentation.In Proc.of the 2006 Int.Conf.on Computational Intelligence and Security,2006,1423-1428.
    [20]Ponte J M,Croft W B.A language modeling approach to information retrieval.In Proc.of the 21st Annual Int.ACM SIGIR Conf.on Research and Development in Information Retrieval,1998,275-281.
    [21]Turtle H,Croft W B.Evaluation of an inference network-based retrieval model.ACM Trans.on Information System,9(3),1991,187-222.
    [22]Strohman T,Metzler D,Turtle H,et al.Indri:A language model-based search engine for complex queries(extended version).CIIR Technical Report,2005.
    [23]Metzler D,Croft WB.Combining the language model and inference network approaches to retrieval.Information Processing and Management Special Issue on Bayesian Networks and Information Retrieval,40(5),2004,735-750.
    [24]He H,Chen B,Guo J.Emotion recognition of pop music based on maximum entropy with priors".The 13~(th) Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD'09),2009,788-795.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700