基于评论分析的Blog观点提取技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于评论分析的Blog观点提取技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Blog Viewpoint Extraction Based on Comment
作者：孙峰
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：垃圾评论识别 ; Blog观点 ; 余弦相似性
英文关键词：spam detection ; blog viewpoint ; cosine similarity
学位年度：2008
导师：黄哲学 ; 叶允明
学科代码：081201
学位授予单位：哈尔滨工业大学
论文提交日期：2008-12-01

摘要

Blog是一种基于RSS技术的信息交互平台,它是一种作者与读者以日志风格进行交互的中介,是一种崭新的信息传播和交互方式。与传统的网络信息相比,Blog具有动态性、交互性以及共享性等特点。为用户在互联网上发布信息和进行交互提供了方便。
     随着Blog的迅速发展,信息量的膨胀和信息源的无限增加使得互联网用户很难找到高质量的Blog。另一方面在Blog信息源中存在着大量的垃圾Blog,即使在一个高评价的Blog中也存在着大量的垃圾评论信息。给互联网用户的阅读与交流带来了不便。如何对Blog信息进行分析评定Blog的质量成为一个亟待解决并且及具有意义的问题。
     本文对基于评论分析的Blog观点提取技术进行了研究,目标是对Blog信息源进行评价得到读者对Blog的支持度。由于是从评论的角度分析Blog观点,发现在Blog中存在着大量的垃圾评论,因此本文的研究内容包括垃圾评论的识别过滤以及Blog观点提取。
     在对评论信息进行深入研究之后,发现垃圾评论具有评论内容高度重复性、垃圾评论者集合性、垃圾链接集合性以及垃圾评论发布时间的局部密集性等特征。本文针对垃圾评论的特征分别从内容角度、链接角度和发布时间角度对评论信息分析打分,通过得分与指定阈值的比较识别垃圾评论。
     对Blog结构进行深入研究之后,发现可以从评论数目、评论内容和评论中包含的情感词汇来分析。本文在垃圾评论识别过滤的基础上对评论信息进行分析,分别从上述三个角度分析对Blog打分,通过平衡因子得到Blog支持度。
     基于以上的研究成果,本文设计并实现了一个Blog观点提取的实验原型系统,包括了数据解析、垃圾评论过滤、情感词提取、观点提取等模块,为进行相关的算法实验和研究提供了一个基础平台。
Blog is a platform for information transfer based on RSS. It is a kind of mutual intermediary for author and reader with the style of the daily record. It is a new mutual mode for information diffusion. And compared to the traditional network information, Blog has a dynamic, interactive, as well as shared and so on. It brings convenience for users to publish information and discuss on the Internet.
     Popularity of blogs and the amount of information in the blog space increase fast. So, it is now difficult for Internet users to find information they care about. On the other hand, there is a lot of garbage in the sources of Blog information, even in a high evaluation of the Blog there are a lot of spam message. It makes inconvenience when user reading and communion, and how to evaluate the information of Blog has become a sociological issue to be resolved.
     In this paper, we will investigate the technology of blog viewpoint extraction based on comment. The object is to evaluate the information of Blog and extract support degree. Because the technology is based on comment, we find there are lots of garbage comments. And so this paper research contents are include the garbage comments detection technology and Blog viewpoint extraction.
     Then we do a deep research of comment information. We find garbage comments have some characteristic. First the content of comment is repetitive highly. Second the authors who usually publish garbage comments compose an author set. Third the garbage links compose a link set. And last the published times of garbage comment are local intensive. In this paper we will do an analysis of the comment based on the characteristic and get a score to evaluate the comment. Then we will detect the garbage comment by comparing the score and the threshold.
     Then we do a deep research of the Blog structure. We find we can analysis a Blog by the number of comment, the content of comment and the affective words in content. In this paper, we first filter the garbage comment and then extract Blog viewpoint. We get the support degree by analyzing the Blog structure characteristic.
     Based on the above results, the paper designed and implemented an experimental blog viewpoint extraction system, including data analysis, filter garbage comment, affective words extraction and blog viewpoint extraction, which provide a foundation platform for the associated algorithms experiments and research work.

引文

1. G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of AIR Web, 2005.
    2. M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web engines. SIGIR Forum, 36(2):11-22, 2002.
    3. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB’04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1-6. ACM Press, 2004.
    4. Seungyeop Han, Yong-yeol Ahn, Sue Moon, et al. Collaborative Blog Spam Filtering Using Adaptive Percolation Search. The 15th Int’1 WWW Conf., Edinburgh, UK, 2006.
    5. Kazuyuki Narisawa, Yasuhiro Yamada, Daisuke Ikeda, et al. Detecting Blog Spams using the Vocabulary Size of All Substrings in their Copies. The 15th Int’1 WWW Conf., Edinburgh, UK, 2006.
    6. D. Ikeda. Autoschediastic Text Mining Algorithms. PhD thesis, Graduate School of Information Science and Electrical Engineering, Kyushu University, March 2004.
    7. D. Ikeda. Y. Yamada. Gathering Text Files Generated from Templates. In Proceedings of Workshop on Information Integration on the Web(IIWeb-04), page 21-26, August 2004. http://cips.eas.asu.edu/iiweb.htm.
    8. H. Drucker, C.J.C. Burges, L.Kaufman. A.J.Smola and V. Vapnik. Support vector regression machines. In NIPS, pages 155-161, 1996.
    9. P. Kolari, T.Finin, A. Joshi. SVMs for the blogosphere: Blog identification and splog detection. In AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, 2006, to appear.
    10. P. Kolari, A. Java, T. Finin, et al. Detecting Spam blogs: A machine learning approach. 2006. Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006).
    11. Pranam Kolari, Akshay Java, Tim Finin. Characterizing the Splogosphere. The 15th Int’1 WWW Conf., Edinburgh, UK, 2006.
    12. Allan J, Lavrenko V,Connell M E.A month to Topic Detection and Trackingin Hindi. In ACM Transactions on Asian Language Processing, 2003, 2(2):85-100
    13. Jonathan G F,George R D.Topic Detection and Trancking evaluation overview in Topic Detection and Tracking:Event-based Information Organization. Kluwer Academic Publishers, 2002:17-31
    14. Allan J, Carbonell J, Doddington G et al. Topic Detection and Tracking pilot study: funal report, Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, San Francisco:Morgan Kaufmann Publishers,1998:194-218.
    15. Wayne C.Multilingual Topic Detection and Tracking:successful research enabled by corpora and evaluation. Language Resources and Evaluation Confrernce(LREC), Greece, 2000:1487-1494.
    16. He t, Qu G, Li Setal. Semi-automatic hot event detection. Advanced Data Mining and Applications (ADMA), 2006, 4093:1008-1016.
    17. Lavrenko V, Allan J, DeGuzman E, Relevance models for Topic Detection and Tracking. Proceedings of the Human Language Technology Conference, San Diego CA, 2002:104-110.
    18. Esmaili K S,Neshati M,Jamali M et al. Comparing performance of recommendation techniques in the Blogsphere. In Proceedings of ECAI2006 Workshop on Recommender Systems, Riva Del Garda Italy, 2006.
    19. Mishne G, Glance N. Leave a reply: an analysis of weblog comments. In Third Annual Workshop on the Weblogging Ecosystm, Edinburgh Scotland 2006.
    20. Mishne G, Carmel D, Lempel R. Blocking blog spam with language model disagreement. In First International Workshop on Adversarial Information Retrieval on the Web, at www’05: the 14th International Conference on World Wild Web, Chiba Japan, 2005.
    21. Han S, Ahn Y-Y, Moon S et al. Collaborative blog spam filtering using adaptive percolation search. The 15th International World Wide Web Conference (WWW2006), Workshop on the Weblogging Ecosystem, Edinburgh Scotland, 2006.
    22. Mishne G, Rijke M de. Caturing global mood levels using blog posts. In AAAI 2006 Spring Symposium on Computational Approaches to AnalyzingWeblogs(AAAIAAW 2006), Stanford University, California, 2006: 145-152.
    23. Hatzivassiloglou V, Mckeown K R.Predicting the semantic orientation of adjectives. In: Procedings of the 35th annual meeting of the Association for Computational Lin-guistics and the 8th Conference of the European Chapter of the ACL, Madrid Spain, 1997: 174-181.
    24. Turney P D, Littan M L. Measuring praise and criticism: inference of semantic orientation from association. ACM Transactions on Information Systems, 2003, 21(4): 315-346.
    25. Kamps J, Marx M, Mokken R J et al. Using WordNet to measure semantic orientation of adjectives. In: Proceedings of LREC-4, 4th International Conference on Language Resources and Evaluation, Lisbon, 2004: 1115-1118.
    26. Turney P D. Thumbs up or thumbs down?semantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, 2002: 417-424.
    27. Rilooff E, Wiebe J. Learning extraction patterns for subjective expressions. In the Proceedings of HLT-EMNLP’2003, Sapporo Japan, 2003: 25-32.
    28. Pang B, Lee L, Vaithyanathan S. Thumbs up?sentiment classification using machine learning techniques. In Pro.of EMNLP’2002, University of Pennsylvania Philadelphia, PAUSA, 2002, 79-86.
    29. Pang B, Lee L. A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Pro.of ACL’2004, Barcelona Spain, 2004: 271-278.
    30. Kobayashi N, Iida R, Inui L et al. Opinion mining as extraction of attribute-value relations. New Frontiers in Artificial Intelligence, 2006, 4012: 470-481.
    31. Zhang Y, Li Z, Ren F et al. Semi-automatic emotion recognition from textual input based on the constructed emotion thesaurus. IEEE, 2005: 571-576.
    32. Gamon M, Aue A, Corston-0live S et al. Pulse: mining customer opinions from free text. Avances in Intelligent Data Analysis VI, 2005, 3646: 121-132.
    33. Liu B, Hu M, Cheng J. Opinion observer: analyzing and comparing opinions on the web. In Proceedings of WWW’05, Chiba Japan, 2005: 342-351.
    34.范明,孟小峰译,数据挖掘技术与概念,机械工业出版社,2001.8
    35.蔡健平,林世平。基于及其学习的词语和句子极性分析。第三届全国信息检索与内容安全学术会议(NCIRCS’2007),苏州,2007:643-649.
    36.姚天昉,娄德成。汉语语句主题语义倾向性分析方法的研究。第九届全国计算机语言学术会议(JSCL2007),大连,2007:582-587.
    37. J. Allan, C. Wade, and A. Bolivar. Retrieval and novelty detection at the sentence level. In SIGIR’03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, 314-321. ACM Press, 2003.
    38. B. Davison. Recognizing nepotistic links on the web. In AAAI-2000 workshop on Artificial Intelligence for Web Search, 23-28. AAAI Press, 2000.
    39. A. de Moor and L. Efimova. An Argumentation Analysis of Weblog Conversations. In The 9th International Working Conference on the Language-Action Perspective on Communication Modeling, 2004.
    40. N. Glance. Indexing the blogosphere one post at a time. In Third International Workshop on Web Document Analysis (WDA2005), 2005.
    41. A. Kilgarriff. Comparing corpora. International Journal of Corpus Linguistics, 6(1): 1-37, 2001.
    42. R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. In WWW, 03: Proceedings of the 12th international conference on World Wide Web, 568-576, New York, NY, USA, 2003. ACM Press.
    43. C. Marlow. Audience, structure and authority in the weblog community. In The 54th Annual Conference of the International Communication Association, 2004.
    44. E. M. Trevino. Blogger motivations: Power, pull, and positive feedback. In Internet Research 6.0. 2005.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700