基于倾向性文本过滤的IM监控系统的研究与实现

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

基于倾向性文本过滤的IM监控系统的研究与实现

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research and Implementation of Instant Messaging Monitoring System Based on Tendency Text Filtering
作者：于海燕
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Netfilter框架 ; 倾向性 ; 过滤模板 ; 旁路监控
英文关键词：Netfilter framework ; tendency ; filtering profile ; bypass monitoring
学位年度：2007
导师：房鼎益
学科代码：081203
学位授予单位：西北大学
论文提交日期：2007-06-01

摘要

即时通信(Instant Messaging，简称IM)是一种实时的互联网交流形式，伴随着网络的开放性和日益增长的规模，它已经成为人们自由交流信息的便捷手段，极大地改变了人们的联系方式。然而在IM得到广泛应用的同时，却存在着很大的负面效应，如不良信息的广泛传播，机密信息泄露，影响正常工作效率等。这时，一个能对IM软件进行有效监控的系统有了很大的市场需求，但目前国内IM过滤软件多采用基于主题的过滤，使得在过滤精度上有所欠缺。
     本文针对现有IM监控软件的缺陷，以建立一个高效、准确的监控系统为目标，实现了一个原型系统。本文的研究工作主要包括以下几个方面：
     1、研究了IM监控系统实现平台——Netfilter框架的设计思想和工作原理，着重分析了其扩展机制及应用；然后针对IM监控系统的过滤需求，选择合适的Netfilter框架钩入点，扩展了框架对应用层IM协议的支持。
     2、提出了IM监控系统的实现方案，深入分析并讨论了系统实现中的一系列关键技术，包括IM软件协议解析方案、中文分词技术、倾向性文本过滤技术、TCP连接阻断技术以及可加载内核模块(LKM)技术和内核空间与用户空间的通信技术。本文针对系统过滤准确性和实时性的需求，在分析IM文本消息特点和实际应用特点的基础上，对基于语义分析的倾向性文档过滤技术进行了重点研究，给出了一个适用于实时过滤IM消息的倾向性文本过滤方法。
     3、设计并实现了一个基于倾向性文本过滤的IM监控系统原型——TFIMM(Instant Messaging Monitoring System based on Tendency Text Filtering)。该系统应用了本文所给出的倾向性文本过滤方法和旁路监控技术，不仅有效提高了IM文本信息过滤的准确性，而且避免了对网络速度的负面影响。
     4、搭建了系统的实验环境，通过召回率、正确率等指标对本文给出的倾向性文本过滤方法进行了测评，并从吞吐率、延迟率两方面对系统性能进行了分析和评价。实验结果表明，该原型系统达到了预期的效果。
Instant Messaging (IM) is a kind of real-time exchange way for millions of Internet users. Along with the opening and the scale that increased day-by-day of the network, it has come to being a convenient means by which people can exchange information freely. At the same time, there are some negative effects also, such as the spread of various kinds of illegal information, leaking of secret information, low efficiency and high cost of network. Therefore, a system which can monitor the use of IM has a very big market demand. However, most IM filter software products in China are based on subject filtering at present, which are short of filtering precision.
     By analyzing the shortages of current IM filter software, a prototype system is designed and implemented in order to filter the information precisely and effectively. In this paper, the research work can be summarized in the following aspects:
     Firstly, the implementation platform of IM monitoring system, Netfilter security framework, is studied. Its design philosophy and extended mechanism are mainly analyzed. Then aiming at the filtering request of IM monitoring system, an appropriate hook point is chosen and the Netfilter framework is extended. Thus the IM communication protocols can be supported at the application layer.
     Secondly, the implementary scheme of IM monitoring system is proposed. The key techniques of the implementation of IM monitoring system are analyzed and studied in detail, including the analytic scheme of the IM protocol, Chinese word segmentation technique, tendency text analysis technique, TCP connect blocks technique, Loadable Kernel Modules (LKM) technique and the communications between kernel space and user space. By analyzing the characteristic of IM text information and users' filtering demand, the tendency text filtering technique based on semantic analysis is studied, and a tendency text filtering method (IMTTF) which is fit for IM monitoring is given. The method can filter the information precisely and effectively.
     Thirdly, the prototype of instant messaging monitoring system based on tendency text filtering (TFIMM) is designed and implemented. The IMTTF method and the bypass monitoring technique are applied to this system, which not only improve the filtering precision effectively, but also avoid the negative influence on Internet speed.
     Finally, the system experimental environment is set up. The IMTTF method is evaluated on Recall and Precision, and the system performance is evaluated on Response per Second and Response Delay. The results indicate that the prototype system reaches the anticipated effect.

引文

[1] 中国即时通讯市场创多项世界之最[EB/OL].http://www.weamax.com/daily/2006-08-29/.
    [2] 艾瑞：2010年中国即时通讯用户数量将超过2亿, [EB/OL].http://www.gioc.com.cn/gtxkxnew/200604/gtx04.htm.
    [3] Sujata Chavan. Understanding Instant Messaging(IM) and its security risks[J]. GIAC Security Essentiala(GSEC) Certification Practical Assignment. Version 1.4b-Option 1.August 25th, 2003,1-14.
    [4] CNNIC发布2006年中国即时通信市场调查报告[EB/OL].http://www.cnnic.cn/html/Dir/2006/12/18/4352.htm,2006.12.
    [5] Zhijun Liu,Weili Lin,Na Li,Lee,D.Detecting and filtering instant messaging spam-a global and personalized approach[J].Secure Network Protocals.The 1st IEEE ICNP Workshop.2005,19-24.
    [6] 十大企业级即时信息产品[EB/OL].http://www.zdnet.com.cn/techupdate/intranet_internet/skill/39104842.htm,2005.8.27.
    [7] Main Features[EB/OL]. http://www.Netfilter.org/patch-2.4.20.2004.
    [8] Kristen Accardi,Tony Bock,Frank Hady,Jon Krueger.Netword Processor Accerleration for a Linux~* Netfilter Firewall[J].ACM,October 2005,115-123.
    [9] 周功业，郑红．基于netfilter的Linux防火墙可扩展性研究[J]．微计算机应用，2005(6)：23-26．
    [10] 张文茂，章淼．互联网即时消息的研究现状与展望[J]．小型微型计算机系统，2004．11．
    [11] MSN Messenger Service 1.0 Protocol[EB/OL].http://www.hypothetic.org/docs/msn/sitevl/index.php, 2003.9.
    [12] MSN Protocol Version 13[EB/OL]. http://msnpiki.msnfanatic.com/index.php/ MSNP13: Changes#RML.
    [13] MSN Protocol Version 14[EB/OL]. http://msnpiki.msnfanatic.com/index.php/ MSNP14: Changes#RML.
    [14] 朱德熙．语法讲义[M]．北京：商务印书馆．1982．11．11．
    [15] Palmer, D. A trainable rule-based algorithm for word segmentation[C].The 35th Annual Meeting of the Association for Computational Linguistics (ACL '97),Madrid. 1997.
    [16] 计算所汉语词法分析系统ICTCLAS[EB／OL]．http://sewm.pku.edu.cn/QA/reference/ ICTCLAS／FreelCTCLAS／．
    [17] 刘群，张华平等．基于层叠隐马模型的汉语词法分析[J]．计算机研究与发展，2004．8，41(8)：1421-1429．
    [18] ZHANG Hua-Ping, YU Hong-Kui.HHMM-based Chinese Lexical Analyzer ICTCLAS[J].2nd SIGHAN workshop affiliated with 41th ACL;Sapporo Japan. July, 2003,184-187.
    [19] 张华平．基于N—最短路径的中文词语粗分模型[J]．中文信息学报，2002．5：1-7．
    [20] ZHANG Hua-Ping, LIU Qun and Cheng Xue-Qi.Chinese Lexical Analysis Using Hierarchical Hidden Markov Model[J]. Second SIGHAN workshop affiliated with 41th ACL;Sapporo Japan, July, 2003,63-70.
    [21] SharpICTCLAS分词系统简介(1)读取词典库[EB／OL]．http://www.cnblogs.com/zhenyulu/articles/668024.htm.1.
    [22] 内容过滤：正义与邪恶之间[EB／OL]．http://www.ccw.com.cn/applic/tech/htm2003/20030324_14IS6_2.htm.
    [23] 杨晓懿．基于内容分析的信息安全过滤技术研究[D]．四川：四川大学，2005．
    [24] 刘永丹，曾海泉，胡运发等．基于语义分析的倾向性文本过滤[J]．通信学报，2004．7，25(7)：78-85．
    [25] P.Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classfication of Reviews. In proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002.
    [26] J.yi, T.Nasukawa, R.Bunescu, W.Niblack. Sentiment analyzer: Extracting Sentiments about a Given Topic using Natrual Language Processing Techniques. Proceedings of the Third IEEE International Conference on Data Mining.Nov.2003,19-22.
    [27] 晋耀红．基于语义的文本过滤系统的设计与实现[J]，计算机工程与应用，2003(17)：22-25．
    [28] 江宝林，刘永丹，胡运发．一个基于语义分析的倾向性文档过滤系统[J]．计算机应用与软件，2005．1，22(1)：10-11．
    [29] Gildea, Daniel and Daniel Jurafsky. Automatic labeling of semantic roles[J].Computational Linguistics28·3.2002,245-288.
    [30] J.Postel. Transmission Control Protocol[Z]. RFC793, 1981.
    [31] 焦绪录等．面向TCP连接的网络实时监控系统及其连接阻断技术[J]．计算机工程，2004．3，30(6)：48-50．
    [32] 杨沙洲．Linux Netfilter实现机制和扩展技术[EB／OL]．http://blog.csdn.net/sah/articles /171939.aspx,2003.10.
    [33] 毛德操，胡希明著．LINUX内核源代码情景分析(上册)[M]．浙江：浙江大学出版社，2001．9．
    [34] 毛德操，胡希明著．LINUX内核源代码情景分析(下册)[M]．浙江：浙江大学出版社，2001．9．
    [35] 杨燚．Linux下用户空间与内核空间数据交换的方式[EB／OL]．http://magichere.bokee.com/blog/4821839.html.
    [36] Rubini A．Ljnux设备驱动程序(第2版)[M]．北京：中国电力出版社，2002．
    [37] 句式变换指导[EB／OL]．http://www.yuwen8.cn/article/article_19479_1.html,2007．1．
    [38] 王立群．“不”和“没(有)”的句法、语义、语用区别[J]．语言应用研究，2006．7：55-57．
    [39] 许建章．副词“不”和“没(有)”同谓词组合所受的条件制约[J]．河南科技大学学报，2004．6，22(2)：71-75．
    [40] 黄国营．“的”字的句法、语义功能[J]．语言研究，1982(1)．
    [41] 吕叔湘(主编)．现代汉语八百词(增订本)[z]．北京：商务印书馆，1999．
    [42] Somnath Deb, KrishnaR. Pattipati, and Yaakov Bar-Shalom. AS-dimensional Assignment Algorithm for Tracklnitiation[J]. 92CH3179-9/92,1992.
    [43] 光明网全文搜索[EB／OL]．http://search.gmw.cn/search.jsp.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700