用户名: 密码: 验证码:
Web挖掘在检测网络广告欺诈行为中的研究与应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网的发展,网络广告已经成为一种新的市场推广手段。各行各业的市场人员通过多姿多彩的网络广告宣传自己的产品和品牌的同时,也为这些广告支付广告费用,其中按点击付费广告是目前互联网界简单易行且流行的广告计费方式,它以每次网页上的广告被点击并连接到相关网站或者详细内容页面为基准的网络广告收费模式。点击欺诈(Click Fraud)存在于网络广告的按点击付费模式中,当一个人对广告本身没有兴趣,而只是为了某种利益,采取手动或者利用计算机程序的方式模仿正常用户点击广告时,点击欺诈便产生了。点击欺诈的出现和泛滥,极大地危害了互联网的健康发展。
     本文主要是研究Web挖掘应用于网络广告中的点击欺诈,针对国内外有关点击欺诈检测方法进行深入研究,结合Web挖掘的离群点挖掘、多元线性分析、时序分析等算法,设计了一套基于Web挖掘的网络广告欺诈点击检测模型,同时系统地介绍了该模型的检测体系。此检测体系分为两大步:初步评估、评估修正。初步评估主要是根据当前点击流和短时间内点击流进行分析,然后给出此点击的初步评估分,并反馈到前台。评估修正主要的工作是利用Web挖掘技术对初步评估进行修正和预测。在数据处理上,首先对数据进行预处理,由于采集过来的数据属性标识的很明确,我们需要做的有数据清洗、会话识别、属性选择、格式转换、归一化等操作,但由于我们采集的数据集有服务器日志和脚本点击流两部分组成,所以我们还需要完成数据整合的任务,同时还要完成数据补充和校对的功能。在算法上,首先分离出离群点,然后对这些离群点单独分析,而对于新进来的数据我们需要结合历史数据集进行多元线性回归分析,从而预测出可能是点击欺诈行为的数据,通过修正初步评估分把预测结果反馈到前台。前台是相对于服务器而言的,包括网站主、广告主和广告联盟。
     通过本文涉及的点击欺诈检测模型能有效检测或屏蔽各类点击欺诈行为,有效屏蔽无意识的无效点击,并且在不影响广告展示速度的基础上显著提高检测点击欺诈的效率。本文通过多组实验对检测模型进行了测试,并对实验结果进行了对比和分析。实验结果也表明,本文提出的解决方案可以有效检测采用手动或者利用计算机自动点击程序的方法模仿正常用户进行点击欺诈的行为,从而证明了该模型的可行性和方案的有效性。
     本文最后对论文阐述的内容做了简要总结,针对欺诈点击检测的发展趋势和发展方向做出展望,对本文的检测脚本、用户识别、挖掘算法、后续分析等不足之处进行了分析探讨,这些都将成为下一步继续研究的工作重点。
With the development of the Internet, online advertising has become a new marketing tool. When many marketers of life through the colorful online advertising promote their products and brands, they also need pay for these ads, Cost-Per-Click advertising is easy and popular way of advertising billing, which is marked when the online advertising of web page is clicked and linked to relevant websites or details of the advertising. Click fraud is existed in Cost-Per-Click model of the online advertising industry, click fraud will be occurred when a person takes manual or uses computer program to imitate a legitimate user of web browser clicking on an ad's link, and who is not interested in the ad's link itself, but merely to gets some benefit. The emergence and proliferation of click fraud have hindered greatly the healthy development of the internet advertising industry.
     The purpose of this pager is to study the application of web mining technology to the click fraud of online advertising, This pager designed a click fraud detection model of online advertising based web mining algorithm, which the detection mode is referenced the methods of domestic and foreign research, and combined with Web mining outliers mining, multivariate linear analysis, timing analysis and etc.. Then systematic introduction to the detection system of the model. The detection system is divided into two steps:preliminary assessment, assessment modification. The preliminary assessment analyzed the data mainly based on the current click stream and the click stream of a short time, and then given a point of preliminary assessment of the click and feedback to the foreground. The main work of assessment modification is using Web Mining algorithms to correct and predict the preliminary assessment. In the data processing, first of all, the data need to preterit, because the collected data is regular, we need to do data cleaning, session identification, attribute selection, format conversion, normalization, etc, but since we collected server log data sets and script click stream, we also need to complete the task of data integration, complete and proofread the data sets. In the algorithm, firstly, we need to isolate the outliers, and then need a separate analysis for these outliers; the new incoming data need to run multiple linear regression analysis with historical data sets, the result of detection may be click fraud, and then feedback to the foreground. The foreground is relative to the server, including the site owners. advertisers and ad network.
     The detection model can detect or shield effectively various types of click fraud, and shield effectively the unconscious invalid clicks, and improve significantly the efficiency of click fraud detection based on no affect of the rate of ads showing. In this paper, several experiments were tested on the detection model; the experimental results were compared and analyzed. The experimental results also show that the proposed scheme could be effectively detected the click fraud of the persons who took manual or used computer program to imitate a legitimate user of web browser, the feasibility of the model and the effectiveness of the scheme is proved.
     Finally, the paper has described a brief summary for the contents of the paper, has prospected the development trend of click fraud detection, and has analyzed and discussed on the deficiencies of detective scripts, user identification, mining algorithms, follow-up analysis and etc, which will be the next steps.
引文
[1]袁健,张劲松等.一种有效预防点击欺诈的策略[J].计算机应用,2009,29(7):1790-1792.
    [2]舒正勇.商业搜索引擎的点击欺诈问题研究[D].大连:辽宁师范大学硕士学位论文,2008.
    [3]张蓉.Web挖掘技术研究[J].计算机工程,2006,,32(15):4-6.
    [4]孙涛.网络广告系统的用户行为定向研究[D].上海:复旦大学硕士学位论文,2008.
    [5]苏疆煜.基于点击流Web用户行为挖掘[D].广州:广东工业大学硕士学位论,2010.
    [6]高志坚.引入第三方监测根治点击欺诈[J].生产力研究,2007(18):72-73.
    [7]Mehmed Kantardzic,Chamila Walgampaya,Brent Wenerstrom,et al. Improving Click Fraud Detection by Real Time Data Fusion[J]//Proc of IEEE International Symposium on 16-19 Dec,2008
    [8]薛安荣,姚林,鞠时光,陈伟鹤,等.离群点挖掘方法综述[J].计算机科学,2008,35(11):13-18,27
    [9]赵站营,成长生.基于聚类分析具备离群点挖掘改进算法的研究与实现[J].计算机应用与软件,2010,27(11):255 258
    [10]XiaoBin Li,Jiangsheng Qian,Zhikai Zhao.An Approach for Discovering Rare High-Density Outlier Cluster[J]. Networking and Digital Society,2009. ICNDS'09. International Conference on 2009:29-32
    [11]赵玥.基于相邻关系的聚类和离群点检测算法的研究[D].上海:复旦大学硕士学位论文,2006.
    [12]Pang-Ning Ta,et al. Introduction to Data Mining,POST&TELECOM PRESS,China,2006
    [13]曾颖,罗可,邹瑞芝.基于K-均值聚类的凝聚聚类的查找方法[J].计算机工程与应用.2009,45(29):131-134
    [14]于浩,王斌,肖刚等.基于距离的不确定离群点检测[J].计算机研究与发展,2010,47(3):474-484
    [15]Niennatrakul,v;Keogh,E;Ratanamahatana,C.A. Application of Distance-Based Outlier Detection to Streams[J].Data Mining (ICDM),2010 IEEE 10th International Conference on.2010:947.
    [16]杨显飞,张健沛,杨静等.基于距离的数据流离群点挖掘算法[J].计算机应用,2010,30(11):2949-2951,2973.
    [17]Distance-Based Outlier Detection on Uncertain Data [J].Computer and Information Technology,2009. CIT'09. Ninth IEEE International Conference on.2009,1:293
    [18]Huang Zhijun,Xia Chuangwen.A Kind of Algorithms for Euclidean Distance-Based Outlier Mining and its Application to Expressway Toll Fraud Detection[J]. Informatics in Control, Automation and Robotics,2009. CAR'09. International Asia Conference on.2009:414
    [19]韩红霞.基于距离离群点的分析与研究[D].镇江:江苏大学硕士学位论文,2007.
    [20]江峰,杜军威,眭跃飞等.基于边界和距离的离群点检测[J].电子学报,2010,38(3):700-705
    [21]连风娜,吴锦林,薛永生.一种改进的基于距离的离群点挖掘算法[J].计算机科学,2007,34(10):139-141,160
    [22]张毅,刘旭敏,关永.基于密度的立群噪声点检测[J].计算机应用,2010,30(3):802-805,809.
    [23]Yunxin Tao,Dechang Pi;.Unifying Density-Based Clustering and Outlier Detection[J]. Knowledge Discovery and Data Mining,2009. WKDD 2009. Second International Workshop on.2009:644-647.
    [24]胡彩平,秦小麟.一种基于密度的局部离群点检测算法DLOF[J].计算机研究与发展,2010,47(12):2110-2116.
    [25]张卫旭,尉宇.基于密度的局部离群点检测算法[J].计算机与数字工程,2010,38(10):11-14.
    [26]薛安荣,鞠时光,何伟华等.局部离群点挖掘算法研究[J].计算机学报,2007,30(8):1455-1463.
    [27]曹洪其,孙志挥.基于网格技术的高维大数据集离群点挖掘算法[J].计算机应用,2007,27(10):2369-2371,2382.
    [28]张宁.离群点检测算法研究[J].桂林电子科技大学学报,2009,,29(1):22-25.
    [29]姚林.离群点快速挖掘算法的研究[D].镇江:江苏大学,2008.
    [30]唐志刚,杨炳儒,杨珺.一种基于Z曲线的新离群点挖掘算法[J].计算机应用研究,2010,,27(12):4427-4429,4432.
    [31]连风娜.离群点挖掘及其内涵知识发现研究[D].厦门大学硕士学位论文,2008
    [32]王雪英.离群点预处理及检测算法研究[D].西南交通大学硕士学位论文.2009
    [33]陈光平,叶东毅.一种改进的离群点检测方法[J].福州大学学报,2007,35(3):376-380.
    [34]徐翔,刘建伟,罗雄麟.离群点挖掘研究[J].计算机应用研究.2009,26(1):34-40
    [35]薛安荣,姚林,鞠时光等.离群点挖掘方法综述[J].计算机科学,2008,35(11):13-18,27.
    [36]王惠文,孟浩.多元线性回归的预测建模方法[J].北京航空航天大学学报,2007,33(4):500-504.
    [37]Xianming Liu,Debin Zhao,Yongbing Zhang et al.Joint learning for side information and correlation model based on linear regression model in distributed video coding[J].Image Processing (ICIP),2009 16th IEEE International Conference on.2009:2937-2940.
    [38]樊纪香,张宏,李辉等.BP网络和多元线性产量预测中的应用[J].计算机工程与应用,2007,43(23):203-204.
    [39]王玉磊,吕顺,许柱等.基于模糊线性回归法预测矸石山有害元素的迁移[J].应用科学,2010,5 95.
    [40]孙颉,吕震宙.模糊线性回归及其在模糊可靠性分析中的应用[J].西北工业大学学报,2006,24(1):115-118.
    [41]薛向阳.一种改进的线性回归预测模型[J].科学技术与工程,2010,10(12):2970-2973.
    [42]纪永凤.灰多元线性回归分析及其应用研究[D].长春市:东北师范大学硕士学位论文,2008
    [43]孙涛.网络广告系统的用户行为定向研究[D].上海:复旦大学硕士学位论文,2007.
    [44]Margaret h.Dunham.Data Mining Introductory and Advanced Topics. Upper Saddle River, N.J.:Prentice Hall/Pearson Education,c2003.
    [45]邵峰晶,于忠清等.数据挖掘原理与算法[M].北京:科学出版社.2009
    [46]Bing Liu.Web Data Mining[M]. Prof. Olfa Nasraoui, published in SIGKDD Explorations,2009.
    [47]http://www.kuqin.com/datawarehouse/20070909/980.html
    [48]Pang-Ning Tan,Michael Steinbach.Vipin Kumar.Introduction to Data Mining[M].Pearson Education,Inc.Addison Wesley,2006.
    [49]K.P Soman,Shyam Diwakar,V.Ajay.Insight into Data Mining Theory and Practice [M].India.Prentice,2006.
    [50]Han Jiawei,Micheline K. Data mining:concepts and techniques[M].2nd edition. San Francisco:Morgan Kaufmann Publishers,2006.
    [51]赵站营,成长生.基于聚类分析局部离群点挖掘改进算法的研究与实现[J].计算机应用与软件,2010,27(11):255-258.
    [52]张毅,刘旭敏,关永.基于密度的离群噪声点检测[J].计算机应用,2010,30(3):802-805,809.
    [53]高卫华,谢康林.Web用户行为预测的一种新模型及算法[J].计算机应用与软件,2007,24(3):142-144,162.
    [54]李学俊,李龙澍,徐怡.基于粗糙集的Web用户行为预测研究[J].计算机工程与应用,2008.44(13):134-136.
    [55]陈如云.多元线性回归的柳州水文信息管理系统[D].武汉,华中科技大学硕士学位论文.2006.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700