用户名: 密码: 验证码:
基于网页空间进化算法的暴雨灾害主题爬虫策略
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Focused Crawler for Rainstorm Disaster Strategy Based on Web Space Evolutionary Algorithm
  • 作者:刘景发 ; 李新 ; 蒋盛益
  • 英文作者:LIU Jingfa;LI Xin;JIANG Shengyi;College of Computer and Software,Nanjing University of Information Science and Technology;College of Information Science and Technology,Guangdong University of Foreign Studies;
  • 关键词:多目标优化 ; 主题爬虫 ; 网页空间进化算法 ; Pareto最优 ; 暴雨灾害
  • 英文关键词:multi-objective optimization;;focused crawler;;Web Space Evolutionary(WSE) algorithm;;Pareto optimal;;rainstorm disaster
  • 中文刊名:JSJC
  • 英文刊名:Computer Engineering
  • 机构:南京信息工程大学计算机与软件学院;广东外语外贸大学信息科学与技术学院;
  • 出版日期:2019-02-15
  • 出版单位:计算机工程
  • 年:2019
  • 期:v.45;No.497
  • 基金:国家自然科学基金(61373016);; 国家社会科学基金重大招标项目(16ZDA047);; 江苏省自然科学基金(BK20171458,BK20181409)
  • 语种:中文;
  • 页:JSJC201902031
  • 页数:7
  • CN:02
  • ISSN:31-1289/TP
  • 分类号:190-196
摘要
针对单目标优化算法求解爬虫问题时难以获得最优加权因子和易于陷入局部最优的缺点,将多目标优化算法引入主题爬虫,提出一种基于多目标优化的网页空间进化算法。通过计算测试链接与种子链接库中链接的最短距离,将其与种子链接库中所有链接间的平均距离进行比较来更新种子链接库。针对多目标优化中Pareto最优解的选取问题,给出一种最近最远候选解法。实验结果表明,与宽度优先搜索等算法相比,该算法具有较高的爬准率和稳定性。
        Aiming at the shortcomings of single target optimization algorithm to solve the problem that the crawler problem is difficult to obtain the optimal weighting factor and easy to fall into the local optimum,the multi-objective optimization algorithm is introduced into the topic crawler,and a Web Space Evolution(WSE) algorithm based on multiobjective optimization is proposed. The seed link library is updated by calculating the shortest distance between the test link and the link in the seed link library,comparing it to the average distance of all links in the seed link library. Aiming at the selection of Pareto optimal solution in multi-objective optimization,a recent farthest candidate solution is proposed.Experimental results show that compared with the algorithm of breadth-first search,the algorithm has high tracking rate and stability.
引文
[1]BRIN S,PAGE L.The anatomy of a large-scale hypertextual web search engine[J].Computer Networks and ISDN Systems,1998,30(1-7):107-117.
    [2]KLEINBERG J M.Authoritative sources in a hyperlinked environment[J].Journal of the ACM,1999,46(5):604-632.
    [3]BRA P D,HOUBEN G J,KORNATZKY Y,et al.Information retrieval in distributed hypertexts[EB/OL].[2017-06-05].http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.27.7720.
    [4]HERSOVICI M,JACOVI M,MAAREK Y S,et al.The shark-search algorithm.An application:tailoredWeb site mapping[J].Computer Networks and ISDN Systems,1998,30(1-7):317-326.
    [5]DU Y J,LIU W J,LV X J,et al.An improved focused crawler based on semantic similarity vector space model[J].Applied Soft Computing,2015,36(C):392-407.
    [6]王桦.基于广度优先的主题爬虫的设计与实现[D].上海:复旦大学,2011.
    [7]RAWAT S,PATIL D R.Efficient focused crawling based on best first search[C]//Proceedings of the 3rd IEEE International Advance Computing Conference.Washington D.C.,USA:IEEE Press,2013:908-911.
    [8]WEI Y,LI P.Designing focused crawler based on improved genetic algorithm[C]//Proceedings of the 10th International Conference on Advanced Computational Intelligence.Washington D.C.,USA:IEEE Press,2018:319-323.
    [9]NING H,WU H,He Z Z,et al.Focused crawler URL analysis model based on improved genetic algorithm[C]//Proceedings of IEEE International Conference on Mechatronics and Automation.Washington D.C.,USA:IEEE Press,2011:2159-2164.
    [10]张捷,杨希龙.基于模拟退火算法的移动通信网络自规划[J].计算机工程,2017,43(5):83-87.
    [11]DEB K,PRATAP A,AGARWAL S,et al.A fast and elitist multi-objective genetic algorithm:NSGA-II[J].IEEETransactions on Evolutionary Computation,2002,6(2):182-197.
    [12]蒋昌金,彭宏,程建超,等.基于主题词权重和句子特征的自动文摘[J].华南理工大学学报(自然科学版),2010,38(7):50-55.
    [13]高峰,刘震,高辉.结合有监督广度优先搜索策略的通用垂直爬虫方法[J].计算机工程,2018,44(11):289-299.
    [14]WANG Z,MENG B.A comparison of approaches to Chinese word segmentation in hadoop[C]//Proceedings of IEEE International Conference on Data Mining Workshop.Washington D.C.,USA:IEEE Press,2015:844-850.
    [15]武永亮,赵书良,李长镜,等.基于TF-IDF和余弦相似度的文本分类方法[J].中文信息学报,2017,31(5):138-145.
    [16]VELDHUIZEN D A V,LAMONT G B.Multi-objective evolutionary algorithms:analyzing the state-of-the-art[J].Evolutionary Computation,2000,8(2):125-147.
    [17]LIU W J,DU Y J.A novel focused crawler based on celllike membrane computing optimization algorithm[J].Neurocomputing,2014,123:266-280.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700