基于DNS查询日志的互联网访问模式分析

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于DNS查询日志的互联网访问模式分析

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Analysis of Internet Access Pattern Based on the DNS Log
作者：季成
论文级别：硕士
学科专业名称：信息与通信工程
中文关键词：DNS服务器 ; 日志分析 ; 访问模式 ; 聚类 ; .CN国家顶级域
英文关键词：DNS server ; Log analysis ; Access pattern ; clustering ; CN ccTLD
学位年度：2009
导师：袁坚
学科代码：081001
学位授予单位：清华大学
论文提交日期：2009-05-01

摘要

域名系统(DNS)实现了IP地址和域名之间的转换,是互联网最关键的基础设施和其他丰富应用的基础。几乎所有基于IP网络的信息通信服务都要通过域名访问来定位相应的网络资源。因此,DNS日志记载了用户访问域名的情况,蕴藏了丰富的互联网访问信息,是研究互联网访问模式的一个新的途径。但是由于DNS日志获取比较困难,而且其日志数据量巨大,目前对于DNS数据的分析主要集中在对于DNS服务器本身的性能、配置等方面的研究,而对于DNS数据所包含的网络用户行为信息的研究还比较少。
     本文借助中国互联网络信息中心负责管理的国家域名系统资源,采用了若干CN节点的DNS服务器日志数据,对互联网访问模式进行了分析。主要的研究工作包括:
     首先,提出了域名规约的方法压缩数据,在保留有效数据的同时有效的减小了数据量。由于DNS日志数量巨大(大约每天200GB),所以在进行分析之前进行预处理来减小数据量是十分有必要的。
     其次,利用预处理之后的数据进行统计规律分析。得出域名的访问量遵从类Zipf分布,大约5%的网站就可以满足网络用户90%以上的域名查询需求;用户的查询量分布则呈现介于幂律和指数函数之间的广延指数分布,体现了网络用户选择DNS递归服务器发出CN域名查询请求行为的确定性与随机性的共存和结合。
     最后,对DNS数据进行了聚类分析,提出了对DNS日志中IP和域名的特征提取方案。分别采用K-means算法和BIRCH算法对DNS日志中的IP和域名的特征矢量进行了聚类分析。结果表明IP地址发送域名查询请求的模式存在巨大差异,呈现出三种主要模式。通过对域名被查询模式的分析,找到了真正体现绝大多数用户网络访问需求的域名。研究成果可以用于对域名和解析请求实现有效的分层管理,实现网络、计算资源的优化配置。
Domain Name System (DNS), which achieves the conversion between IP addresses and domain names, is the infrastructure of the Internet and the basis of other rich Internet applications. All IP-based Internet services use the domain name system to locate the corresponding resources. Therefore, DNS log recorded the domain names which are queried by users and contained a lot of information. It is a new way to analyze the Internet access pattern with the DNS log. However, the current study of DNS log is focused on the performance and configuration of the server itself because of the difficult access to DNS log and the enormous amount of log data. There has been relatively little research in the analysis of the behavior of the Internet user with DNS log.
     In this thesis, the Internet access pattern was studied with the DNS log of CN node within a number of days. These data was provided by the China Internet Network Information Center, which is responsible for the management of national domain name resources. Specific researches include: First of all, a data compression method was designed, which reduces the volume of data while retaining the valid data at the same time. Due to the huge amount of DNS log (about 200GB per day), reducing the amount of data before analysis is necessary.
     Secondly, the statistical rules were analyzed using the preprocessed data. And we found that the access amount of the domain names complied with the Zipf law. About 5% of the sites can meet more than 90% of Internet users’needs. The user's queries comply with Stretched Exponential Distribution, which is between power law and exponential distribution. The result reflects that it is combined with random and certainty when users choose DNS recursive server to query domain names.
     Finally, we focused on the clustering analysis of DNS data. First, a feature extraction method of IPs and domain names was designed. And then the data was clustered with K-means algorithm and BIRCH algorithm. The results show that the temporal behavior of IP addresses differs greatly. And it rendered three main kinds of pattern. Through the similar analysis of the .CN domain names, the domain names which truly reflect the need of the majority of users were found. The results of this research can enable the construction of a hierarchy of domain names and prioritized processing of DNS queries, leading to effective resolving and administration of CN domain names.

引文

[1] IETF. RFC 1034. Domain names - concepts and facilities. New York: Nov. 1987.
    [2] IETF. RFC 1546. Host Anycasting Service. New York: Nov. 1993
    [3] http://www.root-servers.org/
    [4] https://www.isc.org/solutions/survey
    [5] http://baike.baidu.com/view/37878.htm
    [6] Danzig P B, Obraczka K, Kumar A. An analysis of wide-area name server traffic: a study of the internet domain name system. ACM SIGCOMM Computer Communication Review. New York, ACM: 1992, 22(4):281-292.
    [7] Wessels D, Fomenkov M. Wow, that’s a lot of packets. Proceedings of Passive and Active Network Measurement Workshop (PAM). San Diego, Apr. 2003.
    [8] Pappas V, Xu Z G, Lu S, et al. Impact of configuration errors on DNS robustness. In Weikum G, K?nig A, De?loch S, eds. ACM SIGCOMM Computer Communication Review. Paris: ACM, 2004, 34(4): 319-330.
    [9] Xu W, Kirkpatrick B, Lacoste-julien S. Analyzing Root DNS Traffic (2004). www.eecs.berkeley.edu/~bbkirk/papers/cs262a-2004.pdf.
    [10] IETF. RFC1035: DOMAIN NAMES–IMPLEMENTATION AND SPECIFICATION. Nov. 1987
    [11] Brownlee N, Claffy k, Nemeth E. DNS measurements at a root server. In Sixth Global Internet Symposium. San Antonio, TX, Nov. 2001.
    [12] Liston R, Srinivasan S, Zegura E. Diversity in DNS performance measures. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurement (IMW). Marseille, France. 2002, 19-31.
    [13] Liu Z, Huffaker B, Fomenkov M, et al. Two Days in the Life of the DNS Anycast Root Servers. In Proceedings of 8th International Conference on Passive and Active Network Measurement (PAM). Louvain-la-neuve, Belgium. 2007, 125-134.
    [14] d'Ocagne M. Coordonnées parallèles et axiales: Méthodede transformation géométrique et procédénouveau de calcul graphique déduits de la considération des coordonnées parallèlles .Paris: Gauthier-Villars, 1885.
    [15] Sarat S, Pappas V, Terzis A. On the Use of Anycast in DNS. In Computer Communications and Networks (ICCCN). Arlington, VA. Oct. 2006, 71-78
    [16] Huffaker B. Two days in the life of three DNS root servers. http://www.caida.org /publications/presentations/2006/brad_wide0611_anycast_analysis/brad_wide0611_anycast_analysis.pdf
    [17] Zhao F M, Hori Y, Sakurai K. Analysis of Privacy Disclosure in DNS Query. In Proceedings of the International Conference on Multimedia and Ubiquitous Engineering (MUE). Seoul. 2007, 952-957.
    [18] Chandramouli R, Rose S. Challenges in Securing the Domain Name System. IEEE Security and Privacy. 2006 4(1), 84-87.
    [19] Jung J, Sit E, Balakrishnan H, et al. DNS performance and the effectiveness of caching. IEEE/ACM Transactions on Networking (TON). Piscataway, NJ: IEEE Press. 2002, 10(5), 589-603.
    [20] Ishibashi K, Toyono T, Matsuoka H, et al. Measurement of DNS Traffic Caused by DDoS Attack. In Proceedings of the Symposium on Applications and the Internet Workshops (SAINT-W). Washington, DC: IEEE Computer Society. 2005, 118-121.
    [21] Ishibashi K, Toyono T, Toyama K, et al. Detecting Mass-Mailing Worm Infected Hosts by Mining DNS Traffic Data. In Proc. of the ACM Symposium on Communications Architectures and Protocols. 2005, 159-164.
    [22] Brownlee N, claffy k, Nemeth E. DNS Root/gTLD performance measurements. In Proc. Passive and Active Measurement workshop (PAM). 2001.
    [23] Castro, S. Wessels, D. Fomenkov, M. claffy, k. A Day at the Root of the Internet. ACM SIGCOMM Computer Communications Review. New York: ACM. Oct. 2008, 38(5), 41-46.
    [24] Flake G, Lawrence S, Giles C, Coetzee F. Self-Organization and identification of Web communities. IEEE Computer, 35(3):66-71, 2002.
    [25] Nweman M. The structure and function of complex networks. SIAM, Rveiew. 2003, 45: 167-256
    [26]吴金闪,狄增如.从统计物理学看复杂网络研究.物理学进展. 2004,24(1):18-46
    [27] Albert R, Jeong H, L.Barabasi A. Diameter of the world-wide-web. Nature. 1999, 401:130-131
    [28] Laherrere J, Sornette D, Eur. Phys. J. B, 1998, 2: 525.
    [29] Zipf G. Selected studies of the principle of relative frequency in language. Cambridge: Harvard University Press, 1932.
    [30] Huberman, Pirolli P, Pitkow J, et al. Strong Regularities in World Wide Web Surfing. Science, 1998, 280(5360):95-7.
    [31] Wang F, Wan E H, Qian H L. Analysis and Research of DNS and Keyword Services Request Distribution. Computer Engineering, 2006, 32(5): 15-17.
    [32] Heaps J. Information Retrieval - Computational and Theoretical Aspects, Academic Press, 1978.
    [33] http://en.wikipedia.org/wiki/Data_clustering
    [34] Webb, Andrew R. Statistical Pattern Recognition (2nd Edition). West Sussex, England; New Jersey: Wiley, 2002
    [35]孙即祥.现代模式识别.长沙:国防科技大学出版社, 2002, 31-33.
    [36]邵峰晶,于忠清.数据挖掘原理与算法.中国水利水电出版社, 2003
    [37] Livny M, Ramakrishnan R, Zhang T. Birch, An Efficient Data Clustering Method for Very Large Databases, ACM-SIGMOD Int. Conf. Management of Data, 1998.
    [38] Margaret H, Dunham. Data Mining Introductory and Advanced Topics. Pearson Education, 2003
    [39] Karypis G, Han E H, Kumar V. Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling. Computer, 1999, 32(8): 68-75
    [40] ISC BIND. http://www.isc.org.
    [41] Cricket L, Paul A. DNS and BIND. 2002
    [42] US-CERT. Vulnerability note VU#800113: Multiple DNS implementations vulnerable to cache poisoning. http://www.kb.cert.org/vuls/id/800113.
    [43]刘美玲. BA无标度网络模型的应用及扩展[硕士学位论文].武行:武汉理工大学理学院,2005
    [44]李雄飞,李军.数据挖掘与知识发现.高等教育出版社
    [45]毛国君等.数据挖掘原理与算法(第二版).清华大学出版社
    [46] Grandison T, Sloman M. A Survey of Trust in Internet Applications. IEEE Communications Surveys and Tutorials, 2000, Vo1.4(4): 2-16
    [47] Guha U. Rastogi R. Shim K. CURE: An Efficient Clustering Algorithm for Large Databases. Pergamon Information Systems,2001,26(1):35-58
    [48]周兵,王和兴,王翠荣.一种基于Gist的层次聚类算法.计算机工程. 2008, 34(9), 58-60

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700