用户名: 密码: 验证码:
INTERNET中的信息网络提取分析及Rank相关研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
Internet使人们在信息资源共享和沟通交流的范围和方式上得到了极大的拓展。当传播手段不再成为瓶颈的时候,信息交互者对信息的理解和需求差异成为信息沟通的根本障碍。这种资源规模海量化与信息需求差异性造成的信息获取的困难,是Internet面临的越来越显著的问题。
     信息网络的研究,针对Internet中的各种形式的信息载体之间的相互联系进行研究,发现其中的内在联系和基本规律,发挥计算机的计算与信息处理的优势,为用户提供多种方式的有价值的评估与参考。搜索引擎等挖掘服务可以高效定位信息资源,Rank相关技术对于其挖掘质量有较大影响。人们研究发现,Internet中网络的度分布具有无尺度特性,这种结构的异质性可以为Rank相关问题提供有价值的内在信息,其有效性已经被搜索引擎等一些应用所证实。
     本论文介绍了我们在信息网络提取分析及Rank相关问题方面的一些研究工作。我们的工作主要针对WWW页面发布系统与新闻组讨论系统进行。在WWW系统方面,我们建立了Spider系统,并利用其获取了校园网有关数据,并从中提取了页面和站点层次的链接关系网络。利用链接关系网络,我们对网络的分布特性和连通特性等方面进行了研究分析;为解释网络中的幂率特性,建立了链接倾向性与内在适应性模型分析这两种因素对于幂率形成的影响;基于链接关系网络,本文提出了一种对Web页面进行快速Rank的ExpRank算法。
     在新闻组方面,我们编制了bot程序获取数据,并从中提取了参与者与线索关系网络以及邮帖回复关系网络,分析了数据集中的幂率分布特性。基于新闻组讨论系统的特点,提出了利用参与者和线索关系网络获得参与者和线索的IFP和IFT的方法;提出了基于邮贴回复关系网络的邮帖PostRank算法,对提出的算法含义及特点进行了讨论分析,并利用所获数据集进行了信息挖掘的相关试验。本论文研究工作对于信息网络的理解和Rank相关应用具有较好的参考价值和实用意义。
Internet breaks the bottleneck caused by the limitation in range and means of communication and resource sharing. Under this circumstance, gap based on the difference of comprehension and the requirement of information resources comes to be the essential obstacle for communication, and this becomes more and more significant.
     Research on information network focuses on the relationship among different kinds of information units. It can find the intrinsic principles and relationships on Internet, takes the advantage of data process of computers, and provides valuable information for the users. Mining services such as search engines provide efficient way for the resources locating on Internet. The Rank technology can highly impact the quality of mining results. Power-law features were discovered from the degree distribution of the Internet. The heterogeneity of network structure can provide valuable intrinsic information for the Rank related problems, and its effectiveness has been proved by some applications such as search engines.
     In this thesis, our research work on the extraction and analysis of information network and Rank related problems was introduced. We mainly focus on WWW publishing system and Usenet discussion system. On WWW system, we created Spider program, collected the dataset within campus network with the Spider and extracted the hyperlink network structure on page and site level. Features of the distribution and connection of hyperlink network were studied. To understand the emergence of power law distribution, a network model with preference attachment and intrinsic fitness was created and the simulation results were given. A quick rank algorithm based on web structure named ExpRank was proposed.
     On Usenet system, Bot program was created and used to collect the dataset from NNTP server. From the dataset, the relationship network of participators and threads and respondent was extracted. Based on the characteristic of Usenet Discussion System, a method to evaluation the IFP and IFT of participators and threads using the relationship network was proposed. The algorithm called PostRank to calculate the rank of postings based on the respondent networks was proposed. Some analysis of the meanings and features our algorithms were discussed, and the experimental results on collected datasets were given. Our work can provide valuable and practical information for the understanding and Rank application of information networks.
引文
[1] Information Networks Courses. Internet(WWW). http://www.cs.cornell.edu/home/kleinber/
    [2] Information Networks. Internet(WWW). http://www.cs.helsinki.fl/u/tsaparas
    
    [3] Albert R, Jeong H, Barabasi A L. Diameter of the World-Wide Web. Nature, 1999, 401:130
    
    [4] Broder A, Kumar R, et al. Graph structure in the web. Comput. Networks, 2000, 33:309- 320
    
    [5] Huberman B A, Adamic L A. Growth dynamics of the World-Wide Web. Nature, 1999, 399:130
    
    [6] Adamic L A, Huberman B A, Barabasi A L, et al. Power-law distribution of the world wide web. Science, 2000,287:2115a
    
    [7] Faloutsos M, Faloutsos P, Faloutsos C. On power-law relationships of the internet topology. Computer Communication Review, 1999,29:252-262
    
    [8] Albert R, Barabasi A L. Statistical mechanics of complex networks. Reviews of Modern Physics, 2002, 74:47-97
    
    [9] Barabasi A L, Albert R. Emergence of scaling in Random Networks. Science, 1999, 286:509-512
    
    [10] Barabasi A L, Albert R, Jeong H. Mean-field theory for scale-free random networks. Physica A, 1999,272:173-187
    
    [11] Dorogovtsev S N, Mendes J F F. Effect of the accelerating growth of communications networks on their structure. Physical Review E, 2001, 63(025101)
    
    [12] Albert R, Barabasi A L. Topology of complex networks: Local events and universality. Phys. Rev, 2000, 85(24):5234-5237
    
    [13] Caldarelli G, Capocci A, Rios P D L, et al. Scale-Free Networks from Varying Vertex Intrinsic Fitness. Phys. Rev. Lett., 2002, 89(258702)
    
    [14] Servedio V D P, Caldarelli G. Vetex intrinsic fitness: How to produce arbitrary scale-free networks. Physical Review E, 2004, 70(056126)
    
    [15] Albert R, Jeong H, Barabasi A L. Attack and error tolerance of complex networks. Nature, 2000, 406:378-382
    
    [16] Pastor-Satorras R, Vespignani A. Epidemic spreading in scale-free networks. Phys. Rev. Lett., 2001, 86:3200-3203
    
    [17] Adamic L A, Lukose R M, et al. Search in power-law networks. Physical Review, 2001, 64(046135)
    [18] Brin S, Page L, Motwanl R, et al. The PageRank citation ranking: Bring order to the web. Technical report, Stanford University, 1999. Available at http://dbpubs.stanford.edu:8090/pub/1999-66
    [19] Kleinberg J. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 1999, 46(5):604-632
    [20] Borodin A, Rosenthal J S, et al. Finding Authorities and Hubs From Link Structures on the World Wide Web. Proceedings of 10th International World Wide Web Conference, 2001
    [21] Chakrabarti S, Dom B, et al. Automatic resource compilation by analyzing hyperlink structure and associated text. Proceedings of the 7th ACM-WWW International Conference. Brisbane: ACM Press, 1998. 65-74
    [22] Lempel R, Moran S. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. ACM Trans. on Information Systems, 2001, 19(2): 131-160
    [23] Conn D, Chang H. Learning to probabilistically identify authoritative documents. Proceedings of the 17th International Conference on Machine Learning. San Francisco: Morgan Kaufmann,2000. 167-174
    [24] Borodin A, Roberts G O, et al. Link analysis ranking: algorithms, theory, and experiments. ACM Transactions on Internet Technology, 2005, 5(1):231-297
    [25] Wasserman S, Faust K. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994
    [26] Granovetter M. The strength of weak ties. American Journal of Sociology, 78(6): 1360-1380
    [27] Watts D J, Strogatz S H. Collective dynamics of 'small world' networks. Nature, 1998, 393:440-442
    [28] Heaton J. Programming Spiders, Bots, and Aggregators in Java. SYBEX Inc.
    [29] Rozenfeld H D, et al. Statistics of cycles: how loopy is your network? J. Phys. A: Math. Gen., 2005, 38:4589-4595
    [30] Medard M, Lumetta S S. Network Reliability and Fault Tolerance. In: Proakis J, editor, Proceedings of Wiley Encyclopedia of Engineering
    [31] Mateti P, Deo N. On algorithms for enumerating all circuits of a graph. SIAM J. Comput., 1976, 5:90-99
    [32] Rao V V B, Muti V G K. Enumeration of all circuits of a graph. Proc. IEEE, 1969, 57:700-701
    [33] Johnson D B. Find all the elementary circuits of a directed graph. J. SIAM, 1975, 4:77-84
    [34] Tarjan R. Enumeration of the elementary circuits of a directed graph. J. SIAM, 1973, 2:211-216
    [35] Tiernan J C. An efficient search algorithm to find the elementary circuits of a graph. Comm. ACM, 1970, 13:722-726
    [36] Newman M E J. The Structure and Function of Complex networks. SIAM Review, 2003, 45:167-256
    [37] Floyd S, Paxson V. Difficulties in simulating the internet. IEEE/ACM Trans. Networks., 2001,9:392-403
    [38] Adamic L A, Lukose R M, Huberman B A. Local search in unstructured networks. Physical Review E, 2001, 64(046135)
    [39] Erdos P, Renyi A. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci., 1960,5:17-60
    [40] Jeh G, Wedom J. Scaling Personalized Web Search. Technical report, Stanford University, 2002. Available at http://dbpubs.stanford.edu:8090/pub/2002-12
    [41] Haveliwala T H. Topic-sensitiv PageRank. Proceedings of the Eleventh International World Wide Web Conference. New York: ACM Press, 2002. 517-526
    
    [42] Kamvar S D, Haveliwala T H, Manning C D, et al. Extrapolation Methods for Accelerating PageRank Computations. Proceedings of the Twelfth International World Wide Web Conference. New York: ACM Press, 2003. 261-270
    
    [43] Langville A N, Meyer C D. Deeper Inside PageRank. Internet Mathmatics, 2003, 1:335- 380
    
    [44] Moler C, Loan C V. Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later. SIAM Review, 2003, 45(1):3-49
    
    [45] Hochbruck M, Lubich C. On Krylov subspace approximations to the matrix exponential operator. SIAM J. Numer. Anal., 1997, 34(5):1911-1925
    
    [46] History of Usenet Newsgroups. Internet(WWW). http://www.usenetmonster.com/infocenter/articles/usenet_history.asp
    [47] Mozilla Wiki Mork. Internet(WWW). http://wiki.mozilla.org/Mork
    [48] w3.org. Network News Transfer Protocol. Internet(WWW). http://www.w3.org/Protocols/rfc977/rfc977
    [49] cpan.org. Comprehensive Perl Archive Network. Internet(WWW). http://www.cpan.org
    
    [50] Baeza-Yates R A, Ribeiro-Neto B A. Modern Information Retrieval. ACM Press / Addison- Wesley, 1999
    
    [51] Domingos P, Richardson M. Mining the Network Value of Customers. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2001
    
    [52] Schwartz M F, Wood D C M. Discovering Shared Interests Using Graph Analysis. Communications of the ACM, 1993, 36(8):78-89
    [53] Xi W, Lind J, Brill E. Learning effective ranking functions for newsgroup search. Proceedings of the 27st Annual International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval. New York: ACM Press, 2004
    
    [54] Borgs C, Chayes J T, Mahdian M, et al. Exploring the community structure of newsgroups. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004. 783-787
    
    [55] Agrawal S D, Rajagopaian S, Srikant R, et al. Mining Newsgroups Using Networks Arising from Social Behavior. Proceedings of the Twelfth International World Wide Web Conference. New York: ACM Press, 2003
    
    [56] Holland J H. Hidden Order : How Adaptation Builds Complexity. 1995
    
    [57] Badii R, Politi A. Complexity: Hierarchical structures and scaling in physics. Cambridge University Press, 1997
    [58] Bianconi G, Barabasi A L. Bose-Einstein Condensation in Complex Networks. Phy. Rev, 2001,56:6532
    [59] Bianconi G, Barabasi A L. Competition and multiscaling in evolving networks. Europhysics Letters, 2001, 54:436-442
    
    [60] Briggs J, Peat F D. TURBULENT MIRROR: An Illustrated Guide to Chaos Theroy and the Science of Wholeness. New York, 1989
    
    [61] Casbai I. 1/f noise in computer network traffic. Journal of Physics A: Mathematical and General, 1994,27:L417-L421
    
    [62] Dorogovtsev S N, Mendes J F F. Structure of Growing Networks with Preferential Linking. Phys. Rev, 2000, 85:4633
    
    [63] Dorogovtsev S N, Mendes J F F. Evolution in networks. Advances in Physics, 2002, 51:1079-1187
    
    [64] Govindan R, Reddy A. An analysis of Internet inter-domain topology and route stability. Proceedings of IEEE INFOCOM, 1997
    
    [65] Haken H. Principles of Brain Functioning: A Synergetic Approach to Brain Activity, Be- havior and Cognition. 1996
    
    [66] Jeong H, Tombor B, Albert R, et al. The large-scale organizaiton of metabolic networks. Nature, 2000, 407:651-654
    
    [67] Leland W E, Taqqu M S, et al. On the Self-Similar Nature of Ethernet Traffic(Extended Version). IEEE/ACM Transactions on Networking, 1994.
    
    [68] Maslov S, Sneppen K. Specificity and stability in topology of protein networks. Science, 2002,296:910-913
    [69] Newman M E J. Scientific collaboration networks. I. Network construction and fundamental results. Physical Review E, 2001, 64(016131)
    [70] Newman M E J. Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Physical Review E, 2001, 64(016132)
    [71] Pansiot J J, Grad D. On routes and multicast trees in the Internet. ACM Computer Communication Review, 1998,28:41-50
    [72] Paxson V, Floyd S. Wide area traffic: the failure of Poisson modeling. IEEE/ACM Transactions on Networking, 1995. 226-244
    [73] Paxon V, Floyd S. Why We Don' t Know How To Simulate The Internet. Proceedings of the Winter Simulation Conference, 1997
    [74] Redner S. How popular is your paper? An empirical study of the citation distribution. European Physicial Journal B, 1998, 4:131-134
    [75] Sabater J, Sierra C. Reputation and Social Network Analysis in Multi-Agent Systems. Proceedings of AAMAS'02, Bologna, Italy, 2002. 475-482
    [76] Sole R V, Pastor-Satorras R, Smith E, et al. A model of large-scale proteome evolution. Adv. in Complex Systems, 2002, 5:43-54
    [77] Takayasu M, Takayasu H, et al. Dynamic phase transition observed in the internet traffic flow. Physica A, 2000, 277:248-255
    [78] Tangmunarunkit H, Govindan R, et al. Network Topology Generators: DegreeBased vs. structural. Proceedings of SIGCOMM' 02, Pittsburgh, Pennsylvania, USA, 2002
    [79] Turcotte D L, Rundle J B. Self-organized complexity in the physical, biological,and social sciences. PNAS, 2002, 99
    [80] Tuulos V H, Tirri H. Combining Topic Models and Social Networks for Chat Data Mining. Proceedings of 2004 IEEE/WIC/ACM International Conference on Web Intelligence, 2004. 206-213
    [81] Waxman B M. Routing of Multipoint Connections. IEEE Journal of Selected Areas in Communication, 1988, 6(9): 1617-1622
    [82] Willinger W, Taqqu M, et al. Self-similarity through high variability: statistical analysis of Ethernet LAN traffic at the source level. IEEE/ACM Transactions on Networking, 1997, 5:71-86
    [83] Willinger W, Govindan R, et al. Scaling phenomena in the Internet: Critically examining criticality. Proceedings of Natl. Acad. Sci, 2002. 2573-2580
    [84] Achlioptas D, Fiat A, et al. Web Search via Hub Synthesis. Proceedings of 42nd IEEE Symposium on Foundations of Computer Science, 2001. 611-618
    [85] Albert R, Barabasi A L. Topology of Evolving Networks: Local Events and Universality. Phys. Rev. Lett., 2000, 85(24):5234-5237
    [86] Alon N. Eigenvalues and Expanders. Combinatorica, 1986,6:83-96
    [87] Arasu A, Cho J, et al. Searching the Web. ACM Transactions on Internet Technology, 2001, 1(1):2-43
    [88] Bharat K, Henzinger M R. Improved Algorithms of Topic Distillation in a Hyperlinked Environment. Proceedings of 21st ACM SIGIR Conferenceon Research and Development in Information Retrieval, 1998
    [89] Bharat K, Broder A. A technique for measuring the relative size and overlap of public Web search engines. Proceedings of 7th International World Wide Web Conference, 1998
    [90] Bharat K, Broder A. The connectivity server: fast access to linkage information on the web. Proceedings of 7th WWW, 1998
    [91] Bollobas B, Borgs C, Chayes J. Directed scale-free graphs. Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms, 2003
    [92] Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of 7th International World Wide Web Conference, 1998
    [93] Chakrabarti S, Dom B, et al. Mining the link structure of the World Wide Web. IEEE Computer, 1999.
    [94] Chen Q, Chang H, et al. The Origin of Power Laws in Internet Topologies Revisited. Proceedings of IEEE Infocom, 2002
    [95] Cherkassky B V, Goldberg A V. Negative-cycle detection algorithm. 1999.
    [96] Chung F, Lu L. The average distance in random graphs with given expected degrees. Proceedings of the National Academy of Sciences, 2002
    [97] Domingos P, Richardson M. The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank. Advances in Neural Information Processing Systems, 2002,14
    [98] Dwork C, Kumar R, et al. Rank Aggregation Methods for the Web. Proceedings of 10th International World Wide Web Conference, 2001
    [99] Fagin R, Kumar R, Sivakumar D. Efficient similarity search and classification via rank aggregation. Proceedings of 2003 ACM SIGMOD Conference, 2003
    [100] Floyd R W Nondeterministic Algorithms. J. Assoc. Comput. Mach, 1967, 4:636-644
    [101] Garfield E. Citation analysis as a tool in journal evaluation. Science, 1972, 178(4060):471-479
    [102] Granovetter M. Threshold models of collective behavior. American Journal of Sociology, 1978, 83(6): 1420-1443
    [103] Henzinger M, Heydon A, et al. On Near-Uniform URL Sampling. Proceedings of 9th International World Wide Web Conference, 2000
    [104] Kempe D, Kleinberg J, Tardos E. Maximizing the Spread of Influence through a Social Network. Proceedings of 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2003
    [105] Kumar R, Raghavan P, et al. Stochastic models for the Web graph. Proceedings of 41th IEEE Symp. on Foundations of Computer Science, 2000
    [106] Mendelzon A, Rafiei D. What do the Neighbours Think? Computing Web Page Reputations. IEEE Data Engineering Bulletin, 2000, 23(3):9-16
    [107] Newman M E J, Strogatz S H, Watts D J. Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E, 2001, 64(026118)
    [108] Ng A Y, Zheng A X, Jordan M I. Link analysis, eigenvectors, and stability. Proceedings of International Joint Conference on Artificial Intelligence, 2001
    [109] Plaxton C G, Rajaraman R, Andrea W Accessing Nearby Copies of Replicated Objects in a Distributed Environment. Proceedings of ACM Symposium on Parallel Algorithms and Architectures, 1997
    [110] Rafiei D, Mendelzon A. What is this Page Known for? Computing Web Page Reputation. Proceedings of the 9th ACM-WWW International Conference, 2000
    [111] Randall K H, Stata R, et al. The LINK database: Fast access to graphs of the Web. Technical report, Compaq Systems Research Center, 2001
    [112] Ratnasamy S, Francis P, et al. A Scalable Content-Addressable Network. Proceedings of ACM SIGCOMM, 2001
    [113] Rowstron A, Druschel P. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. Proceedings of 18th IFIP/ACM International Conference on Distributed Systems Platforms, 2001
    [114] White S, Smyth P. Algorithms for Estimating Relative Importance in Networks. SIGKDD'03, ACM, 2003
    [115] Sporns O. Network analysis, complexity, and brain function. Complexity, 2002, 8:56-60
    [116] Stoica I, Morris R. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. Proceedings of ACM SIGCOMM, 2001
    [117] Tsaparas P. Link Analysis Ranking: [PhD Thesis]. University of Toronto, 2004
    [118] Tsaparas P. Using Non-Linear Dynamical Systems for Web Searching and Ranking. Proceedings of Principles of Database Systems, 2004

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700