用户名: 密码: 验证码:
基于神经网络的Deep Web数据合并技术的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着World Wide Web(WWW)的飞速发展,整个互连网上的有用信息量也在急剧增长。为了能够有效的利用这些信息,需要将来自于不同网站上的数据识别出来,并将这些有用的信息合并到一个统一的模式下。但Web数据库的异质性和自主性使得这项工作很具有挑战性。
     本文在分析国内外已有的Web数据合并方法的基础上,根据神经网络具有自组织、自学习以及自适应的特点,提出一种基于神经网络的Deep Web数据合并技术。
     主要研究的工作包括:
     (1)对Deep Web数据合并技术的国内外研究现状进行调查,对现有的经典的Web数据合并技术进行研究。
     (2)针对已有的Web数据合并技术的缺陷,本文提出一种基于神经网络的方法,该方法从数据表的特征和属性特征方面考虑来实现Web数据合并,为解决Web数据合并问题提供一种有效的途径。
     (3)以神经网络为基础的Deep Web数据合并方法的前提是先将Web上获得的数据进行数据转换,所以本文还研究了Web的数据转换技术,该技术实现将Web上的数据映射到结构化的关系数据库中,以便进一步进行研究。
     (4)应用神经网络工具箱来进行网络训练和仿真,保证该数据合并方法的可操作性,并采用实例验证基于神经网络的Web数据合并技术的准确性和优越性。
     最后,本文对所做的工作进行总结,并对以后的研究内容进行展望。
With the rapid development of the World Wide Web, the amount of useful information on the Internet has ever been increasing. In order to make effective of the information, the data from different websites need to be recognized and merged under a uniform pattern. However, the different qualities and self-determination of Web data bases make it a very challenging issue.
     This paper made a comprehensive review and analysis of the literature at home and abroad on deep web data merging techiniques, and proposed a deep web data merging method based on neural network according to its self-organizing, self-learning and self-adjusting properties. The main research points of the study are like the following:
     (1) It investigated the research situation on deep wet data merging techniques both at home and abroad, and studied carefully the several typical ones.
     (2) Considering the defects of previous web data merging techniques, the present study proposed a method based on neural network. This method realizes the web data merging from the aspects of data table features and property features, providing a new approach towards web data merging.
     (3) Data converting is an important prerequisite for the neural network based deep web data merging. Therefore, this research also studied the web data converting techniques, which map the web data onto structured relational data bases for later analysis.
     (4) Neural network toolbox was applied for network training and simulating, which ensures the maneuverability of this method. Several real cases were used to validate this data merging approach.
     Finally, the research work involved in the thesis was summarized, and several suggestions for further studies on deep web data merging were also proposed at the end of the thesis.
引文
[1] Fetterly, D.,M.Manasse,M.Najork & J.L.Wiener. A large-scale study of the evolution of web pages[C]. In Proceedings of the 12th International World Wide Web Conference,2003,34(2):669-678.
    [2] Bergman,M.K.The Deep Web: Surfacing Hidden Value[J]. The Journal of Electronic Publishing,2001,7(1):1174-1175.
    [3] Brightplanet.com网址[EB/OL]. http://www.brightplanet.com/technology/ DeepWeb.
    [4] He, B., M. Patel, Z. Zhang & K. C. C. Chang.Accessing the Deep Web:A Survey[R].Technical Report, Department of Computer Science, UIUC. Available at: http://eagle.cs.uiuc.edu/tr/dwsurveytr-hpzc-jul04.pdf, July 2004.
    [5] Invisiable.com网址[EB/OL]. http://www.invisiable.com/.
    [6] 刘伟,孟小峰,孟卫一.Deep Web数据集成问题研究[EB/OL]. http://www.dbtech .cn/reports/report2006.cn.htm,2006.
    [7] 刘伟 , 孟 小 峰, 孟 卫一 .Deep Web数 据 集 成 研 究 综 述 [J]. 计 算 机 学 报, 2007,30(9):1475-1489.
    [8] 凌妍妍,刘伟,王仲远,艾静,孟小峰. Deep Web数据集成中的实体识别方法[J].计算机科学与发展,2006,43:46-53.
    [9] Chang, K.C.C, B. He, C. Li, M. Patel & Z. Zhang. Structured Databased on the Web: Observations and Implications[J].SIGMOD Record 2004, 33(3): 61-70.
    [10] Winkler, W. E.The State of Record Linkage and Current Research Problems [C].In Proceedings of the Survey Methods Section,1999:73-79.
    [11] Cohen, W.W.Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity[C].SIGMOD Conference 1998: 201-212.
    [12] Tejada, S., C. A. Knoblock & S. Minton. Learning Domain-independent String Transformation Weights for High Accuracy Object Identification [C]. KDD 2002:350-359.
    [13] Doan,A.,Y.Lu,Y.Lee & J.Han. Object Matching for Information Integration: A Profiler-Based Approach [C]. In Proceedings of the IJCAI-03 Workshop on Information Integration on the Web, 2003: 53-58.
    [14] Li, Wen-Syan & C. Clifton. SEMINT: A Tool for Identifying Attribute Correspondences in Heterogeneous Databases Using Neural Networks [J]. Data & Knowledge Engineering. 2000, 33: 49-84.
    [15] 吴微 著.神经网络计算[M].北京:高等教育出版社,2004.
    [16] 朱大奇,史慧 著.人工神经网络原理及应用[M].北京:科学出版社,2006.
    [17] 张立明 著.人工神经网络的模型及应用[M].上海:复旦大学出版社,1994.
    [18] 张清良,李先明.一种确定神经网络隐层节点数的新方法[J].首吉大学学报(自然科学版),2002,23(1):89-91.
    [19] 段维平.基于BP神经网络的技术创新扩散环境评价研究[D].硕士学位论文.长沙:中南大学,2006.
    [20] Kohonen, T. The Self-organizing Map [J]. In Proceedings of IEEE, 1990, 78(9):1480-1481.
    [21] HTMLTidy工具[EB/OL].http://www.w3.org/People/Raggett/tidy/.
    [22] HTML-Ki工具[EB/OL].http://www.chami.com/html-kit/.
    [23] XspLit工具[EB/OL]. http://www.percussion.com/XMLzone/Technology.htm.
    [24] 网上的XML的转换工具[EB/OL].http://www.chami.com/html-kit/.
    [25] HTML2XML工具[EB/OL].http://www.Infoncall.com.
    [26] 杨(Young,M.J.)著,前导工作室 译.XML学习指南[M].北京:机械工业出版社,2001.
    [27] 哈罗德(Harold,E.R.)著,马云,钟萍等 译. XML宝典(第二版)[M].北京:电子工业出版社,2002.
    [28] Florescu, D. & D. Kossmann. Storing and Querying XML Data Using an RDMB [EB]. IEEE Data Engineering Bulletin, 2003: 27-34.
    [29] Myllymaki, J. Effective Web Data Extraction with Standard XML Technologies[C]. In Proceedings of the 10th International Conference on World Wide Web,2001: 689-696.
    [30] 石宇.基于XML的Web信息抽取与集成技术的研究[D].硕士学位论文.大连:大连海事大学,2006.
    [31] 数据库中使用XML的工具[EB/OL]. http://www.Rpbourret.com/XML/XMLAndData bases.htm.
    [32] 刘振岩,王万森.基于XML的Web数据挖掘的研究[J].计算机科学,2003,5:42-43, 70.
    [33] Poggi, A. & S. Abiteboul. XML Data Integration with Identification (Extended Abstract)[C]. In Proceedings of the Thirteenth Italian Symposium on Advanced Database Systems (SEBD). June 2005.
    [34] 张靓.基于XML的WEB数据抽取与存储的研究[D].硕士学位论文.北京:北京交通大学,2005.
    [35] 王理,陈皓,夏辉,邓海生.在异构数据库环境中实现数据集成[J].现代电子技术.2006,6:83-85.
    [36] Sarawagi, S. & A. Bhamidipaty.Interactive Deduplication Using Active Learning[J]. KDD 2002:269-278.
    [37] Lim,E.P., J. Srivastava & S. Shekhar.Resolving Attribute Incompatibility in Database Integration:An Evidential Reasoning Approach[C].In Proceedings of 10th IEEE Data Engineering Conference,1994:154-163.
    [38] Na Navas, J.C. & M. Wynblatt. The Network is the Database: Data Managenet for Highly Distributed Systems [C]. In Proceedings of the International Conference on Management of Data,2001:544-551.
    [39] 汪加才,陈奇等.面向分类数据的自组织神经网络[J].计算机工程与应用,2003,39(5):96-98.
    [40] Kohonen, T. Adaptive Associative and Self-organizing Functions in Neural Computing [J]. Applied Optics, 1987, 26: 4910-4918.
    [41] Ellmer, E., C. Huemer, D. Merkl & G. Pernul. Neural Network Technology to Support View Integration [C]. In Proceedings of the 14th InternationalConference on Object-oriented and Entity-relationship Modeling, 1995: 181-190.
    [42] Al-Namlah,A.A.Solving the Data Duplication Problem for Complex Databases Using Neural Networks [D]. Florida Institute of Technology, 2003.
    [43] 张志涌,杨祖樱等 著.MATLAB教程[M].北京:北京航空航天大学出版社,2006.
    [44] 董长虹 著.Matlab神经网络与应用[M].北京:国防工业出版社,2007.
    [45] 康耀红,周开利 著.神经网络模型及其matlab仿真程序设计[M].北京:清华大学出版社,2005.
    [46] Shanahan, F. 著,吴宏泉 译.Mashups Web 2.0开发技术--基于Amazon.com[M].北京:清华大学出版社,2008.
    [47] Yang, Y. & X. Liu. A Re-examination of Text Categorization Methods [C]. In Gey, F., M. Hearst & R. Rong (eds). Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval, 1999: 42-49.
    [48] Li, W. & C. Clifton. SEMINT: A System Prototype for Semantic Integration in Heterogeneous Databases (Demonstration Description)[C]. In Proceedings of the ACM SIGMOD Conference, 1995: 23-25.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700