分布式图片搜索引擎设计与实现

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

分布式图片搜索引擎设计与实现

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Distributed Image Search Engine Design and Implementation
作者：詹恒飞
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：图片搜索 ; 聚焦爬虫 ; 文本提取 ; Hadoop ; Lucene
英文关键词：image search ; focusing crawler ; text extraction ; Hadoop ; Lucene
学位年度：2010
导师：杨岳湘
学科代码：081201
学位授予单位：国防科学技术大学
论文提交日期：2010-11-01

摘要

随着网络规模的飞速扩展和成像设备的普及,互联网上的图片资源增长速度惊人,而且内容丰富,蕴含着巨大的信息量。为了有效地提取并利用这些信息,本文通过对现有的图片搜索技术进行分析,选用了准确率较高的基于文本的图片搜索技术和Hadoop分布式平台进行研究,并构建了一个基于文本的分布式图片搜索引擎,具体完成了以下几项工作:
     (1)通过对网络资源采集技术进行分析,并结合实际需求,提出了一种基于页面权威性和文本齐全度的图片聚焦爬虫算法。该爬虫可以对可信度高、图片数量密集的站点进行优先的页面、图片数据采集,并兼顾采集广度,以达到在单位时间内实现高效采集的目的。
     (2)研究了传统的文本分类技术和信息提取技术,据此提出一种以TF-IDF文本分类技术为基础,并加入了句子成分识别和文本的页面位置重要性加权等多因素的文本关键词提取改进算法。该算法能够在大量文本中较准确地提取合适语句作为该页面内图片的描述,并将提取的描述语句作为图片的关键词,采用Lucene开源开发工具包来定制方便图片搜索应用的倒排序结构文本索引库。
     (3)为了能够在大数据集的情况下依然能够快速为用户提供搜索服务,本文研究了Hadoop分布式平台的应用,并利用Hadoop的Map/Reduce分布式编程技术作为运算分布式化的工具,设计了一个融合上述单点的数据采集技术和索引生成技术的分布式数据采集、索引和搜索一体化的图片搜索引擎系统。
     (4)利用Eclipse编程工具实现了该分布式图片搜索引擎,并对其进行了性能测试,测试结果表明,该分布式图片搜索引擎在采集数据能力、建立大体积索引速度和搜索响应上具有较好的性能。
Along with the rapid expansion of internet scale and the popularity of imaging device, image resources in internet increase very quickly and contain huge information. To extract and use this information efficiently, this paper selected high accuracy text based image search technology and the Hadoop distributed platform to reseach, through analyzing the existing image search technology, and constructed distributed image search engine. The finished work included:
     (1)Posing a kind of image focused crawler based on page authority and text complete degrees by combining analysis of existing network resources collection technology with actual demand of image search engine. This crawler can make prior data collection to the high authority and image intensive websites as well as taking into account the collection range to achieve the high efficiency collection purpose in unit interval.
     (2)Through researching the traditional text classification and information extracting technology, posing a kind of text key words extraction improved algorithm based on TF-IDF text classification technology and adding sentence components recognition and text’s page position importance weighted elements. This algorithm can accurately extract appropriate statements to describe the page’s image in a large number of texts. We also used Lucene open source development kit to customize convenient inverted structuring text index datebase and search interface for image search.
     (3) In order to still supply quick search service to users In the case of large-scale data, this paper researched the application of Hadoop distributed platform and used Map/Reduce distributed programming technology of Hadoop as distributed computing tool to design an image search engine system integrating distributed data collection, index and search fusing above single-point’s data collection technology and index generation technology.
     (4) Using Eclipse programming tool to realize this distributed image search engine and took performance test. The test result showed that this distributed image search engine had better performance in data collection ability, establishment of large-scale index and search response speed.

引文

[1]章毓晋.图象处理和分析[M].北京:清华大学出版社,1999:227~227.
    [2]黄祥林,沈兰荪.基于内容的图片检索技术研究[J].电子学报,2002,30(7):1065~107l.
    [3] H. Kang, A. Efros, M. Hebert, T. Kanade. Image matching in large scale indoor environment[C]. 1st workshop on Egocentric, 2009.
    [4] Z. Wang, A. C. Bbovik, H. R. Sheikh, et al. Image quality assessment: From error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600~612.
    [5]赵继东,鲁珂,吴跃.一种基于谱图理论的web图像搜索方法[J].计算机应用研究, 2008(5):12~13.
    [6] H. B. Kekre, Tanuja K. Sarode, Sudeep D. Thepade. Image Retrieval by Kekre’s Transform Applied on Each Row of Walsh Transformed VQ Codebook[C]. Conference and Workshop on Emerging Trends in Technology (ICWET 2010), Mumbai, 2010.
    [7] R. C. Veltkamp, Tanase M. Content-Based Image Retrieval Systems: A Survey[R]. Utrecht University, 2000.
    [8] H. Shao, W. C. Cui, and H. Zhao. Medical image retrieval based on visual contents and text information[C]. In International Conference on Systems, 2004.
    [9] S. K. Chang, A. Hsu. Image Information Systems:Where do We go from Here?[J]. Knowledge and Data Engineering, 1992, 4: 431~442.
    [10]蒋澄,马范援,蒋思杰.中英文www搜索引擎的信息处理[J].计算机工程,1999,25(4):37~38.
    [11] K. Hirata, T. Kato. Query by visual example--contentbased image retrieval[C]. Advances in DatabaseTechnologies (EDTB’92), Vienna, 1992: 56~71.
    [12] Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, Gene H. Golub. Exploiting the block structure of the web for computing pagerank[R]. Technical report, Stanford University, 2003.
    [13] Y. R. Yao, Y. J. Zhang. Shape-based Image Retrieval Using Wavelets and Moments.Proc[C]. Workshop on Very Low Bitrate Video, 1999: 71~74..
    [14] A. S. Barb, C. R. Shyu, Y. P. Sethi. Knowledge Representation and Sharing Using Visual Semantic Modeling for Diagnostic Medical Image Databases[J]. Information Technology in Biomedicine, 2005,9(4).
    [15] V. N. Gudivada, V. V. Raghavan. Content-based Image Retrieval System[J]. IEEE Computer, 1995, 28(9): l8~22.
    [16]鲍永生,任建峰,郭雷.支持语义的图片检索[J].南京航空航天大学学报,2005,37(1):75~78.
    [17] G. Pant, P. Srinivasan. Learning to Crawl: Comparing Classification Schemes[J]. Transactions on Information Systems. 2005, 23(4): 430~462.
    [18] Gopal Pandurangan, Prabhakar Raghavan, Eli Upfal. Using pagerank to characterize web structure[C]. COCOON 2002, Singapore , 2002: 330~339.
    [19] R. S. Segall, Q. Zhang. Teaching Web Mining in the Classroom: with an Overview of Web Usage Mining[C]. Proceedings of the Thirty-ninth Annual Conference of the Southwest Decision Sciences Institute, Houston, 2008: 4~6.
    [20] M. Diligenti, M. Gori, M. Maggini, Web Page Scoring Systems for Horizontal and Vertical Search[C]. 11th Int’l World Wide Web Conf., 2002.
    [21] M. Diligenti, F. Coetizee, S. Lawrence, et al. Focuscd Crawling Using Context Graphs[C]. Proceedings of the 26th International Conference on VeIy Large Data Bases, 2000.
    [22] G. Tsouloupas, M. D. Dikaiakos. GridBench: A Tool for the Interactive Performance Exploration of Grid Infrastructures[J]. Journal of Parallel and Distributed Computing, 2007: 1029~1045.
    [23] K. Bharat, B. W. Chang, M. Henzinger, et al. Who inks to whom: Mining linkage between web sites[C]. Proceedings of the IEEE International Conference on Data Mining(ICDM'01), San Jose, 2001.
    [24] J. WU, K. Aberer. Using SiteRank for Decentralized Computation of Web Document Ranking[EB/OL]. http://project.alvis.info/alvis_docs/siterank.pdf, 2006.
    [25] R. Mukras, N. Wiratunga, R. Lothian, S. Chakraborti, D. Harper. Information Gain Feature Selection for Ordinal Text Classification using Probability Redistribution[C].In Proc.of IJCAI Textlink Workshop, 2007: 1582~1587.
    [26] G. Salton, C. S. Yang. On the Specification of Term Values in Automatic Indexing[J]. Journal of Documentation, 1973, 29(4): 351~372.
    [27] G. Salton. Automatic Text Processing: The Transformatong, Analysis, and Retrieval of Information by Computer[J]. Reading, 1989.
    [28] S. T. Dumais, G. W. furnas, T. K. landauer, S. Deerwester. Using Latent Semantic analysis to improve information retrival[C]. Conference on Human Factors in Computing(CHI'88), New York, 1988: 281~285.
    [29]张华.WWW图像语义信息提取方法研究[D].济南:山东师范大学,2004.
    [30] Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma, Ji-Rong Wen. Hierachical clustering of WWW image search results using visual, textual and link Information[C]. Proceedings of the ACM International Conference on Multimedia, New York, 2004: 952~959.
    [31] En Cheng, Feng Jing, Chao Zhang, Lei Zhang. Search Result Clustering Based Relevance Feedback for Web Image Retrival[C]. International Conference on Acoustics,Speech,and Signal Processing, Hawaii, 2007: 961~964.
    [32] J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, W. Lou, Fuzzy keyword search overencrypted data in cloud computing[C], in Proc. of IEEE INFOCOM’10 Mini-Conference, San Diego, 2010.
    [33] K. L. Wu, P. S. Yu. Interval Query Indexing for Efficient Stream Processing[C]. In Proc. of CIKM, 2004: 88~97.
    [34] E. Gabrilovich, S. Markovitch. Computing semantic relatedness using Wikipedia-based explicit semantic analysis[C]. Proceedings of IJCAI, 2007: 1606~1611.
    [35] The Hadoop Distributed File System:Architecture and Design[WB/OL]. http://hadoop.apache.org/common/docs/r0.18.0/hdfs_design.pdf, 2007.
    [36] J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters[C]. OSDI 2004, 2004: 137~150.
    [37] D. Cutting. MapReduce in Nutch. http://www.hadoop.org.cn/mapreduce/nutch-mapreduce/, 2008.
    [38]杨代庆,张智雄.基于Hadoop的海量共现矩阵生成方法[J].现代图书情报技术,2009(4):23~26.
    [39] Xiang Xiuqiao, Zhou Jianzhong. Fault diagnosis based on Walsh transform and rough sets[J]. Mechanical Systems and Signal Processing, 2009, 23(4): 1313~1326.
    [40]蒋宗礼,徐学可,李帅.一种基于超链接引导的主题搜索的主题敏感爬行方法[J].计算机应用,2008,28(4):942~944.
    [41]朱学芳,韩占校.一种图像主题网络爬虫的实现方法研究[J].南京师范大学学报:工程技术版,2008(4):117~119.
    [42]谢同.基于文本的Web图片搜索引擎的研究与实现[D].成都:电子科技大学,2007.
    [43] D. Cai, S. Yu, J. R. Wen, W. Y. Ma. VIPS:a vision-based page segmentation algorithm[R]. Microsoft Technical Report MSR-TR-2003-79, 2003.
    [44]亢世勇,刘艳.汉语动词谓语句的语义成分和语义句式[J],唐都学刊,1998,14(1):89~93.
    [45]徐斌.基于PCFG-HDSM模型的语义句式识别[D].南京:南京航天航空大学,2008.
    [46]许笑,张伟哲,张宏莉,方滨兴.广域网分布式Web爬虫[J].软件学报.2010,21(5):1067~1082.
    [47] C. F. Tsai, C. Hung. Automatically annotating images with key-words: A review of image annotation systems[J]. Recent Patents on Computer Science, 2008: 55~68.
    [48]温小斌.Internet图像搜索引擎的研究与实现[D].海口:海南大学,2006.
    [49] G. Almpanidis, C. Kotropoulos, I. Pitas. Combining text and link analysis for focused crawling-An application for vertical search engines[J]. InformationSystems, 2007, 32(6): 886~908.
    [50] S. Chakraborti, R. Mukras, R. Lothian, N. Wiratunga, S. Watt, D. Harper. Supervised Latent Semantic Indexing Using Adaptive Sprinkling[C]. Proceedings of IJCAI, 2007: 1582~1587.
    [51] M. Koster. Robots in the web: threat or treat?[J]. ConneXions, 1995, 4(4).
    [52] H. Zhang, S. Payandeh, J. Dill. Simulation of Progressive Cutting on Surface Mesh model[C]. DRAFT6-08 Sept02, 2002.
    [53] W. Mao,W. W. Chu. The phrase-based vector space model for automatic retrieval of free-text medical documents[J]. Data and Knowledge Engineering, 2007, 61(1): 76~92.
    [54] R. V. Oliveira, B. Zhang, L. Zhang. Observing the Evolution of Internet AS Topology[C]. In Proceedings of ACM SIGCOMM, 2007.
    [55] A. Mirzal, M. Furukawa. A Method for Accelerating the HITS Algorithm[J]. Advanced Computational Intelligence and Intelligent Informatics, 2009.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700