面向信息检索的文本信息组织关键技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

NSTL服务站

面向信息检索的文本信息组织关键技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Study on Key Technologies of Text Information Organization for Information Retrieve
作者：刘健
论文级别：博士
学科专业名称：管理科学与工程
中文关键词：信息检索 ; 信息组织 ; 文本聚类 ; 文本分类 ; 特征选择 ; 全文检索
英文关键词：Information Retrieve ; Information Organization ; Document Clustering ; Document Classification ; Feature Selection ; Full-text Indexing Model
学位年度：2009
导师：张维明
学科代码：1201
学位授予单位：国防科学技术大学
论文提交日期：2009-04-01

摘要

信息检索系统是人们获取信息必不可少的工具,但是随着互联网的日益发展,信息资源呈现出了爆炸式增长的趋势,对信息检索系统带来了巨大的挑战。如何高效地组织、处理和管理这些信息,并快速、准确、全面地从中获得用户所需要的信息,是亟待解决的问题。多项研究表明,合理的信息组织是解决这一系列问题的关键环节。
     本文致力于综合运用文本分类/聚类技术、文本索引技术提高文本信息组织的性能与自动化程度,实现海量数据条件下的文本信息组织系统。目前,这些关键技术和方法在实际应用中还存在着很多不足之处,主要表现在:(1)现有文本聚类算法研究集中在如何提高算法的准确率与效率,忽视了聚类算法的有效性,如参数难以确定,算法仅对特定的数据分布有效等,导致算法难以满足文本主题挖掘的需求;(2)文本分类需要利用大量的有标记样本进行训练,而有标记样本在实际应用中又难以获取,这使得分类器的泛化能力较低,分类精度不能满足文本自动归类的要求;(3)文本采用向量空间模型表示使得文本向量高维且稀疏,严重影响了文本分类的效率与精度;(4)现有的索引模型都是针对西方语言设计,而中文与西方语言之间存在着较大的差异,这些索引模型都无法对中文文本建立理想的索引。
     本文针对这些问题,采用理论分析、实验研究等手段,重点研究了这些关键技术中的算法与模型,并提出了相应的解决方法,取得的主要研究成果如下:
     (1)针对文本集主题结构挖掘中聚类算法的有效性问题,提出了一种基于动态阈值选择模型的无参数局部密度聚类算法DTSLD。该算法首先在小波去噪中分层滤波思想的启发下,建立了分层阈值选取的动态阈值选择模型,对算法参数自动选取;其次,在RDBKNN算法的基础上进行改进,为了避免全局参数对算法的影响,提高参数选择的正确性,不再使用全局统一的近邻参数k,而是利用动态阈值选择模型为每个数据点选择各自合适的近邻,形成更为自然的邻域;相对密度阈值参数δ的选择也利用动态阈值选择模型进行选取,但采取不同的策略执行;最后,在文档集主题挖掘应用中,利用多项式核函数改进了文档相似度计算方法,使之更加适合于高维文本数据的聚类应用。相关实验表明,该算法易于使用,并且对各种云状、流形数据分布都具备很好的适应能力,能够充分满足文本集主题结构挖掘提出的有效性要求。
     (2)针对文本分类技术在自动文档归类应用中面临的小样本问题,提出了基于半监督学习与数据剪辑的直推式文本分类算法――Tri-ed-training-Tsvm。该算法的设计思路是将半监督学习算法与TSVM算法相结合,在初始训练样本不足时利用半监督学习算法对未标记样本的学习能力,逐步扩大训练集规模。再利用扩大后的训练集对TSVM进行训练,得到一个相对准确的分类面,从而屏蔽TSVM算法中的参数N,避免人为设置的困难与误差;然后利用TSVM算法中最大化两类样本间隔的方法,成对交换边界样本标记的以获得最佳的分类精度。此外,由于初始训练集规模较小时,半监督学习算法在扩大训练集的过程中势必会引入大量误标记和噪声数据,为弥补这一负面影响,本文引入了基于最近邻一致性约束规则的数据剪辑(data editing)技术对学习过程进行误标记样本修正、噪声数据剔除等净化操作,以提高扩大后的训练集质量。
     (3)针对文本数据采用向量空间模型表示,文本向量高维稀疏导致文本分类性能降低的问题,对文本降维技术中的特征选择方法进行了研究。首先在Fisher线性判别模型的基础上,从特征选择的角度出发,经理论推导与相关定理证明,提出了一种稳定性高、特征选择能力强的FS特征选择算法。其次,通过对互信息方法的实验与理论分析,对其进行了改进,摒弃了原算法中以特征项对某个类别最高的贡献度作为最终评估值的方法,利用特征对各个类别之间贡献度的差异作为评估标准,从而使其特征选择能力得到了极大提升。最后通过相关实验,分析验证了相关算法在文档降维中的准确率与时间效率,并且证明了yang等人关于特征选择的经典论文中提出的特征文档频数与分类能力存在相关性的结论存在错误。
     (4)针对现有索引模型无法建立高效中文文本索引的问题,首次提出了一种符合中文语言特点的字词混合一体化索引模型。该模型以互关联后继树索引模型为基础,利用其保存字符先后关系的特点,通过对叶节点结构的扩展,增加了词汇索引信息,实现了中文字、词一体化混合索引;同时,针对互关联后继树索引模型检索效率较低的缺陷,通过对原模型中后继树层次的扩展,将“根节点-叶节点”结构扩展为“根节点-分支节点-叶节点”结构,克服了原模型仅能顺序查找无法使用快速定位技术的缺点,极大提高了检索速度。实验表明,本文提出的一体化混合索引模型成功结合了字、词索引模型的优点,具有创建速度快、查询效率高的特点。与基于字的索引模型相比,检索速度及查准率有较大提高;与基于词的索引模型相比,在查全率方面有明显进步。
     (5)综合运用本文提出的关键技术,基于流程化、组件化、层次化的先进设计理念,实现了一个面向信息检索的文本信息组织平台,并且该平台已在多项科研任务与工程实践中得到了应用。
With the development of computer and internet, the various information resources are increasing rapidly, especially the number of text has explosive growth. Faced with such a large and rapid expansion of the information oceans, how to efficiently handle and manage these information, and how to accurately comprehensive and gain the information needed by the user, is a major challenge to the current information science fields, as an important means to solve these problems, the text information management has a very broad application prospects.
     This paper is committed to the integrated use of document classification /clustering technology and document indexing technology to improve the performance and degree of automation of text organization system. However, these key technologies and methods still have many deficiencies in practical applications, mainly reflected in: (1) Existing document clustering algorithms focus on how to improve the algorithm's accuracy and efficiency, but the effectiveness is often neglected, such as the parameters are difficult to determine, only effective on specific data distribution, etc, resulting in algorithms can not meet the demands of the the document topic mining; (2) Document classification requires a large number of labeled samples to train, but the labeled samples are difficult to obtain, which makes the classifier’s generalization ability lower and the classification accuracy can not meet the needs; (3) The VSM makes the high dimension of document vector, which seriously affected the efficiency and accuracy of document classification; (4) The existing indexing models are designed for western languages, which can not establish an ideal index for Chinese document because of the differences between Chinese and Western languages. To solve the problems, This paper focus on these key technologies, models, algorithms and give the the corresponding solution by theoretical analysising and experimental researching. The research results are as follows:
     (1) For the effectiveness problem of clustering algorithms, proposed a dynamic threshold selection model based local density clustering algorithm without parameters—DTSLD. The algorithm is inspired by layered filtering thinking in wavelet denoising, establish a tiered dynamic threshold selection model to automatically select the parameters of the algorithm; Secondly, based on RDBKNN algorithm, use dynamic threshold selection model for each data point to choose their suitable neighbor in stead of unified global neighbors parameters k to to avoid the impact of overall parameters. Relative density threshold parameterδalso makes use of dynamic threshold selection model, but apply a different strategy. Finally, in the document’s topic mining applications, the use of polynomial kernel function improved the document similarity calculating. Experiments show that the algorithm is easy to use, effective on a variety of cloud-like and manifolds data distribution, and has a very good ability to fully meet the demands of the document topic mining.
     (2) For the small small sample issue, proposed a new transductive inference classification algorithm. The algorithm first use Tri-training algorithm to explore the unlabled samples, gradually expand the scale of the training set by transferring the knowledge of data distribution implied by the unlabeled samples to the classifier. Then use the paired-label exchanging, an idea from transductive vector machines, to maximize the samples margin which raised the handling capacity of border samples. In addition, since the initial training set is at a smaller scale, Tri-training algorithm in the process of expand training set will be introducing a lot of noisy data and error tags, to recover the negative impact, consistency of nearest neighbor bound rule based data editing technology is introduced to do purification operation by removing the mislabeld data and noise in learning Process which increasing the expanded training set’s quality.
     (3) For the“high demision curse”problem, studied the feature selection algorithms in the text classification. A new feature selection algorithm is proposed based on fisher linear discriminant model, which converts the solution process to feature optimization problem and avoid the complex matrix operations. At the same time, we improved the poor performance MI method by using the contribution variance among the categories instead of choosing the highest contribution as a final assessment. And through relevant experiments, we proved there is an error in the conclusions of frequency and classification capability of characteristics having relevance submitted in the paper of yang.
     (4) Proposed a integration mixed-term full-text indexing model, the model is based on the IRST indexing model, using its characteristic of keeping relations between the characters,we add the word information to the node through expanding the structure. For IRST model search efficiency is its defects, through the expansion of "root node - leaf nodes" structure to the "Root node - branch nodes - leaf nodes" structure, we overcome the shortcoming that the original model can only order to find, can not use the shortcomings of rapid positioning technology such as hash table, which has greatly improved the speed of retrieval. Experiments show the new model successfully combines the character and word indexing, which has higher recall than word indexing model, higher precision and faster retrieve speed than character indexing model.

引文

[1]中国互联网络信息中心.第16次中国互联网络发展状况统计报告, 2005 http://www.cnnic.net.cn/uploadflles/pdf/2005/7/20/210342.pdf
    [2]张刚,谭建龙.分布式信息检索中文档集合划分问题的评价.软件学报, 2008, 19(1): 136-143.
    [3]郎皓,王斌,李锦涛,丁凡.文本检索的查询性能预测.软件学报, 2008, 19(2): 291-300.
    [4]王永恒.海量短语信息挖掘技术的研究与实现.博士学位论文,国防科学技术大学, 2006.
    [5]陈晓云.文本挖掘若干关键技术研究.博士学位论文,复旦大学, 2005.
    [6]王浩畅,赵铁军.生物医学文本挖掘技术的研究与进展.中文信息学报, 2008, 22(3).
    [7]苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展.软件学报, 2006, 17(9): 1848?1859.
    [8] Debole F, Sebastiani F. Supervised term weighting for automated text categorization. In: Haddad H, George AP, eds. Proc. of the 18th ACM Symp. on Applied Computing (SAC-03). Melbourne: ACM Press, 2003. 784?788.
    [9] Xue D, Sun M. Chinese text categorization based on the binary weighting model with non-binary smoothing. In: Sebastiani F, ed. Proc. of the 25th European Conf. on Information Retrieval (ECIR-03). Pisa: Springer-Verlag, 2003: 408?419.
    [10] Lertnattee V, Theeramunkong T. Effect of term distributions on centroid-based text categorization. Information Sciences, 2004, 158(1): 89?115.
    [11]吴志峰,田学东.人名、机构名在基于概念的文本分类中的应用研究.河北大学学报, 2004, 24(6): 657-661
    [12] Moschitti A, Basili R. Complex linguistic features for text classification: A comprehensive study. In: McDonald S, Tait J, eds. Proc. of the 26th European Conf. on Information Retrieval Research (ECIR-04). Sunderland: Springer-Verlag, 2004: 181?196.
    [13] Kehagias A, Petridis V, Kaburlasos VG, Fragkou P. A comparison of word- and sense-based text categorization using several classification algorithms. Journal of Intelligent Information Systems, 2003, 21(3): 227?247.
    [14] Bigi. B. Using Kullback-Leibler distance for text categorization. In: Sebastiani F, ed. Proc. of the 25th European Conf. on Information Retrieval (ECIR-03). Pisa: Springer-Verlag, 2003: 305?319.
    [15] Zhai C, J Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference. 2001:334-342.
    [16]丁国栋,白硕,王斌.文本检索的统计语言建模方法综述.计算机研究与发展, 2006. 43(5): 769-776.
    [17] K.Collins-Thompson, P.Ogilvie, Y.Zhang, J.Callan. Information filtering: novelty detection and named-page finding. In Proceedings of the 11th Text Retrieval Conference.National Institute of Standards and Technology, 2002.
    [18] S.Eyheramendy, D.D.Lewis and D.Madigan. On the naive bayes model for text categorization. Artificial Intelligence & Statistics, 2003.
    [19]李荣陆,胡运发.基于密度的KNN文本分类器训练样本裁剪方法.计算机研究与发展, 2004, 41(4): 539-545.
    [20] F.Peng and D.Schuurmans. Combining naive bayes and n-gram language models for text classification. Proceedings of the 25th European Conference on Information Retrieval Research (ECIR03). April 14 -16, 2003, Pisa, Italy.
    [21]李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类.计算机研究与发展, 2005, 42(1): 94-101.
    [22] Yang Y, Zhang J, Kisiel B. A scalability analysis of classifiers in text categorization. In: Callan J, Cormack G, Clarke C, Hawking D, Smeaton A, eds. Proc. of the 26th ACM Int’l Conf. on Research and Development in Information Retrieval (SIGIR-03). Toronto: ACM Press, 2003: 96?103.
    [23] Liu TY, Yang Y, Wan H, Zhou Q, Gao B, Zeng HJ, Chen Z, Ma WY. An experimental study on large-scale web categorization. In: Ellis A, Hagino T, eds. Proc. of the 14th Int’l World Wide Web Conf (WWW-05). Chiba: ACM Press, 2005: 1106?1107.
    [24] Chakrabarti S, Roy S, Soundalgekar M. Fast and accurate text classification via multiple linear discriminant projections. Int’l Journal on Very Large Data Bases, 2003, 12(2):170?185.
    [25] Wang J, Wang H, Zhang S, Hu Y. A simple and efficient algorithm to classify a large scale of text. Journal of Computer Research and Development, 2005, 42(1): 85?93 (in Chinese with English abstract).
    [26] Tan S, Cheng X, Wang B, Xu H, Ghanem MM, Guo Y. Using dragpushing to refine centroid text classifiers. In: Ricardo ABY, Nivio Z, Gary M, Alistair M, John T, eds. Proc. of the ACM SIGIR-05. Salvador: ACM Press, 2005: 653?654.
    [27] Debole F, Sebastiani F. An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society for Information Science and Technology, 2004, 56(6): 584?596.
    [28] Lewis DD, Li F, Rose T, Yang Y. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004, 5(3):361?397.
    [29] Forman G, Cohen I. Learning from little: Comparison of classifiers given little training. In: Jean FB, Floriana E, Fosca G, Dino P, eds. Proc. of the 8th European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD-04). Pisa: Springer-Verlag, 2004: 161?172.
    [30] Tsay JJ, Wang JD. Improving linear classifier for Chinese text categorization. Information Processing and Management, 2004, 40(2): 223?237.
    [31] Lam W, Lai KY. Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2003, 25(5): 628?633.
    [32] Tan S, Cheng X, Ghanem MM, Wang B, Xu H. A novel refinement approach for text categorization. In: Otthein H, Hans JS, Norbert F, Abdur C, Wilfried T, eds. Proc. of the 14th ACM Conf. on Information and Knowledge Management (CIKM-05). Bremen: ACM Press, 2005: 469?476.
    [33] Yang, Y. and Pedersen, J.O.. A comparative Study on Feature Selection in Text Categorization. ICML 1997: 412-420.
    [34] Liu, H. and Motoda, H.Feature Extraction, Construction and Selection:A Data Mining Perspective.Kluwer Academic,Norwell, MA, USA, 1998.
    [35]尚文倩黄厚宽刘玉玲等,文本分类中基于基尼指数的特征选择算法研究.计算机研究与发展, 2006, 43(10): 1688-1694.
    [36]徐燕,李锦涛,王斌,孙春明,张森,文本分类中特征选择的约束研究.计算机研究与发展, 2008, 45(4): 596-602.
    [37]刁力力,胡可云,陆玉昌,石纯一.用Boosting方法组合增强Stumps进行文本分类.软件学报, 2002, 13(8):1363-1367.
    [38]唐春生,金以慧.基于全信息矩阵的多分类器集成方法软件学报, 2003, 14(6): 1103-1109.
    [39] Bennett PN, Dumais ST, Horvitz E. The combination of text classifiers using reliability indicators. Information Retrieval, 2005, 8(1): 67?100.
    [40] Xu X, Zhang B, Zhong Q. Text categorization using SVMs with Rocchio ensemble for internet information classification. In: Lu X, Zhao W, eds. Proc of the3rd Int’l Conf on Networking and Mobile Computing (ICCNMC-05). Springer-Verlag, 2005: 1022?1031.
    [41] K. Nigam, A. McCallum, S. Thrun, T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 2000,39(2/3): 103-134
    [42]袁时金,李荣陆,周水庚,胡运发.层次化中文文档分类.通信学报, 2004, 25(11): 55-63
    [43] Lijuan Cai,Thomas Hofmann.Hierarchical document categorization with support vector machines.CIKM 2004: 78-87
    [44] Shankar Ranganathan.Text Classification Combining Clustering and Hierarchical Approaches.University of Madras, Chennai, India. Master Theisis
    [45] O Dekel, J Keshet, Y Singer. Large margin hierarchical classification. Proceedings of the twenty-first international conference on Machine learning ICML 2004
    [46] T. SU AND J. G. DY. A deterministic method for initializing k-means clustering. In proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004): 784–786.
    [47] W. SHENG AND X. LIU. A hybrid algorithm for k-medoid clustering of large data sets. Congress on Evolutionary Computation, 2004. CEC2004. 2004: 77–82.
    [48]孙吉贵,刘杰,赵连宇.聚类算法研究.软件学报, 2008, 19(1): 48-61.
    [49] J. LIU, J. P. LEE, L. LI, Z.-Q. LUO, AND K. M. WONG. Online clustering algorithms for radar emitter classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005: 1185–1196.
    [50] S. Bandyopadhyay, E. Coyle. An energy-efficient hierarchical clustering algorithm for wireless sensor networks. In: Proc. of the IEEE INFOCOM. 2003.
    [51] H. Van Dyke Parunak, Richard Rohwer, Theodore C. Belding, Sven Brueckner. Dynamic Decentralized Any-Time Hierarchical Clustering, in 29th Annual International ACM SIGIR Conference on Research & Development on Information Retrieval, Seattle, USA, August 6-11, 2006.
    [52] Alexander Kraskov, Harald St?gbauer, Ralph G. Andrzejak, Peter Grassberger. Hierarchical Clustering Using Mutual Information. EUROPHYSICS LETTERS, 2005, 70(2): 278-284.
    [53] S. Chung, J. Jun and D. McLeod. Mining Gene Expression Datasets Using Density-based Clustering. In proceedings of ACM Conference on Information and Knowledge Management, Washington DC, November 2004.
    [54] Hans-Peter Kriegel, Martin Pfeifle. Hierarchical Density-Based Clustering of Uncertain Data. In proceedings of the Fifth IEEE International Conference on Data Mining(ICDM 2005). 2005: 689-692.
    [55] Anne Denton. Density-based clustering of time series subsequences. In Proceedings of the Third Workshop on Mining Temporal and Sequential Data (TDM 04) in conjunction with the tenth ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, Seattle, WA, 2004.
    [56] Mauro Falanga, Wei-keng Liao, Ying Liu, and Alok Choudhary. A Grid-based Clustering Algorithm using Adaptive Mesh Refinement. In Proceedings of the 7th Workshop on Mining Scientific and Engineering Datasets, April 2004.
    [57] Mauro Falanga, Wei-keng Liao, Ying Liu, and Alok Choudhary. A Grid-based Clustering Algorithm using Adaptive Mesh Refinement. In Proceedings of the 7th Workshop on Mining Scientific and Engineering Datasets, April 2004.
    [58] Shi Zhong, Joydeep Ghosh. A Unified Framework for Model-based Clustering. Journal of Machine Learning Research 4, 2003: 1001-1037.
    [59] M. Law, Alexander Topchy, Anil K. Jain. Model-based Clustering With Probabilistic Constraints. In proceedings of the Fifth SIAM International Conference on Data Mining (SDM 2005). 2005.
    [60] Gelbard R, Goldman O, Spiegler I. Investigating diversity of clustering methods: An empirical comparison. Data & Knowledge Engineering, 2007, 63(1): 155?166.
    [61] Kumar P, Krishna PR, Bapi RS, De SK. Rough clustering of sequential data. Data & Knowledge Engineering, 2007, 3(2): 183?199.
    [62] Cai WL, Chen SC, Zhang DQ. Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation. Pattern Recognition, 2007, 40(3): 825?833.
    [63] Breunig MM, Kriegel H, Ng RT, Sander J. LOF: Identifying density-based local outliers. In: Chen WD, Naughton JF, Bernstein PA, eds. Proc. of the 2000 ACM SIGMOD Int’l Conf. on Management of Data. Dallas: ACM Press, 2000: 93?104.
    [64] Papadimitirou S, Kitagawa H, Gibbons PB, Faloutsos C. LOCI: Fast outlier detection using the local correlation integral. In: Dayal U, Ramamritham K, Vijayaraman TM, eds. Proc. of the 19th Int’l Conf. on Data Engineering. IEEE Computer Society Press, 2003: 315?326.
    [65] Aggarwal C, Yu P. An effective and efficient algorithm for high-dimensional outlier detection. The VLDB Journal, 2005, 14(2): 211?221.
    [66] Wei L, Gong XQ, Qian WN, Zhou AY. Finding outliers in high-dimensional space. Journal of Software. 2002, 13(2): 280?290 (in Chinese with English abstract).
    [67] He ZY, Xu XF, Huang JZ, Deng SC. FP-Outlier: Frequent pattern based outlier detection. Computer Science and Information System. 2005, 2(1):103?118.
    [68] Dwi H. Widyantoro, Thomas R.loerger, etc. John Yen: An Incremental Approach to Building a Cluster Hierarchy. ICDM 2002: 705-708
    [69] Hans-Peter Kriegel, Peer Kroger, and Irina Gotlibovich. Incremental OPTICS:Efficient Computation of Updates in a Hierarchical Cluster Ordering. Proc. 5th Int. Conf. on Data Warehousing and Knowledge Discovery (DaWaK'03), Prague, Czech Rep. 2003.
    [70]冯中慧,鲍军鹏,沈钧毅.一种增量式文本软聚类算法.西安交通大学学报, 2007, 41(4): 398-411.
    [71] Yongheng Wang, Yan Jia and Shuqiang Yang. Ontology-based Short Conversations Clustering in Very Large Text Database. In Proceeding of the VLDB Workshop on Ontologies-based techniques for DataBases and Information Systems (ODBIS’06). 2006.
    [72] Yongheng Wang, Yan Jia and Shuqiang Yang. Short Documents Clustering in Very Large Text Databases. In Proceeding of the WISE Workshop on Web-Based Massive Data Processing (WMDP2006). 2006
    [73]黄永光,刘挺,车万翔,胡晓光.面向变异短文本的快速聚类算法.中文信息学报, 2007. 21(2): 63-68.
    [74]蔡颖琨,谢昆青,马修军.屏蔽了输入参数敏感性的DBSCAN改进算法.北京大学学报(自然科学版), 2004, 40(3): 480-486.
    [75]周水庚,周傲英,曹晶.基于数据分区的DBSCAN算法.计算机研究与发展, 2000 , 37 (10) : 1153-1159.
    [76] E. Achtert, H. P. Kriegel, A. Pryakhin, M. Schubert. Hierarchical Density-Based Clustering for Multi-Represented Objects. In proceedings of Workshop on Mining Complex Data (MCD 2005) in con5th international Conference on Data Mining, Houston, TX, USA. 2005
    [77]骆吉洲,李建中.一种索引结构的压缩存储及其查询处理技术.计算机工程与应用, 2007, 43(8).
    [78]朱虹,吴林.倒排索引压缩及在RDBMS全文检索中的实现.华中科技大学学报(自然科学版), 2005, 33(4).
    [79] Christos Faloutsos. Access methods for text. Computing Survey, 1985. 17(1).
    [80] Gonnet, Gaston, Baeza-Yates, Ricardo, and Tim Snider. New indices for text: Pat trees and Pat Arrays. in Frakes, W.B., Ricardo Baeza-Yates(Eds), Information Retrieval Data Structure & Algorithms, Prentice Hall, NewJersey, 1992: 66-81.
    [81] U.Manber, E.Mvers. Suffix arrays: A new method for on-line string searches. Proc. of the First Ann. ACM-SIAM Symp. on Discrete Algorithms, 1990: 319-327.
    [82] Manber. U, R. Baeza-Yates. An Algorithm for String Matching with a Sequence of Don't Cares. Information Processing Letters, 1991, 37(1): 33-36.
    [83] A.Amir, M.Farach, R.Giancarlo, Z.Galil, and K.Park. Dynamic dictionarymatching. Journal of Computer and System Sciences, 1993.
    [84] A.Amir, M.Farach, R.M.Idury, H.La Poutre, and A.A.Schafer. Improved dictionary matching. Proc. of the Fourth Ann. ACM-SIAM Symp.on Discrete Algorithms, 1993.
    [85] Lester N, Zobel J, Williams H E. In place versus re-build versus re-merge: Index maintenance strategies for text retrieval systems. Proc of ACSC 2004. Darlinghurst, Australia: Australian Computer Society, 2004: 15-22.
    [86] Lester N, Zobel J, Williams H E. Efficient online index maintenance for text retrieval systems. Information Processing & Management, 2006, 42(4): 916-933.
    [87] Buttcher S, Clarke C, Lushman B. Hybrid index maintenance for growing text collections. Proc of SIGIR 2006. New York: ACM, 2006.
    [88] Buttcher S, Clarke C. Indexing time vs. query time tradeoffs in dynamic information retrieval systems. Proc of CIKM 2005. New York: ACM, 2005.
    [89] Lester N, Moffat A, Zobel J. Fast on line index construction by geometric partitioning. Proc of CIKM 2005. New York, ACM, 2005: 776-783.
    [90] Chiuch T, Huang L. Efficient real-time index updates in text retrieval systems [R]. New York: Stony Brook, 1998.
    [91]郭瑞杰,程学旗,许洪波,王斌,丁国栋.一种基于动态平衡树的在线索引快速构建方法计算机研究与发展, 2008, 45(10): 1769-1775.
    [92]胡争光,池天河,毕建涛.基于Lucene和GML/SVG的地图搜索引擎模型研究与实现.计算机应用研究, 2008, 25(4).
    [93]朱学昊,王儒敬,余锋林,唐昱.基于Lucene的站内搜索设计与实现.计算机应用与软件, 2008, 25(10).
    [94]宋懿,国德峰.基于压缩倒排文件的中文全文检索仿真系统.计算机工程, 2008, 34(9).
    [95]杨笑天,陶晓鹏.后缀数组创建算法的分析和比较.计算机工程, 2007, 33(3): 186-188.
    [96]彭波.搜索引擎的混合索引技术.计算机工程与应用, 2004, 40(2).
    [97]李仁勇.基于查询扩展的主题搜索引擎系统的设计与实现.东南大学,硕士学位论文, 2007.
    [98]杜慧平,侯汉清.网络环境中汉语叙词表的自动构建研究.情报学报, 2008, 27(6).
    [99]关佶红、胡运发、周水庚.基于邻接矩阵的全文索引模型.软件学报, 2002, 13(10): 1933-1942.
    [100]王政华,胡运发.基于后继区间的互关联后继树搜索算法.计算机工程, 2007, 35(9).
    [101]张猛,王大玲,于戈.一种基于自动阈值发现的文本聚类方法.计算机研究与发展, 2004, 41(10): 1748-1753.
    [102]丁军娣,马儒宁,陈松灿.一个基于多项式核的结构化有向树数据聚类算法.软件学报, 2008, 19(12): 3147?3160.
    [103]行小帅,潘进,焦李成.基于免疫规划的K-means聚类算法.计算机学报, 2003, 26 (5) : 605 - 610.
    [104] Chen SC, Zhang DQ. Robust image segmentation using FCM with spatial constraints based on new Kernel-Induced distance measure. IEEE Trans. on Systems, Man, and Cybernetics—Part B: Cybernetics, 2004, 34(4): 1907?1916.
    [105] Li J, Gao XB, Jiao LC. A new feature weighted fuzzy clustering algorithm. ACTA Electronica Sinica, 2006,34(1): 412?420 (inChinese with English abstract).
    [106] Ding C, He X. K-Nearest-Neighbor in data clustering: Incorporating local information into global optimization. In: Proc. of the ACM Symp. on Applied Computing. Nicosia: ACM Press, 2004: 584?589.
    [107] Nanni M, Pedreschi D. Time-Focused clustering of trajectories of moving objects. Journal of Intelligent Information Systems, 2006, 27(3): 267?289.
    [108] Birant D, Kut A. ST-DBSCAN: An algorithm for clustering spatial-temporal data. Data & Knowledge Engineering, 2007, 60(1): 208?221.
    [109] P.H.Sneath and R.R.Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.
    [110] S.Uha, R.Rastogi, K.Shim. CURE: An Efficient Clustering Algorithm for Large Database. Proceedings of ACM SIGMOD Conference. Jun 1998: 3-84.
    [111] Marques JP, Written, Wu YF, Trans. Pattern Recognition Concepts, Methods and Applications. 2nd ed, Beijing: Tsinghua University Press, 2002: 51?74 (in Chinese).
    [112] Liu Q B, Deng Su, Lu C H, et al. Relative Density Based K-nearest Neighbors Clustering Algorithm. In: Proc. 2003 Int. Conf. on Machine Learning and Cybernetics, Xi’an, China, 2003: 133 - 137.
    [113]金阳,左万利.一种基于动态近邻选择模型的聚类算法.计算机学报, 2007, 30(5): 756-762.
    [114]刘健,陶玉静,张维明.模极大值与阈值决策融合的小波语音数据去噪方法.计算机应用研究, 2008, 25(10): 3134-3138.
    [115] Kevin Beyer, Jo Nathan Goldstein, R.aghu Ramakrishnan and Uri Shaft. When is nearest neighbor meaningful. In Proceedings of 7th International Conference on Database Theory (ICDT-1999), Jerusalem, Israel, 1999: 21 7-235.
    [116]赵晖.支持向量机分类方法及其在文本分类中的应用研究.大连理工大学,博士学位论文, 2005.
    [117] Ding J, Ma R, Chen S. A Scale-based coherence connected tree algorithm for image segmentation. IEEE Trans. on Image Proc., 2008, 17(2):204?216.
    [118] Wang Z, Chen S, Liu J, Zhang D. Pattern representation in feature extraction and classifier design: Matrix versus vector. IEEE Trans. on Neural Networks, 2008, 19(4):758?76.
    [119] Iris. 1998. http://www.ics.uci.edu/-mlearn/MLSummary.html
    [120] reuters21578. 2004. http://www.daviddlewis.com/resources/testcollections/ruters21578/
    [121]谭松波,王月粉.中文文本分类语料库-TanCorpV1.0.
    [122] LIBSVM- A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/
    [123] D. J. Miller, H. S. Uyar. A mixture of experts classifier with learning based on both labelled and unlabelled data. In: M. Mozer, M. I. Jordan, T. Petsche, eds. Advances in Neural Information Processing Systems 9, Cambridge, MA: MIT Press, 1997: 571-577.
    [124] T. Zhang, F. J. Oles. A probability analysis on the value of unlabeled data for classification problems. In: Proceedings of the 17th International Conference on Machine Learning (ICML’00), San Francisco, CA, 2000: 1191-1198.
    [125] Z.-H. Zhou. Learning with unlabeled data and its application to image retrieval. In: Proceedings of the 9th Pacific Rim International Conference on Artificial Intelligence (PRICAI'06), Guilin, China, LNAI 4099, 2006, 5(10).
    [126] O. Chapelle, B. Sch?lkopf, A. Zien, eds. Semi-Supervised Learning, Cambridge, MA: MIT Press, 2006.
    [127] Z.-H. Zhou and M. Li. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11): 1529–1541.
    [128] Goldman S, Zhou Y. Enhancing supervised learning with unlabeled data. In: Pat L, ed. Proc. of the 17th Int’l Conf. on Machine Learning (ICML 2000). San Francisco: Morgan Kaufmann Publishers, 2000: 327?334.
    [129] V. N. Vapnik. Statistical Learning Theory, New York: Wiley, 1998.
    [130] A. Blum and T. Mitchell, Combining labeled and unlabeled data with co-training, In Annual Conference on Computational Learning Theory (COLT-98), 1998.
    [131] K.Nigam, A.McCallum and T.Mitchell. Learning to classify text from labeled and unlabeled documents, in Proceedings of AAAI-1998.
    [132] T. Joachims. Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on MachineLearning (ICML’99), Bled, Slovenia, 1999: 200-209.
    [133] N.Abe, H. Mamitsuka. Query learning strategies using boosting and bagging. In: Proceedings of the 15th International Conference on Machine Learning (ICML’98), Madison, WI, 1998: 1-9.
    [134] Angluin D, Laird P. Learning from noisy examples. Machine Learning 1998, 2(4): 343-370.
    [135]邓超,郭茂祖.基于自适应数据剪辑策略的Tri-training算法.计算机学报, 2008, 30(8) : 1213-1225.
    [136] Sánchez JS, Barandela R, Marqués AI, Alejo R, Badenas J. Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters, 2003, 24(7): 1015?1022.
    [137] W. Wang, Z.-H. Zhou. Analyzing co-training style algorithms. In: Proceedings of the 18th European Conference on Machine Learning (ECML’07), Warsaw, Poland, 2007.
    [138] Zhong S. Semi-Supervised model-based document clustering: A comparative study. Machine Learning, 2006, 65(1): 3?29.
    [139] Bouchachia A, Pedrycz W. Data clustering with partial supervision. Data Mining and Knowledge Discovery, 2006, 12(1): 47?78.
    [140]姜文瀚,周晓飞,杨静宇.子空间样本选择及其支持向量机人脸识别应用. 2007, 43(10).
    [141]陈毅松,汪国平,董士海.基于支持向量机的渐进直推式分类学习算法.软件学报, 2003, 14(3): 451 - 460.
    [142]廖东平,姜斌,魏玺章,黎湘,庄钊文.一种快速的渐进直推式支持向量机分类学习算法.系统工程与电子技术, 2007, 29(1): 87-91.
    [143] Chen L, Tokuda N, Nagai A. A new differential LSI space-based probabilistic document classifier. Information Processing Letters, 2003, 88(5):203?212.
    [144] Kim H, Howland P, Park H. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 2005, 6(1):37?53.
    [145] Forman G. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 2003, 3(1):1533?7928.
    [146] Chen W, Chang X, Wang H, Zhu J, Tianshun Y. Automatic word clustering for text categorization using global information. In: Myaeng SH, Zhou M, Wong KF, Zhang H, eds. Proc. of the Information Retrieval Technology, Asia Information Retrieval Symp. (AIRS 2004). Beijing: Springer-Verlag, 2004: 1-11.
    [147] Y Yang, Pedersen JO. A comparative study on feature selection in textcategorization. Fisher DH, ed. Proc. of the 14th Int’l Conf. on Machine Learning (ICML-97). Nashville: Morgan Kaufmann Publishers, 1997: 412?420.
    [148] D Mladenic ,M Grobelnk, Feature selection for unbalanced class distribution and Naive Bayes. In: Proc of the 16th Int'l Conf Machine Learning (ICML'99), San Francisco: Morgan Kaufmann Publishers, 1999: 258- 267.
    [149]余俊英,王明文,盛俊.文本分类中的类别信息特征选择方法.山东大学学报(理学版), 2006, 41(3): 10-13
    [150]谭松波.高性能文本分类算法研究.博士学位论文,中国科学院计算技术研究所, 2006.
    [151] OHSUMED. ftp://medir.ohsu.edu/pub/ohsumed
    [152]封举富,时建新.基因选择的快速Fisher优化模型.北京大学学报(自然科学版), 2005, 41(1).
    [153]胡运发.互关联后继树:一种新型全文数据库数学模型.复旦大学计算机与信息技术系,技术报告, 2002.
    [154]李江波,周强,陈祖舜.汉语词典的快速查询算法研究.中文信息学报, 2006, 5. 20(5): 31-39.
    [155]曾海泉,刘永丹,宋扬,胡运发.基于互关联后继树的多时间序列关联模式挖掘.计算机研究与发展, 2003, 7. 40(7): 935-940.
    [156] R. Agrawal, R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of 20th International Conference on Very Large Databases, Santiago, Chile, 1994: 487-499.
    [157] H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. In Proceedings of AAAI'94 Workshop Knowledge Discovery in Databases (KDD'94), Seattle, WA, July 1994: 181-192.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700