用户名: 密码: 验证码:
汉语文本数据挖掘
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息技术和计算机网络的飞速发展,各行各业每天产生并积累大量数据,从海量数据中发现有价值信息的数据挖掘已经成为迫切而富有挑战性的研究课题.本文从市长公开电话的实际需要出发,研究了以下几个方面内容:
     众多市民在某一短暂的时间内对某一问题的集中投诉,称为热点问题,这类问题产生速度快、数量大,如不及时处理势必会产生严重的负面影响,甚至出现集体上访、阻塞交通、罢工等恶劣事件的发生.如何从海量文本数据中挖掘出热点问题?若采用文档聚类方法直接提取热点问题,会由于文档向量空间的维数过高导致聚类效果很差,因此本文将提取热点问题转化为先求热点词,然后通过热点词进行变量聚类,使反映同一主题的热点词汇聚在一起,画出聚类树,最后根据聚类树提取热点问题.本文论述了提取热点问题的具体实现方法,给出了此方法在市长公开电话数据上的实际应用结果分析,并与人工提取的结果进行了对比,结果表明本文提出的方法效果非常好,既准确又能节省大量的人力.
     根据市长公开电话数据的季节性特点,设计了基于时序数据的朴素贝叶斯分类器.首先对分类类别与时间进行独立性检验,然后将与时间不独立的类别用核回归函数估计其在不同时间段上的先验概率,从而构建基于时序数据的朴素贝叶斯分类器.鉴于朴素贝叶斯条件独立性假设约束对分类性能的影响,又提出了一种加权朴素贝叶斯分类器,其加权参数作用于类别节点上,先用朴素贝叶斯分类器计算出各类别的后验概率后,通过二次加权调整后验概率,再完成分类,调整系数根据属于不同类别的投诉样本在不同时间内分布情况进行确定.
     面对海量数据,本文提出了基于任务驱动的并行算法,并应用于决策树学习和贝叶斯多网学习中,取得了成功;将市长公开电话数据按月份分成12个数据子集,并在每一个数据子集上构建支持向量机,经实践,训练时间得到显著降低,能够完成白天分类夜间学习的需要,而且正确率也有很大提高.另外,还设计了基于二项检验的特征词提取和基于词频的特征词词组提取方法,再利用获得的决策树信息、贝叶斯多网信息,提出了一种基于规则的得分法文本分类器.
With the rapid development of computer networks and information technology, the speed of information generation increases a lot. The government, companies and other institutions have accumulated a large amount of text data. As a result, it is very urgent and challenging for us to find a method to get valuable information from the data. Text data mining uses text information as data mining objects and uses quantitative calculation, qualitative analysis to look for the information structure, models which have new information of potential value. The importance of text mining increases as well as the increase of text data. Two basic tasks in text data mining are classification and clustering which can not be separated from all the application area.
     This paper is from the specific practical problems in mayor's public access line project and the studying is based on the real data sets. This paper finds out the rules hidden in massive text data and provides information for the government to make decisions. This paper has great practical significance for this study. As the public telephone services develops, data can be further accumulated, which provides possibilities to mine more periodicities and trends in the data. Hao Lizhu made basic research of the mayor public telephones in his doctoral thesis. I make further study based on his platform system. I make new improvements and explore to mine information in hotspots, to improve the accuracy of classifiers, to establish Bayesian networks and to do large-scale parallel computing.
     In the first chapter, the paper talks about the development direction and the development process of segmentation in Chinese word, feature dimensionality reduction, classification algorithms and performance evaluation. This article describes the Mayor's public telephone data sets, and does some work in constructing thesaurus and frequency curve-fitting.
     In the aspect of thesaurus construction, the rate and the effect of segmentation is not very satisfactory because the original word thesaurus is so big. According to the actual needs of text messaging, I removed some of the long-term non-use words and designed a purification algorithm to deal with the thesaurus. So this thesaurus is formed as the special thesaurus. After purification, the number of the words is still large and there is enormous word ambiguity. I combine automatic segmentation and artificial combination together to improve thesaurus further. The construction of the specialized thesaurus greatly reduced the dimension of vector space model and improved the segmentation speed and accuracy.
     The second chapter of this article focuses on how to find hot issues of common concern from massive data. Although public individual complaints are random events, the hidden statistical regularity shows when hundreds of thousands of public complaint the issues. The hot issues refers to complaint focused on the issue happened in a short period of time, such as noise problems during the college entrance examination; making up classes during the winter and summer vacations: the spring of the three agricultural issues, garbage problem and so on.
     Such problems have a large number and high speed; they will cause serious negative impact if not treated with, for example, a collective petition, blocking traffic, or s striking. The effective extraction of hot issues can help the government to make new polices which can deal with the difficulties encountered by the public. It is time-consuming and low efficient to analyze hot issues subjectively. It is difficult to find hot issues objectively among the tens of thousands of complaints. Often we will miss the right time even not find the problem.
     To face with tens of thousands of complaints from the public and to find the same theme is actually a clustering problem. How to find hot issues from the mass text data? If we use traditional clustering method that is document clustering method to directly distill the hot issues, since the high dimension causes clustering results to be poor, this paper transforms the pick-up of the hot issues to finding the hot issues words first which are used to cluster the variables. Then the paper gathers the hot issues words which reflect the same subject, draws clustering tree, further to extract hot issues through the clustering tree. This paper discusses the extraction of hot issues, talks about the results of analysis of practical application and compares the results of manual extraction. The results turns to be good, it is more accurate and saves more human resource and material resource, in the end the paper shows a method to determine the numbers of clustering.
     The events which reflect the hot issues all have some important key words. The key words have strong statistical correlation with some months. We call these words hot words. Using the statistical methods to extract the hot words, we need to put forward a distinguishing criterion:
     1. The characteristic word in the current month has a strong statistical correlation with the current month.
     2. The dispersion of the word in the current month is low, that is the distribution of the documents which include the key words concentrated in some continuous days.
     Extraction of hot words must be done before the extraction of stop word. To avoid the disturbance of the hot words extraction using stop word, we distinguish stop words we use the following conditions:
     1. The words have high document frequency.
     2. The words have low statistical correlation with all the months.
     To distinguish the statistical correlation between the key words and the months we use hypothesis methods.H_0 : w_r is independent with m_i, i = 1,…, 12; H_1: there is at least a k so that w_r is not independent with m_i. Table 1 The 2 * p contingency table of w_rwhere
     The chi-square statistic in the contingency table and the weighed chi-square statistic are as follows:where n_r represents the document frequency of the word x-r, (?) , is the weighted factor.∑nr is the normalizing factor.
     I order words according to the weighed chi-square statistic from small to big and drop the first 800 words.
     Before picking up of the hot words, we need to find correlation degree between the certain word and the current month. We test the Hypothesis in the 2*2 contingency table.H_0 : w_r is independent with m_k: H_1 : w_r is not independent with m_k.
     Table 2 The 2*2 contingency table of the month m_k.
     The calculating formula of chi-square statistic between characteristic word and its correlative cluster m_k is as follows:
     This paper reviews the dispersion of the word W_j. I use the random variable T to represent the date of the document and the sample space isΩ= {1.2,…, k}, 1≤k≤31, so the probability distribution, the expectation and the variance are as follows:
     The variance reflects the dispersion degree of the data. The calculating formula of the hot word's abstraction is as follows: where the x~2 represents the x~2 value of word W_j,σis the standard deviation, 1 is the smoothing factor, N is the document frequency of the word in the month.I use the log(N/10) as the weighed coefficient, the 10 is the expertise value.
     I get the word list of the hot words by ordering the x_σof each month from big to small, for example: the word list of May.
     Figure 1 The word list of May in March 2008
     In 12th May, there was a world-shaking earthquake in Wenchuan which causes citizen's great reflection and attention in Changchun. The citizens express their will to join the help campaign by the hot line "12345". The data in Mayor Public Telephone of May showed that although there were only 100 documents which include these words the most events were the unexpected incidents or the emergencies. From the second day of the earthquake, that is May 13th, to the end of that month, the increase number shows that the event reflected by these words is indeed the hot issues. There was a record in the month paper of the Mayor Public Telephone. As follows: Complains which have obvious changes are as follows:
     (1) "there is great increase in civil administration and the reflection related to the earthquake relief stands out."
     This event is listed the first among all the complaints which have obvious changes so it is clear that this event is important. At the same time it shows the importance of the event by manual work and by automatism using our way is consistent.
     From the hot words table we can see that sometimes several words relate to the same event, if we use all the documents related to hot words to analyze its intent, the analysis efficiency will be lowed by the massive documents. So we need to cluster the hot words using clustering methods to get the hot words which get together to reflect the same theme. So we can extract hot issues successfully by looking up the related document.
     The cluster analysis in multivariate statistical analysis is based on the actual needs, and it has two major research directions, the first sample clustering is called Q-type clustering and the second variable clustering is called R-type clustering. The purpose of this paper is to cluster the hot words in the hot words table, so the R-type clustering is more suitable. Each hot word can be regard as 0 or 1 binary variables.
     The hot words table is W = {w_1,…,w_n}, where n represents the capacity of words table, W_1 represents the ith hot word, the variable set related to the hot word corresponding to W is X = {x_1.…, x_n}. That is if the hot word W_i appears in the document, then its corresponding variable is x_i = 1, otherwise x_i = 0. Then we cluster all the variables in X.
     In the selection of the similar coefficient between each two variables, we often use Jaccard Coefficient to match less similar words. In the variable clustering, we use d_(ij)to represent the similarity coefficient between x_i and x_j, G_1,G_2,…to represent the clusters. We use the most similar variables's coefficient to define the similarity coefficient between the clusters. We use D_(pq) to represent the similarity coefficientbetween G_p and G_q.That is
     We use system clustering algorithm to get the clustering tree of the hot words in March 2008. which we can see in figure 2.
     From Figure 2 we can see that the "bean oil, fertilizers, prices, too fast" getting together to show that they are saying the same theme, their corresponding hot sentences are shown in Figure 3 which saying that "Oil, fertilizer prices too fast": and "poor, color TV, helping the poor" to form one cluster, that is the "color TV to help the poor doesn't be put into place".
     From the analysis above we can see that the result of the variables clustering is good and the unnecessary duplication of effort using the single hot word is avoided.
     Figure 3 The result of the hot sentences which include the "bean oil,fertilizers"
     In the third chapter, we studied how the design of the Chinese language text classifiers is used in the Mayor public phones. Naive Bayesian classifier is a simple, effective and successful classification algorithm which has actual usage. In the situation of the text clustering which have plenty characteristic words and anfractuous relations, the introduction of new information can improve the clustering results. After introducing the time information I provide two improved Bayesian classifiers.
     The first is weighted Naive Bayesian classifier whose weighted parameters are used in the cluster nodes. After calculating the clusters' posterior probability using the classifier, we use the second power to adjust the posterior probability and then finish the clustering:The adjusted coefficient is fixed by the distribution of the complain samples of different types and in different times. The result comparisons can be seen in Table 3.
     Table 3 The result comparisons between Naive Bayesian and weighed Naive Bayesian
     As table 3 shows, the accuracy using our methods is improved than using Naive Bayesian. The accuracy rate using Naive Bayesian is improved by 1.32% than using weighted Naive Bayesian.
     Another improvement considers the seasonal characteristics of the actual data and it tests the independence between the categories and the time. The paper finds that they are not independent most of the rime. We construct the cluster model using the important prior information and establish a Naive Bayesian classifier based on time-series data whose prior parameters are estimated using kernel regression method to get. Comparing the results using Naive Bayesian classifier method in the data of Mayor public telephone, we find that the classification accuracy is greatly improved.
     I do experimental test and performance analysis of the text classification based on support vector machine which is used in the Mayor public telephone. The results prove that its correct rate is higher especially in the smaller cluster. The results are much better than that of Naive Bayesian. However. due to large amount of data which leads to long time of training and optimization process of parameters, we can't meet the needs of night-time study, so I propose a parallel processing strategy using support vector machines.Aimed at the good generalization ability in small spaces of the support vector machines, I divide the data in the Mayor public telephone into 12 sub-sets and construct support vector machines in each set. In the actual running, I finish the clustering by using each set's related support vector machines. The practical results showed that there was small decrease of correct rate in small data sets but the night-study time was greatly reduced. In this way, the need of clustering in the daytime and studying at night can be met. What's more. I designed the method to abstract characteristic words using binomial test and characteristic phrase based on the words' frequency. Then using the acquired information of decision-making tree and the Bayesian multi-network we provide a score text classifier based on some rules.
     In the fourth chapter, aiming at the large-scale data classification and learning problems, using the high-performance computer. I provide an effective parallel learning algorithm based on task- driven. Dealing with the large-scale data classification and learning problems, the time complexity and space complexity caused by the size of sample set is the limitation of the algorithm's application. As a result, based on the machine-learning algorithm people are paying more attention to pursue more efficient parallel cluster algorithm. The increase of the data and the development of the high-performance computing also make it possible for the realization of the parallel algorithm to come true.
     The currently used parallel processing algorithms are mainly to divide the data, their advantages are that they have more nodes and they can do parallel work with less time and they are more suitable for the large-scale database. For example, the most representative of the parallel processing algorithms are SLIQ algorithm SPRINT algorithm. But in the actual environment, people often share the same cluster and multi-processes are running in the computer at the same time. As a result, the running process and the CPU utilization are not the same. The fast node process is forced to wait for the slow one, which seriously influenced the efficiency of the computer. Further more, we need to divide the total sample into small ones. The differences in the distribution of each sample can impact the performance of the classifier potentially. The combination strategy used in the sub- classifier can have serious affect on the accuracy of classification. For this reason I propose parallel program design based on task-driven.
     The idea of parallel program design based on task-driven is to cut large tasks into small tasks, then assign these small tasks to the each node to run. In this way, the first node of the process will get new and more tasks. What's more, a dominant process divides the tasks which achieve the message passing and data exchange of calculation in the form of task-driven. So we can achieve the coordination of multi-computer and the effective load balancing in the complex environment. Based on the idea I provide a parallel learning algorithm which is based on task-driven decision-making and a Bayesian network learning algorithm.
     As the data in Mayor public telephone increases, traditional decision tree algorithm which need to scan and sort the data many times in the tree structure will make the algorithm less effective. In order to meet the needs of large-scale data sets, I find a decision-making tree based on task-driven. Maintaining the accuracy of the original serial, the author does the test in the data of Mayor public telephone and finds the operating speed significantly increased. (In the high-performance computer, 210 process are started at the same time each process using a CPU core,) It only took 50 minutes to complete the decision tree with 50.000 nodes.
     In this paper, a parallel algorithm based on the sorting of chi-square sequences which is using the K2 algorithm to construct Bayesian network. First using the characteristic words in each cluster to calculate its chi-square value then order the chi-square value and use them as the priori order of the Bayesian network. We use K2 parallel algorithm to learn the construction of Bayesian network. K2 parallel algorithm which is based on the serial process uses the main process control to assign the tasks. The calculation process is specially used to deal with the tasks which are sent from the main process control. After the treatment the calculation process sends the results back to the main process control. The work of the calculation process includes calculating the score of each node, the score of the nodes which have one parent node or two, then starting the 15 days learning work of Bayesian multi-network with 99 sub- networks which having 277 progresses each using a CPU core in the high-performance computers.
     Further more, using decision tree and support vector machine algorithm to test data from Harbin and Changchun, the result shows that the former which has large noise is not as better as the latter. On the bases of this. I suggest that we use the acquired information of the decision tree, the Bayesian Multi-networks, the characteristic words and words of different meanings extracted using binomial test to establish the rules of the documental input. The rules have been used in Changchun and Harbin and laid a good foundation for the improvement of precision in text classification. In the research of neural networks, to simulate weight regulation in biological neurons, the paper provides a new idea which realizing the measure of the biological neurons weight relying on the consequences in the transmitting progress. This paper also does some computer simulation experiments which can provide a text classification reference for the future in which the simulation of human's brain can realized.
引文
[1] HAN J: KAMBER M. Data Mining: Concepts and Techniques[M]. 2th ed. San Francisco: Morgan Kaufmann, 2006.
    [2] SEBASTIANI F. A tutorial on automated text categorization[C]. In:Proc. of Argentinian Symposium Artificial Intelligence(ASAI-99, 1st) Buenos Aires, 1999:7-35.
    [3] 郝立柱.汉语文本自动分类-市长公开电话数据统计分析[D].长春:吉林大学,2008.
    [4] HAYES P J, ANDERSEN P M, NIRENBURG I B, SCHMANDT L M. Tcs: a shell for content-based text categorization[C]. In Proceedings of CAIA-90. 6th IEEE Conference on Artificial Intelligence Applications ,Santa Barbara, CA, 1990:320-326.
    [5] SEBASTIANI F. Machine Learning in Automated Text Categorization [J]. ACM Computing Surveys, 2002,34(1):1-47.
    [6] CIOS K, PEDRYCZ W, SWINIARSKI R. Data Mining Methods for Knowledge Discovery[M]. The Netherlands: Kluwer, 1998.
    [7] DUDA R O, HART P E, STORK D G. Pattern Classification[J]. 2th ed. New York: John Wiley & Sons, 2001.
    [8] GOODMAN J T. A bit of progress in language modeling[J]. Computer Speech and Language, 2001, 15(4):403-434.
    [9] YANG Y, LIU X. A re-examination of text categorization methods[C]. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'99), Berkeley: ACM Press, 1999:42-49.
    [10] YANG Y M. An evaluation of statistical approaches to text categorization [J]. Information Retrieval,1999,1(1-2):69-90.
    [11] LEWIS D, Ringuette M. A comparison of two learning algorithms for text categorization [C]. In: Proc. of Symp. on Document Analysis and IR. 1994:81-93.
    [12] MITCHELL T M. Machine learning[M]. New York. US: McGraw Hill. 1997.
    [13] AAS K, EIKVIL L. Text categorization: A Survey[R]. Technical report. Norwegian Computing Center, June. 1999.
    [14] VAPNIK V. The Nature of Statistical Learning Theory[M]. New York: Springer-Verlag, 1995.
    [15] NIGAM K, LAFFERTY J. MCCALLUM A. Using maximum entropy for text classification[C]. In:Proc.of the IJCAI'99 Workshop on Machine Learning for Information Filtering. 1999:61-67.
    [16] LIU B. HSU W, MA Y. Integrating Classification and Association Rule Mining[C]. In:Proc. of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, 1998:80-86.
    [17] DONG G, ZHANG X, WONG L, LI J. CAEP: Classification by aggregating emerging patterns[C]. In Proceedings of the second International Conference on Discovery science, Tokyo. Japan, 1999, 30-42.
    [18] LI W, HAN J, PEI J. CMAR:Accurate and efficient clasification based on multiple classification rules[C]. In IEEE international conference on Data Mining (ICDM'Ol)San jose, California. 2001.
    [19] ZAIANE 0, ANTONIE M. Classifying Text Documents by Association Terms with Text Categories[C]. In Proceedings of the Thirteenth Australasian Database conference, 2002:215-222.
    [20] LEWIS D D. Representation and learning in information retrieval[D]. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, US, 1992.
    [21] PENG F, SCHUURMANS D,WANG S. Augmenting Naive Bayes Classifiers with Statistical Language Models[J].Information Retrieval, 2004,7:317-345.
    [22] LEWIS D D. Representation quality in text classification: an introduction and experiment [C]. In Proceedings of a Workshop on Speech and Natural Language, 1990.
    [23] KEHAGIAS A,PETRIDIS V,KABURLASOS V G, FRAGKOU P. A Comparison of Word-and Sense-Based Text Categorization Using Several Classification Algorithms[J]. Journal of Intelligent Information Systems, 2003,21(3):227-247.
    [24] 王映,常毅,谭建龙,白硕.基于N元汉字串模型的文本表示和实时分类的研究与实现[J].计算机工程与应用,2005.5(41):88-91.
    [25] XUE N W, CONVERSE S P. Combining classifiers for Chinese word segmentation[C]. In Proceedings of First SIGHAN Workshop on Chinese Language Processing, 2002:57-63.
    [26] 张春霞,郝天永.汉语自动分词的研究现状与困难[J].系统仿真学报,2005.1(17):138-143.
    [27] GOH C L, ASAHARA M, MATSUMOTO Y. Chinese word segmentation by classification of characters[J]. Computational Linguistics and Chinese Language Processing, 2005, 10(3):381-396.
    [28] 周志华,王珏.机器学习及其应用2007[M].北京:清华大学出版社,2007.
    [29] SALTON G. WONG A. YANG C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975,18(11):613-620.
    [30] DEBOLE F. SEBASTIANI F. Supervised term weighting for automated text categorization[C]. In the 18th ACM Symp. on Applied Computing (SAC-03). Melbourne: ACM Press, 2003:784-788.
    [31] 鲁松,李晓黎,白硕等.文档中词语权重计算方法的改进[J].中文信息学报,2000.14(6):8-13.
    [32] BIGI B. Using Kullback-Leibler distance for text categorization[C]. In: Sebastiani F, ed. Proc. of the 25th European Conf. on Information Retrieval (ECIR-03). Pisa: Springer-Verlag, 2003:305-319.
    [33] NUNZIO GMD. A bidimensional view of documents for text categorisation[C]. In: McDonald S. Tait J. eds. Proc. of the 26th European Conf. on Information Retrieval Research (ECIR-04). Sunderland: Springer-Verlag, 2004:112-126.
    [34] 周昭涛,卜东波,程学旗.文本的图表示初探[c].第一届全国信息检索和内容安全学术会议(NCIRCS2004),上海, 2004.11.
    [35] YANG Y, PEDERSEN J O. A comparative study on feature selection in text categorization[C]. Proceedings of the Fourteen th International Conference on Machine Learning(ICML-97), 1997:412-420.
    [36] MOOD A. GRAYBILL F. BOES D. Introduction to the Theory of Statistics[M]. 3th ed. McGraw-Hill 1974.
    [37] FORMAN G. An extensive empirical study of feature selection metrics for text classification[J]. Journal Machine Learning Research, 2003,3:1289-1305.
    [38] 方开泰.实用多元统计分析fM].上海:华东师范大学出版社, 1989.
    [39] INTRATOR N. Localized exploratory projection pursuit[C]. In : Proceedings of the 23rd Symposium on the Interface, Seattle,WA ,1991,237-240.
    [40] DEERWESTER S,DUMAIS S.FURNAS G,LANDUAUER T, HARSHMAN R. Indexing by latent semantic analysis[J]. Journal of the American Society of Information Science. 1990, 41(6),391-407.
    [41] PEREIRA F,TISHBY N,LEE L. Distributional Clustering of English Words[C]. Proceedings of the 31st ACL, 1993,183-190.
    [42] LANGLEY P. IBA W, THOMPSON K. An Analysis of Bayesian Classifiers[C]. Proc. of the 10th National Conf. on Artificial Intelligence, Menlo Park, AAAI Press. 1992:223-228.
    [43] DAGAN I, KAROV Y AND ROTH D. Mistakedriven learning in text categorization[C]. In Proceedings of EMNLP-97, 2nd Conference on Empirical Methods in Natural Language Processing, Providence. RI, 1997:55-63.
    [44] KAMAL N, JOHN L, ANDREW M. Using maximum entropy for text classification[C]. In IJCAI-99 Workshop on Machine Learning for Information Filtering. 1999:61-67.
    [45] CHOUCHOULAS A . A Rough Set Approach to Text Classification[D]. School of Artificial Intelligence,The University of Edinburgh,UK, 1999.
    [46] CHOUCHOULAS S Q. A Rough Set-Based Approach to Text Classification[C]. 7th International Workshop on New Directions in Rough Sets. Data Mining, and Granular-Soft Computing, 1999:118-127.
    [47] JOACHIMS T. Text categorization with support vector machines: learning with many relevant features[C]. In: Proceedings of 10th European Conference on Machine Learning(ECML-98), Chemnitz, DE, 1998:137-142.
    [48] WIENER E D. A neural network approach to topic spotting in text[D]. University of Colorado, 1995.
    [49] LU B L, ITO M. Task decomposition and module combination based on class relations: A modular neural network for pattern classification[J]. IEEE Trans. Neural Networks, 1999. 10(5):1244-1256.
    [50] LIU F Y, WU K, ZHAO H, LU B L. Fast Text Categorization with Min-Max Modular Support Vector Machines[C]. International Joint Conference on Neural Networks, 2005:570-575.
    [51] KE W.LU B L,UTIYAMA M,ISAHARA H. An empirical comparison of minmax-modular k-NN with different voting methods to large-scale text categorization[J]. Soft Computing -A Fusion of Foundations, Methodologies and Applications, 2008, 12:647-655.
    [52] JAIN A, DUBES R. Algorithms for Clustering Data[M]. Prentice-Hall, 1988.
    [53] 毛国君,段立娟,王石等.数据挖掘原理与算法[M].北京:清华大学出版社,2004.
    
    [54] ZHOU Z H. Learning with unlabeled data and its application to image retrieval[C]. In: Proceedings of the 9th Pacific Rim International Conference on Artificial Intelligence (PRI-CAI'06), Guilin, China, LNAI 4099, 2006:5-10.
    [55] JOACHIMS T. Transductive inference for text classification using support vector ma-chines[C]. In: Proceedings of the 16th International Conference on Machine Learning (ICML ' 99), Bled, Slovenia, 1999:200-209.
    [56] NIGAM K, MCCALLUM A, THRUN S, et al. Text classification from labeled and unlabeled documents using EM[J]. Machine Learning, 2000,39(2):103-134.
    [57] RAGHAVAN H. MADANI O,JONES R. Active Learning with Feedback on Both Features and Instances [J]. Journal of Machine Learning Research. 2006,7:1655-1686.
    [58] CHAPELLE 0. SINDHWANI V, KEERTHI S S. Optimization Techniques for Semi-Supervised Support Vector Machines[J]. Journal of Machine Learning Research. 2008, 9:203-233.
    [59] HENRICH F, OBERMAYER K. Active Learning by Spherical Subdivision[J]. Journal of Machine Learning Research, 2008, 9:105-130.
    [60] RATSABY J. Incremental learning with sample queries[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998,20(8):883-888.
    [61] YAMAUCHI K. YAMAGUCHI N,ISHII N. Incremental learning methods with retrieving of interfered patterns[J]. IEEE Transactions on Neural Networks, 1999,10(11):1351-1365.
    [62] SYED N A, LIU H, SUNG K. Incremental Learning with Support Vector Machines[C]. Proceedings of the Workshop on Support Vector Machines at the International Joint Conference on Artificial Intelligence (IJCAI-99). Stockholm. Sweden, 1999.
    [63] RALAIVOLA L. D'ALCHE-BUC F. Incremental Support Vector Machine Learning: A Local Approach[C]. Proc. of ICANN ' 01. Vienna, Austria: Springer, 2001:322-330.
    [64] DIETTERICH T G. Machine Learning Research: Four Current Directions[J]. AI Magazine, 1997,18(4):97-136.
    [65] 苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859.
    [66] DIETTERICH T G. Ensemble Methods in Machine Learning[J]. In Multiple Classier Systems, Cagliari, Italy, 2000.
    [67] BAHLER D, NAVARRO L. Methods for Combining Heterogeneous Sets of Classifiers[C]. Artificial Intelligence (AAAI), 17th Natl. Conference on, 2000.
    [68] BREIMAN L. Bagging Predictors [J]. Machine Learning, 1996. 24(2):123-140.
    [69] SCHAPIRE R E. The Strength of Weak Learnability[J]. Machine Learning, 1990,5(2):197-227.
    [70] 罗福星,刘卫国.一种朴素贝叶斯分类增量学习算法[J].微计算机应用,2008,29(6):107-112.
    [71] 朱美琳,杨佩.基于支持向量机的多分类增量学习算法[J].计算机工程,2006,32(17):77-79.
    [72] 李凯,黄厚宽.小规模数据集的神经网络集成算法研究[J].计算机研究与发展,2006,43(7):1161-1166.
    [73] 李凯,崔丽娟.集成学习算法的差异性及性能比较[J].计算机工程,2008,34(6):35-37.
    [74] 宫秀军,史忠植.基于Bayes潜在语义模型的半监督Web挖掘[J].软件学报,2002,13(8).
    [75] 朱巧明,李培峰等.中文信息处理技术教程[M].北京:清华大学出版社,2005.
    [76] 威滕,弗兰克.数据挖掘实用机器学习技术[M].董琳,邱泉等译.北京:机械工业出版社,2006.
    [77] HAO L L, HAO L Z. Automatic Identification of Stop Words in Chinese Text Classification[C]. The International Conference on Computer Science and Software Engineering(CSSE 2008), 2008:718-722.
    [78] 史宁中.统计检验的理论与方法[M].北京:科学出版社,2008.
    [79] GALAVOTTI L,SEBASTIANI F.SIMI M. Experiments on the use of feature selection and negative evidence in automated text categorization[C]. In Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries, 2000:59-68.
    [80] 长春市人民政府市长公开电话办公室月报,2008,第20期(总232期).
    [81] 长春市人民政府市长公开电话办公室月报,2008,第26期(总238期).
    [82] 张尧庭,方开泰.多元统计引论[M].北京:科学出版社,1999.
    [83] ANDERBERG M R. Cluster Analysis for Applications [M]. New York: Academic Press. 1973.
    [84] 张连文,周雪忠,陈等.论证候研究中变量聚类结果的诠释[J].中国中医药信息杂志, 2007,7(14):102-103.
    [85] CHU S C, RODDICK J F, PAN J S. An Incremental Multi-Centroid, Multi-Run Sampling Scheme for k-medoids-based Algortihms-Extended Report[C]. Proceedings of the Third International Conference on Data Mining Methods and Databases, Data Mining Ⅲ, 2002:553-562.
    [86] MERZ P. Clustering Gene Expression Profiles with Memetic Algorithms[C]. Proceedings of the 7th International Conference on Parallel Problem Solving from Nature, 2002:811-820.
    [87] CALINSKI T, HARABASZ J. A Dendrite Method for Cluster Analysis[J]. Communications in Statistics, 1974,3(1):1-27.
    [88] PAL N R, BEZDEK J C. On cluster validity for the fuzzy c-mean model[J]. IEEE Transactions on Fuzzy Systems, 1995, 3(3):370-379.
    [89] KIM D W, KWANG H, LEE, DOHEON LEE. On cluster validity index for estimation of optimal number of fuzzy clusters[J]. Pattern Recognition, 2004, 37:2009-2024.
    [90] MILLIGAN G W, COOPER M C. An Examination of Procedures for Determining the Number of Clusters in a Data Set[J]. Psychometrik, 1985, 58(2):159-179.
    [91] SARKAR M, YEGNANARAYANA B,KHEMANI D. A clustering algorithm using an evolutionary programming-based approach[J]. Pattern Recognition Letters, 1997, 18:975-986.
    [92] CASILLAS A. GONZALEZ DE LENA M T, MARTINEZ R. Document clustering into an unknown number of clusters using a Genetic Algorithm[C]. International Conference on Text Speech and Dialogue TSD, 2003:43-49.
    [93] RICHARDSON S, GREEN P J. On Bayesian analysis of mixtures with an unknown number of components[J]. J.R. Statist. Soc. B, 1997,59(4):731-792.
    [94] JAIN A K, MURTY M N, FLYNN P J. Data clustering: a review[J]. ACM Computing Surveys(CSUR), 1999, 31(3):264-323.
    [95] ROEDER K. Density estimation with confidence sets exemplified by superclusters and voids in galaxies[J]. JASA. 1990, 85:617-624.
    [96] LEWIS D D. Naive (Bayes) at forty: The independence assumption in information retrieval[C]. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, Germany), 1998:4-15.
    [97] KEOGH E,CHAKRABARTI K,PAZZANI M J,MEHROTRA S. Dimensionality reduction for fast similarity search in large time-series databases[J]. Knowledge and Information Systems, 2001, 3(3):263-286.
    [98] RATANAMAHATANA C A, KEOGH E. Making time-series classification more accurate using learned constraints. In Proceedings of the 4th SIAM International Conference on Data Mining(SDM'04)SIAM, 2004:11-22.
    [99] YAMADA Y, SUZUKI E. Decision-tree induction from time-series data based on a standard example split test[C]. In Proceedings of the 20th International Conference on Machine Learning(ICML'03), Morgan Kaufmann. 2003:840-847.
    [100] NADARAYA E A. On estimating regression[J]. Theory of Probability and Its Applications, 1964,9:141-142.
    [101] WATSON G S. Smooth regression analysis[J]. Sankhya, Series A, 1964, 26:359-372.
    [102] WETTSCHERECK D, AHA D W. MOHRI T. A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms[J]. Artif. Intell. Rev. 1997, 11(1-5), 273-314.
    [103] WEBB G I. PAZZANI M J. Adjusted probability naive Bayesian induction[C]. Proceedings of the Eleventh Australian Joint Conference on Artificial Intelligence, Berlin: Springer-Verlag, 1998, 285-295.
    [104] CRISTIANINI N,SHAWE-TAYLOR J.支持向量机导论[M].李国正,王猛等译.北京:电子工业出版社,2004.
    [105] WEI Y G, TSAY J J. A study of multiple classifier systems in automated text categorization[D]. Chiayi:College of Engineering National Chung Cheng University, 2002.
    [106] WESTON J, WATKINS C. Support vector machines for multi-class pattern recognition[C]. In Proceedings of 7th European Symposium on Artificial Neural Networks, 1999:219-224.
    [107] HSU C. LIN C J. A comparison of methods for multi-class support vector machines[J]. IEEE Transactions on Neural Networks. 13(2):415-425, 2002.
    [108] OSUNA E, FREUND R, GIROSI F. Training Support Vector Machines:An Application to Face Detection[C]. In Proceedings of CVPR'97 Puerto Rico, 1997.
    [109] KIJSIRIKUL B, USSIVAKUL N. Multiclass support vector machines using adaptive directed acyclic graph[C]. Proceedings of the 2002 International Joint Conference on Neural Networks, 2002, 1(5):980-985.
    [110] DOMENICONI C, GUNOPULOS D. Incremental Support Vector Machine Construction[C]. In Proceedings of International Conference on Data Mining,California. USA, 2001:589-592.
    [111] GRAF H P, COSATTO E, BOTTOU L, DOURDANOVIC I, VAPNIK V. Parallel support vector machines:The cascade svm[J]. Advances in neural information processing systems, MIT Press, 2005, 17:521-528.
    [112] 杨静,张健沛,李忠伟.一种基于多支持向量机分类器的并行学习算法[J].哈尔滨工程大学学报,2006,27:374-378.
    [113] 邵峰晶,于忠清.数据挖掘原理与算法[M].北京:中国水利水电出版社,2003.
    [114] MINGERS J. An empirical comparison of pruning methods for decision tree induction[J]. Machine Learning, 1989, 4(2):227-243.
    [115] 严胜祥,吴绍春,吴耿锋,金沈杰.一种基于纵向划分数据集的并行决策树分类算法[J].计算机工程与科学,2004,26(7):67-70.
    [116] 宋晓云,苏宏升.一种并行决策树学习方法研究[J].现代电子技术,2007,2:141-144.
    [117] NEAPOLITAN R E. Learning Bayesian Networks[M]. New York:Pearson Hall, 2003.
    [118] CHOW C K, LIU C N. Approximating discrete probability distributions with dependence trees[J]. IEEE Trans. Inform. Theory IT-14, 1968. 3:462-467.
    [119] PEARL J. Probabilistic Reasoning in Intelligent Systems:Networks of Plausible Inference[M]. San Mateo, CA:Morgan Kaufman Publishers, 1988.
    [120] SPIRTES P, GLYMOUR C,SCHEINES R. Causation, Prediction, and Search[M]. New York:MIT Press. 2000.
    [121] COOPER G. The computational complexity of probabilistic inference using Bayesian belief networks[J]. Artif. Intell., 1990. 42:395-405.
    [122] JENSEN F V. An Introduction to Bayesian Networks[M]. London:UCL Press, 1996.
    [123] DE CAMPOS L M. A Scoring function for learning Bayesian networks based on mutual information and conditional independence tests[J]. Journal of Machine Learning Research, 2006, 7(10):2149-2187.
    [124] COOPER G, HERSKOVITS E. A Bayesian method for the induction of probabilistic networks from data[J]. Machine Learning, 1992, 9(4):309-347.
    [125] CHENG J, GREINER R. Learning Bayesian Belief Network Classifiers:Algorithms and System[C]. Lecture Notes in Computer Science, 2001.2056:141-151.
    [126] FRIEDMAN N, GEIGER D, GOLDSZMIDT M. Bayesian Network Classifiers[J]. Machine Learning, 1997. 29:131-163.
    [127] GEIGER D, HECKERMAN D. Knowledge representation and inference in similarity networks and Bayesian multinets[J]. Artificial Intelligence, 1996, 82:45-74.
    [128] LAM W. BACCHUS F. Learning Bayesian belief networks: An approach based on the MDL principle[J]. Computational Intelligence, 1994, 10(4):269-293.
    [129] 张连文,郭海鹏.贝叶斯网引论[M].北京:科学出版社,2006.
    [130] HRUSCHKA E R,EBECKEN N F. Towards efficient variables ordering for Bayesian networks classifier[J]. Data & Knowledge Engineering. 2007,63:258-269.
    [131] 王小平,曹立明.遗传算法-理论、应用与软件实玩[M].西安:西安交通大学出版社,2002.
    [132] 梁培基,陈爱华.神经元活动的多电极同步记录及神经信息处理[M].北京:北京工业大学出版社,2003.
    [133] BEAR M F,CONNORS B W,PARADISO M A.Neuroscience:Exploring the Brain[M]. Lippincott Willianms&Wilkins,2001.(神经科学-探索脑.王建军主译.中文第二版.北京:高等教育出版社,2004).

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700