用户名: 密码: 验证码:
数据流频繁项挖掘与聚类分析的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着科学技术的高速发展和信息技术的广泛应用引发了一类新型应用,包括计算机网络流量控制、网络安全监控、金融应用、环境监测和日志分析等。在这些新型的应用中,数据以流的形式产生,它实时、持续、有序地到达。这种由一系列连续且有序的数据组成的序列被称为数据流。与传统数据库不同,数据流具有如下特点:无限性;不可再现性;数据到达速率极快;数据的到达次序不受应用约束。分析和挖掘数据流已成为热点研究问题。
     数据流挖掘即在流式数据上提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识的过程。数据流挖掘主要是通过挖掘频繁项(集)、分类分析、聚类分析、异常分析等知识发现活动,以寻找数据流中的关联规则、分类规则、聚类模式、异常模式等类型的知识。如挖掘数据流中的频繁项可应用于基于流量的网络计费、网络交通阻塞控制、网络安全监控等。对数据流进行聚类分析可应用于监测网络入侵、在线新闻组过滤、话题识别与跟踪、对交通拥塞与地理环境等进行实时监控、对大型公司中不同顾客群进行划分、检测金融欺诈等。
     由于存储空间的有限性与数据流的无限性,使得存储数据流中的全部数据以提供精确的挖掘结果是不实际的。因此,在数据流处理模型中,数据流处理算法只存储数据流的概要信息,并随着流中数据不断到来,不断更新流概要,同时根据用户的查询要求,利用所维护的数据流概要信息,为用户提供近似的查询结果。
     由于频繁项挖掘与聚类分析在网络数据流分析中有着重要的应用,我们主要进行数据流环境下的频繁项挖掘与聚类分析方法的研究。在网络数据流及大量的其它应用领域中,数据类型多种多样,它们不仅包括数值型数据,而且包括非数值型数据,且这些数据经常会有几十甚至几百个属性,使得研究混合属性数据流的聚类及高维属性数据流聚类均具有重要的理论价值与实际意义。以生物系统的机理为基础设计算法和系统是近代研究的热点,并已取得了瞩目成效。而人工免疫系统(AIS)结合了分类器、神经网络和机器推理等系统的一些优点,具有提供新颖的问题解决方法的潜力。同时AIS在数据流聚类中也已有了初步研究,本文针对现有基于AIS的数据流聚类存在的不足,研究新的基于AIS的数据流聚类算法。本文的研究内容和创新工作可概括为以下四个方面。
     (1)数据流频繁项挖掘算法的研究
     本文在Bloom Filter的基础上,提出空间效率高、可支持表达庞大数据集及较高查找效率的数据结构—可扩展Bloom Filter,并基于该数据结构提出基于界标窗口模型的数据流频繁项挖掘算法(FI-ESBFL),同时通过理论证明只需比同类算法中更少的计数器数目即可达到相同的精度与置信度要求。FI-ESBFL可以根据数据流中数据的不同分布及不同的数据项的多少动态调整所使用的内存空间,从而大大减少了内存空间的浪费。实验证明FI-ESBFL具有更高的空间效率与较高的时间效率。本文在FI-EBFSL的基础上还提出了基于衰减窗口模型的数据流频繁项挖掘算法—FI-ESBFD及基于滑动窗口模型的数据流频繁项挖掘算法—FIS-EBFS。FIS-EBFSD具有在一般情况下比同类算法有更高的时间与空间效率。FIS-EBFS具有高效的时间性能。
     (2)混合属性数据流聚类分析算法的研究
     本文提出两种不同的基于熵的混合属性数据对象间相似性度量,并在此基础上提出两类混合属性数据流聚类算法——CNCE-Stream与CNCDE-Stream。其中CNCDE-Stream同时利用欧式距离与熵来定义混合属性数据对象间的相似性。在算法CNCE-Stream中,利用单一的量——熵度量混合属性数据对象间的相似性,提出数据流环境下概率密度函数的估计方法—S核方法和带混合属性的类的期望熵计算方法。实验结果表明,CNCDE-Stream与CNCE-Stream均具有较高的聚类质量,且CNCDE-Stream具有很高的时间效率。
     (3)高维数据流子空间聚类分析算法的研究
     针对大部分现有的数据流聚类算法只适合于待聚类的数据含有的维度较低的情况及现有数据流子空间聚类算法的不足,本文提出基于网格与密度的高维数据流子空间聚类算法—SOStream。SOStream在线维护一个所有密集网格单元的超集,并提出延迟插入潜在密集网格单元与定期修剪非密集(稀疏)网格单元策略,提高了算法的时间与空间效率。当用户请求时,利用在线维护的密集网格单元生成最终类结构。我们通过实验证明了本算法的有效性。
     (4)基于人工免疫原理的数据流聚类分析算法的研究
     本文根据人工免疫系统可动态适应外部环境的变化,提出一种新的基于人工免疫网络的数据流聚类算法—AIN-Stream。AIN-Stream利用外部抗原(流数据)对B细胞的激励作用定义B细胞的激励度,并通过为B细胞创建特征向量,利用统计分析的方法自动确定基于人工免疫聚类算法的关键参数—B细胞识别区域,保证了聚类结果的稳定性。同时,AIN-Stream利用B细胞特征向量中的统计信息更有效地去除冗余B细胞,进一步提高了算法效率。在生成聚类结果时,AIN-Stream无需指定类数,可真正实现无监督聚类。实验表明,AIN-Stream能够动态适应数据流的变化,并具较高的聚类质量,且具有更高的空间效率与明显的时间效率提高。
     本文提出的算法是对现有数据流上的频繁项挖掘技术与聚类分析技术的补充与改进,理论分析与实验结果表明本文算法能够较为有效地解决相应问题。
A growing number of applications appeared with the rapid development of scientific technology and the wide application of information technology, where data takes form of "continuous data streams " , which means data arrives continuously, orderly and in real time, the application areas include network-traffic monitoring, computer-network security, financial applications, environmental monitoring and log records analysis, and many more. This sequence of elements arriving continuously and orderly is called data stream. Different from traditional database, data stream has the following distinguished characteristics: unbounded volume of data; being processed only once, unless being reserved; rapid arriving rate of records; uncontrollability of records' arriving order. Analysis of and mining of data stream have become hot subjects of research.
     Data stream mining is to extract useful information and knowledge which embed in the streaming data and previously unknown to users. The main function of streaming data mining is to find association rules, classification rules, cluster structure, abnormity ,etc through mining the frequent items(itemsets), classifying, clustering, abnormal detection, etc. For example, we can perform accounting based on usage, network-traffic monitoring, computer-network security monitoring by applying the technology of mining frequent items from data stream; We can perform computer-network intrusion attack monitoring, topic detection and tracking of newswire, monitoring the traffic and environment, dividing the customer of big companies into different groups, detecting the financial cheating, etc by applying the technology of clustering data streams.
     It is impossible to store all the data in the data stream due to the limited memory and unbounded streaming data. therefore, the algorithms dealing with the data stream only store the synopsis of data stream which is updated with the data arriving in the stream. In the data stream computation model, the algorithms can only give the approximate results upon the users' request by using the stream synopsis stored.
     Our main work focus on mining frequent items and the clustering of data stream, since they have great importance in network flow analysis. The data of network flow and many other application fields contains different types of attributes besides numeric attributes, there are also categorical and other types of attributes. In addition, the data of many application can be of high dimensionality which may be as high as several dozen or even several hundred. Therefore, it is of great importance both in theoretical and in practical to perform research on the clustering algorithms of high dimensional data stream and the clustering algorithms of data stream where data contains mixed attributes. On the other hand, it has also been a hot subject of research to develop algorithms and systems based on the biological system, and the research has made distinguished progress, as one of them, artificial immune system(AIS) has the combined advantages of classifier system, neural network and machine reasoning etc, and it can provide novel way of solving problems. There have been a few researches in the applying AIS to data stream clustering, but there are problems, therefore, we also focus on providing new AIS-based clustering algorithm aiming at solving the problems in the currently AlS-based stream clustering algorithm. Our primary works are as follows:
     (1) Research on Algorithms of Mining Frequent Items in Data Stream
     We introduce a memory efficient data structure based on Bloom Filter, which we call ESBF(Extensible and Scalable Bloom Filter), ESBF can be used for store information of large dataset and can provide efficient query. Based on ESBF, the algorithm for miningfrequent items in data stream based on landmark window model——FI-ESBFL is proposed,and it is proved that FI-ESBFL can achieve the same precision and probability as other Bloom Filter based algorithm dealing with stream frequent item mining by using less amount of counters. The memory used by FI-ESBFL can be adjusted dynamically according to the different data distribution and the number of distinct items in the data streams, which contribute to memory saving. Experimental results demonstrate the efficiency and effectiveness of FI-ESBFL. We also propose the new algorithms of mining frequent items in data stream based on damped window model(FI-ESBFD) and sliding window model(FIS-EBFS) respectively, FI-ESBFD and FIS-EBFS are all based on FI-ESBFL. FIS-EBFSD is more efficient in time and memory comsumption than the compared algorithms based on Bloom Filter and damped window model in most cases. FIS-EBFS is much more efficient in time consuming than the compared algorithm based on sliding window model.
     (2) Research on Clustering Algorithms of Data Stream with Mixed Attributes
     Two different similarity measures of data objects with mixed attributes based on entropy are introduced, then, we propose two different algorithms of clustering the data stream with mixed attributes based on the two similarity measures respectively, they are CNCE-Stream and CNCDE-Stream. CNCDE-Stream define the similarity between the data objects using both Euclidean Distance and entropy, while CNCE-Stream only use entropy as similarity measure. Before proposing algorithm CNCE-Stream, we introduce a new way of estimate the pdf(probability density function) in the stream scenario, and then the way of calculating the expect entropy of a cluster with mixed attributes is introduced, the incremental entropy and merging entropy are introduced to make CNCE-Stream work. Experimental results demonstrate that both CNCDE-Stream and CNCE-Stream can work with high precision, and CNCDE-Stream is very efficient in time consuming.
     (3) Research on Subspace Clustering Algorithm of High Dimensional Data Stream
     We introduce a new subspace clustering algorithm of high dimensional datastream(SOStream) to address the issue of that most clustering algorithm of data stream can work well only when data is of low dimensionality and of the disadvantages of the existing subspace stream clustering algorithms. SOStream is based on grid and density and it maintains a superset of all dense units in an online way, we introduce the delayed insertion of potential dense units and the pruning of sparse units to make the algorithm work more efficiently. SOStream will generate the final cluster using the maintained dense units when required. The experiments conducted show the effectiveness of the proposed algorithm.
     (4) Research on Clustering Algorithm of Data Stream Based on Artificial Immune Network(AIN)
     The immune principle based learning can adapt to the dynamic environment easily, So a new AIS based data stream clustering algorithm(AIN-Stream) is proposed. During the process of clustering, the simulation level of a B-cell is determined by the antigen(the data in the stream). By creating and maintenance of the B-Cell Feature vectors and by means of statistical analysis, AIN-Stream is capable of automatically determining themost important parameter of AIS based clustering algorithm——Recognization Zone ofB-Cells. B-Cell Feature vectors also make it possible to eliminate the redundant B-cells more effectively, which also contribute to the improved efficiency of the algorithm. AIN-Stream can determine the number of final clusters automatically, which makes the real unsupervised clustering. Experimental results demonstrate that AIN-Stream can track the evolving of data steam and has high clustering quality, experiments conducted also show that AIN-Stream is superior in memory and processing time consumption than other immune principle based clustering algorithms under the circumstance of similar clustering results.
     In conclusion, our methods are great complementarity and improvement to existing ones which deal with data stream frequent items mining and data stream clustering. Theoritical analysis and experimental results show that our methods can solve the corresponding problems effectively and efficiently.
引文
[1] J.Han and M.Kamber. Data Mining:Concepts and Techniques[M]. Morgan Kaufmann, 2001.
    [2] B.Babcock, S.Babu, M.Datar, R.Motwani, J.Widom. Models and Issues in Data Stream Systems[C]. In Proc. of ACM PODS, 2002.1-16.
    [3] Y.Zhu and D.Shasha. Statstream:Statistical Monitoring of Thousands of Data Streams in Real Time[C]. In Proc. of V LDB, 2002.358-369.
    [4] A.K.Jain and R.C.Dubes. Algorithms for Clustering Data[M]. Prentice Hall, Englewood Cliffs, NJ., 1988.
    [5] H.Ralambondrainy. A Conceptual Version of the K-means Algorithm[J]. Pattern Recognition Letter, 1995.1147-1157.
    [6] M.Garofalakis, J.Gehrke and R.Rastogi. Querying and Mining Data Streams: You Only Get One Look[C]. In the tutorial notes of VLDB, 2002.
    [7] M.Fang, N.Shivakumar, H.Garcia-Molina, R.Motwani, and J.D.Ullman. Computing Iceberg Queries Efficiently[C]. In Proc. of VLDB,1998.299-310.
    [8] P.Domingos and G. Hulten. Mining High-Speed Data Streams[C]. In Proc. of KDD, 2000. 71-80.
    [9] G.Hulten, L.Spencer, and P.Domingos. Mining Time-Changing Data Streams [C]. In Proc. of KDD, 2001.97-106.
    [10] S.Guha, N.Mishra, R.Motwani and L.O'Callaghan. Clustering Data Streams [C]. In Proc. of the Annual Symposium on Foundations of Computer Science, 2000.359-366.
    
    [11] M.Charikar, K.Chen, and M.Farach-Colton. Finding Frequent Items in Data Streams[C]. In Proc. of the 29th International Colloquium on Automata,Languages and Programming, Springer-Verlag, 2002.693-703.
    
    [12] J.Gehrke. Special Issue on Data Stream Processing[J]. IEEE Comput. Soc. Bull. Tech,. 2003..
    [13] A.C.Gilbert, Y.Kotidis, S.Muthukrishnan and MJ.Strauss. QuickSAND: Quick Summary and Analysis of Network Data[R]. Technical Report, Dec. 2001.
    [14] M.Sullivan, A.Heybey. Tribeca: A System for Managing Large Databases of Network Traffic[C]. In Proc. of USENIX Annual Technical Conf., 1998.
    [15] G.cormode and S.Muthukrishnan. What's hot and what's not: Tracking most frequent items dynamically[C]. In Proc. of the 22nd ACM PODS, 2003. 296-306.
    [16] J.Chen, D.DeWitt, F.Tian, Y.Wang. NiagaraCQ:A Scalable Continuous Query System for Internet Databases[C]. In Proc. of ACM Int. Conf. on Management of Data, 2000. 379-390.
    [17]C.cortes,K.Fisher,D.Pregibon,A.Rogers and F.Smith.Hancock:A Language for Extracting Signatures from Data Streams[C].In Proc.of ACM Int.Conf.on Knowledge Discovery and Data Mining,2000.9-17.
    [18]P.Bonnet,J.Gehrke and P.Seshadri.Towards Sensor Database Systems[C].In Proc.of the 2nd IEEE MDM International Conf.on Mobile Data Management,2001.3-14.
    [19]C.Estan and G.Varghese.New Directions in Traffic Measurement and Accounting[C].In proc.of SIGCOMM,2002.323-336.
    [20]C.Estan and G.Varghese.New Directions in Traffic Measurement and Accounting:Focusing on the Elephants,lgnoring the Mice[J].ACM Transactions on Computer Systems,2003.270-313.
    [21]K.Lan and J.Heidemann.Rapid Model Parameterization from Traffic Measurements[J].ACM Transactions on Modeling and Computer Simulation,2002.201-229.
    [22]T.Zhang,R.Ramakrishnan and M.Livny.BIRCH:An Efficient Data Clustering Method for Very Large[C].In Proc.of SIGMOD,1996.103-116.
    [23]S.Guha,R.Rastogi and K.Shim.CURE:An Efficient Clustering Algorithm for Large Databases[C].In Proc.of SIGMOD,1998.73-84.
    [24]M.Ester,H.-P.Kriegel,J.Sander,and X.Xu.A Density-based Algorithm for Discovering Custers in Large Spatial Databases with Noise[C].In Proc.of KDD,1996.226-231.
    [25]T.N.Raymond and J.Han.CLARANS:A Method for Clustering Objects for Spatial Data Mining[J].IEEE Transactions on Knowledge and Data Engineering,2002.1003-1016.
    [26]R.Agrawal,J.Gehrke,D.Gunopulos,P.Raghavan.Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications[C].In Proc.of SIGMOD,1998.94-105.
    [27]G.Sheikholeslami,S.Chatterjee,A.Zhang.WaveCluster:a wavelet-based clustering approach for spatial data in very large databases[J].The VLDB Journal,2000.289-304.
    [28]C.-H.Cheng,A.W.Fu,and Y.Zhang.Entropy-based Subspace Clustering for Mining Numerical Data[C].In Proc.of ACM SIGKDD,1999.84-93.
    [29]S.Goil,H.Nagesh,and A.Choudhary.Mafia:Effcient and Scalable Subspace Clustering for Very Large Data Sets[R].Technical Report CPDC-TR-9906-010,Northwestern University,June 1999.
    [30]C.C.Aggatwal,J.L.Wolf,P.S.Yu,C.Procopiuc and J.S.Park.Fast Algorithms for Projected Clustering[C].In Proc.of the ACM SIGMOD,1999.61-72.
    [31]C.C.Aggarwal and P.S.Yu.Finding Generalized Projected Clusters in High Dimensional Spaces[C].In Proc.of ACM SIGMOD,2000.70-81.
    [32]G.Karypis,E-H.Han and V.Kumar.CHAMELEON:A Hierarchical Clustering Algorithm Using Dynamic Modeling[R].University of Minnesota,Technical Report #99-007,1999.
    [33]S.Guha,R.Rastogi and K.Shim.ROCK:A Robust Clustering Algorithm for Categorical Attributes[C].In Proc.of ICDE,1999.512-521.
    [34]W.Wang,J.Yang and R.Muntz.STING:A Statistical Information Grid Approach to Spatial Data MiningC[C].In Proc.of VLDB,1997.186-195.
    [35]P.S.Bradley,U.Fayyad and C.Reina.Scaling Clustering Algorithms to Large Databases[C].In Proc.of KDD,1998.9-15.
    [36]M.Ankerst,M.M.Breunig,H-P.Kriegel,J.Sander.OPTICS:Ordering Points To Identify the Clustering Structure[C].In Proc.of ACM SIGMOD,1999.49-60.
    [37]C.Cranor,Y.Gao,T Johnson,V.Shkapenyuk and O.Spatscheck.GigaScope:High Performance Network Monitoring with an SQL InterfaceC[C].In Proc.of ACM SIGMOD,2002.623-623.
    [38]V.Katos.Network intrusion detection:Evaluating Cluster,Ciscriminant,and Logit Analysis[J].Information Sciences,Vol.177,Issue 15,2007.3060-3073.
    [39]A.Jain,Z.Zhang,E.Y.Chang.Adaptive Nonlinear Clustering in Data Streams[C].In Proc.of CIKM,2006.122-131.
    [40]M.M.Gaber and P.S.Yu.A Framework for Resource-aware Knowledge Discovery in Data Streams:A Holistic Approach with Its Application to Clustering[C].In Porc.of SAC,2006.649-656.
    [41]B-R.Dai,J-W.Huang,M-Y.Yeh and M-S.Chen.Adaptive Clustering for Multiple Evolving Streams[J].IEEE Transactions on Knowledge and Data Engineering.Vol.18,NO.9,2006.1166-1180.
    [42]MY.Yeh,B-R.Dai and M-S.Chen.Clustering over Multiple Evolving Streams by Events and Correlations[J].IEEE Transactions on Knowledge and Data Engineering,Vol.19,NO.10,2007.1349-1362.
    [43]L.Golab and M.T.Ozsu.Issues in Data Stream Management[C].In Proc.of ACM SIGMOD,2003.5-14.
    [44]P.B.Gibbons and Y.Matias.Synopsis Data Structures[C].In Proc.of SODA,1999.909-910.
    [45]R.Motwani,J.Widom and A.Atasu and B.Babcock et al.Query Processing,Approximation,and Resource Management in A Data Stream Management System[C].In Proc.of CIDR,2003.245-256.
    [46]A.Arasu,B.Babcock,S.Babu,et al.STREAM:The Stanford Data Stream Management System[J].IEEE Data Engineering Bulletin,Vol.26 No.1,March,2003.19-26.
    [47]D.Carney,U.Cetintemel,M.Cherniack,C.Convey,et al.Monitoring Streams:A New Class of Data Management Applications[C].In Proc.of VLDB,2002.215-226.
    [48]D.J.Abadi,D.Carney,U.Cetintemel,et al.Aurora:A Data Stream Management System[J].The VLDB Journal,Vol.12,Number 2.2003.120-139.
    [49]The Aurora Project.http://www.cs.brown.edu/research/aurora
    [50]R.Avnur and J.Hellerstein.Eddies:Continuously Adaptive Query Processing[C].In Proc.of ACM SIGMOD,2000.261-272.
    [51]J.M.Hellerstein,M.J.Franklin,S.Chandrasekaran,A.Deshpande,et al.Adaptive Query Processing:Technology in Evolution[J].IEEE Data Engineering Bulletin,June 2000.23(2).7-18.
    [52]S.Madden,M.Shah,J.M.Hellerstein,V.Raman.Continuously Adaptive Continuous Queries over Streams[C].In Proc.of ACM SIGMOD,2002.49-60.
    [53]S.Chandrasekaran,O.Cooper,A.Deshpande,M.J.Franldin.et.al.TelegraphCQ:Continuous Dataflow Processing for an Uncertain World[C].In Proc.of CIDR,2003.269-280.
    [54]J.Chen,D.De Witt,F.Tian,Y.Wang.NiagaraCQ:A Scalable Continuous Query System for Internet Databases[C].In Proc.of ACM SIGMOD,2000.379-390
    [55]J.Chen,D.J.De Witt and J.F.Naughton.Design and Evaluation of Alternative Selection Placement Strategies in Optimizing Continuous Queries[C].http://www.cs.wisc.edu/niagara/papers/Icde02.pdf.
    [56]J.Chen,D.J DeWitt.Dynamic Re-grouping of Continuous Queries[R].Technical Report,Univ.of Wisconsin-Madison,2002.
    [57]S.Krishnamurthy,S.Chandrasekaran,O.Cooper,A.Deshpande,et al.TelegraphCQ:An Architectural Status Report[J].Bulletin of the IEEE Computer Society,2003,26(1).11-18.
    [58]M.Cherniack,H.Balakrishnan,M.Balazinska,D.Carney,et al.Scalable Distributed Stream Processing[C].In Proc.of CIDR,2003.257-268.
    [59]M.Balazinska,H.Balakrishnan and M.Stonebraker.Contract-Based Load Management in Federated Distributed Systems[C].In Proc.of 1st Symposium on Networked Systems Design and Implementation(NSDI),2004.197-210.
    [60]S.Zdonik,M.Stonebraker,M.Cherniack,U.Cetintemel,et al.The Aurora and Medusa Projects[J].IEEE Data Engineering Bulletin,26(1),2003.
    [61]D.J.Abadi,Y.Ahmad,M.Balazinska,U.Cetintemel,et al.The Design of the Borealis Stream Processing Engine[C].In Proc.of CIDR,2005.
    [62]Y.Ahmad,B.Berg,U.Cetintemel,M.Humphrey et al.Distributed Operation in the Borealis Stream Processing Engine[C].In Proc.of the ACM SIGMOD,2005.882-884.
    [63]Y.Zhu,D.Shasha.StatStream:Statistical Monitoring of Thousands of Data Streams in Real Time[C].In Proc.of VLDB,2002.358-369.
    [64]M.Sullivan.Tribeca:A Stream Database Manager For Network Traffic Analysis[C].In Proc.of VLDB,1996.594-594.
    [65]D.Barbara.Requirements for clustering data streams[J].In SIGKDD,Vol.3,Issue 2,2002.23-27.
    [66]L.Golab,M.T.Ozsu.Data Stream Management Issues-A Survey[R].Technical Report,CS 2003-08,University of Waterloo.
    [67]M.Henzinger,P.Raghavan and S.Rajagopalan.Computing on Data Streams[R].Technichal note 1998-011,1998.
    [68]M.M.Gaber,A.Zaslavsky and S.Krishnaswamy.Mining Data Streams:A Review[J].SIGMOD Record,Vol.34,No.2,2005.18-26.
    [69]J.S.Vitter.Random Sampling with A Reservoir[J].ACM Transactions on Mathematical Software,Vol.11,No.1,March 1985.37-57.
    [70]P.B.Gibbons and Y.Matias.New Sampling-based Summary Statistics for Improving Approximate Query Answers[C].In Proc.of AMC SIGMOD,1998.331-342.
    [71]J.Greenwald and F.Khanna.Space Efficient On-Line Computation of Quantile Summaries[C].In Proc.ACM SIGMOD,2001.58-66.
    [72]V.Poosala,Y.loannidis,P.Haas and E.Shekita.Improved Histograms for Selectivity Estimation of Range Predicates[C].In Proc.of ACM SIGMOD,1996.294-305.
    [73]Y.Ioannidis and V.Poosala.Balancing Histogram Optimality and Practicality for Query Result Size Estimation[C].In Proc.of ACM SIGMOD,1995.233-244.
    [74]S.Muthukrishnan.Data Streams:Algorithms and Applications[C].In Proc.of the SODA,2003.413-413.
    [75]B.Babcock,M.Datar and R.Motwani.Load Shedding Techniques for Data Stream Systems(short paper)[C].In Proe.of the Workshop on Management and Processing of Data Streams,2003.
    [76]N.Tatbul,U.Cetintemel,S.Zdonik,M.Cherniack and M.Stonebraker.Load Shedding on Data Streams[C].In Proc.of the Workshop on Management and Processing of Data Streams,2003.
    [77]M.Balazinska,H.Balakrishnan and M.Stonebraker.Contract-Based Load Management in Federated Distributed Systems[C].In Proc.of Symposium on Networked Systems Design and Implementation(NSDI),2004.197-210.
    [78]A.Gibert,Y.Kotidis,S.Muthukdshnan and M.Strauss.Surfing Wavelets on Streams:One Pass Summaries for Approximate Aggregate Queries[J].VLDB Journal,79(8),2001.79-88.
    [79]B.Jawerth and W.Sweldens.An Overview of Wavelet Based Multiresolution Analysis[J].SIAM Rev.,36(3),1994.377-412.
    [80]J.S.Vitter,M.Wang and B.Iyer.Data Cube Approximation and Histograms Via Wavelets[C].In Proc.of CIKM,1998.
    [81]J.S.Vitter and M.Wang.Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets[C].In Proc.of ACM SIGMOD,1999.
    [82]Y.Matias,J.S.Vitter and M.Wang.Wavelet-based Histograms Form Selectivity Estimation[C].In Proc.of ACM SIGMOD,1998.
    [83]E.Demaine,A.Lopez-Ortiz,J.I.Munro.Frequency Estimation of Internet Packet Streams with Limited Space[C].In Proc.of European Symp.on Algorithms,2002.348-360.
    [84]G.S.Manku and R.Motwani.Approximate Frequency Counts over Data Streams[C].In Proc.of VLDB,2002.346-357.
    [85]P.Gibbons,S.Tirthapura.Estimating Simple Functions on the Union of Data Streams[C].In Proc.of ACM Symp.on Parallel Algorithms an Architectures,2001.281-291.
    [86]P.Flajolet,G.N.Martin.Probabilistic Counting[C].In Proc.of Syrup.on Foundations of Computer Science,1983.76-82.
    [87]G.S.Manku,S.Rajagopalan,B.G.Lindsay.Random sampling techniques for space effcient online computation of order statistics of large datasets[C].In Proc.of ACM SIGMOD,1999.251-262.
    [88]N.Alon,Y.Matias,M.Szegedy.The Space Complexity of Approximating the Frequency Moments[C].In Proc.of ACM Symp.on Theory of Computing,1996.20-29.
    [89]G.Cormode,M.Datar,P.Indyk,S.Muthukrishnan.Comparing Data Streams Using Hamming Norms(How to Zero In)[C].In Proc.of VLDB,2002.335-345.
    [90]A.obra,M.arofalakis,J.ehrke,R.astogi.Processing Complex Aggregate Queries over Data Streams[C].In Proc.of ACM SIGMOD,2002.61-72.
    [91]B.H.Bloom.Space/time Tradeoffs in Hash Coding with Allowable Errors[J].Communications of the ACM,1970,13(7).422-426
    [92]L.Fan,P.Cao,J.Almeida and A.Z.Broder.Summary Cache:A Scalable Wide-Area Web Cache Sharing Protocol[J].IEEE/ACM Transactions on Networking,2000,8(3).281-293.
    [93]J.Saborit,P.Trancoso,V.Mulero,J.Larriba-Pey.Dynamic Count Filters[J].SIGMOD Record,2006,35(1).26-32.
    [94]S.Cohen and Y.Matias.Spectral Bloom Filters[C].In Proc.of ACM SIGMOD,2003.241-252.
    [95]J.K.Mulun.A Second Look at Bloom Filters[J].Communications of the ACM,Vol.26,NO.8,1983.570-571.
    [96]B.Chazelle,J.Kilian,R.Rubinfeld and A.Tal.The Bloomier Filter:An Efficient Data Structure for Static Support Lookup Tables[C].In Proc.of 15th Annual ACM-SIAM Symposium on Discrete Algorithms,2004.30-39.
    [97]E.Safi,A.Moshovos and A.Veneris.L-CBF:A Low-Power,Fast Counting Bloom Filter Architecture[C].In Proc.of int'l symposium on Low Power Electronics and Design,2006.250-255.
    [98]H.Song,S.Dharmapurikar.Fast Hash Table Lookup Using Extended Bloom Filter:An Aid to Network Processing[C].In Proc.of ACM SIGCOMM,2005.181-192.
    [99]J.Bruck,J.Gao and A.Jiang.Weighted Bloom Filter[J].IEEE int'l symposium on information theory,2006.2304-2308.
    [100]J.Feigenbaum,S.Kannan,M.Strauss,M.Viswanathan.An Approximate L1-Difference Algorithm for Massive Data Streams[C].In Proc.of Syrup.on Foundations of Computer Science,1999.501-511.
    [101]J.Gehrke,F.Korn,D.Srivastava.On Computing Correlated Aggregates Over Continual Data treams[C].In Proc.of ACM SIGMOD,2001.13-24.
    [102]A.C.Gilbert,Y.Kotidis,S.Muthukrishnan,M.J.Strauss.Surffng Wavelets on Streams:One-Pass Summaries for Approximate Aggregate Queries[C].In Proc.of VLDB,2001.79-88.
    [103]M.Garofalakis,P.Gibbons.Wavelet Synopses with Error Guarantees[C].In Proc.of ACM SIGMOD,2002.476-487.
    [104]L.K.Lee,H.F.Ting.A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows[C].In Proc.of PODS,2006.290-297.
    [105]A.Metwally,D.Agrawal and A.Abbadi.An Integrated Efficient Solution for Computing Frequent and Top-k Elements in Data Streams[J].In ACM TODS,Vol.31,No.3,2006.1095-1133.
    [106]A.Manjhi,V.Shkapenyuk,K.Dhamdhere and C.Olston.Finding(Recently) Frequent Items in Distributed Data Streams[C].In Proc.of ICDE,2005.
    [107]A.Metwally,D.Agrawal and A.Abbadi.Duplicate Detection in Click Streams[C].In Proc.of WWW,2005.12-21.
    [108]B.Babcock and C.Olston.Distributed Top K Monitoring[C].In Proc.of ACM SIGMOD,2003.28-39.
    [109]LGolab,D.DeHaan,A.Lopez-Ortiz and E.D.Demaine.Finding Frequent Items in Sliding Windows with Multinomially-Distributed Item Frequencies[R].Tech.Report CS-2004-06,Univ.of Waterloo,2004.
    [110]A.Metwally,D.Agrawal,and A.Abbadi.Efficient Computation of Frequent and Top-k Elements in Data Streams[R].Technical Report 2005-23,Univ.of California,2005.
    [111]C.Jin,W.Qian,C.Sha,J.X.Yu and A.Zhou.Dynamically Maintaining FrequentItems over A Data Stream[C].In Proc.of CIKM,2003.
    [112]R.Karp,S.Shenker,and C.Papadimitriou.A Simple Algorithm for Finding Frequent Elements in Streams and Bags[J].ACM Transactions on Database Systems,2005,28(1). 51-55.
    [113]N.Tatbul,U.Cetintemel,S.Zdonik,M.Cherniack et al.Load Sheding in A Data Stream Manager[C].In Proc.of VLDB,2003.
    [114]S.Guha,N.Mishra,R.Motwani and L.O'Callaghan.Clustering Data Streams[C].In Proc.of ACM Symp.Foundations of Computer Science,2000.359-366.
    [115]S.Guha,A.Meyerson,N.Mishra,R.Motwani and L.O'Callaghan.Clustering Data Streams:Theory and Practice[J].IEEE TKDE,Vol.15,NO.3,,2003.
    [116]J.Yang.Dynamic Clustering of Evolving Streams with a Single Pass[C].In Proc.of ICDE,2003.695-697.
    [117]M.Kontaki,A.N.Papadopoulos,Y.Manolopoulos.Efficient Incremental Subspace Clustering in Data Streams[C].In Proc.of 10th International Database Engineering and Applications Symposium(IDEAS),2006.
    [118]L.O'Callaghan,N.Mishra,A.Meyerson,S.Guha and R.Motwani.Streaming-Data Algorithms for High-Quality Clustering[C].In Proc.of ICDE,2002.685-694.
    [119]D.Chakrabarti,R.Kumar and A.Tomkins.Evolutionary Clustering[C].In Proc.of KDD,2006.554-560.
    [120]M.Charikar,L.O'Callaghan and R.Panigrahy.Better Streaming Algorithms for Clustering Problems[C].In Proc.of STOC,2003.30-39.
    [121]C.Ordonez.Clustering Binary Data Streams with Kmeans[C].In Proc.of DMKD,2003.12-19.
    [122]K.Cho,S.Jo,H.Jang,S.M.Kim and J.Song.DCF:An Efficient Data Stream Clustering Framework for Streaming Applications[J].LNCS 4080,2006.114-122.
    [123]A.Jain,Z.Zhang and EY.Chang.Adaptive Nonlinear Clustering in Data Streams[C].In Proc.of CIKM,2006.122-131.
    [124]O.Nasraoui,C.Cardona,C.Rojas and F.Gonzlez.TECNO-STREAMS:Tracking Evolving Clusters in Noisy Data Streams with A Scalable Immune System Learning Model[C].In Proc.of ICDM,2003.235-242.
    [125]C.C.Aggarwal,J.Han,J.Wang and P.S.Yu.A Framework for Clustering Evolving Data Streams[C].In Proc.of VLDB,2003.81-92.
    [126]F.Cao,M.Estery,W.Qian,A.Zhou.Density-Based Clustering over an Evolving Data Stream with Noise[C].In Proc.of the SIAM Conference on Data Mining(SDM),2006.326-337.
    [127]B.-R.Dai,J.-W.Huang,M.-Y.Yeh and M.-S.Chen.Clustering on Demand for Multiple Data Streams[C].In Proc.of ICDM,2004.367-370.
    [128]N.H.Park and W.S.Lee.Statistical Grid-based Clustering over Data Streams[J].SIGMOD Record,Vol.33,No.1,2004.32-37.
    [129]M.-Y.Yeh,B.-R.Dai and M.-S.Chen.Clustering over Multiple Evolving Streams by Events and Correlations[J].TKDE,2007.1349-1362.
    [130]C.C.Aggarwal,J.Han,J.Wang and P.S.Yu.A Framework for Projected Clustering of High Dimensional Data Streams[C].In Prec.of VLDB,2004.852-863.
    [131]K.Chen and L.Liu.Detecting the Change of Clustering Structure in Categorical Data Streams[C].In Prec.of SIAM Data Mining Conference(SDM),2006.502-506
    [132]C.Hidber.Online Association Rule Mining[C].In Prec.of ACM SIGMOD,1999.145-156.
    [133]R.M.Karp and S.Shenker.A Simple Algorithm for Finding Frequent Elements in Streams and Bags[C].In ACM TODS,2003.51-55.
    [134]L.Yang and M.Sanver.Mining Short Association Rules with One Database Scan[C].In Prec.of Int'l Conf.on Information and Knowledge Engineering,2004.392-398.
    [135]Y.Chi,H.Wang,P.Yu and R.Richard.Moment:Maintaining Closed Frequent Itemsets over a Stream Sliding Window[C].In Prec.of ICDE,2004.59-66.
    [136]G.Mao,X.Wu,C.Liu,X.Zhu et al.Online Mining of Maximal Frequent Itemsequences from Data Streams[R].Technical Report CS-05-07,University of Vermont,Computer Science,2005.
    [137]C.Giannella,J.Han,J.Pei,X.Yan et al.Mining Frequent Patterns in Ddata Streams at Multiple Time Granularities[C].In Proc.of VLDB,2002.191-212
    [138]J.X.Yu,Z.Chong,H.Lu and A.Zhou.False Positive or False Negative:Mining Frequent Item,sets from High Speed Transactional Data Streams[C].In Prec.of VLDB,2004.204-215.
    [139]J.Chang and W.Lee.Finding Recent Frequent Itemsets Adaptively over Online Data Streams[C].In Proc.of ACM SIGKDD,2003.226-235.
    [140]G.Hulten,L.Spencer and P.Domingos.Mining Time-changing Data Streams[C].In Proc.of ACM SIGKDD,2001.97-106.
    [141]H.Wang,W.Wang,P.Yu and J.Han.Mining Concept-driftingdata Streams Using Ensemble Classifiers[C].In Proc.of ACM SIGKDD,2003.
    [142]W.Fan.Systematic Data Selection to Mine Concept-drifting Data Streams[C].In Proc.of KDD,2004.128-137.
    [143]C.Aggarwal,J.Han,J.Wang,and P.S.Yu.On Demand Classification of Data Stream[C].In Proc.of KDD,2004.503-508.
    [145]M.Burl,C.Fowlkes,J.Roden,A.Stechert and S.Mukhtar.Diamond Eye:A Distributed Architecture for Image Data Mining[C].In Proc.of SPIE DMKD,vol.3695,1999.197-206.
    [146]H.Kargupta,B-H.Park,S.Pittie,L.Liu et al.MobiMine:Monitoring the Stock Market from a PDA[J]. In ACM SIGKDD Explorations, Vol.3, Issue 2,2002.37-46.
    [147] H.Kargupta, R.Bhargava, K.Xiu, M.Powers, et al. VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring[C]. In Proc. of SIAM Int'l Conf. on Data Mining, 2004.300-311.
    [148] S.Tanner, MAlshayeb, E.Criswell, M.Iyer, et al. EVE: On-Board Process Planning and Execution[C]. Earth Science Technology Conference, 2002.11-14.
    [149] A.Srivastava and J.Stroeve. Onboard Detection of Snow, Ice, Clouds and Other Geophysical Processes Using Kernel Methods[C]. In Proc. of the ICML workshop on Machine Learning Technologies for Autonomous Space Applications, 2003.
    [150] B.Babcock, M.Datar, R.Motwani and L.O'Callaghan. Maintaining Variance and k-Medians over Data Stream Windows[C]. In Proc.of ACM PODS, 2003.234-243.
    [151] Z.Huang. Clustering Large Data Sets with Mixed Numeric and Categorical Values. In Knowledge discovery and data mining:techniques and applications[J]. World Scientific, 1997.
    [152] J.-W.Chang and D.-S.Jin. A New Cell-based Clustering Method for Large, High-dimensional Data in Data Mining Applications[C]. In Proc. of ACM symposium on Applied computing, 2002.503-507.
    [153] C.M.Procopiuc, MJones, P.K.Agarwal and T.M.Murali. A Monte-carlo Algorithm for Fast Projective Clustering[C]. In Proc. of the ACM SIGMOD, 2002.418-427.
    [154] D.J.Abadi, W.Lindner, S.Madden, J.Schuler . An Integration Framework for Sensor Networks and Data Stream Management Systems[C]. In Proc. of VLDB, 2004. 1361-1364.
    [155] J.Yang, W.Wang, H.Wang, and P.Yu. δ-clusters:Capturing Subspace Correlation in A Large Data Set[C]. In Proc. of Inte'l Conf. on Data Engineering, 2002.517-528.
    [156] J.H.Friedman and J.J.Meulman. Clustering Objects on Subsets of Attributes. http://citeseer.nj.nec.com/friedmanO2clustering.html, 2002.
    [157] B.Liu, Y.Xia and P.S.Yu. Clustering Through Decision Tree Construction[C]. In Proc.of int'l conf. on Information and knowledge management, 2000.20-29.
    [158] L.N.Castro and F.J.Zuben. An Evolutionary Immune Network for Data Clustering[C]. In Proc. of SBRN, 2000. 84-89.
    [159] L.N.Castro and F.J.Zuben. aiNet: An Artificial Immune Network for Data Analysis. ftp://ftp.dca.fee.unicamp.br/pub/docs/vonzuben/lnunes/DMHA.pdf
    [160] J.Timmis and M.Neal. A Resource Limited Artificial System for Data Analysis[J]. Knowledge-Based System, 2001.121-130.
    [161] L.Xu, H.Mo, K.Wang and N.Tang. Document Clustering Based on Modified Artificial Immune Network[C]. In Proc. of RSKT, LNAI 4062,2006.516-521.
    [162]X.Hang and H.Dai.An Immune Network Approach for Web Document Clustering[C].In Proc.of IEEE/WIC/ACM Int'l Conf.on Web Intelligence,2004.278-284.
    [163]K.Ciesielski,S.T.Wierzchon and M.A.Klopotek.An Immune Network for Contextual Text Data Clustering[C].In Proc.of ICARIS,LNCS 4163,2006.432-445.
    [164]O.Nasraoui,C.Cardona and C.Rojas.Mining Evolving User Profiles in Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm[C].In Proc.of WEBKDD,2003.71-81.
    [165]G.B.Bezerra and LN.Castro.Bioinformatics Data Analysis Using an Artificial Immune Network[C].In Proc.of ICARIS,LNCS 2787,2003.22-33.
    [166]D.N.Sotiropoulos,G.A.Tsihrintzis,A.Sawopoulos and M.Virvou.Artificial Immune System-Based Customer Data Clustering in an e-Shopping Application[C].In Proc.of KES,Part Ⅰ,LNA14251,2006.960-967.
    [167]X.Liu,N.Zhang.Incremental Immune-Inspired Clustering Approach to Behavior-Based Anti-Spam Technology[J].Int'l.Journal of Information Technology,Vol.12,No.3,2006.111-120
    [168]Z.Huang.Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values[J].Data Mining Knowledge Discovery,1998,2(3).283-304.
    [169]J.Timmis,M.Neal and J.Hunt.An Artificial Immune System for Data Analysis[J].Biosystems,Vol.55,No.1,2000.143-150.
    [170]LN.Castro and J.Timmis.Artificial Immune Systems:A New Computational Intelligence Approach[M].Springer 2002.
    [171]K.Cheng,L.Xiang,M.lwaihara,H.Xu,et al.Time-Decaying Bloom Filters for Data Streams with Skewed Distributions[C].In Proc.of RIDE,2005.63-69.
    [172]E.Gokcay and J.C.Principe.Information Theoretic Clustering[J].IEEE Ttansactions on pattern analysis and intelligence,Vol.24,NO.2,2002.158-171.
    [173]K.Chen and L.Liu.The "Best K" for Entropy-based Categorical Data Clustering[C].In Proc.of SSDBM,2005.
    [174]J.Yao,M.Dash,S.T.Tan,H.Liu.Entropy-based Fuzzy Clustering and Fuzzy modeling[J].Fuzzy Sets and Systems 113(2000).381-388.
    [175]T.Zhang,R.Ramakrishnan and M.Livny.Fast Density Estimation Using CF-kernel for Very Large Databases[C].In Proc.of KDD,1999.312-316.
    [176]D.Barbara,Julia Couto,Yi Li.COOLCAT:An Entropy-based Algorithm For Categorical Clustering[C].In Proc.of CIKM,2002.582-589.
    [177]R.Jenssentf,K.E.Hild,D.Erdogmust,Jo.C.Principet et al.Clustering using Renyi's Entropy[C].In Proc.Int'l Joint Conf.Neural Networks,2003.523-528.
    [178]D.F.Specht.Series Estimation of a Probability Density Function[J].Technometrics,Vol. 13,No.2,1971.409-424.
    [179]B.W.Silverman.On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method[J].The Annals of Statistics,Vol.10,No.3,1982.795-810.
    [180]E.F.Schuster.Estimation of a Probability Density Function and Its Derivatives[J].The Annals of Mathematical Statistics,Vol40,No.4,1969.1187-1195.
    [181]L.Parsons,E.Haque,H.Liu.Subspace clustering for high dimensional data:A review[J].Sigkdd Explorations.Vol.6,Issue 1,2004.90-105.
    [182]S.Brin,R.Motwani,J.D.Ullman and S.Tsur.Dynamic Itemset Counting and Implication Rules for Market Aasket Data[C].In Proc.of ACM SIGMOD,1997.255-264.
    [183]R.C.Agarwal,C.C.Aggarwal and V.V.V.Prasad.Depth First Generation of Long Patterns[C].In Proc.of ACM SIGKDD,2000.108-118.
    [184]K.K.Kumar,J.Neidhoefer.Immunized Neuron Control[J].Exper.Systems With Applications,1997,13(3).201-214.
    [185]F.Mizessyn,Y.Ishida.Immune Networks for Cement Plants[C].In Int'l Symposium on Autonomous Decentralized Systems,1993.282-288.
    [186]V.J.Decastroln.The Clonal Selection Algorithm with Engineering Applications[C].In Proc.of Genetic and Evolutionary Ccomputation Conference,2000.36-37.
    [187]Z.Tang,T.Yamaguchi,K.Tashima.A Multiple Valued Immune Network and Its Applications[J].In IEICE Transaction Fundamentals,1999,E82-A(6).1102-1108.
    [188]J.Timmis.On Parameter Adjustment of the immune Inspired Machine Leafing Algorithm[R].Technical Report,univ.Kent,2000.
    [189]M.C.Jones,J.S.Marron,S.J.Sheather.A Brief Survey of Bandwidth Selection for Density Estimation[J].Journal of the American Statistical Association,Vol.91,No.433,1996.401-407.
    [190]P.Hall,S.N.Lahiri,Y.K.Truong.On Bandwidth Choice for Density Estimation with Dependent Data[J].The Annals of Statistics,Vol.23,No.6,1995.2241-2263.
    [191]S.-T.Chiu.Bandwidth Selection for Kernel Density Estimation[J].The Annals of Statistics,Vol.19,No.4,1991.1883-1905.
    [192]周晓云,孙志挥,张柏礼,杨宜东.高维数据流子空间聚类发现及维护算法[J].计算机研究与发展,2006,43(5):834-840
    [193]姜丹.信息论与编码(第二版).中国科学技术大学出版社.2004.
    [194]邵学广,陈宗海,林样钦.一种新型的信号拟合方法—免疫算法[J].分析化学,2000,28(2):152-155.
    [195]丁永生,任立红.人工免疫系统:理论与应用[J].模式识别与人工智能,2000,13(1):52-59.
    [196]金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(08):1172-1181.
    [197]宋国杰,唐世渭,杨冬青,王腾蛟.数据流中异常模式的提取与趋势监测[J].计算机研究与发展,2004,41(10):1754-1759.
    [198]郭龙江,李建中,王伟平,张冬冬.数据流上的连续预测聚集查询[J].计算机研究与发展,2004,41(10):1690-1695.
    [199]王涛,李舟军,颜跃进,陈火旺.数据流挖掘分类技术综述[J].计算机研究与发展,2007,44(11):1809-1815.
    [200]张冬冬,李建中,王伟平,郭龙江.数据流历史数据的存储与聚集查询处理算法[J].软件学报,2005,16(12):2089-2098.
    [201]郑军,胡铭曾,云晓春,郑仲.基于数据流方法的大规模网络异常发现[J].通信学报,2006,2(2):1-8.
    [202]陈希孺,李国英.非参数统计[M].上海科学技术出版社,1989.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700