用户名: 密码: 验证码:
基于聚类的多维数据热点发现算法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Detecting Hotspot in Multi-dimensional Data Through Clustering
  • 作者:邹磊 ; 朱晶 ; 聂晓辉 ; 苏亚 ; 裴丹 ; 孙宇
  • 英文作者:ZOU Lei;ZHU Jing;NIE Xiao-hui;SU Ya;PEI Dan;SUN Yu;Department of Compute Science and Techology,Tsinghua University;Beijing Didi Chuxing Company Limited;
  • 关键词:热点发现 ; 聚类 ; 数据挖掘 ; 决策树 ; 多维数据分析
  • 英文关键词:Hotspot detection;;clustering;;data mining;;unsupervised decision tree;;multi-dimensional data analysis
  • 中文刊名:XXWX
  • 英文刊名:Journal of Chinese Computer Systems
  • 机构:清华大学计算机系;北京小桔科技(滴滴出行)有限公司;
  • 出版日期:2019-03-15
  • 出版单位:小型微型计算机系统
  • 年:2019
  • 期:v.40
  • 语种:中文;
  • 页:XXWX201903001
  • 页数:7
  • CN:03
  • ISSN:21-1106/TP
  • 分类号:3-9
摘要
数据热点发现的目标是找出数据集中的区域,并以易于人理解的方式将其展示出来.本文针对同时包含数值型特征和类别型特征的多维数据设计了数据热点发现算法,该算法的核心是改进CLTree设计的聚类算法CLTree+.本文改进了CLTree,使其能够直接对同时包含数值型特征和类别型特征的数据进行聚类,并提升了具有周期性性质的数值型特征的聚类效果.除此之外,相比CLTree,CLTree+还大幅度提升了计算效率,使其可以用于处理大规模数据. CLTree+被应用于某大型互联网公司的业务数据,成功找出了若干个数据热点,并以易于理解的特征取值组合的方式将这些信息展示出来.
        Hotspot detection in data aims at finding out those areas with high density of data,and presenting these areas in a interpretable way. In this work,hotspot detecting algorithm is designed to deal with multi-dimensional data containing numerical features as well as categorical features. The core of the algorithm is the clustering algorithm CLTree +,a significant improvement over the baseline CLTree. CLTree + is able to deal with numerical features and categorical features,and the clustering result of numerical features with periodical characteristics is also improved. Besides,the computational efficiency of CLTree + is also improved. CLTree + is applied to transaction data of large Internet businesses and find out a fewareas with high density of data,and these areas are presented as the easy to interpret combinations of attributes and its values.
引文
[1]Ding Jian-li,Yang Bo,Lei Xiong.Algorithm of airline QoS hot topic detection based on M apReduce[J].Computer Engineering and Science,2013,35(4):130-135.
    [2]Wei De-zhi,Chen Fu-ji,Lin Li-na.Microblog hotspot detection method based on M FIHC and TOPSIS[J].Application Research of Computers,2018,35(4):1014-1017+1041.
    [3]Ma Bao-jun,Zhang Nan,Liu Guan-nan,et al.Semantic search for public opinions on urban affairs:aprobabilistic topic modeling-based approach[J].Information Processing&Management,2016,52(3):430-445.
    [4]Bing Liu,Yiyuan Xia,Philip S Yu.Clustering through decision tree construction[C].Proceedings of the Ninth International Conference on Information and Know ledge M anagement,M c Lean,Virginia,USA,2000:20-29.
    [5]Li Rui,Qiu Yu-hui.Study of ants-clustering algorithm based on outlier[J].Computer Science,2005,32(6):111-113.
    [6]Wazavkar S V,Manjrekar A A.Text clustering using HFREC-CAand rough K-M eans cluster algorithm[J].Discovery,2014,15(40):44-47.
    [7]Guha S,Rastogi R,Shim K.ROCK:a robust clustering algorithm for categorical atributes[C].Procedings of the IEEE Conference on Data Enginering,1999.
    [8]Trikha P,Vijendra S.Fast density based clustering algorithm[J].International Journal of M achine Learning and Computing,2013,3(1):10-12.
    [9]Fraley C,Raftery A E.Model-based clustering,discriminant analysis,and density estimation[J].Journal of the American Statistical Asociation,2002,97(458):611-631.
    [10]Sun Hao-jun,Wang Sheng-rui,Jiang Qing-shan.FCM-based model selection algorithms for determining the number of clusters[J].Patern Recognition,2004,37(10):2027-2037.
    [11]Sharan R,Shamir R.CLICK:A clustering algorithm with applications to gene expresion analysis[C].Proc.8th Int.Conf.Inteligent Systems for M olecular Biology,2000:307-316.
    [12]Barbara B,Chen Ping.Using the fractal dimension to cluster datasets[C].Proc.of the 6th ACM SIGKDD Int'1 Conf.on Knowledge Discovery and Data Mining(KDD2000),ACM Pres,2000:260-264.
    [13]Wei-Yin Loh.Fifty years of classification and regression trees[J].International Statistical Review,2014,82,3,329-348,doi:10.1111/insr.12016.
    [14]Tan P N,Steinbach M,Kumar V.Classification:basic concepts,decision trees,and model evaluation[M].Pearson,2013.
    [15]University of Ljubljana.Orange documentations[EB/OL].http://docs.orange.biolab.si/reference/rst/Orange.classification.tree.html,2018.
    [16]Matthieu Boussard,Cloderic Mars,Remi Des,Caroline Chopinaud.Periodic split method:learning more readable decision trees for human activities[C].Conference Nationale Surles Applications Pratiques de l'Intelligence Artificielle,Caen,France,Conference Nationale sur les Applications Pratiques de l'Intelligence Artificielle,2017.
    [17]Liu Da-peng,Zhao You-jian,Sui Kai-xin,et al.FOCUS:shedding light on the high search response time in the w ild[C].The 35th Annual IEEE International Conference on Computer Communications,2016.
    [18]Li Nan,Wu De-sheng.Using text mining and sentiment analysis for online forums hotspot detection and forecast[J].Decision Support Systems,2010,48(2):354-368.
    [19]Ding D,Wu X,Ghosh J,et al.Machine learning based lithographic hotspot detection w ith critical-feature extraction and classification[C].Proc.Int.Conf.for Integrated Circuit Design Technology,2009:219-222.
    [20]Scikit-learn developers.Scikit-learn documentation[EB/OL].http://scikit-learn.org/stable/modules/tree.html,2018.
    [21]Wikipedia contributor.Wikipedia[EB/OL].https://en.wikipedia.org/w iki/Continuous_or_discrete_variable,2018.
    [1]丁建立,杨博,雷雄.基于MapReduce的航空公司服务品质热点发现算法[J].计算机工程与科学,2013,35(4):130-135.
    [2]魏德志,陈福集,林丽娜.一种基于MFIHC聚类和TOPSIS的微博热点发现方法[J].计算机应用研究,2018,35(4):1014-1017+1041.
    [5]李瑞,邱玉辉.基于离散点的蚁群聚类算法的研究[J].计算机科学,2005,32(6):111-113.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700