摘要
为了科学使用真实世界数据,探索适用于日益常见的混合型数据的聚类方法,文章分析和比较了两种典型的混合型数据聚类方法K-prototypes与ClustMD,改进了聚类方法关键参数选择方法,并提出聚类稳定性指标。结果表明,两种聚类方法均具有很高的有效性和稳定性,各有优缺点。当数据相关性强、数据缺失严重或非连续变量较多时,建议使用K-prototypes。
In order to scientifically use real world data,this paper explores the clustering methods applicable to the increasingly common mixed medical data. The paper analyzes and compares the two typical clustering methods:K-prototypes and ClustMD,improves the key parameter selection method,and also proposes the clustering stability index. Cases analysis results indicate that the two methods are highly effective and stable,each with advantages and disadvantages. When data correlation is strong,data missing is serious or there are relatively more non-continuous variables,K-prototypes is recommended for hybrid data.
引文
[1]Huang Z X.Extentions to the K-means Algorithm for Clustering Large Data Sets With Categorical Values[J].Data Mining and Knowledge Discovery,1998,(2).
[2]McParland D,Gormley I C.Model Based Clustering for Mixed Data:clustMD[J].Advances in Data Analysis and Classification,2016,10(2).
[3]刘强,邓磊,贾振红等.一种改进的加权K-prototypes算法[J].激光杂志,2014,35(1).
[4]刘燕驰,高学东,国宏伟等.聚类有效性的组合评价方法[J].计算机工程与应用,2011,(19).
[5]陈韡,王雷,蒋子云.基于K-prototypes的混合属性数据聚类算法[J].计算机应用,2010,30(8).
[6]刘新涛,刘晓光,申琪等.合并与不合并:两个相似性聚类分析方法的比较[J].生态学报,2013,33(11)./