不平衡分类的数据采样方法综述

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

不平衡分类的数据采样方法综述

详细信息查看全文 | 推荐本文 |

英文篇名：A Survey on Data Sampling Methods in Imbalance Classification
作者：刘定祥 ; 乔少杰 ; 张永清 ; 韩楠 ; 魏军林 ; 张榕珂 ; 黄萍
英文作者：LIU Dingxiang;QIAO Shaojie;ZHANG Yongqing;HAN Nan;WEI Junlin;ZHANG Rongke;HUANG Ping;School of Cybersecurity,Chengdu University of Information Technology;School of Software Engineering,Chengdu University of Information Technology;School of Computer Science,Chengdu University of Information Technology;School of Management,Chengdu University of Information Technology;Western General Hospital;
关键词：机器学习 ; 不平衡数据 ; 过采样 ; 欠采样 ; 混合采样
英文关键词：machine learning;;imbalance data;;over-sampling;;under-sampling;;hybrid sampling
中文刊名：CGGL
英文刊名：Journal of Chongqing University of Technology(Natural Science)
机构：成都信息工程大学网络空间安全学院;成都信息工程大学计算机学院;成都信息工程大学软件工程学院;成都信息工程大学管理学院;西部战区总医院;
出版日期：2019-07-15
出版单位：重庆理工大学学报(自然科学)
年：2019
期：v.33;No.408
基金：国家自然科学基金资助项目(61772091,61802035,61702058);; 广西自然科学基金资助项目(2018GXNSFDA138005);; 四川省科技计划项目(2018JY0448);; 四川高校科研创新团队建设计划项目(18TD0027);; 成都市软科学研究项目(2017-RK00-00053-ZF);; 成都信息工程大学中青年学术带头人科研基金资助项目(J201701);成都信息工程大学科研基金资助项目(KYTZ201715,KYTZ201750)
语种：中文;
页：CGGL201907014
页数：11
CN：07
ISSN：50-1205/T
分类号：108-118

摘要

如何获得更加精确的分类效果一直是机器学习领域的重要研究内容,现有大多数分类器都是针对平衡的数据集来设计的。虽然平衡的数据训练出来的分类模型能取得较好的正负样本分类正确率,但现实生活中的数据往往是不平衡的,不平衡的数据使得正样本分类正确率急剧下降,不能满足机器学习对分类效果的要求。针对这种情况,综述了当前主流不平衡分类的数据采样方法。首先,阐述了欠采样方法,包括基于聚类和基于整合的欠采样方法;其次,对过采样方法进行了总结,包括基于k近邻、基于聚类、基于半监督、基于深度神经网络和基于进化算法的过采样方法;再次,对混合采样方法进行了总结;最后,总结了不平衡分类问题研究的发展趋势。
How to achieve highly accurate results on classification is a fundamental research problem in machine learning. Most of classifiers are designed for balanced dataset. The classifiers trained by the balanced dataset can achieve better classification accuracy of positive and negative samples.However,the real data are always imbalanced. The imbalanced data greatly degrade the classification accuracy of positive samples,which fails to satisfy the growing requirement of classification accuracy in machine learning research. This study surveys the state-of-the-art data sample methods in imbalance classification. Firstly the under-sampling methods including clustering based and integration based methods are introduced. Secondly,the over-sampling methods including k nearest neighbor based, clustering based, semi-supervised based, deep neural networks based and evolutionary based methods are presented, and then hybrid-sampling methods are summarized.Lastly,the future development on the problem of imbalance classification is concluded.

引文

[1] COTE D. Using machine learning in communication networks[J]. IEEE Journal of Optical Communications and Networking,2018,10(10):D100-D109.
    [2] JORDAN M I,MITCHELL T M. Machine learning:Trends,perspectives,and prospects[J]. Science,2015,349(6245):255-260.
    [3] YU H,NI J,DAN Y,et al. Mining and integrating reliable decision rules for imbalanced cancer gene expression data sets[J]. Tsinghua Science and Technology,2012,17(6):666-673.
    [4] OLSZEWSKI D. A probabilistic approach to fraud detection in telecommunications[J]. Knowledge Based Systems,2012,26:246-258.
    [5] LIMA R F,PEREIRA A C M. A fraud detection model based on feature selection and under-sampling applied to web payment systems[C]//Proceedings of the 2015IEEE International Conference on Web Intelligence and Intelligent Agent Technology. Piscataway,NJ:IEEE,2016:219-222.
    [6] EBO B K,KEUNG J,PHANNACHITTA P,et al. MAHAKIL:diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction[J]. IEEE Transactions on Software Engineering,2017,44(6):534-550.
    [7] MURPHY K P. Machine learning:a probabilistic perspective[J]. Chance,2012,27(2):62-63.
    [8] STEFANOWSKI J. Dealing with data difficulty factors while learning from imbalanced data[M]. Berlin German:Springer,2016:333-363.
    [9] LI J,ZHOU T. On gradient descent algorithm for generalized phase retrieval problem[C]//Proceedings of the2017 IEEE International Conference on Signal Processing. Piscataway,NJ:IEEE,2017:320-325.
    [10] JAPKOWICZ N,STEPHEN S. The class imbalance problem:a systematic study[J]. Intelligent Data Analysis,2002,6(5):429-449.
    [11] ANAND R,MEHROTRA K G,MOHAN C K,et al. An improved algorithm for neural network classification of imbalanced training sets[J]. IEEE Transactions on Neural Networks,1993,4(6):962-969.
    [12] KRAWCZYK B. Learning from imbalanced data:open challenges and future directions[J]. Progress in Artificial Intelligence,2016,5(4):221-232.
    [13] PRUSA J,KHOSHGOFTAAR T M,DITTMAN D J,et al.Using random under-sampling to alleviate class imbalance on tweet sentiment data[C]//Proceedings of the 2015IEEE International Conference on Information Reuse and Integration. Piscataway,NJ:IEEE,2015:197-202.
    [14] YEN S,LEE Y. Cluster based under-sampling approaches for imbalanced data distributions[J]. Expert Systems with Applications,2009,36(3):5718-5727.
    [15] YEN S J,LEE Y S. Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset[J]. Lecture Notes in Control and Information Sciences,2006,344(2):731-740.
    [16] Ng W W,Hu J,Yeung D S,et al. Diversified sensitivity based under-sampling for imbalance classification problems[J]. IEEE Transactions on Cybernetics,2017,45(11):2402-2412.
    [17] VARASSIN C G,PLASTINO A,LEITAO H C D G,et al.Under-sampling strategy based on clustering to improve the performance of splice site classification in human genes[C]//Proceedings of the 24th International Workshop on Database and Expert Systems Applications. Piscataway,NJ:IEEE,2013:85-89.
    [18] LIU X Y,WU J,ZHOU Z H. Exploratory under-sampling for class-imbalance learning[J]. IEEE Transactions on Systems Man and Cybernetics Part B(Cybernetics),2009,39(2):539-550.
    [19] ZHANG Y,LIU G,LUAN W,et al. An approach to class imbalance problem based on stacking and inverse random under sampling methods[C]//Proceedings of the 2018IEEE 15th International Conference on Networking,Sensing and Control. Piscataway,NJ:IEEE,2018.
    [20] TAHIR M A,KITTLER J,YAN F. Inverse random under sampling for class imbalance problem and its application to multi-label classification[J]. Pattern Recognition,2012,45(10):3738-3750.
    [21] CAO H,LI X L,WOON Y K,et al. Integrated oversampling for imbalanced time series classification[J]. IEEE Transactions on Knowledge and Data Engineering,2013,25(12):2809-2822.
    [22] ESTABROOKS A,JO T,JAPKOWICZ N. A multiple resampling method for learning from imbalanced data sets[J]. Computational Intelligence,2010,20(1):18-36.
    [23] CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
    [24] HAN H,WANG W Y,MAO B H. Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//Proceedings of the 2005 International Conference on Advances in Intelligent Computing. Berlin,Germany:Springer,2005:878-887.
    [25] BUNKHUMPORNPAT C,SINAPIROMSARAN K,LURSINSAP C. Safe-level-smote:safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem[C]//Proceedings of the 2009 Pacific-Asia conference on knowledge discovery and data mining.Berlin,Germany:Springer,2009:475-482.
    [26] SANCHEZ,ATLANTIDA I,Morales E F,et al. Synthetic oversampling of insistences using clustering[J]. International Journal on Artificial Intelligence Tools,2013,22(2):475-482.
    [27] NEKOOEIMEHR I,LAIYUEN S K. Adaptive semi-unsupervised weighted oversampling(A-SUWO)for imbalanced datasets[J]. Expert Systems with Applications,2016,46:405-416.
    [28] DONG A,CHUNG F,WANG S. Semi-supervised classification method through oversampling and common hidden space[J]. Information Sciences,2016,349:216-228.
    [29]石凤兴.针对类内不平衡样本分类方法的研究[D].哈尔滨:哈尔滨工业大学,2016.
    [30]林智勇,郝志峰,杨晓伟.不平衡数据分类的研究现状[J].计算机应用研究,2008,25(2):332-336.
    [31] JAPKOWICZ N,STEPHEN S. The class imbalance problem:a systematic study[J]. Intelligent data analysis,2002,6(5):429-449.
    [32] KONNO T,IWAZUME M. Pseudo-Feature generation for imbalanced data analysis in deep learning[P]. ar Xiv:1807. 06538,2018.
    [33] PREZGODOY M D,FERNNDEZ A,RIVERA A J,et al.Analysis of an evolutionary RBFN design algorithm,CO2RBFN,for imbalanced data sets[J]. Pattern Recognition Letters,2010,31(15):2375-2388.
    [34] CAO P,LI B,LI W,et al. Imbalanced data learning based on particle swarm optimization[J]. Journal of Computer Applications,2013,33(3):789-792.
    [35] GAO M,HONG X,CHEN S,et al. On combination of SMOTE and particle swarm optimization based radial basis function classifier for imbalanced problems[C]//Proceedings of the 2011 International Joint Conference on Neural Networks. Piscataway,NJ:IEEE,2011:1146-1153.
    [36] LIM P,GOH C K,TAN K C. Evolutionary cluster based synthetic over-sampling ensemble(ECO-Ensemble)for imbalance learning[J]. IEEE Transactions on Cybernetics,2016,47(9):2850-2861.
    [37] RAMENTOL E,GONDRES I,LAJES S,et al. Fuzzyrough imbalanced learning for the diagnosis of high voltage circuit breaker maintenance:the SMOTE-FRST-2T algorithm[J]. Engineering Applications of Artificial Intelligence,2016,48:134-139.
    [38] PANG J Z F,CAO H,TAN V Y F. MOGT:oversampling with a parsimonious mixture of gaussian trees model for imbalanced time-series classification[C]//Proceedings of the 2013 IEEE International Workshop on Machine Learning for Signal Processing. Piscataway, NJ:IEEE,2013.
    [39] MOREO A,ESULI A,SEBASTIANI F. Distributional random oversampling for imbalanced text classification[C]//Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2016:805-808.
    [40] BARUA S,ISLAM M M,YAO X,et al. MWMOTE:majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge and Data Engineering,2014,26(2):405-425.
    [41]欧阳源遊.基于混合采样的非平衡数据集分类研究[D].重庆:重庆大学,2014.
    [42] SEIFFERT C,KHOSHGOFTAAR T M,VAN H J. Hybrid sampling for imbalanced data[J]. Integrated ComputerAided Engineering,2009,16(3):193-210.
    [43]戴翔,毛宇光.基于集成混合采样的软件缺陷预测研究[J].计算机工程与科学,2015,37(5):930-936.
    [44] LI P,QIAO P L,LIU Y C. A hybrid re-sampling method for SVM learning from imbalanced data sets[C]//Proceedings of the 2018 International Conference on Fuzzy Systems and Knowledge Discovery. Piscataway,NJ:IEEE,2008:65-69.
    [45] CERVANTES J,HUANG D S,FARID G L,et al. A hybrid algorithm to improve the accuracy of support vector machines on skewed data sets[M]. Berlin German:Springer,2014:782-788.
    [46]高锋,黄海燕.基于邻域混合抽样和动态集成的不平衡数据分类方法[J].计算机科学,2017,44(8):225-229.
    [47]冯宏伟,姚博,高原,等.基于边界混合采样的非均衡数据处理算法[J].控制与决策,2017,32(10):1831-1836.
    [48] GAZZAH S,HECHKEL A,AMARA N E B. A hybrid sampling method for imbalanced data[C]//Proceedings of the 2015 International Multi-conference on Systems.Piscataway,NJ:IEEE,2015.
    [49] CAO P,ZHAO D,ZAIANE O. Hybrid probabilistic sampling with random subspace for imbalanced data learning[J]. Intelligent Data Analysis,2014,18(6):1089-1108.
    [50] PRACHUABSUPAKIJ W. A new hybrid sampling classification for imbalanced data[C]//Proceedings of the2015 International Joint Conference on Computer Science and Software Engineering. Piscataway,NJ:IEEE,2015:281-286.
    [51] CAO P,YANG J,LI W,et al. Ensemble-based hybrid probabilistic sampling for imbalanced data learning in lung nodule CAD[J]. Computerized Medical Imaging and Graphics the Official Journal of the Computerized Medical Imaging Society,2014,38(3):137-150.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700