用户名: 密码: 验证码:
面向不均衡医学数据集的疾病预测模型研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research on Disease Prediction Models Based on Imbalanced Medical Data Sets
  • 作者:陈旭 ; 刘鹏鹤 ; 孙毓忠 ; 沈曦 ; 张磊 ; 王晓青 ; 孙晓平 ; 程伟
  • 英文作者:CHEN Xu;LIU Peng-He;SUN Yu-Zhong;SHEN Xi;ZHANG Lei;WANG Xiao-Qing;SUN Xiao-Ping;CHENG Wei;State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences;Institute of Basic Research in Clinical Medicine,China Academy of Chinese Medical Sciences;Beijing Chao-Yang Hospital Affiliate of Capital University of Medical Sciences,Attending Pediatrician;Key Lab of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences;Xiyuan Hospital of China Academy of Chinese Medical Sciences;
  • 关键词:疾病预测 ; 不均衡数据集 ; 欠采样 ; 二分类 ; 多标签分类
  • 英文关键词:disease prediction;;imbalanced data set;;under-sampling;;binary classification;;multilabel classification
  • 中文刊名:JSJX
  • 英文刊名:Chinese Journal of Computers
  • 机构:中国科学院计算技术研究所计算机体系结构国家重点实验室;中国中医科学院中医临床基础医学研究所;首都医科大学附属北京朝阳医院;中国科学院计算技术研究所智能信息处理重点实验室;中国中医科学院西苑医院;
  • 出版日期:2017-11-15 22:23
  • 出版单位:计算机学报
  • 年:2019
  • 期:v.42;No.435
  • 基金:“面向云计算的网络化操作系统(2016YFB1000505)”;; 国家自然科学基金委员会(NSFC)-广东省人民政府联合基金超级计算科学应用研究专项计划(第二期)资助~~
  • 语种:中文;
  • 页:JSJX201903009
  • 页数:14
  • CN:03
  • ISSN:11-1826/TP
  • 分类号:146-159
摘要
基于临床表现的疾病预测模型是临床决策支持系统(Clinical Decision Support System,CDSS)的一个重要研究内容.现有临床决策支持系统往往将临床病例作为训练数据集,以临床表现的描述文字为特征,采用统计机器学习方法构建疾病预测模型.然而,在医疗领域往往存在着样本数据集不均衡的问题,导致模型的预测效果降低.欠采样技术是目前解决样本不均衡问题的常用手段.其主要采用一定的方法从多数类样本中抽取部分样本,与少数类样本组成平衡数据集后再构建模型.现有的欠采样方法往往可以显著提高模型对少数类样本的召回率,然而其通常也会造成模型准确率的降低,从而限制了预测模型的整体提升效果.为此,该文提出了一种新的基于迭代提升欠采样的集成分类方法(Under-Sampling with Iteratively Boosting,USIB),该方法迭代地从多数类样本中进行欠抽样,构建多组弱分类器,并采用加权组合方式将这些弱分类器构成一个强分类器,从而提升样本不平衡条件下单种疾病预测效果.另外,医学病例样本数据集通常是多类别、多标签的,为此,该文将多个单种疾病的预测模型进行组合构成一个多标签疾病预测模型,以满足临床意义上的多病种以及并发症的诊断.为了进一步提升多标签预测模型的效果,该文设计了一种基于标签最大互信息生成树的标签选择方法(Labels Selection method based on Maximum Mutual Information Spanning Tree,LS-MMIST),该方法根据原始数据集的分布构建标签之间的最大互信息生成树,在每一次的样本预测阶段,借助树中疾病标签之间的关系确定最终的预测标签集合.实验方面,该文首先选择三种公开的不均衡二分类数据集和我们私有的四种稀有疾病的数据集,对该文提出的迭代提升欠采样方法进行性能评估.其次,分别对比了该文提出的多标签预测模型与现有的多标签预测技术在中医和西医两种多标签数据集上的预测性能.实验结果显示,相对于目前主流的八种欠采样以及两种集成采样技术,该文提出的迭代提升欠采样方法在各个不均衡二分类数据集上的F1值平均提升22.58%;与现有的各种多标签预测技术相比,该文提出的多标签预测方法在西医和中医数据集上正确率分别提升6.30%和12.43%,召回率分别提升4.33%和5.86%,F1值分别提升5.48%和11.16%.
        The prediction of diseases based on clinical records is an important research topic of clinical decision support system.Existing clinical decision support systems often apply statistical machine learning methods to construct disease prediction models with the collected clinical records as training data sets and the clinical manifestation texts as feature spaces,which can help diagnose patient disease.In medical field,the common human diseases usually have much more clinical records than the rare diseases which only have few recorded samples.The imbalance of diseases samples often has a bad effect on the model's prediction effect.Under-sampling is a common method that combines the part samples extracted from majority samples and the minority samples into a balanced dataset.The existing under-sampling methods can significantly improve the recall rate of the model while it also usually leads to the decrease of the model's accuracy at the same time,which limits the overall promotion effect of the prediction model.To address this,in this paper,we propose a new ensemble classification method based on under-sampling with iteratively boosting(USIB).The method uses a boosting method to iteratively build a set of weaker classifiers by under-sampling the majority class samples,and ensemble these weaker classifiers to a strong classifier in order to improve the effect of prediction model for single disease under imbalanced data set.Besides,the medical data sets always have multiple classes and a sample has several labels.Thus,to meet the diagnosis of multiple diseases and complications in clinical significance,we combined the single disease prediction models into a multi-label disease prediction model.In order to further improve the effect of multi-label prediction model,we designed a Label Selection method based on label Maximum Mutual Information Spanning Tree(LS-MMIST).The method builds the maximum mutual information spanning tree between labels according to the distribution of the original data set and determines the final labels by the relation between the disease labels in the tree when predicting an unknown multi-label clinical sample.In experiment,we first evaluated the predictive performance of our ensemble method based on under-sampling with iteratively boosting under the imbalanced data set by using three different open imbalanced binary data sets and four private rare diseases data sets.In detail,the proposed under-sampling method with iteratively boosting improves the average of F1 value by 22.58% on various imbalanced binary classification data sets.Secondly,we compared the performance of the proposed multi-label prediction model with the existing multi-label prediction techniques on Traditional Chinese medicine and Western medicine data sets.The experimental results show that our method significantly outperforms the current eight mainstream under-sampling and two kinds of ensemble sampling methods in imbalanced medical data sets especially in classes of small sample size.And compared with the existing multi-label prediction technology,our multi-label disease prediction model increased the precision rate by 6.30% and 12.43%,the recall rate by 4.33% and 5.86%,the F1 value by 5.48% and 11.16%respectively.In addition,compared with a LSTM model,our multilabel disease prediction model is more applicable to a small-size sample set.
引文
[1]Shortliffe E H.Computer-based medical consultations:MYCIN.Elsevier,1976,85(6):iii
    [2]Kohn M S,Sun J,Knoop S,et al.IBM’s health analytics and clinical decision support.Yearbook of Medical Informatics,2014,9(1):154-162
    [3]Lipton Z C,Kale D C,Elkan C,et al.Learning to diagnose with LSTM recurrent neural networks.arXiv preprint arXiv:1511.03677,2015
    [4]Inouye S K,Viscoli C M,Horwitz R I,et al.A predictive model for delirium in hospitalized elderly medical patients based on admission characteristics.Annals of Internal Medicine,1993,119(6):474-481
    [5]Lin D,Vasilakos A V,Tang Y,et al.Neural networks for computer-aided diagnosis in medicine:A review.Neurocomputing,2016,216:700-708
    [6]Prince M J.Predicting the onset of alzheimer’s disease using Bayes’theorem.American Journal of Epidemiology,1996,143(3):301-308
    [7]Chu N,Ma L,Chen X,et al.Ensemble learning for synthesis of the four diagnostics of TCM//Proceedings of the 2011IEEE International Conference on Bioinformatics and Biomedicine Workshops(BIBMW).Atlanta,USA,2011:843-847
    [8]Zhang N L,Yuan S,Chen T,et al.Latent tree models and diagnosis in traditional Chinese medicine.Artificial Intelligence in Medicine,2008,42(3):229-245
    [9]Wang Y,Ma L,Liu P.Feature selection and syndrome prediction for liver cirrhosis in traditional Chinese medicine.Computer Methods and Programs in Biomedicine,2009,95(3):249-257
    [10]Grossi E,Mancini A,Buscema M.International experience on the use of artificial neural networks in gastroenterology.Digestive and Liver Disease,2007,39(3):278-285
    [11]Green M,Bj9rk J.,Forberg J,et al.Comparison between neural networks and multiple logistic regression to predict acute coronary syndrome in the emergency room.Artificial Intelligence in Medicine,2006,38(3):305-318
    [12]Das R,Turkoglu I,Sengur A.Diagnosis of valvular heart disease through neural networks ensembles.Computer methods and programs in biomedicine,2009,93(2):185-191
    [13]Atkov O Y,Gorokhova S G,Sboev A G,et al.Coronary heart disease diagnosis by artificial neural networks including genetic polymorphisms and clinical parameters.Journal of Cardiology,2012,59(2):190-194
    [14]Plant C,B9hm C,Tilg B,et al.Enhancing instance-based classification with local density:A new algorithm for classifying unbalanced biomedical data.Bioinformatics,2006,22(8):981-988
    [15]Ding Z.Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics[Ph.D.dissertation].Georgia State University,Atlanta,USA,2011
    [16]He H,Garcia E A.Learning from imbalanced data.IEEETransactions on Knowledge and Data Engineering,2009,21(9):1263-1284
    [17]Branco P,Torgo L,Ribeiro R P.A survey of predictive modeling on imbalanced domains.ACM Computing Surveys(CSUR),2016,49(2):1-50
    [18]Van Hulse J,Khoshgoftaar T M,Napolitano A.Experimental perspectives on learning from imbalanced data//Proceedings of the 24th International Conference on Machine Learning.Corvallis,USA,2007:935-942
    [19]Kermanidis K,Maragoudakis M,Fakotakis N,et al.Learning Greek verb complements:Addressing the class imbalance//Proceedings of the 20th International Conference on Computational Linguistics.Geneva,Switzerland,2004:1065
    [20]Zhang X,Ma D,Gan L,et al.CGMOS:Certainty guided minority OverSampling//Proceedings of the 25th ACM International Conference on Information and Knowledge Management.Indianapolis,USA,2016:1623-1631
    [21]Chawla N V,Bowyer K W,Hall L O,et al.SMOTE:Synthetic minority over-sampling technique.Journal of Artificial Intelligence Research,2002,16(1):321-357
    [22]Gong Z,Chen H.Model-based oversampling for imbalanced sequence classification//Proceedings of the 25th ACM International on Conference on Information and Knowledge Management.Indianapolis,USA,2016:1009-1018
    [23]Chen S,He H,Garcia E A.RAMOBoost:Ranked minority oversampling in boosting.IEEE Transactions on Neural Networks,2010,21(10):1624-1642
    [24]Rodríuez J J,Díez-Pastor J F,García-Osorio C,et al.Using model trees and their ensembles for imbalanced data//Proceedings of the 14th Spanish Association for Artificial Intelligence.Puebla,Mexico,2011:94-103
    [25]Yu H,Ni J.An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data.IEEE/ACM Transactions on Computational Biology and Bioinformatics,2014,11(4):657-666
    [26]Thanathamathee P,Lursinsap C.Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques.Pattern Recognition Letters,2013,34(12):1339-1347
    [27]Liang G,Cohn A G.An effective approach for imbalanced classification:Unevenly balanced bagging//Proceedings of the 27th AAAI Conference on Artificial Intelligence.Bellevue,USA,2013:1633-1634
    [28]Sun Z,Song Q,Zhu X,et al.A novel ensemble method for classifying imbalanced data.Pattern Recognition,2015,48(5):1623-1637
    [29]Cai Q,He H,Man H.Imbalanced evolving self-organizing learning.Neurocomputing,2014,133:258-270
    [30]Krawczyk B,Jelen L,Krzyzak A,et al.Oversampling methods for classification of imbalanced breast cancer malignancy data//Proceedings of the Computer Vision and Graphics:International Conference.Warsaw,Poland,2012:483-490
    [31]Dubey R,Zhou J,Wang Y,et al.Analysis of sampling techniques for imbalanced data:An n=648ADNI study.NeuroImage,2014,87(3):220-241
    [32]Tahir M A,Kittler J,Yan F.Inverse random under sampling for class imbalance problem and its application to multilabel classification.Pattern Recognition,2012,45(10):3738-3750
    [33]Tomek I.Two modifications of CNN.IEEE Transactions on Systems,Man,and Cybernetics,1976,6(11):769-772
    [34]Zhang J,Mani I.kNN approach to unbalanced data distributions:A case study involving information extraction//Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Datasets.Washington,USA,2003:42-48
    [35]Zhang Y P,Zhang L N,Wang Y C.Cluster-based majority under-sampling approaches for class imbalance learning//Proceedings of the 2nd IEEE International Conference on Information and Financial Engineering.Chongqing,China,2010:400-404
    [36]Kubat M,Matwin S.Addressing the curse of imbalanced training sets:One-sided selection//Proceedings of the 14th International Conference on Machine Learning.Nashville,USA,1997:179-186
    [37]Wilson D L.Asymptotic properties of nearest neighbor rules using edited data.IEEE Transactions on Systems,Man,and Cybernetics,1972,2(3):408-421
    [38]Laurikkala J.Improving identification of difficult small classes by balancing class distribution//Proceedings of the8th Conference on Artificial Intelligence in Medicine in Europe.Cascais,Portugal,2001:63-66
    [39]Tomek I.An experiment with the edited nearest-neighbor rule.IEEE Transactions on Systems,Man,and Cybernetics,1976(6):448-452
    [40]Liu X-Y,Wu J,Zhou Z-H.Exploratory undersampling for class-imbalance learning.IEEE Transactions on Systems,Man,and Cybernetics,Part B(Cybernetics),2009,39(2):539-550
    [41]Alcalkdez J.,Fernndez A,Luengo J,et al.KEEL datamining software tool:Data set repository,integration of algorithms and experimental analysis framework.Journal of Multiple-Valued Logic&Soft Computing,2011,17(2-3):255-287
    [42]Tsoumakas G,Katakis I.Multi-label classification:An overview.International Journal of Data Warehousing and Mining,2007,3(3):1-13
    [43]Zhang M-L,Zhou Z-H.ML-KNN:A lazy learning approach to multi-label learning.Pattern recognition,2007,40(7):2038-2048
    [44]Margineantu D D,Dietterich T G.Pruning adaptive boosting//Proceedings of the 14th International Conference on Machine Learning.San Francisco,USA,1997:211-218

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700