用户名: 密码: 验证码:
基于代价敏感性和概率校准的先天性心脏病概率预测模型研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Probability Prediction Model of Congenital Heart Disease based on Cost Sensitivity and Probability Calibration
  • 作者:罗艳 ; 李治 ; 余红梅 ; 郭虎生 ; 曹红艳 ; 王蕾 ; 宋春英 ; 郭兴萍 ; 张岩波
  • 英文作者:Luo Yanhong;Li Zhi;Yu Hongmei;Department of Health Statistics,Shanxi Medical University;
  • 关键词:先心病 ; 预测 ; 不平衡数据 ; 代价敏感性 ; 概率校准
  • 英文关键词:Congenital heart disease;;Probability prediction;;Unbalanced data;;Cost sensitivity;;Probability calibration
  • 中文刊名:ZGWT
  • 英文刊名:Chinese Journal of Health Statistics
  • 机构:山西医科大学卫生统计学教研室;中北大学体育学院;山西大学计算机与信息技术学院;山西省人口计生委科学研究所;
  • 出版日期:2019-02-25
  • 出版单位:中国卫生统计
  • 年:2019
  • 期:v.36
  • 基金:国家自然科学基金资助项目(81502897);; 山西医科大学博士启动基金资助项目(BS2017029)
  • 语种:中文;
  • 页:ZGWT201901008
  • 页数:4
  • CN:01
  • ISSN:21-1153/R
  • 分类号:38-41
摘要
目的先心病数据存在类别不平衡问题,使先心病预测存在偏差,本文针对以上问题建立基于代价敏感性和概率校准的先心病概率预测模型,以期提高模型概率预测能力,为筛选先心病高危人群给予参考。方法构建调整惩罚权重的加权支持向量机(weighted support vector machine,WSVM)和加权随机森林(weighted random forest,WRF)的Platt和Isotonic regression(Iso)校准模型(WSVM-Platt,WRF-Platt,WSVM-Iso,WRF-Iso),同时与logistic回归模型进行对比。结果通过比较6种模型(WSVM-Platt,WSVM-Iso,WRF,WRF-Platt,WRF-Iso和logistic回归模型)的概率预测效果评价指标AUC(the area under the curves)、RMSE(root mean squared error)及SAR可得,以上6类模型均比较理想。6种模型中,WSVM的Platt校准模型的预测效果最优,logistic回归其次;对于WRF,WRF-Platt和WRF-Iso的预测效果均优于WRF;对于WRF和WSVM,Platt校准的概率预测能力均略优于Iso校准。结论针对极端不平衡数据,本文模型的预测结果较为理想。相比未校准模型,校准模型的预测效果更优,Platt校准预测效果略优于Iso校准,故本文构建的模型可为有效筛选先心病高危人群提供参考。
        Objective There is a category imbalance in CHD data,which results in the decreased predictive performance of the prediction model.In this study,we attempt to establish probability prediction model based on cost sensitivity and probability calibration with high properties for probability prediction and provide a reference for screening high risk population of CHD.Methods We built the weighted support vector machine(WSVM),the weighted random forest(WRF),WSVM-Platt,WRF-Platt,WSVM-Iso and WRF-Iso based on Platt scaling and isotonic regression probability calibration,while compared with the logistic regression model.Results The evaluation index of AUC,RMSE and SAR of 6 single probability prediction models(WSVM-Platt,WSVM-Iso,WRF,WRF-Platt,WRF-Iso and logistic regression model)showed that the above 6 kinds of models are ideal.After comparison of the 6 models,we know that WSVM based on Platt scaling probability calibration has the best prediction performance,followed by the logistic model.For WRF,the prediction performance of WRF-Platt and WRF-Iso is better than that of WRF.The prediction performance of Platt calibration were also slightly better than the Iso calibration for both WSVM and WRF.Conclusion The probability prediction performance of themodel for unbalanced data were all high.The calibration model is better than the uncalibrated model,and the Platt probability calibration model is slightly better than the Iso calibration model.Therefore,the improved created model will be used to screen the risk groups of CHD effectively.
引文
[1] 中华人民共和国国家卫生与计划生育委员会.中国出生缺陷防治报告(2012).http://www.gov.cn/gzdt/att/att/site1/20120912/1c6f6506c7f811bacf9301.pdf.2012-9-12.
    [2] 杨峰.基于决策树的出生缺陷预警系统研究与实现.长春:东北师范大学,2006:24-42.
    [3] 方俊群,罗家有,姚宽保,等.C5.0决策树法在出生缺陷预测中的应用.中国卫生统计,2009,26(5):473-476.
    [4] 刘长云,丁艳,王永芹,等.非综合征性唇腭裂高危因素与发病预测模型研究.中国实用口腔科杂志,2009,2(8):465-468.
    [5] 赵佳璐.基于关联规则挖掘的出生缺陷预警系统的研究与实现.北京:北京邮电大学,2012:2-46.
    [6] 秦姣龙,王蔚.Bagging组合的不平衡数据分类方法.计算机工程,2011,37(14):178-179,182.
    [7] 孙秀彬,辛涛,薛付忠,等.基于SMOTE算法的颅脑损伤患者继发精神障碍预警模型.中国卫生统计,2013,30(6):790-793.
    [8] Li DC,Liu CW,Hu SC.A learning method for the class imbalance problem with medical data sets.Computers in Biology and Medicine,2010,40:509-518.
    [9] Galar M,Fernandez A,Barrenechea E,et al.A review on ensembles for the class imbalance problem:bagging-,boosting-,and hybrid-based approaches.IEEE Transactions on Systems,Man,and Cybernetics,Part C(Applications and Reviews),2012,42(4):463-484.
    [10] Ayer T,Alagoz O,Chhatwal J,et al.Breast cancer risk estimation with artificial neural networks revisited:discrimination and calibration.Cancer,2010,116(14):3310-3321.
    [11] Platt JC.Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers,1999,10(3):61-74.
    [12] Zadrozny B,Elkan C.Transforming classifier scores into accurate multiclass probability estimates.Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM,2002:694-699.
    [13] Mizil AN,Caruana R.Predicting good probabilities with supervised learning.Proceedings of the 22nd international conference on Machine learning.ACM,2005:625-632.
    [14] Caruana R,Mizil AN.An empirical comparison of supervised learning algorithms.Proceedings of the 23rd international conference on Machine learning.ACM,2006:161-168.
    [15] Cao HY,Wei XY,Guo XP,et al.Screening high-risk clusters for developing birth defects in mothers in Shanxi Province,China:application of latent class cluster analysis.BMC Pregnancy and Childbirth,2015,15(1):343.
    [16] Hausdorf K,Eakin E,Whiteman D,et al.Prevalence and correlates of multiple cancer risk behaviors in an Australian population-based survey:results from the Queensland Cancer Risk Study.Cancer Causes & Control,2008,19(10):1339-1347.
    [17] Veropoulos K,Campbell C,Cristianini N.Controlling the sensitivity of support vector machines.Proceedings of the international joint conference on AI,1999:55-60.
    [18] Austin PC,Tu JV,Ho JE,et al.Using methods from the data-mining and machine-learning literature for disease classification and prediction:a case study examining classification of heart failure subtypes.Journal of Clinical Epidemiology,2013,66:398-407.
    [19] Breiman L.Random Forests.Machine learning,2001,45:5-32.
    [20] 李建更,高志坤.随机森林针对小样本数据类权重设置.计算机工程与应用,2009,45(26):131-134.
    [21] Gebel M,Weihs C.Calibrating classifier scores into probabilities.Advances in data analysis.Springer Berlin Heidelberg,2007:141-148.
    [22] Caruana R,Mizil AN.Data mining in metric space:an empirical analysis of supervised learning performance criteria.Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM,2004:69-78.
    [23] Luo YH,Li Z,Guo HS,et al.Predicting Congenital Heart Defects:a Comparison of Three Data Mining Methods.Plos one,12(5):e0177811.
    [24] Caruana R,Niculescu-Mizil A,Crew G,et al.Ensemble selection from libraries of models.Proceedings of the twenty-first international conference on Machine learning.ACM,2004:18.
    [25] Wang JF,Liu X,Liao YL,et al.Prediction of neural tube defect using support vector machine.Biomedical and Environmental Sciences,2010,23(3):167-172.
    [26] Sainz JA,Zurita MJ,Guillen I,et al.Prenatal screening of congenital heart defects in population at low risk of congenital defects.A reality today.Anales de Pediatr? ′a(English Edition),2015,82(1):27-34.
    [27] Wieczorek A,Hernandez-Robles J,Ewing L,et al.Prediction of outcome of fetalcongenital heart disease using a cardiovascular profile score.Ultrasound Obstet Gynecol,2008,31:284-288
    [28] 吕奕,王清.一种基于概率校正和集成学习的肠癌肝转移预测模型.计算机应用与软件,2011,28(9):48-51.
    [29] 沈翠华,邓乃扬.基于支持向量机的消费信贷中个人信用评估方法研究.北京:中国农业大学,2004:18-127.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700