用户名: 密码: 验证码:
Are Bigger Data Sets Better for Machine Learning? Fusing Single-Point and Dual-Event Dose Response Data for Mycobacterium tuberculosis
详细信息    查看全文
  • 作者:Sean Ekins ; Joel S. Freundlich ; Robert C. Reynolds
  • 刊名:Journal of Chemical Information and Modeling
  • 出版年:2014
  • 出版时间:July 28, 2014
  • 年:2014
  • 卷:54
  • 期:7
  • 页码:2157-2165
  • 全文大小:402K
  • ISSN:1549-960X
文摘
Tuberculosis is a major, neglected disease for which the quest to find new treatments continues. There is an abundance of data from large phenotypic screens in the public domain against Mycobacterium tuberculosis (Mtb). Since machine learning methods can learn from past data, we were interested in addressing whether more data builds better models. We now describe using Bayesian machine learning to assess whether we can improve our models by combining the large quantities of single-point data with the much smaller (higher quality) dual-event data sets, which use both dose鈥搑esponse data for both whole-cell antitubercular activity and Vero cell cytotoxicity. We have evaluated 12 models ranging from different single-point, dual-event dose鈥搑esponse, single-point and dual-event dose鈥搑esponse as well as combined data sets for three distinct data sets from the same laboratory. We used a fourth data set of active and inactive compounds from the same group as well as a smaller set of 177 active compounds from GlaxoSmithKline as test sets. Our data suggest combining single-point with dual-event dose鈥搑esponse data does not diminish the internal or external predictive ability of the models based on the receiver operator curve (ROC) for these models (internal ROC range 0.83鈥?.91, external ROC range 0.62鈥?.83) compared to the orders of magnitude smaller dual-event models (internal ROC range 0.6鈥?.83 and external ROC 0.54鈥?.83). In conclusion, models developed with 1200鈥?000 compounds appear to be as predictive as those generated with 25鈥?00鈥?50鈥?00 molecules. Our results have implications for justifying further high-throughput screening versus focused testing based on model predictions.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700