
Biological studies are data-intensive by nature. We have witnessed a rapid accumulation of various types of biological data in the past decade. Due to the complexity of biology, it is challenging to select the most relevant features and build mechanism-based models given the flood of biological data. In this thesis, we applied machine learning in predicting the kinetic constants of proteins by machine learning models using features generated by Rosetta, and predicting mutations in a genome of Escherichia coli (E. coli) in a culture condition. Tobuild machine learning models, high-quality standardized data around a biological problem is critical. A mutation database was curated from literature for predicting mutation. Due to the on-going nature of research, it is common to design new experiments to fill in thegap or address ambiguity in the data that has been collected. Given a limited budget, it is imperative to select the most valuable experiments to run. We applied active learning (optimal experimental design) technique using Gaussian process (GP) to quantify the uncertainty and representativeness of each candidate experiment. The most uncertain and representative candidates were selected and the data was collected in a wet lab. Our approach reduced the number of datapoints by 44% to reach the same prediction accuracy on a transcriptomic profiling problem, in which the transcriptomic profile of E. coli was predicted by GP models trained on transcriptomic profiles in other culture conditions. The optimal experimental design framework consists of two modules, a predictive model and a utility score to quantify the information content of a candidate experiment. The framework can also be applied in other scenarios by replacing the predictive model with one suited for the scenarios.
Page Count:
0
Publication Date:
2020-01-01
ISBN-13:
9798664726923
No comments yet. Be the first to share your thoughts!