Our research interests are in the area of machine learning, biology and medicine.  We are developing machine learning algorithms that will enable the use of an individual’s comprehensive biological information to predict or diagnose diseases, and to find or develop the best therapy for that individual. 

It has recently become possible to retrieve molecular-level information from an individual, such as DNA sequence, gene expression levels in various tissues, epigenomic profile and other information.  While such data is increasingly available, we are still unable to understand the genetic and molecular mechanisms that cause diseases.  The challenge is due to the multifactorial nature of disease.  The same disease can be caused by mutations in different genes or different pathogenic pathways.  Unfortunately, current data analysis approaches fail to capture the complex relationship between disease and the vast amount of information in the molecular data.

The aim of our research is to resolve this challenge by developing machine learning algorithms that jointly model sophisticated interactions among many variables such as genetic variation, genes, pathways and disease, and robustly learn from vast amounts of data in order to better understand and treat disease.  An approach that can robustly infer the pathways that can define disease processes will dramatically improve our understanding of diseases and advance personalized medicine in its treatment.  We aim to realize this goal by using modern, advanced machine learning techniques.

Our research focuses on the following areas:

Systems biology of human diseases

Most human diseases are heterogeneous and multifactorial in origin.  The same disease can be caused by (a combination of) mutations in different genes or multiple pathogenic pathways. Because of that, two individuals with seemingly similar tumors sometimes have very different responses to chemotherapy and other treatments, as well as drastically different survival outcomes. In order to better understand this phenomenon, biologists have developed ways to obtain a molecular “snapshot” of an individual.  This snapshot includes the individual’s DNA sequence, gene expression levels in various tissues and other detailed information. 

We aim to apply or develop advanced machine learning algorithms that can convert those snapshot of many individuals into useful knowledge. The main difference with other methods is that our method gives us a more detailed explanation. While other approaches try to explain the cause of disease by indicating that "a mutation on gene X is associated with disease Y."  Our approach might indicate that "a mutation on gene X turns on pathways A, B and C (making ~100 genes interact in certain ways), which leads to disease Y with 85% chance."  Our method provides superior results because it describes how the biological system works, facilitating novel discoveries.


Drawing this kind of explanation from the vast amounts of data is a statistically daunting task, because there are many possible explanations involving mutations, genes, pathways and disease. To address this challenge, we are developing innovative approaches that combine theoretically-founded probabilistic models with knowledge in biology.

Our research will provide an innovative approach to personalized medicine. Doctors currently make treatment decisions based on experience with patients in the past, clinical measurements and other information, but not based on the molecular snapshot of an individual patient. Our research will enable them to use this information to make a more informed decision for that particular patient. Our approach will transform the way we diagnose our health conditions, and select treatment methods that work best for individuals.

We are applying our algorithms to better understand and treat cancer, Alzheimer's disease, and cardiovascular diseases. We have close collaborations with biologists at UW Medicine, UW Genome Sciences, UW Medical Center, Fred Hutchinson Cancer Research Center, UW Cardiovascular Health Research Unit, Institute for Systems Biology, UCLA Medicine, and Stanford. The short-distance collaborations also facilitate experimental or clinical validation of the hypotheses generated by our computational models, which will amplify the impact of our approaches.

Publications

  • Maxim Grechkin and Su-In Lee (2013). Identifying Perturbed Genes in the Regulatory Networks from Gene Expression Data. NIPS Workshop on Machine Learning in Computational Biology. Oral presentation (acceptance rate: 20%)
  • Safiye Celik, Benjamin A. Logsdon, and Su-In Lee (2013). Sparse Estimation of Module Gaussian Graphical Models with Applications to Cancer Systems Biology. NIPS Workshop on Machine Learning in Computational Biology. Oral presentation (acceptance rate: 20%).
  • K. Mohan, M. Chung, S. Han, D. Witten, S.-I. Lee, M. Fazel (2012). Structured Sparse Learning of Multiple Gaussian Graphical Models. To appear in Neural Information Processing Systems (NIPS).
  • S.M. Schwartz, H.T. Schwartz, S. Horvath, E. Schadt, S.-I. Lee (2012). A Systematic Approach to Multifactorial Cardiovascular Disease: Causal Analysis. To appear in Arteriosclerosis, Thrombosis, and Vascular Biology.
  • R.P. Patwardhan, J.B. Hiatt, D.M. Witten, M.J. Kim, R.P. Smith, D. May, C. Lee, J.M. Andrie, S.-I. Lee, G.M. Cooper, N. Ahituv, L.A. Pennacchio, J. Shendure (2012). Massively parallel functional dissection of mammalian enhancers in vivo. Nature Biotechnology, 30(3), 265-70.
  • I.M. Dykes, L. Tempest, S.-I. Lee, E. Turner (2011). Brn3a and Islet1 act epistatically to regulate the gene expression program of sensory differentiation. Journal of Neuroscience, 31(27), 9789-99.
  • A.J. Gentles, A.A. Alizadeh, S.-I. Lee, J.H. Myklebust, B. Shahbaba, C.M. Shachaf, R. Levy, D. Koller, S.K. Plevritis (2009). A pluripotency signature predicts histologic transformation and influences survival in follicular lymphoma patients. Blood, 114(15), 3133-4.

Machine learning research

Biology is one of the most challenging applications of machine learning. Biological data often have much fewer training data than other applications, due to the cost and various restrictions regarding acquiring data. Biological systems involve sophisticated interactions of a number of biomolecules, which requires the use of complex models. The field of biology provides tremendous opportunities to identify new, challenging machine learning problems. We propose novel machine learning approaches motivated by biological problems, which will contribute to advancing the field of machine learning in addition to improving our insights into complex diseases.

Publications
  • Safiye Celik, Benjamin A. Logsdon, and Su-In Lee (2013). Sparse Estimation of Module Gaussian Graphical Models with Applications to Cancer Systems Biology. NIPS Workshop on Machine Learning in Computational Biology. Oral presentation (acceptance rate: 20%).
  • K. Mohan, M. Chung, S. Han, D. Witten, S.-I. Lee, M. Fazel (2012). Structured Sparse Learning of Multiple Gaussian Graphical Models. To appear in Neural Information Processing Systems (NIPS).
  • S. Yang, L. Shapiro, M. Cunningham, M. Speltz, C. Birgfeld, I. Atmosukarto, S.-I. Lee (2012). Skull Retrieval for Craniosynostosis Using Sparse Logistic Regression Models. Proceedings of the 15th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). - best paper award!
  • S. Balakrishnan, H. Kamisetty, J.C. Carbonell, S.-I. Lee, C.J. Langmead (2011). Learning Generative Models for Protein Fold Families. PROTEINS: Structure, Function, and Bioinformatics, 79(4), 1061-78.
  • S. Balakrishnan, H. Kamisetty, J.C. Carbonell, S.-I. Lee, C.J. Langmead (2010). Learning Networks of Statistical Couplings in Protein Fold Families using L1-regularization. Proceedings of 3DSIG Structural Bioinformatics and Computational Biophysics.
  • S.-I. Lee, V. Chatalbashev, D. Vickrey, D. Koller (2007). Learning a Meta-Level Prior for Feature Relevance from Multiple Related Tasks. Proceedings of International Conference on Machine Learning (ICML).
  • S.-I. Lee, V. Ganapathi, D. Koller (2007). Efficient Structure Learning of Markov Networks using L1-Regularization. Proceedings of Neural Information Processing Systems (NIPS).
  • S.-I. Lee, H. Lee, P. Abbeel, A.Y. Ng (2006). Efficient L1 Regularized Logistic Regression. Proceedings of the 21th National Conference on Artificial Intelligence (AAAI).
  • S.-I. Lee, S. Batzoglou (2004). ICA-based Clustering of Genes from Microarray Expression Data.Proceedings of Neural Information Processing Systems (NIPS).
Predictive medicine

We are developing machine learning algorithms for data-driven predictive modeling of various disease conditions.  For instance, we aim to develop a computational model for early detection of Pneumonia in patients at intensive care unit (ICU).  Our prediction model will take various measurements made in real time at ICU as input and will predict the probability of the incidence of Pneumonia in the next couple of hours, to better save the lives of the patients in a critical condition.

We are also interested in developing a medical diagnosis system that can classify a patient based on the information in his or her 3D craniofacial image data.  The main challenge is in the complexity of the 3D image data.  We are using advanced machine learning techniques that can effectively retrieve relevant information from 3D image data for classification of diseases.

Publications

  • S. Yang, L. Shapiro, M. Cunningham, M. Speltz, C. Birgfeld, I. Atmosukarto, S.-I. Lee (2012). Skull Retrieval for Craniosynostosis Using Sparse Logistic Regression Models. Proceedings of the 15th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). (acceptance rate: 30%)  - best paper award!
  • S. Bilge, J.-N. Hwang, S.-I. Lee, L. Shapiro (2012). Tremor Detection Using Motion Filter and SVM. Proceedings of the 21st International Conference on Pattern Recognition. 
  • B. Soran, Z. Xie, R. Tungaraza, S.-I. Lee, L. Shapiro, T. Grabowski (2012). Parcellation of Human Inferior Parietal Lobule Based On Diffusion MRI. Proceedings of the 34th Annual International Conference of the IEEE Engineering in Medicine & Biology Society, Engineering Innovation in Global Health. - selected for oral presentation.
  • S. Yang, L. Shapiro, M.L. Cunningham, M. Speltz, S.-I. Lee (2011). Classification and Interest Region Localization on Craniosynostosis Skulls. Proceedings of ACM Conference on Bioinformatics, Computational Biology and Biomedicine (ACM BCB).

Systems genetics

Biological organisms differ in numerous observable characteristics, termed phenotypes. There has been substantial evidence that many phenotypes are affected to varying degrees by an individual’s specific genotype stored in its DNA. Identifying the complex relationship between genotype and phenotype is one of the fundamental goals in biology. The central dogma of molecular biology describes how DNA sequence information flows to cellular processes: A specific region of the genome, called a gene, is transcribed into RNA which is then translated into protein (a process called gene expression). Proteins provide the basic building blocks of all cellular activities. Cellular activities are regulated by a complex web of interactions of proteins and genes called a gene regulatory network. Thus, variations in the DNA sequence such as single nucleotide polymorphism (SNP) – a difference of a single base – can affect the gene regulatory network, which in turn leads to phenotypic changes.

We consider the problem of inferring such a causal pathway from genotype to phenotype from biological data. Say that we obtain the genotype, intermediate phenotypes (e.g., gene expression) and phenotype data from a population of individuals. Given these data, we aim to infer the interaction network among genotype variations, gene expression, and phenotypes. Our approach explicitly models sophisticated interactions among a number of molecules by using probabilistic graphical models and machine learning techniques for incorporating prior knowledge and robustly learning the structure of the models. We aim to gain a systems-level understanding of how genetic variation leads to phenotypic changes in complex molecular networks. 

Publications

  • S.-I. Lee, A.M. Dudley, D. Drubin, P.A. Silver, N.J. Krogan, D. Pe’er, D. Koller (2009). Learning a Prior on Regulatory Potential from eQTL Data. PLoS Genetics, 5(1), e1000358. [Pubmed]
  • S.-I. Lee, V. Chatalbashev, D. Vickrey, D. Koller (2007). Learning a Meta-Level Prior for Feature Relevance from Multiple Related Tasks. Proceedings of International Conference on Machine Learning (ICML).
  • S.-I. Lee, D. Pe’er, A.M. Dudley, G.M. Church, D. Koller (2006). Identifying Regulatory Mechanisms using Individual Variation Reveals Key Role for Chromatin Modification. Proceedings of the National Academy of Sciences (PNAS), 103, 14062-14067. [Pubmed]