Interpretable Biomarker Discovery in Bio-Chemoinformatics

Research Areas - Reproducible Artificial Intelligence (A.I.) Applications in Bio-Chemoinformatics


The research portfolio of Soufan Lab encompasses cheminformatics and bioinformatics with applications in life sciences.


Soufan Lab aims to advance life sciences by developing innovative methods, systems and resources for targeted knowledge discovery from biological data.


  • Develop new A.I. approaches that automatically gather, integrate, and analyze big/small amounts of bio-chemical data so as to reveal existing information and produce new knowledge, both with some degree of confidence.
  • Integrate experimental design, data generation and data analysis.
  • Train high profile specialists in our multi- and inter-disciplinary environment.
  • Engage with industrial & government partners in technology development and transfer.


More than ever, there is a growing availability and accessibility to biological and chemical data in relation to advances in next generation sequencing, mass spectrometry, array-based methods and others technologies. Together, bio-cheminformatics, covers a range of computational methods that can be used to predict interactions between biomolecules (e.g., proteins) and chemicals (e.g., ligands) at large scales. For example, developing gene expression studies to increase screening of biological activities of chemicals is the frontier in environmental and health studies. In the context of drug discovery, bio-cheminformatics allows to tackle the problem of predicting off-target proteins that lead to side effects which in turn, limit efficacy of many existing medicines. Associations of compounds and proteins is necessary to process structural alerts in environmental toxicology, and detect patterns in chemicals that can cause certain adverse effect in organs. Other types of complex chemicals and omics (transcriptomics, proteomics, metabolomics, etc.) interactions shape domains like nutrigenomics which focuses on studying relationship between human genome, nutrition and health. All of this is key to understand molecular mechanisms and reveal detailed interactions in life systems which eventually will help in tackling interrelated questions about treatments, long term effects and impacts of the environment. Our three main components of the proposed research program are listed next.

1) Developing AI platform for biomarker discovery

With expansion in omics data (volume and dimensionality), there is a need for faster, more reliable and more cost-effective AI models to find top relevant variables. Biomarker discovery aims at finding top indicators that explain connections to treatment conditions (time, dose, exposure) and target meta data (age, tissue, histology). In complex biological systems, biomarker analysis facilitates understanding of the underlying mechanisms, assists in capturing states and changeable signatures of genes, proteins, metabolites and chemicals

2) Implementing omics-based library for de novo synthesis of chemicals

Chemicals influence on biological targets not only varies by genetic factors but greatly via environmental ones. Due to rapid emergence in environmental conditions towards intake of chemicals, there is a growing need to discover novel chemical structures. Domain applications will range from finding new cures (i.e., health) to characterizing unknown mixtures with reported toxic effects (i.e., toxicology). With the sheer magnitude of the chemical search space (10200 molecules could exist) and limited functional reference libraries, the goal is to develop a solution to learn, predict and characterize functions of novel chemical structures.

3) Reproducibility, validation and translational research collaborations

Key challenges in bio/cheminformatics data analysis include, but are not limited to, access to standardized analysis workflows, interactive analytics for decision making, sharing and reproducibility. Reproducibility is not only sharing the source code but training data, parameters, steps and all possible details to reduce randomness effects (e.g., numbers generated to kick of model training).