28th Conference on

Intelligent Systems for Molecular Biology

ISMB 2020

Welcome to the ISMB 2020 Virtual Conference Platform!

Welcome to ISMB 2020, the flagship meeting of ISCB!

Welcome to the ISMB 2020 virtual platform. The International Society for Computational Biology (ISCB) is grateful for your support and participation as we navigate these unprecedented times. We, like you, were devastated when we had to cancel our face to face conference, but we knew that we needed to find a way to continue to advance computational research and provide our members with a platform to disseminate research findings and network.

Over the course of the next four days, you will have the opportunity to hear and interact live with speakers presenting over 400 talks in 22 Community of Special Interest (COSI) tracks, Special Sessions, Technology Tracks, and a Workshop on Bioinformatics Education (WEB). You will also be able to browse more than 700 posters in the virtual Poster Hall, where authors will be standing by to answer your questions in the chat feature. And don’t forget to stop by our sponsorship and exhibitor section to learn more about publishing opportunities, services, tools, and job openings.

 

Schedule

All times are Pacific
Sort by Date
Sort by COSI/Track
       
09:30 AM - 10:30 AM (EDT)
Keynotes - Computational Analysis in Pediatric Cancer Precision Medicine
Computational analysis provides a critical component for cancer NGS data evaluation, identification of variants and their interpretation, as well as databasing and data sharing. Further refinements in the context of immunogenomic analyses further expand the toolkit for precision medicine evaluation. My lecture will illustrate how the interplay of data and analytics produces a high yield of medically meaningful data.
10:35 AM - 10:40 AM (EDT)
SysMod - Introduction to SysMod 2020
This talk introduces the community of special interest for systems modeling and the first virtual meeting in 2020.
10:40 AM - 10:50 AM (EDT)
BIOINFO-CORE - BioNet Alberta: A network based approach to Bioinformatic capacity building in Alberta
BioNet Alberta is a unique approach to capacity building in Bioinformatics and Computational Biology (B/CB) across the province of Alberta. This network-based approach focuses on building community engagement, encouraging collaboration between local researchers and providing administrative support toward any activity that would further propel Bioinformatic activities. One of the key deliverables of BioNet is a pan-Alberta Bioinformatics service platform, currently under development, capable of analyzing large amounts of bulk experimental data. Located on campus at the University of Lethbridge and in collaboration with the Southern Alberta Genome Sciences Centre, this service platform provides a local option for B/CB analysis, with a special focus on combining both Illumina and Nanopore sequencing data.
10:40 AM - 11:00 AM (EDT)
SysMod - Maintenance energy is essential for accurate predictions of intracellular fluxes in CHO
Chinese Hamster Ovary (CHO) cells are the most valuable mammalian cell hosts for the production of complex protein biopharmaceuticals. Currently, cell line and bioprocess development must be done individually for each new product and rely mostly on trial and error approaches and screening of thousands of clones. Systems biology approaches, including metabolic modeling, could elucidate bottlenecks in protein production and enable more targeted process and cell line engineering. The reconstruction of CHO genome-scale metabolic model (GSMM) was an important step towards applying these methods to CHO. The model can reliably predict growth rates when using flux balance analysis (FBA). However, the quality of the predicted internal fluxes has not yet been validated. In this work we compared the fluxes predicted by parsimonious FBA, using the CHO GSMM, with published 13C flux data. Our analysis revealed that most fluxes of the central carbon metabolism are grossly underestimated if cellular maintenance energy (mATP) is not accounted for (R^2=0.44). Because of the lack of experimental data for CHO, we estimated mATP computationally. Adding the estimated mATP as a constraint significantly improved the predictions of internal fluxes by pFBA for all data sets (R^2=0.96). Furthermore, we validated the mATP experimentally for CHO-K1 cell line.
10:40 AM - 11:00 AM (EDT)
TechTrack - The GenePattern Notebook Environment
The GenePattern Notebook environment, notebook.genepattern.org, integrates the Jupyter Notebook system with the GenePattern platform for integrative genomics analysis, making hundreds of analyses available within notebooks without the need for programming. Additional features allow publication and collaboration. A library of useful analysis notebook templates can be adapted to a scientist's needs.
10:40 AM - 11:20 AM (EDT)
Text Mining - Text Mining Keynote: Collaborative Community Text Mining and Semantic Computing for Biomedical Knowledge Discovery
In this talk, Dr. Wu will cover research in integrative literature mining, data mining and semantic computing for knowledge discovery, as well as underlying cyberinfrastructure for collaborative research. To realize the value of genomic scale data, her team has developed a semantic computing framework connecting text mining, data mining and biomedical ontologies. Natural language processing and machine learning approaches are employed for information extraction from the literature, along with an automated workflow for large-scale text analytics across documents. The ontological framework allows computational reasoning, and through federated SPARQL queries, it connects complex entities and relations such as gene variants, protein post-translational modifications and diseases mined from heterogeneous knowledge sources. Scientific use cases demonstrate data-driven discovery of gene-disease-drug relationships that may facilitate disease understanding and drug target identification for diseases ranging from Alzheimer's’ to COVID-19.
10:40 AM - 11:20 AM (EDT)
iRNA - iRNA Keynote: The Functional Iso-Transcriptomics toolset to leverage long reads sequencing for unraveling isoform transcriptional networks from single cells.
Long read sequencing is revolutionizing our ability to study transcriptome diversity and dynamics by empowering the detection of full-length transcripts. However, long reads pose novel bioinformatics challenges both for removing technology biases as for leveraging the discovery potential of the new data. In this talk I will present the Functional Iso-Transcriptomics analysis toolset composed of SQANTI, IsoAnnot and tappAS, that covers quality control, annotation and statistical analysis of long reads transcriptome data. I will also illustrate how to use these tools to unravel isoform transcriptional networks combining long and short single cell RNA-seq data.
10:40 AM - 11:40 AM (EDT)
Function - Function COSI Keynote: DNA methylases – computation, experiments and new biology
10:40 AM - 11:40 AM (EDT)
MLCSB: Machine Learning - Emergent pathogens, vaccines and therapeutics : how can computation accelerate discovery?
10:45 AM - 11:40 AM (EDT)
CAMDA - CAMDA KEYNOTE: Climate, Oceans, and Human Health: Cholera as a paradigm for prediction infectious diseases
Climate and the oceans historically have been closely intertwined with human health. Today significant advances in information technology have brought new discoveries - from the outer reaches of space, where remote sensing monitors on satellites circle the earth, to the ultramicroscopic through application of next generation sequencing and bioinformatics. Vibrio cholerae provides a useful example of the fundamental link between human health and the oceans. This bacterium is the causative agent of cholera and is associated with major pandemics, yet it is a marine bacterium with a versatile genetics and is distributed globally in estuaries throughout the world, notably the Bay of Bengal, but also in coastal regions and aquatic systems of the world. Vibrio species, both nonpathogenic and those pathogenic for humans, marine animals, or marine vegetation, play a fundamental role in nutrient cycling. They have also been shown to respond to warming of surface waters of the North Atlantic, with increase in their numbers correlated with increased incident of vibrio disease in humans. The models we have developed for understanding and predicting outbreaks of cholera are based on work done in the Chesapeake Bay and the Bay of Bengal and these models are now used by UNICEF and aid agencies to predict cholera in Yemen and other countries of the African continent. With onset of COVID-19, these models are currently being modified to predict SARS CoV-2 and incidence of COVID-19, the current pandemic of coronavirus. In summary, molecular microbial ecology coupled with computational science can provide a critical indicator and prediction of human health and wellness. How this is being accomplished and how we are beginning to understand environmental aspects of COVID-19 will be discussed in this talk.
11:00 AM - 11:20 AM (EDT)
SysMod - Using genome-scale model of metabolism and macromolecular expression (ME-model) to study biofilm development in Pseudomonas aeruginosa PAO1
In this work, we seek to use genome-scale models (GEMs) to gain mechanistic insights into the biofilm formation and development in Pseudomonas aeruginosa PAO1. P. aeruginosa is an opportunistic human pathogen which is the main cause of mortality in cystic fibrosis (CF) patients and one of the leading nosocomial pathogens affecting hospitalized patients. P. aeruginosa possesses a wide array of mechanisms for antibiotic resistance including biofilms which primarily consist of extracellular polymeric substances (EPS) such as DNA, proteins and polysaccharides. Within EPS, multiple interactions can occur that make biofilms a robust protective barrier against multiple antibiotics. We will apply GEMs to study biofilm development in P. aeruginosa. GEMs rely on mathematical optimization principles using the flux balance analysis (FBA) and constraint-based modelling approaches. Within the GEM framework, ME (metabolism and macromolecular expression) models can predict the interplay between the expression of macromolecules and the metabolic state of an organism under a given genetic and environmental condition. Using a ME-model of P. aeruginosa, we intend to determine the genotype-phenotype relationship involved in the expression of EPS and biofilm development. Such knowledge can then be used to predict the possible mechanisms to disrupt biofilms and treat infections caused by P. aeruginosa.
11:00 AM - 11:20 AM (EDT)
TechTrack - Expression Atlas: A platform for integrating and displaying expression data from tissues to single cells
Expression Atlas at EMBL-EBI provides free resources that facilitate submission, archival, reprocessing and visualisation of functional genomics data. This allows researchers a powerful way to identify where their favourite gene is expressed and how its expression changes in disease. Our tools are continuously updated to incorporate datasets using new technologies.
11:00 AM - 11:10 AM (EDT)
BIOINFO-CORE - Chickens in Space: our experiences with spatial transcriptomics on the 10x Visium and slide-seq platforms
There have been many recent developments in the area of spatial transcriptomics. Here we will discuss our experiences with our initial projects on the 10x Visium and slide-seq platforms on slices of developing chicken embryo trunk. We also have corresponding scRNA-seq data for comparison and integration. This talk will be geared towards our fellow bioinformaticians in hopes of giving them more information about this type of data and some of the potential analysis that can be used, as well as potential pitfalls.
11:10 AM - 11:20 AM (EDT)
BIOINFO-CORE - Multi-sample, multi-condition analysis in scRNAseq data sets
Single-cell RNA sequencing (scRNAseq) is a novel technology revolutionizing our ability to study tissues. By measuring transcriptomes at the single-cell level, scRNAseq enables identification of cellular heterogeneity within a tissue sample in far greater detail than conventional (bulk) RNA sequencing, where the diversity of the genetic signal is averaged over the whole sample. However, this additional granularity brings an additional data complexity layer, which in turn makes data interpretation more difficult. In this talk we will discuss the main aspects and problems of scRNAseq data analysis, as well as numerous state-of-the-art algorithms and techniques elaborated to extract information from it, in the context of multiple samples and multiple conditions.
11:10 AM - 11:40 AM (EDT)
SST01 - The impact of cancer associated CTCF mutations on chromatin architecture and gene regulation
11:20 AM - 11:40 AM (EDT)
Breakout session 1
We will break into small groups for discussion of topics TBA.
11:20 AM - 11:30 AM (EDT)
iRNA - Swan: a Python library for the analysis and visualization of long-read transcriptomes
Long-read RNA-sequencing platforms such as PacBio and Oxford Nanopore have led to an explosion in discovery of transcript isoforms that were impossible to assemble with short reads. Current transcript model visualization tools are difficult to interpret on a genomic scale and complicate distinguishing similar isoforms. We introduce the Swan Python library, which is designed for the analysis and visualization of transcript models. Swan offers a robust visualization suite for easily differentiating splicing events. Using a graphical model approach, Swan provides a platform to visually discriminate between transcript models and to identify novel exon skipping as well as intron retention events that are commonly missed in short read transcriptomics. Furthermore, Swan is integrated with flexible differential gene and transcript expression statistical tools that enable the analysis of full-length transcript models in different biological settings. We demonstrate the utility of this software by applying Swan to the HepG2 and HFFc6 human cell lines which have full-length PacBio transcriptome data available on the ENCODE portal. Swan found 4,503 differentially expressed transcripts, including 280 transcripts that are differentially expressed even though the parent gene is not. Swan provides a comprehensive environment to analyze long-read transcriptomes and produce high-quality publication-ready figures.
11:20 AM - 11:40 AM (EDT)
TechTrack - HiSCiAp: User-friendly, scalable tools and workflows for single-cell analysis
HiSCiAp is a user-friendly Galaxy setup to use ~80 Single Cell analysis modules from tools such as a Seurat, Scanpy, Monocle3, SCMap and others in QC, clustering, trajectory analysis and cell mapping. Tools are available at humancellatlas.usegalaxy.eu, can be installed on any Galaxy instance or used in command line.
11:20 AM - 11:40 AM (EDT)
Text Mining - CellMeSH: Probabilistic Cell-Type Identification Using Indexed Literature
Single-cell RNA sequencing (scRNA-Seq) is becoming widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-Seq experiments aim to identify and quantify all cell-types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are then mapped to cell-types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious effort that requires expert knowledge. We introduce CellMeSH - a new approach to identify cell-types based on gene-set queries directly from literature. CellMeSH combines a database of gene cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and scales automatically. The probabilistic query method enables reliable information retrieval even though the gene cell-type associations extracted from the literature are necessarily noisy. CellMeSH achieves 60% top-1 accuracy and 90% top-3 accuracy in annotating the cell-types on a human dataset, and up to 58.8% top-1 accuracy and 88.2% top-3 accuracy on three mouse datasets, which is consistently better than existing approaches.
11:20 AM - 11:40 AM (EDT)
SysMod - Genome-scale metabolic modelling reveals key features of a minimal gene set
Minimal organisms represent a stepping stone for the rational design of entire genomes. Mesoplasma florum, a fast-growing near-minimal organism for which genetic engineering techniques have been developed, is an interesting model for this task. Using sequence and structural homology, the set of metabolic functions encoded in its genome was identified, allowing the reconstruction of a metabolic network covering ~30% of its proteome. Experimental biomass composition, defined media compositions, substrate uptake and secretion rates were integrated as species-specific constraints to produce a functional model. Sensitivity analysis revealed oxygen dependency for the secretion of acetate, consistent with M. florum’s known facultative anaerobe phenotype. The model was validated and refined using genome-wide expression and essentiality datasets as well as growth data on varying carbohydrates. Discrepancies between model predictions and observations were mechanistically explained using protein structures and network analysis. The validated model, along with essentiality data and the complete transcription units architecture were used for the design of a reduced genome, thereby targeting 167 genes for removal.
11:30 AM - 11:40 AM (EDT)
iRNA - IsoQuant: isoform analysis and quantification with long error-prone transcriptomic reads
Third-generation (PacBio/Oxford Nanopore) transcriptomics allow to generate long reads, which in contrast to short-read sequencing have the potential to analyse and quantify complex alternative isoforms. Due to long-read RNA sequencing novelty, few computational methods analyse such data, the SQANTI pipeline being an exception that is designed primarily for PacBio data. Here, we present a software called IsoQuant for reference-based analysis of long error-prone reads. IsoQuant assigns reads to annotated isoforms based on their intron and exon structure, and further performs gene and isoform quantification. For high-error-rate data, the algorithm uses inexact intron and exon matching, which accurately resolves various error-rate induced alignment artifacts, such as skipped short exons or shifted splice sites. To estimate accuracy of IsoQuant we simulated several Nanopore and PacBio datasets based on mouse and human transcriptomes. For low-error reads (e.g. PacBio CCS), both IsoQuant and SQANTI2 show near-perfect accuracy, but for high error data with complex artifacts (such as Oxford Nanopore - for which SQANTI2 was not designed), IsoQuant’s inexact intron/exon matching yields strong improvement. IsoQuant is an open-source software implemented in Python and is available at https://github.com/ablab/IsoQuant.
12:00 PM - 12:20 PM (EDT)
Breakout session 2
We will break into small groups for discussion of topics TBA.
12:00 PM - 12:10 PM (EDT)
Text Mining - BERTMeSH: Deep Contextual Representation Learning for Large-scale High-performance MeSH Indexing with Full Text
large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH 1) uses Learning To Rank (LTR), which is time-consuming, 2) can capture some pre-defined sections only in full text, and 3) ignores the whole MEDLINE. We propose a deep learning based method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: 1) the state-of-the-art pre-trained deep contextual representation, BERT (Bidirectional Encoder Representations from Transformers), which makes BERTMeSH capture deep semantics of full text. 2) a transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on approximately 1.5 million full text in PMC. BERTMeSH outperformed various cutting edge baselines. For example, for 20K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant.Also prediction of 20K test articles needed 5 minutes by BERTMeSH, while it took more than 10 hours by FullMeSH, proving the computational efficiency of BERTMeSH.
12:00 PM - 12:10 PM (EDT)
iRNA - Full-length transcriptome reconstruction reveals a large diversity of RNA and protein isoforms in rat hippocampus
Gene annotation is a critical resource in genomics research. Many computational approaches have been developed to assemble transcriptomes based on high-throughput short-read sequencing, however, only with limited accuracy. Here, we combine next-generation and third-generation sequencing to reconstruct a full-length transcriptome in the rat hippocampus, which is further validated using independent 5 ́ and 3 ́-end profiling approaches. In total, we detect 28,268 full-length transcripts (FLTs), covering 6,380 RefSeq genes and 849 unannotated loci. Based on these FLTs, we discover co-occurring alternative RNA processing events. Integrating with polysome profiling and ribosome footprinting data, we predict isoform-specific translational status and reconstruct an open reading frame (ORF)-eome. Notably, a high proportion of the predicted ORFs are validated by mass spectrometry-based proteomics. Moreover, we identify isoforms with subcellular localization pattern in neurons. Collectively, our data advance our knowledge of RNA and protein isoform diversity in the rat brain and provide a rich resource for functional studies.
12:00 PM - 12:20 PM (EDT)
Function - Update on Protein Functional Annotation in UniProt in 2020
With the increasing number of data generated by sequencing projects, researchers need reliable systems to provide the functional annotation of proteins. UniProtKB is using two functional annotation systems, UniRule and ARBA (Association-Rule-Based Annotator), to automatically annotate UniProtKB/TrEMBL in an efficient and scalable manner with a high degree of accuracy. These systems use protein signatures and taxonomy classifications to infer the biochemical features and biological functions of proteins. This knowledge is expressed in the form of rules: sets of IF-THEN statements coming from expert curation (UniRule [1]) or generated by machine learning (ARBA [2]). Rules are applied at each release keeping the propagated annotations up-to-date. On the UniProtKB website, information added by annotation rules is clearly highlighted as such using evidence tags. These tags can also be used as keywords to search for or filter out annotation added by a rule. The protein function community could also benefit from those rules, as some sequences may not be yet available in public databases or could be present in highly redundant proteomes absent from UniProtKB. This has been made possible via UniFIRE (The UniProt Functional annotation Inference Rule Engine), an engine to execute rules in the URML (UniProt Rule Markup Language) format.
12:00 PM - 12:20 PM (EDT)
MLCSB: Machine Learning - Cancer mutational signatures representation by large-scale context embedding
The accumulation of somatic mutations plays critical roles in cancer development and progression. However, the global patterns of somatic mutations, especially non-coding mutations, and their roles in defining molecular subtypes of cancer have not been well characterized due to the computational challenges in analyzing the complex mutational patterns. Here we develop a new algorithm, called MutSpace, to effectively extract patient-specific mutational features using an embedding framework for larger sequence context. Our method is motivated by the observation that the mutational rate at megabase scale and the local mutational patterns jointly contribute to distinguishing cancer subtypes, both of which can be simultaneously captured by MutSpace. Simulation evaluations demonstrate that MutSpace can effectively characterize mutational features from bona fide patient subgroups and achieve superior performance compared with previous methods. As a proof-of-principle, we apply MutSpace to 560 breast cancer patient samples and demonstrate that our method achieves high accuracy in subtype identification. In addition, the learned embeddings from MutSpace reflect intrinsic patterns of breast cancer subtypes and other features of genome structure and function. MutSpace is a promising new framework to better understand cancer heterogeneity based on somatic mutations.
12:00 PM - 12:40 PM (EDT)
CAMDA - Metagenomic Geolocation using Read Signatures
We present a novel approach to the Metagenomic Geolocation Challenge based on random projection of the sample reads for each location. Individual signatures are computed for all reads, and each location is characterised by an hierarchical vector space representation of the resulting clusters. Classification is then treated as a problem in ranked retrieval of locations, where similar signatures are taken as a proxy for underlying microbial similarity. We evaluate our approach based on the 2016 and 2020 Challenge datasets and obtain promising results based on nearest neighbour classification.
12:00 PM - 12:40 PM (EDT)
SysMod - Robotic mapping and generative modelling of cytokine response
We have developed a robotic platform allowing us to monitor cytokines dynamics (including IL-2, IFNgamma, TNF alpha, IL-6) of immune cells in vitro, with unprecedented resolution. To understand the complex emerging dynamics, we use interpretable machine learning techniques to build a generative model of cytokine response. We discover that, surprisingly, immune activity is encoded into one global parameter, encoding ligand antigenic properties and to a less extent ligand quantity. Based on this we build a simple interpretable model which can fully explain the broad variability of cytokines dynamics. We validate our approach using different lines of cells and different ligands. Two processes are identified, connected to timing and intensity of cytokine response, which we successfully modulate using drugs or by changing conditions such as initial T cell numbers. Our work reveals a simple "cytokine code", which can be used to better understand immune response in different contexts including immunotherapy. More generally, it reveals how robotic platforms and machine learning can be leveraged to build and validate systems biology models. Work in collaboration with Grégoire Altan-Bonnet, National Cancer Institute
12:00 PM - 12:40 PM (EDT)
SST01 - Probing the Immune Response to Vaccination with Systems Biology
12:10 PM - 12:20 PM (EDT)
iRNA - Freddie: Annotation-free Isoform Discovery Using Long-Read Sequencing
Background Alternative splicing (AS) events are essential to understanding the development of cancer and may play a role as a target of personalized cancer therapeutics. However, detecting novel AS events remains a challenging task: existing reference transcriptome annotation databases are far from universal comprehensiveness and traditional sequencing technologies are limited by their short-read lengths. Given these challenges, transcriptomic long-read sequencing (LRS) presents a promising potential for novel AS discovery. Methods We present Freddie, a computational annotation-free isoform discovery and detection tool. Freddie uses genome alignments of transcriptomic LRS reads as input and generates isoform clusters of these reads for a given gene of interest. Freddie then segments the gene interval using the alignments into canonical exon segments. Finally, Freddie clusters the reads into isoform clusters that satisfy a set of expected transcriptomic LRS constraints. We formulate this clustering as an optimization problem that we name Minimum Error Clustering into Isoforms (MErCi) problem and solve it an Integer Linear Program. Results We compare the performance of Freddie on both simulated and real datasets with state-of-the-art isoform detection tools with varying dependence on reference annotations. We show that both the segmentation and clustering steps of Freddie are highly accurate and computationally efficient.
12:10 PM - 12:20 PM (EDT)
Text Mining - Integrating Image Caption Information into Biomedical Document Classification in Support of Biocuration
Biological databases provide precise, well-organized information for supporting biomedical research. Developing such databases typically relies on manual curation. However, the large and rapidly increasing publication rate makes it impractical for curators to quickly identify all and only those documents of interest. As such, automated biomedical document classification has attracted much attention. Images convey significant information for supporting biomedical document classification. Accordingly, using image caption, which provides a simple and accessible summary of the actual image content, has potential to enhance the performance of classifiers. We present a classification scheme incorporating features gathered from captions in addition to titles-and-abstracts. We trained/tested our classifier over a large imbalanced dataset, originally curated by Gene Expression Database (GXD), consisting of ~60,000 documents, where each document is represented by the text of its title, abstract and captions. Our classifier attains precision of 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, exceeding the performance of current classifiers employed by biological databases and showing that our framework effectively addresses the imbalanced classification task faced by GXD. Moreover, our classifier’s performance is significantly improved by utilizing information from captions compared to using titles-and-abstracts alone, which demonstrates that captions provide substantial information for supporting biomedical document classification.
12:20 PM - 12:30 PM (EDT)
iRNA - Prioritizing genes likely to have functionally distinct splice isoforms using long read RNA-seq data
RNA sequencing continues to detect an increasing number of splice variants, yet there are limited genes with evidence of functionally distinct splice isoforms (FDSIs). One challenge is that splice variants are often reconstructed computationally from short sequence fragments, an error-prone process. Here we identify likely candidate genes with FDSIs using long-read RNA-sequencing data that removes much of the present ambiguity around transcript structures. We developed a computational pipeline for prioritizing splice isoforms in MinION long-read RNA-seq data. We designed the prioritization approach using splice variant-specific conservation (PhastCons, PhyloP and BLAST homology searches), expression, coding-potential, and protein domain annotations. Based on these annotations, the pipeline outputs a prioritized list of genes likely to have FDSIs. We then applied this pipeline to publicly available and novel mouse brain and liver transcriptomes. For 6,799 genes with multiple splice variants, our approach prioritized a set of 44 putative genes with FDSIs. This candidate set includes genes with published evidence of FDSIs (Cdc42) and genes with promising literature evidence of FDSIs (Tpd52l1, Gstz1). The limited amount of long-read data and low sequencing depth hinders our prioritization. Nevertheless, our work aids in establishing guidelines for high-throughput prioritization of genes with FDSIs.
12:20 PM - 12:40 PM (EDT)
Function - Pruning the Protein Jungle: recent developments in the CATH-Gardener function analysis and prediction pipeline.
The CATH-Gene3D database includes evolutionary relationships between protein domains, and classifies domains into superfamilies. Functional Families (FunFams) subdivide superfamilies further to provide clusters and alignments of protein domains that all perform closely related functions. FunFams have performed well in function prediction assessments (CAFA) and provide insights both on functional sites and effects of variants on structure and disease. FunFams are created with Gardener: a novel pipeline for clustering massive protein datasets. Domains are first partitioned by Multi-Domain Architecture (MDA) then tree-building/tree-cutting algorithms (GeMMA/FunFHMMER) are applied iteratively. To deal with huge numbers of sequences from metagenome projects, we have employed a novel random sampling approach that reduces search space, whilst retaining functional diversity required to build informative alignments. For CAFA4, we implemented a generalised Fast and Frugal Tree algorithm to filter and combine search results from a variety of sources. Sources contributing to these results included: FunFams built from CATH and Pfam domains, a predictor based on autoencoding of network information and a Machine Learning (ML) approach trained to recognise patterns in FunFam sequence matches. Benchmarks suggest significant improvement in accuracy for the most recent FunFams, also that network data considerably enhances function prediction in in-silico validation on fission yeast.
12:20 PM - 12:30 PM (EDT)
Text Mining - Bridging the gap: enlisting authors’ help addressing the remaining difficulties in automated concept extraction
Automated text analysis has proven very effective for retrieving relevant articles from the biomedical literature. But quantitative biomedical research requires the specific knowledge embedded within individual articles and more comprehensive results across the literature. Text mining helps provide this data by automating the conversion of unstructured text such as scientific publications into structured, computable formats. Recent advances allow automated text mining systems to provide results that are predominantly of very high quality. However, some cases remain difficult for current text mining tools. In this work we propose ten writing tips to enlist the help of authors. We also propose a web-based tool, PubReCheck, to help authors visualize results of automated concept extraction on their text and to automatically identify many types of issues prior to publication. Articles that follow these guides will typically be processed more accurately, enabling the content to be found more readily and used more widely. Following these guidelines at a large scale will improve the ability of millions to find articles that meet their information needs and the ability of text mining tools to provide structured, computable data that enable larger scale and more rapid analyses. Availability: https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/PubReCheck/
12:20 PM - 12:40 PM (EDT)
Breakout session 3
Groups report their findings back to the main group.
12:20 PM - 12:40 PM (EDT)
MLCSB: Machine Learning - MHCAttnNet: Predicting MHC-Peptide Bindings for MHC Alleles Classes I & II Using An Attention-Based Deep Neural Model
Motivation: Accurate prediction of binding between an MHC allele and a peptide plays a major role in the synthesis of personalized cancer vaccines. The immune system struggles to distinguish between a cancerous and a healthy cell. In a patient suffering from cancer who has a particular MHC allele, only those peptides that bind with the MHC allele with high affinity help the immune system recognize the cancerous cells. Results: MHCAttnNet is a deep neural model that uses an attention mechanism to capture the relevant subsequences of the amino acid sequences of peptides and MHC alleles. It then uses this to accurately predict the MHC-peptide binding. MHCAttnNet achieves an AUC-PRC score of 94.18% with 161 class I MHC alleles which outperforms the state-of-the-art models for this task. MHCAttnNet also achieves a better AUC-ROC score in comparison to the state-of-the-art models while covering a greater number of class II MHC alleles. The attention mechanism used by MHCAttnNet provides a heatmap over the amino acids thus indicating the important subsequences present in the amino acid sequence. This approach also allows us to focus on a much smaller number of relevant trigrams corresponding to the amino acid sequence of an MHC allele, from 9251 possible trigrams to about 258. This significantly reduces the number of amino acid subsequences that need to be clinically tested.
12:30 PM - 12:40 PM (EDT)
Text Mining - Promoting Reproducible Research for Characterizing Nonmedical use of Medications through Data Annotation: A Description of a Twitter Corpus and Guidelines
Due to the contribution of prescription medications, such as stimulants or opioids, to the broader drug abuse crisis, several studies have explored the use of social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a lack of clear descriptions of how abuse information is presented in public social media, as well as few thorough annotation guidelines that may serve as the groundwork for long-term future research or annotated datasets usable for automatic characterization of social media chatter associated with abuse-prone medications. We employed an iterative annotation strategy to create an annotated corpus of 16,433 tweets, mentioning one of 20 abuse-prone medications, labeled as potential abuse or misuse, non-abuse consumption, drug mention only, and unrelated. We experimented with several machine learning algorithms to illustrate the utility of the corpus and generate baseline performance metrics for automatic classification on these data. Among the machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.00% (95% CI 71.4-74.5) over the test set (n=3271). We expect that our annotation strategy, guidelines, and dataset will provide a significant boost to community-driven data-centric approaches for the task of monitoring prescription medication misuse or abuse monitoring from Twitter.
12:30 PM - 01:30 PM (EDT)
iRNA - iRNA Panel Discussion: Long-read RNA-seq
Long read discussion going into lunch break
01:00 PM - 01:45 PM (EDT)
ISCB Town Hall
02:00 PM - 02:10 PM (EDT)
SST02 - Introduction
This special session aims to introduce the audience to the motivations, principles and opportunities offered by crowdsourcing technologies. We will emphasize the methodological contributions and describe upcoming technical challenges. This talk will present the speakers and organization of the session.
02:00 PM - 02:20 PM (EDT)
iRNA - Inferring snoRNA characteristics from their abundance profile in healthy human tissues
Small nucleolar RNAs (snoRNAs) are noncoding RNAs known to regulate ribosome biogenesis and splicing. SnoRNAs have a highly stable structure which impairs their quantification in high-throughput sequencing (RNA-Seq). The use of thermostable group II intron reverse transcriptase in RNA-seq (TGIRT-Seq) was recently shown to accurately quantify their abundance. To faithfully characterize snoRNAs’ abundance determinants, we carried out the TGIRT-Seq of seven healthy human tissues and conducted subsequent bioinformatic analyses. We found that snoRNAs can be categorized in two abundance profiles with distinct characteristics: uniformly expressed (UE) and tissue-specific (TS) snoRNAs. UE snoRNAs are encoded in protein-coding host gene (HG), are highly conserved and target rRNA whereas TS snoRNAs are encoded in noncoding HG, are poorly conserved and are orphan snoRNAs. We found that UE snoRNAs can be anticorrelated with their HG (in which case the HG is mostly involved in ribosome biogenesis and translation) or correlated (in which case the HG is mostly coding for a ribosomal protein) in regards of their abundance. Conversely, TS snoRNAs are well correlated with their HG. These results suggest a model in which, based upon its abundance profile across tissues, a snoRNA’s target, conservation and functional relationship with its HG can be clearly inferred.
02:00 PM - 02:20 PM (EDT)
Function - Benchmarking Gene Ontology Function Predictions Using Negative Annotations
With the ever-increasing number and diversity of sequenced species, the challenge to characterise genes with functional information is even more important. In most species, this characterisation almost entirely relies on automated electronic methods. As such, it is critical to benchmark the various methods. The CAFA series of community experiments provide the most comprehensive benchmark, with a time-delayed analysis leveraging newly curated experimentally supported annotations. However, the definition of a false positive in CAFA has not fully accounted for the Open World Assumption (OWA), leading to a systematic underestimation of precision. The main reason for this limitation is the relative paucity of negative experimental annotations. This paper introduces a new, OWA-compliant, benchmark based on a balanced test set of positive and negative annotations. The negative annotations are derived from expert-curated annotations of protein families on phylogenetic trees. This approach results in a large increase in the average information content (IC) of negative annotations. The benchmark has been tested using the naïve and BLAST baseline methods, as well as two orthology-based methods. This new benchmark could complement existing ones in future CAFA experiments.
02:00 PM - 02:20 PM (EDT)
CAMDA - Unraveling city-specific microbial signature and identifying sample origin for the data from CAMDA 2020 Metagenomic Geolocation Challenge
As the composition of microbial communities can be location-specific, investigating the microbial community of different cities could unravel city-specific microbial signatures and further identify the origin of future samples. In this study, the whole genome shotgun (WGS) metagenomics data from over 20 cities in 17 countries and city-specific information including location, climate and biomes data were provided as part of the CAMDA 2020 “Metagenomic Geolocation Challenge”. Appropriate methods including feature selection, normalization, two popular machine learning methods (Random Forest (RF) and Support Vector Machine (SVM)), one deep learning method (Multilayer Perception (MLP)), and Principal Coordinates Analysis (PCoA) were used.
02:00 PM - 02:30 PM (EDT)
SST01 - Chemistry between genes: a proposition for ImmunoMetabolomics
02:00 PM - 02:40 PM (EDT)
SysMod - Cross-Species Translation of Biological Information via Computational Systems Modeling Frameworks
A vital challenge that the vast majority of biological research must address is how to translate observations from one physiological context to another—most commonly from experimental animals (e.g., rodents, primates) or technological constructs (e.g., organ-on-chip platforms) to human subjects. This is typically required for understanding human biology because of the strong constraints on measurements and perturbations in human in vivo contexts. Direct translation of observations from experimental animals to human subjects is generally unsatisfactory because of significant differences among organisms at all levels of molecular properties from genome to transcriptome to proteome and so forth. Accordingly, addressing inter-species translation requires an integrated experimental/computational approach for mapping comparable but not identical molecule-to-phenotype relationships. This presentation will describe methods we have developed for a variety of cross-species translation examples, demonstrated on applications in inflammatory pathologies and cancer.
02:00 PM - 02:40 PM (EDT)
Text Mining - Text Mining Keynote: Text mining to understand drug action: from PubMed to Reddit
02:00 PM - 03:00 PM (EDT)
MLCSB: Machine Learning - Deep learning at base-resolution reveals motif syntax of the cis-regulatory code
Genes are regulated by cis-regulatory elements, which contain transcription factor (TF) binding motifs in specific arrangements. To understand the syntax of these motif arrangements and its influence on TF binding, we developed a new convolutional neural network called BPNet that models the relationship between regulatory DNA sequence and base-resolution binding profiles from ChIP-exo/nexus experiments targeting four pluripotency TFs Oct4, Sox2, Nanog, and Klf4 in mouse embryonic stem cells. BPNet is able to predict base-resolution binding profiles and footprints at unprecedented accuracy on par with replicate experiments. We developed a suite of model interpretation methods to learn novel motif representations, accurately map predictive motif instances in the genome and identify higher-order rules by which combinatorial motif syntax influences binding of these TFs. We discovered several novel motifs bound by these TFs supported by distinct footprints. We further found that instances of strict motif spacing are largely due to retrotransposons, but that soft motif syntax influences TF binding at protein or nucleosome range in a directional manner. Most strikingly, Nanog binding is driven by motifs with a strong preference for ~10.5 bp spacings corresponding to helical periodicity. We then validated our model's predictions using CRISPR-induced point mutations of motif instances followed by profiling TF binding. The sequence representations learned by the binding models also generalized to accurately predict differential chromatin accessibility after TF depletion as well as massively parallel reporter experiments. BPNet easily adapts to other types of profiling experiments (e.g. ChIP-seq, DNase-seq, ATAC-seq, PRO-seq) in mammals as well as other species such as yeast, thus paving the way to decipher the cis-regulatory code from diverse regulatory profiling experiments.
02:00 PM - 02:04 PM (EDT)
Web - Introduction to WEB 2020
02:04 PM - 02:05 PM (EDT)
Web - RECENT ONLINE TRAINING EXPERIENCES - THE UNIVERSITY EXPERIENCE
02:05 PM - 02:25 PM (EDT)
Web - Teaching the Virus: Lessons from the Online Age
We are five faculty members who teach in Frontiers of Science, the Columbia College science core course. Like most of our colleagues around the world, we turned to online teaching in mid-March. To stay true to our name – Frontiers of Science, we modified our curriculum to make room for teaching about the evolution of SARS-CoV-2 using genomic data. After an introduction on the evolution of coronaviruses and genomics studies, we conducted an online activity, where students worked in teams to investigate the evolution of the SARS-CoV-2 using the Nextstrain.org website. First, students thoroughly examined phylogenetic graphs and discussed how these trees are created. Next, students calculated the virus’ substitution rates per site and compared it to that of the other pathogens, such as Influenza. The class ended with discussions revolving around infectious diseases and the role of genomics in treatment and vaccine development. We hope that other faculty may be able to use or adapt the activity that we have developed on SARS-CoV-2 and some of the ideas that have guided our modified online curriculum. If there ever was a topic that all students would deem to be relevant to their lives, this is certainly one.
02:10 PM - 02:30 PM (EDT)
SST02 - Foldit and the origin of scientific discovery games in molecular biology
This talk will discuss the genesis of the Foldit project and early results that pioneered the field of scientific discovery games for molecular biology. It will provide an overview of the foundation of this field of research.
02:20 PM - 02:40 PM (EDT)
Function - The Ortholog Conjecture Revisited: the Value of Orthologs and Paralogs in Function Prediction
The computational prediction of gene function is a key step in making full use of newly sequenced genomes. Function is generally predicted by transferring annotations from homologous genes or proteins for which experimental evidence exists. The ''ortholog conjecture'' proposes that orthologous genes should be preferred when making such predictions, as they evolve functions more slowly than paralogous genes. Previous research has provided little support for the ortholog conjecture, though the incomplete nature of the data cast doubt on the conclusions. Here we use experimental annotations from over 40,000 proteins, drawn from over 80,000 publications, to revisit the ortholog conjecture in two pairs of species: (i) Homo sapiens and Mus musculus and (ii) Saccharomyces cerevisiae and Schizosaccharomyces pombe. By making a distinction between questions about the evolution of function versus questions about the prediction of function, we find strong evidence against the ortholog conjecture in the context of function prediction, though questions about the evolution of function remain difficult to address. In both pairs of species, we quantify the amount of data that must be ignored if paralogs are discarded, as well as the resulting loss in prediction accuracy. Taken as a whole, our results support the view that the types of homologs used for function transfer are largely irrelevant to the task of function prediction. Aiming to maximize the amount of data used for this task, regardless of whether it comes from orthologs or paralogs, is most likely to lead to higher prediction accuracy.
02:20 PM - 02:40 PM (EDT)
CAMDA - Spatial models for assessment of bacterial classification relevant to AMR
The work and development of the MetaSUB international consortium project raises important questions about the antimicrobial resistance (AMR) of the collected samples. Different bioinformatics and statistical approaches are developed for bacterial classification, AMR taxa discovery and characterization of their geographical distribution across multiple cities around the world. In this work we use methods from epidemiology to estimate relative risk of antimicrobial resistance by modeling the spatial correlation in the data. The novelty of our approach in this work is that we apply a convolution model or more specifically BYM (Besag, York, Mollié) model to incorporate explicitly the spatial structure in the data as determined by the longitude and latitude of the samples. We use the Bayesian setting implementation in R package CARBayes, where inference is based on Markov chain Monte Carlo (MCMC) simulation. We adapt the epidemiological concept of relative risk (RR) to find regions in the cities with elevated AMR risk. The model can also be used for data with excessive zeros by modeling the response as Zero Inflation Poisson (ZIP) process. In addition to spatial modeling, we used several machine learning approaches such as GBM, Random Forest and Neural Network to predict geographical origin of the samples.
02:20 PM - 02:40 PM (EDT)
iRNA - Dissecting the role of SINE non-coding RNAs in amyloid pathology: An integrative RNA genomics approach.
As the human life span increases, the numbers of people in Canada suffering from aging associated cognitive impairment and Alzheimer’s disease (AD) are expected to rise dramatically. In a previous work we have shown that learning impairment is connected with epigenetic changes in stress response genes (SRGs),(Peleg*, Sanabenesi*,Zovoilis* et al, Science 2010). However, the molecular mechanisms associated with this epigenetic deregulation remain unknown. Among mechanisms that have recently attracted attention are those involving non-protein-coding RNAs (non-coding RNAs), including RNAs derived by repetitive DNA (Zovoilis et al, Cell 2016). Repetitive DNA accounts for ~50% of the noncoding sequences, with Short Interspersed Nuclear Elements (SINEs), being among the most frequent repeats. Here, we applied an integrative RNA genomics and bioinformatics approach to dissect any connection of SINE non-coding RNAs with amyloid pathology. Using short and long RNA sequencing, we demonstrate that SINE RNAs are associated with amyloid pathology in brain, revealing a potential biomarker and a novel molecular mechanism associated with this condition.
02:30 PM - 02:35 PM (EDT)
Web - Overview 2
02:30 PM - 02:45 PM (EDT)
SST02 - Applying Citizen Science to Biomedical Literature Curation and Beyond
Biomedical literature is growing at a rate that outpaces our ability to harness the knowledge contained therein. In order to mine valuable inferences from the large volume of literature, many researchers have turned to information extraction algorithms to harvest information in biomedical texts. Advances in computational methods usually depends on the generation of gold standards by a limited number of expert curators. This process can be time consuming and represents an area of biomedical research that is ripe for exploration with citizen science. With our web-based application Mark2Cure (https://mark2cure.org), we demonstrate that citizen scientists can perform biocuration tasks on biomedical abstracts such as Named Entity Recognition (NER) and Relationship Extraction (RE). We discuss examples and challenges in applying citizen science to biomedical literature mining and explore additional biocuration and biomedical knowledge organization opportunities outside biomedical literature such as engaging citizen scientists to make biomedical research resources more findable, accessible, interoperable, and reusable.
02:30 PM - 03:00 PM (EDT)
SST01 - The role of cis-regulatory elements in cell fate and immune disorders
02:35 PM - 02:40 PM (EDT)
Web - Overview 3
02:40 PM - 03:00 PM (EDT)
SysMod - Towards a Human Whole-Cell Model: A Prototype Model of Human Embryonic Stem Cells
Whole-cell (WC) models that integrate diverse data into a mechanistic understanding of every gene function could transform medicine and bioengineering. Towards this goal, we are prototyping a WC model of human embryonic stem cells, which have great potential for regenerative medicine. First, we combined multi-omics, biochemical, and physiological data into a structured knowledge base. Second, we used the knowledge base to generate submodels of multiple pathways, including metabolism, transcription, translation, macromolecular complexation, and RNA and protein degradation. Next, we integrated the submodels into a single model, and used the knowledge base to estimate the parameters and initial conditions of the model. We are using a multi-algorithmic approach to simulate the wide range of concentrations and fluxes involved in the model. We use stochastic simulation to simulate slow pathways such as transcription, we use flux balance analysis to simulate fast pathways such as metabolism, and we synchronize the shared variables of these sub-simulations throughout each simulation. To enable this work, we are developing several new databases, methods, and tools for building and simulating large models. Going forward, we aim to incorporate additional submodels of signal transduction and cell cycle regulation, and use the model to gain insights into stem cell self-renewal.
02:40 PM - 03:00 PM (EDT)
Text Mining - Comprehensive Named Entity Recognition on CORD-19 with Distant or Weak Supervision
Motivation: To facilitate biomedical text mining research on COVID-19, we have developed an automated, comprehensive named entity annotation and typing system CORD-NER. The system generates an annotated CORD-NER dataset based on the COVID-19 Open Research Dataset, covering 75 fine-grained types in high quality. Both the system and the annotated literature datasets will be updated timely. Results: The distinctive features of this CORD-NER dataset include (1) it covers 75 fine-grained entity types: In addition to the common biomedical entity types (e.g., genes, chemicals and diseases), it covers many new entity types specifically related to the COVID-19 studies (e.g., coronaviruses, viral proteins, evolution, materials, substrates and immune responses), which may benefit research on COVID-19 related virus, spreading mechanisms, and potential vaccines; (2) it relies on distantly- and weakly-supervised NER methods, with no need of expensive human annotation on any articles or sub-corpus; and (3) its entity annotation quality surpasses SciSpacy (over 10% higher on the F1 score based on a sample set of documents), a fully supervised BioNER tool. Our NER system supports incrementally adding new documents as well as adding new entity types when needed by adding dozens of seeds as the input examples.
02:40 PM - 03:00 PM (EDT)
iRNA - Inferring competing endogenous RNA (ceRNA) interactions in cancer
To understand driving biological factors for cancer, regulatory circuity of genes needs to be discovered. Recently, a new gene regulation mechanism called competing endogenous RNA (ceRNA) interactions has been discovered. Certain RNAs targeted by common microRNAs (miRNAs) “compete” for these miRNAs, thereby regulate each other by making other free from miRNA regulation. Several computational tools have been published to infer ceRNA networks. In most existing tools, however, expression abundance and groupwise effect of ceRNAs are not considered. In this study, we developed a computational pipeline named Crinet to infer cancer-associated ceRNA networks addressing critical drawbacks. Crinet considers lncRNAs, pseudogenes and mRNAs as potential ceRNAs and incorporates a network deconvolution method to exclude amplifying effect of ceRNA pairs. We tested Crinet on breast cancer data in TCGA. Crinet inferred reproducible ceRNA interactions and groups, which were significantly enriched in cancer-related genes and biological processes. We validated our ceRNA interactions using protein expression data. Crinet outperformed existing tools predicting gene expression change in knockdown assays. Top high-degree genes in the inferred network included known suppressor/oncogene lncRNAs of breast cancer showing the importance of noncoding-RNA’s inclusion for ceRNA inference.
02:40 PM - 03:00 PM (EDT)
Function - Discovery of multi-operon colinear syntenic blocks in microbial genomes
Motivation: An important task in comparative genomics is to detect functional units by analyzing gene-context patterns. Colinear syntenic blocks (CSBs) are groups of genes that are consistently encoded in the same neighborhood and in the same order across a wide range of taxa. Such colinear syntenic blocks are likely essential for the regulation of gene expression in prokaryotes. Recent results indicate that colinearity can be conserved across multiple operons, thus motivating the discovery of multi-operon CSBs. This computational task raises scalability challenges in large datasets. Results: We propose an efficient algorithm for the discovery of cross-strand multi-operon CSBs in large genomic datasets. The proposed algorithm uses match-point arithmetic, which is scalable for large datasets of microbial genomes in terms of running time and space requirements. The algorithm is implemented and incorporated into a tool with a graphical user interface, denoted CSBFinder-S. We applied CSBFinder-S to data mine 1,485 prokaryotic genomes and analyzed the identified cross-strand CSBs. Our results indicate that most of the syntenic blocks are exclusively colinear. Additional results indicate that transcriptional regulation by overlapping transcriptional genes is abundant in bacteria. We demonstrate the utility of CSBFinder-S to identify common function of the gene-pair PulEF in multiple contexts, including Type 2 Secretion System, Type 4 Pilus System, and DNA uptake.
02:40 PM - 03:00 PM (EDT)
CAMDA - Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier
Several standard classifiers have been employed for the prediction of the origin of a given microbial sample. The application of these individual classification algorithms yields a varying degree of classification accuracy, and the performance of such classifiers are also dependent on the structure of the available data. Rather than employing several individual classifiers, we adopt the adaptive ensemble classification algorithm proposed by (Datta, 2010). The ensemble classifier, which is constructed by bagging and rank aggregation, comprises a set of standard classification algorithms where such individual algorithms are combined flexibly to yield classification performance as least, as good as the best classification algorithm in the ensemble. For our analysis, we trained and tested fourteen standard classifiers including the ensemble classifier, and in different instances, we also applied class weighting and an optimal oversampling technique to overcome the problem of class balance in the primary data. These analyses were conducted both on the primary data set of relative abundances and data with feature reduced space. In each instance, we found that the standard algorithms performed differently, whereas the ensemble classifier consistently showed to have optimal performance. Lastly, we predict the source cities of the mystery samples provided by the CAMDA organizers.
02:45 PM - 03:00 PM (EDT)
SST02 - Phylo: How to turn scientific tasks into casual games
Launched in 2010, Phylo is a casual tile-matching game aiming to improve large-scale multiple sequence alignments. In this talk, we will discuss the origins and evolution of this scientific game during the last decade. We will discuss its motivations, its impact and various lessons we learned through this project.
03:19 PM - 03:20 PM (EDT)
Web - RECENT ONLINE TRAINING EXPERIENCES - SHORT-COURSE EXPERIENCE
03:20 PM - 03:30 PM (EDT)
SST01 - Uncovering B-ALL TF-gene regulatory interactions associated with CRLF2-overexpression
Although genetic alterations are initial drivers of disease, aberrantly activated transcriptional regulatory programs are often responsible for maintenance and progression in cancer. Alterations leading to CRLF2-overexpression in B-ALL patients are associated with poor outcome and activate JAK-STAT, PI3K and ERK/MAPK signaling pathways. Although inhibitors of these pathways are available, there still remains the issue of treatment-associated toxicities and poorly studied regulatory structures controlling leukemogenesis. Comparing RNA-seq from CRLF2-High and other B-ALL patients (CRLF2-Low), we defined a CRLF2-High gene signature. Patient-specific chromatin accessibility was interrogated to identify altered putative regulatory elements that could be linked to transcriptional changes. To delineate these regulatory interactions, a B-ALL cancer-specific regulatory network was inferred using 868 B-ALL patient samples from the NCI TARGET database coupled with priors generated from ATAC-seq peak TF-motif analysis. Analysis of CRISPRi, siRNA knockdown and ChIP-seq of nine TFs involved in the inferred network were analyzed to validate a cohort of predicted TF-gene regulatory interactions. Inferred interactions were used to identify differential patient-specific transcription factor activities predicted to control CRLF2-High deregulated genes, thereby enabling identification of robust gene targets.
03:20 PM - 03:40 PM (EDT)
SysMod - Whole-body regeneration and size-dependent fission controlled by a self-regulated Turing system in planaria
Planarian worms have the extraordinary ability to regenerate any body part after an amputation. This ability allows them to reproduce asexually by fission, cutting themselves to produce two separated pieces each repatterning and regenerating a complete animal. The induction of this process is known to be dependent on the size of the worm as well as on environmental factors such as population density, temperature, and light intensity. Models based on Turing systems can explain the self-regulation of many biological mechanisms, from skin patterns to digit formation. Here, we combine experimental evidence with a modeling approach to show how a cross-inhibited Turing system can explain at once both the signaling mechanism of regeneration and fission in planaria. The model explains in a growing domain the precise signals that control the regeneration of the different body parts after amputations as well as when and where planaria fission, and its dependence on the worm length. We provide molecular implementations of the proposed model, which also explains the effects of environmental factors in the signaling of fission. In summary, the proposed controlled cross-inhibited Turing system represents a completely self-regulated model of the whole-body regeneration and fission signaling in planaria.
03:20 PM - 03:25 PM (EDT)
Web - Overview 4
03:20 PM - 03:35 PM (EDT)
SST02 - Improved RNA secondary structure modeling through crowdsourced RNA design initiatives
Although algorithms for predicting RNA structure have been in development for decades, little is known about their accuracy in many of their use-cases. We created a database, "EternaBench", comprising the diverse high-throughput structural data gathered through the crowdsourced RNA design project Eterna, to evaluate the performance of a wide set of structure algorithms. Surprisingly, we found that lesser-used algorithms, which were developed through statistical learning, perform notably better than widely-used algorithms that are derived from experimental data. Motivated by this finding, we developed a multitask-learning-based model, "EternaFold", trained on the EternaBench data, which demonstrates improved predictions both on molecules from Eterna, as well as completely independent datasets of viral genomes and mRNAs. Experimental data from the project Eterna allows us to establish the first large-scale, independent benchmarks of RNA thermodynamic predictions, as well as leverage the diversity of player designs to improve structure prediction through statistical learning.
03:20 PM - 03:40 PM (EDT)
iRNA - Single Cell Chromatin Accessibility Delineates Cellular Identities of the Neonatal Organ of Corti
The organ of Corti, the receptor organ for hearing, is formed by a variety of sensory hair cells (HCs) and supporting cells (SCs) within the cochlea. However, the gene regulation mechanisms of cochlea development are not fully understood. The aim of this study is to identify regulatory elements controlling the differentiation and maturation of the organ of Corti. To achieve this goal, we generated scATAC-seq and scRNA-seq libraries from postnatal day 2 organ of Corti preparations divided into apical and basal compartments. By integrating scRNA-seq data, we identified cell types of scATAC-seq by calculating a Jaccard similarity matrix, identified cell type-specific transcription factors (TFs), classified them as activators and repressors based on function, and further validation by footprints. Focusing on HCs, we reconstructed the organ’s one-dimensional architecture from both scRNA-seq and scATAC-seq data. We identified novel differentially expressed genes along the tonotopic axis and validated them by RNAscope. Additionally, we identified TFs that drive HC differentiation and maturation by reconstructing developmental trajectories. The results of this study enable us to understand how epigenomic landscape delineates cellular identities and functions within the organ of Corti. Further studies will investigate regulatory elements driving SC maturation, which will contribute to regenerative strategies.
03:20 PM - 03:40 PM (EDT)
MLCSB: Machine Learning - Fourier-transform-based attribution priors improve the stability and interpretability of deep learning models for regulatory genomics
Deep learning models of regulatory DNA can accurately predict transcription factor (TF) binding and chromatin accessibility profiles. Base-resolution importance (i.e. "attribution") scores learned by the models can highlight predictive motifs and syntax. Unfortunately, these models are prone to overfitting and are sensitive to random initializations, often resulting in noisy and irreproducible attributions that obfuscate underlying motifs. To address these shortcomings, we propose a novel attribution prior, where the Fourier transform of input-level attribution scores are computed at training-time, and high-frequency components of the Fourier spectrum are penalized. We evaluate different model architectures with and without attribution priors to predict binary or continuous profiles of TF binding or chromatin accessibility. The prior is agnostic to the model architecture or predicted experimental assay, yet provides similar gains across all experiments. We show that our attribution prior dramatically improves the models’ stability, interpretability, and performance on held-out data, including when training data is severely limited. Our attribution prior also allows models to identify motifs more sensitively and precisely within individual regulatory elements. This work represents an important advancement in improving the reliability of deep learning models for deciphering the cis-regulatory code from regulatory profiling experiments.
03:20 PM - 03:40 PM (EDT)
CAMDA - Separation of Mystery-Samples using mi-faser and forest embedding
Here we propose a new approach for identifying and clustering city profiles based on the metagenomic fingerprints of related subway systems. We generate functional fingerprints of shotgun sequenced metagenomes and use forest embeddings in combination with a density based clustering method to identify samples of common provenance. We applied this approach to the CAMDA2020 Metagenomic Forensics Challenge dataset and found that the set of 121 mystery samples may come from eight different cities. We also tried to predict metadata of the mystery samples, i.e. associated subway network length and usage, in order to assign each sample to a city.
03:20 PM - 04:00 PM (EDT)
Function - Function COSI Keynote: Saving Time at the Bench and in the Field: Predicting Gene Function and Phenotype in Crops
03:25 PM - 03:30 PM (EDT)
Web - Overview 5
03:30 PM - 03:35 PM (EDT)
Web - Overview 6
03:30 PM - 03:40 PM (EDT)
Text Mining - COVID-19 systems demo panel (part 2)
03:30 PM - 04:00 PM (EDT)
SST01 - Examining waning immunity of B. pertussis vaccination
03:35 PM - 03:50 PM (EDT)
SST02 - Stall Catchers: Engineering Obsolescence
It might seem ironic that the goal of a citizen science project would be to render itself obsolete, but it turns out this could be the most fruitful approach for science and the most ethical approach for online participants. Stall Catchers is an online game created to accelerate Alzheimer's research at Cornell University. Today, over 25,000 registered Stall Catchers volunteers are analyzing datasets in 1 or 2 months that formerly required up to a year to analyze in the lab. Continuing to expand the participant community could further accelerate this analysis, however, a more efficient path might entail using the human generated data to improve machine-based contributions. This talk describes how the evolution of the Stall Catchers project may suggest a life cycle for future generations of citizen science projects that respects their contributors while maximizing societal impacts.
03:35 PM - 03:53 PM (EDT)
Web - Group discussion on short-course specific aspects of online training: Technologies used, hands-on facilitation, numbers taught
03:40 PM - 03:50 PM (EDT)
Text Mining - COVID-19 systems demo panel (part 3)
03:40 PM - 04:00 PM (EDT)
MLCSB: Machine Learning - Dissecting the grammar of chromatin architecture using megabase scale DNA sequence with deep neural networks and transfer learning
Mammalian genome architecture is characterized by an intricate framework of hierarchically folded domains that dictate essential gene regulatory functions such as enhancer-promoter interactions. Understanding which determinants are driving 3D genome formation and how this architecture is altered through the sequence and structural variations requires high-throughput, genome-wide approaches. Computational models sophisticated enough to grasp the determinants of chromatin folding allow us to perform these large scale experiments in silico. We have developed a deep neural network (deepC) that uses transfer learning to predict chromatin interactions from DNA sequence at the megabase scale. Our model predicts Hi-C interactions at high resolution captures intricate, hierarchical chromatin structures and can be used to fine-map Hi-C data. DeepC allows us to predict the impact of single base pair variants as well as structural variation in the same end-to-end framework, bridging the different levels of resolution from base pairs to TADs. DeepC enables large-scale, computational screens that empower us to dissect the functional elements and sequence determinants that regulate chromatin architecture at base pair resolution and genome-wide scale. We demonstrate how we employ deepC to stratify the contribution of distinct classes of regulatory elements, study the grammar of domain boundaries and predict the effect of SNPs.
03:40 PM - 04:00 PM (EDT)
SysMod - Clb3-centered regulations are pivotal for autonomous cell cycle oscillator designs in yeast
Some biological networks exhibit oscillations in their components to convert stimuli to time-dependent responses. The eukaryotic cell cycle is such a case, being governed by waves of cyclin-dependent kinase (cyclin/Cdk) activities that rise and fall with specific timing and guarantee its timely occurrence. Disruption of cyclin/Cdk oscillations could result in dysfunction through reduced cell division. Therefore, it is of interest to capture properties of network designs that exhibit robust oscillations. Here we show that a minimal cell cycle network in budding yeast is able to oscillate autonomously, and that cyclin/Cdk-mediated positive feedback loops (PFLs) and Clb3-centered regulations sustain cyclin/Cdk oscillations, in known and hypothetical network designs. An integrative, computational and experimental approach pinpoints how robustness of cell cycle control is realized by revealing a novel and conserved principle of design that ensures a timely interlock of transcriptional and cyclin/Cdk oscillations. Given the evolutionary conservation of the cell cycle network across eukaryotes, the cyclin/Cdk network can be used as a core building block of multi-scale models that integrate regulatory modules to address cellular physiology.
03:40 PM - 04:00 PM (EDT)
iRNA - A Bayesian framework for inter-cellular information sharing improves dscRNA-seq quantification
Motivation: Droplet based single cell RNA-seq (dscRNA-seq) data is being generated at an unprecedented pace, and the accurate estimation of gene level abundances for each cell is a crucial first step in most dscRNA-seq analyses. When preprocessing the raw dscRNA-seq data to generate a count matrix, care must be taken to account for the potentially large number of multi-mapping locations per read. The sparsity of dscNRA-seq data, and the strong 3’ sampling bias, makes it difficult to disambiguate cases where there is no uniquely mapping read to any of the candidate target genes. Results: We introduce a Bayesian framework for information sharing across cells within a sample, or across multiple modalities of data using the same sample, to improve gene quantification estimates for dscRNA-seq data. We use an anchor-based approach to connect cells with similar gene expression patterns, and learn informative, empirical priors which we provide to alevin’s gene multi-mapping resolution algorithm. This improves the quantification estimates for genes with no uniquely mapping reads (i.e. when there is no unique intra-cellular information). We show our new model improves the per cell gene level estimates and provides a principled framework for information sharing across multiple modalities. We test our method on a combination of simulated and real datasets under various setups. Availability: The information sharing model is included in alevin and is implemented in C++14. It is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/salmon as of version 1.1.0.
03:40 PM - 04:00 PM (EDT)
CAMDA - Metagenomic Data Analysis with Probability-Based Reduced Dataset Representation
Metagenomics has become of increasing interest to researchers exploring human health and disease. Higher sequencing throughput and lower associated costs, in turn, has made whole genome sequencing of complex environmental samples a realistic possibility for researchers. The massive datasets produced by these approaches, however, calls for new approaches to their processing and interpretation. MetaSUB and CAMDA have partnered together to confront this challenge, offering a yearly contest in which participants are asked to predict the origin of unlabeled metagenomic samples by applying machine learning methods to a set of city-labeled whole genome sequencing data. A remaining roadblock, however, is the enormous computational power required to process these massive datasets. Previous research has studied the effect of mystery sample prediction using a "reduced-representation subset" which contained only 50% of the sample data in the training set. In each case, there is relatively little loss of predictive power using the subset. In this paper, we extend upon the previous work and explore the relationship between mystery sample prediction and reduced-representation subsets at 20% intervals.
03:50 PM - 04:05 PM (EDT)
SST02 - Accelerating Microbiome Science by Empowering Individuals
The human microbiome plays a fundamental role in human health. These microbes, that live on and inside of us, perform a range of critical functions, including vitamin synthesis and the production of short chain fatty acids. Research over the past decade has highlighted relationships between the microbiome and diseases like Inflammatory Bowel Disease, Nonalcoholic Fatty Liver Disease, Parkinson’s, cancer and more. Concurrently, research has shown a remarkable degree of variability in the microbiome among individuals, with substantially more variation than peoples genomes. The factors related to the dynamic ranges of microbes observed are not well understood, nor is it known whether certain configurations are healthier or may predispose you to disease. Two major limitations are data: a small portion of the human population has been sampled, and the collective set of microbiome studies performed have incompatible standards. Within the American Gut Project, and its global extension The Microsetta Initiative, we are combating these limitations through infrastructure that allows virtually anyone to participate. To date, over 20,000 have submitted microbiome samples. The data produced, including a detailed voluntary questionnaire, are released de-identified into the public domain for anyone to reuse. In this talk, we discuss the importance of citizen science for the microbiome, and some of the challenges encountered since project inception in the fall of 2012.
03:50 PM - 04:00 PM (EDT)
Text Mining - COVID-19 systems demo panel (part 4)
03:53 PM - 03:55 PM (EDT)
Web - BEST PRACTICES IN ONLINE TRAINING
03:55 PM - 04:10 PM (EDT)
Web - Best practices from the experience: the Carpentries
04:00 PM - 04:40 PM (EDT)
CAMDA-Towards a metagenomics interpretable model for understanding the transition from adenoma to colorectal cancer
04:00 PM - 04:10 PM (EDT)
Text Mining - COVID-19 systems demo panel (part 5)
04:00 PM - 04:20 PM (EDT)
MLCSB: Machine Learning - CoRE-ATAC: A Deep Learning model for the Classification of Regulatory Elements from single cell and bulk ATAC-seq data
Cis-Regulatory elements (cis-REs) include promoters, enhancers, and insulators that regulate gene expression programs via binding of transcription factors. ATAC-seq technology effectively identifies active cis-REs in a given cell type (including from single cells) by mapping the accessible chromatin at base-pair resolution. However, these maps are not immediately useful for inferring specific functions of cis-REs. For this purpose, we developed a deep learning framework (CoRE-ATAC) with novel data encoders that integrate DNA sequence (reference or personal genotypes) and ATAC-seq read pileups. CoRE-ATAC was trained on 4 cell types (n=6 samples/replicates) and accurately predicted known cis-RE functions from 7 cell types (n=40 samples) that were not used in model training (average precision=0.80). CoRE-ATAC enhancer predictions from 19 human islets coincided with genetically modulated gain/loss of enhancer activity, which was confirmed by massively parallel reporter assays (MPRAs). Finally, CoRE-ATAC effectively inferred functionality of cis-REs from single nucleus ATAC-seq data from human blood-derived immune cells that overlapped well with known functional annotations in sorted immune cells. ATAC-seq maps from primary human cells reveal individual- and cell-specific variation in cis-RE activity. CoRE-ATAC increases the functional resolution of these maps, a critical step for studying regulatory disruptions behind diseases.
04:00 PM - 04:20 PM (EDT)
Function - DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier
Predicting the phenotypes resulting from molecular perturbations is one of the key challenges in genetics. Both forward and reverse genetic screen are employed to identify the molecular mechanisms underlying phenotypes and disease, and these resulted in a large number of genotype--phenotype association being available for humans and model organisms. Combined with recent advances in machine learning, it may now be possible to predict human phenotypes resulting from particular molecular aberrations. We developed DeepPheno, a neural network based hierarchical multi-class multi-label classification method for predicting the phenotypes resulting from complete loss-of-function in single genes. DeepPheno uses the functional annotations with gene products to predict the phenotypes resulting from a loss-of-function; additionally, we employ a two-step procedure in which we predict these functions first and then predict phenotypes. Prediction of phenotypes is ontology-based and we propose a novel ontology-based classifier suitable for very large hierarchical classification tasks. These methods allow us to predict phenotypes associated with any known protein-coding gene. We evaluate our approach using evaluation metrics established by the CAFA challenge and compare with top performing CAFA2 methods as well as several state of the art phenotype prediction approaches, demonstrating the improvement of DeepPheno over state of the art methods.
04:00 PM - 04:20 PM (EDT)
SysMod - Robust Inference of Kinase Activity Using Functional Networks
Recent developments in mass spectrometry (MS) enable high-throughput screening of phospho-proteins across a broad range of biological contexts. Phospho-proteomic data complemented by computational algorithms enable the inference of kinase activity facilitating the identification of dysregulated kinases in various diseases, including cancer, Alzheimer’s disease and Parkinson’s disease, among others. However, the inadequacy of known kinase-substrate associations and the incompleteness of MS-based phosphorylation data pose important limitations on the inference of kinase activity. With a view to enhancing the reliability of kinase activity inference, we present a network-based framework named RoKAI that integrates various sources of functional information. These functional information include structure distance, co-evolution evidence, shared kinase associations, and protein-protein interaction networks. By propagating phosphorylation data across these networks, RoKAI obtains representative phosphorylation profiles capturing coordinated changes in signaling. The resulting phosphorylation profiles can be used in conjunction with any existing or future inference methods to predict kinase activity. The results of our computational experiments show that RoKAI consistently improves the accuracy of commonly used kinase activity inference methods and makes them more robust to missing kinase-substrate annotations. To provide an easy to use interface to users, RoKAI is available as a web-based tool at http://rokai.ngrok.io.
04:00 PM - 04:20 PM (EDT)
SST01 - Sexual dimorphism in human immune system aging
04:00 PM - 04:40 PM (EDT)
iRNA - iRNA Keynote: The race to the 3’ end: tracking mRNA cleavage diversity in real time
Alternative mRNA processing is increasingly appreciated to play a large role in driving transcriptome variability, disease etiology, cellular identity, and the molecular response to diverse environmental stresses. Although we have extensive insights into the regulatory factors and sequence elements that influence alternative isoform usage, less is known about the temporal dynamics and co-regulation of RNA processing decisions. Intermediate processing events—splicing and 3’ end cleavage—often occur co-transcriptionally, with the interplay between transcriptional elongation rates and rates of each processing event often impacting choices that lead to alternative isoform production. Consequently, measuring the kinetics of these processes may shed light on how early gene regulatory decisions are made. Recent development of high-throughput sequencing techniques that capture nascent RNA over defined temporal intervals has made genome-wide kinetic profiling of RNA maturation possible. Though rates of mRNA splicing have been estimated globally, the rate at which an mRNA is cleaved and polyadenylated to complete the maturation process has never been investigated. Here, we present a novel computational method to estimate genome-wide kinetic parameters for mRNA cleavage rates. This method capitalizes on short-read sequencing data from nascent mRNAs isolated after a time-course of 4sU metabolic labeling to model the rate of mRNA maturation over time. To specifically measure cleavage rates, we first use patterns of read coverage from our sequencing data to approximate the position at which cleavage occurs. We then estimate the fraction of reads derived from cleaved or uncleaved molecules at that site across time to model the rate of 3’ end cleavage over time. We applied this method to nascent RNA-seq data from Drosophila melanogaster S2 cells to estimate polyadenylation-site (PAS) specific rates of mRNA cleavage. Our findings shed light on the timing of decisions involved in alternative PAS usage within genes and the variable efficiency of 3’ end cleavage and polyadenylation across genes.
04:05 PM - 04:20 PM (EDT)
SST02 - Gamers and experimentalists collaborate on COVID-19
Scientific games allow users to process raw experimental data and submit hypothesis to scientists that can be validated in labs. This theme will explore opportunities arising with the creation of new bridges between citizens and scientists, highlighting recent successes with the Foldit project. Citizen scientists have collaborated with scientists in an attempt to design anti-viral and anti-inflammatories against COVID-19.
04:10 PM - 04:25 PM (EDT)
Web - Best practices from the experience: Software Sustainability
04:20 PM - 04:30 PM (EDT)
SST01 - Integrated transcriptomic analysis of SLE reveals IFN-driven cross-talk between immune cells
Systemic lupus erythematosus (SLE) is an incurable autoimmune disease disproportionately affecting women and may lead to damage in multiple different organs. The marked heterogeneity in its clinical manifestations is a major obstacle in finding targeted treatments and involvement of multiple immune cell types further increases this complexity. Thus, identifying molecular sub-types that best correlate with disease heterogeneity and severity as well as deducing molecular cross-talk among major immune cell types that lead to disease progression are critical steps in development of more informed therapies for SLE. Here we profile and analyze gene expression of six major circulating immune cell types from patients with well-characterized SLE and from healthy subjects. Our results show that the interferon (IFN) response signature was the major molecular feature that classified SLE patients into two distinct groups. We show that the gene expression signature of IFN response was consistent among different datasets and cell types. For a better understanding of molecular differences of IFNpos versus IFNneg patients, we combined differential gene expression analysis with differential Weighted Gene Co-expression Network Analysis (WGCNA), which revealed a relatively small list of genes. We have also performed the integrated WGNCA (iWGCNA) to understand the cross-talk between different immune cell types.
04:20 PM - 04:35 PM (EDT)
SST02 - Crowdsourced design of stabilized COVID-19 mRNA vaccines with Eterna OpenVaccine
While more rapidly deployable than other kinds of vaccines, COVID-19 mRNA vaccines have a problem: their poor chemical stability at refrigerator temperatures precludes worldwide delivery in prefilled syringes, as would be needed for immediate mass immunization. Recent studies have implicated mRNA structure as a critical factor in stability and expression, but our ability to rationally design these desired properties remains primitive. In March 2020, we launched the OpenVaccine challenge on Eterna (https://eternagame.org) to accelerate the design of stabilized, high-expression mRNA vaccines through a worldwide, open competition. The project involves an internet-scale videogame with mRNA design challenges; novel methods for rapid synthesis of 100s to 1000s of long mRNAs per round and for assessment of mRNA degradation and protein expression in human cells; and an IP framework ensuring open use of results while allowing companies to license exclusive redesign of their mRNA vaccine candidates.
04:20 PM - 04:40 PM (EDT)
MLCSB: Machine Learning - Zero-shot imputations across species are enabled through joint modeling of human and mouse epigenomics
Recent large-scale efforts to characterize biochemical activity along the human genome have produced thousands of genome-wide experiments that quantify various forms of biological activity, such as histone modifications, protein binding, and chromatin accessibility. Although these experiments represent a small fraction of the possible experiments that could be performed, the human genome remains the most characterized of any species. We propose an extension to the imputation approach Avocado that enables the model to leverage the large number of human genomic data sets when making imputations in other species. This extension takes advantage of shared synteny and the similar role that many types of biochemical activity have in evolutionarily-related species. We found that not only does this extension result in improved imputations of mouse genomics experiments, but that the extended model is able to make accurate imputations for assays that have been performed in humans but not in mice. This ability to make ``zero-shot'' imputations greatly increases the utility of such imputation approaches, and enables comprehensive imputations to be made for species even when experimental data are sparse.
04:20 PM - 04:40 PM (EDT)
Function - HPOLabeler: Improving Prediction of Human Protein-Phenotype Associations by Learning to Rank
Annotating human proteins by abnormal phenotypes has become an important topic. As of Nov. 2019, only less than 4,000 proteins have been annotated with Human Phenotype Ontology (HPO). Thus a computational approach for accurately predicting protein-HPO associations would be important, while no methods have outperformed a simple Naive approach in the CAFA2 (second Critical Assessment of Functional Annotation, 2013-14). We present HPOLabeler, which can use a wide variety of evidence, such as protein-protein interaction networks (PPI), Gene Ontology (GO), InterPro, trigram frequency and HPO term frequency, in the framework of learning to rank (LTR). Given an input protein, LTR outputs the ranked list of HPO terms from a series of input scores given to the candidate HPO terms by component learning models (logistic regression, nearest neighbor and a Naive method), which are trained from given multiple evidence. We empirically evaluate HPOLabeler extensively through mainly two experiments of cross-validation and temporal validation, for which HPOLabeler significantly outperformed all component models and competing methods including the current state-of-the-art method. We further found that 1) PPI is most informative for prediction among diverse data sources, and 2) low prediction performance of temporal validation might be caused by incomplete annotation of new proteins.
04:20 PM - 04:40 PM (EDT)
SysMod - Probabilistic Factor Graph Modeling and Analysis of Biological Networks
Reverse engineering of molecular networks from biological data is one of the most challenging tasks in systems biology. Numerous inference and computational methodologies have been formalized to enhance the deduction of reliable and testable predictions in today’s biology. However, there is little research aimed to quantify how well the existing state-of-the-art molecular networks correspond to the measured gene expressions. We present a computational framework that combines formulation of probabilistic graphical model, standard statistical estimation, and integration of high-throughput gene expression data. The model is represented as a probabilistic bipartite graph, which accommodates partial information of diverse biological entities to study and analyze the global behavior of a biological system. This result provides a building block for performing simulations on the consistency between inferred gene regulatory networks and corresponding biological data. We test the applicability of our model to explore the allowable stable-states in two experimentally verified regulatory pathways in Escherechia Coli using real microarray expression data from the M3D database. Furthermore, the model is employed to quantify how well the pathways are explained by the extant microarray data. Results show a surprisingly high correlation between the observed states of the experimental data under various conditions and the inferred system’s behavior.
04:25 PM - 04:40 PM (EDT)
Web - Education Summit report
04:30 PM - 04:40 PM (EDT)
SST01 - Semi-connected multilayer perceptron for single-cell profiling
04:40 PM - 04:50 PM (EDT)
MLCSB: Machine Learning - A Combined Species Model using Branched Multitask Routing Networks
Recent development in deep learning has shown unprecedented accuracy in predicting chromatin features using DNA sequences alone. The high prediction accuracy partly stems from the millions of labeled sequences available across whole organs and cell-lines in culture, but for specific biological systems, the number of labeled sequences is often an order of magnitude lower. Acquiring additional labeled sequences from the same species adds little variability. A promising strategy is to combine datasets across species, since flanks around conserved regulatory subsequences tend to vary. Common approaches for combining datasets include transfer learning and multitask learning. However, these approaches do not facilitate common and species-specific attribute extraction from the models. To improve both prediction accuracy and model interpretability, we propose a branched multitask routing network (BMTRN) for cross species chromatin feature prediction. The idea is to split a network into common and species-specific layers via task routing so that shared signals between datasets can be exploited without assuming the species have the same regulatory attributes. We apply BMTRN to ATAC-seq datasets from mouse and human, and show that BMTRN improves prediction accuracy, enhances filter reproducibility, facilitates easier interpretation of cross-species differences, and increases sensitivity in detecting effects of functional variants in silico.
05:00 PM - 05:20 PM (EDT)
Function - Enzymes, Moonlighting Enzymes, Pseudoenzymes: Similar in Sequence, Different in Function
The function of a newly sequenced protein is often estimated by sequence alignment with the sequences of proteins with known functions. However, members of a protein superfamily can share significant amino acid sequence identity but vary in the reaction catalyzed and/or the substrate used. In addition, a protein superfamily can include moonlighting proteins, which have two or more functions, and pseudoenzymes, which have a three-dimensional fold that resembles a conventional catalytically active enzyme, but do not have catalytic activity. I will discuss several examples of protein families that contain enzymes with noncanonical catalytic functions, pseudoenzymes, and/or moonlighting proteins. Pseudoenzymes and moonlighting proteins are widespread in the evolutionary tree and are found in many protein families, and they are often very similar in sequence and structure to their monofunctional and catalytically active counterparts. A greater understanding is needed to clarify when similarities and differences in amino acid sequences and structures correspond to similarities and differences in biochemical functions and cellular roles. This information can help improve programs that identify protein functions from sequence or structure and assist in more accurate annotation of sequence and structural databases, as well as in our understanding of the broad diversity of protein functions.
05:00 PM - 05:20 PM (EDT)
iRNA - RBP-Pokedex: Prediction of RBP knockdown effect via DNN experiment modeling
The genome of eukaryotes code for hundreds of RNA-binding proteins (RBP), which regulate the fate of RNA from synthesis to degradation. These RBPs form an extensive network of condition specific regulation, including direct and indirect regulation of other RBPs. This fact, combined with technological limitations, make systematic experimental characterization of their effect on gene expression infeasible. Here, we propose to leverage the available expression measurements in diverse conditions to instead build predictive models for the effect of RBP knockdown on expression of other RBPs, capturing a condition-specific “RBP state”. We develop a model for prediction of knockdown effects via DNN experiments (Pokedex). We use an unsupervised learning approach where we construct a variational autoencoder for the expression of 531 RBPs. Training it on the naturally occurring variations of RBP expression across 53 GTEx tissues we test its ability to predict knockdown effects in ENCODE experiments. Using the expression changes of RBPs compared to control as the test statistic we show this DNN performs significantly better then a standard PCA based linear model of variation. Finally, we make the learned model available to the RNA community through a web-tool, RBP-Pokedex (https://tools.biociphers.org/rbp-pokedex), to predict the effect of pan-tissue RBP knockdowns.
05:00 PM - 05:20 PM (EDT)
SysMod - Executable models of pathways built using single-cell RNAseq data reveal immune cell heterogeneity in people living with HIV and atherosclerosis
Single-cell RNA-sequencing (scRNA-seq) enables the profiling of mixed tissues at unprecedented resolution. Clustering of these sequenced cells into cell types is a major challenge, compounded by cell heterogeneity in terms of signaling pathways that cause observed phenotypes. Knowledge-driven methods utilize cell type specific marker genes for clustering. However, these methods do not account for the pathway-level transcriptional differences that define cell types. We propose a knowledge-driven clustering method using Boolean modeling of transcriptional networks. We have previously developed an algorithm called BONITA that infers Boolean rules characterizing pre-defined topologies from bulk transcriptomic data [1]. We demonstrate the utility of our approach on an scRNA-seq dataset of peripheral blood mononuclear cells from HIV+ individuals with and without atherosclerosis. Specifically, we (i) simulated networks and identified reachable attractors (ii) identified subpopulation-specific pathways dysregulated in HIV-associated atherosclerosis and (iii) clustered cells in multi-attractor space. We propose that cell clusters such as CD14+ CD16+ monocytes, that are implicated in HIV-associated atherosclerosis, can be characterized based on pathway-level similarities obtained using Boolean pathway models. References: 1. Palli R, Palshikar MG, Thakar J (2019) Executable pathway analysis using ensemble discrete-state modeling for large-scale data. PLoS Comput Biol 15(9): e1007317.
05:00 PM - 05:20 PM (EDT)
Text Mining - PEDL: Extracting protein-protein associations using deep language models and distant supervision.
Motivation: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein-protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. Results: We propose PEDL, a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different data sets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three data sets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway data bases and that it correctly identifies the text spans supporting the PPA. Availability: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used data sets and to reproduce the experiments from this paper. Contact: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
05:00 PM - 05:20 PM (EDT)
CAMDA - IDMIL: An alignment-free interpretable deep multiple instance learning (MIL) for predicting disease from whole-metagenomic data
The human body hosts more microbial organisms than human cells. Analysis of this microbial diversity provides key insight into the role played by these microorganisms on human health. Metagenomics is the collective DNA sequencing of coexisting microbial organisms in a host. This has several applications in precision medicine, agriculture, environmental science, and forensics. State-of-the-art predictive models for phenotype predictions from microbiome rely on alignments, assembly, extensive pruning, taxonomic profiling along with expert-curated reference databases. These processes are time-consuming and they discard the majority of the DNA sequences for downstream analysis limiting the potential of whole-metagenomics. We formulate the problem of predicting human disease from whole-metagenomic data using Multiple Instance Learning (MIL), a popular supervised learning paradigm. Our proposed alignment-free approach provides higher accuracy in prediction by harnessing the capability of deep convolutional neural network (CNN) within a MIL framework and provides interpretability via neural attention mechanism. The MIL formulation combined with the hierarchical feature extraction capability of deep-CNN provides significantly better predictive performance compared to popular existing approaches. The attention mechanism allows for the identification of groups of sequences that are likely to be correlated to diseases providing the much-needed interpretation. Our proposed approach does not rely on alignment, assembly and manually curated databases; making it fast and scalable for large-scale metagenomic data. We evaluate our method on well-known large-scale metagenomic studies and show that our proposed approach outperforms comparative state-of-the-art methods for disease prediction.
05:00 PM - 05:20 PM (EDT)
MLCSB: Machine Learning - Towards Heterogeneous Information Fusion: Bipartite Graph Convolutional Networks for In Silico Drug Repurposing
Motivation: Mining disease and drug association and their interactions are essential for developing computational models in drug repurposing and understanding underlying biological mechanisms. Recently, large-scale biological databases are increasingly available for pharmaceutical research, allowing for deep characterization for molecular informatics and drug discovery. In this study, we propose a bipartite graph convolution network model that integrates multi-scale pharmaceutical information. Especially the introduction of protein nodes serve as a bridge of message passing, which provides insights into the protein-protein interaction (PPI) network for improved drug repositioning assessment. Results: Our approach combines insights of multi-scale pharmaceutical information by constructing a multi-relational graph of protein–protein, drug–protein and disease-protein interactions. Specifically, our model offers a novel avenue for message passing among diverse domains that we learn useful feature representations for all graph nodes by fusing biological information across interaction edges. Then the high-level representation of drug-disease pairs are fed into a multi-layer perceptron decoder to predict therapeutic indications. Unlike conventional graph convolution networks that assume the same node attributes in a global graph, our model is domain-consistent by modeling inter-domain information fusion with bipartite graph convolution operation. We offered an exploratory analysis for finding novel drug-disease associations. Extensive experiments showed that our approach achieves improved performance than multiple baseline approaches.
05:00 PM - 05:20 PM (EDT)
SST01 - Interpreting genetic variants through chromatin interaction maps in primary human immune cells
05:00 PM - 05:55 PM (EDT)
FILLING THE GAPS IN ONLINE LEARNING
Group Think Tank - discussion: What is Missing from Online Tools in Support of Online Learning
05:00 PM - 05:10 PM (EDT)
SST02 - Citizen Science and Videogames - An unlikely marriage
We dedicated the last five years at Massively Multiplayer Online Science to bring citizen science activities to major videogames that resulted in the most active citizen science projects. We'll share the major milestones of this journey from Project Discovery Proteins to Borderlands Science and on.
05:10 PM - 05:20 PM (EDT)
SST02 - Challenges in designing scientific discovery games in AAA games
How do we make a complex scientific task suitable to the universe of a sci-fi fast-paced role-playing first person shooter video game? In this talk, we will describe the challenges and choices that were made when designing the Borderlands Science puzzle game. We will also discuss how scientists and video game designers can efficiently work together to make a citizen science game scientifically relevant and fun!
05:20 PM - 05:40 PM (EDT)
CAMDA - DNA Based Methods in Intelligence - Moving Towards Metagenomics
05:20 PM - 05:30 PM (EDT)
SST02: TBA
05:20 PM - 05:50 PM (EDT)
SST01 - Exploring the functional consequences of macrophage heterogeneity
05:20 PM - 05:40 PM (EDT)
iRNA - Finding the Direct Optimal RNA Barrier Energy and Improving Pathways with an Arbitrary Energy Model
RNA folding kinetics plays an important role in the biological functions of RNA molecules. An important goal in the investigation of the kinetic behavior of RNAs is to find the folding pathway with the lowest energy barrier. For this purpose, most of the existing methods employ heuristics because the number of possible pathways is huge even if only the shortest (direct) folding pathways are considered. In this study, we propose a new method using a best-first search strategy to efficiently compute the exact solution of the minimum barrier energy of direct pathways. Using our method, we can find the exact direct pathways within a Hamming distance of 20, while the previous methods even miss the exact short pathways. Moreover, our method can be used to improve the pathways found by existing methods for exploring indirect pathways. The source code and datasets created and used in this research are available at https://github.com/eukaryo/czno.
05:20 PM - 05:40 PM (EDT)
SysMod - Modeling Sorting, Intercalation, and Involution Tissue Behaviors due to Regulated Cell Adhesion
Cell-cell adhesion can dictate tissue growth and multicellular pattern formation and it is crucial for the cellular dynamics during embryogenesis and cancer progression. While it is known that these adhesive forces are generated by cell adhesion molecules (CAMs), the regulation of CAMs is not well understood due to complex nonlinear interactions that span multiple levels of biological organization–from genetic regulation to whole-organism shape formation. We present a novel continuous model that can explain the dynamic relationships between genetic regulation, CAM expression, and differential adhesion. This approach can demonstrate the mechanisms responsible for cell-sorting behaviors, cell intercalation in proliferating populations, and the involution of zebrafish germ layer cells during gastrulation. The model can predict the physical parameters controlling the amplitude and wavelength of a cellular intercalation interface as shown in vitro. We demonstrate the crucial role of N-cadherin regulation for the involution and migration of cells beyond the gradient of the morphogen Nodal during zebrafish gastrulation. Integrating the emergent spatial tissue behaviors with the regulation of genes responsible for essential cellular properties such as adhesion will pave the way toward understanding the genetic regulation of large-scale complex patterns and shapes formation in developmental, regenerative, and cancer biology.
05:20 PM - 05:40 PM (EDT)
Text Mining - Deep Semi-supervised Ensemble Method for Classifying Co-mentions of Human Proteins and Phenotypes
Identifying human protein-phenotype relations is of paramount importance for uncovering rare and complex diseases. Human Phenotype Ontology (HPO) is a standardized vocabulary for describing disease-related phenotypic abnormalities in humans. Since the experimental determination of HPO categories for human proteins is a highly resource-consuming task, developing automated tools that can accurately predict HPO categories has gained interest recently. In this work, we develop a novel method for classifying sentence co-mentions of human proteins and HPO names using semi-supervised learning and deep neural networks. Our model is a combination of BERT, CNN, RNN models, and a self-learning module, that uses a large collection of unlabeled co-mentions available from ProPheno, which is an online database developed in a previous study. Using a gold-standard dataset composed of curated sentence-level co-mentions, we demonstrate that our proposed model provides the state-of-the-art performance in the task of classifying human protein-phenotype co-mentions by outperforming two supervised and semi-supervised support vector machines model counterparts. The findings and the insight of this work have implications for biocurators, researchers, and bio text mining tool developers.
05:20 PM - 05:30 PM (EDT)
MLCSB: Machine Learning - Learning Context-aware Structural Representations to Predict Antigen and Antibody Binding Interfaces
Understanding how antibodies specifically interact with their antigens can enable better drug and vaccine design, as well as provide insights into natural immunity. Experimental structural characterization can detail the “ground truth” of antibody-antigen interactions, but computational methods are required to efficiently scale to large-scale studies. In order to increase prediction accuracy as well as to provide a means to gain new biological insights into these interactions, we have developed PECAN, a unified deep learning-based framework to predict binding interfaces on both antibodies and antigens. PECAN leverages three key aspects of antibody-antigen interactions to learn predictive structural representations: (1) since interfaces are formed from multiple residues in spatial proximity, we employ graph convolutions to aggregate properties across local regions in a protein; (2) since interactions are specific between antibody-antigen pairs, we employ an attention layer to explicitly encode the context of the partner; (3) since more data is available for general protein-protein interactions, we employ transfer learning to leverage this data as a prior for the specific case of antibody-antigen interactions. We show that PECAN achieves state-of-the-art performance at predicting binding interfaces on both antibodies and antigens, and that each of its three aspects drives additional improvement in the performance.
05:20 PM - 05:40 PM (EDT)
Function - eCAMI: simultaneous classification and motif identification for enzyme annotation
Carbohydrate-active enzymes (CAZymes) are extremely important to bioenergy, human gut microbiome, and plant pathogen researches and industries. We developed a new amino acid k-mer based CAZyme classification, motif identification, and genome annotation tool using a bipartite network algorithm. Using this tool, we classified 390 CAZyme families into thousands of subfamilies each with distinguishing k-mer peptides. These k-mers represented the characteristic motifs of each subfamily, and thus were further used to annotate new genomes for CAZymes. This idea was also generalized to extract characteristic k-mer peptides for all the Swiss-Prot enzymes classified by the EC (enzyme commission) numbers and applied to enzyme EC prediction. This new tool was implemented as a Python package named eCAMI. Benchmark analysis of eCAMI against the state-of-the-art tools on CAZyme and enzyme EC datasets found that: (i) eCAMI has the best performance in terms of accuracy and memory use for CAZyme and enzyme EC classification and annotation; (ii) the k-mer based tools (including PPR-Hotpep, CUPP, eCAMI) perform better than homology-based tools and deep-learning tools in enzyme EC prediction. Lastly, we confirmed that the k-mer based tools have the unique ability to identify the characteristic k-mer peptides in the predicted enzymes. The paper is published at https://doi.org/10.1093/bioinformatics/btz908.
05:30 PM - 06:00 PM (EDT)
SST02: Panel Discussion
TBD
05:30 PM - 05:40 PM (EDT)
MLCSB: Machine Learning - PaccMann^RL: Designing anticancer drugs from transcriptomic data via reinforcement learning
While state-of-the-art deep learning approaches have shown potential in generating compounds with desired chemical properties, they disregard the cellular biomolecular properties of the target disease. We introduce a novel framework for de-novo molecular design that systematically leverages systems biology information into the drug discovery process. Embodied through two separate VAEs, the drug generation is driven by a disease context (transcriptomic profiles of cancer cells) deemed to represent the target environment of the drug. Showcased at the task of anticancer drug discovery, our conditional generative model is demonstrated to tailor anticancer compounds to target desired biomolecular profiles. Specifically, we reveal how the molecule generation can be biased towards compounds with high predicted inhibitory effect against individual cell-lines or cell-lines from specific cancer sites. We verify our approach by investigating candidate drugs generated against specific cancer types and find highest structural similarity to existing compounds with known efficacy against these types. Despite no direct optimization of other pharmacological properties, we report good agreement with cancer drugs in metrics like drug-likeness, synthesizability and solubility. We envision our approach to be a step towards increasing success rates in lead compound discovery and finding more targeted medicines by leveraging the cellular environment of the disease.
05:40 PM - 06:00 PM (EDT)
iRNA - The locality dilemma of Sankoff-like RNA alignments
Motivation: Elucidating the functions of non-coding RNAs by homology has been strongly limited due to fundamental computational and modeling issues. While existing simultaneous alignment and folding (SA&F) algorithms successfully align homologous RNAs with precisely known boundaries (global SA&F), the more pressing problem of finding homologous RNAs in the genome (local SA&F) is intrinsically more difficult and much less understood. Typically, the length of local alignments is strongly overestimated and alignment boundaries are dramatically mispredicted. We hypothesize that local SA&F approaches are compromised this way due to a score bias, which is caused by the contribution of RNA structure similarity to their overall alignment score. Results: In the light of this hypothesis, we study local SA&F for the first time systematically—based on a novel local RNA alignment benchmark set and quality measure. First, we vary the relative influence of structure similarity compared to sequence similarity. Putting more emphasis on the structure component leads to overestimating the length of local alignments. This clearly shows the bias of current scores and strongly hints at the structure component as its origin. Second, we study the interplay of several important scoring parameters by learning parameters for local and global SA&F. The divergence of these optimized parameter sets underlines the fundamental obstacles for local SA&F. Thirdly, by introducing a position-wise correction term in local SA&F, we constructively solve its principal issues.
05:40 PM - 06:00 PM (EDT)
SysMod - Closing remarks of the first SysMod day
A brief recap of the first day of the meeting.
05:40 PM - 06:00 PM (EDT)
CAMDA - Day summary
05:40 PM - 06:00 PM (EDT)
Text Mining - A Hybrid Method for Phenotype Concept Recognition using the Human Phenotype Ontology
Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research despite a few attempts in the past. In previous works, dictionary-based and machine learning-based methods are attempted for phenotype concept recognition. The dictionary-based methods can achieve a high precision, but it suffers from a lower recall problem. Machine learning-based methods can recognize more phenotype concept variants by automatic feature learning. However, most of them require large corpora of manually annotated data for its model training, which is difficult to obtain due to the high cost of human annotation. To address the problems, we propose a hybrid method combining dictionary-based and machine learning-based methods to recognize the Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We firstly use all concepts and synonyms in HPO to construct a dictionary. Then we employ the dictionary-based method and HPO to automatically build a “weak” training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase into a corresponding concept id. Finally, machine learning-based prediction results are incorporated into the dictionary-based results for improved performance. Experimental results show that our method can achieve state-of-the-art performance without any manually labeled training data.
05:40 PM - 05:50 PM (EDT)
MLCSB: Machine Learning - Tissue-guided LASSO for prediction of clinical drug response using preclinical samples
Predicting the clinical drug response (CDR) of cancer patients, based on their clinical parameters and their tumours' molecular profiles, can play an important role in precision medicine. While machine learning (ML) models have the potential to address this issue, their training requires data from a large number of patients treated with each drug, limiting their feasibility for many drugs. One alternative is training ML models on large databases containing molecular profiles of hundreds of preclinical cell lines and their response to hundreds of drugs. Here, we developed a novel algorithm (TG-LASSO) that explicitly incorporates information on samples' tissue of origin with gene expression profiles to predict the CDR of patients using preclinical samples. Using two large databases, we showed that TG-LASSO can accurately distinguish between resistant and sensitive patients for 7 out of 12 drugs, outperforming various other methods. Moreover, TG-LASSO identified genes associated with the drug response, including known targets and pathways involved in the drugs' mechanism of action. Additionally, genes identified by this method for multiple drugs in a tissue are associated with patient survival and can be used to predict their outcome. In summary, TG-LASSO can predict patients’ CDR and identify biomarkers of drug sensitivity and survival.
05:40 PM - 06:00 PM (EDT)
Function - Learning sequence, structure and network features for protein function prediction
Recently, self-supervised and unsupervised representation learning approaches have shown tremendous potential in exploring the huge volume of protein data in many databases and learning features indicative of protein function. Building upon our previous works on sequence, structure, and network data, we propose to systematically investigate the contribution of these features on classification performance on individual GO terms, study their complementarity, and build an integrative model. We compute: sequence-based features using our LSTM language model pre-trained on ~10 million protein sequences; structure-based features from contact maps using a Graph Autoencoder pretrained on ~30k domain structures from CATH and network-based features using our deepNF model pre-trained on 6 different networks from STRING. Then we train a separate neural network (NN) for predicting GO term probabilities on each individual feature set. We will show results from an experiment we conducted on ~18,000 human proteins using their sequences, 3D structures retrieved from PDB and SWISS-MODEL and PPI networks. We will also present the performance results obtained by training a multi-modal NN with all three feature sets for multiple organisms. We show that different modalities contribute to different GO terms, and show that the model integrating information from all sources outperforms the individual models.
05:50 PM - 06:00 PM (EDT)
SST01 - Dissecting the heterogeneity of protein and transcriptional responses in human blood derived immune cells after T- and monocyte-specific activation
Single cell profiling of activated immune cells is a powerful way to study immune function disruptions in diverse diseases and conditions. However, generating and analyzing single cell data from activated cells harbor unique challenges. To address these challenges and derive reference response genes to frequently studied immune cell activation conditions, we generated CITE-seq data from 10 healthy adults (5 men, 5 women) before and after stimulating their peripheral blood mononuclear cells (PBMCs) via i) anti-CD3/CD28 and ii) LPS. Using a comprehensive antibody panel (n=39) including cell type (e.g., CD16, CD14) and cell state (e.g., CD69, CD25) markers, we uncovered how each immune cell type responds to these conditions. Two levels of multiplexing (cell hashing and individuals’ genotypes) were instrumental to avoid batch effects and to eliminate multiplet cells that increased in number upon activation. All PBMCs responded to stimulation via anti-CD3/CD28, whereas LPS specifically activated monocytes and induced pro-inflammatory genes and pathways. Pseudo-temporal ordering of single cells before and after activation revealed cell- and condition-specific heterogeneity in cellular responses independent of genetic variation. Together, these data are shared within an interactive web application (https://czi-pbmc-cite-seq.jax.org/) and will serve as a resource to guide future studies of immune cell responses.
05:50 PM - 06:00 PM (EDT)
MLCSB: Machine Learning - Quantifying gene selection in cancer through protein functional alteration bias
Compiling the catalogue of genes actively involved in cancer is an ongoing endeavor, with profound implications to the understanding and treatment of the disease. An abundance of computational methods have been developed to screening the genome for candidate driver genes based on genomic data of somatic mutations in tumors. Existing methods make many implicit and explicit assumptions about the distribution of random mutations. We present FABRIC, a new framework for quantifying the selection of genes in cancer by assessing the effects of de-novo somatic mutations on protein-coding genes. Using a machine-learning model, we quantified the functional effects of ∼3M somatic mutations extracted from over 10 000 human cancerous samples, and compared them against the effects of all possible single-nucleotide mutations in the coding human genome. We detected 593 protein-coding genes showing statistically significant bias towards harmful mutations. These genes, discovered without any prior knowledge, show an overwhelming overlap with known cancer genes, but also include many overlooked genes. FABRIC is designed to avoid false discoveries by comparing each gene to its own background model using rigorous statistics, making minimal assumptions about the distribution of random somatic mutations. The framework is an open-source project with a simple command-line interface.
05:55 PM - 06:00 PM (EDT)
Web - Closing of WEB 2020
06:00 PM - 06:10 PM (EDT)
SST01 - Single-cell transcriptomic analysis of SARS-CoV-2 reactive CD4+ T cells
The contribution of CD4+ T cells to protective or pathogenic immune responses to SARS-CoV-2 infection remains unknown. Here, we present large-scale single-cell transcriptomic analysis of viral antigen-reactive CD4+ T cells from 32 COVID-19 patients. In patients with severe disease compared to mild disease, we found increased proportions of cytotoxic follicular helper (TFH) cells and cytotoxic T helper cells (CD4-CTLs) responding to SARS-CoV-2, and reduced proportion of SARS-CoV-2 reactive regulatory T cells. Importantly, the CD4-CTLs were highly enriched for the expression of transcripts encoding chemokines that are involved in the recruitment of myeloid cells and dendritic cells to the sites of viral infection. Polyfunctional T helper (TH)1 cells and TH17 cell subsets were underrepresented in the repertoire of SARS-CoV-2-reactive CD4+ T cells compared to influenza-reactive CD4+ T cells. Together, our analyses provide so far unprecedented insights into the gene expression patterns of SARS-CoV-2 reactive CD4+ T cells in distinct disease severities.
06:10 PM - 06:20 PM (EDT)
SST01 - Single-cell transcriptomic analysis of allergen-specific T cells in allergy and asthma
CD4+ helper and regulatory T cells that respond to common allergens play an important role in driving and dampening airway inflammation in patients with asthma. Until recently, direct, unbiased molecular analysis of allergen-reactive T-cells has not been possible. To better understand the diversity of these T-cells in allergy and asthma, we analyzed the single-cell transcriptome of ~50,000 house dust mite (HDM) allergen-reactive T cells from asthmatics with HDM allergy and from three control groups: asthmatics without HDM allergy and non-asthmatics with and without HDM allergy. Our analyses show that HDM allergen-reactive T cells are highly heterogeneous, and certain subsets are quantitatively and qualitatively different in subjects with HDM-reactive asthma. The number of interleukin-9 expressing HDM-reactive TH cells is greater in asthmatics compared with non-asthmatics with HDM allergy and display enhanced pathogenic properties. More HDM-reactive TH and Treg cells expressing the interferon-response signature are present in asthmatics without HDM allergy. In cells from these subsets, expression of TNFSF10 was enriched; its product, TRAIL, dampens activation of TH cells. These findings suggest that these subsets may dampen allergic responses, which may help explain why only some people develop TH2 responses to nearly ubiquitous allergens.
09:30 AM - 10:30 AM (EDT)
Keynotes - Computational modeling of tumor immunity
10:40 AM - 11:20 AM (EDT)
Function - Keynote: Gene function prediction using unsupervised biological network integration

Biological networks have the power to map cellular function, but only when unified to overcome their individual limitations such as bias and noise. Unsupervised network integration addresses this, automatically weighting input information to obtain an accurate, unified result. However, existing unsupervised network integration methods do not adequately scale to the number of nodes and networks present in genome-scale data and do not handle frequently encountered data characteristics (e.g. partial network overlap). To address this, we have developed an unsupervised deep learning-based network integration algorithm that incorporates recent advances in reasoning over unstructured data – namely the Graph Convolutional Network (GCN) – that can effectively learn dependencies between physical, co-expression and genetic interaction network topologies. Our method, BIONIC (Biological Network Integration using Convolutions), produces high quality gene and protein features which capture and unify information across many diverse functional interaction networks. BIONIC learns features which contain substantially more functional information compared to existing approaches, linking genes and proteins that share co-complex, pathway and bioprocess relationships.

10:40 AM - 11:00 AM (EDT)
iRNA - Detection of differential RNA modifications from direct RNA sequencing of human cell lines
Differences in RNA expression can provide insights into the molecular identity of a cell and pathways involved in human diseases. RNA modifications such as m6A have been found to contribute to molecular functions of RNAs. However, quantification of differences in RNA modifications has been challenging. Here, we present a computational method, xPore, to identify differential RNA modifications from direct RNA sequencing data. Based solely on the current intensity profiles, we extend a standard two-Gaussian mixture model to accommodate multi-sample comparisons. For each single site, the model learns two distributions, corresponding to unmodified and modified RNA, the signal properties that are shared across samples, while allowing the probability of each read being modified to be inferred specifically for each sample. Having incorporated prior knowledge into the model, we are able to determine the signal distributions of the modified k-mers and quantitatively estimate the modification rates accordingly. We evaluate our method on transcriptome-wide m6A profiling, demonstrating that we can accurately prioritize differentially modified sites. Together, we demonstrate that RNA modifications can be quantitatively identified from direct RNA-sequencing data with high accuracy, opening many new opportunities for large scale applications in precision medicine. xPore is available at https://github.com/GoekeLab/xpore.
10:40 AM - 11:00 AM (EDT)
SysMod - A stochastic hybrid model for DNA replication incorporating protein mobility dynamics
DNA replication is a complex process that ensures genetic information maintenance. As recently observed, DNA replication timing is highly correlated with chromatin folding and global nuclear architecture. Here, we present a stochastic hybrid model of DNA replication that incorporates protein mobility and three-dimensional chromatin structure. Our model provides a framework to realistically simulate DNA replication for a complete eukaryotic genome and investigate the relationship between three-dimensional chromatin conformation and replication timing. Performing simulations for three model variants and a broad range of parameter values, we collected about 300,000 in silico replication profiles for fission yeast. We find that the number of firing factors initiating replication is rate-limiting and dominates the DNA replication completion time. We also find that explicitly modeling the recruitment of firing factors by the spindle pole body best reproduces experimental data and provide an independent validation of these findings in vivo. Further investigation of replication kinetics confirmed earlier observations of a rate-limiting number of firing factors in conjunction with their recycling upon replication fork collision. While the model faithfully reproduces global patterns of replication initiation, additional analysis of firing concurrence suggests that a uniform binding probability is too simplistic to capture local neighborhood effects in origin firing.
10:40 AM - 11:00 AM (EDT)
Evolution and Comparative Genomics - FastMulRFS: Fast and accurate species tree estimation under generic gene duplication and loss models
Motivation:Species tree estimation is a basic part of biological research but can be challenging because of gene duplication and loss (GDL), which results in genes that can appear more than once in a given genome. All common approaches in phylogenomic studies either reduce available data or are error-prone, and thus, scalable methods that do not discard data and have high accuracy on large heterogeneous datasets are needed. Results: We present FastMulRFS, a polynomial-time method for estimating species trees without knowledge of orthology. We prove that FastMulRFS is statistically consistent under a generic model of GDL when adversarial GDL does not occur. Our extensive simulation study shows that FastMulRFS matches the accuracy of MulRF (which tries to solve the same optimization problem) and has better accuracy than prior methods, including ASTRAL-multi (the only method to date that has been proven statistically consistent under GDL), while being much faster than both methods.
10:40 AM - 11:00 AM (EDT)
SST03 - Tackling the challenge of elucidating reef-building coral holobiont function
Reef-building corals form massive calcium carbonate structures that are visible from space and support invaluable reef goods and services. These corals are metaorganisms whose functionality comes from the holobiont integration of the cnidarian host, its primary dinoflagellate endosymbionts (family Symbiodinaceae), and a multitude of bacteria, archaea, viruses, and fungi that are surface, tissue, and skeletal dwelling symbionts. The study of this symbiotic system as a fascinating functional puzzle has intensified with the rapid decline in coral cover and growing threat of reef extinction due to climate change factors, such as increasing seawater temperatures and ocean acidification. In particular, the high frequency of mass bleaching events, where thermal stress causes dysbiosis and loss or expulsion the Symbiodinaceae and mass coral mortality, has sharpened our focus on these threatened organisms. Technological advances have allowed a rapid expansion of coral and symbiont sequence data for studies of coral response to stress, but bioinformatic challenges remain in analyzing and interpreting this wealth of metaorganism data.
10:40 AM - 11:20 AM (EDT)
Education - COSI Education Keynote Talk: Empowering usable, and comprehensive bioinformatics training
With the explosion of biological data, the primary challenge is not how to store the data, nor what computational resources to use to process them, but it is the general lack of researchers' understanding to manipulate and analyse these data. This problem could be solved with comprehensive and up-to-date bioinformatics training. How can we build such an infrastructure? Over the last few years, I had the opportunity to lead and contribute to several training programs and communities in bioinformatics. In this talk, I will share some lessons learned and ideas to build usable, comprehensive, and empowering bioinformatics training infrastructure. I will talk about community-driven development of free, FAIR and reusable material, online and hybrid training, mentoring, and further illustrate these concepts with concrete examples from my work.
10:40 AM - 11:40 AM (EDT)
MLCSB: Machine Learning - Finding concise descriptors of genomic data
Genome scale technologies can measure thousands or even millions of molecules but the individual measurements are often noisy redouts of the internal biological state. For many computational approaches the ultimate goal is to infer the true biological state parameters from their effects on a much larger set of experimental readouts. Indeed, much of genomic data analysis can be viewed as a special case of dimensionality reduction. Unlike a generic dimensionality reduction problem, however, in the biological case we have extensive prior knowledge about the data-generating process. In this talk we will discuss several methods that exploit this prior knowledge to create interpretable low-dimensional representations for genome scale datasets.
10:45 AM - 11:40 AM (EDT)
CAMDA - CAMDA KEYNOTE: Challenges and tools for integrative analysis of single cell 'omics data
Single-cell multi 'omics data has potential to provide unprecedented insights into molecular, spatial and cellular organization of tissue during health and disease. Whilst approaches for integrating single-cell data 'omics are emerging, the sparsity, high dimensionality, heterogeneity and scale of these data type present unique challenges. Due to the scale of single cell data, dimension reduction is an essential preliminary step in analysis pipelines, data integration, cell clustering and trajectory analysis. Principal component analysis (PCA) is a widely used for this because it is relatively fast, and can easily scale to large datasets when used with sparse-matrix representations. I will review different forms of PCA, impact of scaling, log-transforming, and how to recognize problems such as horseshoe or arch effect. I will describe a alternative to PCA which we have recently implemented in a new Bioconductor package Corral, and show how simple replacement of PCA with corral can improve integrative analysis of single cell data. I will describe challenges we have encounters and how we are re-envisioning tools we developed for bulk RNAseq data for emerging single cell data types.
11:00 AM - 11:20 AM (EDT)
iRNA - RNA editing landscapes: a new model for biomarkers discovery in neurological disease
Current studies look at genomic variants or differential gene expression to predict genetic predisposition and find biomarkers for many neurodevelopmental, psychiatric and degenerative disorders. Here we explore a novel approach to elucidate potential diagnostic, prognostic and/or therapeutic biomarkers for major depressive disorder and suicide focusing on a more nuanced aspect of transcriptome diversity, RNA editing. RNA editing, more specifically adenosine deaminase acting on RNA (ADAR) editing which contributes to transcriptome diversity by dynamically altering the ratios of differentially functioning proteins unpinning the “fine-tuning” of neural signaling and synaptic plasticity. We apply a novel approach utilizing an item response theory model the Guttman Scale to create ADAR editing landscape profiles which are then used to map differential editing in major depressive disorder and suicide. We were able to find a handful of genes of interest for further investigation into their direct contribution to neurological symptoms. We also highlight pathways including ion homeostasis that are altered in depression and suicide which warrant further investigation for their role in synaptic plasticity. Furthermore, we provide evidence this model can be used in biomarker discovery for many other neurological disorders in which transcriptome diversity plays an important role.
11:00 AM - 11:40 AM (EDT)
SysMod - Quantitative and Systems Pharmacology (QSP) and Model-Informed Drug Development (MIDD) of a “Smart” Insulin
One of the many challenges of drug development programs is the strictly limited volume of pharmacological data for investigatory drug candidates. To help manage the risk of exposing volunteers to drug candidates of unknown safety and efficacy while evaluating hypothesized benefits, programs seek to integrate all relevant data including molecular and cellular profiling data and clinical trial electronic data capture. The integration of relevant data has been used successfully to help guide decisions at each step of the development process. The success of such efforts across the industry is evidenced by the FDA’s launch of a MIDD pilot program to facilitate the use of quantitative methods to ultimately help improve clinical trial efficiency and optimize therapeutic individualization in the absence of dedicated trials. The integration of quantitative methods was used to guide validation of putative glucose dysregulation targets and to guide the design of a complex clinical trial of glucose control. Although every program has individual challenges, these case studies will be discussed to highlight open questions in the application of quantitative systems biology in drug development.
11:00 AM - 11:20 AM (EDT)
SST03 - Dynamic genome evolution of coral symbionts: a bioinformatics challenge
Coral reefs are hotspots of marine biodiversity. The ecological success of corals in nutrient-poor waters relies on photosynthetic algal symbionts (Symbiodiniaceae) for supply of fixed carbon as energy, and nutrients. However, the implication of the evolution of these algae and its implications on coral evolution is poorly understood. Genomes of Symbiodiniaceae present a bioinformatics challenge, because of their immense sizes (1-3 Gbp) and highly idiosyncratic genome features. In this talk, I will present our recent effort to generate 12 de novo genome assemblies from diverse Symbiodiniaceae species and their free-living relative, and how we developed a customised computational workflow for predicting genes from these genomes. Comparative analysis reveals unexpectedly high sequence and structural divergence, and conserved lineage-specific gene families of unknown function. Many genes are encoded in unidirectional clusters, and some critical functions are encoded in tandemly repeated single-exon genes. Our results underscore the rapid evolution of coral symbionts that comprise a massive, undiscovered phylogenetic diversity, and elucidate how selection acts within the context of a complex genome structure to facilitate local adaptation, e.g. by enhancing transcriptional responses. These outcomes provide an important reference for research of coral holobionts and their resilience in changing environments.
11:00 AM - 11:20 AM (EDT)
10 lessons we learned in 10 days about clinical data interoperability in the COVID crisis
There is an urgent need to take what we have learned in our new data-driven era of medicine, and use it to create a new system of precision medicine, delivering the best, safest, cost-effective preventative or therapeutic intervention at the right time, for the right patients. Dr. Butte's lab at the University of California, San Francisco builds and applies tools that convert trillions of points of molecular, clinical, and epidemiological data -- measured by researchers and clinicians over the past decade and now commonly termed “big data” -- into diagnostics, therapeutics, and new insights into disease. Dr. Butte, a computer scientist and pediatrician, will highlight his center’s recent work on integrating electronic health records data across the entire University of California, and how analytics on this “real world data” can lead to new evidence for drug efficacy, new savings from better medication choices, and new methods to teach intelligence – real and artificial – to more precisely practice medicine.
11:00 AM - 11:10 AM (EDT)
Evolution and Comparative Genomics - Species tree estimation with selection
Species trees form a basis to address many biological questions, and during the last years have crucial to our understanding of the divergence process and speciation. Although modeling the species tree is challenging the last two decades have seen an explosion of sophisticated models for the species tree. The multispecies coalescent model has arisen as a leading framework, but alternative ones have been proposed. The polymorphism-aware phylogenetic models (PoMos), in particular, offer the advantage of naturally accounting for incomplete lineage sorting and being flexible modeling-wise. PoMo extends any DNA substitution model to account for polymorphisms by expanding the state space to include polymorphic states. A Moran process is used to model genetic drift. PoMo performs well for species tree estimation as it can accurately and time-efficiently estimate the parameters describing evolutionary patterns for phylogenetic trees of any shape (species trees, population trees, or any combination of those). Here, we extend PoMo to incorporate allelic directional selection. Our results show that one might overestimate the divergence among evolving species when directional selection is unaccounted. We developed a Bayesian framework for PoMo with allelic selection, which permitted to infer pervasive signatures of nucleotide usage biases in great apes and fruit flies genomes.
11:10 AM - 11:20 AM (EDT)
Evolution and Comparative Genomics - Graph Splitting: A Graph-Based Approach for Superfamily-Scale Phylogenetic Tree Reconstruction
A protein superfamily contains distantly related proteins that have acquired diverse biological functions through a long evolutionary history. Phylogenetic analysis of the early evolution of protein superfamilies is a key challenge because existing phylogenetic methods show poor performance when protein sequences are too diverged to construct an informative multiple sequence alignment (MSA). Here, we propose the Graph Splitting (GS) method, which rapidly reconstructs a protein superfamily-scale phylogenetic tree using a graph-based approach. Evolutionary simulation showed that the GS method can accurately reconstruct phylogenetic trees and be robust to major problems in phylogenetic estimation, such as biased taxon sampling, heterogeneous evolutionary rates, and long-branch attraction when sequences are substantially diverge. Its application to an empirical data set of the triosephosphate isomerase (TIM)-barrel superfamily suggests rapid evolution of protein-mediated pyrimidine biosynthesis, likely taking place after the RNA world. Furthermore, the GS method can also substantially improve performance of widely used MSA methods by providing accurate guide trees. The GS method is freely available at our website: http://gs.bs.s.u-tokyo.ac.jp/ Reference: Motomu Matsui and Wataru Iwasaki. Systematic Biology, 69, 265-279. (2020) https://doi.org/10.1093/sysbio/syz049
11:20 AM - 11:40 AM (EDT)
Function - Cross-species functional prediction by global network alignment
We report the first successful cross-species pre- diction of protein function based solely on topology-driven global network alignment. Using SANA (the Simulated Annealing Network Aligner), we pair the proteins of one species with those of another solely by maximizing the number of aligned edges between the networks. We find that SANA’s confidence, called NAF, in each individual pair of aligned proteins correlates with their functional similarity. We then apply SANA to BioGRID 3.0 networks from April 2010, and use GO data from the same month to transfer GO annotations from better-annotated proteins to lesser-annotated ones. We validate the predictions on a recent GO release and find an AUPR of up to 0.4 depending on the predicting GO evidence code, even when restricting predictions to proteins that have no observed sequence or homology relationship. Finally, we apply the same method to recent BioGRID PPI networks of mouse and human, and predict novel cilia-related GO terms in human proteins based on their confident alignment with cilia-annotated mouse proteins; the most confident predictions have literature validation rates above 80%. We propose topology-based alignment of PPI networks as a novel source for prediction of protein function that is independent of sequence or structural information
11:20 AM - 11:40 AM (EDT)
SST03 - Pipeline for discovery of membrane receptors in non-model organisms: the case of Pocillopora damicornis
Mining of the large coral and symbiont sequence data is a challenge but also provides an opportunity for understanding functionality and evolution of these and other non-model organisms. Much more information than for any other eukaryotic species is available for human, especially related to signal transduction and diseases. However, the coral cnidarian host and human have diverged over 700 million years ago and homologies between proteins are therefore often in the gray zone or undetectable with traditional Blast searches. Through remote homology detection using hidden markov models we can identify putative candidate human homologues in the cnidarian genome. However, for many proteins, the human genome alone contains multiple family members with similar or more divergence in sequence. To illustrate and address this challenge, we developed a pipeline for mapping membrane receptors in human to membrane receptors in corals. We will discuss functional implications for the presence and absence of membrane receptor homologues for communication across different cell types of the cnidarian host and across the microbiome. This pipeline may prove generally useful, e.g. for other protein families or types of functions important to both corals and humans, such as wound healing and biomineralization, and for other non-model organisms including plants where discovery of human homologues can open novel avenues for agricultural applications.
11:20 AM - 11:40 AM (EDT)
Education - Online learning from EMBL-EBI with the new and improved Train online
EMBL-EBI’s online learning platform, Train online has a new look. Since it’s development in 2011 it has grown into a resource accessed by nearly 600,000 people per year globally. Containing over 80 courses, it provides training on EMBL-EBI data resources and tools, and basic concepts in bioinformatics and data analysis. Our aim in re-development was to provide increased interactivity and improved performance for our learners, along with an updated style. This enables online learning from EMBL-EBI to be more engaging, user friendly and ultimately have greater learner impact. Prior to redesign, we worked with trainees and authors to identify challenges to taking and writing a course. A number of platforms were considered and we ultimately decided on WordPress. This provides a clean, modern look for trainees and simple course development for trainers. Increased interactivity has been added using H5P, enabling quick creation of games, quizzes and other interactive content to support and assess learning. Feedback on the first updated course has been extremely positive, indicating improved engagement and retention of trainees. It was also pivotal to our design revisions before finalising the new look. Further development of Train online is planned, focusing on personalisation and collaboration in online learning.
11:20 AM - 11:30 AM (EDT)
Evolution and Comparative Genomics - Gaps and runs in syntenic alignments
Gene loss is the obverse of novel gene acquisition by a genome through a variety of evolutionary processes. It serves a number of functional and structural roles, compensating for the energy and material costs of gene complement expansion. A type of gene loss widespread in the lineages of plant genomes is “fractionation” after whole genome doubling or tripling, where one of a pair or triplet of paralogous genes in parallel syntenic contexts is discarded. The detailed syntenic mechanisms of gene loss, especially in fractionation, remain controversial. We focus on the the frequency distribution of gap lengths (number of deleted genes – not nucleotides) within syntenic blocks calculated during the comparison of chromosomes from two genomes. We mathematically characterize a simple model in some detail and show how it is an adequate description neither of the Coffea arabica subgenomes nor its two progenitor genomes. We find that a mixture of two models, a random, one-gene-at-a-time, model and a geometric-length distributed excision for removing a variable number of genes, fits well.
11:20 AM - 11:30 AM (EDT)
iRNA - BORE - Detecting RNA Editing Events comfortably
RNA editing and its regulation is increasingly being recognized as an important mechanism in disease pathogenesis. With the growing availability of deep RNASeq datasets, the opportunity for global detection of RNA editing events has become achievable. However, despite advances, reliably detecting RNA editing events in the transcriptome remains difficult. Here we introduce Balanced Output of RNA Editing (BORE), a cloud-native application that provides researchers with the ability to process their own High Throughput Sequencing data and find potential RNA editing sites. The Application is written in Python and Go, and is hosted on Amazon Web Services (AWS). The Application uses AWS Simple Storage Service (S3) to store the raw FASTQ files. It then processes the files through the BORE Pipeline to generate candidate RNA editing sites. The results are then available to download by the researcher. The internal workflow of BORE is: download a FASTQ file, align it to a reference genome with HiSAT2, and then preprocess into a filtered high-quality BAM File. The main processing workflow then takes over. We filter out known Single Nucleotide Variants (SNVs), generate an internal representation of the putative editing sites, and represent them as machine-readable files like VCF and as a summary report.
11:20 AM - 11:40 AM (EDT)
Network Medicine Framework for Drug Repurposing
Repurposing approved drugs for novel uses is evolving as a significant approach for identifying new, effective treatments for known diseases and for new diseases. We have applied the principles of network medicine to the identification of drug targets that may be effectively repurposed, and done so by utilizing the comprehensive protein-protein interaction network (interactome) as the basis for the analysis. To do so, we first recognize that unique (clustered) subnetworks or disease modules exist within this interactome for each disease. Next, we create a bipartite graph based on the assumption that the proximity of a drug target to the disease module of interest suggests the potential utility of the target’s drug for the disease of interest. We then use network proximity, network diffusion, and AI-based network metrics to rank all approved drugs with respect to their likely efficacy in the disease of interest; aggregate all predictions to create an integrated rank order; test its statistical predictive accuracy with ground truth examples; and arrive at lead candidate drugs for the disease of interest. Lastly, we test select members of this candidate list in relevant in vitro experiments as a proof-of-concept. We believe this unique approach to repurposing approved drugs holds promise as an effective, rapid, relatively inexpensive strategy for deploying ‘new’ treatments safely.
11:30 AM - 11:40 AM (EDT)
Evolution and Comparative Genomics - A Solution to the Labeled Robinson-Foulds Distance Problem
Gene trees are extensively used, notably for inferring the most plausible scenario of evolutionary events leading to the observed gene family from a single ancestral gene copy. This has important implications towards elucidating the functional relationship between gene copies. For this purpose, reconciliation enables the labeling of internal nodes in the gene tree with the type of events at the origin of gene tree bifurcations. The variety of phylogenetic inference methods, leading to different and potentially inconsistent trees for the same dataset, warrants the design of appropriate tools for comparing them. While, comparing reconciled gene trees remains a largely unexplored field, a large variety of measures have been developed for comparing unlabeled evolutionary trees. Among them, despite its limitations, the Robinson-Foulds (RF) distance remains the most widely used one, mostly due to its computational efficiency. In this paper, we report on a Labeled Robinson-Foulds edit distance, which maintains desirable properties such as being computable exactly in linear-time. Further, we show that this new distance is computable for an arbitrary number of label types, thus making it useful for applications including more label types than speciations and duplications.
11:30 AM - 02:00 PM (EDT)
iRNA - iRNA: Poster Session
12:00 PM - 12:20 PM (EDT)
SST03 - Metabolic complementary in phylogenetically divergent coral holobionts
Coral holobiont members – which include the coral host, the algal symbiont and a poorly characterized consortium of diverse microbial taxa – have different generation times and effective population sizes. These differences can have pronounced evolutionary and ecological consequences for holobiont members, especially under a rapidly changing climate. Corals tend to have slow generation times relative to their microbes, on the order of decades. Algal symbionts with different physiological adaptations to temperature are known to shift in abundance within a host in response to perturbations (e.g., global warming) over months to years. Coral associated prokaryotic microbial assemblages not only have very fast generation times, but their community composition can shift rapidly under changing environmental conditions. Metabolic complementary analysis has revealed a keystone role for the coral microbiome in holobiont homeostasis. We performed temperature stress experiments on three Caribbean coral holobionts and examined their response via metatranscriptome sequencing. Our data have uncovered genes with lineage-specific expression level adaptation in both host and photosymbiont taxa as well as the potential roles for bacterial associates. We have only scratched the surface of complex holobiont dynamics, yet these studies highlight the need for additional data and a comprehensive theoretical framework that integrates ecological and evolutionary time scales.
12:00 PM - 12:20 PM (EDT)
ODSS - Machine Reading for Precision Medicine
The advent of big data promises to revolutionize medicine by making it more personalized and effective, but big data also presents a grand challenge of information overload. For example, tumor sequencing has become routine in cancer treatment, yet interpreting the genomic data requires painstakingly curating knowledge from a vast biomedical literature, which grows by thousands of papers every day. Electronic medical records contain valuable information to speed up clinical trial recruitment and drug development, but curating such real-world evidence from clinical notes can take hours for a single patient. Natural language processing (NLP) can play a key role in interpreting big data for precision medicine. In particular, machine reading can help unlock knowledge from text by substantially improving curation efficiency. However, standard supervised methods require labeled examples, which are expensive and time-consuming to produce at scale. In this talk, I'll present Project Hanover, where we overcome the annotation bottleneck by combining deep learning with probabilistic logic, and by exploiting self-supervision from readily available resources such as ontologies and databases. This enables us to extract knowledge from millions of publications, reason efficiently with the resulting knowledge graph by learning neural embeddings of biomedical entities and relations, and apply the extracted knowledge and learned embeddings to supporting precision oncology.
12:00 PM - 12:20 PM (EDT)
CAMDA - Effect of Tumor Purity on The Analysis of Gene Expression Data
The tumor microenvironment is comprised of tumor cells, stroma cells, immune cells, blood vessels, and other associated non-cancerous cells. Gene expression measurements on tumor samples are an average over cells in the microenvironment. However, research questions often seek answers about tumor cells rather than the surrounding non-tumor tissue. In this study, we investigate the effect of tumor purity – the proportion of tumor cells in a solid tumor sample – on the performance of two statistical methods used for analyzing gene expression data: differential network (DN) analysis and differential gene expression analysis (DGEA). We perform the DN analysis on a breast invasive carcinoma dataset. The results reveal cancer-progression related pathways when analyzing the high-purity samples, whereas many non-cancer-related pathways are obtained when analyzing the complete dataset. In addition, a simulation study is conducted to assess the effect of replacing low tumor purity samples compared to random sampling variability with two additional datasets: head and neck squamous cell carcinoma (HNSC) and lung squamous cell carcinoma (LUSC). The approach described in this study provides a general strategy for assessing the effect of tumor purity on any gene expression data analyses and can be applied to other types of cancers.
12:00 PM - 12:20 PM (EDT)
Education - HTrainDB: H3Africa Training Database
One of the main aims of the H3Africa Consortium is to improve its research capacity in genomics, bioinformatics and health in Africa. A number of training initiatives in the form of workshops, seminars and related activities is organized by the Consortium in view of strengthening its research capacity and training world-class trainers. Given the large amount of training activities and participants, it is imperative to maintain a centralized record of all pertinent information. The database has been created to monitor the training efforts and evaluate the effectiveness of this research capacity building initiative. HTrainDB has several functionalities including capturing several aspects of the consortium membership, registration for Consortium meetings, advertising and recording training activities, creating meeting polls, creating survey questionnaires and webforms for various purposes. It also lists the publications of all members and highlights their career timeline. It provides a centralized system to keep records related to the consortium membership, projects and activities. To this end it can capture aggregate data and showcase capacity building in genomics research in Africa. HTrainDB has more than 600 members and is exemplary to inform other research projects or consortia with large memberships on how to manage and archive research related administrative data.
12:00 PM - 12:20 PM (EDT)
MLCSB: Machine Learning - Causal network learning using a semi-supervised approach
Causal interplay between molecular components is central to the regulation of cellular behaviour. Data-driven learning of molecular network topology continues to be an active area of research, including development of methods that have a particular focus on learning causal relationships. These methods often use statistical models with explicit causal assumptions (for example, causal directed acyclic graphs). Here, we take an alternative approach and view causal network inference from a machine learning point of view. The idea is to allow direct learning of patterns of causal influence between nodes. Binary indicators of causal influence between pairs of variables are treated as "labels". Available data for the variables of interest are used to provide edge features for the labelling task. Background knowledge or any available interventional data provide labels on a subset of variable pairs. The task is to learn the remaining labels that are unobserved. Using a specific approach rooted in manifold semi-supervised learning we present empirical results on three different biological datasets, including data where causal effects can be verified experimentally. Our results demonstrate the efficacy and highly general nature of the approach as well as its simplicity from a user's point of view.
12:00 PM - 12:20 PM (EDT)
SysMod - Cellular robustness is not a byproduct of environmental flexibility
Cells show remarkable resilience against perturbations. Its evolutionary origin remains obscure. In order to leverage methods of systems biology to examine cellular robustness, a computationally accessible way of quantification is needed. Here, we present an unbiased metric of structural robustness in (genome-scale) metabolic models based on concepts prevalent in reliability engineering. The probability of failure (PoF) is defined as the (weighted) portion of all combinations of loss-of-function mutations that disable network functionality. It can be exactly determined if all essential reactions, all synthetic lethal pairs of reactions, all synthetic lethal triplets of reactions etc., are known. In theory, these minimal cut sets (MCSs) can be calculated for any network, but for large models the problem remains computationally intractable. We demonstrate that even at the genome-scale only the lowest-cardinality MCSs are required to efficiently approximate the PoF. We analyzed the robustness of 459 E. coli, Shigella, and Salmonella strains. In contrast to the congruence theory, which explains the origin of genetic robustness as a byproduct of selection for environmental flexibility, we found no correlation between robustness and the diversity of growth-supporting environments. On the contrary, our analysis indicates that the core-reactome, i.e. the set of reactions shared across strains, dominates robustness.
12:20 PM - 12:40 PM (EDT)
Internet and Remote Sensing Data for Public Health Surveillance
In this talk, I will present opportunities and challenges of using data from Internet sources and remote sensing for public health surveillance. If properly mined and filtered, these data can be used to study the dynamics of disease propagation, risk factors and the relationship between human behavior and disease spread. This talk will include examples on the use of satellite images, social media postings and product reviews for surveillance of infectious diseases, food products and obesity prevalence.
12:20 PM - 12:30 PM (EDT)
Evolution and Comparative Genomics - A Probabilistic Framework for Cell Lineage Tree Reconstruction
Single-cell DNA sequencing (ScDNA-seq) technology enables a higher resolution look on the cells and has the potential to uncover the relationship between individual cells. ScDNA-seq is a fundamental tool for the evolutionary studies; however, it also introduces technological artefacts such as uneven coverage, allelic dropout, amplification and sequencing errors. In this study, we focus on human cells without copy number alterations. We present a Bayesian model based on scDNA-seq, that uncovers the difference between cells with the help of germline single nucleotide variations (gSNVs). The use of gSNVs as reference points enables us to accurately differentiate between somatic point mutations and the technological artefacts, especially the amplification errors. The model outputs a cell-to-cell distance matrix of each analysed pairs of loci, from which we reconstruct the cell lineage tree with bootstrapping and neighbour joining. We evaluate the reconstructed tree with transfer bootstrap expectation scores of branches and the Robinson-Foulds distance to the underlying tree structure. The model is embarrassingly parallel and with the use of dynamic programming and neighbour joining, we can analyse tens of thousands positions in the genome. The experiments showed high accuracy in tree reconstruction and the identification of subclones.
12:20 PM - 12:40 PM (EDT)
Education - Staff Education to Accelerate the Cloud Adoption
The European Bioinformatics Institute (EMBL-EBI) is part of EMBL. We have a large number of staff developing databases and tools to host, analyse and share data and analytic results openly in the life sciences. Staff training to adopt new technologies such as cloud computing is very challenging. We have defined the long-term strategy to adopt cloud infrastructure to provide better access to scientists in Europe and around the world to accelerate their research with the diverse data stored, verified and visualised by EBI. ==SHORTENED TO SATISFY 200 WORD LIMIT. PLEASE SEE PDF FOR THE COMPLETE ABSTRACT== As a result, many research and service teams have launched their projects with the new skills in the clouds. This team has been invited to deliver cloud workshops for EU projects in person for EOSC-Life and remotely for BioExcel with additional projects in the near future. The Cloud Roundtable forum has now been extended to include participants from Wellcome Sanger Institute to foster closer cooperation in using cloud technologies with a new joint project proposed. With this systematic staff education, we have made our first step to lead EBI and EU projects to adopt cloud technologies.
12:20 PM - 12:40 PM (EDT)
MLCSB: Machine Learning - Factorized embeddings learns rich and biologically meaningful embedding spaces using factorized tensor decomposition
The recent development of sequencing technologies revolutionised our understanding of the inner workings of the cell as well as the way disease is treated. A single RNA sequencing (RNA-Seq) experiment, however, measures tens of thousands of parameters simultaneously. While the results are information rich, data analysis provides a challenge. Dimensionality reduction methods help with this task by extracting patterns from the data by compressing it into compact vector representations. We present the factorized embeddings (FE) model, a self-supervised deep learning algorithm that learns simultaneously, by tensor factorization, gene and sample representation spaces. We ran the model on RNA-Seq data from two large-scale cohorts and observed that the sample representation captures information on single-gene and global gene expression patterns. Moreover, we found that the gene representation space was organized such that tissue-specific genes, highly correlated genes as well as genes participating in the same GO terms were grouped. Finally, we compared the vector representation of samples learned by the FE model to other similar models on 49 regression tasks. We report that the FE-trained representations rank first or second in all of the tasks, surpassing, sometimes by a considerable margin, other representations.
12:20 PM - 12:40 PM (EDT)
SST03 - Metabolomic analysis of coral holobionts reveals markers of thermal stress
Coral reef systems are under threat globally due to anthropomorphic climate change. Understanding the response of the coral holobiont to abiotic and biotic stress is therefore crucial to aid conservation efforts. The most pressing problem is “coral bleaching”, usually precipitated by prolonged thermal stress that disrupts the algal symbiosis sustaining the holobiont. Here we used metabolomics to address the following questions: how does the coral holobiont metabolome respond to heat stress, and can we find diagnostic markers of that stress prior to the onset of bleaching? We studied the heat tolerant Montipora capitata and heat sensitive Pocillopora acuta coral species from the Hawaiian reef system. Untargeted LC-MS analysis uncovered a suite of known and novel metabolites that accumulate under heat stress, including a variety of dipeptides, some of which occupy hub positions in metabolite networks, and are present in both coral species. The structures of four of these were determined (Arginine-Glutamine, Alanine-Arginine, Valine-Arginine, Lysine-Glutamine). These dipeptides show differential accumulation in symbiotic and aposymbiotic (alga free) individuals of the sea anemone Aiptasia, suggesting their animal derivation and algal symbiont related function. We have identified a suite of metabolites associated with thermal stress in corals that provide tools to diagnose coral health in the wild.
12:20 PM - 12:40 PM (EDT)
SysMod - Closing remarks of the SysMod meeting 2020
A brief recap of the two meeting days and the SysMod community
12:20 PM - 12:40 PM (EDT)
CAMDA - Mechanistic models of CMap drug perturbation functional profiles
We have used mechanistic models of pathway activities to generate complete catalogue of Cmap functional profiles that can be further used to detect reverse functional profiles in diseases, thus providing a functional basis and a potential biological interpretation for the drug-disease inverse matching.
12:30 PM - 12:40 PM (EDT)
Evolution and Comparative Genomics - Splicing-structure-based selection of protein isoforms improves the accuracy of gene tree reconstruction
Constructing accurate gene trees is important, as gene trees play a key role in several biological studies, such as species tree reconstruction, gene functional analysis and gene family evolution studies. Although several methods have provided large improvements in the construction and the correction of gene trees by making use of the relationship with a species tree in addition to multiple sequence alignments, there is still room for improvement on the accuracy of gene trees and the computing time. In particular, accounting for alternative splicing that allows eukaryote genes to produce multiple transcripts and proteins per gene is a way to improve the quality of multiple sequence alignments used to reconstruct gene trees.Current methods for gene tree reconstruction usually make use of a set of transcripts composed of one representative transcript per gene, to generate multiple sequence alignments which are then used to estimate gene trees. In this work, we present two new splicing-structure-based methods to estimate gene trees based on wisely selecting an accurate set of homologous transcripts based on their splicing structure to represent the genes of a gene family. The results show that the new methods compare favorably with the currently most used gene tree construction methods.
12:40 PM - 01:00 PM (EDT)
Evolution and Comparative Genomics - Inference of Population Admixture Network from Local Gene Genealogies: a Coalescent-based Maximum Likelihood Approach
Population admixture is an important subject in population genetics. Inferring population demographic history with admixture under the so-called admixture network model from population genetic data is an established problem in genetics. Existing admixture network inference approaches work with single genetic polymorphisms. While these methods are usually very fast, they don't fully utilize the information (e.g., linkage disequilibrium or LD) contained in population genetic data. In this paper, we develop a new admixture network inference method called GTmix. Different from existing methods, GTmix works with local gene genealogies that can be inferred from population haplotypes. Local gene genealogies represent the evolutionary history of sampled alleles and contain the LD information. GTmix performs coalescent-based maximum likelihood inference of admixture networks with inferred genealogies based on the well-known multispecies coalescent (MSC) model. GTmix utilizes various techniques to speed up likelihood computation on the MSC model and optimal network search. Our simulations show that GTmix can infer more accurate admixture networks with much smaller data than existing methods, even when these existing methods are given much larger data. GTmix is reasonably efficient and can analyze genetic datasets of current interests.
01:15 PM - 02:00 PM (EDT)
Lunch & Learn - F1000Research: Innovative Open Access publishing for software tools and beyond
02:00 PM - 02:20 PM (EDT)
SCANGEN - Deep Representation Learning for Problems in Biology
02:00 PM - 02:20 PM (EDT)
Education - IMPLEMENTATION OF ADVANCED BIOINFORMATICS EDUCATION ACROSS AFRICA
With more microbiome studies conducted by African based research groups, there is an increasing demand for knowledge in the design and analysis of microbiome studies and data, but high-quality bioinformatics courses are often hampered by factors such as lack of computational infrastructure and local expertise, among others. To address this need, H3ABioNet developed an intermediate 16S rRNA analysis course alongside experienced microbiome researchers who identified key topics ranging from designing microbiome studies to more practical topics such as introductory high-performance computing, microbiome analysis pipelines and downstream analyses conducted in R. Tools used in the course were packaged in Singularity containers to remove the overhead of installing individual tools, versions and libraries and a separate container was created for downstream analysis of results using Rstudio. The pulling, running and testing of the containers, software and analysis on various clusters was performed prior to the start of the course by all hosting classrooms. The pilot ran successfully in 2019 across 23 sites registered in 11 African countries, with more than 200 participants formally enrolled. It provides a model for delivering topic specific bioinformatics courses across Africa which overcome barriers such as unequal infrastructures, geographical distance, access to expertise and educational materials.
02:00 PM - 03:00 PM (EDT)
CAMDA - Panel Presentations and Discussion

Join us for Panel Presentations and a Discussion session in the CAMDA Cafe room by clicking the CAMDA logo under "Cafe Connect". We will discuss latest insights and shape future Camda Contest Challenges!

  • Camda 2000-2020 -  historical overview of previous CAMDA Conferences: Joaquin Dopazo, Clinical Bioinformatics Area, Fundación Progreso y Salud, Spain
  • Perspectives 2021 - next year's CAMDA challenges: Wenzhong Xiao, Stanford University & Massachusetts General Hospital, U.S.A.

02:00 PM - 02:40 PM (EDT)
SST05 - HUBMAP Session Keynote: HuBMAP Data Collection Plans
02:00 PM - 02:20 PM (EDT)
Evolution and Comparative Genomics - EvoLSTM: Context-dependent models of sequence evolution using a sequence-to-sequence LSTM
Motivation: Accurate probabilistic models of sequence evolution are essential for a wide variety of bioinformatics tasks, including sequence alignment and phylogenetic inference. The ability to realistically simulate the evolution of sequence is also at the core of many benchmarking strategies. Yet mutational processes have complex context dependencies that remain poorly modeled and understood. Results: We introduce EvoLSTM, a recurrent neural network-based evolution simulator that captures mutational context dependencies. EvoLSTM uses a sequence-to-sequence LSTM model trained to predict mutation probabilities at each position of a given descendant ancestral sequence, taking into consideration the 14 flanking nucleotides. EvoLSTM can realistically simulate primate DNA sequence and reveals unexpectedly strong long-range context dependencies in mutation rates. Conclusion: EvoLSTM brings modern machine learning approaches to bear on sequence evolution. It will serve as a useful tool to study and simulate complex mutational processes.
02:00 PM - 02:20 PM (EDT)
MLCSB: Machine Learning - Adversarial Deconfounding Autoencoder for Learning Robust Gene Expression Embeddings
The increasing number of gene expression profiles has enabled the use of complex models, such as deep unsupervised neural networks, to extract a latent space from these profiles. However, expression profiles, especially when collected in large numbers, inherently contain variations introduced by technical artifacts (e.g., batch effects) and uninteresting biological variables (e.g., age) in addition to the true signals of interest. These sources of variation, called confounders, produce embeddings that fail to capture biological variables of interest and transfer to different domains. To remedy this problem, we attempt to disentangle confounders from true signals to generate biologically informative embeddings by introducing the AD-AE (Adversarial Deconfounding AutoEncoder) approach. The AD-AE model consists of an autoencoder and an adversary network that we jointly train to generate embeddings that can encode as much information as possible without encoding any confounding signal. By applying AD-AE to two distinct gene expression datasets, we show that our model can (1) generate embeddings that do not encode confounder information, (2) conserve the biological signals present in the original space, and (3) generalize successfully across different confounder domains. We believe that this adversarial deconfounding approach can be the key to discovering robust expression patterns.
02:00 PM - 02:40 PM (EDT)
iRNA - iRNA Keynote: Non-canonical base pair interactions improve the scalability and accuracy of the prediction and analysis of RNA 3D structures
A vast and complex network of base interactions stabilizes the 3D architecture of RNAs. Beyond the canonical Watson-Crick and Wobble base pairs, each pair of nucleotides can interact in up to 12 different ways. The frequency of each type of base pair varies a lot, but in most cases their occurrence is essential to shape the local or global geometry of structured RNAs. Non-canonical base pairs are primarily found within or between unpaired regions of the secondary structure commonly refereed as loops. Their concentration in loops creates sophisticated base interaction patterns that are representative of the 3D structure supported by this network. In this talk, we show that working with a graphical model based on the set of canonical and non-canonical base pairs offers novel opportunities to develop efficient algorithms for predicting and analyzing physical and biochemical properties of RNA 3D structures. We present applications of this framework to (i) the automated discovery of structural motifs in databases, (ii) the prediction of small molecules binding RNAs, and (ii) the prediction of the 3D structure of large RNAs. These results suggest novel avenues to study of the evolution of RNAs or accelerate RNA drug discovery.
02:00 PM - 02:20 PM (EDT)
Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning
A growing body of work shows that many problems in fairness, accountability, transparency, and ethics in machine learning systems are rooted in decisions surrounding the data collection and annotation process. We argue that a new specialization should be formed within machine learning that is focused on methodologies for data collection and annotation: efforts that require institutional frameworks and procedures. Specifically for sociocultural data, parallels can be drawn from archives and libraries. Archives are the longest standing communal effort to gather human information and archive scholars have already developed the language and procedures to address and discuss many challenges pertaining to data collection such as consent, power, inclusivity, transparency, and ethics privacy. We discuss these five key approaches in document collection practices in archives that can inform data collection in sociocultural machine learning.
02:00 PM - 02:20 PM (EDT)
CAFA4 Talk 1
02:20 PM - 02:40 PM (EDT)
CAFA4 Talk 2
02:20 PM - 02:40 PM (EDT)
SCANGEN - Gradient of developmental and injury-response transcriptional states define roots of glioblastoma heterogeneity
Glioblastomas (GBM) harbour diverse populations of cells, including a rare subpopulation of glioblastoma stem cells (GSCs) that drive tumourigenesis. To characterize functional diversity within the tumour-initiating fraction of GBM, we performed single-cell RNA-sequencing on >69,000 GSCs cultured from the tumours of 26 patients. We identified a high degree of transcriptional heterogeneity within GSCs, with 2-6 transcriptional clonotypes per sample. Inference of copy number variation (CNVs) from scRNA-seq data revealed subclonal somatic CNVs drive a portion of transcriptional diversity within GSCs, but do not fully explain inter-sample heterogeneity. Instead, we found GSCs mapped along a transcriptional gradient spanning two cellular states reminiscent of normal neural developmental and inflammatory wound response processes. Further single-cell RNA-sequencing of patient tumours found the majority of cancer cells organize along an astrocyte maturation gradient orthogonal to the GSC gradient yet retained expression of founder transcriptional programs found in GSCs. Our results support a model whereby GBM heterogeneity can be explained by a fundamental GSC-based gradient between neural tissue regeneration and wound response transcriptional programs. This new paradigm has implications for understanding GBM origins with clinical implications for designing targeted therapies against the heterogeneous tumour-initiating fraction of GBM.
02:20 PM - 02:40 PM (EDT)
Education - DIAGNOSING THE HEALTH OF BIOINFORMATICS EDUCATION IN LATIN AMERICA
Bioinformatics education is essential for supporting R&D in Latin America; consequently, it is imperative to understand its status and to support challenged audiences. The number of bioinformatics programs in LatAm is low compared with developed countries; ~16% non-biotechnological undergraduate programs in life sciences offer bioinformatics. A project aimed at analysing the current status of bioinformatics education in LatAm and supporting training for undergraduate programs in life sciences is presented herein. It is a secondment project supported by CABANA (www.cabana.online) that comprises: 1) undergraduate training modules in Spanish and, 2) status analysis and trainer support. Modules: LatAm trainers assigned high-level competency requirements to undergraduates in life sciences for basic and applied biology, and scientific computing skills using the ISCB Competency Framework. Accordingly, face-to-face and online training modules are being created, will be peer-reviewed by CABANA partners and evaluated in test groups. Status and trainer support: A status analysis of bioinformatics programs is being performed jointly with LatAm researchers. Status analysis of undergraduate programs in life sciences samples one country, Mexico. A train-the-lecturer program is being created for support and knowledge exchange and to promote the creation of a community of practice for bioinformatics trainers and educators.
02:20 PM - 02:30 PM (EDT)
Evolution and Comparative Genomics - Gene annotation refinement software using synteny based mapping
High throughput next-generation sequencing (NGS) reduces the generation cost of genome data substantially. To apply the NGS data for various genetics studies, the sequencing data is assembled into a genome sequence and annotated into genes. Gene annotation, one of fundamental steps to understand the functions of each gene, is to determine the location of the gene and coding regions. There exist several gene annotation tools, but they have still limitations in terms of ambiguity issues in gene annotation steps. Most annotation tools are also quite difficult for novice users to install and apply for their studies. We propose a user-friendly and practically usable gene annotation software pipeline. To this end, the ambiguity problems are resolved using synteny mapping information. The performance of our software tool is evaluated using benchmark datasets such as the sequence of Saccharomyces cerevisiae S288C strain as well as other strain data. We believe our tool improves the accuracy of gene annotations so that it can substantially reduce the efforts and time required for manual curation in genome annotation. Our software package is released as an installation script and a Docker image so that users can easily install and apply for their own sequence data.
02:20 PM - 02:40 PM (EDT)
MLCSB: Machine Learning - Latent periodic process inference from single-cell RNA-seq data
The development of a phenotype in a multicellular organism often involves multiple, simultaneously occurring biological processes. Advances in single-cell RNA-sequencing make it possible to infer latent developmental processes from the transcriptomic profiles of cells at various developmental stages. Accurate characterization is challenging, however, particularly for periodic processes such as cell cycle. To address this, we develop Cyclum, an autoencoder approach to identify circular trajectories in the gene expression space. Experiments using the scRNA-seq data from a set of proliferating cell-lines and mouse embryonic stem cells show that Cyclum reconstructed experimentally labeled cell-cycle stages and rediscovered known cell-cycle genes more accurately than Cyclone, ccRemover, Seurat, and reCAT. Applying Cyclum to removing cell-cycle effects substantially improves delineations of cell subpopulations, which is useful for establishing various cell atlases and studying tumor heterogeneity. Comparing circular patterns in each gene between nicotine treated human embryonic cells and a control sample proposes proven and new target genes of nicotine. Thus, Cyclum can be applied as a generic tool for characterizing periodic processes underlying cellular development/differentiation and cellular architecture in the scRNA-seq data. These features make it useful for constructing the Human Cell Atlas, the Human Tumor Atlas, and other cell ontologies.
02:20 PM - 02:40 PM (EDT)
Data Sharing Challenges for Biomedical AI
The amount of person-specific data generated in the clinical and research domain continues to grow at an unprecedented rate. The breadth and depth of this data enables better care and hypothesis-driven discovery. At the same time, there is a belief that sharing and reusing, as well as combining it with other resources will lead to new and more robust statistical investigations. However, sharing such data has the potential to infringe upon the rights (or expectations) of the individuals to whom the records correspond and organizations making the resources available. In this presentation, I will review various ways in which well meaning biomedical data science investigations have led to misuse, as well as how data-driven methodologies can help us to mitigate such threats while still enabling scientific utility. This presentation will draw upon experiences in data collection and sharing in the context of several large data genomic and medical records data sharing initiatives, including the NIH-sponsored Electronic Medical Records and Genomics (eMERGE) Network and the All of Us Research Program, as well as the European Medicines Agency clinical trials data sharing program.
02:30 PM - 02:40 PM (EDT)
Evolution and Comparative Genomics - A Comprehensive Analysis of the Phylogenetic Signal in Ramp Sequences in 211 Vertebrates
Background: Ramp sequences increase translational speed and accuracy when rare, slowly-translated codons are found at the beginnings of genes. Here, the results of the first analysis of ramp sequences in a phylogenetic construct are presented. Methods: Ramp sequences were compared from 211 vertebrates (110 Mammalian and 101 non-mammalian). The presence and absence of ramp sequences was analyzed as a binary character in a parsimony and maximum likelihood framework. Additionally, ramp sequences were mapped to the Open Tree of Life taxonomy to determine the number of parallelisms and reversals that occurred. Results: Parsimony and maximum likelihood analyses of the presence/absence of ramp sequences recovered phylogenies that are highly congruent with established phylogenies. Additionally, the retention index of ramp sequences is significantly higher than would be expected due to random chance (p-value = 0). A chi-square analysis of completely orthologous ramp sequences resulted in a p-value of approximately zero as compared to random chance. Discussion: Ramp sequences recover comparable phylogenies as other phylogenomic methods. Although not all ramp sequences appear to have a phylogenetic signal, more ramp sequences track speciation than expected by random chance. Therefore, ramp sequences may be used in conjunction with other phylogenomic approaches.
02:40 PM - 03:00 PM (EDT)
Ethics, Bias, and the Adoption of AI in Biomedicine
Artificial intelligence (AI) has vast potential to revolutionize biomedicine across the translational science spectrum, from basic science to clinical research to practice and policymaking. However, recent high-profile cases – lawsuits alleging that patients’ privacy rights were violated when health data were shared with a technology company, errant treatment recommendations, and racially biased algorithms – suggest that addressing the ethical questions AI raises will be critical to its success. What does privacy require? Who owns data? Where is the line between informing choice through analytics, and manipulating it? Should some domains of the human experience be “off limits” for AI? How should the potential for biases and stigmatization as a result of AI applications be managed? Using a highly publicized real-world case of a biased resource allocation algorithm as an example, this presentation will describe three considerations that are critical for understanding how to manage ethical questions surrounding AI: context (i.e., how we manage bias and other ethical issues depends upon the data type and the circumstances under which data are collected and used); upstream action (i.e., bias and other ethical issues must be addressed at the earliest stages of AI development, not after the fact); and engagement (i.e., meaningfully engaging diverse stakeholders is essential to managing ethical issues). For AI to be most effective it must be ethical. Moving forward, there is a great need for empirical research into how ethical issues and their management affect the diffusion of AI-based innovations.
02:40 PM - 03:00 PM (EDT)
CAFA4 Talk 3
02:40 PM - 02:50 PM (EDT)
Evolution and Comparative Genomics - Highly-regulated and diverse NTP-based biological conflict systems with implications for emergence of multicellularity
Multicellular organizations are prone to infections even if a single cell is infected. We reveal novel highly-regulated chaperone-based systems that are likely used as survival tactic by prokaryotes with complex lifecycles. These architecturally-analogous systems have constant core modules coupled with highly-variable effector modules which is reminiscent of known biological conflict systems and co-evolutionary arms-race. The constant component is either an ATPase/GTPase and/or a peptidase that is activated in response to an invasive entity and causes effector deployment, that is additionally regulated by proteolytic processing or binding of nucleotide-derived signal. A third component senses invasive entities and transmits the signal. Effectors either: target invasive nucleic-acids or proteins; are inactive counterparts of host proteins that mediate decoy interactions with invasive molecules; or form macromolecular assemblages to cause host cell-death or containment of invasive entity. These apoptotic and immunity properties displayed by systems in phylogenetically-disparate multicellular prokaryotes are suggestive of evolutionary convergence for kin viability in multicellular organizations. Comparable protein domains appear to have organized into systems based on common principles in eukaryotic apoptosis. Thus, a similar operational “grammar” and shared “vocabulary” of protein domains in sensing and limiting infections during the multiple emergences of multicellularity across the tree of life is seen.
02:40 PM - 02:50 PM (EDT)
iRNA - RNA Secondary Structure Prediction By Learning Unrolled Algorithms
RNA secondary structure prediction is one of the oldest computational problems in bioinformatics, which has been studied for more than 40 years. Usually, researchers tend to utilize dynamic programming to resolve it, which can be relatively slow, with the F1 score being around 0.6 and having difficulty in handling pseudoknots. In this paper, we address it from an entirely new angle, viewing it as a translation with constraints problem. We propose a novel end-to-end deep learning model, called E2Efold, which has the problem-specific constraints embedded in the network architecture. The core idea of E2Efold is to predict the RNA base-pairing matrix directly, and use an unrolled algorithm for constrained programming as the template for deep architectures to enforce constraints. With comprehensive experiments on benchmark datasets, we demonstrate the superior performance of E2Efold: it predicts significantly better structures compared to the previous state-of-the-art methods (especially for pseudoknotted structures), with the F1 score being around 0.8, while being as efficient as the fastest algorithm in terms of inference time. The original paper has been published as an oral paper in ICLR 2020. The code of E2Efold is available at https://github.com/ml4bio/e2efold.
02:40 PM - 03:00 PM (EDT)
Education - Applying best practices to enhance bioinformatics training in Switzerland
SIB Swiss Institute of Bioinformatics training courses are created by following a cycle of best practices in training, which have been defined by the trainers’ communities in ISCB, GOBLET, ELIXIR and the SIB Training group. Trainers from the SIB Training group are encouraged to attend Train the trainer courses. As a consequence, SIB courses are clearly defined with specific learning objectives, target audiences, prerequisites, and active learning activities. Course pages are described with Bioschemas specifications and metadata, which provide valuable information to enable trainees to assess whether the courses meet their needs and background knowledge, a step towards FAIR training. Quality and impact metrics, together with training needs, are continuously collected, providing indications as to whether courses and the annual program need to be adapted. The integration of these best practices, together with the extensive expertise in bioinformatics from the SIB scientists, has resulted in courses that are consistently well evaluated and appreciated by PhD students, postdocs, and their PIs.
02:40 PM - 03:00 PM (EDT)
MLCSB: Machine Learning - SCIM: Universal Single-Cell Matching with Unpaired Feature Sets
Multi-modal molecular profiling of samples on a single-cell level can yield deeper insights into tissue microenvironment and disease dynamics. Profiling technologies like scRNA-seq often consume the analyzed cells and cellular correspondences between data modalities are lost. To exploit single cell ’omics technologies jointly, we propose Single-Cell data Integration via Matching (SCIM), a scalable and accurate approach to recover such correspondences in two or more technologies, even in the absence of overlapping feature sets. SCIM assumes that cells share a common underlying structure and reconstructs such technology-invariant latent space using an auto-encoder framework with an adversarial objective. Cell pairs across technologies are then identified using a customized bipartite matching scheme operating on the latent representations. We evaluate SCIM on a simulated branching process designed for scRNA-seq data (total of 192,000 cells) and show that the cell-to-cell matches reflect the same pseudotime (Pearson’s coefficient: 0.86). Moreover, we apply our method to a real-world melanoma tumor sample, and achieve 93% cell-matching accuracy with respect to cell-type label when aligning scRNA-seq and CyTOF datasets. SCIM is a scalable and flexible algorithm that bridges the gap between generation and integrative interpretation of diverse multi-modal data.
02:40 PM - 03:00 PM (EDT)
SCANGEN - Inferring the origins of pediatric brain tumors by single-cell analysis of the normal developing brain
02:40 PM - 03:05 PM (EDT)
SST05 - Embryo-scale, single-cell spatial transcriptomics
Spatial patterns of gene expression span many scales, and are shaped by both local (e.g. cell-cell interactions) and global (e.g. tissue, organ) context. However, most in situ methods for profiling gene expression either average local contexts or are restricted to limited fields of view. Here we introduce sci-Space, a scale-flexible method for spatial transcriptomics that retains single cell resolution while simultaneously capturing heterogeneity at larger scales. As a proof-of-concept, we apply sci-Space to the developing mouse embryo, capturing the approximate spatial coordinates of profiled cells from whole embryo serial sections. We identify genes including Hox-family transcription factors expressed in an anatomically patterned manner across excitatory neurons and other cell types. We also show that sci-Space can resolve the differential contribution of cell types to signalling molecules exhibiting spatially heterogeneous expression. Finally, we develop and apply a new statistical approach for quantifying the contribution of spatial context to variation in gene expression within cell types.
02:50 PM - 03:00 PM (EDT)
Evolution and Comparative Genomics - Integrated synteny- and similarity-based inference on the polyploidization-fractionation cycle
Two orthogonal approaches to the study of fractionation (duplicate gene loss after polyploidization) focus on the decrease over time of the number of surviving duplicate pairs, on the one hand, and on the pattern of syntenically consecutive pairs lost at a deletion event, on the other. Here we explore a synergy between the two approaches that greatly enlarges the scope of both. In the branching process approach to accounting for the distribution of gene pair similarities, the inference possibilities are minimal, since there is only one degree of freedom for each replication event. It is only by transcending the distribution of gene pair similarities and bringing other data to bear can we increase the number of parameters of the branching process that can be estimated. We greatly enlarged the possibilities of estimating parameters this model of the replication-fractionation cycle, by considering the singletons within synteny blocks, by deriving theoretical constraints among the retention rates, and by correcting for erosion of synteny blocks over time.
02:50 PM - 03:00 PM (EDT)
iRNA - Elucidating the Automatically Detected Features Used By Deep Neural Networks for RFAM family classification
In this study, we show that deep feed-forward neural nets are able to accurately classify RNA families directly from RNA sequences only. We demonstrate to what degree those models use length, as well as nucleotide and dinucleotide composition, and higher order subsequences present in the RNA to make accurate predictions by selectively obfuscating combinations of these features. We report the area under the receiver-operator characteristic curve (ROC-AUC) for the classification task of a diverse selection of RNA families, showing how randomizing various implicit sequence features affects the performance of these models, suggesting what features they are able to detect. We hope these findings will encourage the use of artificial neural network models for reliable data-driven detection of RNA families from primary structure directly, and integration of these models into other various sequence-based bioinformatics tasks, such as de novo genome annotation.
03:20 PM - 03:40 PM (EDT)
Training at the Intersection: Bringing Together Computation and Biomedicine
The confluence of data from electronic health records (EHRs) and new types of observations (omic, imaging, mHealth), combined with novel computational methods holds a tantalizing promise to transform biomedical discovery and patient care. New machine and reinforcement learning (ML/RL) techniques, for example, are uncovering scientific insights and helping optimize clinical decision making. However, the next generation of scientists must appreciate the proper use of these methods in addition to the underlying translational challenges inherent to making such techniques usable in real-world environments. Moreover, such training needs to span an interdisciplinary spectrum, particularly if artificial intelligence (AI) is to advance in healthcare: for the computer scientists and engineers engaged in methodological development and the theoretical underpinnings of these algorithms, there is a need to appreciate the nuanced nature of biomedical data and how the tools they create are ultimately used; and for the biological/clinical scientist interested in using these tools, there must be an understanding of the appropriate application of data-driven analyses, including their evaluation. Unique opportunities exist to bring these individuals together by employing team science approaches on real-world use cases posed by healthcare systems, providing a pragmatic testbed for learning and implementation. Although short-term effort is required to make such teams successful (e.g., forging a common language, understanding different information needs), the diversity of technical experience and viewpoints enables innovative, effective solutions and the group learns together, often in a self-sustaining manner.
03:20 PM - 03:40 PM (EDT)
iRNA - LinearPartition: Linear-Time Approximation of RNA Folding Partition Function and Base Pairing Probabilities
RNA secondary structure prediction is widely used to understand RNA function. Recently, there has been a shift away from the classical minimum free energy (MFE) methods to partition function-based methods that account for folding ensembles and can therefore estimate structure and base pair probabilities. However, the classical partition function algorithm scales cubically with sequence length, and is therefore a slow calculation for long sequences. This slowness is even more severe than cubic-time MFE-based methods due to a larger constant factor in runtime. Inspired by the success of our recently proposed LinearFold algorithm that predicts the approximate MFE structure in linear time, we design a similar linear-time heuristic algorithm, LinearPartition, to approximate the partition function and base pairing probabilities, which is shown to be orders of magnitude faster than Vienna RNAfold and CONTRAfold (e.g., 2.5 days vs. 1.3 minutes on a sequence with length 32,753 nt). More interestingly, the resulting base pairing probabilities are even better correlated with the ground truth structures. LinearPartition also leads to a small accuracy improvement when used for downstream structure prediction on families with the longest length sequences (16S and 23S rRNA), as well as a substantial improvement on long-distance base pairs (500+ nt apart).
03:20 PM - 03:40 PM (EDT)
Education - Data Science Training for Experimental Biology Graduate Students
The life sciences education community has long recognized the need to provide more rigorous quantitative and computational training essential to current research practice. Here we describe an effort to meet this need for a cohort of experimental life sciences graduate students assumed to have no prior biostatistics or bioinformatics training. We sought to provide a practical and accessible introduction via a course organized around analyzing real primary experimental data in series of modules, each covering a distinct biological domain and data type. The course in part uses lecture material on biological problems and data sources as well as cross-cutting topics in bioinformatics and biostatistics. These are brought into practice with hands-on analysis in class and in homework of primary data in R. Assessment suggests that the course produces modest gains in reasoning correctly about problems in biological data analysis and experimental design, particularly via increasing ability to draw on quantitative knowledge. Future work will aim to extend students' ability for more complex coding tasks, incorporate new modules, and better develop course materials for export.
03:20 PM - 03:40 PM (EDT)
MLCSB: Machine Learning - TinGa: fast and flexible trajectory inference with Growing Neural Gas
Motivation: During the last decade, trajectory inference methods have emerged as a novel framework to model cell developmental dynamics, most notably in the area of single-cell transcriptomics. At present, more than 70 trajectory inference methods have been published, and recent benchmarks showed that, while some methods perform well for certain trajectory types, overall there is still a lot of room for improvement. Results: In this work we present TinGa, a new trajectory inference model that is fast and flexible, and that is based on growing neural graphs. This allows TinGa to model both the most simple as well as most complex trajectory types. We performed an extensive comparison of TinGa to the five best existing methods for trajectory inference on a set of 250 datasets, including both synthetic as well as real datasets. Overall, TinGa obtained better results than all other methods on all ranges of data complexity, from the simplest linear datasets to the most complex disconnected graphs. In addition, TinGa obtained the fastest run times, showing that our method is thus one of the most versatile methods up to date. Availability: R scripts for running TinGa, comparing it to top existing methods and generating the figures of this paper are available at https://github.com/Helena-todd/researchgng
03:20 PM - 03:40 PM (EDT)
Evolution and Comparative Genomics - Copy Number Evolution with Weighted Aberrations in Cancer
Motivation: Copy number aberrations (CNAs), which delete or amplify large contiguous segments of the genome, are a common type of somatic mutation in cancer. Copy number profiles, representing the number of copies of each region of a genome, are readily obtained from whole-genome sequencing or microarrays. However, modeling copy number evolution is a substantial challenge, since CNAs alter contiguous segments of the genome and different CNAs may overlap with one another. A recent popular model for copy number evolution is the Copy Number Distance (CND), defined as the length of a shortest sequence of deletions and amplifications of contiguous segments that transforms one profile into the other. All events contribute equally to the CND; however, CNAs are observed to occur at different rates according to their length or genomic position and also vary across cancer type. Results: We introduce a weighted copy number distance that allows events to have varying weights, or probabilities, based on their length, position and type. We derive an efficient algorithm to compute the weighted copy number distance as well as the associated transformation, based on the observation that the constraint matrix of the underlying optimization problem is totally unimodular. We demonstrate the utility of the weighted copy number distance by showing that the weighted CND: improves phylogenetic reconstruction on simulated data where copy number aberrations occur with varying probabilities; aids in the derivation of phylogenies from ultra low-coverage single-cell DNA sequencing data; helps estimate CNA rates in a large pan-cancer dataset.
03:20 PM - 03:40 PM (EDT)
SCANGEN - Copy number aberrations from single-cell sequencing
03:20 PM - 03:45 PM (EDT)
SST05 - Infrastructure for Storing Massive Biological Data Sets
03:20 PM - 04:00 PM (EDT)
CAMDA - Prediction of Drug Induced Liver Injury with different data sets and different end points
Motivation: Drug-induced liver injury (DILI) is one of the primary problems in drug development. Early prediction of DILI, would bring a significant reduction in the cost of clinical trials and faster development of drugs. Current study is aims at building predictive models of DILI potential of chemical compounds. Methods: We build predictive models for several alternative splits of compounds between DILI and non-DILI classes, using supervised Machine Learning algorithms. To this end we use chemical properties of the compounds under scrutiny, their effects on gene expression levels in 6 human cell-lines treated with them and their toxicological profiles. We first identity the most informative variables and then use them to build ML models. Individual models built using gene expression of single cell lines, chemical properties of compounds, their toxicology profiles are then combined using Super Learner approach. Results: We have obtained weakly predictive model for using molecular descriptors and DILI statistics, with AUC exceeding 0.7 for some DILI definitions. With one exception, gene expression profiles of human cell lines resulted were non-informative and resulted in random models. Gene expression profiles of HEPG2 cell line lead to statistically significant models (AUC=0.67) only for one definition of DILI.
03:20 PM - 03:40 PM (EDT)
Function - Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. In this work we applied an existing deep sequence model that had been pre-trained in an unsupervised setting on the supervised task of protein function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for deep prediction models, as a two-layer perceptron was enough to achieve state-of-the-art performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that three-dimensional structure is also potentially learned during the unsupervised pre-training.
03:40 PM - 04:00 PM (EDT)
Function - Embeddings allow GO annotation transfer beyond homology
Understanding protein function is crucial for molecular and medical biology, nevertheless Gene Ontology (GO) annotations have manually been confirmed for fewer than 0.5% of all known protein sequences. Computational methods bridge this sequence-function gap, but the best prediction methods need evolutionary information to predict function. Here, we proposed a new method predicting GO terms through annotation transfer not using sequence similarity. Instead, the method uses SeqVec embeddings to transfer annotations between proteins through proximity in embedding space. SeqVec’s data driven feature extraction transferred knowledge from large unlabeled databases to smaller but labelled datasets (transfer learning). Replicating the conditions of CAFA3, our method reached an Fmax of 50%, 59%, and 65% for BPO, MFO, and CCO, respectively. This was numerically higher than all methods that had actually participated in CAFA3 for BPO and CCO and scored second for MFO. Restricting the lookup dataset to proteins with less than 20% pairwise sequence identity to the targets, performance dropped clearly (Fmax BPO 38%, MFO 46%, CCO 56%), but continued to clearly outperform simple homology-based inference. Thereby, the new method may help in annotating novel proteins not belonging to large families.
03:40 PM - 04:00 PM (EDT)
"Moneyball" for Team Science? Fostering a Collaborative Workforce
03:40 PM - 04:00 PM (EDT)
iRNA - Graph neural representational learning of RNA secondary structures for predicting RNA-protein interactions
Motivation: RNA-protein interactions are key effectors of post-transcriptional regulation. Significant experimental and bioinformatics efforts have been expended on characterizing protein binding mechanisms on the molecular level, and on highlighting the sequence and structural traits of RNA that impact the binding specificity for different proteins. Yet our ability to predict these interactions in silico remains relatively poor. Results: In this study, we introduce RPI-Net, a graph neural network approach for RNA-protein interaction prediction. RPI-Net learns and exploits a graph representation of RNA molecules, yielding significant performance gains over existing state-of-the-art approaches. We also introduce an approach to rectify particular type of sequence bias present in many CLIP-Seq data sets, and we show that correcting this bias is essential in order to learn meaningful predictors and properly evaluate their accuracy. Finally, we provide new approaches to interpret the trained models and extract simple, biologically-interpretable representations of the learned sequence and structural motifs.
03:40 PM - 04:00 PM (EDT)
MLCSB: Machine Learning - Unsupervised Topological Alignment for Single-Cell Multi-Omics Integration
Motivation: Single-cell multi-omics data provide a comprehensive molecular view of cells. However, single- cell multi-omics datasets consist of unpaired cells measured with distinct unmatched features across modalities, making data integration challenging. Results: In this study, we present a novel algorithm, termed UnionCom, for the unsupervised topological alignment of single-cell multi-omics integration. UnionCom does not require any correspondence information, either among cells or among features. It first embeds the intrinsic low-dimensional structure of each single-cell dataset into a distance matrix of cells within the same dataset and then aligns the cells across single-cell multi-omics datasets by matching the distance matrices via a matrix optimization method. Finally, it projects the distinct unmatched features across single-cell datasets into a common embedding space for feature comparability of the aligned cells. To match the complex nonlinear geometrical distorted low-dimensional structures across datasets, UnionCom proposes and adjusts a global scaling parameter on distance matrices for aligning similar topological structures. It does not require one-to-one correspondence among cells across datasets, and it can accommodate samples with dataset-specific cell types. UnionCom outperforms state-of-the-art methods on both simulated and real single-cell multi-omics datasets. UnionCom is robust to parameter choices, as well as subsampling of features. Availability: UnionCom software is available at https://github.com/caokai1073/UnionCom.
03:40 PM - 04:00 PM (EDT)
SCANGEN - Joint Inference of Clonal Structure using Single-cell DNA-Seq and RNA-Seq data
Latest high-throughput single-cell RNA-sequencing (scRNA-seq) and DNA-sequencing (scDNA-seq) technologies enabled cell-resolved investigation of pathological tissue clones. However, it is still technically challenging to simultaneously measure the genome and transcriptome content of a single cell. In this work, we developed CCNMF – a new computational tool utilizing the Coupled-Clone Non-negative Matrix Factorization technique to jointly infer clonal structures in single-cell genomics and transcriptomics data. We benchmarked CCNMF using both simulated and real cell mixture derived datasets and fully demonstrated its robustness and accuracy. We also applied CCNMF to the paired scRNA and scDNA data from a triple-negative breast cancer xenograft, resolved its underlying clonal structures, and identified differential genes between cell clusters. In summary, CCNMF presents a joint and coherent approach to resolve the clonal genome and transcriptome structures, which will facilitate a better understanding of the cellular and tissue changes associated with disease development.
03:40 PM - 04:00 PM (EDT)
Education - Embedding skills for a new profession by teaching programming in an immersive and authentic environment
Clinical Bioinformatics combines computer science with genomics in clinical practice. Trained clinical bioinformaticians are in short supply necessitating creative and flexibly-delivered education to fill the skills-gap. Our Introduction to Programming unit launched in 2019 as part of a PG-Cert teaching the fundamentals of Clinical Bioinformatics to a diverse cohort of distance learners. The unit simulated real-world experiences by building a situated learning environment that used agile project methods and authentic problem-solving activities to emulate clinical programming best-practice. Clear instructions, signposting and support, ensured comfort with course materials delivered using Jupyter Notebooks via GitHub. This use of industry-standard platforms and downloadable content also encouraged post-course lifelong-learning. The unit followed a social constructivist model geared to help students to learn individually and as a team. Synchronous online support from experienced facilitators helped encourage group-based peer-to-peer support which afforded more time for educators to support struggling students. By prioritising pedagogy over technology, the learning design resulted in incremental coding activities supporting a variety of learners. Methods such as Sprints provided real-world problem-based learning using real user-stories. The students directly contributed to the clinical bioinformatics toolkit by developing resources for personal practice or for co-development of the VariantValidator software, used in clinical practice worldwide.
03:40 PM - 03:50 PM (EDT)
Evolution and Comparative Genomics - Reconstructing Tumor Evolutionary Histories and Clone Trees in Polynomial-time with SubMARine
Tumors contain multiple subpopulations of genetically distinct cancer cells. Reconstructing their evolutionary history can improve our understanding of how cancers develop and respond to treatment. Subclonal reconstruction methods infer the ancestral relationships among the subpopulations by constructing a clone tree. However, often multiple clone trees are consistent with the data. Current methods do not effectively characterize this uncertainty, and cannot scale to cancers with many subclonal populations. In this work we introduce a partial clone tree that defines a subset of the pairwise ancestral relationships in a clone tree, thereby implicitly representing the set of all clone trees that have these defined relationships. Also, we define a special partial clone tree, the Maximally-Constrained Ancestral Reconstruction (MAR), which summarizes all clone trees fitting the input data equally well. We describe SubMARine, a polynomial-time algorithm producing the subMAR, which approximates the MAR with specific guarantees. We also extend SubMARine to work with subclonal copy number aberrations. We show, both on simulated and a real lung cancer dataset, that SubMARine runs in less than 70 seconds, and that the subMAR equals the MAR in > 99.9% of cases where only a single tree exists. SubMARine is available at https://github. com/morrislab/submarine.
03:45 PM - 04:10 PM (EDT)
SST05 - Tools and Pipelines for the Analysis and Integration of HuBMAP data
To reconstruct a 3D human map HuBMAP intends to profile large amounts of high throughput biological data spanning several different organs, modalities and technologies. In this talk I will discuss the guiding principals underlying the decisions for selecting specific pipelines. I will mention some challenges related to implanting these pipelines in the context of a large and diverse consortium and will show some preliminary results for using these methods to analyze genomics, proteomics and imaging data.
03:50 PM - 04:00 PM (EDT)
Evolution and Comparative Genomics - Structural and transcriptional variation linked to protracted human frontal cortex development
The human frontal cortex is unusually large compared with many other species. The expansion of the human frontal cortex is accompanied by both connectivity and transcriptional changes. Yet, the developmental origins generating variation in frontal cortex circuitry across species remain unresolved. Nineteen genes, which encode filaments, synapse, and voltage-gated channels are especially enriched in the supragranular layers of the human cerebral cortex, which suggests enhanced cortico-cortical projections emerging from layer III. We identify species differences in connections with the use of diffusion MR tractography as well as gene expression in adulthood and in development to identify developmental mechanisms generating variation in frontal cortical circuitry. We demonstrate that increased expression of supragranular-enriched genes in frontal cortex layer III is concomitant with an expansion in cortico-cortical pathways projecting within the frontal cortex in humans relative to mice. We also demonstrate that the growth of the frontal cortex white matter and transcriptional profiles of supragranular-enriched genes are protracted in humans relative to mice. The expansion of projections emerging from the human frontal cortex emerges by extending frontal cortical circuitry development. Integrating gene expression with neuroimaging level phenotypes is an effective strategy to assess deviations in developmental programs leading to species differences in connections.
04:00 PM - 04:20 PM (EDT)
ODSS - Reflections and Lessons from 15 Years of Training Computational Biologists
It is clear that there is an increasing need for computational biologists in both academia and industry. We launched a joint doctoral program to train computational biologists between Carnegie Mellon University and University of Pittsburgh in 2005, and this has been a continually evolving process since then in tandem with the rapid developments in the field, and the increasing demand for workforce in the field. I will present the challenges we faced in our interdisciplinary, inter-institutional program, and the solutions we came up with. As a program that admits students from a wide variety of backgrounds and aims at providing training in a rapidly evolving field, it has been critically important to monitor student progress and success as a function of their background and offer customized training when needed in line with career opportunities. A major lesson learned is the importance of familiarizing with the field early in education, before graduate studies; another is to gauge the career landscape and proactively adapt the training program to current and future needs.
04:00 PM - 04:10 PM (EDT)
iRNA - Practical Guidance for Genome-Wide RNA:DNA Triple Helix Prediction
Long noncoding RNAs (lncRNAs) play a key role in many cellular processes including chromatin regulation. To modify chromatin, lncRNAs often interact with DNA in a sequence-specific manner forming RNA:DNA triple helices. Computational tools for triple helices search do not always provide genome-wide predictions of sufficient quality. Here, we used four human lncRNAs (MEG3, DACOR1, TERC and HOTAIR) and their experimentally determined binding regions for evaluating triple helix parameters used by Triplexator software. We find out that 10 nt as a minimum length, 20% as a maximum error-rate and a minimum G-content of 70% or 40% provide the highest accuracy of triple helices predictions in terms of area under the curve (AUC). Additionally, we combined triple helix prediction with the lncRNA secondary structure and demonstrated that consideration of only single-stranded fragments of lncRNA predicted by RNAplfold with 0.95 or 0.5 thresholds for probability of pairing can further improve DNA-RNA the quality of triplexes prediction, especially in MEG3 case. This improvement can be explained by the number and characteristics of DBDs - regions of lncRNA that form the majority of the triplexes, detected by TDF software.
04:00 PM - 04:10 PM (EDT)
Evolution and Comparative Genomics - CoCoCoNet: Conserved and Comparative Co-expression Across a Diverse Set of Species
Co-expression analysis has provided insight into gene function in many organisms from Arabidopsis to Zebrafish. Comparison across species has the potential to enrich these results, for example, by prioritizing among candidate human disease genes based on their network properties, or by finding alternative model systems where their co-expression is conserved. Here, we present CoCoCoNet as a tool for identifying conserved gene modules and comparing co-expression networks. CoCoCoNet is a resource for both data and methods, providing gold-standard networks and sophisticated tools for on-the-fly comparative analyses across 14 species. We show how CoCoCoNet can be used in two use cases. In the first, we demonstrate deep conservation of a nucleolus gene module across very divergent organisms, and in the second, we show how the heterogeneity of autism mechanisms in humans can be broken down by functional group, and translated to model organisms. CoCoCoNet is free to use and available at https://milton.cshl.edu/CoCoCoNet/, providing users with convenient access to both data and methods for cross-species analyses, opening up a range of potential research questions relevant to evolution and comparative genomics.
04:00 PM - 04:20 PM (EDT)
SCANGEN - Integrative analysis of breast cancer survival based on spatial features
Complex cancer progression, such as tumour growth and invasion, is known to be impacted by the immune system and its spatial interaction with tumours. Two recent studies of breast cancer using multiplexed ion beam imaging by time-of-flight (MIBI-TOF) and imaging mass cytometry (IMC) have revealed that single-cell heterogeneity within the spatial tumour-microenvironment context is closely associated with patient subtypes (Keren et al. 2018, Jackson et al. 2020). However, a complete understanding of the ability of different features generated from these state-of-the-art technologies, such as cell type composition and tumour spatial patterns, to predict patient clinical outcomes is still limited. To this end, we will introduce a procedure to identify informative spatial and non-spatial features and develop a corresponding model to predict patient survival outcomes. We investigate integration methods such as optimal transport to reconstruct the spatial pattern of additional antibodies not measured in the imagining data using information from CyTOF. This allows us to expand the feature space by characterising the spatial pattern of existing and imputed protein expression, individual cell type, and cell type interaction using spatial statistics. Finally, we generalize the predictive features from different experiments and evaluate them via patient survival prediction performance.
04:00 PM - 04:20 PM (EDT)
CAMDA - Improving Deep Learning Performance on Prediction of Drug-Induced Liver Injury
Drug-induced liver injury (DILI) is an important safety issue in the field of drug development. The potential to accurately predict DILI from both in vivo and in vitro studies would be an added advantage in analyzing the drug potential to cause DILI because hepatotoxicity might not be evident at the beginning stages of development. The annual international conference on Critical Assessment of Massive Data Analysis (CAMDA) releases challenges each year to tackle big data problems in the life sciences; the CMap Drug Safety challenge focuses on the prediction of DILI. In previous years, deep learning has been utilized but to un-stellar performance. This year (2020), we seek to improve the performance of deep learning on this challenge by investigating different methods of preprocessing data and network architectures. Our current leading model is able to predict severe DILI more accurately than previous deep learning results.
04:00 PM - 04:20 PM (EDT)
MLCSB: Machine Learning - scNym: Semi-supervised neural networks for single cell identity classification
Single cell genomics experiments can reveal the keystone cellular actors in complex tissues. However, annotating cell type and state identities for each molecular profile in these experiments remains an analytical bottleneck. Here, we present scNym, a semi-supervised neural network that learns to transfer cell identity annotations from one experiment to another. scNym uses consistency regularization and entropy minimization techniques in the semi-supervised MixMatch framework to take advantage of information in both the labeled and unlabeled datasets. In benchmark experiments, we show that scNym offers superior performance to baseline approaches in transferring cell identity annotations across experiments performed with different technologies or in distinct biological conditions. We show with ablation experiments that semi-supervision techniques improved both the performance and calibration of scNym models. We also show that scNym models are well-calibrated and interpretable with saliency methods, allowing for review of model decisions by domain experts.
04:00 PM - 04:20 PM (EDT)
Education - An Introduction to Modern Computational Biology through Microbiome Research for High School Students
The Pre-College Program in Computational Biology (http://www.cbd.cmu.edu/education/pre-college-program-in computational-biology/) is a three-week annual summer educational program in computational biology designed for high school students that launched in July 2019. The curriculum was oriented around the problem of understanding the microbiomes present in Pittsburgh’s three rivers. Students in this program collected water samples from multiple locations in the three rivers, performed wet-lab experiments to capture data from their samples, and then wrote algorithms to analyze the data that they generated, comparing the results against those from software used by current scientists. Students implemented algorithms for sequence alignment, genome assembly, gene prediction and annotation, image analysis, and machine learning-based microbiome analysis. The program demonstrated the vital interplay between experimental and computational biology to an audience of students who would likely not have had exposure to either subject. We reflect on our experience in the first year of this program and briefly discuss the results of our students' work, which led to two research manuscripts.
04:00 PM - 04:20 PM (EDT)
Function - ContactPFP: Protein Function Prediction Using Predicted Contact Information
Protein function prediction is an important task in bioinformatics. Although many current function prediction methods rely heavily on sequence-based information, three-dimensional (3D) structure of proteins is very useful to identify the evolutionary relationship of proteins, from which function similarity can be inferred. Here, we developed a novel protein function prediction method called ContactPFP, which uses predicted residue-residue contact maps to identify the structural similarity of proteins. ContactPFP showed comparable performance to existing sequence-based methods.
04:10 PM - 04:20 PM (EDT)
Evolution and Comparative Genomics - Modeling gene expression evolution with EvoGeneX uncovers differences in evolution of species, organs and sexes
While DNA sequence evolution is well-studied, an equally important factor, evolution of gene expression, is yet to be fully understood. The availability of recent tissue/organ-specific expression datasets spanning several organisms across the tree of life, including our new data from Drosophila, has enabled detailed studies of expression evolution. We introduce EvoGeneX, a computational method that complements existing models for expression evolution across species using stochastic processes, maximum likelihood-estimation and hypothesis-testing to differentiate three modes of evolution: 1) neutral: Brownian Motion, 2) constrained: when expression evolved toward an optimum (Ornstein-Uhlenbeck process), and 3) adaptive: when expression in different branches of species tree evolved toward different optima. Additionally, EvoGeneX incorporates biological replicates for within-species variations. We also introduce a novel comparative analysis of evolution across tissues and sexes using Michalis-Menten(MM) curves. In our simulation EvoGeneX significantly outperformed the currently available method on false discovery rate. On expression data across organs, species, and sexes of Drosophila, our generic method revealed a large fraction of constrained genes including genes constrained in all organs and sexes. Our MM-based approach revealed striking differences in evolutionary dynamics in gonads. Finally, EvoGeneX revealed compelling examples of adaptive evolution, including odor binding proteins, ribosomal proteins, and amino acid metabolism.
04:10 PM - 04:35 PM (EDT)
SST05 - Visualization and Exploration of Heterogeneous Human Tissue Data Sets
The HuBMAP consortium generates, integrates, and disseminates multi-modal, single-cell data from human tissues. The heterogeneity and scale of these data sets pose new challenges for data visualization, such as integrating diverse data types and scaling to enormous dataset sizes. To address these issues, we have designed and implemented Vitessce, a visualization tool for exploring spatial single-cell experiments. Vitessce can be used both as a standalone tool as well as a component for portal user interfaces. In my presentation, I will introduce the features and architecture of Vitessce and discuss our strategies for tight integration between the HuBMAP Data Portal and Vitessce visualizations.
04:10 PM - 04:20 PM (EDT)
iRNA - Splicing variations contribute to the functional dysregulation of genes in acute myeloid leukemia.
Altered pre-mRNA splicing may result in aberrations that phenocopy classical somatic mutations. Despite the importance of RNA splicing, most studies of acute myeloid leukemia (AML) have not broadly explored means by which altered splicing may functionally disrupt genes associated with AML. To address this gap, we investigated the splicing variability of 70 AML-associated genes within RNA-Seq data from 29 in-house AML patient samples (PENN cohort). In brief, using the MAJIQ splicing quantification algorithm, we detected 40 highly variable splicing events across the patients of the PENN cohort, many of which are novel and reduce expression of protein without changing overall transcript abundance. Splicing variability occurred independently of known cis-mutations, thus highlighting pathogenic mechanisms overlooked by standard genetic analyses. We also find these 40 splicing events as significantly more variable within the ~400 patient BEAT-AML cohort when compared to normal CD34+ cells. Furthermore, hierarchical clustering revealed a high degree of correlation between 23 of these 40 splicing events in both the PENN and BEAT-AML cohorts, suggesting a pathogenic co-regulation that is not observed in normal CD34+ cells. Overall, our findings highlight underlying transcriptomic complexity across AML populations and demonstrate how previously unreported splicing variations contribute to protein dysregulation in AML.
04:20 PM - 04:30 PM (EDT)
MLCSB: Machine Learning - Gromov-Wasserstein based optimal transport to align single-cell multi-omics data
Data integration of single-cell measurements is critical for our understanding of cell development and disease, but the lack of correspondence between different types of single-cell measurements makes such efforts challenging. Several unsupervised algorithms are capable of aligning heterogeneous types of single-cell measurements in a shared space, enabling the creation of mappings between single cells in different data modalities. We present Single-Cell alignment using Optimal Transport (SCOT), an unsupervised learning algorithm that uses Gromov Wasserstein-based optimal transport to align single-cell multi-omics datasets. SCOT calculates a probabilistic coupling matrix that matches cells across two datasets. The optimization uses k-nearest neighbor graphs, thus preserving the local geometry of the data. We use the resulting coupling matrix to project one single-cell dataset onto another via barycentric projection. We compare the alignment performance of SCOT with state-of-the-art algorithms on three simulated and two real datasets. Our results demonstrate that SCOT yields results that are comparable in quality to those of competing methods, but SCOT is significantly faster and requires tuning fewer hyperparameters.
04:20 PM - 04:40 PM (EDT)
Function - Fine-tuning of Language Model-Based Representation for Protein Functional Annotation
We present our approach in fine-tuning of language model-based representation for the prediction of (i) gene-ontology (GO), (ii) human-phenotype-ontology (HPO), and (iii) disorder-ontology (DO) terms, as subtasks of CAFA 4. Recently, transfer learning showed significant improvements in many machine learning problems, in particular at the scarcity of annotated data. Combinations of being self-supervised as well as being general enough, makes neural language modeling an ideal candidate for transfer learning on the sequential data. Subsequently, the trained language modeling network can be fine-tuned for any particular task, even when only a limited number of annotations are available. In CAFA 4 subtasks, we make use of language model-based transfer learning throughout the following steps: (i) we train a language model-based representation of protein sequences on a large collection of protein sequences (UniRef50) (ii) we fine-tune the obtained model for the supervised task of neural GO prediction, which relatively has more training instances than HPO and DO (iii) for the second time we fine-tune the model already tuned for the GO prediction, this time for the prediction of HPO and DO. To improve the predictions, we use an ensemble of different fine-tuning paths from language modeling to the supervised annotation prediction of interest.
04:20 PM - 04:40 PM (EDT)
Education - Introducing genome assembly to the general public through interactive word games
Reconstructing genomes from DNA sequencing reads - genome assembly - is a fundamental task in genomics that is the foundation for many downstream analyses. Genome assembly also reveals the power provided by the combination of biology and computer science - innovations in assembly algorithms were critical to the genomic revolution. To introduce these concepts to the general public and to illustrate computational thinking paradigms related to assembly algorithms, we developed a simple word game similar to magnetic poetry kits. The game involves reconstructing repetitive phrases from fragments printed on magnets. The size of the fragments and complexity of the phrases can be varied to adjust the level of difficulty. Using a metal whiteboard as a backing for the game also creates the opportunity for introducing graph-based solutions to the genome assembly problem, while collaborative team teaching within a classroom setting also enables a discussion of parallel algorithms. In our presentation, we will describe lesson plans built around this game and highlight our experiences in deploying them at Maryland Day (an open house event organized at the University of Maryland each spring) and within a summer camp aimed at introducing K-12 students to computer science.
04:20 PM - 04:40 PM (EDT)
iRNA - PEGASAS: A pathway-guided approach for analyzing pre-mRNA alternative splicing during cancer progression
Aberrant pre-mRNA alternative splicing (AS) is widespread in cancer, but the causes and consequences of AS dysregulation during cancer progression are not well understood. We developed a novel computational framework, PEGASAS, as a pathway-guided approach for examining the effects of oncogenic signaling on exon incorporation. PEGASAS was designed to study the interplay among oncogenic signaling, AS, and affected biological processes. In this study, we applied PEGASAS to define the AS landscape across prostate cancer disease states and the relationship between splicing and known driver alterations. We compiled a meta-dataset of RNA-seq data of 876 tissue samples from publicly available sources, covering a range of disease states, from normal tissues to aggressive metastatic tumors. PEGASAS analysis revealed a correlation between Myc signaling and splicing changes in RNA binding proteins (RBPs), suggestive of a previously undescribed auto-regulatory phenomenon. We experimentally verified this result in a human prostate cell transformation assay. Our findings establish a role for Myc in regulating RNA processing by controlling incorporation of nonsense mediated decay-determinant exons in RBP-encoding genes. In conclusion, PEGASAS can mine large-scale transcriptomic data to connect changes in pre-mRNA AS with oncogenic alterations that are common to many cancer types.
04:20 PM - 04:40 PM (EDT)
Evolution and Comparative Genomics - Phylogenetic double placement of mixed samples
Motivation: Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction. Results: We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non- convex optimization problem that deconvolutes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MISA, on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice. Availability: The sofware and data are available at https://github.com/balabanmetin/misa.
04:20 PM - 04:40 PM (EDT)
CAMDA - Gene expression signature-based machine learning classifier of drug-induced liver injury
Drug-induced liver injury (DILI) is considered a primary factor in regulatory clearance for drug development, and there is a pressing need to develop and evaluate new prediction models for DILI. The CAMDA 2020 CMap Drug Safety Challenge included 422 drugs for training and 195 drugs with blinded labels for testing to predict four types of DILI classes (DILI1, DILI3, DILI5, and DILI6). Our approach utilized the machine learning (ML) approach focusing on drug perturbation gene expression signatures from the six human cell lines (PHH, HEPG2, HA1E, A375, MCF7, and PC3). We created representative expression signatures, the 250 most up-regulated and 250 down-regulated genes, for each drug using Kruskal-Borda merging of ranked z-scores profiles. Various ML algorithms, including random forest (RF), recursive partitioning and regression trees (RPART), support vector machine (SVM), generalized linear model (GLM), and naïve-Bayes classifier were built and evaluated using 100 times 5-fold cross-validation. The initial model results range from a ROC value of 0.818 in the RF DILI3 to 0.491 in GLM DILI6. These models are still a work in progress, in which data from Tox21, FAERS, and Mold2 will need to be incorporated alongside the gene expression response garnered from CMap.
04:20 PM - 04:40 PM (EDT)
ODSS - Integrating Biomedical Informatics and Data Science to Prepare the Precision Medicine Workforce
As biomedicine, and in particular, translational research, have entered the era of big data and artificial intelligence, the Washington University School of Medicine has developed a precision medicine roadmap intended to respond to such trends. At the core of this roadmap is the promise of precise and data-driven approaches to enabling the delivery of the right treatment to the right patient at the right time, all with the objective of saving and improving lives. As part of our roadmap, we believe that the biomedical research teams of the future will increasingly need to utilize multi-disciplinary approaches, notably incorporating the use of biomedical informatics and data science theories and methods. Based upon this belief, we have launched a comprehensive portfolio of practical, in-career, and scientific training programs, all of which transcend traditional disciplinary boundaries, and that ultimately seek to create a workforce capable of delivering on the promise of precision medicine. In this presentation, we will review the structure, curriculum, and lessons learned as a result of establishing such education and workforce development programs. In particular, we will focus on the critical need to combine both biomedical informatics and data science methodologies with driving biological and clinical problems, so as to engage and support learners in a highly contextualized and experiential learning environment.
04:30 PM - 04:40 PM (EDT)
MLCSB: Machine Learning - A Non-Parametric Bayesian Framework for Detecting Coregulated Splicing Signals in Heterogeneous RNA Datasets with Applications to Acute Myeloid Leukemia
Analysis of RNASeq data from large patient cohorts can reveal transcriptomic perturbations that are associated with disease. This is typically framed as an unsupervised learning task to discover latent structure in a data matrix. However, the heterogeneity of these datasets makes such analysis challenging. For example, in acute myeloid leukemia, mutations in splice factor genes occurring in a subset of the patients may only result in alteration of a subset of splicing events. Thus, there is a need to identify “tiles”, defined by a subset of samples and splicing events with abnormal signals. Although algorithms exist for this task, they fail to model splicing data. To address these challenges, we propose CHESSBOARD, a non-parametric Bayesian model for unsupervised discovery of tiles. Our algorithm does not require a priori knowledge of the number of the tiles and uses a unique missing value model for cancer data. First, we apply our model to synthetic datasets and show it outperforms several baseline approaches. Next, we show that it recovers tiles characterized by splicing aberrations which are reproducible in multiple AML patient cohorts. Finally, we show that tiles we discover are correlated with drug response to therapeutics, pointing to translational potential of our findings.
04:35 PM - 05:00 PM (EDT)
SST05 - Common coordinates for registering human scale data at multiple scales
The Common Coordinate System (CCF) consists of ontologies, reference object libraries, and computer software (e.g., user interfaces) that enable biomedical experts to semantically annotate tissue samples and to precisely describe their locations in the human body (“registration”), align multi-modal tissue data extracted from different individuals to a reference coordinate system (“mapping”) and, provide tools for searching and browsing HuBMAP data at multiple levels, from the whole body down to single cells (“exploration”).
04:45 PM - 06:00 PM (EDT)
Panel Discussion
05:00 PM - 05:10 PM (EDT)
MLCSB: Machine Learning - Knowledge-primed neural networks enable biologically interpretable deep learning on single-cell sequencing data
Deep learning algorithms are powerful predictors but generally provide little insight into the functions that underlie a prediction. We hypothesized that deep learning on biological networks instead of artificial networks assigns meaningful weights that can be readily interpreted, thus developing knowledge-primed neural networks (KPNNs). In KPNNs, every node has a molecular equivalent (protein activity) and every edge has a mechanistic interpretation (a regulatory interaction). KPNNs thereby connect gene expression to cellular phenotypes through large regulatory networks. After training on large single-cell transcriptomics datasets, KPNNs reveal important regulatory proteins. We validate KPNN interpretability on simulated data and biological systems with known ground truth, and reveal regulatory proteins in underexplored systems. We show that KPNN interpretations largely differ from standard interpretation approaches that are focused on features, and that interpretations are lost in shuffled networks. These results are enabled by an optimized learning method that stabilizes node weights after random initiation and controls for connectivity biases in biological networks. In summary, KPNNs combine the predictive power of deep learning with the interpretability of biological networks. While demonstrated on single-cell sequencing data and regulatory networks, our method is broadly relevant to all research areas where domain knowledge can be represented as networks.
05:00 PM - 05:20 PM (EDT)
Education - Guidelines for curriculum and course development in higher education and training
Background: Curriculum and instructional development should follow a formal process. Although the focus in formal curriculum theory is on long-term programs of study, the process is also applicable to shorter-form Learning Experiences (LEs) (single courses, lessons, or training sessions). Successful curricula and instruction support learners as they develop from entry-level performance to the minimum qualification for completing a program or course, articulated in terms of Learning Outcomes (LOs). These considerations have been encapsulated in an iterative model of curriculum and instructional design, with guidelines for its use. Output and conclusion: The starting point is the articulation of target LOs: everything follows from these, including the selection of LEs, content, the development of assessments, and evaluation of the resulting curriculum/instruction. The iterative process can be used in curriculum and instructional development, and provide a set of practical guidelines for curriculum and course preparation. The essential features effective curriculum and instruction (i.e., that achieves its stated LOs for the majority of learners) is presented here, to offer practical guidance and support for devising and evaluating both short- and long-form teaching.
05:00 PM - 05:20 PM (EDT)
Function - INGA protein function prediction for the dark proteome
Our current knowledge of complex biological systems is stored in a computable form through the Gene Ontology (GO) which provides a comprehensive description of gene function. Prediction of GO terms from the sequence remains, however, a challenging task, which is particularly critical for novel genomes. Here we present INGA version 3.0, a new version of the INGA software for protein function prediction. INGA exploits homology, domain architecture and information from the ‘dark proteome’, i.e. all those features difficult to detect looking at sequence conservation. In the new version, which was implemented for CAFA4, we also included low complexity, coiled coil and homorepeat information. Dark features are used to enrich domain architectures allowing to increase the specificity of related functions. Also, they allow the characterization of those proteins not matching any known domain. INGA was ranked in the top ten methods on CAFA2 and second for the majority of CAFA3 challenges. The new algorithm can process entire genomes in a few hours or even less when additional input files are provided. The INGA web server, databases and benchmarking are available from URL: https://inga.bio.unipd.it/.
05:00 PM - 05:20 PM (EDT)
CAMDA - AITL: Adversarial Inductive Transfer Learning with input and output space adaptation for pharmacogenomics
Motivation: the goal of pharmacogenomics is to predict drug response in patients using their single- or multi-omics data. A major challenge is that clinical data (i.e. patients) with drug response outcome is very limited, creating a need for transfer learning to bridge the gap between large pre-clinical pharmacogenomics datasets (e.g. cancer cell lines), as a source domain, and clinical datasets as a target domain. Two major discrepancies exist between pre-clinical and clinical datasets: 1) in the input space, the gene expression data due to difference in the basic biology, and 2) in the output space, the different measures of the drug response. Therefore, training a computational model on cell lines and testing it on patients violates the i.i.d assumption that train and test data are from the same distribution. Results: We propose Adversarial Inductive Transfer Learning (AITL), a deep neural network method for addressing discrepancies in input and output space between the pre-clinical and clinical datasets. AITL takes gene expression of patients and cell lines as the input, employs adversarial domain adaptation and multi-task learning to address these discrepancies, and predicts the drug response as the output. To the best of our knowledge, AITL is the first adversarial inductive transfer learning method to address both input and output discrepancies. Experimental results indicate that AITL outperforms state-of-the-art pharmacogenomics and transfer learning baselines and may guide precision oncology more accurately.
05:00 PM - 05:20 PM (EDT)
Evolution and Comparative Genomics - Sampling and Summarizing Transmission Trees with Multi-strain Infections
Motivation: The combination of genomic and epidemiological data hold the potential to enable accurate pathogen transmission history inference. However, the inference of outbreak transmission histories remains challenging due to various factors such as within-host pathogen diversity and multi-strain infections. Current computational methods ignore within-host diversity and/or multi-strain infections, often failing to accurately infer the transmission history. Thus, there is a need for efficient computational methods for transmission tree inference that accommodate the complexities of real data. Results: We formulate the Direct Transmission Inference (DTI) problem for inferring transmission trees that support multi-strain infections given a timed phylogeny and additional epidemiological data. We establish hardness for the decision and counting version of the DTI problem. We introduce TiTUS, a method that uses SATISFIABILITY to almost uniformly sample from the space of transmission trees. We introduce criteria that prioritizes parsimonious transmission trees that we subsequently summarize using a novel consensus tree approach. We demonstrate TiTUS’s ability to accurately reconstruct transmission trees on simulated data as well as a documented HIV transmission chain. Availability: https://github.com/elkebir-group/TiTUS
05:00 PM - 05:20 PM (EDT)
SCANGEN - Single cell whole genome sequencing for studying cancer evolution
05:00 PM - 05:40 PM (EDT)
iRNA - iRNA Keynote: Systematic approaches to study the subcellular localization properties of RNAs and RNA Binding Proteins
Our laboratory seeks to understand the biological functions and regulatory mechanisms of RNA intracellular localization. Our projects aim to elucidate the normal functions of RNA trafficking in the maintenance of genome stability and cell polarity, and how disruption of these pathways can contribute to the aetiology of diseases such as cancer and neuromuscular disorders. For this work, we combine the versatility of Drosophila genetics with high-throughput molecular imaging and functional genomics approaches in fly and human cellular models. During this presentation, I will provide an update in our efforts i) to assess the global cellular distribution properties of the human and fly transcriptomes using fractionation-sequencing methodologies, ii) to systematically map the cytotopic distribution properties of human RNA binding proteins, and iii) to characterize perturbations in post-transcriptional regulatory pathways that may be linked to disease aetiology.
05:10 PM - 05:20 PM (EDT)
MLCSB: Machine Learning - DeepPLIER: a deep learning approach to pathway-level representation of gene expression data
Extracting latent representation of gene expression data can provide insights into components and activations of gene pathways and networks. Inspired by the Pathway-level information extractor (PLIER) method, we propose a deep learning model, DeepPLIER, that incorporates pathway information in its architecture as prior for extracting latent variables (LVs) from gene expression data. DeepPLIER constructs LVs as combinations of known pathways by using partially connected nodes, but also includes fully connected nodes to correct for pathway misspecifications and to learn potentially unknown pathways from data. By incorporating pathway information, extraction of biologically relevant LVs is encouraged. Using simulation, we show that DeepPLIER achieves higher LV estimation accuracy than PLIER, especially in scenarios where prior pathways are partially missing. Using two large gene expression datasets from bulk brain tissues (ROSMAP and CommonMind), we show that DeepPLIER attains higher LV replication than PLIER. As well, we show that some of the identified LVs represent cell type proportions, and these LVs more accurately align with experimental estimates from immunohistochemistry than PLIER for a number of brain cell types.
05:20 PM - 05:40 PM (EDT)
SCANGEN - Normalisr: inferring single-cell differential and co-expression with linear association testing
Single-cell RNA sequencing (ScRNA-seq) may provide unprecedented technical and statistical power to study gene expression and regulation within and across cell-types. However, due to its sparsity and technical variations, developing a superior single-cell computational method for differential expression (DE) and co-expression remains challenging. Here we present Normalisr, a parameter-free normalization-association two-step inferential framework for scRNA-seq that solves case-control DE, co-expression, and pooled CRISPRi scRNA-seq screen under one umbrella of linear association testing. Normalisr addresses those challenges with posterior mRNA abundances, nonlinear cellular summary covariates, and mean and variance normalization. All these enable linear association testing to achieve optimal sensitivity, specificity, and speed in all above scenarios. Normalisr recovers high-quality transcriptome-wide co-expression networks from conventional scRNA-seq of T cells in human melanoma and robust gene regulations from pooled CRISPRi scRNA-seq screens. Normalisr provides a unified framework for optimal, scalable hypothesis testings in scRNA-seq.
05:20 PM - 05:30 PM (EDT)
MLCSB: Machine Learning - Inferring Signaling Pathways with Probabilistic Programming
Cells regulate themselves via complex biochemical processes called signaling pathways. These are usually depicted as networks, where nodes represent proteins and edges indicate their influence relationships. To understand diseases and therapies at the cellular level, it is crucial to understand the signaling pathways at work. Because signaling pathways can be rewired by disease, inferring signaling pathways from context-specific data is highly valuable. We formulate signaling pathway inference as a Dynamic Bayesian Network (DBN) structure learning problem on phosphoproteomic time course data. We take a Bayesian approach, using MCMC to sample DBN structures. We use a novel proposal distribution that efficiently samples large, sparse graphs. We also relax some modeling assumptions made in past works. We call the resulting method Sparse Signaling Pathway Sampling (SSPS). We implement SSPS in Julia, using the Gen probabilistic programming language. We evaluate SSPS on simulated and real data. SSPS attains superior scalability for large problems, in comparison to other DBN techniques. We also find that it competes well against established methods on the HPN-DREAM breast cancer network reconstruction challenge. SSPS significantly improves Bayesian techniques for network inference, and gives a proof of concept for probabilistic programming in this setting.
05:20 PM - 05:40 PM (EDT)
Function - CaoLab: Protein function and disorder prediction from sequence based on RNN
A lot of progress has been made in the machine learning and natural language processing field. Here we introduce the CaoLab server that attended the latest CAFA4 experiment. We used natural language processing and machine learning techniques to tackle the protein function prediction and disorder prediction problem. ProLanGO2 is used to predict protein function using protein sequence, and ProLanDO would make protein disorder prediction from protein sequence. The latest version of UniProt database (on 12/12/2019) is used for extracting the top 2000 most frequent k-mers (k from 3 to 7) to build a fragment sequence database FSD. The ProLanDO method uses DO database provided by CAFA4 (https://www.disprot.org/) while each sequence is filtered with FSD, and the character-level RNN model is trained to classify the DO term. The ProLanGO2 method is an updated version of ProLanGO published in 2017, which uses the latest version of Uniprot database filtered by FSD. The Encoder-Decoder network, a model consisting of two RNNs (encoder and decoder), is used to train models on the dataset, and the top 100 best performing models are used to select ensemble models as the final model for protein function prediction.
05:20 PM - 05:40 PM (EDT)
CAMDA - Improved survival analysis by learning shared genomic information from pan-cancer data
Motivation: Recent advances in deep learning have offered solutions to many biomedical tasks. However, there remains a challenge in applying deep learning to survival analysis using human cancer transcriptome data. Since the number of genes, the input variables of survival model, is larger than the amount of available cancer patient samples, deep learning models are prone to overfitting. To address the issue, we introduce a new deep learning architecture called VAECox. VAECox employs transfer learning and fine tuning. Results: We pre-trained a variational autoencoder on all RNA-seq data in 20 TCGA datasets and transferred the trained weights to our survival prediction model. Then we fine-tuned the transferred weights during training the survival model on each dataset. Results show that our model outperformed other previous models such as Cox-PH with LASSO and ridge penalty and Cox-nnet on the 7 of 10 TCGA datasets in terms of C-index. The results signify that the transferred information obtained from entire cancer transcriptome data helped our survival prediction model reduce overfitting and show robust performance in unseen cancer patient samples. Availability: Our implementation of VAECox is available at https://github.com/SunkyuKim/VAECox
05:20 PM - 06:00 PM (EDT)
Education - COSI Education Keynote Talk: Online Data Science Education and its effect on my class room teaching
Educational institutions across the world are responding to the unprecedented demand of training in statistics and data science by the creation of new courses, curriculums and degrees in applied statistics and data science. We have participated in two data science courses taught at Harvard and the creation of an online course of data analysis for the life sciences, and a Data Sciences series composed of 9 courses. In this presentation, I will first try to define data science and explain what aspects of it I teach. Then I will discuss our approach to developing a MOOC based almost exclusively on real-world examples and how our lecturers revolved around dozens of exercises that required R programming to answer.
05:30 PM - 05:40 PM (EDT)
MLCSB: Machine Learning - Can We Trust Convolutional Neural Networks for Genomics?
Convolutional neural networks (CNNs) are powerful methods to predict transcription factor binding sites from DNA sequence. Although CNNs are largely considered a "black box", attribution-based interpretability methods can be employed to identify single nucleotide variants that are important for model predictions. However, there is no guarantee that attribution methods will recover meaningful features even for state-of-the-art CNNs. Here we train CNNs with different architectures and training procedures on synthetic sequences embedded with known motifs and then quantitatively measure how well attribution methods recover ground truth. We find that deep CNNs tend to recover less interpretable motifs, despite yielding superior performance on held out test data. This suggests that good model performance does not necessarily imply good model interpretability. Strikingly, we find that adversarial training, a method to promote robustness to small perturbations to the input data, can significantly improve the efficacy of attribution methods. We also find that CNNs specially designed with an inductive bias to learn strong motif representations consistently improves interpretability. We then show that these results generalize to in vivo ChIP-seq data. This work highlights the importance of moving beyond performance on benchmark datasets when considering whether to trust a CNN’s prediction in genomics.
05:30 PM - 05:40 PM (EDT)
Evolution and Comparative Genomics - Reading the book of life: the language of proteins
Background Genomes are remarkably similar to natural language texts. From an information theory perspective, we can think of amino acid residues as letters, protein domains as words, and proteins as sentences consisting of ordered arrangements of protein domains (domain architectures). This work describes our recent efforts towards understanding the linguistic properties of genomes. Results Our recent work showed that the complexity of “grammars” in all major branches of life is close to a universal constant of ~1.2 bits. This is remarkably similar to natural languages; such an--yet unexplained--universal information gain has been observed and generally used to determine whether a series of symbols represent a language. In this work, we describe the implications of this work and its extension in various areas with a particular emphasis on measuring the proteome complexities in human tissues. Conclusion Our work established the similarity between natural languages and genomes and showed, for the first time, that there exists a “quasi-universal grammar” of protein domains and measured the minimal complexity of proteome required for a functional cell. We also describe the proteome complexities in human tissues and their functional significance.
05:40 PM - 06:00 PM (EDT)
CAMDA - Closing remarks and awards
05:40 PM - 06:00 PM (EDT)
iRNA - iRNA: concluding Remarks
05:40 PM - 06:00 PM (EDT)
Function - Protein function inference via curated aggregate co-expression networks.
Background We explore the specific utility of co-expression data for the CAFA challenge. Our lab has assembled a set of highly-curated aggregate co-expression networks derived from 895 datasets (~39 517 samples) from 14 species, providing a well-powered source for inference. Methods We tested performance via the standard CAFA approach of calculating Fmax per protein. Our “direct expression” method simply finds the top co-expressed genes of benchmark genes, and transfers functions where available. Our “orthoexpression” extends species coverage by first looking up top homologs to the benchmarks, finding coexpressed genes from those, and then transferring functions. Results/Conclusions We found evidence that expression data contains information that can be used to improve functional annotation. Both our methods performed well relative to the pure sequence-based (phmmer) baseline, particularly for BP. To quantify potential performance improvements we ask how well we could do if we knew in advance which method is better. Plots of predictions against one another show substantial room for improvement. We find that while aggregated coexpression data can successfully be used to improve protein function prediction, integration with sequence-based methods is a major area for potential progress in the field.
05:40 PM - 05:50 PM (EDT)
Evolution and Comparative Genomics - What is the structure of the ‘evolutionary model space’ for proteins?
Estimates of amino acid exchangeabilities are central to to models of protein evolution; there have been many efforts to estimate those parameters using from large numbers of proteins. Although models trained in this way can be useful for phylogenetic analyses, they provide limited information about the process of protein evolution. Recent studies have revealed that patterns of protein evolution (assessed using amino acid exchangeability parameters) vary across the tree of life; in other words, the processes underlying protein evolution are non-homogeneous. However, optimizing a 20-state non-homogenous model requires the estimation of many. of free parameters. Thus, it represents a challenging computational problem. There are two straightforward ways to simplify this problem: 1) estimate parameters for the best-approximating time-reversible model using restricted taxon sets; or 2) reduce the number of free parameters by constraining the model using biochemical information. Different protein structural environments are also associated with distinct amino acid substitution patterns; this results in a mixture of underlying models that also differ among taxa. Efforts to use the approaches described above, in combination with structural partitioning, to understand the space of protein evolutionary models will be described.
05:50 PM - 06:00 PM (EDT)
Evolution and Comparative Genomics - On quantifying evolutionary importance of protein sites: A tale of two measures
A key challenge in evolutionary biology is the quantification of selective pressure on proteins and other biological macromolecules at single-site resolution. The evolutionary importance of a protein site under purifying selection is typically measured by the degree of conservation of the protein site itself. A possible alternative measure is the strength of the site-induced conservation gradient in the rest of the protein structure. Here, we show that despite major differences, there is a linear relationship between the two measures such that more conserved protein sites also induce stronger conservation gradient in the protein structure. This linear relationship is universal as it holds for different types of proteins and functional sites. Our results show that generally, the selective pressure acting on a functional site percolates through the rest of the protein via residue-residue contacts. Surprisingly however, catalytic sites in enzymes are the principal exception to this rule. Catalytic sites induce significantly stronger conservation gradients in the rest of the protein than expected from the degree of conservation of the site alone. The uniquely stringent requirement for the active site to selectively stabilize the transition state of the catalyzed chemical reaction imposes additional selective constraints on the rest of the enzyme.
09:30 AM - 10:30 AM (EDT)
Keynotes - Machine learning for structural and functional genomics
Recent advances in network and functional genomics have enabled large-scale measurements of molecular interactions, functional activities, and the impact of genetic perturbations. Identifying connections, patterns, and deeper functional annotations from heterogeneous measurements will enhance our capability to predict protein function, discover their roles in biological processes underlying diseases, and develop novel therapeutics. In this talk, I will cover a few machine algorithms that interrogate molecular interactions, perturbation screens, structural data, and evolutionary information to understand protein functions. First, I will present a network-based representation learning algorithm that integrates multiple molecular interactomes into compact topological embeddings for protein function inference. I will describe its application to discovering new disease factors and subnetworks from genetic perturbations and variations. Finally, I will present a few deep learning algorithms for protein structure prediction and sequence-to-function mapping. Applications to protein engineering and disease-associated mutation prediction will be introduced.
10:40 AM - 11:00 AM (EDT)
CompMS - Community standard for reporting the experimental design in proteomics experiments: From samples to data files
Sharing of proteomics data within the research community has been greatly facilitated by the development of open data standards such as mzIdentML, mzML, and mzTab. In addition, the ProteomeXchange consortium of proteomics resources has enabled the data submission and dissemination of public MS proteomics data worldwide. However, several authors have pointed out the lack of suitable experiment-related metadata in public proteomics datasets making data reuse by the community difficult. In particular, datasets contain limited sample information about the disease, tissue, cell type, and treatment, among others; and the relationship between the samples and the mass spectrometry files (RAW) is missing in most cases. Here, we present a community standard data model and file format to represent the experimental design for proteomics experiments. The proposed data model is based on the Sample and Data Relationship Format (SDRF), which originates from the metadata file format developed by the transcriptomics community called MAGE-TAB. The aim is to define a set of guidelines to support the annotation of the experimental design in proteomics datasets in the public domain, enabling future integration among multi-omics experiments.
10:40 AM - 11:40 AM (EDT)
Hitseq Keynote: Genetic and epigenetic maps of human centromeric regions
10:40 AM - 10:50 AM (EDT)
BioVis - BioVis Opening
10:40 AM - 11:00 AM (EDT)
3DSIG - Deep Learning of Protein Structural Classes: Any Evidence for an 'Urfold'?
Recent advances in protein structure determination and prediction offer new opportunities to decipher relationships amongst proteins--a task that entails 3D structure comparison and classification. Historically, protein domain classification has been somewhat manual and heuristic. While CATH and related resources represent significant steps towards a more systematic (and automatable) approach, more scalable and objective classification methods, e.g., grounded in machine learning, could be informative. Indeed, comparative analyses of protein structures via Deep Learning (DL), though it may entail large-scale restructuring of classification schemes, could uncover distant relationships. We have developed new DL models for domain structures (including physicochemical properties), focused initially at CATH's homologous superfamily (SF) level. Adopting DL approaches to image classification and segmentation, we have devised and applied a hybrid convolutional autoencoder architecture that allows SF-specific models to learn features that, in a sense, 'define' the various homologous SFs. We quantitatively evaluate pairwise 'distances' between SFs by building one model per SF and comparing the loss functions of the models. Clustering on these distance matrices provides a new view of protein interrelationships--a view that extends beyond simple structural/geometric similarity, towards the realm of structure/function properties, and that is consistent with a recently proposed 'Urfold' concept.
10:40 AM - 11:20 AM (EDT)
NetBio: Network Biology - NetBio Keynote: Network Medicine: From Cellular Networks to the Human Diseasome
Given the functional interdependencies between the molecular components in a human cell, a disease is rarely a consequence of an abnormality in a single gene, but reflects the perturbations of the complex intracellular network. The emerging tools of network medicine offer a platform to explore systematically not only the molecular complexity of a particular disease, leading to the identification of disease modules and pathways, but also the molecular relationships between apparently distinct (patho) phenotypes. Advances in this direction are essential to identify new disease genes, to uncover the biological significance of disease-associated mutations identified by genome-wide association studies and full genome sequencing, and to identify drug targets and biomarkers for complex diseases.
10:40 AM - 11:20 AM (EDT)
TransMed - Transmed Keynote: 20 Challenges of AI in Medicine
The opportunities for using artificial intelligence (AI) and machine learning to improve healthcare are endless. Recent successes include the use of deep learning to identify clinically important patterns in radiographic images. However, there are numerous important challenges which must be addressed before AI can become widely adopted and incorporated into clinical workflows. Some of these challenges come from healthcare data. For example, clinical data from electronic health records (EHR) are notoriously noisy and incomplete. Some of these challenges come from the limitations of AI algorithms. For example, each algorithm looks at data in a different manner. How do you know which are the right methods to employ for a given data set? We will review 10 important clinical data challenges and 10 AI challenges which can impede progress in this area. A number of specific examples and some possible solutions will be provided.
10:40 AM - 11:20 AM (EDT)
MICROBIOME - Microbiome Keynote: Assembling and modelling complex microbiomes mediating host-pathogen interactions
Human and environmental microbial communities mediating host-pathogen interactions often have complex genetic architectures and dynamics. Unravelling these needs new approaches for metagenome assembly at the strain level and microbiome modelling from limited relative abundance profiles. We propose a hybrid assembly framework that leverages long read sequencing to generate high-quality, near-complete strain genomes from complex metagenomes (OPERA-MS [1]). Applying this approach to human and environmental communities enabled recovery of 100s of novel genomes, plasmid and phage sequences, direct analysis of transmission patterns and investigation of antibiotic resistance gene combinations [1, 2]. Furthermore, we show how microbial community dynamics can be modelled accurately from sparse relative abundance data (BEEM [3]), providing insights into pathogen-commensal interactions in skin dermotypes. Data from several studies tracking the transmission of multi-drug resistant pathogens across environmental and human microbiomes will be used to illustrate the utility of these methods. [1] Bertrand et al. “Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes." 2019 Nature Biotechnology 37 (8), 937-944 [2] Chng et al. ”Cartography of opportunistic pathogens and antibiotic resistance genes in a tertiary hospital environment." 2019 BioRxiv, 644740 [3] Li et al. “An expectation-maximization algorithm enables accurate ecological modeling using longitudinal microbiome sequencing data." 2019 Microbiome 7 (1), 1-14
10:40 AM - 11:20 AM (EDT)
Vari - New techniques for tracing mutation spectrum evolution: from linear inverse modeling to yeast reporter assays
Population genetic models provide a unified theoretical framework for interpreting the function and significance of natural genetic variants, which arise as random mutations and persist in populations as a result of natural selection and random genetic drift. Although these models recognize that drift and selection can vary widely between populations with different histories, they often characterize mutagenesis as a simple, uniform process, overlooking the fact that mutagenesis is a trait that can evolve and vary between populations. We can show that the germline mutagenic process appears to vary between populations as well as functional genomic compartments by statistically analyzing the "mutation spectrum," meaning the relative abundance of genetic variation in different contexts such as three-base-pair motifs. Mutation spectra systematically vary between great ape species and even closely related human populations, and we have developed a novel method, MuSHi, to deconvolute this variation into the footprints of mutational processes that have changed in rate over the course of population history. Our computational methodology indicates the presence of mutation spectrum variation within a database of natural yeast strains; for example, a clade of strains known as "mosaic beer yeast" appear to have been accumulating higher rates of C>A mutations than standard laboratory strains. We show that this mutation spectrum difference can be recapitulated in de novo mutations accumulated in a controlled laboratory setting, confirming that this component of natural yeast mutation spectrum variation is likely caused by a genetically encoded mutator allele.
10:40 AM - 11:40 AM (EDT)
Bio-Ontologies - Bio-Ontologies Keynote: The crisis of content
Although a plethora of ontologies have been developed in a wide variety of domains, there is often a sense in which it is difficult to measure progress in the field of applied ontology. In some domains there is a mindset that treats ontologies as being as arbitrary as software code, so there is no point in evaluating them, and there cannot possibly be any consensus on which ontologies to use. In other domains, there is an abundance of ontologies but no understanding of their relationships, leading to a perception of continually reinventing the wheel. Far too often, the only criteria for selecting ontologies are political, not technical. If we proceed further down this road, we ultimately risk irrelevance. Against this viewpoint, I would offer an approach to ontology design that focuses on formalizing the intended semantics of an ontology, so that sharability and reusability is guaranteed.
10:45 AM - 11:20 AM (EDT)
RegSys - RegSys KEYNOTE: Detection of functional cis-regulatory variations causal for rare genetic disorders
Despite great expectation, definitively causal cis-regulatory sequence variations have removed elusive despite widespread access to whole genome sequencing. Three case studies of cis-regulatory alterations detected using computational approaches will be presented that cause strabismus, glutaminase deficiency and a novel neurodevelopmental disorder. Each features a distinct mechanism, including a short 4 basepair deletion, a trinucleotide repeat expansion and a large structural inversion. A semi-quantitative framework for scoring the evidence supporting candidate cis-regulatory variants is presented, based on a reference collection of cis-regulatory alterations in genes encoding DNA binding transcription factors. The scoring framework features two sections, one pertaining to the evidence linking the genotype to the phenotype, and one pertaining to the evidence that the sequence variation functionally alters gene regulation.
10:50 AM - 11:40 AM (EDT)
BioVis Keynote: Visualization and Human-AI collaboration for biomedical tasks
11:00 AM - 11:20 AM (EDT)
CompMS - MassIVE.quant: a community resource of curated quantitative mass spectrometry-based proteomics datasets
We present MassIVE.quant (http://massive.ucsd.edu/ProteoSAFe/static/massive-quant.jsp), a tool-independent repository infrastructure and data resource for reproducible quantitative mass spectrometry-based biomedical research. MassIVE.quant is an extension of MassIVE to provide the opportunity of large-scale deposition of heterogeneous experimental datasets and facilitate a community-wide conversation about the necessary extent of experiment documentation and the benefits of its use. It supports various reproducibility scopes, such as the infrastructure to fully automated the workflow, to store, and to browse the intermediate results. First, MassIVE.quant supplements the raw experimental data with detailed annotations of the experimental design, analysis scripts, and results, which enable the quantitative interpretation of MS-based experiments and the online inter-active exploration of the results. A branch structure enables to view and even compare reanalyses of each experiment with various combinations of methods and tools. Second, the curated alternative workflows can be used off-line and online reanalyses of the data starting from an intermediate out-put in MassIVE.quant. To exemplify from quantitative experiments, we present the first compilation of more than 40 proteomic datasets from benchmark controlled mixtures and biological investigations, interpreted with various data processing tools and analysis options. The extensive documentation for workflow and data submission, including video tutorial, is available.
11:00 AM - 11:10 AM (EDT)
3DSIG - EM Map Segmentation and De Novo Protein Structure Modeling for Multiple Chain Complexes with MAINMAST
The significant progress of cryo-electron microscopy (cryo-EM) poses a pressing need for software for structural interpretation of EM maps. Methods for map segmentation is particularly needed for the modeling because most of the modeling methods are designed for building a single protein structure. Here, we developed new software, MAINMASTseg, for segmenting maps with symmetry. Unlike existing segmentation methods that merely consider densities in an input EM map, MAINMASTseg captures underlying molecular structures by constructing a skeleton that connects local dense points in the map. MAINMASTseg performed significantly better than other popular existing methods.
11:10 AM - 11:20 AM (EDT)
3DSIG - Protein Contact Map De-noising Using Generative Adversarial Networks
Protein residue-residue contact prediction from protein sequence information has made substantial improvement in the past years and has been a driving force in the protein structure prediction field. In this work, we propose a novel contact map denoising method, ContactGAN, which uses Generative adversarial networks (GAN). ContactGAN takes a predicted protein contact map as input and outputs a refined, more accurate contact map. On a test set of 43 protein domains from CASP13, ContactGAN showed an average improvement of 24% in precision values of L/1 long contacts.
11:20 AM - 11:40 AM (EDT)
NetBio: Network Biology - Chromatin network markers of leukemia
Motivation: The structure of chromatin impacts gene expression. Its alteration has been shown to coincide with the occurrence of cancer. A key challenge is in understanding the role of chromatin structure in cellular processes and its implications in diseases. Results: We propose a comparative pipeline to analyze chromatin structures and apply it to study chronic lymphocytic leukemia (CLL). We model the chromatin of the affected and control cells as networks and analyze the network topology by state-of-the-art methods. Our results show that chromatin structures are a rich source of new biological and functional information about DNA elements and cells that can complement protein-protein and co-expression data. Importantly, we show the existence of structural markers of cancer-related DNA elements in the chromatin. Surprisingly, CLL driver genes are characterized by specific local wiring patterns not only in the chromatin structure network of CLL cells, but also of healthy cells. This allows us to successfully predict new CLL-related DNA elements. Importantly, this shows that we can identify cancer-related DNA elements in other cancer types by investigating the chromatin structure network of the healthy cell of origin, a key new insight paving the road to new therapeutic strategies. This gives us an opportunity to exploit chromosome conformation data in healthy cells to predict new drivers.
11:20 AM - 11:30 AM (EDT)
CompMS - CoronaMassKB: an open-access platform for sharing of mass spectrometry data and reanalyses from SARS-CoV-2 and related species
CoronaMassKB is an open-data community resource for sharing of mass spectrometry data and (re)analysis results for all experiments pertinent to the global SARS-CoV-2 pandemic. The resource is designed for the rapid exchange of data and results among the global community of scientists working towards understanding the biology of SARS-CoV-2/COVID19 and thus aims to accelerate the emergence of effective responses to this global pandemic. CoronaMassKB is currently based on reanalysis of >7 million spectra from SARS-CoV-2 dataset and >20 million spectra from related viruses (including SARS-Cov, MERS and H1N1) in public mass spectrometry data available at MassIVE. Searching for modified peptides nearly doubled the number of identifications derived from the same data and reveals >550 SARS-CoV-2 hypermodified peptides with 10+ distinct combinations of modifications, up to over 100 unique modification variants for a single peptide sequence. Overall, thousands of distinctly-modified and unmodified peptide variants were identified to 25 SARS-CoV-2 protein products, with thousands of additional variants also identified for the various related viral species. Over 7,000 host proteins were also detected by tens of thousands of peptide variants across several types of experiments, including 339 variants mapped to regulatory regions of 92 drug-target proteins assessed to be interactors of SARS-CoV-2 proteins.
11:20 AM - 11:40 AM (EDT)
Vari - Population-specific VNTR Alleles in the Human Genome
Variable number of tandem repeats (VNTRs) are polymorphic DNA tandem repeat loci in which the number of pattern copies varies across a population. Human minisatellite VNTR loci (with pattern lengths from seven to hundreds of base pairs) have a variety of functional effects (transcription factor binding, RNA splicing) and are associated with disease (neurodegenerative disorders, cancers, Alzheimer’s disease). Despite their importance, relatively few minisatellite VNTRs have been identified and studied in detail. As part of a large survey of VNTR occurrence in over 2,500 human whole genome sequencing samples from the 1000 genomes project, we sought to identify population-specific VNTR alleles. We found 5,541 “common” VNTR loci (occurring in ≥ 5% of the samples) and used their alleles to develop a decision tree classification model to predict super-population membership, with 97.81% accuracy. We then identified 1,283 top population predictive alleles. Finally, we developed a novel ‘Virtual Gel’ illustration showing how alleles differ across populations at population-specific loci. This is the first large-scale study of population-specific VNTRs and the information obtained could be useful for haplotype inference, studies of human migration and evolution, and accurate use of VNTRs in GWAS studies.
11:20 AM - 11:40 AM (EDT)
TransMed - Privacy-preserving Construction of Generalized Linear Mixed Model for Biomedical Computation
The Generalized Linear Mixed Model (GLMM) is an extension of the generalized linear model (GLM) in which the linear predictor takes into account random effects.Given its power of precisely modeling the mixed effects from multiple sources of random variations, the method has been widely used in biomedical computation, for instance in the genome-wide association studies (GWAS) that aim to detect genetic variance significantly associated with phenotypes such as human diseases. Collaborative GWAS on large cohorts of patients across multiple institutions is often impeded by the privacy concerns of sharing personal genomic and other health data. To address such concerns, we present in this paper a privacy-preserving Expectation-Maximization (EM) algorithm to build GLMM collaboratively when input data are distributed to multiple participating parties and cannot be transferred to a central server. We assume that the data are horizontally partitioned among participating parties: i.e., each party holds a subset of records (including observational values of fixed effect variables and their corresponding outcome), and for all records, the outcome is regulated by the same set of known fixed effects and random effects. Our collaborative EM algorithm is mathematically equivalent to the original EM algorithm commonly used in GLMM construction. The algorithm also runs efficiently when tested on simulated and real human genomic data, and thus can be practically used for privacy-preserving GLMM construction.
11:20 AM - 11:40 AM (EDT)
RegSys - Hox binding specificity is directed by DNA sequence preferences and differential abilities to engage inaccessible chromatin
The Hox transcription factors (TFs) bind similar target sequence motifs in vitro, yet they specify distinct cellular fates in the developing embryo. How the Hox TFs achieve the specificity required in order to drive differential cell fates remains unclear. Here, we induce the expression of several Hox TFs in the uniform cellular context of embryonic stem cell derived progenitor motor neurons (pMNs). We ask whether Hox TFs bind differentially in pMNs, and whether differential Hox TF binding is determined by sequence preferences or the ability of Hox TFs to differentially interact with preexisting pMN chromatin. We find that Hox TFs bind both “shared” and “unique” sites in the genome, and sequence preferences are unable to explain the observed diversity in Hox TF binding. To examine the degree to which preexisting chromatin predicts induced Hox binding, we develop a novel bimodal neural network framework that estimates the degree to which sequence and pMN chromatin features predict induced Hox TF binding. Results from our network suggest that the preexisting chromatin environment determines the binding of different Hox TFs to varying degrees. For Hox TFs that bind similar sequences, differential abilities to bind inaccessible chromatin drive differential Hox TF binding.
11:20 AM - 11:30 AM (EDT)
Bio-Ontologies - CORAL: A platform for FAIR, rigorous, self-validated data modeling and integrative, reproducible data analysis
Many organizations face challenges in managing and analyzing data, especially when such data is obtained from multiple sources, created using diverse methods or protocols. Analyzing heterogeneous, structured datasets requires rigorous tracking of their interrelationships and provenance. This task has long been a Grand Challenge of data science, and has more recently been formalized in the FAIR principles: that all data be Findable, Accessible, Interoperable and Reusable, both for machines and for people. Adherence to these principles is necessary for proper stewardship of information, for testing regulatory compliance, for measuring efficiency, and for effectively being able to reuse data analytical frameworks. Interoperability and reusability are especially challenging to implement in practice, to the extent that scientists acknowledge a “reproducibility crisis” across many fields of study. We developed CORAL, a framework for organizing the large diversity of datasets that are generated and used by moderately complex organizations. CORAL features a web interface for bench scientists to upload and explore data, as well as a Jupyter notebook interface for data analysts, both backed by a common API. We describe the CORAL data model and associated tools, and discuss how they greatly facilitate adherence to all four of the FAIR principles.
11:20 AM - 11:30 AM (EDT)
3DSIG - Deep Learning Protein Contacts and Real-valued Distances Using PDNET
As deep learning algorithms drive the progress in protein structure prediction, a lot remains to be studied at this emerging crossway of deep learning and protein structure prediction. Recent findings show that inter-residue distance prediction, a more granular version of the well-known contact prediction problem, is a key to predict accurate models. We believe that deep learning methods that predict these distances are still at infancy. To advance these methods and develop other novel methods, we need a small and representative dataset packaged for fast development and testing. In this work, we introduce Protein Distance Net (PDNET), a dataset derived from the widely used DeepCov dataset consisting of 3456 representative protein chains. It is packaged with all the scripts that were used to curate the dataset, generate the input features and distance maps, and scripts with deep learning models to train, validate, and test. Deep learning models can also be trained and tested in a web browser using free platforms such as Google Colab. We discuss how our framework can be used to predict contacts, distance intervals, and real-valued distances. PDNET is available at https://github.com/ba-lab/pdnet/.
11:20 AM - 11:30 AM (EDT)
MICROBIOME - Charting the secondary metabolic diversity of 209,211 microbial genomes and metagenome-assembled genomes
Microbial secondary metabolism plays a central role in the community dynamics of the microbiome. The wide arsenal of unique chemical compounds produced by these pathways is used by the microbes to gain survival advantages and to interact with its environment. To investigate this metabolism, genome mining of Biosynthetic Gene Clusters (BGCs) acts as a bridge, linking gene sequences to the chemistry of compounds they produced. With the large, ever-increasing number of genomes and metagenomes being sequenced, a map of biosynthetic diversity across taxa will help us chart our course in natural product discovery and microbial ecology. Here, we introduce BiG-SLiCE, a highly scalable tool for the large scale clustering analyses of BGC data. Using this new tool, we performed a global homology analysis of 1,225,071 BGCs identified from 188,623 microbial isolate genomes and 20,588 previously published metagenome-assembled genomes in roughly 100 hours of wall-time on a 36-cores CPU. The analysis reveals the true extent of microbial product diversity, showing a high number of potential novelty, especially from environmental microbes. Furthermore, the collection of GCF models it produced may be used in combination with long reads sequencing technology to perform BGC-based functional metagenomics.
11:30 AM - 11:40 AM (EDT)
CompMS - Phosphopedia 2.0, a modern targeted phosphoproteomics resource
Global mass spectrometry methods are the workhorse of phosphopeptide identification and quantification. Yet, targeted mass spectrometry approaches achieve much higher quantification sensitivity and precision. To facilitate the development of targeted assays we previously built Phosphopedia, a resource that compiled phosphopeptide identifications from nearly 1000 DDA mass spectrometry runs. However this implementation was difficult to expand, missing out on the wealth of continuously generated public phosphoproteomic data. Here, we present Phosphopedia 2.0, which extends our database and allows the generation of targeted assays for any phosphosite, regardless of previous observation. This update was achieved through two important developments. First, in order to enable dynamic database expansion, we have automated data injection via the Snakemake workflow management system, enhancing the database with thousands of new DDA mass spectrometry runs. Second, we employ machine learning to model fundamental properties of phosphopeptides necessary for targeted assay development enabling us to access phosphosites that have not been observed previously. Through straightforward access to public data and phosphopeptide property prediction, we provide the ability to greatly enhance the development of targeted phosphoproteomic assays. In addition, our database and prediction tools can be used to build spectral libraries for the analysis of DIA phosphoproteomic experiments.
11:30 AM - 11:40 AM (EDT)
3DSIG - Redundancy-Weighting the PDB for Detailed Secondary Structure Prediction
The Protein Data Bank (PDB), the ultimate source for data in structural biology, is inherently imbalanced. To alleviate biases, virtually all structural biology studies use non-redundant subsets of the PDB, which include only a fraction of the available data. An alternative approach, dubbed redundancy-weighting, down-weights redundant entries rather than discarding them. This approach may be particularly helpful for Machine Learning (ML) methods that use the PDB as their source for data. Current state-of-art methods for Secondary Structure Prediction of proteins (SSP) use non-redundant datasets to predict either 3-letter or 8-letter secondary structure annotations. The current study challenges both choices: the dataset and alphabet size. Non-redundant datasets are presumably unbiased, but are also inherently small, which limits the machine learning performance. On the other hand, the utility of both 3- and 8-letter alphabets is limited by the aggregation of parallel, anti-parallel, and mixed beta-sheets in a single class. Each of these subclasses imposes different structural constraints, which makes the distinction between them desirable. In this study we show improvement in prediction accuracy by training on a redundancy-weighted dataset. Further, we show the information content is improved by extending the alphabet to consider beta subclasses while hardly effecting SSP accuracy.
11:30 AM - 11:40 AM (EDT)
MICROBIOME - PIRATE- Phage Identification fRom Assembly-graph varianT Elements
Bacteriophages are viruses that infect and destroy bacteria. As bacteria rapidly evolve to counter the effect of antibiotic drugs, bacteriophages are being explored as complements and alternatives to antibiotics. Identification and characterization of novel phage from sequencing data is critical to achieve this goal, but presents many computational challenges. We developed MetaCarvel (https://github.com/marbl/MetaCarvel), a scaffolding tool that detects assembly graph motifs representative of biologically-relevant variants. Some bubble and repeat motifs detected by MetaCarvel represent phage integration events, providing the opportunity for detecting novel phage within microbial communities. Bubbles, indicating genomic insertion/deletion events or strain variants, may contain specialist phage, while repeat elements may capture generalist phage, common to multiple closely related bacterial hosts. Our assembly graph based methods were able to detect crAssphage (the first computationally identified phage) within variants in 208 human gut microbiome samples. To identify novel phage in metagenomes, we extracted repeat and bubble contigs(unitigs) that did not share sufficient similarity with known organisms. We clustered contigs with similar genomic content and blasted predicted genes from each cluster against the UniProt phage database. Multiple clusters contained sequences rich in integrase genes, tail proteins and tape measure proteins, suggesting these sequences represent genomic fragments from previously uncharacterized phage.
12:00 PM - 12:20 PM (EDT)
HiTSeq: High Throughput Sequencing - The String Decomposition Problem and its Applications to Centromere Analysis and Assembly
Motivation: Recent attempts to assemble extra-long tandem repeats (such as centromeres) faced the challenge of translating long error-prone reads from the nucleotide alphabet into the alphabet of repeat units. Human centromeres represent a particularly complex type of high-order repeats (HORs) formed by chromosome-specific monomers. Given a set of all human monomers, translating a read from a centromere into the monomer alphabet is modeled as the String Decomposition Problem. The accurate translation of reads into the monomer alphabet turns centromere assembly into a more tractable problem than the notoriously difficult problem of assembling centromeres in the nucleotide alphabet. Results: We describe a StringDecomposer algorithm for solving this problem, benchmark it on the set of long error-prone reads generated by the Telomere-to-Telomere consortium, and identify a novel (rare) monomer that extends the set of known X-chromosome specific monomers. Our identification of a novel monomer emphasizes the importance of identification of all (even rare) monomers for future centromere assembly efforts and evolutionary studies. To further analyze novel monomers, we applied StringDecomposer to the set of recently generated long accurate Pacific Biosciences HiFi reads. This analysis revealed that the set of known human monomers and HORs remains incomplete. StringDecomposer opens a possibility to generate a complete set of human monomers and HORs for using in the ongoing efforts to generate the complete assembly of the human genome. Availability: StringDecomposer is available on https://github.com/ablab/stringdecomposer.
12:00 PM - 12:20 PM (EDT)
MICROBIOME - MetaBCC-LR: Metagenomics Binning by Coverageand Composition for Long Reads
Motivation: Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyze metagenomic data, binning is considered a crucial step to characterise the different species of microorganisms present. The use of short-read data in most binning tools poses several limitations, such as insufficient species-specific signal, and the emergence of long-read sequencing technologies offers us opportunities to surmount them. However, most current metagenomic binning tools have been developed for short reads. The few tools that can process long reads either do not scale with increasing input size or require a database with reference genomes that are often unknown. In this paper, we presentMetaBCC-LR, a scalable reference-free binning method which clusters long reads directly based on their k-mer coverage histograms and oligonucleotide composition. Results: We evaluate MetaBCC-LR on multiple simulated and real metagenomic long-read datasets with varying coverages and error rates. Our experiments demonstrate that MetaBCC-LR substantially outperforms state-of-the-art reference-free binning tools, achieving∼13% improvement in F1-score and∼30% improvement in ARI compared to the best previous tools. Moreover, we show that using MetaBCC-LR before long read assembly helps to enhance the assembly quality while significantly reducing the assembly cost in terms of time and memory usage. The efficiency and accuracy of MetaBCC-LR pave the way for more effective long-read based metagenomics analyses to support a wide range of applications. Availability: The source code is freely available at: https://github.com/anuradhawick/MetaBCC-LR.
12:00 PM - 12:20 PM (EDT)
Bio-Ontologies - Ontology-based collection and analysis of natural and lab animal hosts of human coronaviruses
SARS-CoV-2 is the pathogen of the COVID-19 disease. It is commonly agreed that SARS-CoV-2 originated from some animal host. However, the exact origin of SARS-CoV-2 remains unclear. The origins of other human coronaviruses including SARS-CoV and MERS-CoV are also unclear. This study focuses on collection, ontological modeling and representation, and analysis of the hosts of various human coronaviruses with a focus on SARS-CoV-2. Over 20 natural and laboratory animal hosts were found able to host human coronaviruses. All the viruses and hosts were classified using the NCBITaxon ontology. The related terms were also imported to the Coronavirus Infectious Disease Ontology (CIDO), and the relations between human coronaviruses and their hosts were linked using an axiom in CIDO. Our ontological classification of all the hosts also allowed us to hypothesize that human coronaviruses only use mammals as their hosts.
12:00 PM - 12:20 PM (EDT)
RegSys - Deep learning models of 422 C2H2 Zinc Finger transcription factor binding profiles reveal alternate combinatorial DNA binding sequence preferences
The C2H2-Zinc Finger family of transcription factors (TFs) constitute half of all human TFs. These TFs contain multiple zinc-finger (ZF) domains that could combinatorially bind distinct alternate motifs. The in vivo DNA binding preferences remain largely unexplored due to challenges in profiling and motif discovery. To comprehensively characterize binding preferences of these TFs, we performed ChIP-seq on 208 C2H2-ZF transcription factors using eGFP-TF fusion constructs in HEK293 cells to complement previous ChIP-seq and ChIP-exo studies of C2H2-ZFs. We trained neural networks (NNs) to learn accurate sequence models of base-resolution binding profiles of 422 C2H2- ZF TFs. We distilled robust motif representations from the NN models. Several TFs exhibited novel alternate DNA binding preferences involving distinct motifs that aligned to different combinations of ZF domains based on the C2H2-ZF B1H "Recognition Code”. We performed Spec-seq experiments to validate variably spaced alternate motifs and novel motifs that differ from B1H predicted motifs. Our motifs were significantly longer than previously discovered motifs and identified far more ZF domains engaging in binding DNA than previously reported. Our results significantly expand the cis-regulatory lexicon of the human genome.
12:00 PM - 12:20 PM (EDT)
Vari - Multi-omic strategies for transcriptome-wide prediction and association studies
Traditional models for transcriptome-wide association studies (TWAS) consider only single nucleotide polymorphisms (SNPs) local to genes of interest and perform parameter shrinkage with a regularization process. These approaches ignore the effect of distal-SNPs or possible effects underlying the SNP-gene association. Here, we outline multi-omic strategies for transcriptome imputation from germline genetics for testing gene-trait associations by prioritizing distal-SNPs to the gene of interest. In one extension, we identify mediating biomarkers (CpG sites, microRNAs, and transcription factors) highly associated with gene expression and train predictive models for these mediators using their local SNPs. Imputed values for mediators are then incorporated into the final model as fixed effects with local SNPs to the gene included as regularized effects. In the second extension, we assess distal-eSNPs for their mediation effect through mediators local to these distal-eSNPs. Highly mediated distal-eSNPs are included in the transcriptomic prediction model. We show considerable gains in prediction of gene expression and TWAS power using simulation analysis and real data applications with TCGA breast cancer and ROS/MAP brain data. This integrative approach to transcriptome-wide imputation and association studies aids in understanding the complex interactions underlying genetic regulation within a tissue and identifying important risk genes for various traits.
12:00 PM - 12:20 PM (EDT)
TransMed - Longitudinal multi-omics profiling reveals two biological seasonal patterns in California
The influence of seasons on biological processes, particularly at a molecular level, is poorly understood. Moreover, seasons are arbitrarily defined based on four equal segments in the calendar year. In order to identify biological seasonal patterns in humans based on diverse molecular data, rather than calendar dates, we leveraged the power of longitudinal multi-omics data from deep profiling cohort of 105 individuals. These individuals underwent intensive clinical measures and emerging omics profiling technologies including transcriptome, proteome, metabolome, cytokinome as well as gut and nasal microbiome monitoring for up to four years. We identified more than 1000 seasonal variations in omics analytes and clinical measures, including molecular and microbial markers with known seasonality changes, as well as new molecular and microbial markers with seasonality fluctuations. The different molecules grouped into two major seasonal patterns which correlate with peaks in late spring and late fall/early winter in the San Francisco Bay Area. Lastly, we used our recently developed omcis longitudinal differential analysis method, OmicsLonDA, to identify molecules and microbes that demonstrated different seasonal patterns in insulin-sensitive and insulin-resistant individuals. These gained insights have important implications for human health and our methodology framework can be applied to any geographical location.
12:00 PM - 12:10 PM (EDT)
BioVis - bio_embeddings: python pipeline for fast visualization of protein features extracted by language models
With high throughput sequencing, quick insight into data separability for custom sequence datasets is desired to focus experiments on promising candidates. Recently, language models (LMs) have been adapted from use in natural language to work with protein sequences instead. Protein LMs show enormous potential in generating descriptive features for proteins from just their sequences at a fraction of the time of previous approaches. Protein LMs offer to convert amino acid sequences into embeddings that can also be used in combination with dimensionality reduction techniques (e.g. UMAP) to quickly span and visualize protein spaces (e.g. via scatter plots). On 3D scatter plots, proteins can be annotated with known properties to visually gain an intuition, even prior to training supervised models, about the separability of data. Additionally, conclusions can be drawn about proteins without annotation by putting them into the context of labelled proteins. The bio_embeddings pipeline offers an interface to simply and quickly embed large protein sets using protein LMs, to project the embeddings in lower dimensional spaces, and to visualize proteins in these spaces on interactive 3D scatter plots. The pipeline is accompanied by a web server that offers to visualize small protein datasets without the need to install software.
12:00 PM - 12:20 PM (EDT)
CompMS - Deep multiple instance learning classifies subtissue locations in mass spectrometry images from tissue-level annotations
Motivation: Mass spectrometry imaging (MSI) characterizes molecular composition of tissues at spatial resolution, and has a strong potential for distinguishing tissue types, or disease states. This can be achieved by supervised classification, which takes as input MSI spectra, and assigns class labels to subtissue locations. Unfortunately, developing such classifiers is hindered by the limited availability of training sets with subtissue labels as the ground truth. Subtissue labeling is prohibitively expensive, and only rough annotations of the entire tissues are typically available. Classifiers trained on data with approximate labels have sub-optimal performance. Results: To alleviate this challenge, we contribute a semi-supervised approach mi-CNN. mi-CNN implements multiple instance learning with a convolutional neural network (CNN). The multiple instance aspect enables weak supervision from tissue-level annotations when classifying subtissue locations. The convolutional architecture of the CNN captures contextual dependencies between the spectral features. Evaluations on simulated and experimental datasets demonstrated that mi-CNN improved the subtissue classification as compared to traditional classifiers. We propose mi-CNN as an important step towards accurate subtissue classification in MSI, enabling rapid distinction between tissue types and disease states.
12:00 PM - 12:40 PM (EDT)
3DSIG - 3DSIG Keynote: Thinking Deeply About Protein Structure Prediction
In this talk I will overview the astonishing recent progress in protein structure prediction that has arisen from better modelling of amino acid covariation effects, and most recently, the application of deep neural networks to the problem e.g. with Deepmind's AlphaFold. Bringing things up to the present day, I will discuss some of the recent method developments we have been experimenting with in my own lab, and discuss how close we are to being able to model the structure for the entire complement of proteins encoded by a bacterial genome, and the interactions between them. I will finish by discussing the future prospects for these techniques and discuss some of the current limitations, along with some intriguing new methods that are coming down the line that might be able to take us further.
12:00 PM - 12:20 PM (EDT)
NetBio: Network Biology - Prediction of cancer driver genes through network-based moment propagation of mutation scores
Motivation: Gaining a comprehensive understanding of the genetics underlying cancer development and progression is a central goal of biomedical research. Its accomplishment promises key mechanistic, diagnostic and therapeutic insights. One major step in its direction is the identification of genes that drive the emergence of tumors upon mutation. Recent advances in the field of computational biology have shown the potential of combining genetic summary statistics that represent the mutational burden in genes with biological networks, such as protein-protein interaction networks, to identify cancer driver genes. Those approaches superimpose the summary statistics on the nodes in the network, followed by an unsupervised propagation of the node scores through the network. However, this unsupervised setting does not leverage any knowledge on well-established cancer genes, a potentially valuable resource to improve the identification of novel cancer drivers. Results: We develop a novel node embedding that enables classification of cancer driver genes in a supervised setting. The embedding combines a representation of the mutation score distribution in a node’s local neighborhood with network propagtion. We leverage the knowledge of well-established cancer driver genes to define a positive class, resulting in a partially-labeled data set, and develop a cross validation scheme to enable supervised prediction. The proposed node embedding followed by a supervised classification improves the predictive performance compared to baseline methods, and yields a set of promising genes that constitute candidates for further biological validation.
12:10 PM - 12:20 PM (EDT)
BioVis - Grid-Constrained Dimensionality-Reduction for Single-Cell RNA-Seq Summarization
Single-cell transcriptomics has become an increasingly common technique for understanding complex biological systems. Due to the high-dimensional data that is generated by such experiments, algorithms for dimensionality reduction, such as principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and project (UMAP), attempt to overcome these problems through projection of the high-dimensional findings into a lower-dimensional space. As the output of these methods are commonly presented as scatterplots, they often suffer from overplotting, making nearby points difficult to distinguish. Since the spacing between points can be non-uniform, estimation of the relative proportions of cells in each cluster is not practical. Lack of consistent boundaries between points can also make it challenging to interpret cluster membership along with additional parameters, such as gene expression, in the same plot. Here we present a complementary method to these clustering techniques for single cell data. By using linear assignment to map these projections to an appropriately-sized grid, it becomes possible to preserve overall cell-cell relationships in space, while condensing the space used for the figure, avoiding overplotting, and allowing for easy boundary annotation that permits overlay of additional data
12:20 PM - 12:40 PM (EDT)
Bio-Ontologies - CIDO Diagnosis: COVID-19 diagnosis modeling, representation and analysis using the Coronavirus Infectious Disease Ontology
Diagnosis of COVID-19 is critical to the control of COVID-19 pandemic. Common diagnostic methods include symptoms identification, chest imaging, serological test, and RT-PCR. However, the sensitivity and specificity of different diagnosis methods differ. In this study, we ontologically represent different aspects of COVID-19 diagnosis using the community- based Coronavirus Infectious Disease Ontology (CIDO), an OBO Foundry library ontology. CIDO includes many new terms and also imports many relevant terms from existing ontologies. The high level hierarchy and design pattern of CIDO are introduced to support COVID-19 diagnosis. The knowledge reported in the literature reports and reliable resources such as the FDA website is ontologically represented. We modeled and compared over 20 SARS-CoV-2 RT-PCR assays, which target different gene markers in SARS-CoV-2. The sensitivity and specificity of different methods are discussed.
12:20 PM - 12:30 PM (EDT)
TransMed - PRECISE+ predicts drug response in patients by non-linear subspace based transfer from cell lines and PDX models
Pre-clinical models have extensively been used to understand the molecular underpinnings of cancer. Cell lines and Patient Derived Xenografts (PDX) are amenable to screening for a wide range of anti-cancer therapeutics. These screens offer a direct measure of drug response for many drugs – data that cannot be collected for human tumors. Pre-clinical models do, however, show behavioral discrepancies with respect to human tumors that impedes the transfer of biomarkers of drug response from pre-clinical models to patients. We present a novel framework for integrating omics data derived from pre-clinical models and tumors. Our approach employs non-linear dimensionality reduction to capture complex genetic interaction patterns that are common to pre-clinical models and humans. These patterns are then used to train a drug response predictor on human tumor data. This work extends PRECISE to allow incorporation of non-linear similarity measures between samples while retaining equivalence to the linear setting.
12:20 PM - 12:40 PM (EDT)
NetBio: Network Biology - Network-principled deep generative models for designing drug combinations as graph sets
Motivation: Combination therapy has shown to improve therapeutic efficacy while reducing side effects. Importantly, it has become a indispensable strategy to overcome resistance in antibiotics, anti-microbials, and anti-cancer drugs. Facing enormous chemical space and unclear design principles for small-molecule combinations, computational drug-combination design has not seen generative models to meet its potential to accelerate resistance-overcoming drug combination discovery. Results: We have developed the first deep generative model for drug combination design, by jointly embedding graph-structured domain knowledge and iteratively learning a reinforcement learning-based chemical graph-set designer. First, we have developed Hierarchical Variational Graph Auto-Encoders (HVGAE) trained end-to-end to jointly embed gene-gene, disease-disease and gene-disease networks. Novel attentional pooling is introduced here for learning disease-representations from associated genes' representations. Second, targeting diseases in learned representations, we have recast the drug-combination design problem as graph-set generation and developed a deep learning-based model with novel rewards. Specifically, besides chemical validity rewards, we have introduced novel generative adversarial award, being generalized sliced Wasserstein, for chemically diverse but distributionally similar molecules to known drug-like compounds or drugs. We have also designed a network principle-based reward for drug combinations. Numerical results indicate that, compared to state-of-the-art graph embedding methods, HVGAE learns more generalizable and informative disease representations in disease-disease graph reconstruction. Results also show that the deep generative models generate drug combinations following the principle across diseases. A case study on melanoma shows that generated drug combinations collectively cover the disease module similar to FDA-approved drug combinations and could also suggest promising novel systems-pharmacology strategies. Our method allows for examining and following network-based principle or hypothesis to efficiently generate disease-specific drug combinations in a vast chemical combinatorial space.
12:20 PM - 12:40 PM (EDT)
RegSys - Fully Interpretable Deep Learning Model of Transcriptional Control
The universal expressibility assumption of Deep Neural Networks (DNNs) is the key motivation behind recent works in the system biology community to employ DNNs to solve important problems in functional genomics and molecular genetics. Typically, such investigations have taken a "black box" approach in which the internal structure of the model used is set purely by machine learning considerations with little consideration of representing the internal structure of the biological system by the mathematical structure of the DNN. DNNs have not yet been applied to the detailed modeling of transcriptional control in which mRNA production is controlled by the binding of specific transcription factors to DNA, in part because such models are in part formulated in terms of specific chemical equations that appear different in form from those used in neural networks. In this paper, we give an example of a DNN which can model the detailed control of transcription in a precise and predictive manner. Its internal structure is fully interpretable and is faithful to underlying chemistry of transcription factor binding to DNA. We derive our DNN from a systems biology model that was not previously recognized as having a DNN structure. Although we apply our DNN to data from the early embryo of the fruit fly Drosophila, this system serves as a testbed for analysis of much larger data sets obtained by systems biology studies on a genomic scale.
12:20 PM - 12:40 PM (EDT)
HiTSeq: High Throughput Sequencing - Weighted minimizer sampling improves long read mapping
In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g., Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome in order to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions. We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while taking into account a weight for each k-mer; i.e, the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches, and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes.
12:20 PM - 12:30 PM (EDT)
CompMS - Mass spectrometry imaging in the age of reproducible medical science
Mass spectrometry imaging (MSI) has great potential for a variety of clinical research areas including pharmacology, diagnostics and personalized medicine. MSI data analysis remains challenging due to the large and complex data generated by the measurement of hundreds of analytes in thousands of tissue locations. Reproducibility of published research is limited due to extensive use of proprietary software and in-house scripts. Existing open-source software that paves the way for reproducible data analysis necessitates steep learning curves for scientists without programming knowledge. Therefore, we have integrated 18 MSI tools into the Galaxy framework (https://usegalaxy.eu) to allow easy accessible data analysis with high levels of reproducibility and transparency. The tools are based on Cardinal, MALDiquant and scikit-image enabling all major MSI analysis steps from quality control to image co-registration, preprocessing and statistical analysis. We successfully applied the MSI tools in combination with other proteomics and metabolomics Galaxy tools to analyze a publicly available N-linked glycan imaging dataset, as well as in-house peptide imaging cancer datasets. Furthermore, we created hands-on training material for use cases in proteomics and metabolomics and provide a Docker container for a fully functional analysis platform in a closed network situation, such as in clinical settings.
12:20 PM - 12:30 PM (EDT)
Vari - Mining next-generation genome sequencing data for genetic diversity assessment of eastern Africa finger millet blast fungus
Finger millet (Eleusine coracana) is a key staple crop in eastern Africa cultivated mainly by small-holder farmers. It can withstand high temperatures, salinity, drought stress and low soil fertility. Typically, it yields only one-third of its genetic potential of 6 tons per hectare due to the use of unimproved varieties that are regularly affected by the finger millet blast disease along with other stresses. Blast disease is caused by Magnaporthe oryzae, which is a host-specific complex species that affects different grasses including rice and wheat. While many efforts have been directed towards characterizing rice blast, finger millet blast genetic diversity, specificity and virulence remain poorly understood. To address this, we sequenced 224 blast isolates from Kenya, Tanzania, Uganda and Ethiopia, using Illumina sequencing. One blast isolate, E2, was sequenced using a combination of PacBio and Illumina technologies. A reference genome assembly was generated for E2, and the resequenced isolates’ reads mapped to it. Variant calling identified 195,705 SNPs. Cluster analysis for diversity assessment was then conducted using STRUCTURE, PCA and phylogenetic analysis and the findings are presented here. This information will enhance the existing knowledge of the genetic diversity of the blast fungus.
12:20 PM - 12:30 PM (EDT)
BioVis - RNA-Scoop: interactive visualization of isoforms in single-cell transcriptomes
Isoform detection and discovery at the single cell resolution are central to improving our understanding of heterogeneity in organs and tissues, and visualization tools would be instrumental in exploring this heterogeneity. However, current interactive transcriptome visualization tools are designed for bulk RNA-Seq data, and have limited utility in analyzing single-cell RNA-seq data. Here, we introduce RNA-Scoop, a visualization tool for single cell transcriptomics. The input of RNA-Scoop is a single JSON file, which specifies the paths to a GTF file containing the isoforms of interest, a matrix file containing their expression levels in each cell, and files containing labels for the matrix rows and columns. Users can select genes for the isoform view, where all isoforms of selected genes are displayed. A t-SNE plot allows users to zoom in and out of different areas and select cells via lasso selection. Upon selection, displayed isoforms are colored according to their average level of expression in the selected cells. Expression per cluster is visualized through a dot plot. Additionally, isoforms are selectable, enabling users to highlight the cells in which isoforms of interest are expressed. Through these easy-to-use features, RNA-Scoop simplifies the interrogation of isoforms and cell types in thousands of cells.
12:20 PM - 12:30 PM (EDT)
MICROBIOME - Assembly graph-based variant discovery reveals novel dynamics in the human microbiome
Sequence variation within metagenomes reveals important information about the structure, function, and evolution of microbial communities. However, most existing methods for variant detection are reference-dependent and are limited to identifying single nucleotide polymorphisms (SNPs), missing more complex structural changes. We developed MetaCarvel (​https://github.com/marbl/MetaCarvel​), a reference-independent tool that incorporates paired-end read information to link together contigs into confident scaffolds and detects a rich set of graph signatures indicative of biologically-relevant variants. We applied MetaCarvel to almost 1,000 metagenomes from the Human Microbiome Project and identified over nine million variants representing insertion/deletion events, complex strain differences, plasmids, and repeats. The majority of identified variants were repeats, some corresponding to mobile genetic elements. Our analysis revealed striking differences in the rate of variation across body sites, highlighting niche-specific mechanisms of bacterial adaptation. We identified more indels and strain variants in the oral cavity than in the comparatively nutrient-rich gut. In particular, we highlight a ​Streptococcus​ variant from neighboring sites in the oral cavity suggesting that, despite their close proximity, bacteria within each microenvironment utilize unique approaches for effective colonization. This work highlights the utility of using graph-based variant detection to capture biologically significant signals in microbial populations.
12:30 PM - 12:40 PM (EDT)
MICROBIOME - Meta-NanoSim: metagenome simulator for nanopore reads
As a long-read sequencing technique, Oxford Nanopore Technology (ONT) has shown unprecedented potential in metagenomic studies. However, the challenges associated with ONT reads, such as high error rate and non-uniform error distributions, necessitate analytical tools designed specifically for long reads. To facilitate the development and benchmarking, simulated datasets with known ground truth are desirable. Here, we present Meta-NanoSim, a fast and lightweight ONT read simulator that characterizes and simulates the unique properties of ONT metagenomes, including abundance levels, chimeric reads, and reads that span both ends of a circular genome. Provided with the empirical profiles and abundance profile learnt from experimental dataset, multi-sample multi-replicate metagenome datasets are generated to simulate microbial communities with both circular and linear genomes. To demonstrate its performance, we train Meta-NanoSim with two mock microbial community standards and compare the simulation results against state-of-the-art tools. Further, we showcase the application of Meta-NanoSim through benchmarking ONT metagenome assemblers on our simulated datasets. Gold standards provided by Meta-NanoSim will facilitate the development of algorithms and pipelines in metagenomics, including functional gene prediction, species detection, comparative metagenomics, and clinical diagnosis. As such, we expect Meta-NanoSim to have an enabling role in the field.
12:30 PM - 12:40 PM (EDT)
Vari - Unified inference of missense variant effects and gene constraints in the human genome
A challenge in genomics is to identify variants and genes associated with severe genetic disorders. Several statistical methods have been developed to predict pathogenic variants or constrained genes based on the signatures of negative selection in human populations. However, we currently lack a statistical framework to jointly predict deleterious variants and constrained genes from both variant-level features and gene-level selective constraints. Here we present such a unified approach, UNEECON, based on deep learning and population genetics. UNEECON treats the contributions of variant-level features and gene-level constraints as a variant-level fixed effect and a gene-level random effect, respectively. The sum of the fixed and random effects is then combined with an evolutionary model to infer the strength of negative selection at both variant and gene levels. Compared with previously published methods, UNEECON shows unmatched performance in predicting missense variants and protein-coding genes associated with autosomal dominant disorders. Furthermore, based on UNEECON, we observe an unexpected low correlation between gene-level intolerance to missense mutations and that to loss-of-function mutations. Finally, we show that genes intolerant to both missense and loss-of-function mutations play key roles in autism. Overall, UNEECON is a promising framework for both variant and gene prioritization.
12:30 PM - 12:40 PM (EDT)
TransMed: Q&A
12:30 PM - 12:40 PM (EDT)
CompMS - Combining Information from Crosslinks and Monolinks in the Modelling of Protein Structures
We have developed a computational approach for getting the most out of an XLMS dataset in the context of modelling protein structures by using both crosslink and monolink data. Monolinks are a by-product of a Chemical Crosslinking Mass Spectrometry (XLMS) experiment. They convey residue exposure information and are more abundant in an XLMS dataset than crosslinks, which makes them a useful source of structural information. However, they are rarely used in structural modelling. We have devised the Monolink Depth Score (MoDS), a scoring function for ranking protein structure models from monolink information. Using simulated and reprocessed experimental data from the Proteomic Identification Database, we compare the performance of MoDS to the Matched and Non-Accessible Crosslink (MNXL) score, which we have previously devised to score data from crosslinks. Our results show that MNXL only marginally outperforms MoDS, and that MoDS is an effective tool for scoring model structures. Furthermore combining MoDS and MNXL into the Crosslink Monolink (XLMO) score improves performance above that of both MoDS and MNXL. To make our software easily accessible to the community we created Crosslink Modelling Tools (XLM-Tools), a python program for scoring protein structure models using crosslinks and monolinks.
12:30 PM - 12:40 PM (EDT)
BioVis - ImaCytE: Visual Exploration of Cellular Micro-environments for Imaging Mass Cytometry Data
Tissue functionality is determined by the characteristics of tissue-resident cells and their interactions within their microenvironment. Imaging Mass Cytometry offers the opportunity to distinguish cell types with high precision and link them to their spatial location in intact tissues at sub-cellular resolution. This technology produces large amounts of spatially-resolved high-dimensional data, which constitutes a serious challenge for the data analysis. We present an interactive visual analysis workflow for the end-to-end analysis of Imaging Mass Cytometry data that was developed in close collaboration with domain expert partners. We implemented the presented workflow in an interactive visual analysis tool; ImaCytE. Our workflow is designed to allow the user to discriminate cell types according to their protein expression profiles and analyze their cellular microenvironments, aiding in the formulation or verification of hypotheses on tissue architecture and function. Finally, we show the effectiveness of our workflow and ImaCytE through a case study performed by a collaborating specialist. ImaCytE is open source and the code and binaries are available at https://github.com/biovault/ImaCytE.
12:40 PM - 01:00 PM (EDT)
HiTSeq: High Throughput Sequencing - TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats
Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there are no tools for their quality assessment. Moreover, since the mapping of error-prone reads to ETR remains an open problem, it is not clear how to polish draft ETR assemblies. To address these problems, we developed the TandemTools package that includes the TandemMapper tool for mapping reads to ETRs and the TandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that TandemTools not only reveals errors in ETR assemblies but also improve the recently generated assemblies of human centromeres.
01:00 PM - 01:30 PM (EDT)
Lunch & Learn - The Black Women in Computational Biology Network
02:00 PM - 03:00 PM (EDT)
BioVis Keynote: Machine Learning for Drug Repurposing
02:00 PM - 02:10 PM (EDT)
3DSIG - De Novo Protein Design for Novel Folds with Guided & Conditional Wasserstein GAN
Facing data quickly accumulating on protein sequence and structure, this study is addressing the following question: to what extent could current data alone reveal deep insights into the sequence-structure relationship, such that new sequences can be designed accordingly for novel structure folds? We have developed novel deep generative models, constructed low-dimensional representation of fold space, exploited sequence data with and without paired structures, and developed ultra-fast fold predictor (oracle) as feedback. The resulting semi-supervised gcWGAN (guided & conditional, Wasserstein Generative Adversarial Networks), assessed by the oracle over 100 novel folds not in the training set, generates more yields and covers 3.5 times more target folds compared to a competing data-driven method (cVAE). gcWGAN designs are predicted to be physically and biologically sound. Targeting representative novel folds, including one not even part of basis folds, gcWGAN designs are predicted by Rosetta to have comparable or better fold accuracy; yet have much more sequence diversity and sometimes novelty. The ultra-fast data-driven model is shown to boost the success of principle-driven Rosetta for de novo design, through generating design seeds and tailoring design space. In conclusion, gcWGAN explores uncharted sequence space to design proteins by learning from current sequence-structure data.
02:00 PM - 02:20 PM (EDT)
HiTSeq: High Throughput Sequencing - PopDel detects large deletions jointly in tens of thousands of genomes
Catalogs of genetic variation for large numbers of individuals are a foundation for modern research on human diversity and disease. Creating such catalogs for small variants from whole-genome sequencing (WGS) data is now commonly done for thousands of individuals collectively. We have transferred this joint calling idea from SNPs and indels to larger deletions and developed the first joint calling tool, PopDel, that can detect and genotype deletions in WGS data of tens of thousands of individuals simultaneously as demonstrated by our evaluation on data of up to 49,962 human genomes. Good sensitivity, precision and the correctness of genotypes are demonstrated by extensive tests on simulated and real data and comparison to other state-of-the-art SV-callers. PopDel detects deletions in HG002 and NA12878 with high sensitivity while maintaining a low false positive rate as shown by our comparison with different high-confidence reference sets. On data of up to 6,794 trios, inheritance patterns are in concordance with Mendelian inheritance rules and exhibit a close to ideal transmission rate. PopDel reliably reports common, rare and de novo deletions. Therefore, PopDel enables routine scans for deletions in large-scale sequencing studies and we are currently in the process of implementing the detection of other SV-types.
02:00 PM - 02:20 PM (EDT)
Bio-Ontologies - Modeling quantitative traits for COVID-19 case reports
Medical practitioners record the condition status of a patient through qualitative and quantitative observations. The measurement of vital signs and molecular parameters in the clinics gives a complementary description of abnormal phenotypes associated with the progression of a disease. The Clinical Measurement Ontology (CMO) is used to standardize annotations of these measurable traits. However, researchers have no way to describe how these quantitative traits relate to phenotype concepts in a machine-readable manner. Using the WHO clinical case report form standard for the COVID-19 pandemic, we modeled quantitative traits and developed OWL axioms to formally relate clinical measurement terms with anatomical, biomolecular entities and phenotypes annotated with the Uber-anatomy ontology (Uberon), Chemical Entities of Biological Interest (ChEBI) and the Phenotype and Trait Ontology (PATO) biomedical ontologies. The formal description of these relations allows interoperability between clinical and biological descriptions, and facilitates automated reasoning for analysis of patterns over quantitative and qualitative biomedical observations.
02:00 PM - 02:20 PM (EDT)
TransMed - Robust and accurate deconvolution of tumor populations uncovers evolutionary mechanisms of breast cancer metastasis
Motivation: Cancer develops and progresses through a clonal evolutionary process. Understanding progression to metastasis is of particular clinical importance, but is not easily analyzed by recent methods because it generally requires studying samples gathered years apart, for which modern single-cell genomics is rarely an option. Understanding clonal evolution in the metastatic transition thus still depends on unmixing tumor subpopulations from bulk genomic data. Methods: We develop a method for progression inference from bulk transcriptomic data of paired primary and metastatic samples. We develop a novel toolkit, the Robust and Accurate Deconvolution (RAD) method, to deconvolve biologically meaningful tumor populations from multiple transcriptomic samples spanning distinct progression states. RAD employs a hybrid optimizer to achieve an accurate solution, and a gene module representation to mitigate considerable noise in RNA data. Finally, we apply phylogenetic methods to infer how associated cell populations adapt across the metastatic transition via changes in expression programs and cell-type composition. Results: We validated the superior robustness and accuracy of RAD over other algorithms on a real dataset, and validated the effectiveness of gene module compression on both simulated and real bulk RNA data. We further applied the methods to a breast cancer metastasis dataset, and discovered common early events that promote tumor progression and migration to different metastatic sites, such as dysregulation of ECM-receptor, focal adhesion, and PI3k-Akt pathways.
02:00 PM - 02:40 PM (EDT)
NetBio: Network Biology - NetBio Keynote: Genome-wide phenotypic screens: the total is greater than the sum of the parts
Connecting genotypes to phenotypes is critical for uncovering gene functions, mapping biological networks and understanding the causes of rare and common diseases. Scalable experimental approaches, including gene editing, silencing and knockout, allow to systematically examine how genetic variation affects an organism, and open the door to new ideas in integrative modeling. Here, I describe the assembly and analysis of the largest set of systematic genetic perturbations to date -- the Yeast Phenome. This dataset combines ~11,000 phenotypic screens of the genome-wide collection of knock-out mutants in budding yeast Saccharomyces cerevisiae, and integrates the work of 280 laboratories and 380 publications. The Yeast Phenome currently provides the largest, richest and most systematic phenotypic description of an organism, and enables a multitude of enquires into the nature of gene-gene, phenotype-phenotype and gene-phenotype networks.
02:00 PM - 02:40 PM (EDT)
MICROBIOME - Microbiome Keynote: The analysis of microbiome data from biased high-throughput sequencing
The composition of a microbiome is an important parameter to estimate given the critical role that microbiomes play in human and environmental health. However, profiling the composition of a microbial community using high throughput sequencing distorts the true composition of the community. Sequencing mock communities -- artificially constructed microbiomes of known composition -- clearly illustrates that observed composition is a biased estimate of true composition, with certain taxa consistently overobserved or underobserved compared to their true relative abundance. We propose a statistical model for bias in compositional data, illustrating its performance on data from the Vaginal Microbiome Consortium, and illustrate the effect of compositional bias on the replicability of human microbiome studies using data from the Microbiome Quality Control Project. We conclude with recommendations for the design and analysis of microbiome studies.
02:00 PM - 02:40 PM (EDT)
CompMS - CompMS Keynote: Proximity interactome for SARS-CoV-2. Why knowing your neighbor is key in pandemic times.
Compartmentalization is essential for all complex forms of life. In eukaryotic cells, membrane-bound organelles, as well as a multitude of protein- and nucleic acid-rich subcellular structures, maintain boundaries and serve as enrichment zones to promote and regulate protein function, including signaling events. Consistent with the critical importance of these boundaries, alterations in the machinery that mediate protein transport between these compartments has been implicated in a number of diverse diseases, and is harnessed by pathogens including viruses. Prompted by the implementation in vivo biotinylation approaches such as BioID, we report here the systematic mapping of the composition of various subcellular structures, using as baits proteins (or protein fragments) which are well-characterized markers for a specified location. We defined how relationships between “prey” proteins detected through this approach can help understanding the protein organization inside a cell, which is further facilitated by newly developed computational tools. We will discuss our map of a human cell containing major organelles and non-membrane bound structures, and illustrate how this map can be leveraged to devise “compartment sensors” to explore biology. We next address the use of this type of strategies to reveal new insights on the biology of pathogens, focusing on the first proximity map for each of the SARS-CoV-2 proteins.
02:00 PM - 02:40 PM (EDT)
RegSys - RegSys KEYNOTE: Deep Learning of Immune Differentiation
The mammalian genome contains several million cis-regulatory elements, whose differential activity marked by open chromatin determines cellular differentiation. While the growing availability of functional genomics assays allows us to systematically identify cis-regulatory elements across varied cell types, how the DNA sequence of cis-regulatory elements is decoded and orchestrated on the genome scale to determine cellular differentiation is beyond our grasp. In this talk, I’ll present recent work using machine learning as a tool to derive an understanding of the relationship between regulatory sequence and cellular function in the context of immune cell differentiation. In particular, I’ll present our deep learning approach (AI-TAC) to combining a large and granular compendium of epigenomic data and will describe approaches to robustly interpreting complex models in order to uncover mechanistic insights into immune gene regulation (Yoshida et al., Cell 2019; Maslova et al., bioRxiv 2019). Our work shows that a deep learning approach to genome-wide chromatin accessibility can uncover patterns of immune transcriptional regulators that are directly coded in the DNA sequence, and thus providing a powerful in-silico framework (an in-silico assay of sorts) to mechanistically probe the relationship between regulatory sequence and its function.
02:00 PM - 02:40 PM (EDT)
Vari - Somatic variant calling and interpretation in the Pan-Cancer Analysis of Whole Genomes project
The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes project explored the role of coding and non-coding variation among a cohort of >2,600 cancer whole genomes and matching normal tissues. The consortium had to overcome multiple challenges to generate a consistent and high quality set of variant calls across the cohort, including large differences in the accuracy of different somatic mutation calling algorithms, the lack of uniform benchmarking, and the technical challenges of uniformly processing a geographically scattered data set of 800 TB. In this talk, I will walk through how the consortium addressed these challenges, what we learned about the significance of coding and non-coding cancer driver mutations, and discuss insights gained from the distribution of non-driver "passenger" mutations.
02:10 PM - 02:20 PM (EDT)
3DSIG - The vestibule role of membrane-water interface as the intermediate stage in a new three-stage model for helical membrane protein folding
Transmembrane alpha-helical (TMH) proteins play critical roles in cellular signaling. They display a diversity of structural folds featuring almost-parallel orientation of TM helices packing into helical bundles. The membrane environment enormously reduces the accessible conformational landscape for folding, but also makes its experiments challenging. The contribution of helix insertion energies to the folding energy landscape was computed using structural bioinformatics based hydropathy analysis for most of the polytopic helical membrane proteome (from 1-TMH to 24-TMH proteins with structures). The magnitudes of TM helix insertion energies from Water to membrane-water Interface (WAT→INT energies) are on average half of those insertion energies from water to Trans-Membrane-Helix orientation (WAT→TMH energies), suggesting a potential vestibule role of the membrane-water interface for the TM helices after translocon exit. This is confirmed by showing the stability of very hydrophobic TM helices in the membrane-water interface through multiple microsecond long molecular dynamics simulations of a stop-transfer helix, a re-integration helix, and a pre-folded helical-hairpin from the ribosomal exit vestibule. So, a three-stage folding model is proposed to extend Popot-Engelman’s original two-stage model, where the membrane-water interface acts as the intermediate stage holding vestibule for translated TM helices, reconciling the interface’s critical role seen in many previous studies.
02:20 PM - 02:40 PM (EDT)
3DSIG - Detecting Symmetry in Membrane Proteins
Available membrane protein structures have revealed an abundance of symmetry and pseudo-symmetry, which arose not only by the formation of multi-subunit assemblies, but also by repetition of internal structural elements. In many cases, these symmetry relationships play a crucial role in defining the functional properties of the proteins. Therefore, a systematic study of symmetry should provide a framework for a broader understanding of the mechanistic principles and evolutionary development of membrane proteins. However, available symmetry detection methods have not been tested systematically on this class of proteins because of the lack of an appropriate benchmark set. Hence, we collected membrane protein structures with unique architectures and manually curated their symmetries to create the MemSTATS dataset. Using MemSTATS, we compared the performance of four widely used symmetry detection algorithms and pinpointed areas for improvement. To address the identified shortcomings, we developed a robust symmetry detection methodology called MSSD, which takes into consideration the restrictions that the lipid bilayer places on protein structures. MSSD detected symmetries with higher accuracy and lower false positive rate compared to any other tested method. Consequently, we used MSSD to analyze all available membrane protein structures and presented the resultant symmetries in a database called EncoMPASS (encompass.ninds.nih.gov).
02:20 PM - 02:30 PM (EDT)
TransMed - Deep Hidden Physics Modeling of Cell Signaling Networks
Signaling systems in multicellular organisms are vital for cell-cell communication, tissue organization and disease. Cancer genomics has unraveled a surprisingly large set of novel gene lesions from tumors. Our previous studies have globally explored the rewiring of cell signaling networks underlying malignant transformation caused by kinases and other signaling proteins. By generating quantitative time-/state-series data and subsequently using these as input for deep learning based computational modeling - our lab work to identify the principal changes in the genome, cell signaling and phenotypes of cells harboring genetic mutations; we validate these models by forward prediction of experimentally observed phenotypic responses to drug and genetic perturbations. We are currently deploying such forecasting models on data collected from PDX tumors to describe how the cell signaling networks are mechanistically, dynamically and differentially utilized in cancers. Finally, we are working to combine deep learning with causal/mechanistic models to predict novel treatment and diagnostic strategies for tumors harboring different genetic lesions. In conclusion, our studies aim to unravel the fundamental rewiring of cell signaling networks in cancer and will serve as a major breakthrough in our basic understanding of their impact on the disease, paving the way for future clinical applications and tumor specific cancer therapy.
02:20 PM - 02:40 PM (EDT)
HiTSeq: High Throughput Sequencing - Metalign: Efficient alignment-based metagenomic profiling via containment min hash
Whole-genome shotgun sequencing enables the analysis of microbial communities in unprecedented detail, with important implications in medicine and ecology. Predicting the presence and relative abundances of microbes in a sample, known as “metagenomic profiling”, is a critical step in microbiome analysis. Existing profiling methods have been shown to suffer from poor false positive or false negative rates, while alignment-based approaches are often considered accurate but computationally infeasible. Here we present a novel method, Metalign, that addresses these concerns by performing efficient alignment-based metagenomic profiling. Metalign employs a high-speed, high-recall pre-filtering method based on the mathematical concept of Containment Min Hash to reduce the reference database size dramatically before alignment, followed by a method to estimate organism relative abundances in the sample by handling reads aligned to multiple genomes. We show that Metalign achieves significantly improved results over existing methods on simulated datasets (Figure 1) from a large benchmarking study, CAMI, and performs well on in vitro mock community data and environmental data from the Tara Oceans project. Metalign is freely available at https://github.com/nlapier2/Metalign, and via bioconda.
02:20 PM - 02:40 PM (EDT)
Bio-Ontologies - Using ontologies to extract disease-phenotype associations from literature
With the advances in Next Generation Sequencing (NGS) technologies, a huge volume of clinical genomic data has become available. Efficient exploitation of such data requires linkage to a patient's complete phenotype profile. Current resources providing disease-phenotype associations are not comprehensive, and they often do not cover all of the diseases from OMIM and particularly from ICD10, which are the primary terminologies used in clinical settings. Here, we propose a text-mining system which utilizes semantic relations in the phenotype ontologies and statistical methods to extract disease-phenotype associations from the literature. We compare our findings against established disease-phenotype associations and also demonstrate its utility in covering mouse gene-disease associations from Mouse Genome Informatics (MGI). Such associations serve as necessary information blocks for understanding underlying disease mechanisms and developing or repurposing drugs.
02:30 PM - 02:40 PM (EDT)
TransMed - A deep transfer learning model for extending in vitro CRISPR-Cas9 viability screens to tumors
The Cancer Dependency Map (DepMap) projects recently employed genome-scale CRISPR-Cas9 loss-of-function screens to identify genes essential for cancer cell proliferation and survival across cancer cell lines. However, it remains very challenging to translate these in vitro results to impracticable-to-screen tumors. To address the challenge, we devised a deep learning model with a unique transfer-learning framework to predict gene dependencies of tumors. The model has a 3-stage design that enables a representation learning of unlabeled tumor genomic data, the prediction of gene dependencies in labeled cell-line screening data, and the application to predict tumor dependencies. The prediction performance was verified using cell-line data. Applying our model to ~8,000 tumors of The Cancer Genome Atlas, we constructed a pan-cancer dependency map of tumors. The results were confirmed by several biomarkers and the response to targeted therapies of the TCGA clinical records. Further investigations revealed gene dependencies associated with specific genomic patterns, such as higher tumor mutation burdens and unique expression/methylation signatures. We also identified highly selective gene dependencies of which inhibitor drugs have been approved to treat cancers. We expect the model to evolve with rapidly developing in vitro CRISPR-Cas9 viability screens and facilitate the translation to identifying therapeutic targets of tumors.
02:40 PM - 02:50 PM (EDT)
TransMed - The evolution of homologous repair deficiency in high grade serous ovarian carcinoma
Exploiting a large collection of whole genome sequencing (WGS) data from high grade serous ovarian carcinoma (HGSOC) samples (N=207), we have comprehensively characterised mutation and expression at the BRCA1/2 loci. In addition to the known spectrum of short somatic variants (SSVs), we discover that multi-megabase structural variants (SVs) are a frequent but unappreciated source of BRCA1/2 disruption in these tumours. These SVs independently affect a substantial proportion of patients (16%) in addition to those affected by SSVs (25%) to cause homologous recombination repair deficiency (HRD). We also detail compound deficiencies involving SSVs and SVs at both loci, demonstrating that the strongest risk of HRD emerges from combined SVs at both BRCA1 and BRCA2 in the absence of SSVs. Overall, we show that HRD is a complex phenotype in HGSOC, affected by the patterns of short somatic and germline variants, SVs, as well as methylation and expression at multiple loci, and we construct a successful (ROC AUC = 0.62) predictive model of HRD using such variables. These results extend our understanding of the mutational landscape at the BRCA1/2 loci in highly rearranged tumours, and also increase the number of patients predicted to benefit from therapies exploiting HRD in tumours.
02:40 PM - 02:50 PM (EDT)
NetBio: Network Biology - BiCoN: Network-constrained biclustering of patients and omics data
Unsupervised learning approaches are frequently employed to stratify patients into clinically relevant subgroups and to identify biomarkers such as disease-associated genes. However, clustering and biclustering techniques are not suitable to unravel molecular mechanisms along with patient subgroups. We developed the network-constrained biclustering approach BiCoN (Biclustering Constrained by Networks) which (i) restricts biclusters to functionally related genes connected in molecular interaction networks and (ii) maximizes the difference in gene expression between two subgroups of patients. This allows BiCoN to simultaneously pinpoint molecular mechanisms responsible for the patient grouping. Network-constrained clustering of genes makes BiCoN more robust to noise and batch effects than typical clustering and biclustering methods. BiCoN can faithfully reproduce known disease subtypes as well as novel, clinically relevant patient subgroups, as we could demonstrate using breast and lung cancer datasets. In summary, BiCoN is a novel systems medicine tool that combines several heuristic optimization strategies for robust disease mechanism extraction. BiCoN is well-documented and freely available as a python package or a web-interface. Availability: PyPI package: https://pypi.org/project/bicon Web interface: https://exbio.wzw.tum.de/bicon Preprint: https://doi.org/10.1101/2020.01.31.926345
02:40 PM - 02:50 PM (EDT)
Vari - Seak marries regulatory genomics deep learning with rare-variant association tests
Sequencing-based genotyping methods are on the rise, yet leveraging the predominantly rare genetic variants they measure remains challenging. Large rare-variant association studies have mainly focused on protein-altering variation while little attention has been given to variants acting on the RNA level or other non-coding regulatory mechanisms. For these mechanisms, deep learning has recently been successful at predicting the effects of genetic variants. Here we introduce seak (sequence annotations in kernel-based tests), a Python package that flexibly integrates variant effect predictions into set-based association tests while controlling for relatedness and population structure using linear mixed models. We first show that using functional variant effect predictions can increase statistical power in simulation studies and shed light on potentially causal mechanisms. Then we apply seak to the UK Biobank exome-sequencing dataset. We perform association tests for three biomarkers of cardiovascular disease and cancer, incorporating deep-learning-derived variant effects for disease-related RNA-binding proteins. With this novel approach we find two significant associations for each biomarker, which include both novel and known associations. Our results demonstrate that, by incorporating regulatory variant effects, seak can identify novel biologically interpretable associations, thereby unlocking the potential of whole-exome and whole-genome sequencing studies.
02:40 PM - 02:50 PM (EDT)
CompMS - MSnbase, efficient and elegant R-based processing and visualisation of raw mass spectrometry data
We present version 2 of the MSnbase R/Bioconductor package. MSnbase provides infrastructure for the manipulation,processing and visualisation of mass spectrometry data. We focus on the new on-disk infrastructure, that allows the handling of large raw mass spectrometry experiment on commodity hardware and illustrate how the package is used for elegant data processing,method development and visualisation.
02:40 PM - 02:50 PM (EDT)
MICROBIOME - PLoT-ME: Pre-classification of Long-reads for Memory Efficient Taxonomic assignment
With increasing feasibility, long-read metagenomics can enable high-resolution taxonomic analysis in a range of applications from diagnostics to forensics. The ease of access via portable long-read platforms (e.g. MinION) is in contrast to the need for significant memory resources when classifiers try to provide precise reads assignments (to strain or sub-strain level) or identify a wider set of organisms (e.g. large eukaryotes). To address this, memory-efficient taxonomic classifiers are an active area of research, with methods based on compact indexes providing various tradeoffs between memory usage and speed. Here we present a general-purpose strategy (PLoT-ME) that leverages the information in k-mer frequency (3-5bp) spectrums of long-reads to pre-classify them, allowing existing classifiers to further assign them against subsets of the reference database. Evaluation on mock communities (real reads) shows that PLoT-ME’s fast K-means classifier provides a scalable, compact approach to rapidly pre-classify long error-prone reads (PacBio, Oxford Nanopore) without loss in classification performance. PLoT-ME was found to be robust to a range of read lengths (500bp-10kbp) and provides up to an order-of-magnitude reduction in memory requirements. We envisage that with further improvements in long-read metagenomic classifiers, this approach will enable a general-purpose strategy for high-resolution, low-memory microbiome analysis.
02:40 PM - 03:00 PM (EDT)
3DSIG - Evolutionary pathways of repeat protein topology in bacterial outer membrane proteins
Outer membrane proteins (OMPs) are the proteins in the surface of Gram-negative bacteria. These proteins have diverse functions but a single topology: the β-barrel. Sequence analysis has suggested that this common fold is a β-hairpin repeat protein, and that amplification of the β-hairpin has resulted in 8–26-stranded barrels. Using an integrated approach that combines sequence and structural analyses, we find events in which non-amplification diversification also increases barrel strand number. Our network-based analysis reveals strand-number-based evolutionary pathways, including one that progresses from a primordial 8-stranded barrel to 16-strands and further, to 18-strands. Among these pathways are mechanisms of strand number accretion without domain duplication, like a loop-to-hairpin transition. These mechanisms illustrate perpetuation of repeat protein topology without genetic duplication, likely induced by the hydrophobic membrane. Finally, we find that the evolutionary trace is particularly prominent in the C-terminal half of OMPs, implicating this region in the nucleation of OMP folding.
02:40 PM - 03:00 PM (EDT)
RegSys - A guide to predicting activity of enhancer orthologs in hundreds of species
Many phenotypes, including vocal learning, longevity, and brain size, have evolved through gene expression, meaning that their differences across species are caused by differences in genome sequence at enhancers. While some of the genes involved in these phenotypes have been identified, in most cases it remains unknown which enhancers are responsible and how genome sequence differences in those enhancers have led to differences in gene expression. We developed a machine learning model that predicts species-specific brain enhancer activity from genome sequences at orthologs of open chromatin regions. We trained our models using brain ATAC-seq data from mouse, rat, Rhesus macaque, and human. Our model achieved AUPRC = 0.88 on the entire validation set and AUPRC = 0.70 on species-specific enhancers and non-enhancers. We used our models to make brain enhancer activity predictions across hundreds of mammals. We demonstrated that similarity in predictions between species is negatively correlated with evolutionary distance. We then used our predictions to identify clade-specific enhancers and showed that our predictions were consistent with our data. Our approach to predicting enhancer orthologs’ activity and our metrics for evaluating such predictions can be applied to any tissue or cell type with open chromatin data available from multiple species.
02:40 PM - 03:00 PM (EDT)
Bio-Ontologies - Representing Physician Suicide Claims as Nanopublications
In the poorly studied field of physician suicide, various fac-tors can contribute to misinformation or information distor-tion, which in turn can influence evidence-based policies and prevention of suicide in this unique population. Here, we report on the use of nanopublications as a scientific publishing approach to establish a citation network of claims drawn from a variety of media concerning the rate of suicide of US physicians. Our work integrates these vari-ous claims and enables the verification of non-authoritative assertions, thereby better equipping researchers and to advance evidence-based knowledge and make informed statements in the advocacy of physician suicide prevention.
02:40 PM - 03:00 PM (EDT)
HiTSeq: High Throughput Sequencing - META^2: Memory-efficient taxonomic classification and abundance estimation for metagenomics with deep learning
Taxonomic classification is an important step in the analysis of samples found in metagenomic studies. Conventional mapping-based methods trade off between high memory and low recall, with recent deep learning methods suffering from very large model sizes. We aim to develop a more memory-efficient technique for taxonomic classification. A task of particular interest is abundance estimation. Current methods initially classify reads independently and are agnostic to the co-occurence patterns between taxa. In this work, we also attempt to take these patterns into account. We develop a novel memory-efficient read classification technique, combining deep learning and locality-sensitive hashing. We show that this approach outperforms conventional mapping-based and other deep learning methods for taxonomic classification when restricting all methods to a fixed memory footprint. Moreover, we formulate the task of abundance estimation as a Multiple Instance Learning problem and we extend current deep learning architectures with two types of permutation-invariant MIL pooling layers: a) deepsets and b) attention-based pooling. We illustrate that our architectures can exploit the co-occurrence of species in metagenomic read sets and outperform the single-read architectures in predicting the distribution over taxa at higher taxonomic ranks.
02:50 PM - 03:00 PM (EDT)
TransMed: Q&A
02:50 PM - 03:00 PM (EDT)
CompMS - Democratizing DIA analysis on public cloud infrastructures via Galaxy
Data independent acquisition (DIA) has become one of the most important approaches in global proteomic studies. DIA data provides detailed and in-depth insights into the molecular variety of biological systems. However, due to the high complexity and large data size the data analysis remains challenging. Available open-source software requires different operational systems, programming skills, and large compute infrastructures. Thus, current open-source DIA data analysis is mainly applicable by bioinformatics competent researchers with access to large computational resources and often lacks reproducibility and usability. Here we present a straight-forward workflow containing all essential DIA analysis steps based on OpenSwath, pyprophet, diapysef and swath2stats, which can be applied and adapted by a large user community without the need for tool installations, special computing resources and programming skills. The all-in-one DIA workflow in Galaxy drastically increases the robustness, reproducibility and speed of the DIA data analysis due to parallel processing of multiple inputs using Galaxys HPC- and cloud infrastructure. Each tool is available as Conda package and Biocontainer. However, a few steps in this workflow require up to 1 TB of memory, hence we recommend to use the workflow on the European Galaxy server (https://usegalaxy.eu) which can utilize worldwide HPC- and Cloud resources.
02:50 PM - 03:00 PM (EDT)
NetBio: Network Biology - PhenoGeneRanker: A Tool for Gene Prioritization Using Complete Multiplex Heterogeneous Networks
Identification of specific complex-trait genes is a challenging process as the etiology of those traits involve multiple genes, multiple layers of molecular interactions and environmental factors. Gene prioritization is an important step to make a manageable short list of high likely complex-trait genes. Integration of biological datasets through networks is a promising approach to identify the complex-trait genes by providing natural way of integration of different, complementary genotypic and phenotypic datasets. Integration of different datasets alleviates the effects of missing data, low signal and noisy nature of biomedical datasets. In this study, we present PhenoGeneRanker, a gene prioritization tool which utilizes multi-layer gene and phenotype networks by combining them in a heterogeneous biological network. PhenoGeneRanker enables integration of weighted/unweighted and undirected gene and phenotype networks for wholistic and comprehensive prioritization of genes. It calculates empirical p-values of gene ranking using random stratified sampling of genes based on their degree of centrality in the network to address potential bias toward high degree nodes in the network. To assess PhenoGeneRanker, we applied it on a rice dataset to rank cold tolerance-related genes. Our results showed PhenoGeneRanker successfully ranked genes such that the top ranked genes were enriched in cold tolerance-related GO terms.
02:50 PM - 03:00 PM (EDT)
MICROBIOME - pepFunk: an R shiny app and workflow for peptide-centric functional analysis of metaproteomic microbiome data
Researchers can use metaproteomics to study the composition and functional contributions of the gut microbiome to human health. These metaproteomic data are acquired by a multistep process, first starting with enzymatic digestion of microbial proteins into smaller and more easily detectable peptides. These peptides are then processed by a mass spectrometer, and the obtained mass spectra are matched back to a peptide database. Typically, the metaproteomic data analysis pipeline involves the identification of each peptide to a potential parent protein. Challenges to unambiguous protein identification arise due to the nature of enzymatic digestion, where peptides can match back to multiple parent proteins. We developed pepFunk, a peptide-centric functional analysis of metaproteomic data methodology and tool to circumvent this challenge. We created a gut microbiome peptide-to-KEGG database and developed a functional enrichment strategy for peptide-level data. Our peptide-centric approach gives an enhanced ability for users to observe the biological processes taking place in the microbiome. Our tool is open source and is available as a Shiny web application at https://shiny.imetalab.ca/pepFunk.
02:50 PM - 03:05 PM (EDT)
Vari - Go Big Or Go Home: PCR-free WGS Long And Short Read Orthogonal Test Eliminates The Need For Multiple Platform Genetic Tests
Our diagnostic testing lab uses PCR-free whole genome sequencing (WGS) as the method platform for a comprehensive genetic evaluation to detect single nucleotide variants and small indels (like traditional exomes), but in addition we have the ability to detect structural variants and short tandem repeats (STRs). Of the clinical cases processed this year with reported pathogenic or likely pathogenic variants, just 65% involved SNVs/indels only. By starting with the most comprehensive testing we capture the vast majority of genetic variants with comparable or improved sensitivity and specificity than the standard of care/best practice testing, essentially eliminating or drastically reducing the need for all other tests that are currently only detectable using platforms such as Southern blots for detection of long STRs and large deletions, PCR/capillary electrophoresis for small STRs, qPCR/MLPA for exon level deletions and duplications, and microarrays, FISH and karyotype for gross chromosomal deletions. This type of genetic diagnostic test has the potential to flip the current costly and time-consuming paradigm of starting small with single gene tests and going to large panels or exomes/genomes. With short read WGS analysis we currently see sensitivity values of >99% for SNVs, >96% for indels and >85% for structural variants. With the addition of long read WGS analysis we can strengthen many of the weaknesses of short-read analysis, including, but not limited to, accurately identifying hard to detect deletions and insertions, covering non-uniquely mappable areas, detect exact count of repeat units in expansions and identifying balanced translocations. With simultaneous orthogonal confirmation using both short and long read WGS analysis, the sensitivity for indels, structural variants and STR detection increases to >95% (based on preliminary validation data). We will present a break out of this data.
03:20 PM - 03:40 PM (EDT)
HiTSeq: High Throughput Sequencing - Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis
Oxford Nanopore (ONT) is a leading long-read technology which has been revolutionizing transcriptome analysis through its capacity to sequence the majority of transcripts from end-to-end. This has greatly increased our ability to study the diversity of transcription mechanisms such as transcription initiation, termination, and alternative splicing. However, ONT still suffers from high error rates which have thus far limited its scope to reference-based analyses. When a reference is not available or is not a viable option due to reference-bias, error correction is a crucial step towards the reconstruction of the sequenced transcripts and downstream sequence analysis of transcripts. In this paper, we present a novel computational method to error-correct ONT cDNA sequencing data, called isONcorrect. IsONcorrect is able to jointly use all isoforms from a gene during error correction, thereby allowing it to correct reads at low sequencing depths. We are able to obtain an accuracy of 98.9-99.6%, demonstrating the feasibility of applying cost-effective cDNA full transcript length sequencing for reference-free transcriptome analysis.
03:20 PM - 03:30 PM (EDT)
3DSIG - BIO-GATS: A tool for automated GPCR template selection through a biophysical approach for homology modeling.
G Protein coupled receptors (GPCRs) are the largest membrane proteins family comprised of seven transmembrane (TM) domains and more than 800 members. GPCRs are involved in numerous physiological functions within the human body and are the target of more than 30% of the US Food and Drug Administration approved drugs. At present, 64 unique receptors have known experimental structures. The absence of experimental structure of majority GPCRs demands homology models of GPCRs for structure-based drug discovery workflows. Homology model requires appropriate templates. The common methods for template selection considers sequence identity. However, there exist low sequence identity among the TM domains of GPCRs. The sequences with similar pattern of hydrophobic residues are often structural homologues even sharing low sequence identity. We have proposed a novel biophysical approach for template selection based on hydrophobicity correspondence between the target and the template. The approach takes into consideration the other parameters as well including sequence identity, resolution, and query coverage for template selection. The proposed approach has been implemented in the form of graphical user interface. We have applied the approach to an olfactory receptor and presented a comprehensive comparison between the templates for the ORs based on our template selection criteria.
03:20 PM - 03:40 PM (EDT)
CompMS - Isolation forests improve the capability to detect quality problems in mass spectrometry-based proteomics
Quality control (QC) of mass spectrometry based proteomic experiments involve quantifying a standard mixture which includes a set of analytes, generating multiple metrics. The metrics are then used to evaluate effects of technical variability to the quantification of the actual biological samples. Next, a reliable baseline data set is used to train statistical models or traditional statistical quality control methods to classify outlying runs. Although current technological improvements help conduct initial steps of most QC workflows, many practitioners still lack baseline data and misclassifies QC runs in real time implementation. Here, we present an unsupervised machine learning extension of MSstatsQC to detect deviations from optimal performance of multiple metrics. MSstatsQC implements unsupervised isolation-based trees to address outlier detection problem. Our results show how tree-based methods are helpful in terms of differentiating optimal and sub-optimal experiments where limited information is available about the optimal/suboptimal performance of the instrument. We also provide supporting information based on the root causes of anomalous behavior per peptide which can be used to design preventive actions. Our method is available with MSstatsQC R/Bioconductor package and with web-based graphical user interface MSstatsQCgui.
03:20 PM - 03:40 PM (EDT)
NetBio: Network Biology - Understanding tissue-specific gene regulation by miRNAs
Conventional methods to analyze genomic data do not make use of the interplay between multiple factors, such as between microRNAs (miRNAs) and the mRNA transcripts they regulate, and thereby often fail to identify the cellular processes that are unique to specific tissues. We developed PUMA (PANDA Using MicroRNA Associations), a computational tool that uses message passing to integrate a prior network of miRNA target predictions with protein-protein interaction and target gene co-expression information to model genome-wide gene regulation by miRNAs. We applied PUMA to 38 tissues from the Genotype-Tissue Expression project, integrating RNA-Seq data with two different miRNA target predictions priors, built on predictions from TargetScan and miRanda, respectively. We found that while target predictions obtained from these two different resources are considerably different, PUMA captures similar tissue-specific miRNA-target gene regulatory interactions in the different network models. Furthermore, tissue-specific functions of miRNAs, which we identified by analyzing their regulatory profiles, are highly similar between networks modeled on the two target prediction resources. This indicates that PUMA consistently captures important tissue-specific regulatory processes of miRNAs. In addition, using PUMA we identified miRNAs regulating important tissue-specific processes that, when mutated, may result in disease development in the same tissue.
03:20 PM - 03:40 PM (EDT)
Bio-Ontologies - Metadata standards for the FAIR sharing of vector embeddings in Biomedicine
Motivation: Today, we have an enormous amount of biomedical data and its size, as well as complexity, have been increasing over time. Implementation of standards represents one of the key drivers in the life sciences research as well as the technology transfer. More specifically, standards enable data accessibility, sharing, integration and therefore facilitates data harnessing and accelerates research and innovation transfer. The life sciences community has widely developed and used Semantic web technology standards for data representation and sharing. However, given the success of unsupervised machine learning methods such as Word2Vec and BERT, there is a need to develop new standards for sharing the (pre-trained) vector space embeddings of the entities to facilitate reusability of data and method development. Motivated by this, we propose data and metadata standards for the FAIR distribution of vector embeddings and demonstrate utilization of these standards in Bio2Vec, a platform providing a flexible, reliable and standard-compliant data representation, sharing, integration and analysis. Availability: The proposed metadata standard and an example are available in the ShEx format at Zenodo.
03:20 PM - 03:40 PM (EDT)
TransMed - Identifying diagnosis-specific genotype-phenotype associations via joint multi-task sparse canonical correlation analysis and classification
Brain imaging genetics provides us a new opportunity to understand the pathophysiology of brain disorders. It studies the complex association between genotypic data such as single nucleotide polymorphisms (SNPs) and imaging quantitative traits (QTs). The neurodegenerative disorders usually exhibit the diversity and heterogeneity, originating from which different diagnostic groups might carry distinct imaging QTs, SNPs and their interactions. Sparse canonical correlation analysis (SCCA) is widely used to identify bi-multivariate genotype-phenotype associations. However, most existing SCCA methods are unsupervised, leading to an inability to identify diagnosis-specific genotype-phenotype associations. In this paper, we propose a new joint multi-task learning method, named MT-SCCALR, which absorbs the merits of both SCCA and logistic regression. MT-SCCALR learns genotype-phenotype associations of multiple tasks jointly, with each task focusing on identifying the task-specific genotype-phenotype pattern. To ensure the interpretation and stability, we endow the proposed model with the selection of SNPs and imaging QTs for each diagnostic group alone, while allowing the selection of them shared by multiple diagnostic groups. We derive an efficient optimization algorithm whose convergence to a local optimum is guaranteed. Compared with two state-of-the-art methods, the results show that MT-SCCALR yields better or similar canonical correlation coefficients (CCCs) and classification performances. In addition, it owns much better discriminative canonical weight patterns of great interest than competitors. This demonstrates the power and capability of MTSCCAR in identifying diagnostically heterogeneous genotype-phenotype patterns, which would be helpful to understand the pathophysiology of brain disorders.
03:20 PM - 03:40 PM (EDT)
MICROBIOME - ganon: precise metagenomics classification against large and up-to-date sets of reference sequences
The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices. Motivated by those limitations we created ganon, a k-mer based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires less than 55 minutes to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-Score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification. The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon
03:20 PM - 03:40 PM (EDT)
BioVis - Interactive visualization and analysis of morphological skeletons of brain vasculature networks with VessMorphoVis
Motivation: Accurate morphological models of brain vasculature are key to modeling and simulating cerebral blood flow (CBF) in realistic vascular networks. This in silico approach is fundamental to revealing the principles of neurovascular coupling (NVC). Validating those vascular morphologies entails performing certain visual analysis tasks that cannot be accomplished with generic visualization frameworks. This limitation has a substantial impact on the accuracy of the vascular models employed in the simulation. Results: We present VessMorphoVis, an integrated suite of toolboxes for interactive visualization and analysis of vast brain vascular networks represented by morphological graphs segmented originally from imaging or microscopy stacks. Our workflow leverages the outstanding potentials of Blender, aiming to establish an integrated, extensible and domain-specific framework capable of interactive visualization, analysis, repair, high-fidelity meshing and high-quality rendering of vascular morphologies. Based on the initial feedback of the users, we anticipate that our framework will be an essential component in vascular modeling and simulation in the future, filling a gap that is at present largely unfulfilled. Availability and implementation: VessMorphoVis is freely available under the GNU public license on Github at https://github.com/BlueBrain/VessMorphoVis. The morphology analysis, visualization, meshing and rendering modules are implemented as an add-on for Blender 2.8 based on its Python API (Application Programming Interface). The add-on functionality is made available to users through an intuitive graphical user interface (GUI), as well as through exhaustive configuration files calling the API via a feature-rich CLI (command line interface) running Blender in background mode.
03:20 PM - 03:40 PM (EDT)
Vari - Combinatorial and statistical prediction of gene expression from haplotype sequence
Motivation: Genome-wide association studies have discovered thousands of significant genetic effects on disease phenotypes. By considering gene expression as the intermediary between genotype and disease phenotype, eQTL studies have interpreted these variants by their regulatory effects on gene expression. However, there remains a considerable gap between genotype-to-gene expression association and genotype-to-gene expression prediction. Accurate prediction of gene expression enables gene-based association studies to be performed post-hoc for existing GWAS, reduces multiple testing burden, and can prioritize genes for subsequent experiments. Results: In this work, we develop gene expression prediction methods that relax the independence and additivity assumptions between genetic markers. First, we consider gene expression prediction from a conventional regression perspective and develop the HAPLEXR algorithm which combines haplotype clusterings with allelic dosages. Second, we introduce the new gene expression classification problem, which focuses on identifying expression groups rather than continuous measurements; we formalize the selection of an appropriate number of expression groups using the principle of maximum entropy. Third, we develop the HAPLEXD algorithm that incorporates suffix tree based haplotype sharing with spectral clustering to identify expression classes from haplotype sequences. In both models, we penalize model complexity by prioritizing genetic clusters that indicate significant effects on expression. We compare HAPLEXR and HAPLEXD on five GTEx v8 tissues with three state-of-the-art expression prediction methods. HAPLEXD exhibits significantly higher classification accuracy overall and HAPLEXR shows higher prediction accuracy on a significant subset of genes. These results demonstrate the importance of explicitly modelling non-dosage dependent and intragenic epistatic effects when predicting expression.
03:20 PM - 03:40 PM (EDT)
RegSys - Comparison of chromatin contacts maps from GAM and Hi-C reveals method specific interactions linked with active and inactive chromatin
Gene expression is functionally coupled with 3D genome configuration. Genome Architecture Mapping (GAM) is a ligation-free, genome-wide method that maps chromatin contacts in 3D, based on measuring the frequency of locus co-segregation from an ensemble of ultra-thin nuclear slices of random orientation. To compare GAM and Hi-C, we used our new high-throughput, multiplexed GAM pipeline to produce a deep dataset from mouse embryonic stem cells, and we devised a procedure to extract contacts preferentially detected by either GAM or Hi-C. Strong contacts enriched in the GAM data contain a 2-fold amplification of feature pairs associated with TF binding (including CTCF), histone marks, and enhancers. In contrast, feature pairs enriched in Hi-C-specific contacts are characterized by heterochromatin marks (H3K9me3 and H3K20me3). In general, genomic regions with increased transcriptional activity often form GAM-detected contacts that are underestimated by Hi-C. We are currently investigating whether the differences can be explained by increased contact multiplicity, which could limit ligation-dependent detection. Our findings expand our current understanding of 3D genome folding and highlight the importance of orthogonal approaches.
03:30 PM - 03:40 PM (EDT)
3DSIG - Modeling of G protein-coupled receptor structures : Improving the prediction of loop conformations and the usability of models for structure-based drug design
G protein-coupled receptors (GPCRs) form the largest group of potential drug targets and therefore, the knowledge of their three-dimensional structure is important for rational drug design. Homology modeling serves as a common approach for modeling the transmembrane helical cores of GPCRs, however, these models have varying degrees of inaccuracies that result from the quality of template used. We have explored the extent to which inaccuracies inherent in homology models of the transmembrane helical cores of GPCRs can impact loop prediction. We found that loop prediction in GPCR models is much more difficult than loop reconstruction in crystal structures owing to the imprecise positioning of loop anchors. Therefore, minimizing the errors in loop anchors is likely to be critical for optimal GPCR structure prediction. To address this, we have developed a Ligand Directed Modeling (LDM) method comprising of geometric protein sampling and ligand docking. The method was evaluated for capacity to refine the GPCR models built across a range of templates with varying degrees of sequence similarity with the target. LDM reduced the errors in loop anchor positions and improved the prediction of binding poses of ligands, resulting in much better performance of these models in virtual ligand screenings.
03:40 PM - 04:00 PM (EDT)
HiTSeq: High Throughput Sequencing - Reference-guided transcript discovery and quantification for long read RNA-Seq data
Transcriptome profiling is one of the most frequently used technologies and key to interpreting the function of the genome in human diseases. However, quantification of transcript expression with short read RNA-sequencing remains challenging as different transcripts from the same gene are often highly similar. Nanopore RNA Sequencing reduces the complexity of transcriptome profiling with ultra-long reads that can cover the full length of the isoforms. The technology has a high sequencing error rate and often generates shorter, fragmented reads due to RNA degradation, however, currently no specific transcript quantification method exists for such data. Here, we present bambu, a long read isoform discovery and quantification method. Bambu performs probabilistic assignment of reads to annotated and novel transcripts across samples to improve the accuracy of transcript expression estimates. We apply our method to cancer cell line data with spike-in controls, and compare the results with estimates obtained from short read data. Bambu recovered annotated isoforms from spike-ins and showed consistency in gene expression estimation with existing methods for short read RNA-Sequencing data, but improved accuracy in transcript expression estimation. The method is implemented in R (https://github.com/GoekeLab/bambu), enabling simple, fast, and accurate analysis of long read transcriptome profiling data.
03:40 PM - 04:00 PM (EDT)
Bio-Ontologies - Applying GWAS on UK Biobank by using enhanced phenotype information based on Ontology-Wide Association Study
Genome Wide Associations Study (GWAS) have been widely used to identify potentially causative variants of genetic disease or trait given the patient phenotypes. However, generally it cannot present the complete picture, particularly on how the studied trait related to other similar traits; because, often not all of the available phenotype information is exploited in the analyses. Here, we propose to use Ontology-Wide Genome Associations Study (OWAS) to complete the phenotype profiles of diseases and perform GWAS on UK Biobank. More specifically, with OWAS, we utilize the phenotype information that exists in the literature as well as in semantic resources to expand the GWAS to the cases that are not explicitly associated with the phenotypes. Our initial results show that our approach has the potential to increase the statistical power of GWAS as well as identify associations for the phenotypes which have not been explicitly observed.
03:40 PM - 04:00 PM (EDT)
TransMed - POCOVID-Net: Automatic Detection of COVID-19 From a New Lung Ultrasound Imaging Dataset (POCUS)
With the rapid development of COVID-19 into a global pandemic, there is an urgent need for cheap, fast and reliable tools that assist physicians in diagnosing COVID-19. Medical imaging can take a key role in complementing conventional diagnostic tools. Using CT or X-ray scans several deep learning models were demonstrated promising performances. Here, we present the first framework for COVID-19 detection from ultrasound. Ultrasound is cheap, portable, non-invasive and ubiquitous in medical facilities. Our contribution is threefold. First, we gather a lung ultrasound dataset consisting of 1103 images (654 COVID-19, 277 bacterial pneumonia and 172 healthy controls). This dataset is by no means exhaustive, but we processed it to feed deep learning models and make it publicly available, thus delivering a starting point for an open-access initiative of lung ultrasound data. Second, we train a deep convolutional neural network (POCOVID-Net) in a 5-fold cross validation on this data and achieve an accuracy of 89%, and, for COVID-19, a sensitivity of 0.96 (specificity 0.79). Third, we provide an open-access web service at: https://pocovidscreen.org. The website deploys not only the predictive model but also offers a data-sharing interface, simplifying data contribution for researchers and physicians. Dataset and code are available from: https://github.com/jannisborn/covid19_pocus_ultrasound
03:40 PM - 04:00 PM (EDT)
CompMS - Focus on the spectra that matter by clustering of quantification data in shotgun proteomics
We propose a quantification-first approach for peptides in shotgun proteomics experiments that reverses the classical identification-first workflow. This prevents valuable information from being discarded prematurely in the identification stage and allows us to spend more effort on the identification process. We demonstrate that combining this with Bayesian protein quantification dramatically increases sensitivity on multiple engineered and clinical datasets. Our method, Quandenser, applies unsupervised clustering on both MS1 and MS2-level, summarizing all analytes of interest without assigning identities. This eliminates the need for redoing quantification for new search parameters/engines and reduces search time due to the data reduction. For one of the investigated datasets, using an open modification search with MODa, we assigned identities to 47% more consensus spectra while reducing search time from a week to well below a day. Furthermore, de novo searches on large MS2 spectrum clusters unveiled peptides and proteins not present in the database. Importantly, Quandenser addresses the false transfer problem by providing feature-feature match error rates using decoy features and a novel automated weighting scheme. We integrated these into our probabilistic protein quantification method, Triqler, that propagates error probabilities from feature to protein level and reduces the noise from false positives and missing values.
03:40 PM - 03:50 PM (EDT)
BioVis - Multi-Scale Procedural Animations of Microtubule Dynamics Based on Measured Data
Biologists often use computer graphics to visualize structures, which due to physical limitations are not possible to imagewith a microscope. One example for such structures are microtubules, which are present in every eukaryotic cell. They are part ofthe cytoskeleton maintaining the shape of the cell and playing a key role in the cell division. In this paper, we propose a scientifically-accurate multi-scale procedural model of microtubule dynamics as a novel application scenario for procedural animation, which cangenerate visualizations of their overall shape, molecular structure, as well as animations of the dynamic behaviour of their growth anddisassembly. The model is spanning from tens of micrometers down to atomic resolution. All the aspects of the model are driven byscientific data. The advantage over a traditional, manual animation approach is that when the underlying data change, for instance dueto new evidence, the model can be recreated immediately. The procedural animation concept is presented in its generic form, withseveral novel extensions, facilitating an easy translation to other domains with emergent multi-scale behavior.
03:40 PM - 04:00 PM (EDT)
RegSys - Mustache: Multi-scale Detection of Chromatin Loops from Hi-C and Micro-C Maps using Scale-Space Representation
We present Mustache, a new method for multi-scale detection of chromatin loops from Hi-C and Micro-C contact maps using a technical advance in computer vision called scale-space theory. When applied to high-resolution Hi-C and Micro-C data, Mustache detects loops at a wide range of genomic distances, identifying structural and regulatory interactions that are supported by independent conformation capture experiments as well as by known correlates of loop formation such as CTCF binding, enhancers and promoters. Unlike the commonly used HiCCUPS tool, Mustache runs on general-purpose CPUs and it is very time efficient with a runtime of only a few minutes per chromosome for 5kb-resolution human genome contact maps. Extensive experimental results show that Mustache reports two to three times the number of HiCCUPS loops, which are reproducible across replicates. It also recovers a larger proportion of published ChIA-PET and HiChIP loops than HiCCUPS. A comparative analysis of Mustache’s experimental results on Hi-C and Micro-C data confirms strong agreement between the two datasets with Micro-C providing better power for loop detection. Overall, our experimental results show that Mustache enables a more efficient and comprehensive analysis of the chromatin looping from high-resolution Hi-C and Micro-C datasets. Mustache is freely available at https://github.com/ay-lab/mustache.
03:40 PM - 03:50 PM (EDT)
MICROBIOME - Studying the dynamics of the gut microbiota using metabolically stable isotopic labeling and metaproteomics
The gut microbiome and its metabolic processes are dynamic systems. Surprisingly, our understanding of gut microbiome dynamics is limited. Here we report a metaproteomic workflow that involves protein stable isotope probing (protein-SIP) and identification/quantification of partially labeled peptides. We also developed a package, which we call MetaProfiler, that corrects for false identifications and performs phylogenetic and time series analysis for the study of microbiome dynamics. From the stool sample of five mice that were fed with 15N hydrolysate from Ralstonia eutropha, we identified 15,297 non-redundant unlabeled peptides of which 10,839 of their heavy counterparts were quantified. These results revealed that i) isotope incorporation in proteins differed between taxa, ii) the rate of protein synthesis was lower in the microbiota than in mice, and iii) differences in protein synthesis appeared across protein functions. Interestingly, the phylum Verrucomicrobia and the genera, Akkermansia, Lactobacillus, and Ruminococcus had not reached a plateau of isotopic incorporation 43 days after the continuous introduction of the isotope. Altogether, our study provides an efficient workflow for the study of dynamics of gut microbiota, and our findings helped better understand the complex host-microbiome interactions.
03:40 PM - 03:50 PM (EDT)
NetBio: Network Biology - The Reactome Pathway Knowledgebase: Variants, Dark Proteins and Functional Interactions
Reactome is an open access, open source pathway knowledgebase. Its holdings now comprise 12,986 human reactions organized into 2,362 pathways involving 10,908 proteins, 1,865 small molecules, 237 drugs, and 12,206 complexes. 31,237 literature references support these annotations. The roles of variant forms of some proteins, both germline and somatically arising, have been annotated into disease-variant types of reactions and additional reactions that capture the effects of small molecule drugs on these disease processes. To support different visualization and analysis approaches, we implemented several new features through our website, tools, and ReactomeFIViz-Cytoscape app, such as gene set analysis (GSA), an R interface, a Python client, and an intuitive genome-wide results overview based on Voronoi maps. Furthermore, to increase Reactome adoption within the research community, we developed portals and web services for specific user communities. As part of the Illuminating the Druggable (IDG) program, we have undertaken the role to project understudied (Tdark) proteins into the Reactome pathway context, providing useful contextual information for these understudied proteins for experimental biologists to design experiments to understand these proteins’ functions. Reactome thus provides dominant pathway- and network-based tools for analyzing multiple data sets and types.
03:40 PM - 03:50 PM (EDT)
3DSIG - Nanocapsule Designs for Antimicrobial Resistance
Antimicrobial resistance and drug delivery have been main focuses of the recent medical research. Recently engineered virus-like nanocapsules derived from synthetic multi branched peptides have been shown to promote bacterial membrane poration and to be suitable for gene delivery at the same time [1]. The atomistic details of the nanocapsule assembly, necessary for the antimicrobial and gene delivery activities, are not accessible to experimental techniques. Therefore, the nanocapsule stability in water and its interaction with a model membrane was studied through Molecular Dynamics simulations, comparing the results with the available experimental data [2]. Integrated results from simulations at different resolutions highlighted the role of the amphiphilic structure of capzip as driven promoter of the assembly stability. Moreover, simulations highlighted a strong affinity with a bacterial model membrane and lower with a mammalian one. This results in bacterial membrane poration in presence of an electric field, a process triggered by the insertion of Arginine residues, which are abundant in the structure. This investigation shows the essential role of computational techniques in rationalizing the experimental results and suggests how to manipulate capzip composition in order to trigger particular functions. 1. Chem. Sci., 7(3):1707–1711, 2016. 2. ACS Nano, 14(2):1609-1622, 2020.
03:40 PM - 03:50 PM (EDT)
Vari - Gene family information facilitates classification of disease-causing variants and identification of pathogenic variant enriched regions
Classifying pathogenicity of missense variants represents a major challenge in clinical practice. While orthologous gene conservation is commonly employed in variant annotation, approximately 80% of known disease-associated genes belong to gene families. We empirically evaluated whether paralog-conserved or non-conserved sites in human gene families are important in neurodevelopmental diseases (NDDs) and could demonstrate that disease-associated missense variants are enriched at paralog-conserved sites across all disease groups and inheritance models tested. We developed a gene family de novo enrichment framework that identified 43 exome-wide enriched gene families including 98 de novo variant carrying genes in NDD (Lal et al., 2020). Essential regions for protein function are conserved among gene-family members, and genetic variants within these regions are potentially more likely to confer risk to disease. We explored if gene family information could support the identification novel disease-related regions within proteins. We compared 2,219,811 variants from the general population to 76,153 variants from patients. With this gene-family approach, we identified 465 regions enriched for patient variants in 1252 genes. We found that missense variants inside the identified regions are 106-fold more likely to be classified as pathogenic in comparison to benign (Pérez-Palma et al., 2020).
03:50 PM - 04:00 PM (EDT)
MICROBIOME - Phenotypic characterization of complex microbial communities
Metabolic capabilities (phenotypes) of each microbial species are defined by the presence or absence of pathways encoded in their respective genomes. We reconstructed >70 metabolic pathways in >2,600 reference genomes of bacteria representing the human gut microbiome (HGM) and assigned metabolic phenotypes for (i) utilization of primary sources of energy/carbon (sugars, amino acids); (ii) synthesis of essential nutrients (vitamins/cofactors, amino acids); (iii) excretion of fermentation end-products (short-chain fatty acids). Capturing these phenotypes by a simple binary (1/0) phenotype matrix (BPM) facilitates comparative analysis of the cumulative metabolic potential of microbial communities. To enable metabolic phenotype profiling of microbiomes, we established a computational pipeline converting 16S metagenomic profiles into Community Phenotype Profiles comprised of Community Phenotype Index (CPI) representing fractional representation of all “1”-phenotypes (vitamin prototrophs, sugar utilizers, etc). We applied this approach to assess the distribution of metabolic capabilities in several large HGM datasets from healthy and sick subjects. We also introduce a concept of phenotypic diversity as a diversity of the subcommunity of organisms possessing a particular metabolic phenotype. The obtained functional diversity metrics (Alpha and Beta diversity of phenotypes) reflect phenotype distribution in microbiome samples and allow to train machine learning models for sample classification.
03:50 PM - 04:00 PM (EDT)
BioVis - Towards an Immersive Analytics Application for Anatomical Fish Brain Data
Here we discuss our first prototype of an application for the analysis and visualisation of structural brain image data. The application is part of a processing pipeline to help researchers gain new insights by turning raw anatomical brain data into quantitative 3D representations. The pipeline has been optimised to process anatomical fish brain data obtained at a synchrotron imaging facility. The data is phase-retrieved, reconstructed and segmented before a three-dimensional mesh model is created. After being imported into the application, a model can be visualised in an immersive Virtual Reality environment on the HTC Vive headset. Each model can then be analysed by conducting a series of calibrated distance and volume measurements. The application is accompanied by an ImageJ plug-in, which supports users with image segmentation and model pre-processing.
03:50 PM - 04:00 PM (EDT)
NetBio: Network Biology - Multiscale Co-expression in the Brain
The Brain Initiative Cell Census Network (BICCN) single-cell RNA-sequencing datasets provide an unparalleled opportunity to understand how gene-gene relationships shape cell identity. We study gene-gene relationships by measuring the co-variation of gene expression across samples. Because shared expression patterns are thought to reflect shared function, co-expression networks describe the functional relationships between all genes. The heterogeneity of cell types in bulk RNAseq samples creates connections in co-expression networks that potentially obscure identification of co-regulatory modules. Comparison of a bulk RNAseq network built from over 2,000 mouse brain samples from 52 studies to aggregate scRNAseq co-expression networks, made from the 500,000 cells/nuclei across the 7 BICCN datasets, shows consistent topology and co-regulatory signal of reference gene sets and marker gene sets. Differential signals between broad cell classes persist in driving variation at finer levels, indicating that convergent regulatory processes affect cell phenotype at multiple scales.
03:50 PM - 04:00 PM (EDT)
3DSIG - GeoMine: A Web-Based Tool for Chemical Three-Dimensional Searching of the PDB
The relative arrangement of functional groups and the shape of protein binding sites are the key elements to comprehend a protein’s function. Interactive searching for these three-dimensional patterns is an important tool in life science research, however highly challenging from the computational point of view. This problem is addressed by only a few tools limited in terms of query variability, adjustable search sets, retrieval speed and user friendliness. Here, we present GeoMine, a computational approach enabling spatial geometric queries with full chemical awareness on a regularly updated database containing protein-ligand interfaces of the entire PDB. Due to the use of modern algorithms and database technologies, reasonable queries can be searched in up to a few minutes. With a GeoMine query, almost any relative atom arrangement can be searched. GeoMine is implemented as a publicly available web service within ProteinsPlus (https://proteins.plus). The user interface provides an interactive 3D panel that allows an easy design of queries either from scratch or based on a 3D representation of an existing protein-ligand complex. GeoMine opens a plethora of data analytics opportunities on protein structures, a few of them showcased in this presentation.
03:50 PM - 04:00 PM (EDT)
Vari - A novel framework to evaluate deleteriousness of synonymous variants
Due to the lack of experimental evaluations of variant effects, existing predictors for synonymous singleton nucleotide variants (sSNVs) often rely on databases of variant-disease associations, which are limited for establishing functional effects. Large-scale genome-sequencing efforts have observed roughly four 4 sSNVs (real variants); whereas there are 58 million possible sSNVs have not been observed. We expect that a large fraction of the latter will be found with more sequencing (not-yet-seen variants), while others are either purified due to extreme deleteriousness or unobservable due to physicochemical or sequencing constraints. We further highlight the fact that the not-yet-seen variants are likely similar to the real variants, representing a wide range of functional effects; although the former are probably enriched in more deleterious consequences. We first built a model to identify the not-yet-seen variants from all the unobserved ones. We then trained a second model to differentiate the not-yet-seen and real sets of variants. We further trained the final model model using the real and not-yet-seen variants subset by cutoffs determined by common variants and pathogenic variants, respectively. Our final model outperforms currently available sSNV-predictors in differentiating experimentally verified pathogenic sSNVs from real variants in testing set.
04:00 PM - 04:20 PM (EDT)
Bio-Ontologies - hPSCreg-CLO: ontological representation of human pluripotent stem cell lines from the hPSCreg
Human pluripotent stem cells (PSC) are immortal, represent the genotype of the donor and can differentiate into all cell types of a human body. These features establish their enormous potential for modelling diseases and tissues in vitro, drug- and toxicity testing and regenerative medicine. To translate these potencies into reality, large numbers of PS- lines are being generated from a wide spectrum of donors to make them available for the diverse applications. For users to identify suitable PSC- lines, information about the donors, cell generation, characterization and quality are essential. The human pluripotent stem cell registry hPSCreg contains more than 3000 cell line that are richly annotated. To make the hPSCreg resource more accessible and interoperable, we developed hPSCreg-CLO, a new CLO branch that represents various hPSC lines from hPSCreg. hPSC specific design patterns were generated and used to support computer-assisted ontology development. The hPSCreg-CLO includes over 2,400 hPSC lines and their related information such as cell donors, anatomical entities, and original cell types. DL queries were performed to demonstrate the query capability of hPSCreg-CLO. hPSCreg-CLO will further be integrated with the hPSCreg project and support the database data integration and advanced analyses.
04:00 PM - 04:20 PM (EDT)
BioVis - ClonArch: Visualizing the Spatial Clonal Architecture of Tumors
Motivation: Cancer is caused by the accumulation of somatic mutations that lead to the formation of distinct populations of cells, called clones. The resulting clonal architecture is the main cause of relapse and resistance to treatment. With decreasing costs in DNA sequencing technology, rich cancer genomics datasets with many spatial sequencing samples are becoming increasingly available, enabling the inference of high-resolution tumor clones and prevalences across different spatial coordinates. While temporal and phylogenetic aspects of tumor evolution, such as clonal evolution over time and clonal response to treatment, are commonly visualized in various clonal evolution diagrams, visual analytics methods that reveal the spatial clonal architecture are missing. Results: This paper introduces ClonArch, a web-based tool to interactively visualize the phylogenetic tree and spatial distribution of clones in a single tumor mass. ClonArch uses the marching squares algorithm to draw closed boundaries representing the presence of clones in a real or simulated tumor. ClonArch enables researchers to examine the spatial clonal architecture of a subset of relevant mutations at different prevalence thresholds and across multiple phylogenetic trees. In addition to simulated tumors with varying number of biopsies, we demonstrate the use of ClonArch on a hepatocellular carcinoma tumor with 280 sequencing biopsies. ClonArch provides an automated way to interactively examine the spatial clonal architecture of a tumor, facilitating clinical and biological interpretations of the spatial aspects of intra-tumor heterogeneity. Availability: https://github.com/elkebir-group/ClonArch
04:00 PM - 04:20 PM (EDT)
RegSys - Deciphering the role of 3D genome organization in breast cancer susceptibility
Cancer risk by environmental exposure is modulated by an individual’s genetics and age at exposure. This age-specific period of susceptibility is referred to as a “Window of Susceptibility” (WOS). Radiation exposures poses a high breast cancer risk for women between the early childhood and young adult stage and is reduced in the mid-30s. Rats have a similar WOS for developing mammary cancer. Previous studies have identified a looping interaction between a genomic region and a known cancer gene, PAPPA. However, the global role of three-dimensional organization in WOS is not known. Therefore, we generated Hi-C and RNA-seq data in rat mammary epithelial cells within and outside WOS. We compared the temporal changes in chromosomal looping to those in expression and find that interactions that have significantly higher counts within WOS are significantly enriched for differentially expressed genes that are higher in WOS. To systematically identify differential domains of interactions, we leveraged symmetric non-negative matrix factorization. This revealed clusters of dynamic regions that change their chromosome conformation between the two time points. Our results suggest that WOS-specific changes in 3D genome organization are linked to transcriptional changes that may increase susceptibility to breast cancer at an early age.
04:00 PM - 04:20 PM (EDT)
CompMS - Selection of features with consistent profiles improves relative protein quantification in mass spectrometry experiments
In bottom-up mass spectrometry-based proteomics, relative protein quantification is often achieved with data-dependent acquisition (DDA), data-independent acquisition (DIA), or selected reaction monitoring (SRM). These workflows quantify proteins by summarizing the abundances of all the spectral features of the protein (e.g., precursor ions, transitions or fragments) in a single value per protein per run. When abundances of some features are inconsistent with the overall protein profile (for technological reasons such as interferences, or for biological reasons such as post-translational modifications), the protein-level summaries and the downstream conclusions are undermined. We propose a statistical approach that automatically detects spectral features with such inconsistent patterns. The detected features can be separately investigated, and if necessary removed from the dataset. We evaluated the proposed approach on a series of benchmark controlled mixtures and biological investigations with DDA, DIA and SRM data acquisitions. The results demonstrated that it can facilitate and complement manual curation of the data. Moreover, it can improve the estimation accuracy, sensitivity and specificity of detecting differentially abundant proteins, and reproducibility of conclusions across different data processing tools. The approach is implemented as an option in the open-source R-based software MSstats. This work was accepted to publish in Molecular & Cellular Proteomics.
04:00 PM - 04:20 PM (EDT)
HiTSeq: High Throughput Sequencing - Improving RNA-seq mapping and haplotype-specific transcript inference using variation graphs
Current methods for analyzing RNA-seq data are generally based on first mapping the reads to a reference genome or a known set of reference transcripts. However, this approach can bias read mappings toward the reference, which negatively affects downstream analyses such as haplotype-specific expression quantification. One way to mitigate this reference bias is to use variation graphs, which contain both the primary reference and known genetic variants. For RNA-seq data specifically, variation graphs can also be augmented with splice junctions and haplotype-specific transcripts can be embedded as paths. In this work, we introduce a pipeline based on the variation graph (vg) toolkit for both mapping RNA-seq data to spliced variation graphs and inferring the expression of known haplotype-specific transcripts from the mapped reads. We demonstrate that spliced variation graphs reduce reference bias and show that vg improves mapping of RNA-seq data compared to other mapping algorithms. We also demonstrate that our novel method, rpvg, can accurately estimate expression among millions of haplotype-specific transcripts derived from the GENCODE transcript annotation and the haplotypes from the 1000 Genomes Project.
04:00 PM - 04:20 PM (EDT)
3DSIG - Generating Property-Matched Decoy Molecules Using Deep Learning
An essential component in the development of structure-based virtual screening methods is the datasets or benchmarks used for training and testing. These typically consist of experimentally verified active molecules together with assumed inactive molecules, known as decoys. However, the decoy molecules used in such sets have been shown to exhibit substantial bias in basic chemical properties. In some cases, there is evidence to suggest that some structure-based methods are simply exploiting this bias, rather than learning how to perform molecular recognition. The use of biased decoy molecules therefore is preventing generalisation and hindering the development of structure-based virtual screening methods. We have developed a deep learning method to generate property-matched decoy molecules, called DeepCoy. This eliminates the need to use a database to search for molecules and allows decoys to be generated for the requirements of a particular active molecule. Using DeepCoy generated molecules reduced the bias in basic physicochemical properties of such decoy molecules by 78% and 65% in the DUD-E and DEKOIS 2.0 databases, respectively. We believe that this substantial reduction in bias will benefit the development and improve generalisation of structure-based virtual screening methods.
04:00 PM - 04:20 PM (EDT)
TransMed - Drug repurposing to improve health and lifespan in humans
Model organism studies have demonstrated the possibility of lifespan extension up to 10-fold through genetic interventions. Although the effect size is relatively smaller, several drugs have also been shown to modulate lifespan and health during ageing in model organisms. Translation of this information to humans, however, is challenging and requires further investigation. In this study, we perform a comparative and integrative analysis of different drug repurposing studies for human ageing together with the known lifespan modulators in model organisms. We use two different approaches we developed, i) targeting genes which change expression during ageing in humans, and ii) targeting genes associated with an increased risk of multiple late-onset diseases. The first set included a significant number of known lifespan modulators, which also improves health in model organisms. However, drugs targeting multiple diseases did not overlap with the known pro-longevity drugs. This offers new avenues to explore experimentally. Through a systems-level analysis of the targeted pathways and their regulators, we aim to elucidate the mechanisms of lifespan modulation that can also improve health in the elderly.
04:00 PM - 04:10 PM (EDT)
MICROBIOME - Introduction to CAMI
04:00 PM - 04:10 PM (EDT)
Vari - Missense variants in health and disease affect distinct functional pathways and proteomics features
Missense variants are present amongst the healthy population, but some are causative of human diseases. A deeper understanding of the nature of missense variants in health and disease and their underlying biophysical features are essential to better distinguish pathogenic from population variants. Here we quantify variant enrichment across full-length proteins, domains and 3D-structure defined regions, and integrate this with available transcriptomic and proteomic (half-life, thermal stability, abundance) data. We have mined a rich set of molecular features which separate pathogenic and population variants: pathogenic variants mainly affect proteins involved in cell proliferation and nucleotide processing, localise to protein cores and interaction interfaces, and are enriched in abundant proteins. In contrary to other studies, we find that rare population variants display molecular features which are closer to common than pathogenic variants. We validate these molecular features indicative of variant pathogenicity by comparing against existing in silico impact annotations. This study reveals molecular principles of the sensitivity of proteins towards missense variants. This could be useful in predicting variant deleteriousness, and prioritising protein domains for therapeutic development. The ZoomVar (http://fraternalilab.kcl.ac.uk/ZoomVar) database has been created for large-scale annotation of variants onto protein structures and calculation of variant enrichment across protein structural regions.
04:00 PM - 04:10 PM (EDT)
NetBio: Network Biology - A comparison of normalization and transformation techniques for constructing gene co-expression networks from RNA-seq data
Constructing gene co-expression networks is a powerful approach for analyzing high-throughput gene expression data towards module finding, gene function prediction, and disease-gene prioritization. While optimal workflows for constructing co-expression networks – including good choices for data pre-processing, normalization, and network transformation – have been developed for microarray-based expression data, such well-tested choices do not exist for RNA-seq data. Almost all studies that compare data processing/normalization methods for RNA-seq focus on the end goal of determining differential gene expression. Here, we present a comprehensive benchmarking of 30 different workflows, each with a unique set of normalization and network transformation methods, for constructing co-expression networks from RNA-seq datasets. We test all these workflows on both large, homogenous data (Genotype-Tissue Expression project) and small, heterogeneous datasets from various labs (submitted to the Sequence Read Archive). Our results demonstrate that choosing the between-sample normalization method has the biggest impact, with trimmed mean of M-values or upper quartile normalization producing networks that most accurately recapitulates known tissue-naive and tissue-specific gene functional relationships. Furthermore, we provide insights as to when other methods should be used and which experimental factors, including sample size, noticeably affect network accuracy.
04:10 PM - 04:20 PM (EDT)
Vari - Calibrating variant-scoring methods for clinical decision making
Identifying pathogenic variants and annotating them is a major challenge in human genetics, especially for the non-coding ones. Several tools have been developed and used to predict the functional effect of genetic variants. However, the calibration assessment of the predictions has received little attention. Calibration refers to the idea that if a model predicts a group of variants to be pathogenic with a probability P, it is expected that the same fraction P of true positive is found in the observed set. This problem is relevant since poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making. We evaluated the calibration and the prediction performance of four predictors that directly furnish a probability as output (DANN, DeepSea, FATHMM-MKL and PhD-SNPg). Despite the fact that they show similar performances in terms of AUC, most of them are not well calibrated. We also tested two methods that provide only raw scores, applying calibration transformations to them (CADD and Eigen). We showed that CADD can be well-calibrated and can be used, after this transformation, as a probability score. Among the predictors tested, PhD-SNPg provides the best calibration without any transformation of the scores, while CADD is the best predictor after calibration.
04:10 PM - 04:20 PM (EDT)
NetBio: Network Biology - Supervised prediction of aging-related genes from a dynamic context-specific protein interaction subnetwork
Human aging is linked to many diseases. Because the aging process is influenced by genetic factors, it is important to identify human aging-related genes. We focus on supervised prediction of such genes. Gene expression-based methods for this purpose study genes in isolation from each other. While protein-protein interaction (PPI) network-based methods account for interactions between genes' protein products, current PPI network data are context-unspecific, spanning different biological conditions. Instead, we analyze an aging-specific subnetwork of the entire context-unspecific PPI network, obtained by integrating aging-specific gene expression data and PPI network data. We are the first to propose a supervised learning method for predicting aging-related genes from an aging-specific PPI subnetwork. In a comprehensive evaluation, we find that: (i) using an aging-specific subnetwork yields more accurate aging-related gene predictions than using the entire context-unspecific network, (ii) using a dynamic aging-specific subentwork is superior to using all static aging-specific subnetworks, and (iii) predictive methods that we propose outperform existing methods for the same purpose. Our best method achieves impressively high accuracy of 90%-95% (depending on the measure), compared to 72%-80% by the next best method. Our method could guide with high confidence the discovery of novel aging-related genes for wet lab validation.
04:10 PM - 04:20 PM (EDT)
MICROBIOME - Assembly results for second round of CAMI challenges
04:20 PM - 04:40 PM (EDT)
HiTSeq: High Throughput Sequencing - Terminus enables the discovery of data-driven, robust transcript groups from RNA-seq data
Motivation: Advances in sequencing technology, inference algorithms and differential testing methodology have enabled transcript-level analysis of RNA-seq data. Yet, the inherent inferential uncertainty in transcript-level abundance estimation, even among the most accurate approaches, means that robust transcript level analysis often remains a challenge. Conversely, gene-level analysis remains a common and robust approach for understanding RNA-seq data, but it coarsens the resulting analysis to the level of genes, even if the data strongly support specific transcript-level effects. Results: We introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly-reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. Our approach, implemented i