Deep learning to study the fundamental biological processes underlying human disease

The study of cellular structure and core biological processes---transcription, translation, signaling, metabolism, etc.---in humans and model organisms will greatly impact our understanding of human disease over the long horizon [@tag:Nih_curiosity]. Predicting how cellular systems respond to environmental perturbations and are altered by genetic variation remain daunting tasks. Deep learning offers new approaches for modeling biological processes and integrating multiple types of omic data [@doi:10.1038/ncomms13090], which could eventually help predict how these processes are disrupted in disease. Recent work has already advanced our ability to identify and interpret genetic variants, study microbial communities, and predict protein structures, which also relates to the problems discussed in the drug development section. In addition, unsupervised deep learning has enormous potential for discovering novel cellular states from gene expression, fluorescence microscopy, and other types of data that may ultimately prove to be clinically relevant.

Progress has been rapid in genomics and imaging, fields where important tasks are readily adapted to well-established deep learning paradigms. One-dimensional convolutional and recurrent neural networks are well-suited for tasks related to DNA- and RNA-binding proteins, epigenomics, and RNA splicing. Two dimensional CNNs are ideal for segmentation, feature extraction, and classification in fluorescence microscopy images [@doi:10.3109/10409238.2015.1135868]. Other areas, such as cellular signaling, are biologically important but studied less-frequently to date, with some exceptions [@tag:Chen2015_trans_species]. This may be a consequence of data limitations or greater challenges in adapting neural network architectures to the available data. Here, we highlight several areas of investigation and assess how deep learning might move these fields forward.

Gene expression

Gene expression technologies characterize the abundance of many thousands of RNA transcripts within a given organism, tissue, or cell. This characterization can represent the underlying state of the given system and can be used to study heterogeneity across samples as well as how the system reacts to perturbation. While gene expression measurements were traditionally made by quantitative polymerase chain reaction (qPCR), low-throughput fluorescence-based methods, and microarray technologies, the field has shifted in recent years to primarily performing RNA sequencing (RNA-seq) to catalog whole transcriptomes. As RNA-seq continues to fall in price and rise in throughput, sample sizes will increase and training deep models to study gene expression will become even more useful.

Already several deep learning approaches have been applied to gene expression data with varying aims. For instance, many researchers have applied unsupervised deep learning models to extract meaningful representations of gene modules or sample clusters. Denoising autoencoders have been used to cluster yeast expression microarrays into known modules representing cell cycle processes [@tag:Gupta2015_exprs_yeast] and to stratify yeast strains based on chemical and mutational perturbations [@tag:Chen2016_exprs_yeast]. Shallow (one hidden layer) denoising autoencoders have also been fruitful in extracting biological insight from thousands of Pseudomonas aeruginosa experiments [@tag:Tan2015_adage; @tag:Tan2016_eadage] and in aggregating features relevant to specific breast cancer subtypes [@tag:Tan2014_psb]. These unsupervised approaches applied to gene expression data are powerful methods for identifying gene signatures that may otherwise be overlooked. An additional benefit of unsupervised approaches is that ground truth labels, which are often difficult to acquire or are incorrect, are nonessential. However, the genes that have been aggregated into features must be interpreted carefully. Attributing each node to a single specific biological function risks over-interpreting models. Batch effects could cause models to discover non-biological features, and downstream analyses should take this into consideration.

Deep learning approaches are also being applied to gene expression prediction tasks. For example, a deep neural network with three hidden layers outperformed linear regression in inferring the expression of over 20,000 target genes based on a representative, well-connected set of about 1,000 landmark genes [@tag:Chen2016_gene_expr]. However, while the deep learning model outperformed existing algorithms in nearly every scenario, the model still displayed poor performance. The paper was also limited by computational bottlenecks that required data to be split randomly into two distinct models and trained separately. It is unclear how much performance would have increased if not for computational restrictions.

Epigenomic data, combined with deep learning, may have sufficient explanatory power to infer gene expression. For instance, the DeepChrome CNN [@tag:Singh2016_deepchrome] improved prediction accuracy of high or low gene expression from histone modifications over existing methods. AttentiveChrome [@tag:Singh2017_attentivechrome] added a deep attention model to further enhance DeepChrome. Deep learning can also integrate different data types. For example, Liang et al. combined RBMs to integrate gene expression, DNA methylation, and miRNA data to define ovarian cancer subtypes [@tag:Liang2015_exprs_cancer]. While these approaches are promising, many convert gene expression measurements to categorical or binary variables, thus ablating many complex gene expression signatures present in intermediate and relative numbers.

Deep learning applied to gene expression data is still in its infancy, but the future is bright. Many previously untestable hypotheses can now be interrogated as deep learning enables analysis of increasing amounts of data generated by new technologies. For example, the effects of cellular heterogeneity on basic biology and disease etiology can now be explored by single-cell RNA-seq and high-throughput fluorescence-based imaging, techniques we discuss below that will benefit immensely from deep learning approaches.

DNA methylation

DNA methylation is the process of adding a methyl group to a cytosine in the context of a CpG dinucleotide. This DNA-level epigenetic modification regulates gene transcription and is critical in development. Alterations to DNA methylation are well-established as contributing to pathophysiology of many diseases including cancers [@tag:Robertson2005; @tag:Feinberg2018]. Studies of DNA methylation have demonstrated its fundamental role in cell lineage specification starting with stem cell differentiation [@tag:Meissner2008; @tag:Nazor2012] as well as a strong relationship with aging phenotypes [@tag:Kwabi-Addo2007; @tag:Fraga2005] and pathogenesis in response to environmental exposures [@tag:Christensen2009; @tag:Relton2010].

Traditional analytic approaches to DNA methylation data often focus on estimating differential DNA methylation between groups or related with an outcome using linear mixed effects models, so-called epigenome-wide association studies [@tag:Laird2010; @tag:Wilhelm-Benartzi2013; @tag:Liu2013; @tag:Teschendorff2017]. In addition, a growing application of DNA methylation measures is to infer cellular or subject phenotypes from samples and either examine the relation of these phenotypes with outcomes or disease states directly or include them in models as covariates [@tag:Titus2017; @tag:Salas2018_GR; @tag:Zhang2019; @tag:Horvath2014; @tag:Quach2017]. For example, inference of subject age using DNA methylation clock approaches are established [@tag:Horvath2013] and are starting to be applied to test the relation of biological age with disease risk and outcomes [@tag:Kresovich2019]. Different cell types have different DNA methylation profiles. A novel approach to immunophenotyping combines measurements with reference DNA methylation profiles of leukocytes to infer immune cell type proportions [@tag:Houseman2012; @tag:Salas2018]. This strategy is particularly helpful when only DNA is available from a sample. Cell type inference is important for adjusting for cell-type composition in epigenome-wide association studies [@tag:Teschendorff2017]. While reference-based libraries have strong predictive value for immune cell type estimation and have broad utility, methods to incorporate estimates of mixtures pose important considerations on the interpretation of underlying biology associated with disease manifestations and phenotypes. When a reference library is not available, reference-free deconvolution methods [@tag:Houseman2016] that do not rely on these reference libraries are available to decompose signal purported to be contributed by cell types. However, using reference-free cell type proportion estimates as potential confounders in adjusted models can be overly conservative. Outcome-associated variation in DNA methylation may be decomposed into putative cell type estimates. Additional validated reference-based libraries for other tissue types, advancements in reference-free deconvolution methods, and application of deep learning methods are expected to provide new opportunities to understand and interpret DNA methylation in human health and disease.

Deep learning approaches have numerous potential applications for DNA methylation data. Imputation methods that capture complex interactions between different regions of DNA can expand the number of CpG sites whose DNA methylation state can be studied. Ideally these methods can derive their own informative, biologically-relevant features. The primary deep learning methods developed to date focus on: 1) estimating regions of methylation status and imputing missing methylation values, 2) performing classification and regression tasks, and 3) using the latent embeddings of methylation states to derive biologically meaningful features, infer interpolated disease states, and uncover CpG sites that aid the above prediction tasks.

Inference, imputation, and prediction

Deep learning approaches are beginning to help address some of the current limitations of feature-by-feature analysis approaches to DNA methylation data and may help uncover additional important features necessary to understand the biological underpinnings behind different pathological states. One of the more popular applications is imputing the degree of methylation at CpG sites that are within a few thousand base pairs of measured sites or present in similar samples. DeepSignal employs a CNN to construct features from raw electrical Nanopore signals from sites near a methylated base. It uses a bidirectional RNN on DNA sequences of the aligned signals to detect methylation [@tag:Ni2018]. DeepCpG applies a similar method using scBS-Seq, DNA sequence, and a bidirectional gated recurrent network [@tag:Angermueller2017]. Methods like MRCNN and DeepMethyl incorporate both sequence and topological structure [@tag:Tian2019; @tag:Khwaja2017; @tag:Wang2016_methyl; @tag:Fu2019]. In addition, gene expression has been used to infer and impute methylation states [@tag:Peng2019; @tag:Levy-Jurgenson2018], methylation of genes can be predicted from promoter methylation [@tag:Pan2018], and convolutional models have been able to predict methylation status from images [@tag:Momeni2018; @tag:Korfiatis2017]. While these examples of methylation imputation and inference methods have value, it is imperative to recognize limitations of imputing cytosine modifications. Imputing DNA methylation has complexities above and beyond genotype imputation. Correlation of DNA methylation marks can depend on cell types and other factors that vary by sample. As the number of tissue types and cell types with whole-genome bisulfite sequencing and oxidative bisulfite sequencing grows, the accuracy of DNA methylation imputation is expected to increase. While these methods, such as the autoencoder-based DAPL [@tag:Qiu2018], reduce the computational overhead at comparable performance to other popular methylation imputation methods such as k-nearest neighbors, random forests, singular value decomposition, and multiple imputation by chained equations, the software implementations will need to become more user-friendly to gain widespread adoption.

Once DNA methylation is measured, deep learning approaches can also be used to perform classification and regression tasks. For instance, deep neural networks have been employed on DNA methylation data to predict triglyceride concentrations pre- and post-treatment [@tag:Islam2018; @tag:Darst2018] and differentiate cancer subtypes [@tag:Chatterjee2018; @tag:Khwaja2018] better than other methods such as support vector machines (SVMs). Modular approaches to methylation prediction, such as MethylNet, have been able to predict age, cellular proportions, and cancer subtypes, outperforming SVM and elastic net models while remaining concordant with expected biology [@tag:Levy2019]. These approaches aim to make embedding, hyperparameter selection, regression, classification, and model interpretation tasks more tractable for epigenetics researchers and machine learning scientists.

Latent space construction

Unsupervised discovery of biologically-significant features is another major area of interest for researchers using DNA methylation data. A consistent theme of these methods is that they construct a low-dimensional space that semantically encodes biologically important features from methylation profiles. As with other applications, these low-dimensional representations are thought to capture a set of important, unmeasured sources of biological variability in the data. Projection into these spaces results in biologically-similar examples being close together. For this reason, they are often termed latent spaces. One method used several stacked binary RBMs to learn a low-dimensional subspace representation of the methylation profiles of 5,000 CpG sites with the highest variance across 136 breast tissue samples, 113 breast cancer samples, and 23 non-cancerous samples. Samples in the latent space were clustered via self-organizing maps to show that the latent space could differentiate breast cancer samples from non-neoplastic samples. Furthermore, the latent space was visualized using t-Distributed Stochastic Neighbor Embedding (t-SNE) [@tag:Maaten2008_tsne; @arxiv:1808.01359]. Titus et al. [@doi:10.5220/0006636401400145] adapted a VAE strategy developed by Way et al. [@doi:10.1142/9789813235533_0008] to methylation data. The VAE was modified to perform dimensionality reduction on 300,000 PAM50-assigned CpG features to 100 latent features in 862 samples. The authors performed t-SNE visualization, clustering, and tumor subtype classification from a TCGA breast cancer dataset. In an subsequent extension [@doi:10.1101/433763], the authors constructed a 100-dimensional latent space of 100,000 CpG sites across approximately 1,200 samples. They selected latent space dimensions that were the most highly associated with the differentiation between estrogen receptor (ER) positive and negative tumor samples in breast cancer patients to determine the extent to which the latent space could predict responses to endocrine therapy. Certain latent space dimensions differentiated tumors based on their ER status and provided biologically-plausible hypotheses, which suggests that VAE-derived models may have a place in summarizing DNA methylation profiles into composite features that can aid in predicting treatment response. Another study explored the latent features of lung cancer methylation profiles that were extracted using VAEs. After constructing a latent space representations of TCGA lung cancer samples, the authors used a logistic regression classifier on the latent dimensions to accurately classify cancer subtypes [@doi:10.1109/BIBM.2018.8621365]. These studies, along with the growing body of work using VAEs and other latent representations of genomic and epigenomic data demonstrate a suite of tools to explore the unmeasured aspects of biology. Techniques that produce these representations provide the opportunity to discover important biological features that were previously missed. The power of unsupervised deep learning models for this task comes from their ability to learn high-dimensional non-linear relationships among data.

Important applications in the future include predicting methylation and pathological states based on methylation profiles uncovered from datasets with more noise, such as solid tissue samples. Unsupervised deep learning approaches such as VAEs may provide a more complete understanding of the biological processes underlying cell types, transitions in cell dynamics, and subject phenotypes. In addition, latent representations may assist with biological hypothesis generation and have the ability to stratify patients by predicted risk. While neural network embeddings can outperform traditional embeddings, it is important to be aware that many of these methods can be highly sensitive to hyperparameter tuning and an evaluation of the impact of hyperparameter tuning should be included [@doi:10.1101/385534].

Splicing

Pre-mRNA transcripts can be spliced into different isoforms by retaining or skipping subsets of exons or including parts of introns, creating enormous spatiotemporal flexibility to generate multiple distinct proteins from a single gene. This remarkable complexity can lend itself to defects that underlie many diseases. For instance, splicing mutations in the lamin A (LMNA) gene can lead to specific variants of dilated cardiomyopathy and limb girdle muscular dystrophy [@tag:Scotti2016_missplicing]. A recent study found that quantitative trait loci that affect splicing in lymphoblastoid cell lines are enriched within risk loci for schizophrenia, multiple sclerosis, and other immune diseases, implicating mis-splicing as a more widespread feature of human pathologies than previously thought [@tag:Li2016_variation]. Therapeutic strategies that aim to modulate splicing are also currently being considered for disorders such as Duchenne muscular dystrophy and spinal muscular atrophy [@tag:Scotti2016_missplicing].

Sequencing studies routinely return thousands of unannotated variants, but which cause functional changes in splicing and how are those changes manifested? Prediction of a "splicing code" has been a goal of the field for the past decade. Initial machine learning approaches used a naïve Bayes model and a 2-layer Bayesian neural network with thousands of hand-derived sequence-based features to predict the probability of exon skipping [@tag:Barash2010_splicing_code; @tag:Xiong2011_bayesian]. With the advent of deep learning, more complex models provided better predictive accuracy [@tag:Xiong2015_splicing_code; @tag:Jha2017_integrative_models]. Importantly, these new approaches can take in multiple kinds of epigenomic measurements as well as tissue identity and RNA binding partners of splicing factors. Deep learning is critical in furthering these kinds of integrative studies where different data types and inputs interact in unpredictable (often nonlinear) ways to create higher-order features. Moreover, as in gene expression network analysis, interrogating the hidden nodes within neural networks could potentially illuminate important aspects of splicing behavior. For instance, tissue-specific splicing mechanisms could be inferred by training networks on splicing data from different tissues, then searching for common versus distinctive hidden nodes, a technique employed by Qin et al. for tissue-specific transcription factor (TF) binding predictions [@tag:Qin2017_onehot].

A parallel effort has been to use more data with simpler models. An exhaustive study using readouts of splicing for millions of synthetic intronic sequences uncovered motifs that influence the strength of alternative splice sites [@tag:Rosenberg2015_synthetic_seqs]. The authors built a simple linear model using hexamer motif frequencies that successfully generalized to exon skipping. In a limited analysis using single nucleotide polymorphisms (SNPs) from three genes, it predicted exon skipping with three times the accuracy of an existing deep learning-based framework [@tag:Xiong2015_splicing_code]. This case is instructive in that clever sources of data, not just more descriptive models, are still critical.

We already understand how mis-splicing of a single gene can cause diseases such as limb girdle muscular dystrophy. The challenge now is to uncover how genome-wide alternative splicing underlies complex, non-Mendelian diseases such as autism, schizophrenia, Type 1 diabetes, and multiple sclerosis [@tag:JuanMateu2016_t1d]. As a proof of concept, Xiong et al. [@tag:Xiong2015_splicing_code] sequenced five autism spectrum disorder and 12 control samples, each with an average of 42,000 rare variants, and identified mis-splicing in 19 genes with neural functions. Such methods may one day enable scientists and clinicians to rapidly profile thousands of unannotated variants for functional effects on splicing and nominate candidates for further investigation. Moreover, these nonlinear algorithms can deconvolve the effects of multiple variants on a single splice event without the need to perform combinatorial in vitro experiments. The ultimate goal is to predict an individual’s tissue-specific, exon-specific splicing patterns from their genome sequence and other measurements to enable a new branch of precision diagnostics that also stratifies patients and suggests targeted therapies to correct splicing defects. However, to achieve this we expect that methods to interpret the "black box" of deep neural networks and integrate diverse data sources will be required.

Transcription factors

Transcription factors are proteins that bind regulatory DNA in a sequence-specific manner to modulate the activation and repression of gene transcription. High-throughput in vitro experimental assays that quantitatively measure the binding specificity of a TF to a large library of short oligonucleotides [@doi:10.1016/j.tibs.2014.07.002] provide rich datasets to model the naked DNA sequence affinity of individual TFs in isolation. However, in vivo TF binding is affected by a variety of other factors beyond sequence affinity, such as competition and cooperation with other TFs, TF concentration, and chromatin state (chemical modifications to DNA and other packaging proteins that DNA is wrapped around) [@doi:10.1016/j.tibs.2014.07.002]. TFs can thus exhibit highly variable binding landscapes across the same genomic DNA sequence across diverse cell types and states. Several experimental approaches such as chromatin immunoprecipitation followed by sequencing (ChIP-seq) have been developed to profile in vivo binding maps of TFs [@doi:10.1016/j.tibs.2014.07.002]. Large reference compendia of ChIP-seq data are now freely available for a large collection of TFs in a small number of reference cell states in humans and a few other model organisms [@tag:Consortium2012_encode]. Due to fundamental material and cost constraints, it is infeasible to perform these experiments for all TFs in every possible cellular state and species. Hence, predictive computational models of TF binding are essential to understand gene regulation in diverse cellular contexts.

Several machine learning approaches have been developed to learn generative and discriminative models of TF binding from in vitro and in vivo TF binding datasets that associate collections of synthetic DNA sequences or genomic DNA sequences to binary labels (bound/unbound) or continuous measures of binding. The most common class of TF binding models in the literature are those that only model the DNA sequence affinity of TFs from in vitro and in vivo binding data. The earliest models were based on deriving simple, compact, interpretable sequence motif representations such as position weight matrices (PWMs) and other biophysically inspired models [@tag:Stormo2000_dna; @doi:10.1093/nar/gkp335; @doi:10.1038/nbt.2486]. These models were outperformed by general k-mer based models including SVMs with string kernels [@doi:10.1371/journal.pcbi.1000916; @tag:Ghandi2014_enhanced].

In 2015, Alipanahi et al. developed DeepBind, the first CNN to classify bound DNA sequences based on in vitro and in vivo assays against random DNA sequences matched for dinucleotide sequence composition [@tag:Alipanahi2015_predicting]. The convolutional layers learn pattern detectors reminiscent of PWMs from a one-hot encoding of the raw input DNA sequences. DeepBind outperformed several state-of-the-art methods from the DREAM5 in vitro TF-DNA motif recognition challenge [@doi:10.1038/nbt.2486]. Although DeepBind was also applied to RNA-binding proteins, in general RNA binding is a separate problem [@doi:10.1186/s12859-017-1561-8] and accurate models will need to account for RNA secondary structure. Following DeepBind, several optimized convolutional and recurrent neural network architectures as well as novel hybrid approaches that combine kernel methods with neural networks have been proposed that further improve performance [@tag:Zeng2016_convolutional; @tag:Lanchantin2016_motif; @arxiv:1706.00125; @doi:10.1101/217257]. Specialized layers and regularizers have also been proposed to reduce parameters and learn more robust models by taking advantage of specific properties of DNA sequences such as their reverse complement equivalence [@tag:Shrikumar2017_reversecomplement; @doi:10.1101/146431].

While most of these methods learn independent models for different TFs, in vivo multiple TFs compete or cooperate to occupy DNA binding sites, resulting in complex combinatorial co-binding landscapes. To take advantage of this shared structure in in vivo TF binding data, multi-task neural network architectures have been developed that explicitly share parameters across models for multiple TFs [@tag:Zhou2015_deep_sea; @doi:10.1093/nar/gkw226; @doi:10.1101/217257]. Some of these multi-task models train and evaluate classification performance relative to an unbound background set of regulatory DNA sequences sampled from the genome rather than using synthetic background sequences with matched dinucleotide composition.

The above-mentioned TF binding prediction models that use only DNA sequences as inputs have a fundamental limitation. Because the DNA sequence of a genome is the same across different cell types and states, a sequence-only model of TF binding cannot predict different in vivo TF binding landscapes in new cell types not used during training. One approach for generalizing TF binding predictions to new cell types is to learn models that integrate DNA sequence inputs with other cell-type-specific data modalities that modulate in vivo TF binding such as surrogate measures of TF concentration (e.g. TF gene expression) and chromatin state. Arvey et al. showed that combining the predictions of SVMs trained on DNA sequence inputs and cell-type specific DNase-seq data, which measures genome-wide chromatin accessibility, improved in vivo TF binding prediction within and across cell types [@doi:10.1101/gr.127712.111]. Several "footprinting" based methods have also been developed that learn to discriminate bound from unbound instances of known canonical motifs of a target TF based on high-resolution footprint patterns of chromatin accessibility that are specific to the target TF [@doi:10.1038/nmeth.3772]. However, the genome-wide predictive performance of these methods in new cell types and states has not been evaluated.

Recently, a community challenge known as the "ENCODE-DREAM in vivo TF Binding Site Prediction Challenge" was introduced to systematically evaluate genome-wide performance of methods that can predict TF binding across cell states by integrating DNA sequence and in vitro DNA shape with cell-type-specific chromatin accessibility and gene expression [@tag:Dream_tf_binding]. A deep learning model called FactorNet was amongst the top three performing methods in the challenge [@tag:Quang2017_factor]. FactorNet uses a multi-modal hybrid convolutional and recurrent architecture that integrates DNA sequence with chromatin accessibility profiles, gene expression, and evolutionary conservation of sequence. It is worth noting that FactorNet was slightly outperformed by an approach that does not use neural networks [@doi:10.1101/230011]. This top ranking approach uses an extensive set of curated features in a weighted variant of a discriminative maximum conditional likelihood model in combination with a novel iterative training strategy and model stacking. There appears to be significant room for improvement because none of the current approaches for cross cell type prediction explicitly account for the fact that TFs can co-bind with distinct co-factors in different cell states. In such cases, sequence features that are predictive of TF binding in one cell state may be detrimental to predicting binding in another.

Singh et al. developed transfer string kernels for SVMs for cross-context TF binding [@tag:Singh2016_tsk]. Domain adaptation methods that allow training neural networks which are transferable between differing training and test set distributions of sequence features could be a promising avenue going forward [@arxiv:1502.02791; @arxiv:1505.07818]. These approaches may also be useful for transferring TF binding models across species.

Another class of imputation-based cross cell type in vivo TF binding prediction methods leverage the strong correlation between combinatorial binding landscapes of multiple TFs. Given a partially complete panel of binding profiles of multiple TFs in multiple cell types, a deep learning method called TFImpute learns to predict the missing binding profile of a target TF in some target cell type in the panel based on the binding profiles of other TFs in the target cell type and the binding profile of the target TF in other cell types in the panel [@tag:Qin2017_onehot]. However, TFImpute cannot generalize predictions beyond the training panel of cell types and requires TF binding profiles of related TFs.

It is worth noting that TF binding prediction methods in the literature based on neural networks and other machine learning approaches choose to sample the set of bound and unbound sequences in a variety of different ways. These choices and the choice of performance evaluation measures significantly confound systematic comparison of model performance (see Discussion).

Several methods have also been developed to interpret neural network models of TF binding. Alipanahi et al. visualize convolutional filters to obtain insights into the sequence preferences of TFs [@tag:Alipanahi2015_predicting]. They also introduced in silico mutation maps for identifying important predictive nucleotides in input DNA sequences by exhaustively forward propagating perturbations to individual nucleotides to record the corresponding change in output prediction. Shrikumar et al. [@tag:Shrikumar2017_learning] proposed efficient backpropagation based approaches to simultaneously score the contribution of all nucleotides in an input DNA sequence to an output prediction. Lanchantin et al. [@tag:Lanchantin2016_motif] developed tools to visualize TF motifs learned from TF binding site classification tasks. These and other general interpretation techniques (see Discussion) will be critical to improve our understanding of the biologically meaningful patterns learned by deep learning models of TF binding.

Promoters and enhancers

From TF binding to promoters and enhancers

Multiple TFs act in concert to coordinate changes in gene regulation at the genomic regions known as promoters and enhancers. Each gene has an upstream promoter, essential for initiating that gene's transcription. The gene may also interact with multiple enhancers, which can amplify transcription in particular cellular contexts. These contexts include different cell types in development or environmental stresses.

Promoters and enhancers provide a nexus where clusters of TFs and binding sites mediate downstream gene regulation, starting with transcription. The gold standard to identify an active promoter or enhancer requires demonstrating its ability to affect transcription or other downstream gene products. Even extensive biochemical TF binding data has thus far proven insufficient on its own to accurately and comprehensively locate promoters and enhancers. We lack sufficient understanding of these elements to derive a mechanistic "promoter code" or "enhancer code". But extensive labeled data on promoters and enhancers lends itself to probabilistic classification. The complex interplay of TFs and chromatin leading to the emergent properties of promoter and enhancer activity seems particularly apt for representation by deep neural networks.

Promoters

Despite decades of work, computational identification of promoters remains a stubborn problem [@doi:10.1093/bib/4.1.22]. Researchers have used neural networks for promoter recognition as early as 1996 [@tag:matis]. Recently, a CNN recognized promoter sequences with sensitivity and specificity exceeding 90% [@doi:10.1371/journal.pone.0171410]. Most activity in computational prediction of regulatory regions, however, has moved to enhancer identification. Because one can identify promoters with straightforward biochemical assays [@doi:10.1073/pnas.2136655100; @doi:10.1101/gr.110254.110], the direct rewards of promoter prediction alone have decreased. But the reliable ground truth provided by these assays makes promoter identification an appealing test bed for deep learning approaches that can also identify enhancers.

Enhancers

Recognizing enhancers presents additional challenges. Enhancers may be up to 1,000,000 bp away from the affected promoter, and even within introns of other genes [@doi:10.1038/nrg3458]. Enhancers do not necessarily operate on the nearest gene and may affect multiple genes. Their activity is frequently tissue- or context-specific. No biochemical assay can reliably identify all enhancers. Distinguishing them from other regulatory elements remains difficult, and some believe the distinction somewhat artificial [@doi:10.1016/j.tig.2015.05.007]. While these factors make the enhancer identification problem more difficult, they also make a solution more valuable.

Several neural network approaches yielded promising results in enhancer prediction. Both Basset [@doi:10.1101/gr.200535.115] and DeepEnhancer [@tag:Min2016_deepenhancer] used CNNs to predict enhancers. DECRES used a feed-forward neural network [@doi:10.1101/041616] to distinguish between different kinds of regulatory elements, such as active enhancers, and promoters. DECRES had difficulty distinguishing between inactive enhancers and promoters. They also investigated the power of sequence features to drive classification, finding that beyond CpG islands, few were useful.

Comparing the performance of enhancer prediction methods illustrates the problems in using metrics created with different benchmarking procedures. Both the Basset and DeepEnhancer studies include comparisons to a baseline SVM approach, gkm-SVM [@doi:10.1371/journal.pcbi.1003711]. The Basset study reports gkm-SVM attains a mean area under the precision-recall curve (AUPR) of 0.322 over 164 cell types [@doi:10.1101/gr.200535.115]. The DeepEnhancer study reports for gkm-SVM a dramatically different AUPR of 0.899 on nine cell types [@tag:Min2016_deepenhancer]. This large difference means it's impossible to directly compare the performance of Basset and DeepEnhancer based solely on their reported metrics. DECRES used a different set of metrics altogether. To drive further progress in enhancer identification, we must develop a common and comparable benchmarking procedure (see Discussion).

Promoter-enhancer interactions

In addition to the location of enhancers, identifying enhancer-promoter interactions in three-dimensional space will provide critical knowledge for understanding transcriptional regulation. SPEID used a CNN to predict these interactions with only sequence and the location of putative enhancers and promoters along a one-dimensional chromosome [@doi:10.1101/085241]. It compared well to other methods using a full complement of biochemical data from ChIP-seq and other epigenomic methods. Of course, the putative enhancers and promoters used were themselves derived from epigenomic methods. But one could easily replace them with the output of one of the enhancer or promoter prediction methods above.

Micro-RNA binding

Prediction of microRNAs (miRNAs) and miRNA targets is of great interest, as they are critical components of gene regulatory networks and are often conserved across great evolutionary distance [@tag:Bracken2016_mirna; @tag:Berezikov2011_mirna]. While many machine learning algorithms have been applied to these tasks, they currently require extensive feature selection and optimization. For instance, one of the most widely adopted tools for miRNA target prediction, TargetScan, trained multiple linear regression models on 14 hand-curated features including structural accessibility of the target site on the mRNA, the degree of site conservation, and predicted thermodynamic stability of the miRNA-mRNA complex [@tag:Agarwal2015_targetscan]. Some of these features, including structural accessibility, are imperfect or empirically derived. In addition, current algorithms suffer from low specificity [@tag:Lee2016_deeptarget].

As in other applications, deep learning promises to achieve equal or better performance in predictive tasks by automatically engineering complex features to minimize an objective function. Two recently published tools use different recurrent neural network-based architectures to perform miRNA and target prediction with solely sequence data as input [@tag:Park2016_deepmirgene; @tag:Lee2016_deeptarget]. Though the results are preliminary and still based on a validation set rather than a completely independent test set, they were able to predict microRNA target sites with higher specificity and sensitivity than TargetScan. Excitingly, these tools seem to show that RNNs can accurately align sequences and predict bulges, mismatches, and wobble base pairing without requiring the user to input secondary structure predictions or thermodynamic calculations. Further incremental advances in deep learning for miRNA and target prediction will likely be sufficient to meet the current needs of systems biologists and other researchers who use prediction tools mainly to nominate candidates that are then tested experimentally.

Protein secondary and tertiary structure

Proteins play fundamental roles in almost all biological processes, and understanding their structure is critical for basic biology and drug development. UniProt currently has about 94 million protein sequences, yet fewer than 100,000 proteins across all species have experimentally-solved structures in Protein Data Bank (PDB). As a result, computational structure prediction is essential for a majority of proteins. However, this is very challenging, especially when similar solved structures, called templates, are not available in PDB. Over the past several decades, many computational methods have been developed to predict aspects of protein structure such as secondary structure, torsion angles, solvent accessibility, inter-residue contact maps, disorder regions, and side-chain packing. In recent years, multiple deep learning architectures have been applied, including deep belief networks, LSTMs, CNNs, and deep convolutional neural fields (DeepCNFs) [@doi:10.1007/978-3-319-46227-1_1; @doi:10.1038/srep18962].

Here we focus on deep learning methods for two representative sub-problems: secondary structure prediction and contact map prediction. Secondary structure refers to local conformation of a sequence segment, while a contact map contains information on all residue-residue contacts. Secondary structure prediction is a basic problem and an almost essential module of any protein structure prediction package. Contact prediction is much more challenging than secondary structure prediction, but it has a much larger impact on tertiary structure prediction. In recent years, the accuracy of contact prediction has greatly improved [@doi:10.1371/journal.pcbi.1005324; @doi:10.1093/bioinformatics/btu791; @doi:10.1073/pnas.0805923106; @doi:10.1371/journal.pone.0028766].

One can represent protein secondary structure with three different states (alpha helix, beta strand, and loop regions) or eight finer-grained states. Accuracy of a three-state prediction is called Q3, and accuracy of an 8-state prediction is called Q8. Several groups [@doi:10.1371/journal.pone.0032235; @doi:10.1109/TCBB.2014.2343960; @doi:10.1038/srep11476] applied deep learning to protein secondary structure prediction but were unable to achieve significant improvement over the de facto standard method PSIPRED [@doi:10.1006/jmbi.1999.3091], which uses two shallow feedforward neural networks. In 2014, Zhou and Troyanskaya demonstrated that they could improve Q8 accuracy by using a deep supervised and convolutional generative stochastic network [@arxiv:1403.1347]. In 2016 Wang et al. developed a DeepCNF model that improved Q3 and Q8 accuracy as well as prediction of solvent accessibility and disorder regions [@doi:10.1038/srep18962; @doi:10.1007/978-3-319-46227-1_1]. DeepCNF achieved a higher Q3 accuracy than the standard maintained by PSIPRED for more than 10 years. This improvement may be mainly due to the ability of convolutional neural fields to capture long-range sequential information, which is important for beta strand prediction. Nevertheless, the improvements in secondary structure prediction from DeepCNF are unlikely to result in a commensurate improvement in tertiary structure prediction since secondary structure mainly reflects coarse-grained local conformation of a protein structure.

Protein contact prediction and contact-assisted folding (i.e. folding proteins using predicted contacts as restraints) represents a promising new direction for ab initio folding of proteins without good templates in PDB. Co-evolution analysis is effective for proteins with a very large number (>1000) of sequence homologs [@doi:10.1371/journal.pone.0028766], but fares poorly for proteins without many sequence homologs. By combining co-evolution information with a few other protein features, shallow neural network methods such as MetaPSICOV [@doi:10.1093/bioinformatics/btu791] and CoinDCA-NN [@doi:10.1093/bioinformatics/btv472] have shown some advantage over pure co-evolution analysis for proteins with few sequence homologs, but their accuracy is still far from satisfactory. In recent years, deeper architectures have been explored for contact prediction, such as CMAPpro [@doi:10.1093/bioinformatics/bts475], DNCON [@doi:10.1093/bioinformatics/bts598] and PConsC [@doi:10.1371/journal.pcbi.1003889]. However, blindly tested in the well-known CASP competitions, these methods did not show any advantage over MetaPSICOV [@doi:10.1093/bioinformatics/btu791].

Recently, Wang et al. proposed the deep learning method RaptorX-Contact [@doi:10.1371/journal.pcbi.1005324], which significantly improves contact prediction over MetaPSICOV and pure co-evolution methods, especially for proteins without many sequence homologs. It employs a network architecture formed by one 1D residual neural network and one 2D residual neural network. Blindly tested in the latest CASP competition (i.e. CASP12 [@url:http://www.predictioncenter.org/casp12/rrc_avrg_results.cgi]), RaptorX-Contact ranked first in F₁ score on free-modeling targets as well as the whole set of targets. In CAMEO (which can be interpreted as a fully-automated CASP) [@url:https://www.cameo3d.org], its predicted contacts were also able to fold proteins with a novel fold and only 65--330 sequence homologs. This technique also worked well on membrane proteins even when trained on non-membrane proteins [@arxiv:1704.07207]. RaptorX-Contact performed better mainly due to introduction of residual neural networks and exploitation of contact occurrence patterns by simultaneously predicting all the contacts in a single protein.

Taken together, ab initio folding is becoming much easier with the advent of direct evolutionary coupling analysis and deep learning techniques. We expect further improvements in contact prediction for proteins with fewer than 1000 homologs by studying new deep network architectures. The deep learning methods summarized above also apply to interfacial contact prediction for protein complexes but may be less effective since on average protein complexes have fewer sequence homologs. Beyond secondary structure and contact maps, we anticipate increased attention to predicting 3D protein structure directly from amino acid sequence and single residue evolutionary information [@doi:10.1101/265231].

Structure determination and cryo-electron microscopy

Complementing computational prediction approaches, cryo-electron microscopy (cryo-EM) allows near-atomic resolution determination of protein models by comparing individual electron micrographs [@doi:10.1016/j.cell.2015.03.049]. Detailed structures require tens of thousands of protein images [@doi:10.1016/j.cell.2015.03.050]. Technological development has increased the throughput of image capture. New hardware, such as direct electron detectors, has made large-scale image production practical, while new software has focused on rapid, automated image processing.

Some components of cryo-EM image processing remain difficult to automate. For instance, in particle picking, micrographs are scanned to identify individual molecular images that will be used in structure refinement. In typical applications, hundreds of thousands of particles are necessary to determine a structure to near atomic resolution, making manual selection impractical [@doi:10.1016/j.cell.2015.03.050]. Typical selection approaches are semi-supervised; a user will select several particles manually, and these selections will be used to train a classifier [@doi:10.1016/j.jsb.2006.04.006; @doi:10.1016/j.jsb.2014.11.010]. Now CNNs are being used to select particles in tools like DeepPicker [@doi:10.1016/j.jsb.2016.07.006] and DeepEM [@doi:10.1186/s12859-017-1757-y]. In addition to addressing shortcomings from manual selection, such as selection bias and poor discrimination of low-contrast images, these approaches also provide a means of full automation. DeepPicker can be trained by reference particles from other experiments with structurally unrelated macromolecules, allowing for fully automated application to new samples.

Downstream of particle picking, deep learning is being applied to other aspects of cryo-EM image processing. Statistical manifold learning has been implemented in the software package ROME to classify selected particles and elucidate the different conformations of the subject molecule necessary for accurate 3D structures [@doi:10.1371/journal.pone.0182130]. These recent tools highlight the general applicability of deep learning approaches for image processing to increase the throughput of high-resolution cryo-EM.

Protein-protein interactions

Protein-protein interactions (PPIs) are highly specific and non-accidental physical contacts between proteins, which occur for purposes other than generic protein production or degradation [@doi:10.1371/journal.pcbi.1000807]. Abundant interaction data have been generated in-part thanks to advances in high-throughput screening methods, such as yeast two-hybrid and affinity-purification with mass spectrometry. However, because many PPIs are transient or dependent on biological context, high-throughput methods can fail to capture a number of interactions. The imperfections and costs associated with many experimental PPI screening methods have motivated an interest in high-throughput computational prediction.

Many machine learning approaches to PPI have focused on text mining the literature [@doi:10.1016/j.jbi.2007.11.008; @arxiv:1706.01556v2], but these approaches can fail to capture context-specific interactions, motivating de novo PPI prediction. Early de novo prediction approaches used a variety of statistical and machine learning tools on structural and sequential data, sometimes with reference to the existing body of protein structure knowledge. In the context of PPIs---as in other domains---deep learning shows promise both for exceeding current predictive performance and for circumventing limitations from which other approaches suffer.

One of the key difficulties in applying deep learning techniques to binding prediction is the task of representing peptide and protein sequences in a meaningful way. DeepPPI [@doi:10.1021/acs.jcim.7b00028] made PPI predictions from a set of sequence and composition protein descriptors using a two-stage deep neural network that trained two subnetworks for each protein and combined them into a single network. Sun et al. [@doi:10.1186/s12859-017-1700-2] applied autocovariances, a coding scheme that returns uniform-size vectors describing the covariance between physicochemical properties of the protein sequence at various positions. Wang et al. [@doi:10.1039/C7MB00188F] used deep learning as an intermediate step in PPI prediction. They examined 70 amino acid protein sequences from each of which they extracted 1260 features. A stacked sparse autoencoder with two hidden layers was then used to reduce feature dimensions and noisiness before a novel type of classification vector machine made PPI predictions.

Beyond predicting whether or not two proteins interact, Du et al. [@doi:10.1016/j.ymeth.2016.06.001] employed a deep learning approach to predict the residue contacts between two interacting proteins. Using features that describe how similar a protein's residue is relative to similar proteins at the same position, the authors extracted uniform-length features for each residue in the protein sequence. A stacked autoencoder took two such vectors as input for the prediction of contact between two residues. The authors evaluated the performance of this method with several classifiers and showed that a deep neural network classifier paired with the stacked autoencoder significantly exceeded classical machine learning accuracy.

Because many studies used predefined higher-level features, one of the benefits of deep learning---automatic feature extraction---is not fully leveraged. More work is needed to determine the best ways to represent raw protein sequence information so that the full benefits of deep learning as an automatic feature extractor can be realized.

MHC-peptide binding

An important type of PPI involves the immune system's ability to recognize the body's own cells. The major histocompatibility complex (MHC) plays a key role in regulating this process by binding antigens and displaying them on the cell surface to be recognized by T cells. Due to its importance in immunity and immune response, peptide-MHC binding prediction is a useful problem in computational biology, and one that must account for the allelic diversity in MHC-encoding gene region.

Shallow, feed-forward neural networks are competitive methods and have made progress toward pan-allele and pan-length peptide representations. Sequence alignment techniques are useful for representing variable-length peptides as uniform-length features [@doi:10.1110/ps.0239403; @doi:10.1093/bioinformatics/btv639]. For pan-allelic prediction, NetMHCpan [@doi:10.1007/s00251-008-0341-z; @doi:10.1186/s13073-016-0288-x] used a pseudo-sequence representation of the MHC class I molecule, which included only polymorphic peptide contact residues. The sequences of the peptide and MHC were then represented using both sparse vector encoding and Blosum encoding, in which amino acids are encoded by matrix score vectors. A comparable method to the NetMHC tools is MHCflurry [@doi:10.1101/174243], a method which shows superior performance on peptides of lengths other than nine. MHCflurry adds placeholder amino acids to transform variable-length peptides to length 15 peptides. In training the MHCflurry feed-forward neural network [@doi:10.1101/054775], the authors imputed missing MHC-peptide binding affinities using a Gibbs sampling method, showing that imputation improves performance for data-sets with roughly 100 or fewer training examples. MHCflurry's imputation method increases its performance on poorly characterized alleles, making it competitive with NetMHCpan for this task. Kuksa et al. [@doi:10.1093/bioinformatics/btv371] developed a shallow, higher-order neural network (HONN) comprised of both mean and covariance hidden units to capture some of the higher-order dependencies between amino acid locations. Pre-training this HONN with a semi-restricted Boltzmann machine, the authors found that the performance of the HONN exceeded that of a simple deep neural network, as well as that of NetMHC.

Deep learning's unique flexibility was recently leveraged by Bhattacharya et al. [@doi:10.1101/154757], who used a gated RNN method called MHCnuggets to overcome the difficulty of multiple peptide lengths. Under this framework, they used smoothed sparse encoding to represent amino acids individually. Because MHCnuggets had to be trained for every MHC allele, performance was far better for alleles with abundant, balanced training data. Vang et al. [@doi:10.1093/bioinformatics/btx264] developed HLA-CNN, a method which maps amino acids onto a 15-dimensional vector space based on their context relation to other amino acids before making predictions with a CNN. In a comparison of several current methods, Bhattacharya et al. found that the top methods---NetMHC, NetMHCpan, MHCflurry, and MHCnuggets---showed comparable performance, but large differences in speed. Convolutional neural networks (in this case, HLA-CNN) showed comparatively poor performance, while shallow and recurrent neural networks performed the best. They found that MHCnuggets---the recurrent neural network---was by far the fastest-training among the top performing methods.

PPI networks and graph analysis

Because interacting proteins are more likely to share a similar function, the connectivity of a PPI network itself can be a valuable information source for the prediction of protein function [@doi:10.1038/msb4100129]. To incorporate higher-order network information, it is necessary to find a lower-level embedding of network structure that preserves this higher-order structure. Rather than use hand-crafted network features, deep learning shows promise for the automatic discovery of predictive features within networks. For example, Navlakha [@doi:10.1162/NECO_a_00924] showed that a deep autoencoder was able to compress a graph to 40% of its original size, while being able to reconstruct 93% of the original graph's edges, improving upon standard dimension reduction methods. To achieve this, each graph was represented as an adjacency matrix with rows sorted in descending node degree order, then flattened into a vector and given as input to the autoencoder. While the activity of some hidden layers correlated with several popular hand-crafted network features such as k-core size and graph density, this work showed that deep learning can effectively reduce graph dimensionality while retaining much of its structural information.

An important challenge in PPI network prediction is the task of combining different networks and types of networks. Gligorijevic et al. [@doi:10.1101/223339] developed a multimodal deep autoencoder, deepNF, to find a feature representation common among several different PPI networks. This common lower-level representation allows for the combination of various PPI data sources towards a single predictive task. An SVM classifier trained on the compressed features from the middle layer of the autoencoder outperformed previous methods in predicting protein function.

Hamilton et al. addressed the issue of large, heterogeneous, and changing networks with an inductive approach called GraphSAGE [@arxiv:1706.02216v2]. By finding node embeddings through learned aggregator functions that describe the node and its neighbors in the network, the GraphSAGE approach allows for the generalization of the model to new graphs. In a classification task for the prediction of protein function, Chen and Zhu [@arxiv:1710.10568v1] optimized this approach and enhanced the graph convolutional network with a preprocessing step that uses an approximation to the dropout operation. This preprocessing effectively reduces the number of graph convolutional layers and it significantly improves both training time and prediction accuracy.

Morphological phenotypes

A field poised for dramatic revolution by deep learning is bioimage analysis. Thus far, the primary use of deep learning for biological images has been for segmentation---that is, for the identification of biologically relevant structures in images such as nuclei, infected cells, or vasculature---in fluorescence or even brightfield channels [@doi:10.1371/journal.pcbi.1005177]. Once so-called regions of interest have been identified, it is often straightforward to measure biological properties of interest, such as fluorescence intensities, textures, and sizes. Given the dramatic successes of deep learning in biological imaging, we simply refer to articles that review recent advancements [@doi:10.3109/10409238.2015.1135868; @doi:10.1371/journal.pcbi.1005177; @doi:10.1007/978-3-319-24574-4_28]. However, user-friendly tools must be developed for deep learning to become commonplace for biological image segmentation.

We anticipate an additional paradigm shift in bioimaging that will be brought about by deep learning: what if images of biological samples, from simple cell cultures to three-dimensional organoids and tissue samples, could be mined for much more extensive biologically meaningful information than is currently standard? For example, a recent study demonstrated the ability to predict lineage fate in hematopoietic cells up to three generations in advance of differentiation [@doi:10.1038/nmeth.4182]. In biomedical research, most often biologists decide in advance what feature to measure in images from their assay system. Although classical methods of segmentation and feature extraction can produce hundreds of metrics per cell in an image, deep learning is unconstrained by human intuition and can in theory extract more subtle features through its hidden nodes. Already, there is evidence deep learning can surpass the efficacy of classical methods [@doi:10.1101/081364], even using generic deep convolutional networks trained on natural images [@doi:10.1101/085118], known as transfer learning. Recent work by Johnson et al. [@tag:Johnson2017_integ_cell] demonstrated how the use of a conditional adversarial autoencoder allows for a probabilistic interpretation of cell and nuclear morphology and structure localization from fluorescence images. The proposed model is able to generalize well to a wide range of subcellular localizations. The generative nature of the model allows it to produce high-quality synthetic images predicting localization of subcellular structures by directly modeling the localization of fluorescent labels. Notably, this approach reduces the modeling time by omitting the subcellular structure segmentation step.

The impact of further improvements on biomedicine could be enormous. Comparing cell population morphologies using conventional methods of segmentation and feature extraction has already proven useful for functionally annotating genes and alleles, identifying the cellular target of small molecules, and identifying disease-specific phenotypes suitable for drug screening [@doi:10.1016/j.copbio.2016.04.003; @doi:10.1002/cyto.a.22909; @doi:10.1083/jcb.201610026]. Deep learning would bring to these new kinds of experiments---known as image-based profiling or morphological profiling---a higher degree of accuracy, stemming from the freedom from human-tuned feature extraction strategies.

Single-cell data

Single-cell methods are generating excitement as biologists characterize the vast heterogeneity within unicellular species and between cells of the same tissue type in the same organism [@tag:Gawad2016_singlecell]. For instance, tumor cells and neurons can both harbor extensive somatic variation [@tag:Lodato2015_neurons]. Understanding single-cell diversity in all its dimensions---genetic, epigenomic, transcriptomic, proteomic, morphologic, and metabolic---is key if treatments are to be targeted not only to a specific individual, but also to specific pathological subsets of cells. Single-cell methods also promise to uncover a wealth of new biological knowledge. A sufficiently large population of single cells will have enough representative "snapshots" to recreate timelines of dynamic biological processes. If tracking processes over time is not the limiting factor, single-cell techniques can provide maximal resolution compared to averaging across all cells in bulk tissue, enabling the study of transcriptional bursting with single-cell fluorescence in situ hybridization or the heterogeneity of epigenomic patterns with single-cell Hi-C or ATAC-seq [@tag:Liu2016_sc_transcriptome; @tag:Vera2016_sc_analysis]. Joint profiling of single-cell epigenomic and transcriptional states provides unprecedented views of regulatory processes [@doi:10.1101/138685].

However, large challenges exist in studying single cells. Relatively few cells can be assayed at once using current droplet, imaging, or microwell technologies, and low-abundance molecules or modifications may not be detected by chance due to a phenomenon known as dropout, not to be confused with the dropout layer of deep learning. To solve this problem, Angermueller et al. [@tag:Angermueller2016_single_methyl] trained a neural network to predict the presence or absence of methylation of a specific CpG site in single cells based on surrounding methylation signal and underlying DNA sequence, achieving several percentage points of improvement compared to random forests or deep networks trained only on CpG or sequence information. Similar deep learning methods have been applied to impute low-resolution ChIP-seq signal from bulk tissue with great success, and they could easily be adapted to single-cell data [@tag:Qin2017_onehot; @tag:Koh2016_denoising]. Deep learning has also been useful for dealing with batch effects [@tag:Shaham2016_batch_effects].

Examining populations of single cells can reveal biologically meaningful subsets of cells as well as their underlying gene regulatory networks [@tag:Gaublomme2015_th17]. Unfortunately, machine learning methods generally struggle with imbalanced data---when there are many more examples of class 1 than class 2---because prediction accuracy is usually evaluated over the entire dataset. To tackle this challenge, Arvaniti et al. [@tag:Arvaniti2016_rare_subsets] classified healthy and cancer cells expressing 25 markers by using the most discriminative filters from a CNN trained on the data as a linear classifier. They achieved impressive performance, even for cell types where the subset percentage ranged from 0.1 to 1%, significantly outperforming logistic regression and distance-based outlier detection methods. However, they did not benchmark against random forests, which tend to work better for imbalanced data, and their data was relatively low dimensional.

Neural networks can also learn low-dimensional representations of single-cell gene expression data for visualization, clustering, and other tasks. Both scvis [@doi:10.1101/178624] and scVI [@arxiv:1709.02082] are unsupervised approaches based on variational autoencoders (VAEs). Whereas scvis primarily focuses on single-cell visualization as a replacement for t-SNE [@tag:Maaten2008_tsne], the scVI model accounts for zero-inflated expression distributions and can impute zero values that are due to technical effects. Beyond VAEs, Lin et al. developed a supervised model to predict cell type [@doi:10.1093/nar/gkx681]. Similar to transfer learning approaches for microscopy images [@doi:10.1101/085118], they demonstrated that the hidden layer representations were informative in general and could be used to identify cellular subpopulations or match new cells to known cell types. The supervised neural network's representation was better overall at retrieving cell types than alternatives, but all methods struggled to recover certain cell types such as hematopoietic stem cells and inner cell mass cells. As the Human Cell Atlas [@doi:10.7554/eLife.27041] and related efforts generate more single-cell expression data, there will be opportunities to assess how well these low-dimensional representations generalize to new cell types as well as abundant training data to learn broadly-applicable representations.

The sheer quantity of omic information that can be obtained from each cell, as well as the number of cells in each dataset, uniquely position single-cell data to benefit from deep learning. In the future, lineage tracing could be revolutionized by using autoencoders to reduce the feature space of transcriptomic or variant data followed by algorithms to learn optimal cell differentiation trajectories [@tag:Qiu2017_graph_embedding] or by feeding cell morphology and movement into neural networks [@tag:Buggenthin2017_imaged_lineage]. Reinforcement learning algorithms [@tag:Silver2016_alphago] could be trained on the evolutionary dynamics of cancer cells or bacterial cells undergoing selection pressure and reveal whether patterns of adaptation are random or deterministic, allowing us to develop therapeutic strategies that forestall resistance. We are excited to see the creative applications of deep learning to single-cell biology that emerge over the next few years.

Metagenomics

Metagenomics, which refers to the study of genetic material---16S rRNA or whole-genome shotgun DNA---from microbial communities, has revolutionized the study of micro-scale ecosystems within and around us. In recent years, machine learning has proved to be a powerful tool for metagenomic analysis. 16S rRNA has long been used to deconvolve mixtures of microbial genomes, yet this ignores more than 99% of the genomic content. Subsequent tools aimed to classify 300--3000 bp reads from complex mixtures of microbial genomes based on tetranucleotide frequencies, which differ across organisms [@tag:Karlin], using supervised [@tag:McHardy; @tag:nbc] or unsupervised methods [@tag:Abe]. Then, researchers began to use techniques that could estimate relative abundances from an entire sample faster than classifying individual reads [@tag:Metaphlan; @tag:wgsquikr; @tag:lmat; @tag:Vervier]. There is also great interest in identifying and annotating sequence reads [@tag:yok; @tag:Soueidan]. However, the focus on taxonomic and functional annotation is just the first step. Several groups have proposed methods to determine host or environment phenotypes from the organisms that are identified [@tag:Guetterman; @tag:Knights; @tag:Stratnikov; @tag:Segata] or overall sequence composition [@tag:Ding]. Also, researchers have looked into how feature selection can improve classification [@tag:Liu; @tag:Segata], and techniques have been proposed that are classifier-independent [@tag:Ditzler; @tag:Ditzler2].

Most neural networks are used for phylogenetic classification or functional annotation from sequence data where there is ample data for training. Neural networks have been applied successfully to gene annotation (e.g. Orphelia [@tag:Hoff] and FragGeneScan [@doi:10.1093/nar/gkq747]). Representations (similar to word2vec [@tag:word2vec] in natural language processing) for protein family classification have been introduced and classified with a skip-gram neural network [@tag:Asgari]. Recurrent neural networks show good performance for homology and protein family identification [@tag:Hochreiter; @tag:Sonderby].

One of the first techniques of de novo genome binning used self-organizing maps, a type of neural network [@tag:Abe]. Essinger et al. [@tag:Essinger2010_taxonomic] used Adaptive Resonance Theory to cluster similar genomic fragments and showed that it had better performance than k-means. However, other methods based on interpolated Markov models [@tag:Salzberg] have performed better than these early genome binners. Neural networks can be slow and therefore have had limited use for reference-based taxonomic classification, with TAC-ELM [@tag:TAC-ELM] being the only neural network-based algorithm to taxonomically classify massive amounts of metagenomic data. An initial study successfully applied neural networks to taxonomic classification of 16S rRNA genes, with convolutional networks providing about 10% accuracy genus-level improvement over RNNs and random forests [@tag:Mrzelj]. However, this study evaluated only 3000 sequences.

Neural network uses for classifying phenotype from microbial composition are just beginning. A simple multi-layer perceptron (MLP) was able to classify wound severity from microbial species present in the wound [@doi:10.1016/j.bjid.2015.08.013]. Recently, Ditzler et al. associated soil samples with pH level using MLPs, DBNs, and RNNs [@tag:Ditzler3]. Besides classifying samples appropriately, internal phylogenetic tree nodes inferred by the networks represented features for low and high pH. Thus, hidden nodes might provide biological insight as well as new features for future metagenomic sample comparison. Also, an initial study has shown promise of these networks for diagnosing disease [@tag:Faruqi].

Challenges remain in applying deep neural networks to metagenomics problems. They are not yet ideal for phenotype classification because most studies contain tens of samples and hundreds or thousands of features (species). Such underdetermined, or ill-conditioned, problems are still a challenge for deep neural networks that require many training examples. Also, due to convergence issues [@arxiv:1212.0901v2], taxonomic classification of reads from whole genome sequencing seems out of reach at the moment for deep neural networks. There are only thousands of full-sequenced genomes as compared to hundreds of thousands of 16S rRNA sequences available for training.

However, because RNNs have been applied to base calls for the Oxford Nanopore long-read sequencer with some success [@tag:Boza] (discussed below), one day the entire pipeline, from denoising to functional classification, may be combined into one step using powerful LSTMs [@tag:Sutskever]. For example, metagenomic assembly usually requires binning then assembly, but could deep neural nets accomplish both tasks in one network? We believe the greatest potential in deep learning is to learn the complete characteristics of a metagenomic sample in one complex network.

Sequencing and variant calling

While we have so far primarily discussed the role of deep learning in analyzing genomic data, deep learning can also substantially improve our ability to obtain the genomic data itself. We discuss two specific challenges: calling SNPs and indels (insertions and deletions) with high specificity and sensitivity and improving the accuracy of new types of data such as nanopore sequencing. These two tasks are critical for studying rare variation, allele-specific transcription and translation, and splice site mutations. In the clinical realm, sequencing of rare tumor clones and other genetic diseases will require accurate calling of SNPs and indels.

Current methods achieve relatively high (>99%) precision at 90% recall for SNPs and indel calls from Illumina short-read data [@tag:Poplin2016_deepvariant], yet this leaves a large number of potentially clinically-important remaining false positives and false negatives. These methods have so far relied on experts to build probabilistic models that reliably separate signal from noise. However, this process is time consuming and fundamentally limited by how well we understand and can model the factors that contribute to noise. Recently, two groups have applied deep learning to construct data-driven unbiased noise models. One of these models, DeepVariant, leverages Inception, a neural network trained for image classification by Google Brain, by encoding reads around a candidate SNP as a 221x100 bitmap image, where each column is a nucleotide and each row is a read from the sample library [@tag:Poplin2016_deepvariant]. The top 5 rows represent the reference, and the bottom 95 rows represent randomly sampled reads that overlap the candidate variant. Each RGBA (red/green/blue/alpha) image pixel encodes the base (A, C, G, T) as a different red value, quality score as a green value, strand as a blue value, and variation from the reference as the alpha value. The neural network outputs genotype probabilities for each candidate variant. They were able to achieve better performance than GATK [@doi:10.1038/ng.806], a leading genotype caller, even when GATK was given information about population variation for each candidate variant. Another method, still in its infancy, hand-developed 62 features for each candidate variant and fed these vectors into a fully connected deep neural network [@tag:Torracinta2016_deep_snp]. Unfortunately, this feature set required at least 15 iterations of software development to fine-tune, which suggests that these models may not generalize.

Variant calling will benefit more from optimizing neural network architectures than from developing features by hand. An interesting and informative next step would be to rigorously test if encoding raw sequence and quality data as an image, tensor, or some other mixed format produces the best variant calls. Because many of the latest neural network architectures (ResNet, Inception, Xception, and others) are already optimized for and pre-trained on generic, large-scale image datasets [@tag:Chollet2016_xception], encoding genomic data as images could prove to be a generally effective and efficient strategy.

In limited experiments, DeepVariant was robust to sequencing depth, read length, and even species [@tag:Poplin2016_deepvariant]. However, a model built on Illumina data, for instance, may not be optimal for Pacific Biosciences long-read data or MinION nanopore data, which have vastly different specificity and sensitivity profiles and signal-to-noise characteristics. Recently, Boza et al. used bidirectional recurrent neural networks to infer the E. coli sequence from MinION nanopore electric current data with higher per-base accuracy than the proprietary hidden Markov model-based algorithm Metrichor [@tag:Boza]. Unfortunately, training any neural network requires a large amount of data, which is often not available for new sequencing technologies. To circumvent this, one very preliminary study simulated mutations and spiked them into somatic and germline RNA-seq data, then trained and tested a neural network on simulated paired RNA-seq and exome sequencing data [@tag:Torracinta2016_sim]. Despite subsequent evaluation [@doi:10.1101/093534] on real somatic mutation data from the International Cancer Genome Consortium [@doi:10.1038/ncomms10001], further assessments are required to determine whether simulation can produce sufficiently realistic data to train reliable models.

Method development for interpreting new types of sequencing data has historically taken two steps: first, easily implemented hard cutoffs that prioritize specificity over sensitivity, then expert development of probabilistic models with hand-developed inputs [@tag:Torracinta2016_sim]. We anticipate that these steps will be replaced by deep learning, which will infer features simply by its ability to optimize a complex model against data.

Neuroscience

Artificial neural networks were originally conceived as a model for computation in the brain [@doi:10.1007/BF02478259]. Although deep neural networks have evolved to become a workhorse across many fields, there is still a strong connection between deep networks and the study of the brain. The rich parallel history of artificial neural networks in computer science and neuroscience is reviewed in [@doi:10.3389/fncom.2016.00094; @doi:10.1101/133504; @doi:10.1016/j.neuron.2017.06.011].

Convolutional neural networks were originally conceived as faithful models of visual information processing in the primate visual system, and are still considered so [@doi:10.1038/nn.4244]. The activations of hidden units in consecutive layers of deep convolutional networks have been found to parallel the activity of neurons in consecutive brain regions involved in processing visual scenes. Such models of neural computation are called "encoding" models, as they predict how the nervous system might encode sensory information in the world.

Even when they are not directly modeling biological neurons, deep networks have been a useful computational tool in neuroscience. They have been developed as statistical time series models of neural activity in the brain. And in contrast to the encoding models described earlier, these models are used for decoding neural activity, for instance in brain machine interfaces [@doi:10.1101/152884]. They have been crucial to the field of connectomics, which is concerned with mapping the connectivity of biological neural networks in the brain. In connectomics, deep networks are used to segment the shapes of individual neurons and to infer their connectivity from 3D electron microscopic images [@doi:10.1016/j.conb.2010.07.004], and they have also been used to infer causal connectivity from optical measurement and perturbation of neural activity [@tag:Aitchison2017].

It is an exciting time for neuroscience. Recent rapid progress in deep networks continues to inspire new machine learning based models of brain computation [@doi:10.3389/fncom.2016.00094]. And neuroscience continues to inspire new models of artificial intelligence [@doi:10.1016/j.neuron.2017.06.011].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

04.study.md

04.study.md

Deep learning to study the fundamental biological processes underlying human disease

Gene expression

DNA methylation

Inference, imputation, and prediction

Latent space construction

Splicing

Transcription factors

Promoters and enhancers

From TF binding to promoters and enhancers

Promoters

Enhancers

Promoter-enhancer interactions

Micro-RNA binding

Protein secondary and tertiary structure

Structure determination and cryo-electron microscopy

Protein-protein interactions

MHC-peptide binding

PPI networks and graph analysis

Morphological phenotypes

Single-cell data

Metagenomics

Sequencing and variant calling

Neuroscience

Files

04.study.md

Latest commit

History

04.study.md

File metadata and controls

Deep learning to study the fundamental biological processes underlying human disease

Gene expression

DNA methylation

Inference, imputation, and prediction

Latent space construction

Splicing

Transcription factors

Promoters and enhancers

From TF binding to promoters and enhancers

Promoters

Enhancers

Promoter-enhancer interactions

Micro-RNA binding

Protein secondary and tertiary structure

Structure determination and cryo-electron microscopy

Protein-protein interactions

MHC-peptide binding

PPI networks and graph analysis

Morphological phenotypes

Single-cell data

Metagenomics

Sequencing and variant calling

Neuroscience