add discussion in interpretability section and updates to molecular d…

…esign section and discussion sections (issues with previous PR fixed) (#988) * add discussion in interpretability section and update section on molecular design * remove build files * rehash/update my previous commit - single lines and other fixes * rehash/update my previous commit - single lines and other fixes * rehash/update my previous commit - single lines and other fixes * rehash/update my previous commit - single lines and other fixes * Update content/06.discussion.md Co-Authored-By: Anthony Gitter <[email protected]> Co-authored-by: Anthony Gitter <[email protected]> * add discussion in interpretability section and update section on molecular design * remove build files * rehash/update my previous commit - single lines and other fixes * rehash/update my previous commit - single lines and other fixes * rehash/update my previous commit - single lines and other fixes * rehash/update my previous commit - single lines and other fixes * Delete citation tags Converting to Markdown format * Convert tags to Markdown format Adds tags from ebb27b1 * Apply suggestions from code review commit agitter's changes Co-authored-by: Anthony Gitter <[email protected]> * Update 05.treat.md New references: [@doi:10.1038/s41587-020-0418-2] [@arXiv:1802.04364] [@arXiv:1705.10843] [@arXiv:1806.02473] [@10.1021/acsmedchemlett.0c00088] [@doi:10.1021/acs.jcim.0c00174] note on references - I could not get DOIs for these so had to go with arXiv: [@arXiv:1802.04364] is published in ICML , see https://dblp.org/rec/bibtex/conf/icml/JinBJ18 [@arXiv:1806.02473] is published in NeurIPS, see https://dblp.uni-trier.de/rec/bibtex/conf/nips/YouLYPL18 Note: another work which led to a synthesized and tested drug molecule is this, from 2018: https://doi.org/10.1021/acs.molpharmaceut.8b00839. However, the 2019 work ( Zhavoronkov et al) we discuss got much more attention. The review is already getting a bit "in the weeds" so I left it out. this is a recent review (july this year) which people may be interested in. I cited it. https://pubs.acs.org/doi/pdf/10.1021/acsmedchemlett.0c00088 sorry for any typos that crept in - spellchecker isn't working in Github for some reason. * Remove interpretability changes from this pull request * Citation fixes Some tags missed during conversion to Markdown link format * Apply suggestions from code review agitter's copyedits Co-authored-by: Anthony Gitter <[email protected]> * Update content/05.treat.md Co-authored-by: Anthony Gitter <[email protected]>
greenelab · Aug 9, 2020 · 1173b9b · 1173b9b
1 parent 081fb46
commit 1173b9b
Show file tree

Hide file tree

Showing 3 changed files with 88 additions and 16 deletions.
diff --git a/content/05.treat.md b/content/05.treat.md
@@ -180,28 +180,47 @@ However, in the long term, atomic convolutions may ultimately overtake grid-base
 
 #### *De novo* drug design
 
-*De novo* drug design attempts to model the typical design-synthesize-test cycle of drug discovery [@doi:10.1002/wcms.49; @doi:10.1021/acs.jmedchem.5b01849].
+*De novo* drug design attempts to model the typical design-synthesize-test cycle of drug discovery *in silico* [@doi:10.1002/wcms.49; @doi:10.1021/acs.jmedchem.5b01849].
 It explores an estimated 10<sup>60</sup> synthesizable organic molecules with drug-like properties without explicit enumeration [@doi:10.1002/wcms.1104].
-To test or score structures, algorithms like those discussed earlier are used.
+To score molecules after generation or during optimization, physics-based simulation could be used [@tag:Sumita2018], but machine learning models based on techniques discussed earlier may be preferable [@tag:Gomezb2016_automatic], as they are much more computationally expedient.
+Computational efficiency is particularly important during optimization as the "scoring function" may need to be called thousands of times.
+
 To "design" and "synthesize", traditional *de novo* design software relied on classical optimizers such as genetic algorithms.
-Unfortunately, this often leads to overfit, "weird" molecules, which are difficult to synthesize in the lab.
-Current programs have settled on rule-based virtual chemical reactions to generate molecular structures [@doi:10.1021/acs.jmedchem.5b01849].
-Deep learning models that generate realistic, synthesizable molecules have been proposed as an alternative.
-In contrast to the classical, symbolic approaches, generative models learned from data would not depend on laboriously encoded expert knowledge.
-The challenge of generating molecules has parallels to the generation of syntactically and semantically correct text [@arxiv:1308.0850].
-
-As deep learning models that directly output (molecular) graphs remain under-explored, generative neural networks for drug design typically represent chemicals with the simplified molecular-input line-entry system (SMILES), a standard string-based representation with characters that represent atoms, bonds, and rings [@tag:Segler2017_drug_design].
-This allows treating molecules as sequences and leveraging recent progress in recurrent neural networks.
-Gómez-Bombarelli et al. designed a SMILES-to-SMILES autoencoder to learn a continuous latent feature space for chemicals [@tag:Gomezb2016_automatic].
-In this learned continuous space it was possible to interpolate between continuous representations of chemicals in a manner that is not possible with discrete
-(e.g. bit vector or string) features or in symbolic, molecular graph space.
-Even more interesting is the prospect of performing gradient-based or Bayesian optimization of molecules within this latent space.
+These algorithms use a list of hard-coded rules to perform virtual chemical reactions on molecular structures during each iteration, leading to physically stable and synthesizable molecules [@doi:10.1021/acs.jmedchem.5b01849].
+Deep learning models have been proposed as an alternative.
+In contrast to the classical approaches, in theory generative models learned from big data would not require laboriously encoded expert knowledge to generate realistic, synthesizable molecules.
+
+In the past few years, a large number of techniques for the generative modeling and optimization of molecules with deep learning have been explored, including RNNs, VAEs, GANs, and reinforcement learning---for a review see Elton et al. [@tag:Elton_molecular_design_review] or Vamathevan et al. [@tag:Vamathevan2019].
+
+Building off the large amount of work that has already gone into text generation [@arxiv:1308.0850], many generative neural networks for drug design initially represented chemicals with the simplified molecular-input line-entry system (SMILES), a standard string-based representation with characters that represent atoms, bonds, and rings [@tag:Segler2017_drug_design].
+
+The first successful demonstration of a deep learning based approach for molecular optimization occurred in 2016 with the development of a SMILES-to-SMILES autoencoder capable of learning a continuous latent feature space for molecules [@tag:Gomezb2016_automatic].
+In this learned continuous space it is possible to interpolate between molecular structures in a manner that is not possible with discrete (e.g. bit vector or string) features or in symbolic, molecular graph space.
+Even more interesting is that one can perform gradient-based or Bayesian optimization of molecules within this latent space.
 The strategy of constructing simple, continuous features before applying supervised learning techniques is reminiscent of autoencoders trained on high-dimensional EHR data [@tag:BeaulieuJones2016_ehr_encode].
 A drawback of the SMILES-to-SMILES autoencoder is that not all SMILES strings produced by the autoencoder's decoder correspond to valid chemical structures.
-Recently, the Grammar Variational Autoencoder, which takes the SMILES grammar into account and is guaranteed to produce syntactically valid SMILES, has been proposed to alleviate this issue [@arxiv:1703.01925].
+The Grammar Variational Autoencoder, which takes the SMILES grammar into account and is guaranteed to produce syntactically valid SMILES, helps alleviate this issue to some extent [@arxiv:1703.01925].
 
 Another approach to *de novo* design is to train character-based RNNs on large collections of molecules, for example, ChEMBL [@doi:10.1093/nar/gkr777], to first obtain a generic generative model for drug-like compounds [@tag:Segler2017_drug_design].
 These generative models successfully learn the grammar of compound representations, with 94% [@tag:Olivecrona2017_drug_design] or nearly 98% [@tag:Segler2017_drug_design] of generated SMILES corresponding to valid molecular structures.
 The initial RNN is then fine-tuned to generate molecules that are likely to be active against a specific target by either continuing training on a small set of positive examples [@tag:Segler2017_drug_design] or adopting reinforcement learning strategies [@tag:Olivecrona2017_drug_design; @arxiv:1611.02796].
 Both the fine-tuning and reinforcement learning approaches can rediscover known, held-out active molecules.
-The great flexibility of neural networks, and progress in generative models offers many opportunities for deep architectures in *de novo* design (e.g. the adaptation of GANs for molecules).
+
+Reinforcement learning approaches where operations are performed directly on the molecular graph bypass the need to learn the details of SMILES syntax, allowing the model to focus purely on chemistry.
+Additionally, they seem to require less training data and generate more valid molecules since they are constrained by design only to graph operations which satisfy chemical valiance rules [@tag:Elton_molecular_design_review].
+A reinforcement learning agent developed by Zhou et al. [@doi:10.1038/s41598-019-47148-x] demonstrated superior molecular optimization performance on optimizing the quantitative estimate of drug-likeness (QED) metric and the "penalized logP" metric (logP minus the synthetic accessibility) when compared with other deep learning based approaches such as the Junction Tree VAE [@arxiv:1802.04364], Objective-Reinforced Generative Adversarial Network [@arxiv:1705.10843], and Graph Convolutional Policy Network [@arxiv:1806.02473].
+As another example, Zhavoronkov et al. used generative tensorial reinforcement learning to discover inhibitors of discoidin domain receptor 1 (DDR1) [@tag:Zhavoronkov2019_drugs].
+In contrast to most previous work, six lead candidates discovered using their approach were synthesized and tested in the lab, with 4/6 achieving some degree of binding to DDR1.
+One of the molecules was chosen for further testing and showed promising results in a cancer cell line and mouse model [@tag:Zhavoronkov2019_drugs]. 
+
+In concluding this section, we want to highlight two areas where work is still needed before AI can bring added value to the existing drug discovery process---novelty and synthesizability.
+The work of Zhavoronkov et al. is arguably an important milestone and received much fanfare in the popular press, but Walters and Murko have presented a more sober assessment, noting that the generated molecule they choose to test in the lab is very similar to an existing drug that was present in their training data [@doi:10.1038/s41587-020-0418-2].
+Small variations of existing molecules are likely not to be much better and may not be patentable.
+One way to tackle this problem is to add novelty and diversity metrics to the reward function of reinforcement learning based algorithms.
+Novelty should also be taken into account when comparing different models---and thus is included in the proposed GuacaMol benchmark (2019) for accessing generative molecules for molecular design [@doi:10.1021/acs.jcim.8b00839].
+The other area which has been pointed to as a key limitation of current approaches is synthesizability [@doi:10.1021/acs.jcim.0c00174; @doi:10.1021/acsmedchemlett.0c00088].
+Current heuristics of synthesizability, such as the synthetic accessibility score, are based on a relatively limited domain of chemical data and are too restrictive, so better models of synthesizability should help in this area [@doi:10.1021/acs.jcim.0c00174]. 
+
+As noted before, genetic algorithms use hard-coded rules based on possible chemical reactions to generate molecular structures and therefore may have less trouble generating synthesizable molecules [@doi:10.1021/acs.jmedchem.5b01849].
+We note in passing that Jensen (2018) [@doi:10.1039/C8SC05372C] and Yoshikawa et al. (2019) [@doi:10.1246/cl.180665] have both demonstrated genetic algorithms that are competitive with deep learning approaches. 
+Progress on overcoming both of these shortcomings is proceeding on many fronts, and we believe the future of deep learning for molecular design is quite bright. 
diff --git a/content/90.back-matter.md b/content/90.back-matter.md
@@ -24,6 +24,7 @@
 [@tag:Baskin2015_drug_disc]: doi:10.1080/17460441.2016.1201262
 [@tag:Baxt1991_myocardial]: doi:10.7326/0003-4819-115-11-843
 [@tag:BeaulieuJones2016_ehr_encode]: doi:10.1016/j.jbi.2016.10.007
+[@tag:Belkin2019_PNAS]: doi:10.1073/pnas.1903070116
 [@tag:Bengio2015_prec]: arxiv:1412.7024
 [@tag:Berezikov2011_mirna]: doi:10.1038/nrg3079
 [@tag:Bergstra2011_hyper]: url:https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf
@@ -70,6 +71,8 @@
 [@tag:Edwards2015_growing_pains]: doi:10.1145/2771283
 [@tag:Ehran2009_visualizing]: url:http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/247
 [@tag:Elephas]: url:https://github.com/maxpumperla/elephas
+[@tag:Elton_molecular_design_review]: doi:10.1039/C9ME00039A
+[@tag:Elton2020]: arxiv:2002.05149
 [@tag:Errington2014_reproducibility]: doi:10.7554/eLife.04333
 [@tag:Eser2016_fiddle]: doi:10.1101/081380
 [@tag:Esfahani2016_melanoma]: doi:10.1109/EMBC.2016.7590963
@@ -80,6 +83,7 @@
 [@tag:Finnegan2017_maximum]: doi:10.1101/105957
 [@tag:Fong2017_perturb]: doi:10.1109/ICCV.2017.371
 [@tag:Fraga2005]: doi:10.1073/pnas.0500398102
+[@tag:Frosst2017_distilling]: arxiv:1711.09784
 [@tag:Fu2019]: doi:10.1109/TCBB.2019.2909237
 [@tag:Gal2015_dropout]: arxiv:1506.02142
 [@tag:Gargeya2017_dr]: doi:10.1016/j.ophtha.2017.02.008
@@ -188,10 +192,12 @@
 [@tag:Metaphlan]: doi:10.1038/nmeth.2066
 [@tag:Min2016_deepenhancer]: doi:10.1109/BIBM.2016.7822593
 [@tag:Momeni2018]: doi:10.1101/438341
+[@tag:Montavon2018_visualization]: doi:10.1016/j.dsp.2017.10.011
 [@tag:Mordvintsev2015_inceptionism]: url:http://googleresearch.blogspot.co.uk/2015/06/inceptionism-going-deeper-into-neural.html
 [@tag:Moritz2015_sparknet]: arxiv:1511.06051
 [@tag:Mrzelj]: url:https://repozitorij.uni-lj.si/IzpisGradiva.php?id=85515
 [@tag:Murdoch2017_automatic]: arxiv:1702.02540
+[@tag:Murdoch2019]: doi:10.1073/pnas.1900654116
 [@tag:NIH2016_genome_cost]: url:https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/
 [@tag:Nazor2012]: doi:10.1016/j.stem.2012.02.013
 [@tag:Nemati2016_rl]: doi:10.1109/EMBC.2016.7591355
@@ -233,6 +239,7 @@
 [@tag:Romero2017_diet]: url:https://openreview.net/pdf?id=Sk-oDY9ge
 [@tag:Rosenberg2015_synthetic_seqs]: doi:10.1016/j.cell.2015.09.054
 [@tag:Roth2015_view_agg_cad]: doi:10.1109/TMI.2015.2482920
+[@tag:Rudin2019]: doi:10.1038/s42256-019-0048-x
 [@tag:Russakovsky2015_imagenet]: doi:10.1007/s11263-015-0816-y
 [@tag:Sa2015_buckwild]: pmcid:PMC4907892
 [@tag:Salas2018]: doi:10.1186/s13059-018-1448-7
@@ -241,6 +248,7 @@
 [@tag:Schatz2010_dna_cloud]: doi:10.1038/nbt0710-691
 [@tag:Schmidhuber2014_dnn_overview]: doi:10.1016/j.neunet.2014.09.003
 [@tag:Scotti2016_missplicing]: doi:10.1038/nrg.2015.3
+[@tag:Sculley2018]: url:https://openreview.net/pdf?id=rJWF0Fywf
 [@tag:Segata]: doi:10.1371/journal.pcbi.1004977
 [@tag:Segler2017_drug_design]: arxiv:1701.01329
 [@tag:Seide2014_parallel]: doi:10.1109/ICASSP.2014.6853593
@@ -250,6 +258,7 @@
 [@tag:Shaham2016_batch_effects]: doi:10.1093/bioinformatics/btx196
 [@tag:Shapely]: doi:10.1515/9781400881970-018
 [@tag:Shen2017_medimg_review]: doi:10.1146/annurev-bioeng-071516-044442
+[@tag:Shen2019]: doi:10.1016/j.eswa.2019.01.048
 [@tag:Shin2016_cad_tl]: doi:10.1109/TMI.2016.2528162
 [@tag:Shrikumar2017_learning]: arxiv:1704.02685
 [@tag:Shrikumar2017_reversecomplement]: doi:10.1101/103663
@@ -270,6 +279,7 @@
 [@tag:Strobelt2016_visual]: arxiv:1606.07461
 [@tag:Su2015_gpu]: arxiv:1507.01239
 [@tag:Subramanian2016_bace1]: doi:10.1021/acs.jcim.6b00290
+[@tag:Sumita2018]: doi:10.1021/acscentsci.8b00213
 [@tag:Sun2016_ensemble]: arxiv:1606.00575
 [@tag:Sundararajan2017_axiomatic]: arxiv:1703.01365
 [@tag:Sutskever]: arxiv:1409.3215
@@ -286,6 +296,7 @@
 [@tag:Torracinta2016_sim]: doi:10.1101/079087
 [@tag:Tu1996_anns]: doi:10.1016/S0895-4356(96)00002-9
 [@tag:Unterthiner2014_screening]: url:http://www.bioinf.at/publications/2014/NIPS2014a.pdf
+[@tag:Vamathevan2019]: doi:10.1038/s41573-019-0024-5
 [@tag:Vanhoucke2011_cpu]: url:https://research.google.com/pubs/pub37631.html
 [@tag:Vera2016_sc_analysis]: doi:10.1146/annurev-genet-120215-034854
 [@tag:Vervier]: doi:10.1093/bioinformatics/btv683
@@ -313,6 +324,7 @@
 [@tag:Zhang2015_multitask_tl]: doi:10.1145/2783258.2783304
 [@tag:Zhang2017_generalization]: arxiv:1611.03530v2
 [@tag:Zhang2019]: doi:10.1186/s12885-019-5932-6
+[@tag:Zhavoronkov2019_drugs]: doi:10.1038/s41587-019-0224-x
 [@tag:Zhou2015_deep_sea]: doi:10.1038/nmeth.3547
 [@tag:Zhu2016_advers_mamm]: doi:10.1101/095786
 [@tag:Zhu2016_mult_inst_mamm]: doi:10.1101/095794

diff --git a/content/manual-references.json b/content/manual-references.json
@@ -52,6 +52,47 @@
    ]
   }
  },
+ {
+  "id": "url:https://openreview.net/pdf?id=rJWF0Fywf",
+  "type": "article-journal",
+  "title": "Winner's Curse? On Pace, Progress, and Empirical Rigor ...",
+  "container-title": "International Conference on Learning Representations 2018",
+  "URL": "https://openreview.net/pdf?id=rJWF0Fywf",
+  "author": [
+   {
+    "family": "Sculley",
+    "given": "D."
+   },
+   {
+    "family": "Snoek",
+    "given": "Jasper"
+   },
+   {
+    "family": "Rahimi",
+    "given": "Ali"
+   },
+   {
+    "family": "Wiltschko",
+    "given": "Alex"
+   }
+  ],
+  "issued": {
+   "date-parts": [
+    [
+     "2018"
+    ]
+   ]
+  },
+  "accessed": {
+   "date-parts": [
+    [
+     "2020",
+     2,
+     14
+    ]
+   ]
+  }
+ },
  {
   "id": "url:https://repozitorij.uni-lj.si/IzpisGradiva.php?id=85515",
   "type": "report",