first version of new tutorial

Andreas Prlic · Andreas Prlic · commit ba746f4895b7 · 2013-09-18T14:47:46.000-07:00
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+.DS_Store
+.profile
+.settings
diff --git a/structure/README.md b/structure/README.md
@@ -0,0 +1,16 @@
+The Protein Structure Modules of BioJava
+=====================================================
+
+A tutorial for the protein structure modules of BioJava
+
+## Index
+
+
+This tutorial is split into several chapters.
+
+Chapter 1 - The [BioJava data model](structure-data-model.md) for the representation of macromolecular structures.
+
+Chapter 2 - The [Chemical Component Dictionary](chemcomp.md)
+
+Chapter X - How to [work with mmCIF/PDBx files](mmcif.md).
+
diff --git a/structure/chemcomp.md b/structure/chemcomp.md
@@ -0,0 +1,80 @@
+The Chemical Component Dictionary
+=================================
+
+The [Chemical Components Dictionary](http://www.wwpdb.org/ccd.html) is an external reference file describing all residue and small molecule components found in PDB entries. This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules. 
+
+### How does BioJava decide what groups are amino acids?
+
+<table>
+    <tr><td>
+
+![Selenomethionine is a naturally occurring amino acid containing selenium](img/143px-Selenomethionine-from-xtal-3D-balls.png "Selenomethionine is a naturally occurring amino acid containing selenium source: wikipedia")
+
+</td>
+    <td>Selenomethionine is a naturally occurring amino acid containing selenium source: wikipedia
+        </td>
+    </tr>
+</table>
+BioJava utilizes the Chem. Comp. Dictionary to achieve a chemically correct representation of each group. To make it clear how this can work, let's take a look at how [Selenomethionine](http://en.wikipedia.org/wiki/Selenomethionine) and water is dealt with:
+
+
+
+
+<pre>
+            Structure structure = StructureIO.getStructure("1A62");
+                    
+            for (Chain chain : structure.getChains()){
+                for (Group group : chain.getAtomGroups()){
+                    if ( group.getPDBName().equals("MSE") || group.getPDBName().equals("HOH")){
+                        System.out.println(group.getPDBName() + " is a group of type " + group.getType());
+                    }
+                }
+            }
+</pre>
+
+This should give this output:
+
+<pre>
+MSE is a group of type amino
+MSE is a group of type amino
+MSE is a group of type amino
+HOH is a group of type hetatm
+HOH is a group of type hetatm
+HOH is a group of type hetatm
+...
+</pre>
+
+As you can see, although MSE is flaged as HETATM in the PDB file, BioJava still represents it correctly as an amino acid. They key is that the [definition file for MSE](http://www.rcsb.org/pdb/files/ligand/MSE.cif) flags it as "L-PEPTIDE LINKING", which is being used by BioJava.
+
+
+### How to access Chemical Component definitions
+Bye default BioJava ships with a minimal representation of standard amino acids, however if you want to parse the whole PDB archive, it is good to tell the library to either
+
+1. fetch missing Chemical Component definitions on the fly (small download and parsing delays every time a new chemical compound is found), or
+2. Load all definitions at startup (slow startup, but then no further delays later on, requires more memory)
+
+You can enable the first behaviour by doing using the [FileParsingParameters](http://www.biojava.org/docs/api/org/biojava/bio/structure/io/FileParsingParameters.html) class:
+
+<pre>
+            AtomCache cache = new AtomCache();
+            
+             // by default all files are stored at a temporary location.
+            // you can set this either via at startup with -DPDB_DIR=/path/to/files/
+            // or hard code it this way:
+            cache.setPath("/tmp/");
+            
+            FileParsingParameters params = new FileParsingParameters();
+            
+            params.setLoadChemCompInfo(true);
+            cache.setFileParsingParams(params);
+            
+            StructureIO.setAtomCache(cache);
+            
+            Structure structure = StructureIO.getStructure(...);
+</pre>
+
+If you want to enable the second behaviour (slow loading of all chem comps at startup, but no further small delays later on) you can use the same code but change the behaviour by switching the [ChemCompProvider](http://www.biojava.org/docs/api/org/biojava/bio/structure/io/mmcif/ChemCompProvider.html) implementation in the [ChemCompGroupFactory](http://www.biojava.org/docs/api/org/biojava/bio/structure/io/mmcif/ChemCompGroupFactory.html)
+
+<pre>    
+     ChemCompGroupFactory.setChemCompProvider(new AllChemCompProvider());
+</pre>
diff --git a/structure/img/143px-Selenomethionine-from-xtal-3D-balls.png b/structure/img/143px-Selenomethionine-from-xtal-3D-balls.png
diff --git a/structure/mmcif.md b/structure/mmcif.md
@@ -0,0 +1,147 @@
+# How to parse mmCIF files using BioJava
+
+A quick tutorial how to work with mmCIF files.
+
+## What is mmCIF?
+
+The Protein Data Bank (PDB) has been distributing its archival files as PDB files for a long time. The PDB file format is based on "punchcard"-style rules how to store data in a flat file. With the increasing complexity of macromolecules that have are being resolved experimentally, this file format can not be used any more to represent some or the more complex structures. As such, the wwPDB recently announced the transition from PDB to mmCIF/PDBx as  the principal deposition and dissemination file format (see 
+[here](http://www.wwpdb.org/news/news_2013.html#22-May-2013) and 
+[here](http://wwpdb.org/workshop/wgroup.html)). 
+
+The mmCIF file format has been around for some time (see [Westbrook 2000][] and [Westbrook 2003][] ) [BioJava](http://www.biojava.org) has been supporting mmCIF already for several years. This tutorial is meant to provide a quick introduction into how to parse mmCIF files using [BioJava](http://www.biojava.org)
+
+## The basics
+
+BioJava provides you with both a mmCIF parser and a data model that reads PDB and mmCIF files into a biological and chemically  meaningful data model (BioJava supports the [Chemical Components Dictionary](http://www.wwpdb.org/ccd.html)). If you don't want to use that data model, you can still use BioJava's file parsers, and more on that later, let's start first with the most basic way of loading a protein structure.
+
+## Quick Installation
+
+Before we start, just one quick paragraph of how to get access to BioJava.
+
+BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way:
+
+BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html)  guide.
+
+We are providing a BioJava specific Maven repository at (http://biojava.org/download/maven/) .
+
+You can add the BioJava repository by adding the following XML to your project pom.xml file:
+```xml
+        <repositories>
+            ...
+            <repository>
+                <id>biojava-maven-repo</id>
+                <name>BioJava repository</name>
+                <url>http://www.biojava.org/download/maven/</url>           
+            </repository>
+        </repositories>
+        <dependencies>
+                ...
+                <dependency>
+                        <!-- This imports the latest SNAPSHOT builds from the protein structure modules of BioJava
+                        -->                        
+                        <groupId>org.biojava</groupId>
+                        <artifactId>biojava3-structure</artifactId>
+                        <version>3.0.7-SNAPSHOT</version>
+                </dependency>
+                <!-- other biojava jars as needed -->
+        </dependencies> 
+```
+
+If you run 'mvn package' on your project, the BioJava dependencies will be automatically downloaded and installed for you.
+
+## First steps
+
+The simplest way to load a PDB file is by using the [StructureIO](http://www.biojava.org/docs/api/org/biojava3/structure/StructureIO.html) class.
+
+<pre>
+    Structure structure = StructureIO.getStructure("4HHB");
+    // and let's print out how many atoms are in this structure
+    System.out.println(StructureTools.getNrAtoms(structure));
+</pre>
+
+
+
+BioJava  automatically downloaded the PDB file for hemoglobin [4HHB](http://www.rcsb.org/pdb/explore.do?structureId=4HHB) and copied it into a temporary location. This demonstrates two things:
+
++ BioJava can automatically download and install files locally
++ BioJava by default writes those files into a temporary location (The system temp directory "java.io.tempdir"). 
+
+If you already have a local PDB installation, you can configure where BioJava should read the files from by setting the PDB_DIR system property
+
+<pre>
+    -DPDB_DIR=/wherever/you/want/
+</pre>
+
+## From PDB to mmCIF
+
+By default BioJava is using the PDB file format for parsing data. In order to switch it to use mmCIF, we can take control over the underlying [AtomCache](http://www.biojava.org/docs/api/org/biojava/bio/structure/align/util/AtomCache.html) which manages your PDB (and btw. also SCOP, CATH) installations.
+
+<pre>
+        AtomCache cache = new AtomCache();
+            
+        cache.setUseMmCif(true);
+            
+        // if you struggled to set the PDB_DIR property correctly in the previous step, 
+        // you could set it manually like this:
+        cache.setPath("/tmp/");
+            
+        StructureIO.setAtomCache(cache);
+            
+        Structure structure = StructureIO.getStructure("4HHB");
+                    
+        // and let's count how many chains are in this structure.
+        System.out.println(structure.getChains().size());
+</pre>
+
+As you can see, the AtomCache will again download the missing mmCIF file for 4HHB in the background. 
+
+## Low level access
+
+If you want to learn how to use the BioJava mmCIF parser to populate your own data structure, let's first take a look this lower-level code:
+
+<pre>
+        InputStream inStream =  new FileInputStream(fileName);
+ 
+        MMcifParser parser = new SimpleMMcifParser();
+ 
+        SimpleMMcifConsumer consumer = new SimpleMMcifConsumer();
+ 
+        // The Consumer builds up the BioJava - structure object.
+        // you could also hook in your own and build up you own data model.          
+        parser.addMMcifConsumer(consumer);
+ 
+        try {
+            parser.parse(new BufferedReader(new InputStreamReader(inStream)));
+        } catch (IOException e){
+            e.printStackTrace();
+        }
+ 
+        // now get the protein structure.
+        Structure cifStructure = consumer.getStructure();
+</pre>
+
+The parser operates similar to a XML parser by triggering "events". The [SimpleMMcifConsumer](http://www.biojava.org/docs/api/org/biojava/bio/structure/io/mmcif/SimpleMMcifConsumer.html) listens to new categories being read from the file and then builds up the BioJava data model.
+
+To re-use the parser for your own datamodel, just implement the [MMcifConsumer](http://www.biojava.org/docs/api/org/biojava/bio/structure/io/mmcif/MMcifConsumer.html) interface and add it to the [SimpleMMcifParser](http://www.biojava.org/docs/api/org/biojava/bio/structure/io/mmcif/SimpleMMcifParser.html).
+<pre>
+        parser.addMMcifConsumer(myOwnConsumerImplementation);
+</pre>
+
+## I loaded a Structure object, what now?
+
+BioJava provides a number of algorithms and visualisation tools that you can use to further analyse the structure, or look at it. Here a couple of suggestions for further reads:
+
++ [The BioJava Cookbook for protein structures](http://biojava.org/wiki/BioJava:CookBook#Protein_Structure)
++ How does BioJava [represent the content](structure-data-model.md) of a PDB/mmCIF file?
++ [How to calculate a protein structure alignment using BioJava](http://biojava.org/wiki/BioJava:CookBook:PDB:align)
++ [How to work with Groups (AminoAcid, Nucleotide, Hetatom)](http://biojava.org/wiki/BioJava:CookBook:PDB:groups)
+
+
+
+<!-- References -->
+
+
+[Westbrook 2000]: http://www.ncbi.nlm.nih.gov/pubmed/10842738 "Westbrook JD and Bourne PE. STAR/mmCIF: an ontology for macromolecular structure. Bioinformatics 2000 Feb; 16(2) 159-68. pmid:10842738." 
+
+[Westbrook 2003]: http://www.ncbi.nlm.nih.gov/pubmed/12647386 "Westbrook JD and Fitzgerald PM. The PDB format, mmCIF, and other data formats. Methods Biochem Anal 2003; 44 161-79. pmid:12647386."
+
diff --git a/structure/structure-data-model.md b/structure/structure-data-model.md
@@ -0,0 +1,102 @@
+# The BioJava-structure data model
+
+A biologically and chemically meaningful data representation of PDB/mmCIF.
+
+## The basics   
+
+BioJava at its core is a collection of file parsers and (in some cases) data models to represent frequently used biological data.  The protein-structure modules represent macromolecular data in a way that should make it easy to work with. The representation is essentially independ of the underlying file format and the user can chose to work with either PDB or mmCIF files and still get an almost identical data representation.
+
+## The main hierarchy
+
+BioJava provides a flexible data structure for managing protein structural data. The 
+[http://www.biojava.org/docs/api/org/biojava/bio/structure/Structure.html Structure] class is the main container. 
+
+A Structure has a hierarchy of sub-objects:
+
+<pre>
+Structure
+   |
+   Model(s)
+        |
+        Chain(s)
+            |
+             Group(s) -> Chemical Component Definition
+                 |
+                 Atom(s)
+</pre>
+
+All structure objects contain one or more "models". That means also X-ray structures contain an "virtual" model which serves as a container for the chains. The most common way to access chains will be via
+
+<pre>
+        List<Chain>chains = structure.getChains();
+</pre>
+
+This works for both NMR and X-ray based structures and by default the first model is getting accessed.
+
+
+## Working with atoms
+
+Different ways are provided how to access the data contained in a [Structure](http://www.biojava.org/docs/api/org/biojava/bio/structure/Structure.html).
+If you want to directly access an array of [Atoms](http://www.biojava.org/docs/api/org/biojava/bio/structure/Atom.html) you can use the utility class called [StructureTools](http://www.biojava.org/docs/api/org/biojava/bio/structure/StructureTools.html)
+
+<pre>
+
+    // get all C-alpha atoms in the structure
+    Atom[] caAtoms = StructureTools.getAtomCAArray(structure);
+</pre>
+
+Alternatively you can access atoms also by their parent-group.
+
+## Working with groups
+
+The [Group](http://www.biojava.org/docs/api/org/biojava/bio/structure/Group.html) interface defines all methods common to a group of atoms. There are 3 types of Groups:
+
+* [AminoAcid](http://www.biojava.org/docs/api/org/biojava/bio/structure/AminoAcid.html)
+* [Nucleotide](http://www.biojava.org/docs/api/org/biojava/bio/structure/NucleotideImpl.html) 
+* [Hetatom](http://www.biojava.org/docs/api/org/biojava/bio/structure/HetatomImpl.html) 
+
+In order to get all amino acids that have been observed in a PDB chain, you can use the following utility method:
+
+<pre>
+            Chain chain = s.getChainByPDB("A");
+            List<Group> groups = chain.getAtomGroups("amino");
+            for (Group group : groups) {
+                AminoAcid aa = (AminoAcid) group;
+
+                // do something amino acid specific, e.g. print the secondary structure assignment
+                System.out.println(aa + " " + aa.getSecStruc());
+            }
+</pre>
+
+
+In a similar way you can access all nucleotide groups by
+<pre>
+            chain.getAtomGroups("nucleotide");
+</pre>
+
+The Hetatom groups are access in a similar fashion:
+<pre>
+            chain.getAtomGroups("hetatm");
+</pre>
+
+
+Since all 3 types of groups are implementing the Group interface, you can also iterate over all groups and check for the instance type:
+
+<pre>
+            List<Group> allgroups = chain.getAtomGroups();
+            for (Group group : groups) {
+                if ( group instanceof AminoAcid) {
+                    AminoAcid aa = (AminoAcid) group;
+                    System.out.println(aa.getSecStruc());
+                }
+            }
+</pre>
+
+
+
+
+
+
+
+
+

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+.DS_Store`
	`2`	`+.profile`
	`3`	`+.settings`