biojava.github.io/_site/wikis/BioJava3_Design.html at third · biojava/biojava.github.io

History

241 lines (212 loc) · 10.4 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

<p><strong>Not current</strong></p>

<p>The content on this page was used during the development of the BioJava

3. BioJava 3 has been released on December 28th 2010. The latest release

is available from <biojava:download></biojava:download></p>

<h2 id="implementation">Implementation</h2>

<p>For information on the current status of the BioJava 3 implementation go

to <a href="BioJava3_project" title="wikilink">BioJava3_project</a></p>

<h2 id="references">References</h2>

<p>This document was based on comments made on the following pages:</p>

<ul>

<li><a href="http://biojava.org/wiki/BioJava3_Proposal">http://biojava.org/wiki/BioJava3_Proposal</a></li>

<li><a href="http://biojava.org/wiki/Talk:BioJava3_Proposal">http://biojava.org/wiki/Talk:BioJava3_Proposal</a></li>

<li><a href="http://biojava.org/wiki/UsageAnalysis">http://biojava.org/wiki/UsageAnalysis</a></li>

<li><a href="http://www.derkholm.net/svn/repos/bjv2/website/docs/index.html">http://www.derkholm.net/svn/repos/bjv2/website/docs/index.html</a></li>

</ul>

<h2 id="basic-principles">Basic principles</h2>

<ul>

<li>BioJava3 (BJ3) will freely incorporate features from Java 6.</li>

<li>Maven will be used to build the project.</li>

<li>Full unit testing for every aspect from the ground up using JUnit.</li>

<li>Modular design without any cyclic dependencies, with separate JARs

for key components (IO, databases, genetic algorithms, sequence

manipulation, etc.)</li>

<li>Separation of APIs from implementation code by means of packages.</li>

<li>Base package name: org.biojava3 (to prevent clashes with org.biojava

and org.biojavax, both of which will have backwards-compatibility

extensions to BJ3 in order to make old code reusable).</li>

<li>Use of JavaBeans concepts wherever possible, e.g. getters/setters.

This would enhance Java EE compliance and improve integration into

larger things. DON’T do this where immutability is key to efficiency

though, like with Strings.</li>

<li>Fully commented code in LOTS of detail INCLUDING package-level docs

AND wiki-docs such as the cookbook.</li>

<li>Use of annotations for things like database mappings.</li>

<li>A consistent coding style to be developed and applied.</li>

<li>No Swing code to be included, but graphics code is OK for obviously

useful things such as protein structures or sequence traces. Swing

code is impossible to write in a way that will integrate fully with

each different individual’s own program requirements.</li>

<li>Keep It Simple Stupid (KISS) - don’t object-ify things unless

absolutely necessary. Sequences are perfectly happy as Strings

unless you want to do complex things like store base quality

information, and only at that point should you want to convert them

into more complex object models.</li>

<li>Separation of functionality - don’t make sequences load features,

and don’t make features load their sequence by default. This saves

memory and allows work to be done independently on the specific

parts of interest.</li>

<li>Always ALWAYS correctly implement equals, compareTo, hashCode, and

Serializable wherever possible.</li>

<li>Any general-use methods to be exposed via SPI (e.g.

getTopBlastHit()).</li>

<li>The source code license will be the GNU Lesser General Public

License (LGPL) “version 2.1 or any later version”.</li>

<li>In general BJ3 exceptions should be RuntimeExceptions and unchecked.

They should also be well documented and give useful messages. It

should be up to the developer to decide what to capture and what not

to. In the current BioJava there are way to many exceptions that

can’t really happen under any normal circumstances. We should only

need to think about exceptions in exceptional circumstances.</li>

<li>The default Java logging API should be used extensively. This will

allow a developer the ability to fine tune debugging. The core

module should have a logging helper with static convenience methods

to make it very easy to liberally use logging calls via static

imports.</li>

</ul>

<h2 id="compromises-and-unfinished-bits">Compromises and Unfinished bits</h2>

<ul>

<li>TestNG was suggested instead of JUnit, but knowledge of this tool is

not so widespread and this may impact on quality of testing.</li>

<li>A tool for analysing comment coverage and coding style was

suggested, but none have been identified. Please amend this document

with the names of any good ones you know.</li>

</ul>

<p>[Jalopy <a href="http://jalopy.sourceforge.net/">http://jalopy.sourceforge.net/</a>] - can be used as Eclipse

plugin, or Ant task.<br />

[Cobertura <a href="http://cobertura.sf.net">http://cobertura.sf.net</a>] - can be used to assess JUnit test

coverage.<br />

[FindBugs <a href="http://findbugs.sf.net">http://findbugs.sf.net</a>] - does static analysis of code (also

runnable as Eclipse plugin or Ant task.</p>

<h2 id="priorities">Priorities</h2>

<p>Andreas’ very useful Usage Analysis page shows the most frequently

requested documentation. In the absence of any real usage statistics, we

must assume that the things people most often want to read about are the

things that people most often use. (It could also be said that the

things that people most read about are the things that work least well

in the present code… but we shall ignore that for now…).</p>

<p>Here are the priorities based on Andreas’ work:</p>

<ul>

<li>How to get an Alphabet</li>

<li>How to make a Sequence Object from a String or make a Sequence

Object back into a String</li>

<li>How to parse a Blast output</li>

<li>How to read sequences from a Fasta file</li>

<li>How to read a GenBank, SwissProt or EMBL file</li>

<li>How to generate a global or local alignment with the

Needleman-Wunsch- or the Smith-Waterman-algorithm</li>

<li>How to read a protein structure - PDB file</li>

<li>How to export a sequence to fasta</li>

<li>How to view a sequence in a gui</li>

<li>How to parse a Fasta database search output file</li>

</ul>

<p>These can be broken down into the following modules:</p>

<ul>

<li>Plain sequence <-> Enriched sequence</li>

<li>Sequence similarity -> Sequence similarity IO (Blast, Fasta, etc.)</li>

<li>Plain sequence -> Plain sequence IO (Genbank, FASTA, etc.)</li>

<li>Enriched sequence -> Sequence alignments</li>

<li>Enriched sequence -> Protein structures</li>

</ul>

<h2 id="module-structure">Module structure</h2>

<ul>

<li>BioJava3 module

<ul>

<li>API module contains object builder signature (builder builds

objects from events, much like a SAX parser does).</li>

<li>Listeners can choose to cache data in memory, on disk, keep a

pointer to the source and read it back later, or whatever. Up to

them. Optimisation becomes easier this way as listeners can

choose exactly what to keep in memory and what not to.</li>

</ul>

</li>

</ul>

<ul>

<li>Sequence module

<ul>

<li>API module defines entire BioJava sequence object model (similar

to current one but allowing for non-symbol based sequences and

separation of sequences from features).</li>

<li>API has subclasses of object builders for sequences. Builder can

specify it is only interested in certain events, and parsers can

query this to optimise parsing by skipping irrelevant sections.</li>

<li>Conversion to symbol-based sequences on demand to/from strings.</li>

<li>Simplified alphabet concept, made easier by avoiding use of XMLs

to configure them.</li>

<li>WATCH OUT for localised strings when manipulating sequences.</li>

<li>WATCH OUT for singletons and multi-processor environments.

Consider using JNDI if they are absolutely necessary.</li>

</ul>

</li>

</ul>

<ul>

<li>Feature module

<ul>

<li>API module defines entire BioJava feature object model (similar

to current one but allowing for separation of sequences from

features).</li>

<li>API has subclasses of object builders for features. Builder can

specify it is only interested in certain events, and parsers can

query this to optimise parsing by skipping irrelevant sections.</li>

<li>Allow feature naming using any of the standard ontologies.</li>

</ul>

</li>

</ul>

<ul>

<li>IO module

<ul>

<li>API module contains basic read() and write() function

signatures.</li>

<li>API has concept of RecordSource which is either a file, a group

of files (e.g. directory), a database, a web service, etc. - all

of which implement some kind of RecordProvider interface for

iterating over objects. Those objects can be sequences,

features, etc.</li>

<li>Implementation module - one per sequence format - e.g. Genbank,

FASTA, etc.</li>

<li>Use of event listeners to fire events at an object builder.</li>

<li>Each implementation has default object model and builder that

exactly matches that format, along with a converter that will

‘read’ the object model and fire events as if it was being read

again (to allow for conversion to other formats via the listener

framework).</li>

<li>BioSQL is an IO module. So are other dbs, e.g. Entrez, ebEye.</li>

<li>A RecordSearch API to be implemented to search for matching

records in any RecordSource.</li>

<li>LazyLoading where possible.</li>

<li>Input AND Output achieved by SAX-like event firing. Reading a

file fires events at an object builder containing bits of data

as they are read. Writing a file causes an object parser to

parse an object and fire events at a file writer. Any listener

can listen to any other source of events, so you can

short-circuit file conversion by reading GenBank and specifying

the reader-listener as an instance of a FASTA writer-listener.</li>

<li>RecordSources to be versioned to cope with changing formats over

time.</li>

<li>Each IO module to be entirely independent and agnostic of the

way it is used. This allows modules to optimise themselves for

random access etc., if they see fit. By using the methods on the

API to check what the listener is interested in receiving, they

can also cut out the work of parsing uninteresting stuff.</li>

</ul>

</li>

</ul>

<ul>

<li>Other modules

<ul>

<li>Ontology handling.</li>

<li>Protein structure</li>

<li>Microarray analysis</li>

<li>Phylogenetics</li>

</ul>

</li>

</ul>

<h2 id="use-cases">Use cases</h2>

<p>It is planned to document BioJava in parallel with development. To do

this, we want to drive development from a set of <a href="BioJava 3 Use Cases" title="wikilink"> use

cases</a>.</p>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

BioJava3_Design.html

Latest commit

History

BioJava3_Design.html

File metadata and controls