Skip to content

Commit c463c47

Browse files
dicknetherlandsandreasprlic
authored andcommitted
Change to wiki page
1 parent 1db8441 commit c463c47

2 files changed

Lines changed: 144 additions & 3 deletions

File tree

_wikis/BioJava3_Design.md

Lines changed: 99 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,8 @@ Basic principles
2727
extensions to BJ3 in order to make old code reusable).
2828
- Use of JavaBeans concepts wherever possible, e.g. getters/setters.
2929
This would enhance Java EE compliance and improve integration into
30-
larger things.
30+
larger things. DON'T do this where immutability is key to efficiency
31+
though, like with Strings.
3132
- Fully commented code in LOTS of detail INCLUDING package-level docs
3233
AND wiki-docs such as the cookbook.
3334
- Use of annotations for things like database mappings.
@@ -36,6 +37,19 @@ Basic principles
3637
useful things such as protein structures or sequence traces. Swing
3738
code is impossible to write in a way that will integrate fully with
3839
each different individual's own program requirements.
40+
- Keep It Simple Stupid (KISS) - don't object-ify things unless
41+
absolutely necessary. Sequences are perfectly happy as Strings
42+
unless you want to do complex things like store base quality
43+
information, and only at that point should you want to convert them
44+
into more complex object models.
45+
- Separation of functionality - don't make sequences load features,
46+
and don't make features load their sequence by default. This saves
47+
memory and allows work to be done independently on the specific
48+
parts of interest.
49+
- Always ALWAYS correctly implement equals, compareTo, hashCode, and
50+
Serializable wherever possible.
51+
- Any general-use methods to be exposed via SPI (e.g.
52+
getTopBlastHit()).
3953

4054
Compromises and Unfinished bits
4155
-------------------------------
@@ -79,3 +93,87 @@ These can be broken down into the following modules:
7993
- Enriched sequence -\> Sequence alignments
8094
- Enriched sequence -\> Protein structures
8195

96+
Module structure
97+
----------------
98+
99+
- BioJava3 module
100+
- API module contains object builder signature (builder builds
101+
objects from events, much like a SAX parser does).
102+
- Listeners can choose to cache data in memory, on disk, keep a
103+
pointer to the source and read it back later, or whatever. Up to
104+
them. Optimisation becomes easier this way as listeners can
105+
choose exactly what to keep in memory and what not to.
106+
107+
<!-- -->
108+
109+
- Sequence module
110+
- API module defines entire BioJava sequence object model (similar
111+
to current one but allowing for non-symbol based sequences and
112+
separation of sequences from features).
113+
- API has subclasses of object builders for sequences. Builder can
114+
specify it is only interested in certain events, and parsers can
115+
query this to optimise parsing by skipping irrelevant sections.
116+
- Conversion to symbol-based sequences on demand to/from strings.
117+
- Simplified alphabet concept, made easier by avoiding use of XMLs
118+
to configure them.
119+
- WATCH OUT for localised strings when manipulating sequences.
120+
- WATCH OUT for singletons and multi-processor environments.
121+
Consider using JNDI if they are absolutely necessary.
122+
123+
<!-- -->
124+
125+
- Feature module
126+
- API module defines entire BioJava feature object model (similar
127+
to current one but allowing for separation of sequences from
128+
features).
129+
- API has subclasses of object builders for features. Builder can
130+
specify it is only interested in certain events, and parsers can
131+
query this to optimise parsing by skipping irrelevant sections.
132+
- Allow feature naming using any of the standard ontologies.
133+
134+
<!-- -->
135+
136+
- IO module
137+
- API module contains basic read() and write() function
138+
signatures.
139+
- API has concept of RecordSource which is either a file, a group
140+
of files (e.g. directory), a database, a web service, etc. - all
141+
of which implement some kind of RecordProvider interface for
142+
iterating over objects. Those objects can be sequences,
143+
features, etc.
144+
- Implementation module - one per sequence format - e.g. Genbank,
145+
FASTA, etc.
146+
- Use of event listeners to fire events at an object builder.
147+
- Each implementation has default object model and builder that
148+
exactly matches that format, along with a converter that will
149+
'read' the object model and fire events as if it was being read
150+
again (to allow for conversion to other formats via the listener
151+
framework).
152+
- BioSQL is an IO module. So are other dbs, e.g. Entrez, ebEye.
153+
- A RecordSearch API to be implemented to search for matching
154+
records in any RecordSource.
155+
- LazyLoading where possible.
156+
- Input AND Output achieved by SAX-like event firing. Reading a
157+
file fires events at an object builder containing bits of data
158+
as they are read. Writing a file causes an object parser to
159+
parse an object and fire events at a file writer. Any listener
160+
can listen to any other source of events, so you can
161+
short-circuit file conversion by reading GenBank and specifying
162+
the reader-listener as an instance of a FASTA writer-listener.
163+
- RecordSources to be versioned to cope with changing formats over
164+
time.
165+
- Each IO module to be entirely independent and agnostic of the
166+
way it is used. This allows modules to optimise themselves for
167+
random access etc., if they see fit. By using the methods on the
168+
API to check what the listener is interested in receiving, they
169+
can also cut out the work of parsing uninteresting stuff.
170+
171+
<!-- -->
172+
173+
- Other modules
174+
- Ontology handling.
175+
- Protein structure
176+
- Microarray analysis
177+
- Phylogenetics
178+
- etc. etc. etc.
179+

_wikis/BioJava3_Design.mediawiki

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,15 @@ This document was based on comments made on the following pages:
1212
* Modular design without any cyclic dependencies, with separate JARs for key components (IO, databases, genetic algorithms, sequence manipulation, etc.)
1313
* Separation of APIs from implementation code by means of packages.
1414
* Base package name: org.biojava3 (to prevent clashes with org.biojava and org.biojavax, both of which will have backwards-compatibility extensions to BJ3 in order to make old code reusable).
15-
* Use of JavaBeans concepts wherever possible, e.g. getters/setters. This would enhance Java EE compliance and improve integration into larger things.
15+
* Use of JavaBeans concepts wherever possible, e.g. getters/setters. This would enhance Java EE compliance and improve integration into larger things. DON'T do this where immutability is key to efficiency though, like with Strings.
1616
* Fully commented code in LOTS of detail INCLUDING package-level docs AND wiki-docs such as the cookbook.
1717
* Use of annotations for things like database mappings.
1818
* A consistent coding style to be developed and applied.
1919
* No Swing code to be included, but graphics code is OK for obviously useful things such as protein structures or sequence traces. Swing code is impossible to write in a way that will integrate fully with each different individual's own program requirements.
20+
* Keep It Simple Stupid (KISS) - don't object-ify things unless absolutely necessary. Sequences are perfectly happy as Strings unless you want to do complex things like store base quality information, and only at that point should you want to convert them into more complex object models.
21+
* Separation of functionality - don't make sequences load features, and don't make features load their sequence by default. This saves memory and allows work to be done independently on the specific parts of interest.
22+
* Always ALWAYS correctly implement equals, compareTo, hashCode, and Serializable wherever possible.
23+
* Any general-use methods to be exposed via SPI (e.g. getTopBlastHit()).
2024
2125
==Compromises and Unfinished bits==
2226
* TestNG was suggested instead of JUnit, but knowledge of this tool is not so widespread and this may impact on quality of testing.
@@ -44,4 +48,43 @@ These can be broken down into the following modules:
4448
* Sequence similarity -> Sequence similarity IO (Blast, Fasta, etc.)
4549
* Plain sequence -> Plain sequence IO (Genbank, FASTA, etc.)
4650
* Enriched sequence -> Sequence alignments
47-
* Enriched sequence -> Protein structures
51+
* Enriched sequence -> Protein structures
52+
53+
==Module structure==
54+
55+
* BioJava3 module
56+
** API module contains object builder signature (builder builds objects from events, much like a SAX parser does).
57+
** Listeners can choose to cache data in memory, on disk, keep a pointer to the source and read it back later, or whatever. Up to them. Optimisation becomes easier this way as listeners can choose exactly what to keep in memory and what not to.
58+
59+
* Sequence module
60+
** API module defines entire BioJava sequence object model (similar to current one but allowing for non-symbol based sequences and separation of sequences from features).
61+
** API has subclasses of object builders for sequences. Builder can specify it is only interested in certain events, and parsers can query this to optimise parsing by skipping irrelevant sections.
62+
** Conversion to symbol-based sequences on demand to/from strings.
63+
** Simplified alphabet concept, made easier by avoiding use of XMLs to configure them.
64+
** WATCH OUT for localised strings when manipulating sequences.
65+
** WATCH OUT for singletons and multi-processor environments. Consider using JNDI if they are absolutely necessary.
66+
67+
* Feature module
68+
** API module defines entire BioJava feature object model (similar to current one but allowing for separation of sequences from features).
69+
** API has subclasses of object builders for features. Builder can specify it is only interested in certain events, and parsers can query this to optimise parsing by skipping irrelevant sections.
70+
** Allow feature naming using any of the standard ontologies.
71+
72+
* IO module
73+
** API module contains basic read() and write() function signatures.
74+
** API has concept of RecordSource which is either a file, a group of files (e.g. directory), a database, a web service, etc. - all of which implement some kind of RecordProvider interface for iterating over objects. Those objects can be sequences, features, etc.
75+
** Implementation module - one per sequence format - e.g. Genbank, FASTA, etc.
76+
** Use of event listeners to fire events at an object builder.
77+
** Each implementation has default object model and builder that exactly matches that format, along with a converter that will 'read' the object model and fire events as if it was being read again (to allow for conversion to other formats via the listener framework).
78+
** BioSQL is an IO module. So are other dbs, e.g. Entrez, ebEye.
79+
** A RecordSearch API to be implemented to search for matching records in any RecordSource.
80+
** LazyLoading where possible.
81+
** Input AND Output achieved by SAX-like event firing. Reading a file fires events at an object builder containing bits of data as they are read. Writing a file causes an object parser to parse an object and fire events at a file writer. Any listener can listen to any other source of events, so you can short-circuit file conversion by reading GenBank and specifying the reader-listener as an instance of a FASTA writer-listener.
82+
** RecordSources to be versioned to cope with changing formats over time.
83+
** Each IO module to be entirely independent and agnostic of the way it is used. This allows modules to optimise themselves for random access etc., if they see fit. By using the methods on the API to check what the listener is interested in receiving, they can also cut out the work of parsing uninteresting stuff.
84+
85+
* Other modules
86+
** Ontology handling.
87+
** Protein structure
88+
** Microarray analysis
89+
** Phylogenetics
90+
** etc. etc. etc.

0 commit comments

Comments
 (0)