-
Notifications
You must be signed in to change notification settings - Fork 16
Expand file tree
/
Copy pathBioJava:BioJavaXDocs.html
More file actions
3616 lines (2964 loc) · 156 KB
/
BioJava:BioJavaXDocs.html
File metadata and controls
3616 lines (2964 loc) · 156 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<h2 id="biojavax-is-not-biojava-3-is-not-biojavax">BioJavaX is not BioJava 3 is not BioJavaX.</h2>
<p>BioJavaX is an extension to the existing BioJava 1 or BioJava Legacy
project. Anything written with BioJava will work with BioJavaX, and vice
versa.</p>
<p>org.biojavax is to org.biojava as javax is to java.</p>
<p>The BioJava3 project is a completely new project which intends to
rewrite everything in BioJava from scratch, based around a new set of
object designs and concepts. It is entirely incompatible with the
existing BioJava project.</p>
<p>Therefore BioJavaX is not BioJava 3, and has nothing to do with it.
Please don’t get them confused!</p>
<h2 id="what-didnt-change">What didn’t change?</h2>
<h3 id="existing-interfaces">Existing interfaces.</h3>
<p>Backwards-compatibility is always an issue when a major new version of a
piece of software is released.</p>
<p>BioJavaX addresses this by keeping all the new classes and interfaces
tucked away inside their own special package, org.biojavax. None of the
existing interfaces were modified in any way, so any code which depends
on them will not see any difference.</p>
<p>Apart from ongoing bugfixes, the way in which the existing classes work
also has not changed.</p>
<p>The new interfaces introduced in BioJavaX extend those present in the
existing BioJava packages. This allows new BioJavaX-derived objects to
be passed to legacy code and still be understood.</p>
<h3 id="change-listeners">Change listeners.</h3>
<p>BioJava’s change listener model is intact and unchanged. The new
BioJavaX classes define a set of extra change types which they fire in
addition to the ones generated by existing BioJava classes.</p>
<p>This means that existing change listeners can be attached to
BioJavaX-derived objects and still receive all the information they
would normally receive.</p>
<h3 id="event-based-file-parsing">Event-based file parsing.</h3>
<p>BioJavaX still uses event-based file parsing to read and write files, in
exactly the same way as the old BioJava classes did.</p>
<p>However, you cannot use existing event listeners with the new BioJavaX
file parsers. You must alter the listeners to extend the new
org.biojavax.bio.seq.io.RichSeqIOListener interface instead.</p>
<h2 id="what-did-change">What did change?</h2>
<h3 id="system-requirements">System requirements.</h3>
<p>Java 1.4 is required for all BioJavaX packages.</p>
<h3 id="rich-interfaces">Rich interfaces.</h3>
<p>BioJavaX defines a new set of interfaces for working with sequence
objects. These interfaces are closely modelled on the BioSQL 1.0 schema.</p>
<p>The new interfaces extend existing interfaces wherever possible, in
order to allow backwards-compatibility with legacy code. These
interfaces are known as rich interfaces, as they could be said to be
‘enriched’ versions of the interfaces that they extend.</p>
<p>Instances of implementing classes are known as rich objects, which
legacy instances known as plain ones.</p>
<p>Here is a list of the new rich interfaces:</p>
<p><code class="highlighter-rouge">
ComparableOntology (extends Ontology)
ComparableTerm (extends Term)
ComparableTriple (extends Triple)
RichSequenceIterator (extends SequenceIterator)
RichSequence (extends Sequence)
RichLocation (extends Location)
RichFeature (extends StrandedFeature)
RichFeatureHolder (extends FeatureHolder)
RichAnnotatable (extends Annotatable)
RichAnnotation (extends Annotation)
BioSQLFeatureFilter (extends FeatureFilter)
RichSequenceDB (extends SequenceDB)
</code></p>
<p>Wherever possible in BioJavaX, conversions are attempted if a method
expecting a rich object receives a plain one. You can perform these
conversions yourself by using the Tools sub-class of the appropriate
rich interface, for example to convert an old Sequence object into a new
RichSequence object, you can do this:</p>
<p><code class="highlighter-rouge">
Sequence s = ...; // get an old Sequence object from somewhere
RichSequence rs = RichSequence.Tools.enrich(s);
</code></p>
<p>The conversion process does its best, but it is not perfect. Much of the
way information is stored in the new BioJavaX object model is
fundamentally incompatible with the old object model. So its always best
to deal with RichSequence objects from the word go and try to avoid
instantiating older Sequence objects as far as possible.</p>
<p>Other new interfaces define new concepts, or replace old interfaces
entirely due to a fundamental clash in the way they see the world. Here
is a list:</p>
<p><code class="highlighter-rouge">
NCBITaxon
BioEntry
RichObjectBuilder
RichSequenceHandler
Comment
CrossRef
CrossReferenceResolver
DocRef
DocRefAuthor
Namespace
Note
RankedCrossRef
RankedCrossRefable
RankedDocRef
BioEntryRelationship
Position
PositionResolver
RichFeatureRelationship
BioEntryDB
</code></p>
<h3 id="biosql-persistence">BioSQL persistence.</h3>
<p>BioJavaX introduces a whole new way of working with BioSQL databases.</p>
<p>Instead of attempting to re-invent the wheel with yet another new
object-relational mapping system, BioJavaX uses the services of
Hibernate to do all the dirty work for it. In fact, there is not a
single SQL statement anywhere in the BioJavaX code.</p>
<p>The use of Hibernate allows users to have as much or as little control
as they like over transactions and query optimisation. The Hibernate
query language, HQL, is simple to learn and easy to use.</p>
<p>You can find out more about the Hibernate project at their website:
<a href="http://www.hibernate.org">www.hibernate.org/</a></p>
<h3 id="better-file-parsers">Better file parsers.</h3>
<p>The old BioJava file parsers worked in that they loaded all information
into memory, but they didn’t do much at attempting to understand the
contents of the files, and they often failed miserably when trying to
convert between formats.</p>
<p>The new parsers supplied with BioJavaX put a lot of effort into trying
to fit data from the myriad of file formats out there into a form
representable by BioSQL, and hence by the new BioJavaX object model. Of
course this isn’t always possible, but it does a much better job than
the old ones.</p>
<p>By parsing data into a fixed object model instead of storing everything
as annotations (as was the case, for instance, with the old SwissProt
parsers), conversion between file formats becomes much easier.</p>
<p>The new file parsers also allow you to skip uninteresting parts of the
file altogether, greatly speeding up simple tasks such as counting the
number of sequences in a file.</p>
<h3 id="ncbi-taxonomy-loader">NCBI Taxonomy loader.</h3>
<p>A parser is provided for loading the NCBI Taxonomy database into a set
of BioJavaX NCBITaxon objects. This parser reads the node.dmp and
names.dmp files supplied by NCBI and constructs the appropriate
hierarchy of objects. If you are using BioSQL, it can persist this
hierarchy to the database as it goes.</p>
<h3 id="namespaces">Namespaces.</h3>
<p>All sequences in BioJavaX must belong to a namespace.</p>
<h3 id="singletons">Singletons.</h3>
<p>BioJavaX tries to use singletons as far as possible. This is:</p>
<ul>
<li>to reduce memory usage.</li>
<li>to prevent problems with duplicate keys when persisting to BioSQL.</li>
</ul>
<p>The singletons are kept in a LRU cache managed by a RichObjectFactory.
See the chapter on this subject later in this book.</p>
<h3 id="genetic-algorithms">Genetic algorithms.</h3>
<p>BioJavaX introduces a new package for working with genetic algorithms.</p>
<h2 id="future-plans">Future plans.</h2>
<h3 id="bioperl-and-bioperl-db-compatibility">BioPerl and BioPerl-DB compatibility.</h3>
<p>We tried our best to store sequence data into BioSQL in the same way as
BioPerl-DB does. We also tried to parse files in such a way that data
from files would end up in the same place in BioSQL as if it had been
parsed using the BioPerl file parsers then persisted using BioPerl-DB.</p>
<p>However, we may not have been entirely successful, particularly with
regard to the naming conventions of annotations and feature qualifiers,
and the use of the document and publication cross-reference tables.
Likewise, our definition of fuzzy locations may differ.</p>
<p>So, we intend in the future to try and consolidate our efforts with
those of the BioPerl and BioPerl-DB projects, along with any of the
other Bio* projects who provide BioSQL persistence functionality, so
that we can all read and write data to and from BioSQL in the same way.</p>
<p>The goal is to be able to read a file with any of the Bio* projects,
persist it to the database, then read it back from the database using
any of the other Bio* projects and write it out to file. The input and
output files should be logically identical (give or take some minor
layout or formatting issues).</p>
<p>Help is needed!</p>
<h3 id="efficient-parsing">Efficient parsing.</h3>
<p>The event-based parser model works great, but our implementations of
actual file parsing code may leave a lot to be desired in terms of
efficient use of memory or minimising the number of uses of markers in
the input stream.</p>
<p>If you are an IO, parsing, or code optimisation guru, you would be most
welcome to come have a look and speed things up a bit.</p>
<h3 id="more-file-formats-supported">More file formats supported.</h3>
<p>We’ve provided parsers (and writers) for all the major formats we
thought would be necessary. But there are only two of us, and it takes a
while to trawl through the documentation for each format and try to
shoehorn it all into the BioSQL model, even before the actual coding
begins.</p>
<p>If there’s a format you like and use daily and you think would be of use
to others, but you can’t find it in BioJavaX, then please do write a
parser for it and contribute it to the project.</p>
<h3 id="persistence-to-non-biosql-databases">Persistence to non-BioSQL databases.</h3>
<p>Basically, right now, you can’t. We have only provided Hibernate
mappings for BioSQL.</p>
<p>There is no reason though why you can’t write a new set of Hibernate XML
mapping files that map the BioJavaX objects into tables in some other
database format. Because of the way Hibernate works, you wouldn’t have
to change any of the BioJavaX code at all, only the mapping files that
tell Hibernate how to translate between objects and tables.</p>
<p>If you do, and you think someone else could benefit from your work,
please consider contributing them to the BioJava project for everyone to
enjoy. 5. Java 1.5 and Generics.</p>
<p>Much discussion has occurred recently about upgrading BioJava to use
features only available since version 1.5 of Java (also known as Java
5). Mostly we are considering the use of generics.</p>
<p>A lot of this started after some Java 1.5 features accidentally slipped
into the biojava-live CVS branch one day and suddenly nobody using older
JVMs could compile it any more. These were quickly removed, and it was
agreed to wait a while before a decision was made about the ultimate use
of such features.</p>
<p>Java 1.5 offers a lot of features that would be very useful in BioJava,
and has the potential to greatly reduce the size of the project’s
codebase. However, 1.5 compilers and runtime environments are not
available for some platforms yet, and in other situations companies are
reluctant to upgrade when they have already settled on 1.4 as their
tested and accepted Java environment.</p>
<p>So, we won’t do it yet, but we would definitely like to change in
future.</p>
<h2 id="singletons-and-the-richobjectfactory">Singletons and the <code class="highlighter-rouge">RichObjectFactory</code>.</h2>
<h3 id="using-richobjectfactory">Using <code class="highlighter-rouge">RichObjectFactory</code>.</h3>
<p>BioJavaX revolves around the use of singleton instances. This is
important to keep memory usage down, and becomes even more important
when working with BioSQL databases via Hibernate to prevent duplicate
records in tables. Singletons are generated in a singleton factory.</p>
<p>RichObjectFactory is a caching singleton factory. If you request lots of
instances of the same class, the oldest ones are forgotten about and you
will get a new instance next time you ask for it. This is to prevent
memory blowouts. The default size of this LRU cache is 20 instances of
each class.</p>
<p>Singletons are only important when dealing with certain classes:</p>
<p>`</p>
<div class="highlighter-rouge"><pre class="highlight"><code> SimpleNamespace
SimpleComparableOntology
SimpleNCBITaxon
SimpleCrossRef
SimpleDocRef
</code></pre>
</div>
<p>`</p>
<p>In all other cases, you don’t need to worry about singletons. In fact,
the singleton factory may complain if you try to ask it to make a
singleton of any class not listed above.</p>
<p>To generate a new instance of any of the above, you must use the
RichObjectFactory. This tool checks an LRU cache to see if you have
requested an identical instance recently. If you have, it returns that
instance (a singleton). If you haven’t, then it creates the instance,
adds it to the LRU cache, then returns it.</p>
<p>The parameters you supply to the RichObjectFactory are a class name, and
an array of parameters which you would normally have passed directly to
that class’ constructor. Here is a list of the parameters required, and
an example, for each of the classes accepted by the current factory:</p>
<p>Table 5.1. RichObjectFactory singleton examples.</p>
<p>| Objects | Parameters | Example |
|—————————-|—————————————————————-|————————————————————————————————————————————————————————————————|
| <code class="highlighter-rouge">SimpleNamespace</code> | [name (String)] | <code class="highlighter-rouge">Namespace ns = (Namespace)RichObjectFactory.getObject(SimpleNamespace.class,new Object[]{"myNamespace"});</code> |
| <code class="highlighter-rouge">SimpleComparableOntology</code> | [name (String)] | <code class="highlighter-rouge">ComparableOntology ont = (ComparableOntology)RichObjectFactory.getObject(ComparableOntology.class,new Object[]{"myOntology"});</code> |
| <code class="highlighter-rouge">SimpleNCBITaxon</code> | [taxID (Integer)] | <code class="highlighter-rouge">Integer taxID = new Integer(12345);</code> <code class="highlighter-rouge">NCBITaxon tax = (NCBITaxon)RichObjectFactory.getObject(SimpleNCBITaxon.class,new Object[]{taxID});</code> |
| <code class="highlighter-rouge">SimpleCrossRef</code> | [databaseName (String), accession (String), version (Integer)] | <code class="highlighter-rouge">Integer version = new Integer(0);</code> <code class="highlighter-rouge">CrossRef cr = (CrossRef)RichObjectFactory.getObject(</code> <code class="highlighter-rouge">SimpleCrossRef.class, </code> <code class="highlighter-rouge">new Object[]{"PUBMED","56789",version}</code> <code class="highlighter-rouge">);</code> |
| <code class="highlighter-rouge">SimpleDocRef</code> | [authors (List of DocRefAuthor), location (String)] | <code class="highlighter-rouge">DocRefAuthor author = new SimpleDocRefAuthor("Bloggs,J.");</code> <code class="highlighter-rouge">List authors = new ArrayList();</code> <br />
<code class="highlighter-rouge">authors.add(author);</code> <br />
<code class="highlighter-rouge">DocRef dr = (DocRef)RichObjectFactory.getObject(</code> ` SimpleDocRef.class, ` ` new Object[]{authors,”Journal of Voodoo Virology, 2005, 23:55-57”});` |</p>
<h3 id="where-the-singletons-come-from">Where the singletons come from.</h3>
<p>The actual instances of the classes requested are generated using a
RichObjectBuilder. The default RichObjectBuilder,
SimpleRichObjectBuilder, uses introspection to call the constructors on
the classes and create new instances. You do not need to do anything to
set this up.</p>
<p>If you do decide to write your own RichObjectBuilder for whatever
reason, you can set it to be used by RichObjectFactory like this:</p>
<java> RichObjectBuilder builder = ...; // create your own one here
RichObjectFactory.setRichObjectBuilder(builder); // make the factory use
it from now on </java>
<p>If you change the default RichObjectBuilder to a different one, you must
do so at the very beginning of your program before any call to the
RichObjectFactory has been made. This is because when the builder is
changed, existing singletons or default instances are not removed. If
you do not follow this guideline, you will end up with a mix of objects
in the cache created by two different builders, which could lead to
interesting situations.</p>
<h3 id="hibernate-singletons">Hibernate singletons.</h3>
<p>When working with Hibernate, you must connect BioJavaX to Hibernate by
calling RichObjectFactory.connectToBioSQL(session) and passing it your
session object. When using this, instances are looked up in the
underlying BioSQL database first to see if they exist. If they do, they
are loaded and returned. If not, they are created, then returned.</p>
<p>The instances returned by RichObjectFactory when connected to Hibernate
are guaranteed true singletons and will never be duplicated even if you
fill up the LRU cache several times between requests.</p>
<p>You can replicate the behaviour of
RichObjectFactory.connectToBioSQL(session) by instantiating
BioSQLRichObjectBuilder and BioSQLCrossReferenceResolver objects and
passing these to the appropriate methods in RichObjectFactory.</p>
<p>See the section on BioSQL and Hibernate later in this document for more
details.</p>
<h3 id="managing-the-lru-cache">Managing the LRU cache.</h3>
<p>By default, the LRU cache keeps the 20 most recently requested instances
of any given class in memory. If more than 20 objects are requested, the
oldest ones are removed from the cache before the new ones are added.
This keeps memory usage at a minimum.</p>
<p>If you are experiencing problems with duplicate instances when you
expected singletons., or believe that a larger or smaller cache may help
the performance of your application, then you can change the size of the
LRU cache. There are two ways of doing this.</p>
<p>Changes to the LRU cache size are not instantaneous. The size of the
cache only changes physically next time an instance is requested from
it. Even then, only the cache of instances of the class requested will
actually change.</p>
<h4 id="global-lru-cache-size">Global LRU cache size.</h4>
<p>Changing the global LRU cache size will change the cache size for all
classes. It applies the new cache size to every single class. Next time
any of those classes are accessed via the RichObjectFactory, the LRU
cache for that class will adjust to the new size.</p>
<java> RichObjectFactory.setLRUCacheSize(50); // increases the global
LRU cache size to 50 instances per class </java>
<h4 id="class-specific-lru-cache-size">Class-specific LRU cache size.</h4>
<p>Changing the LRU cache size for a specific class will only affect that
class. Your class-specific settings will be lost if you later change the
global LRU cache size.</p>
<p>RichObjectFactory.setLRUCacheSize(SimpleNamespace.class, 50); //
increases the LRU cache for SimpleNamespace instances to 50</p>
<h3 id="convenience-methods">Convenience methods</h3>
<p>A number of convenience methods are provided by the RichObjectFactory to
allow easy access to some useful default singletons:</p>
<p>RichObjectFactory convenience methods.</p>
<table>
<thead>
<tr>
<th>Name of method</th>
<th>Use</th>
</tr>
</thead>
<tbody>
<tr>
<td>void setDefaultNamespaceName(String name)</td>
<td>Sets the name of the default namespace. This namespace is used when loading files which have no namespace information of their own, and when no namespace has been passed to the file loading routines. It can also be used when creating temporary RichSequence or BioEntry objects, as the namespace parameter is compulsory on these objects.</td>
</tr>
<tr>
<td>Namespace getDefaultNamespace();</td>
<td>Returns the default namespace singleton instance (delegates to getObject()).</td>
</tr>
<tr>
<td>void setDefaultOntologyName(String name);</td>
<td>Sets the name of the default ontology. When parsing files, new terms are often created. If the file format does not have an ontology of its own, then it will use the default ontology to store these terms. Terms commonly used throughout BioJavaX, including those common to all file formats, are also stored in the default ontology.</td>
</tr>
<tr>
<td>ComparableOntology getDefaultOntology();</td>
<td>Returns the default ontology singleton instance (delegates to getObject()).</td>
</tr>
<tr>
<td>void setDefaultPositionResolver(PositionResolver pr);</td>
<td>When converting fuzzy locations into actual physical locations, a PositionResolver instance is used. The default one is AveragePositionResolver, which averages out the range of fuzziness to provide a value somewhere in the middle. You can override this setting using this function. All locations that are resolved without explicility specifying a PositionResolver to use will then use this resolver to do the work.</td>
</tr>
<tr>
<td>PositionResolver getDefaultPositionResolver();</td>
<td>Returns the default position resolver.</td>
</tr>
<tr>
<td>void setDefaultCrossReferenceResolver(CrossReferenceResolver cr);</td>
<td>CrossRef instances are links to other databases. When a CrossRef is used in a RichLocation instance, it means that to obtain the symbols (sequence) for that location, it must first retrieve the remote sequence object. The CrossReferenceResolver object specified using this method is used to carry this out. The default implementation of this interface DummyCrossReferenceResolver, which always returns infinitely ambiguous symbol lists and cannot look up any remote sequence objects. Use BioSQLCrossReferenceResolver instead (or use RichObjectFactory.connectToBioSQL(session)) if you are using Hibernate, which is able to actually look up the sequences (if they exist in your database).</td>
</tr>
<tr>
<td>CrossReferenceResolver getDefaultCrossReferenceResolver();</td>
<td>Returns the default cross reference resolver.</td>
</tr>
<tr>
<td>void setDefaultRichSequenceHandler(RichSequenceHandler rh);</td>
<td>Calls to RichSequence methods which reference sequence data will delegate to this handler to carry the requests out. The default implementation is a DummyRichSequenceHandler, which just uses the internal SymbolList of the RichSequence to look up the data. When this is set to a BioSQLRichSequenceHandler, the handler will go to the database to look up the information instead of keeping an in-memory copy of it.</td>
</tr>
<tr>
<td>RichSequenceHandler getDefaultRichSequenceHandler();</td>
<td>Returns the default rich sequence handler.</td>
</tr>
<tr>
<td>void connectToBioSQL(Object session);</td>
<td>Instantiates BioSQLCrossReferenceResolver, BioSQLRichObjectBuilder and BioSQLRichSequenceHandler using the Hibernate session object provided, and sets these objects as the default instances. After this call, the factory will try to look up all object requests in the underlying database first.</td>
</tr>
</tbody>
</table>
<h3 id="default-settings">Default settings.</h3>
<p>The default namespace name is lcl.</p>
<p>The default ontology name is biojavax.</p>
<p>The default LRU cache size is 20.</p>
<p>The default position resolver is AveragePositionResolver.</p>
<p>The default cross reference resolver is DummyCrossReferenceResolver.</p>
<p>The default rich sequence handler is DummyRichSequenceHandler.</p>
<h2 id="working-with-sequences">Working with sequences.</h2>
<h3 id="creating-sequences">Creating sequences.</h3>
<p>BioJavaX has a two-tier definition of sequence data.</p>
<p>BioEntry objects correspond to the bioentry table in BioSQL. They do not
have any sequence information, and neither do they have any features.
They can, however, be annotated, commented, and put into relationships
with each other. They can also have cross-references to publications and
other databases associated with them.</p>
<p>RichSequence objects extend BioEntry objects by adding in sequence data
and a feature table.</p>
<p>So, when to use them?</p>
<ul>
<li>BioEntry objects are most useful when performing simple operations
such as counting sequences, checking taxonomy data, looking up
accessions, or finding out things like which objects refer to a
particular PUBMED entry.</li>
<li>RichSequence objects are useful only when you need access to the
sequence data itself, or to the sequence feature table.</li>
<li>RichSequence objects must be used whenever you wish to pass objects
to legacy code that is expecting Sequence objects, as only
RichSequence objects implement the Sequence interface. BioEntry
objects do not.</li>
</ul>
<p>Throughout the rest of this document, both BioEntry and RichSequence
objects will be referred to interchangeably as sequence objects.</p>
<p>To create a BioEntry object, you need to have at least the following
information:</p>
<ul>
<li>a Namespace instance to associate the sequence with (use
RichObjectFactory.getDefaultNamespace() for an easy way out)</li>
<li>a name for the sequence</li>
<li>an accession for the sequence</li>
<li>a version for the sequence (use 0 if you don’t want to bother with
versions)</li>
</ul>
<p>To create a RichSequence object, you need to have all the above plus:</p>
<ul>
<li>a SymbolList containing the sequence data</li>
<li>a version for the sequence data (this is separate from the version
of the sequence object)</li>
</ul>
<h3 id="multiple-accessions">Multiple accessions</h3>
<p>If you wish to assign multiple accessions to a sequence, you must do so
using the special term provided, like this:</p>
<java> ComparableTerm accTerm =
RichSequence.Terms.getAdditionalAccessionTerm(); Note accession1 = new
SimpleNote(accTerm,"A12345",1); // this note has an arbitrary rank of 1
Note accession2 = new SimpleNote(accTerm,"Z56789",2); // this note has
an arbitrary rank of 2 ... RichSequence rs = ...; // get a rich sequence
from somewhere rs.getNoteSet().add(accession1); // annotate the rich
sequence with the first additional accession
rs.getNoteSet().add(accession2); // annotate the rich sequence with the
second additional accession ... // you can annotate bioentry objects in
exactly the same way BioEntry be = ...; // get a bioentry from somewhere
be.getNoteSet().add(accession1); be.getNoteSet().add(accession2);
</java>
<p>See later in this document for more information on how to annotate and
comment on sequences.</p>
<h3 id="circular-sequences">Circular sequences</h3>
<p>BioJavaX can flag sequences as being circular, using the setCircular()
and getCircular() methods on RichSequence instances. However, as this
information is not part of BioSQL, it will be lost when the sequence is
persisted to a BioSQL database. Use with care.</p>
<p>Note that only circular sequences can have features with circular
locations associated with them.</p>
<h2 id="relationships-between-sequences">Relationships between sequences.</h2>
<h3 id="relating-two-sequences">Relating two sequences</h3>
<p>Two sequences can be related to each other by using a
BioEntryRelationship object to construct the link.</p>
<p>Relationships are optionally ranked. If you don’t want to rank the
relationship, use null in the constructor.</p>
<p>The following code snippet defines a new term “contains” in the default
ontology, then creates a relationship that states that sequence A (the
parent) contains sequence B (the child):</p>
<java> ComparableTerm contains =
RichObjectFactory.getDefaultOntology().getOrCreateTerm("contains"); ...
RichSequence parent = ...; // get sequence A from somewhere RichSequence
child = ...; // get sequence B from somewhere BioEntryRelationship
relationship = new
SimpleBioEntryRelationship(parent,child,contains,null);
parent.addRelationship(relationship); // add the relationship to the
parent ... parent.removeRelationship(relationship); // you can always
take it away again later </java>
<h3 id="querying-the-relationship">Querying the relationship</h3>
<p>Sequences are only aware of relationships in which they are the parent
sequence. A child sequence cannot find out which parent sequences it is
related to.</p>
<p>The following code snippet prints out all the relationships a sequence
has with child sequences:</p>
<java> RichSequence rs = ...; // get a rich sequence from somewhere for
(Iterator i = rs.getRelationships().iterator(); i.hasNext(); ) {
` BioEntryRelationship br = (BioEntryRelationship)i.next();`
` BioEntry parent = br.getObject(); // parent == rs`
` BioEntry child = br.getSubject(); `
` ComparableTerm relationship = br.getTerm();`
` // print out the relationship (eg. "A contains B");`
` System.out.println(parent.getName()+" "+relationship.getName()+" "+child.getName());`
} </java>
<h2 id="reading-and-writing-files">Reading and writing files.</h2>
<h3 id="tools-for-readingwriting-files">Tools for reading/writing files</h3>
<p>BioJavaX provides a replacement set of tools for working with files.
This is necessary because the new file parsers must work with the new
RichSeqIOListener in order to preserve all the information from the file
correctly.</p>
<p>The tools can all be found in RichSequence.IOTools, a subclass of the
RichSequence interface. For each file format there are a number of
utility methods in this class for reading a variety of sequence types,
and writing them out again. See later sections of this chapter for
details on individual formats.</p>
<p>Here is an example of using the RichSequence.IOTools methods. The
example reads a file in Genbank format containing some DNA sequences,
then prints them out to standard out (the screen) in EMBL format:</p>
<java> // an input GenBank file BufferedReader br = new
BufferedReader(new FileReader("myGenbank.gbk")); // a namespace to
override that in the file Namespace ns =
RichObjectFactory.getDefaultNamespace(); // we are reading DNA sequences
RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(br,ns);
while (seqs.hasNext()) {
` RichSequence rs = seqs.nextRichSequence();`
` // write it in EMBL format to standard out`
` RichSequence.IOTools.writeEMBL(System.out, rs, ns); `
} </java>
<p>If you wish to output a number of sequences in one of the XML formats,
you have to pass a RichSequenceIterator over your collection of
sequences in order for the XML format to group them together into a
single file with the correct headers:</p>
<java> // an input GenBank file BufferedReader br = new
BufferedReader(new FileReader("myGenbank.gbk")); // a namespace to
override that in the file Namespace ns =
RichObjectFactory.getDefaultNamespace(); // we are reading DNA sequences
RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(br,ns);
// write the whole lot in EMBLxml format to standard out
RichSequence.IOTools.writeEMBLxml(System.out, seqs, ns); </java>
<p>If you don’t know what format your input file is in, but know it could
be one of a fixed set of acceptable formats, then you can use BioJavaX’s
format-guessing routine to attempt to read it:</p>
<java> // Not sure if your input is EMBL or Genbank? Load them both
here. Class.forName("org.biojavax.bio.seq.io.EMBLFormat");
Class.forName("org.biojavax.bio.seq.io.GenbankFormat");
// Now let BioJavaX guess which format you actually should use (using
the default namespace) Namespace ns =
RichObjectFactory.getDefaultNamespace(); RichSequenceIterator seqs =
RichSequence.IOTools.readFile(new File("myfile.seq"),ns); </java>
<p>For those who like to do things the hard way, reading and writing by
directly using the RichStreamReader and RichStreamWriter interfaces is
described below.</p>
<h4 id="reading-using-richstreamreader">Reading using RichStreamReader</h4>
<p>File reading is based around the concept of a RichStreamReader. This
object returns a RichSequenceIterator which iterates over every sequence
in the file on demand.</p>
<p>To construct a RichStreamReader, you will need five things.</p>
<ol>
<li>a BufferedReader instance which is connected to the file you wish to
parse;</li>
<li>a RichSequenceFormat instance which understands the format of the
file (eg. FastaFormat, GenbankFormat, etc.);</li>
<li>a SymbolTokenization which understands how to translate the sequence
data in the file into a BioJava SymbolList;</li>
<li>a RichSequenceBuilderFactory instance which generates instances of
RichSequenceBuilder;</li>
<li>a Namespace instance to associate the sequences with.</li>
</ol>
<p>The RichSequenceBuilderFactory is best set to one of the predefined
constants in the RichSequenceBuilderFactory interface. These constants
are defined as:</p>
<p>Table 8.1. RichSequenceBuilderFactory predefined constants.</p>
<table>
<thead>
<tr>
<th>Name of constant</th>
<th>What it will do</th>
</tr>
</thead>
<tbody>
<tr>
<td>RichSequenceBuilderFactor.FACTORY</td>
<td>Does not attempt any compression on sequence data.</td>
</tr>
<tr>
<td>RichSequenceBuilderFactor.PACKED</td>
<td>Will compress all sequence data using PackedSymbolLists.</td>
</tr>
<tr>
<td>RichSequenceBuilderFactor.THRESHOLD</td>
<td>Will compress sequence data using a PackedSymbolList only when the sequence exceeds 5000 bases in length. Otherwise, data is not compressed.</td>
</tr>
</tbody>
</table>
<p>If you set the namespace to null, then the namespace used will depend on
the format you are reading. For formats which specify namespaces, the
namespace from the file will be used. For formats which do not specify
namespaces, the default namespace provided by
RichObjectFactory.getDefaultNamespace() will be used.</p>
<p>The SymbolTokenization should be obtained from the Alphabet that
represents the sequence data you are expecting from the file. If you are
reading DNA sequences, you should use
DNATools.getDNA().getTokenization(“token”). Other alphabets with tools
classes will have similar methods.</p>
<p>For an alphabet which does not have a tools class, you can do this:</p>
<java> Alphabet a = ...; // get an alphabet instance from somewhere
SymbolTokenization st = a.getTokenization("token"); </java>
<h4 id="writing-using-richstreamwriter">Writing using RichStreamWriter</h4>
<p>File output is done using RichStreamWriter. This requires:</p>
<ol>
<li>An OutputStream to write sequences to.</li>
<li>A Namespace to use for the sequences.</li>
<li>A RichSequenceIterator that provides the sequences to write.</li>
</ol>
<p>The namespace should only be specified when the file format includes
namespace information and you wish to override the information
associated with the actual sequences. If you do not wish to do this,
just set it to null, and the namespace from each individual sequence
will be used instead.</p>
<p>The RichSequenceIterator is an iterator over a set of sequences, exactly
the same as the one returned by the RichStreamReader. It is therefore
possible to plug a RichStreamReader directly into a RichStreamWriter and
convert data from one file format to another with no intermediate steps.</p>
<p>If you only have one sequence to write, you can wrap it in a temporary
RichSequenceIterator by using a call like this:</p>
<java> RichSequence rs = ...; // get sequence from somewhere
RichSequenceIterator it = new SingleRichSeqIterator(rs); // wrap it in
an iterator </java>
<h4 id="example">Example</h4>
<p>The following is an example that will read some DNA sequences from a
GenBank file and write them out to standard output (screen) as FASTA
using the methods outlined above:</p>
<java> // sequences will be DNA sequences SymbolTokenization dna =
DNATools.getDNA().getTokenization("token"); // read Genbank
RichSequenceFormat genbank = new GenbankFormat(); // write FASTA
RichSequenceFormat fasta = new FastaFormat(); // compress only longer
sequences RichSequenceBuilderFactory factory =
RichSequenceBuilderFactory.THRESHOLD; // read/write everything using the
'bloggs' namespace Namespace bloggsNS = RichObjectFactory.getObject(
` SimpleNamespace.class, `
` new Object[]{"bloggs"} `
` ); `
// read seqs from "mygenbank.file" BufferedReader input = new
BufferedReader(new FileReader("mygenbank.file")); // write seqs to
STDOUT OutputStream output = System.out;
RichStreamReader seqsIn = new
RichStreamReader(input,genbank,dna,factory,bloggsNS); RichStreamWriter
seqsOut = new RichStreamWriter(output,fasta); // one-step Genbank to
Fasta conversion! seqsOut.writeStream(seqsIn,bloggsNS); </java>
<h4 id="line-widths-and-eliding-information">Line widths and eliding information</h4>
<p>When working at this level, extra methods can be used when direct access
to the RichSequenceFormat object is available. These methods are:</p>
<p>Table 8.2. RichSequenceFormat extra options.</p>
<table>
<thead>
<tr>
<th>Name of method</th>
<th>What it will do</th>
</tr>
</thead>
<tbody>
<tr>
<td>get/setLineWidth()</td>
<td>Sets the line width for output. Any lines longer than this will be wrapped. The default for most formats is 80.</td>
</tr>
<tr>
<td>get/setElideSymbols()</td>
<td>When set to true, this will skip the sequence data (ie. the addSymbols() method of the RichSeqIOListener will never be called).</td>
</tr>
<tr>
<td>get/setElideFeatures()</td>
<td>When set to true, this will skip the feature tables in the file.</td>
</tr>
<tr>
<td>get/setElideComments()</td>
<td>When set to true, this will skip all comments in the file.</td>
</tr>
<tr>
<td>get/setElideReferences()</td>
<td>When set to true, this will skip all publication cross-references in the file.</td>
</tr>
</tbody>
</table>
<p>Finer control is available when you go even deeper and write your own
RichSeqIOListener objects. See later in this document for information on
that subject.</p>
<h4 id="how-parsed-data-becomes-a-sequence">How parsed data becomes a sequence.</h4>
<p>All fields read from a file, regardless of the format, are passed to an
instance of RichSequenceBuilder. In the case of the tools provided in
RichSequence.IOTools, or any RichStreamReader using one of the
RichSequenceBuilderFactory constants or
SimpleRichSequenceBuilderFactory, this is an instance of
SimpleRichSequenceBuilder.</p>
<p>SimpleRichSequenceBuilder constructs sequences as follows:</p>
<p>Table 8.3. SimpleRichSequenceBuilder sequence construction.</p>
<table>
<thead>
<tr>
<th>Name of method</th>
<th>What it will do</th>
<th> </th>
</tr>
</thead>
<tbody>
<tr>
<td>startSequence</td>
<td>Resets all the values in the builder to their defaults, ready to parse a whole new sequence.</td>
<td> </td>
</tr>
<tr>
<td>addSequenceProperty</td>
<td>Assumes that both the key and the value of the property are strings. It uses the key to look up a term with the same name (case-sensitive) in the ontology provided by RichObjectFactory.getDefaultOntology(). If it finds no such term, it creates one. It then adds an annotation to the sequence with that term as the key, using the value provided. The first annotation receives the rank of 0, the second 1, and so on. The annotations are attached to the sequence using setNoteSet() and the accumulated set of notes.</td>
<td> </td>
</tr>
<tr>
<td>setVersion</td>
<td>Only accepts a single call per sequence. Value is passed directly to the resulting sequence’s</td>
<td>setVersion method.</td>
</tr>
<tr>
<td>setURI</td>
<td>Not implemented, throws an exception.</td>
<td> </td>
</tr>
<tr>
<td>setSeqVersion</td>
<td>Only accepts a single call per sequence. Value is parsed into a double and passed to the resulting sequence’s setSeqVersion method. If the value is null, then 0.0 is used.</td>
<td> </td>
</tr>
<tr>
<td>setAccession</td>
<td>Value is passed directly to the sequence’s setAccession method. Multiple calls will replace the accession, not add extra ones. The accession cannot be null.</td>
<td> </td>
</tr>
<tr>
<td>setDescription</td>
<td>Only accepts a single call per sequence. Value is passed directly to the resulting sequence’s setDescription method.</td>
<td> </td>
</tr>
<tr>
<td>setDivision</td>
<td>Only accepts a single call per sequence. Value is passed directly to the resulting sequence’s setDivision method. The division cannot be null.</td>
<td> </td>
</tr>
<tr>
<td>setIdentifier</td>
<td>Only accepts a single call per sequence. Value is passed directly to the resulting sequence’s setIdentifier method.</td>
<td> </td>
</tr>
<tr>
<td>setName</td>
<td>Only accepts a single call per sequence. Value is passed directly to the resulting sequence’s setName method.</td>
<td> </td>
</tr>
<tr>
<td>setNamespace</td>
<td>Only accepts a single call per sequence. Value is passed directly to the resulting sequence’s setNamespace method. The namespace cannot be null.</td>
<td> </td>
</tr>
<tr>
<td>setComment</td>
<td>Adds the text supplied (which must not be null) as a comment to the sequence using addComment(). Multiple calls will result in multiple comments being added. The first comment is ranked 1, the second comment ranked 2, and so on.</td>
<td> </td>
</tr>
<tr>
<td>setTaxon</td>
<td>Value is passed to the sequence’s setNamespace method. It must not be null. If this method is called repeatedly, only the first call will be accepted. Subsequent calls will result in warnings being printed to standard error. These extra calls will not cause the builder to fail. The value from the initial call will be the one that is used.</td>
<td> </td>
</tr>
<tr>
<td>startFeature</td>
<td>Tells the builder to start a new feature on this sequence. If the current feature has not yet been ended, then this feature will be a sub-feature of the current feature and associated with it via a RichFeatureRelationship, where the current feature is the parent and this new feature is the child. The relationship will be defined with the term “contains” from RichObjectFactory.getDefaultOntology(). Each feature will be attached to the resulting sequence by calling setParent() on the feature once the sequence has been created.</td>
<td> </td>
</tr>
<tr>
<td>getCurrentFeature</td>
<td>Returns the current feature, if one has been started. If there is no current feature (eg. it has already ended, or one was never started) then an exception is thrown.</td>
<td> </td>
</tr>
<tr>
<td>addFeatureProperty</td>
<td>Assumes that both the key and the value of the property are strings. It uses the key to look up a term with the same name (case-sensitive) in the ontology provided by RichObjectFactory.getDefaultOntology(). If it finds no such term, it creates one. It then adds an annotation to the current feature with that term as the key, using the value provided. The first annotation receives the rank of 0, the second 1, and so on. The annotations are attached to the feature using getAnnotation().addNote().</td>
<td> </td>
</tr>
<tr>
<td>endFeature</td>
<td>Ends the current feature. If there is no current feature, an exception is thrown.</td>
<td> </td>
</tr>
<tr>
<td>setRankedDocRef</td>
<td>Adds the given RankedDocRef to the set of publication cross-references which the sequence being built refers to. The value cannot be null. If the same value is provided multiple times, it will only be saved once. Each value is stored by calling addRankedDocRef() on the resulting sequence.</td>
<td> </td>
</tr>
<tr>
<td>setRankedCrossRef</td>
<td>Adds the given RankedCrossRef to the set of database cross-references which the sequence being built refers to. The value cannot be null. If the same value is provided multiple times, it will only be saved once. Each value is stored by calling addRankedCrossRef() on the resulting sequence. setRelationship Adds the given BioEntryRelationship to the set of relationships in which the sequence being built is the parent. The relationship cannot be null. If the same relationship is provided multiple times, it will only be saved once. Each relationship is stored by calling addRelationship() on the resulting sequence.</td>
<td> </td>
</tr>
<tr>
<td>setCircular</td>
<td>You can call this as many times as you like. Each call will override the value provided by the previous call. The value is passed to the sequence’s setCircular method.</td>
<td> </td>
</tr>
<tr>
<td>addSymbols</td>
<td>Adds symbols to this sequence. You can call it multiple times to set symbols at different locations in the sequence. If any of the symbols found are not in the alphabet accepted by this builder, or if the locations provided to place the symbols at are unacceptable, an exception is thrown. The resulting SymbolList will be the basis upon which the final RichSequence object is built.</td>
<td> </td>
</tr>
<tr>
<td>endSequence</td>
<td>Tells the builder that we have provided all the information we know. If at this point the name, namespace, or accession have not been provided, or if any of them are null, an exception is thrown.</td>
<td> </td>
</tr>
<tr>
<td>makeSequence</td>
<td>Constructs a RichSequence object from the information provided, following the rules laid out in this table, and returns it. The RichSequence object does not actually exist until this method has been called.</td>
<td> </td>
</tr>
<tr>
<td>makeRichSequence</td>
<td>Wrapper for makeSequence.</td>
<td> </td>
</tr>
</tbody>
</table>
<p>If you want fine-grained control over every aspect of a file whilst it
is being parsed, you must write your own implementation of the
RichSeqIOListener interface (which RichSequenceBuilder extends). This is
detailed later in this document.</p>
<h3 id="fasta">FASTA</h3>
<p>FastaFormat reads and writes FASTA files, and is able to parse the