biojava.github.io/_site/wikis/BioJava:BioJavaXDocs.html at third · biojava/biojava.github.io

History

3616 lines (2964 loc) · 156 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

<h2 id="biojavax-is-not-biojava-3-is-not-biojavax">BioJavaX is not BioJava 3 is not BioJavaX.</h2>

<p>BioJavaX is an extension to the existing BioJava 1 or BioJava Legacy

project. Anything written with BioJava will work with BioJavaX, and vice

versa.</p>

<p>org.biojavax is to org.biojava as javax is to java.</p>

<p>The BioJava3 project is a completely new project which intends to

rewrite everything in BioJava from scratch, based around a new set of

object designs and concepts. It is entirely incompatible with the

existing BioJava project.</p>

<p>Therefore BioJavaX is not BioJava 3, and has nothing to do with it.

Please don’t get them confused!</p>

<h2 id="what-didnt-change">What didn’t change?</h2>

<h3 id="existing-interfaces">Existing interfaces.</h3>

<p>Backwards-compatibility is always an issue when a major new version of a

piece of software is released.</p>

<p>BioJavaX addresses this by keeping all the new classes and interfaces

tucked away inside their own special package, org.biojavax. None of the

existing interfaces were modified in any way, so any code which depends

on them will not see any difference.</p>

<p>Apart from ongoing bugfixes, the way in which the existing classes work

also has not changed.</p>

<p>The new interfaces introduced in BioJavaX extend those present in the

existing BioJava packages. This allows new BioJavaX-derived objects to

be passed to legacy code and still be understood.</p>

<h3 id="change-listeners">Change listeners.</h3>

<p>BioJava’s change listener model is intact and unchanged. The new

BioJavaX classes define a set of extra change types which they fire in

addition to the ones generated by existing BioJava classes.</p>

<p>This means that existing change listeners can be attached to

BioJavaX-derived objects and still receive all the information they

would normally receive.</p>

<h3 id="event-based-file-parsing">Event-based file parsing.</h3>

<p>BioJavaX still uses event-based file parsing to read and write files, in

exactly the same way as the old BioJava classes did.</p>

<p>However, you cannot use existing event listeners with the new BioJavaX

file parsers. You must alter the listeners to extend the new

org.biojavax.bio.seq.io.RichSeqIOListener interface instead.</p>

<h2 id="what-did-change">What did change?</h2>

<h3 id="system-requirements">System requirements.</h3>

<p>Java 1.4 is required for all BioJavaX packages.</p>

<h3 id="rich-interfaces">Rich interfaces.</h3>

<p>BioJavaX defines a new set of interfaces for working with sequence

objects. These interfaces are closely modelled on the BioSQL 1.0 schema.</p>

<p>The new interfaces extend existing interfaces wherever possible, in

order to allow backwards-compatibility with legacy code. These

interfaces are known as rich interfaces, as they could be said to be

‘enriched’ versions of the interfaces that they extend.</p>

<p>Instances of implementing classes are known as rich objects, which

legacy instances known as plain ones.</p>

<p>Here is a list of the new rich interfaces:</p>

ComparableOntology (extends Ontology)

ComparableTerm (extends Term)

ComparableTriple (extends Triple)

RichSequenceIterator (extends SequenceIterator)

RichSequence (extends Sequence)

RichLocation (extends Location)

RichFeature (extends StrandedFeature)

RichFeatureHolder (extends FeatureHolder)

RichAnnotatable (extends Annotatable)

RichAnnotation (extends Annotation)

BioSQLFeatureFilter (extends FeatureFilter)

RichSequenceDB (extends SequenceDB)

</code></p>

<p>Wherever possible in BioJavaX, conversions are attempted if a method

expecting a rich object receives a plain one. You can perform these

conversions yourself by using the Tools sub-class of the appropriate

rich interface, for example to convert an old Sequence object into a new

RichSequence object, you can do this:</p>

Sequence s = ...; // get an old Sequence object from somewhere

RichSequence rs = RichSequence.Tools.enrich(s);

</code></p>

<p>The conversion process does its best, but it is not perfect. Much of the

way information is stored in the new BioJavaX object model is

fundamentally incompatible with the old object model. So its always best

to deal with RichSequence objects from the word go and try to avoid

instantiating older Sequence objects as far as possible.</p>

<p>Other new interfaces define new concepts, or replace old interfaces

entirely due to a fundamental clash in the way they see the world. Here

is a list:</p>

NCBITaxon

BioEntry

RichObjectBuilder

RichSequenceHandler

Comment

CrossRef

CrossReferenceResolver

DocRef

DocRefAuthor

Namespace

Note

RankedCrossRef

RankedCrossRefable

RankedDocRef

BioEntryRelationship

Position

PositionResolver

RichFeatureRelationship

BioEntryDB

</code></p>

<h3 id="biosql-persistence">BioSQL persistence.</h3>

<p>BioJavaX introduces a whole new way of working with BioSQL databases.</p>

<p>Instead of attempting to re-invent the wheel with yet another new

object-relational mapping system, BioJavaX uses the services of

Hibernate to do all the dirty work for it. In fact, there is not a

single SQL statement anywhere in the BioJavaX code.</p>

<p>The use of Hibernate allows users to have as much or as little control

as they like over transactions and query optimisation. The Hibernate

query language, HQL, is simple to learn and easy to use.</p>

<p>You can find out more about the Hibernate project at their website:

<a href="http://www.hibernate.org">www.hibernate.org/</a></p>

<h3 id="better-file-parsers">Better file parsers.</h3>

<p>The old BioJava file parsers worked in that they loaded all information

into memory, but they didn’t do much at attempting to understand the

contents of the files, and they often failed miserably when trying to

convert between formats.</p>

<p>The new parsers supplied with BioJavaX put a lot of effort into trying

to fit data from the myriad of file formats out there into a form

representable by BioSQL, and hence by the new BioJavaX object model. Of

course this isn’t always possible, but it does a much better job than

the old ones.</p>

<p>By parsing data into a fixed object model instead of storing everything

as annotations (as was the case, for instance, with the old SwissProt

parsers), conversion between file formats becomes much easier.</p>

<p>The new file parsers also allow you to skip uninteresting parts of the

file altogether, greatly speeding up simple tasks such as counting the

number of sequences in a file.</p>

<h3 id="ncbi-taxonomy-loader">NCBI Taxonomy loader.</h3>

<p>A parser is provided for loading the NCBI Taxonomy database into a set

of BioJavaX NCBITaxon objects. This parser reads the node.dmp and

names.dmp files supplied by NCBI and constructs the appropriate

hierarchy of objects. If you are using BioSQL, it can persist this

hierarchy to the database as it goes.</p>

<h3 id="namespaces">Namespaces.</h3>

<p>All sequences in BioJavaX must belong to a namespace.</p>

<h3 id="singletons">Singletons.</h3>

<p>BioJavaX tries to use singletons as far as possible. This is:</p>

<ul>

<li>to reduce memory usage.</li>

<li>to prevent problems with duplicate keys when persisting to BioSQL.</li>

</ul>

<p>The singletons are kept in a LRU cache managed by a RichObjectFactory.

See the chapter on this subject later in this book.</p>

<h3 id="genetic-algorithms">Genetic algorithms.</h3>

<p>BioJavaX introduces a new package for working with genetic algorithms.</p>

<h2 id="future-plans">Future plans.</h2>

<h3 id="bioperl-and-bioperl-db-compatibility">BioPerl and BioPerl-DB compatibility.</h3>

<p>We tried our best to store sequence data into BioSQL in the same way as

BioPerl-DB does. We also tried to parse files in such a way that data

from files would end up in the same place in BioSQL as if it had been

parsed using the BioPerl file parsers then persisted using BioPerl-DB.</p>

<p>However, we may not have been entirely successful, particularly with

regard to the naming conventions of annotations and feature qualifiers,

and the use of the document and publication cross-reference tables.

Likewise, our definition of fuzzy locations may differ.</p>

<p>So, we intend in the future to try and consolidate our efforts with

those of the BioPerl and BioPerl-DB projects, along with any of the

other Bio* projects who provide BioSQL persistence functionality, so

that we can all read and write data to and from BioSQL in the same way.</p>

<p>The goal is to be able to read a file with any of the Bio* projects,

persist it to the database, then read it back from the database using

any of the other Bio* projects and write it out to file. The input and

output files should be logically identical (give or take some minor

layout or formatting issues).</p>

<p>Help is needed!</p>

<h3 id="efficient-parsing">Efficient parsing.</h3>

<p>The event-based parser model works great, but our implementations of

actual file parsing code may leave a lot to be desired in terms of

efficient use of memory or minimising the number of uses of markers in

the input stream.</p>

<p>If you are an IO, parsing, or code optimisation guru, you would be most

welcome to come have a look and speed things up a bit.</p>

<h3 id="more-file-formats-supported">More file formats supported.</h3>

<p>We’ve provided parsers (and writers) for all the major formats we

thought would be necessary. But there are only two of us, and it takes a

while to trawl through the documentation for each format and try to

shoehorn it all into the BioSQL model, even before the actual coding

begins.</p>

<p>If there’s a format you like and use daily and you think would be of use

to others, but you can’t find it in BioJavaX, then please do write a

parser for it and contribute it to the project.</p>

<h3 id="persistence-to-non-biosql-databases">Persistence to non-BioSQL databases.</h3>

<p>Basically, right now, you can’t. We have only provided Hibernate

mappings for BioSQL.</p>

<p>There is no reason though why you can’t write a new set of Hibernate XML

mapping files that map the BioJavaX objects into tables in some other

database format. Because of the way Hibernate works, you wouldn’t have

to change any of the BioJavaX code at all, only the mapping files that

tell Hibernate how to translate between objects and tables.</p>

<p>If you do, and you think someone else could benefit from your work,

please consider contributing them to the BioJava project for everyone to

enjoy. 5. Java 1.5 and Generics.</p>

<p>Much discussion has occurred recently about upgrading BioJava to use

features only available since version 1.5 of Java (also known as Java

5). Mostly we are considering the use of generics.</p>

<p>A lot of this started after some Java 1.5 features accidentally slipped

into the biojava-live CVS branch one day and suddenly nobody using older

JVMs could compile it any more. These were quickly removed, and it was

agreed to wait a while before a decision was made about the ultimate use

of such features.</p>

<p>Java 1.5 offers a lot of features that would be very useful in BioJava,

and has the potential to greatly reduce the size of the project’s

codebase. However, 1.5 compilers and runtime environments are not

available for some platforms yet, and in other situations companies are

reluctant to upgrade when they have already settled on 1.4 as their

tested and accepted Java environment.</p>

<p>So, we won’t do it yet, but we would definitely like to change in

future.</p>

<h2 id="singletons-and-the-richobjectfactory">Singletons and the <code class="highlighter-rouge">RichObjectFactory</code>.</h2>

<h3 id="using-richobjectfactory">Using <code class="highlighter-rouge">RichObjectFactory</code>.</h3>

<p>BioJavaX revolves around the use of singleton instances. This is

important to keep memory usage down, and becomes even more important

when working with BioSQL databases via Hibernate to prevent duplicate

records in tables. Singletons are generated in a singleton factory.</p>

<p>RichObjectFactory is a caching singleton factory. If you request lots of

instances of the same class, the oldest ones are forgotten about and you

will get a new instance next time you ask for it. This is to prevent

memory blowouts. The default size of this LRU cache is 20 instances of

each class.</p>

<p>Singletons are only important when dealing with certain classes:</p>

<div class="highlighter-rouge"><pre class="highlight"><code> SimpleNamespace

SimpleComparableOntology

SimpleNCBITaxon

SimpleCrossRef

SimpleDocRef

</code></pre>

</div>

<p>In all other cases, you don’t need to worry about singletons. In fact,

the singleton factory may complain if you try to ask it to make a

singleton of any class not listed above.</p>

<p>To generate a new instance of any of the above, you must use the

RichObjectFactory. This tool checks an LRU cache to see if you have

requested an identical instance recently. If you have, it returns that

instance (a singleton). If you haven’t, then it creates the instance,

adds it to the LRU cache, then returns it.</p>

<p>The parameters you supply to the RichObjectFactory are a class name, and

an array of parameters which you would normally have passed directly to

that class’ constructor. Here is a list of the parameters required, and

an example, for each of the classes accepted by the current factory:</p>

<p>Table 5.1. RichObjectFactory singleton examples.</p>

<p>| Objects | Parameters | Example |

|—————————-|—————————————————————-|————————————————————————————————————————————————————————————————|

| <code class="highlighter-rouge">SimpleNamespace</code> | [name (String)] | <code class="highlighter-rouge">Namespace ns = (Namespace)RichObjectFactory.getObject(SimpleNamespace.class,new Object[]{"myNamespace"});</code> |

| <code class="highlighter-rouge">SimpleComparableOntology</code> | [name (String)] | <code class="highlighter-rouge">ComparableOntology ont = (ComparableOntology)RichObjectFactory.getObject(ComparableOntology.class,new Object[]{"myOntology"});</code> |

| <code class="highlighter-rouge">SimpleNCBITaxon</code> | [taxID (Integer)] | <code class="highlighter-rouge">Integer taxID = new Integer(12345);</code> <code class="highlighter-rouge">NCBITaxon tax = (NCBITaxon)RichObjectFactory.getObject(SimpleNCBITaxon.class,new Object[]{taxID});</code> |

| <code class="highlighter-rouge">SimpleCrossRef</code> | [databaseName (String), accession (String), version (Integer)] | <code class="highlighter-rouge">Integer version = new Integer(0);</code> <code class="highlighter-rouge">CrossRef cr = (CrossRef)RichObjectFactory.getObject(</code> <code class="highlighter-rouge">SimpleCrossRef.class, </code> <code class="highlighter-rouge">new Object[]{"PUBMED","56789",version}</code> <code class="highlighter-rouge">);</code> |

| <code class="highlighter-rouge">SimpleDocRef</code> | [authors (List of DocRefAuthor), location (String)] | <code class="highlighter-rouge">DocRefAuthor author = new SimpleDocRefAuthor("Bloggs,J.");</code> <code class="highlighter-rouge">List authors = new ArrayList();</code> <br />

<code class="highlighter-rouge">authors.add(author);</code> <br />

<code class="highlighter-rouge">DocRef dr = (DocRef)RichObjectFactory.getObject(</code> ` SimpleDocRef.class, ` ` new Object[]{authors,”Journal of Voodoo Virology, 2005, 23:55-57”});` |</p>

<h3 id="where-the-singletons-come-from">Where the singletons come from.</h3>

<p>The actual instances of the classes requested are generated using a

RichObjectBuilder. The default RichObjectBuilder,

SimpleRichObjectBuilder, uses introspection to call the constructors on

the classes and create new instances. You do not need to do anything to

set this up.</p>

<p>If you do decide to write your own RichObjectBuilder for whatever

reason, you can set it to be used by RichObjectFactory like this:</p>

<java> RichObjectBuilder builder = ...; // create your own one here

RichObjectFactory.setRichObjectBuilder(builder); // make the factory use

it from now on </java>

<p>If you change the default RichObjectBuilder to a different one, you must

do so at the very beginning of your program before any call to the

RichObjectFactory has been made. This is because when the builder is

changed, existing singletons or default instances are not removed. If

you do not follow this guideline, you will end up with a mix of objects

in the cache created by two different builders, which could lead to

interesting situations.</p>

<h3 id="hibernate-singletons">Hibernate singletons.</h3>

<p>When working with Hibernate, you must connect BioJavaX to Hibernate by

calling RichObjectFactory.connectToBioSQL(session) and passing it your

session object. When using this, instances are looked up in the

underlying BioSQL database first to see if they exist. If they do, they

are loaded and returned. If not, they are created, then returned.</p>

<p>The instances returned by RichObjectFactory when connected to Hibernate

are guaranteed true singletons and will never be duplicated even if you

fill up the LRU cache several times between requests.</p>

<p>You can replicate the behaviour of

RichObjectFactory.connectToBioSQL(session) by instantiating

BioSQLRichObjectBuilder and BioSQLCrossReferenceResolver objects and

passing these to the appropriate methods in RichObjectFactory.</p>

<p>See the section on BioSQL and Hibernate later in this document for more

details.</p>

<h3 id="managing-the-lru-cache">Managing the LRU cache.</h3>

<p>By default, the LRU cache keeps the 20 most recently requested instances

of any given class in memory. If more than 20 objects are requested, the

oldest ones are removed from the cache before the new ones are added.

This keeps memory usage at a minimum.</p>

<p>If you are experiencing problems with duplicate instances when you

expected singletons., or believe that a larger or smaller cache may help

the performance of your application, then you can change the size of the

LRU cache. There are two ways of doing this.</p>

<p>Changes to the LRU cache size are not instantaneous. The size of the

cache only changes physically next time an instance is requested from

it. Even then, only the cache of instances of the class requested will

actually change.</p>

<h4 id="global-lru-cache-size">Global LRU cache size.</h4>

<p>Changing the global LRU cache size will change the cache size for all

classes. It applies the new cache size to every single class. Next time

any of those classes are accessed via the RichObjectFactory, the LRU

cache for that class will adjust to the new size.</p>

<java> RichObjectFactory.setLRUCacheSize(50); // increases the global

LRU cache size to 50 instances per class </java>

<h4 id="class-specific-lru-cache-size">Class-specific LRU cache size.</h4>

<p>Changing the LRU cache size for a specific class will only affect that

class. Your class-specific settings will be lost if you later change the

global LRU cache size.</p>

<p>RichObjectFactory.setLRUCacheSize(SimpleNamespace.class, 50); //

increases the LRU cache for SimpleNamespace instances to 50</p>

<h3 id="convenience-methods">Convenience methods</h3>

<p>A number of convenience methods are provided by the RichObjectFactory to

allow easy access to some useful default singletons:</p>

<p>RichObjectFactory convenience methods.</p>

<table>

<thead>

<tr>

<th>Name of method</th>

</tr>

</thead>

<tbody>

<tr>

<td>void setDefaultNamespaceName(String name)</td>

<td>Sets the name of the default namespace. This namespace is used when loading files which have no namespace information of their own, and when no namespace has been passed to the file loading routines. It can also be used when creating temporary RichSequence or BioEntry objects, as the namespace parameter is compulsory on these objects.</td>

</tr>

<tr>

<td>Namespace getDefaultNamespace();</td>

<td>Returns the default namespace singleton instance (delegates to getObject()).</td>

</tr>

<tr>

<td>void setDefaultOntologyName(String name);</td>

<td>Sets the name of the default ontology. When parsing files, new terms are often created. If the file format does not have an ontology of its own, then it will use the default ontology to store these terms. Terms commonly used throughout BioJavaX, including those common to all file formats, are also stored in the default ontology.</td>

</tr>

<tr>

<td>ComparableOntology getDefaultOntology();</td>

<td>Returns the default ontology singleton instance (delegates to getObject()).</td>

</tr>

<tr>

<td>void setDefaultPositionResolver(PositionResolver pr);</td>

<td>When converting fuzzy locations into actual physical locations, a PositionResolver instance is used. The default one is AveragePositionResolver, which averages out the range of fuzziness to provide a value somewhere in the middle. You can override this setting using this function. All locations that are resolved without explicility specifying a PositionResolver to use will then use this resolver to do the work.</td>

</tr>

<tr>

<td>PositionResolver getDefaultPositionResolver();</td>

<td>Returns the default position resolver.</td>

</tr>

<tr>

<td>void setDefaultCrossReferenceResolver(CrossReferenceResolver cr);</td>

<td>CrossRef instances are links to other databases. When a CrossRef is used in a RichLocation instance, it means that to obtain the symbols (sequence) for that location, it must first retrieve the remote sequence object. The CrossReferenceResolver object specified using this method is used to carry this out. The default implementation of this interface DummyCrossReferenceResolver, which always returns infinitely ambiguous symbol lists and cannot look up any remote sequence objects. Use BioSQLCrossReferenceResolver instead (or use RichObjectFactory.connectToBioSQL(session)) if you are using Hibernate, which is able to actually look up the sequences (if they exist in your database).</td>

</tr>

<tr>

<td>CrossReferenceResolver getDefaultCrossReferenceResolver();</td>

<td>Returns the default cross reference resolver.</td>

</tr>

<tr>

<td>void setDefaultRichSequenceHandler(RichSequenceHandler rh);</td>

<td>Calls to RichSequence methods which reference sequence data will delegate to this handler to carry the requests out. The default implementation is a DummyRichSequenceHandler, which just uses the internal SymbolList of the RichSequence to look up the data. When this is set to a BioSQLRichSequenceHandler, the handler will go to the database to look up the information instead of keeping an in-memory copy of it.</td>

</tr>

<tr>

<td>RichSequenceHandler getDefaultRichSequenceHandler();</td>

<td>Returns the default rich sequence handler.</td>

</tr>

<tr>

<td>void connectToBioSQL(Object session);</td>

<td>Instantiates BioSQLCrossReferenceResolver, BioSQLRichObjectBuilder and BioSQLRichSequenceHandler using the Hibernate session object provided, and sets these objects as the default instances. After this call, the factory will try to look up all object requests in the underlying database first.</td>

</tr>

</tbody>

</table>

<h3 id="default-settings">Default settings.</h3>

<p>The default namespace name is lcl.</p>

<p>The default ontology name is biojavax.</p>

<p>The default LRU cache size is 20.</p>

<p>The default position resolver is AveragePositionResolver.</p>

<p>The default cross reference resolver is DummyCrossReferenceResolver.</p>

<p>The default rich sequence handler is DummyRichSequenceHandler.</p>

<h2 id="working-with-sequences">Working with sequences.</h2>

<h3 id="creating-sequences">Creating sequences.</h3>

<p>BioJavaX has a two-tier definition of sequence data.</p>

<p>BioEntry objects correspond to the bioentry table in BioSQL. They do not

have any sequence information, and neither do they have any features.

They can, however, be annotated, commented, and put into relationships

with each other. They can also have cross-references to publications and

other databases associated with them.</p>

<p>RichSequence objects extend BioEntry objects by adding in sequence data

and a feature table.</p>

<ul>

<li>BioEntry objects are most useful when performing simple operations

such as counting sequences, checking taxonomy data, looking up

accessions, or finding out things like which objects refer to a

particular PUBMED entry.</li>

<li>RichSequence objects are useful only when you need access to the

sequence data itself, or to the sequence feature table.</li>

<li>RichSequence objects must be used whenever you wish to pass objects

to legacy code that is expecting Sequence objects, as only

RichSequence objects implement the Sequence interface. BioEntry

objects do not.</li>

</ul>

<p>Throughout the rest of this document, both BioEntry and RichSequence

objects will be referred to interchangeably as sequence objects.</p>

<p>To create a BioEntry object, you need to have at least the following

information:</p>

<ul>

<li>a Namespace instance to associate the sequence with (use

RichObjectFactory.getDefaultNamespace() for an easy way out)</li>

<li>a name for the sequence</li>

<li>an accession for the sequence</li>

<li>a version for the sequence (use 0 if you don’t want to bother with

versions)</li>

</ul>

<p>To create a RichSequence object, you need to have all the above plus:</p>

<ul>

<li>a SymbolList containing the sequence data</li>

<li>a version for the sequence data (this is separate from the version

of the sequence object)</li>

</ul>

<h3 id="multiple-accessions">Multiple accessions</h3>

<p>If you wish to assign multiple accessions to a sequence, you must do so

using the special term provided, like this:</p>

<java> ComparableTerm accTerm =

RichSequence.Terms.getAdditionalAccessionTerm(); Note accession1 = new

SimpleNote(accTerm,"A12345",1); // this note has an arbitrary rank of 1

Note accession2 = new SimpleNote(accTerm,"Z56789",2); // this note has

an arbitrary rank of 2 ... RichSequence rs = ...; // get a rich sequence

from somewhere rs.getNoteSet().add(accession1); // annotate the rich

sequence with the first additional accession

rs.getNoteSet().add(accession2); // annotate the rich sequence with the

second additional accession ... // you can annotate bioentry objects in

exactly the same way BioEntry be = ...; // get a bioentry from somewhere

be.getNoteSet().add(accession1); be.getNoteSet().add(accession2);

</java>

<p>See later in this document for more information on how to annotate and

comment on sequences.</p>

<h3 id="circular-sequences">Circular sequences</h3>

<p>BioJavaX can flag sequences as being circular, using the setCircular()

and getCircular() methods on RichSequence instances. However, as this

information is not part of BioSQL, it will be lost when the sequence is

persisted to a BioSQL database. Use with care.</p>

<p>Note that only circular sequences can have features with circular

locations associated with them.</p>

<h2 id="relationships-between-sequences">Relationships between sequences.</h2>

<h3 id="relating-two-sequences">Relating two sequences</h3>

<p>Two sequences can be related to each other by using a

BioEntryRelationship object to construct the link.</p>

<p>Relationships are optionally ranked. If you don’t want to rank the

relationship, use null in the constructor.</p>

<p>The following code snippet defines a new term “contains” in the default

ontology, then creates a relationship that states that sequence A (the

parent) contains sequence B (the child):</p>

<java> ComparableTerm contains =

RichObjectFactory.getDefaultOntology().getOrCreateTerm("contains"); ...

RichSequence parent = ...; // get sequence A from somewhere RichSequence

child = ...; // get sequence B from somewhere BioEntryRelationship

relationship = new

SimpleBioEntryRelationship(parent,child,contains,null);

parent.addRelationship(relationship); // add the relationship to the

parent ... parent.removeRelationship(relationship); // you can always

take it away again later </java>

<h3 id="querying-the-relationship">Querying the relationship</h3>

<p>Sequences are only aware of relationships in which they are the parent

sequence. A child sequence cannot find out which parent sequences it is

related to.</p>

<p>The following code snippet prints out all the relationships a sequence

has with child sequences:</p>

<java> RichSequence rs = ...; // get a rich sequence from somewhere for

(Iterator i = rs.getRelationships().iterator(); i.hasNext(); ) {

` BioEntryRelationship br = (BioEntryRelationship)i.next();`

` BioEntry parent = br.getObject(); // parent == rs`

` BioEntry child = br.getSubject(); `

` ComparableTerm relationship = br.getTerm();`

` // print out the relationship (eg. "A contains B");`

` System.out.println(parent.getName()+" "+relationship.getName()+" "+child.getName());`

} </java>

<h2 id="reading-and-writing-files">Reading and writing files.</h2>

<h3 id="tools-for-readingwriting-files">Tools for reading/writing files</h3>

<p>BioJavaX provides a replacement set of tools for working with files.

This is necessary because the new file parsers must work with the new

RichSeqIOListener in order to preserve all the information from the file

correctly.</p>

<p>The tools can all be found in RichSequence.IOTools, a subclass of the

RichSequence interface. For each file format there are a number of

utility methods in this class for reading a variety of sequence types,

and writing them out again. See later sections of this chapter for

details on individual formats.</p>

<p>Here is an example of using the RichSequence.IOTools methods. The

example reads a file in Genbank format containing some DNA sequences,

then prints them out to standard out (the screen) in EMBL format:</p>

<java> // an input GenBank file BufferedReader br = new

BufferedReader(new FileReader("myGenbank.gbk")); // a namespace to

override that in the file Namespace ns =

RichObjectFactory.getDefaultNamespace(); // we are reading DNA sequences

RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(br,ns);

while (seqs.hasNext()) {

` RichSequence rs = seqs.nextRichSequence();`

` // write it in EMBL format to standard out`

` RichSequence.IOTools.writeEMBL(System.out, rs, ns); `

} </java>

<p>If you wish to output a number of sequences in one of the XML formats,

you have to pass a RichSequenceIterator over your collection of

sequences in order for the XML format to group them together into a

single file with the correct headers:</p>

<java> // an input GenBank file BufferedReader br = new

BufferedReader(new FileReader("myGenbank.gbk")); // a namespace to

override that in the file Namespace ns =

RichObjectFactory.getDefaultNamespace(); // we are reading DNA sequences

RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(br,ns);

// write the whole lot in EMBLxml format to standard out

RichSequence.IOTools.writeEMBLxml(System.out, seqs, ns); </java>

<p>If you don’t know what format your input file is in, but know it could

be one of a fixed set of acceptable formats, then you can use BioJavaX’s

format-guessing routine to attempt to read it:</p>

<java> // Not sure if your input is EMBL or Genbank? Load them both

here. Class.forName("org.biojavax.bio.seq.io.EMBLFormat");

Class.forName("org.biojavax.bio.seq.io.GenbankFormat");

// Now let BioJavaX guess which format you actually should use (using

the default namespace) Namespace ns =

RichObjectFactory.getDefaultNamespace(); RichSequenceIterator seqs =

RichSequence.IOTools.readFile(new File("myfile.seq"),ns); </java>

<p>For those who like to do things the hard way, reading and writing by

directly using the RichStreamReader and RichStreamWriter interfaces is

described below.</p>

<h4 id="reading-using-richstreamreader">Reading using RichStreamReader</h4>

<p>File reading is based around the concept of a RichStreamReader. This

object returns a RichSequenceIterator which iterates over every sequence

in the file on demand.</p>

<p>To construct a RichStreamReader, you will need five things.</p>

<ol>

<li>a BufferedReader instance which is connected to the file you wish to

parse;</li>

<li>a RichSequenceFormat instance which understands the format of the

file (eg. FastaFormat, GenbankFormat, etc.);</li>

<li>a SymbolTokenization which understands how to translate the sequence

data in the file into a BioJava SymbolList;</li>

<li>a RichSequenceBuilderFactory instance which generates instances of

RichSequenceBuilder;</li>

<li>a Namespace instance to associate the sequences with.</li>

</ol>

<p>The RichSequenceBuilderFactory is best set to one of the predefined

constants in the RichSequenceBuilderFactory interface. These constants

are defined as:</p>

<p>Table 8.1. RichSequenceBuilderFactory predefined constants.</p>

<table>

<thead>

<tr>

<th>Name of constant</th>

</tr>

</thead>

<tbody>

<tr>

<td>RichSequenceBuilderFactor.FACTORY</td>

<td>Does not attempt any compression on sequence data.</td>

</tr>

<tr>

<td>RichSequenceBuilderFactor.PACKED</td>

<td>Will compress all sequence data using PackedSymbolLists.</td>

</tr>

<tr>

<td>RichSequenceBuilderFactor.THRESHOLD</td>

<td>Will compress sequence data using a PackedSymbolList only when the sequence exceeds 5000 bases in length. Otherwise, data is not compressed.</td>

</tr>

</tbody>

</table>

<p>If you set the namespace to null, then the namespace used will depend on

the format you are reading. For formats which specify namespaces, the

namespace from the file will be used. For formats which do not specify

namespaces, the default namespace provided by

RichObjectFactory.getDefaultNamespace() will be used.</p>

<p>The SymbolTokenization should be obtained from the Alphabet that

represents the sequence data you are expecting from the file. If you are

reading DNA sequences, you should use

DNATools.getDNA().getTokenization(“token”). Other alphabets with tools

classes will have similar methods.</p>

<p>For an alphabet which does not have a tools class, you can do this:</p>

<java> Alphabet a = ...; // get an alphabet instance from somewhere

SymbolTokenization st = a.getTokenization("token"); </java>

<h4 id="writing-using-richstreamwriter">Writing using RichStreamWriter</h4>

<p>File output is done using RichStreamWriter. This requires:</p>

<ol>

<li>An OutputStream to write sequences to.</li>

<li>A Namespace to use for the sequences.</li>

<li>A RichSequenceIterator that provides the sequences to write.</li>

</ol>

<p>The namespace should only be specified when the file format includes

namespace information and you wish to override the information

associated with the actual sequences. If you do not wish to do this,

just set it to null, and the namespace from each individual sequence

will be used instead.</p>

<p>The RichSequenceIterator is an iterator over a set of sequences, exactly

the same as the one returned by the RichStreamReader. It is therefore

possible to plug a RichStreamReader directly into a RichStreamWriter and

convert data from one file format to another with no intermediate steps.</p>

<p>If you only have one sequence to write, you can wrap it in a temporary

RichSequenceIterator by using a call like this:</p>

<java> RichSequence rs = ...; // get sequence from somewhere

RichSequenceIterator it = new SingleRichSeqIterator(rs); // wrap it in

an iterator </java>

<h4 id="example">Example</h4>

<p>The following is an example that will read some DNA sequences from a

GenBank file and write them out to standard output (screen) as FASTA

using the methods outlined above:</p>

<java> // sequences will be DNA sequences SymbolTokenization dna =

DNATools.getDNA().getTokenization("token"); // read Genbank

RichSequenceFormat genbank = new GenbankFormat(); // write FASTA

RichSequenceFormat fasta = new FastaFormat(); // compress only longer

sequences RichSequenceBuilderFactory factory =

RichSequenceBuilderFactory.THRESHOLD; // read/write everything using the

'bloggs' namespace Namespace bloggsNS = RichObjectFactory.getObject(

` SimpleNamespace.class, `

` new Object[]{"bloggs"} `

` ); `

// read seqs from "mygenbank.file" BufferedReader input = new

BufferedReader(new FileReader("mygenbank.file")); // write seqs to

STDOUT OutputStream output = System.out;

RichStreamReader seqsIn = new

RichStreamReader(input,genbank,dna,factory,bloggsNS); RichStreamWriter

seqsOut = new RichStreamWriter(output,fasta); // one-step Genbank to

Fasta conversion! seqsOut.writeStream(seqsIn,bloggsNS); </java>

<h4 id="line-widths-and-eliding-information">Line widths and eliding information</h4>

<p>When working at this level, extra methods can be used when direct access

to the RichSequenceFormat object is available. These methods are:</p>

<p>Table 8.2. RichSequenceFormat extra options.</p>

<table>

<thead>

<tr>

<th>Name of method</th>

</tr>

</thead>

<tbody>

<tr>

<td>get/setLineWidth()</td>

<td>Sets the line width for output. Any lines longer than this will be wrapped. The default for most formats is 80.</td>

</tr>

<tr>

<td>get/setElideSymbols()</td>

<td>When set to true, this will skip the sequence data (ie. the addSymbols() method of the RichSeqIOListener will never be called).</td>

</tr>

<tr>

<td>get/setElideFeatures()</td>

<td>When set to true, this will skip the feature tables in the file.</td>

</tr>

<tr>

<td>get/setElideComments()</td>

<td>When set to true, this will skip all comments in the file.</td>

</tr>

<tr>

<td>get/setElideReferences()</td>

<td>When set to true, this will skip all publication cross-references in the file.</td>

</tr>

</tbody>

</table>

<p>Finer control is available when you go even deeper and write your own

RichSeqIOListener objects. See later in this document for information on

that subject.</p>

<h4 id="how-parsed-data-becomes-a-sequence">How parsed data becomes a sequence.</h4>

<p>All fields read from a file, regardless of the format, are passed to an

instance of RichSequenceBuilder. In the case of the tools provided in

RichSequence.IOTools, or any RichStreamReader using one of the

RichSequenceBuilderFactory constants or

SimpleRichSequenceBuilderFactory, this is an instance of

SimpleRichSequenceBuilder.</p>

<p>SimpleRichSequenceBuilder constructs sequences as follows:</p>

<p>Table 8.3. SimpleRichSequenceBuilder sequence construction.</p>

<table>

<thead>

<tr>

<th>Name of method</th>

</tr>

</thead>

<tbody>

<tr>

<td>startSequence</td>

<td>Resets all the values in the builder to their defaults, ready to parse a whole new sequence.</td>

</tr>

<tr>

<td>addSequenceProperty</td>

<td>Assumes that both the key and the value of the property are strings. It uses the key to look up a term with the same name (case-sensitive) in the ontology provided by RichObjectFactory.getDefaultOntology(). If it finds no such term, it creates one. It then adds an annotation to the sequence with that term as the key, using the value provided. The first annotation receives the rank of 0, the second 1, and so on. The annotations are attached to the sequence using setNoteSet() and the accumulated set of notes.</td>

</tr>

<tr>

<td>setVersion</td>

<td>Only accepts a single call per sequence. Value is passed directly to the resulting sequence’s</td>

<td>setVersion method.</td>

</tr>

<tr>

<td>setURI</td>

<td>Not implemented, throws an exception.</td>

</tr>

<tr>

<td>setSeqVersion</td>

<td>Only accepts a single call per sequence. Value is parsed into a double and passed to the resulting sequence’s setSeqVersion method. If the value is null, then 0.0 is used.</td>

</tr>

<tr>

<td>setAccession</td>

<td>Value is passed directly to the sequence’s setAccession method. Multiple calls will replace the accession, not add extra ones. The accession cannot be null.</td>

</tr>

<tr>

<td>setDescription</td>

<td>Only accepts a single call per sequence. Value is passed directly to the resulting sequence’s setDescription method.</td>

</tr>

<tr>

<td>setDivision</td>

<td>Only accepts a single call per sequence. Value is passed directly to the resulting sequence’s setDivision method. The division cannot be null.</td>

</tr>

<tr>

<td>setIdentifier</td>

<td>Only accepts a single call per sequence. Value is passed directly to the resulting sequence’s setIdentifier method.</td>

</tr>

<tr>

<td>setName</td>

<td>Only accepts a single call per sequence. Value is passed directly to the resulting sequence’s setName method.</td>

</tr>

<tr>

<td>setNamespace</td>

<td>Only accepts a single call per sequence. Value is passed directly to the resulting sequence’s setNamespace method. The namespace cannot be null.</td>

</tr>

<tr>

<td>setComment</td>

<td>Adds the text supplied (which must not be null) as a comment to the sequence using addComment(). Multiple calls will result in multiple comments being added. The first comment is ranked 1, the second comment ranked 2, and so on.</td>

</tr>

<tr>

<td>setTaxon</td>

<td>Value is passed to the sequence’s setNamespace method. It must not be null. If this method is called repeatedly, only the first call will be accepted. Subsequent calls will result in warnings being printed to standard error. These extra calls will not cause the builder to fail. The value from the initial call will be the one that is used.</td>

</tr>

<tr>

<td>startFeature</td>

<td>Tells the builder to start a new feature on this sequence. If the current feature has not yet been ended, then this feature will be a sub-feature of the current feature and associated with it via a RichFeatureRelationship, where the current feature is the parent and this new feature is the child. The relationship will be defined with the term “contains” from RichObjectFactory.getDefaultOntology(). Each feature will be attached to the resulting sequence by calling setParent() on the feature once the sequence has been created.</td>

</tr>

<tr>

<td>getCurrentFeature</td>

<td>Returns the current feature, if one has been started. If there is no current feature (eg. it has already ended, or one was never started) then an exception is thrown.</td>

</tr>

<tr>

<td>addFeatureProperty</td>

<td>Assumes that both the key and the value of the property are strings. It uses the key to look up a term with the same name (case-sensitive) in the ontology provided by RichObjectFactory.getDefaultOntology(). If it finds no such term, it creates one. It then adds an annotation to the current feature with that term as the key, using the value provided. The first annotation receives the rank of 0, the second 1, and so on. The annotations are attached to the feature using getAnnotation().addNote().</td>

</tr>

<tr>

<td>endFeature</td>

<td>Ends the current feature. If there is no current feature, an exception is thrown.</td>

</tr>

<tr>

<td>setRankedDocRef</td>

<td>Adds the given RankedDocRef to the set of publication cross-references which the sequence being built refers to. The value cannot be null. If the same value is provided multiple times, it will only be saved once. Each value is stored by calling addRankedDocRef() on the resulting sequence.</td>

</tr>

<tr>

<td>setRankedCrossRef</td>

<td>Adds the given RankedCrossRef to the set of database cross-references which the sequence being built refers to. The value cannot be null. If the same value is provided multiple times, it will only be saved once. Each value is stored by calling addRankedCrossRef() on the resulting sequence. setRelationship Adds the given BioEntryRelationship to the set of relationships in which the sequence being built is the parent. The relationship cannot be null. If the same relationship is provided multiple times, it will only be saved once. Each relationship is stored by calling addRelationship() on the resulting sequence.</td>

</tr>

<tr>

<td>setCircular</td>

<td>You can call this as many times as you like. Each call will override the value provided by the previous call. The value is passed to the sequence’s setCircular method.</td>

</tr>

<tr>

<td>addSymbols</td>

<td>Adds symbols to this sequence. You can call it multiple times to set symbols at different locations in the sequence. If any of the symbols found are not in the alphabet accepted by this builder, or if the locations provided to place the symbols at are unacceptable, an exception is thrown. The resulting SymbolList will be the basis upon which the final RichSequence object is built.</td>

</tr>

<tr>

<td>endSequence</td>

<td>Tells the builder that we have provided all the information we know. If at this point the name, namespace, or accession have not been provided, or if any of them are null, an exception is thrown.</td>

</tr>

<tr>

<td>makeSequence</td>

<td>Constructs a RichSequence object from the information provided, following the rules laid out in this table, and returns it. The RichSequence object does not actually exist until this method has been called.</td>

</tr>

<tr>

<td>makeRichSequence</td>

<td>Wrapper for makeSequence.</td>

</tr>

</tbody>

</table>

<p>If you want fine-grained control over every aspect of a file whilst it

is being parsed, you must write your own implementation of the

RichSeqIOListener interface (which RichSequenceBuilder extends). This is

detailed later in this document.</p>

<h3 id="fasta">FASTA</h3>

<p>FastaFormat reads and writes FASTA files, and is able to parse the

View remainder of file in raw view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

BioJava:BioJavaXDocs.html

Latest commit

History

BioJava:BioJavaXDocs.html

File metadata and controls