Skip to content

Commit a5d6976

Browse files
dicknetherlandsandreasprlic
authored andcommitted
/* Reading and writing files. */
1 parent 8779444 commit a5d6976

2 files changed

Lines changed: 69 additions & 0 deletions

File tree

_wikis/BioJava:BioJavaXDocs.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1230,6 +1230,49 @@ going missing if you have an LRU cache in RichObjectFactory that is too
12301230
small. This issue is avoided altogether when using the
12311231
BioSQLRichObjectFactory.
12321232

1233+
### When File Parsers Go Wrong
1234+
1235+
Sometimes you'll come across a file that is not strictly in the correct
1236+
format, or you may even uncover a bug in one of the parsers. We always
1237+
appreciate feedback in these cases, including the input file in question
1238+
and a full stack trace. However, sometimes you may want to find the
1239+
problem yourself, or even attempt to fix it! So we have produced the
1240+
DebuggingRichSeqIOListener for this purpose.
1241+
1242+
The DebuggingRichSeqIOListener is a class that acts both as a
1243+
BufferedInputStream, so it can be passed to a RichSequenceFormat for
1244+
reading data, and as a RichSeqIOListener, so that it can be passed to
1245+
the same RichSequenceFormat to listen to the sequence generation events.
1246+
It dumps all input out to STDOUT as it reads it, and notifies every
1247+
sequence generation event to STDOUT as it is received. This way you can
1248+
see exactly at which points in the file the events are being generated,
1249+
the data the format was working on at the time the event was generated,
1250+
and if an exception happens, it will appear immediately after the
1251+
section of the file that was in error.
1252+
1253+
The idea is that you do something like this (the example debugs the
1254+
parsing of a FASTA file):
1255+
1256+
<java> Namespace ns = RichObjectFactory.getDefaultNamespace();
1257+
InputStream is = new FileInputStream("myFastaFile.fasta"); FASTAFormat
1258+
format = new FASTAFormat();
1259+
1260+
DebuggingRichSeqIOListener debug = new DebuggingRichSeqIOListener(is);
1261+
BufferedReader br = new BufferedReader(new InputStreamReader(debug));
1262+
1263+
SymbolTokenization symParser = format.guessSymbolTokenization(debug);
1264+
1265+
format.readRichSequence(br, symParser, debug, ns); </java>
1266+
1267+
Note that you will often get bits of file repeated in the output, as the
1268+
format runs backwards and forwards through the file between markers it
1269+
has set. This is perfectly normal although it may look a little strange.
1270+
1271+
When reporting problems with file parsing, it would be very useful if
1272+
you could run the above code on your chosen input file and chosen
1273+
RichSequenceFormat, and send us a copy of the output along with the
1274+
stacktrace and input file.
1275+
12331276
Creative file parsing with RichSeqIOListener.
12341277
---------------------------------------------
12351278

_wikis/BioJava:BioJavaXDocs.mediawiki

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1381,6 +1381,32 @@ Note that this is most effective when using BioJavaX with Hibernate to persist d
13811381

13821382
Note that you may have trouble with duplicate NCBITaxon objects or names going missing if you have an LRU cache in RichObjectFactory that is too small. This issue is avoided altogether when using the BioSQLRichObjectFactory.
13831383

1384+
1385+
=== When File Parsers Go Wrong ===
1386+
1387+
Sometimes you'll come across a file that is not strictly in the correct format, or you may even uncover a bug in one of the parsers. We always appreciate feedback in these cases, including the input file in question and a full stack trace. However, sometimes you may want to find the problem yourself, or even attempt to fix it! So we have produced the DebuggingRichSeqIOListener for this purpose.
1388+
1389+
The DebuggingRichSeqIOListener is a class that acts both as a BufferedInputStream, so it can be passed to a RichSequenceFormat for reading data, and as a RichSeqIOListener, so that it can be passed to the same RichSequenceFormat to listen to the sequence generation events. It dumps all input out to STDOUT as it reads it, and notifies every sequence generation event to STDOUT as it is received. This way you can see exactly at which points in the file the events are being generated, the data the format was working on at the time the event was generated, and if an exception happens, it will appear immediately after the section of the file that was in error.
1390+
1391+
The idea is that you do something like this (the example debugs the parsing of a FASTA file):
1392+
1393+
<java>
1394+
Namespace ns = RichObjectFactory.getDefaultNamespace();
1395+
InputStream is = new FileInputStream("myFastaFile.fasta");
1396+
FASTAFormat format = new FASTAFormat();
1397+
1398+
DebuggingRichSeqIOListener debug = new DebuggingRichSeqIOListener(is);
1399+
BufferedReader br = new BufferedReader(new InputStreamReader(debug));
1400+
1401+
SymbolTokenization symParser = format.guessSymbolTokenization(debug);
1402+
1403+
format.readRichSequence(br, symParser, debug, ns);
1404+
</java>
1405+
1406+
Note that you will often get bits of file repeated in the output, as the format runs backwards and forwards through the file between markers it has set. This is perfectly normal although it may look a little strange.
1407+
1408+
When reporting problems with file parsing, it would be very useful if you could run the above code on your chosen input file and chosen RichSequenceFormat, and send us a copy of the output along with the stacktrace and input file.
1409+
13841410
== Creative file parsing with RichSeqIOListener. ==
13851411

13861412
=== Using RichSeqIOListeners directly ===

0 commit comments

Comments
 (0)