Skip to content

Commit 7e024fc

Browse files
author
Alex Wilson
committed
Merge branch 'master' of github.com:tabulapdf/tabula-java into batch-processing
2 parents a747549 + adb7738 commit 7e024fc

93 files changed

Lines changed: 6255 additions & 5543 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/dependabot.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
version: 2
2+
updates:
3+
- package-ecosystem: maven
4+
directory: "/"
5+
schedule:
6+
interval: daily
7+
open-pull-requests-limit: 10

.travis.yml

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,8 @@ language: java
22
install: mvn install -DskipTests=true -Dmaven.javadoc.skip=true -Dgpg.skip=true -B -V
33
script: mvn test -Dgpg.skip=true
44
jdk:
5-
- oraclejdk7
6-
- openjdk7
7-
- oraclejdk8
5+
- openjdk8
6+
- openjdk9
7+
- openjdk10
8+
- openjdk11
89
sudo: false
9-
10-
11-

README.md

Lines changed: 69 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,9 @@
1-
tabula-java [![Build Status](https://travis-ci.org/tabulapdf/tabula-java.svg?branch=master)](https://travis-ci.org/tabulapdf/tabula-java) [![Join the chat at https://gitter.im/tabulapdf/tabula-java](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/tabulapdf/tabula-java?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
1+
tabula-java [![Build Status](https://travis-ci.org/tabulapdf/tabula-java.svg?branch=master)](https://travis-ci.org/tabulapdf/tabula-java) [![Build status](https://ci.appveyor.com/api/projects/status/l5gym1mjhrd2v8yn?svg=true)](https://ci.appveyor.com/project/jazzido/tabula-java)
22
===========
33

4-
`tabula-java` is a library for extracting tables from PDF files — it is the table extraction engine that used to power [Tabula](http://tabula.technology/) ([repo](http://github.com/tabulapdf/tabula)). You can use `tabula-java` as a command-line tool to programmatically extract tables from PDFs.
4+
`tabula-java` is a library for extracting tables from PDF files — it is the table extraction engine that powers [Tabula](http://tabula.technology/) ([repo](http://github.com/tabulapdf/tabula)). You can use `tabula-java` as a command-line tool to programmatically extract tables from PDFs.
55

6-
(This is the new version of the extraction engine; the previous code can be found at [`tabula-extractor`](http://github.com/tabulapdf/tabula-extractor).)
7-
8-
© 2014-2016 Manuel Aristarán. Available under MIT License. See [`LICENSE`](LICENSE).
6+
© 2014-2020 Manuel Aristarán. Available under MIT License. See [`LICENSE`](LICENSE).
97

108
## Download
119

@@ -16,58 +14,72 @@ Download a version of the tabula-java's jar, with all dependencies included, tha
1614
`tabula-java` provides a command line application:
1715

1816
```
19-
$ java -jar ./target/tabula-0.9.1-jar-with-dependencies.jar --help
20-
21-
usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f <FORMAT>] [-g] [-h] [-i]
22-
[-n] [-o <OUTFILE>] [-p <PAGES>] [-r] [-s <PASSWORD>] [-u] [-v]
17+
$ java -jar target/tabula-1.0.5-jar-with-dependencies.jar --help
18+
usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-f <FORMAT>]
19+
[-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r] [-s
20+
<PASSWORD>] [-t] [-u] [-v]
2321
2422
Tabula helps you extract tables from PDFs
25-
-a,--area <AREA> Portion of the page to analyze
26-
(top,left,bottom,right). Example: --area
27-
269.875,12.75,790.5,561. Default is entire
28-
page
29-
-c,--columns <COLUMNS> X coordinates of column boundaries. Example
30-
--columns 10.1,20.2,30.3
31-
-d,--debug Print detected table areas instead of
32-
processing.
33-
-b,--batch <DIRECTORY> Convert all .pdfs in the provided directory
3423
24+
-a,--area <AREA> -a/--area = Portion of the page to analyze.
25+
Example: --area 269.875,12.75,790.5,561.
26+
Accepts top,left,bottom,right i.e. y1,x1,y2,x2
27+
where all values are in points relative to the
28+
top left corner. If all values are between
29+
0-100 (inclusive) and preceded by '%', input
30+
will be taken as % of actual height or width
31+
of the page. Example: --area %0,0,100,50. To
32+
specify multiple areas, -a option should be
33+
repeated. Default is entire page
34+
-b,--batch <DIRECTORY> Convert all .pdfs in the provided directory.
35+
-c,--columns <COLUMNS> X coordinates of column boundaries. Example
36+
--columns 10.1,20.2,30.3. If all values are
37+
between 0-100 (inclusive) and preceded by '%',
38+
input will be taken as % of actual width of
39+
the page. Example: --columns %25,50,80.6
3540
-f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV
3641
-g,--guess Guess the portion of the page to analyze per
3742
page.
3843
-h,--help Print this help text.
3944
-i,--silent Suppress all stderr output.
40-
-n,--no-spreadsheet Force PDF not to be extracted using
41-
spreadsheet-style extraction (if there are
42-
ruling lines separating each cell, as in a PDF
43-
of an Excel spreadsheet)
45+
-l,--lattice Force PDF to be extracted using lattice-mode
46+
extraction (if there are ruling lines
47+
separating each cell, as in a PDF of an Excel
48+
spreadsheet)
49+
-n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF
50+
not to be extracted using spreadsheet-style
51+
extraction (if there are no ruling lines
52+
separating each cell)
4453
-o,--outfile <OUTFILE> Write output to <file> instead of STDOUT.
4554
Default: -
4655
-p,--pages <PAGES> Comma separated list of ranges, or all.
4756
Examples: --pages 1-3,5-7, --pages 3 or
4857
--pages all. Default is --pages 1
49-
-r,--spreadsheet Force PDF to be extracted using
50-
spreadsheet-style extraction (if there are
51-
ruling lines separating each cell, as in a PDF
52-
of an Excel spreadsheet)
58+
-r,--spreadsheet [Deprecated in favor of -l/--lattice] Force
59+
PDF to be extracted using spreadsheet-style
60+
extraction (if there are ruling lines
61+
separating each cell, as in a PDF of an Excel
62+
spreadsheet)
5363
-s,--password <PASSWORD> Password to decrypt document. Default is empty
64+
-t,--stream Force PDF to be extracted using stream-mode
65+
extraction (if there are no ruling lines
66+
separating each cell)
5467
-u,--use-line-returns Use embedded line returns in cells. (Only in
5568
spreadsheet mode.)
5669
-v,--version Print version and exit.
57-
5870
```
5971

60-
It also includes a debugging tool, run `java -cp ./target/tabula-0.9.1-jar-with-dependencies.jar technology.tabula.debug.Debug -h` for the available options.
72+
It also includes a debugging tool, run `java -cp ./target/tabula-1.0.5-jar-with-dependencies.jar technology.tabula.debug.Debug -h` for the available options.
6173

6274
You can also integrate `tabula-java` with any JVM language. For Java examples, see the [`tests`](src/test/java/technology/tabula/) folder.
6375

6476
JVM start-up time is a lot of the cost of the `tabula` command, so if you're trying to extract many tables from PDFs, you have a few options for speeding it up:
6577

6678
- the -b option, which allows you to convert all pdfs in a given directory
6779
- the [drip](https://github.com/ninjudd/drip) utility
68-
- the [Ruby](http://github.com/tabulapdf/tabula-extractor), [R](https://github.com/leeper/tabulizer), and [Node.js](https://github.com/ezodude/tabula-js) bindings
80+
- the [Ruby](http://github.com/tabulapdf/tabula-extractor), [Python](https://github.com/chezou/tabula-py), [R](https://github.com/leeper/tabulizer), and [Node.js](https://github.com/ezodude/tabula-js) bindings
6981
- writing your own program in any JVM language (Java, JRuby, Scala) that imports tabula-java.
70-
- waiting for us to implement an API/server-style system (it's on the roadmap)
82+
- waiting for us to implement an API/server-style system (it's on the [roadmap](https://github.com/tabulapdf/tabula-api))
7183

7284
## Building from Source
7385

@@ -76,3 +88,30 @@ Clone this repo and run:
7688
```
7789
mvn clean compile assembly:single
7890
```
91+
92+
## Contributing
93+
94+
Interested in helping out? We'd love to have your help!
95+
96+
You can help by:
97+
98+
- [Reporting a bug](https://github.com/tabulapdf/tabula-java/issues).
99+
- Adding or editing documentation.
100+
- Contributing code via a Pull Request.
101+
- Spreading the word about `tabula-java` to people who might be able to benefit from using it.
102+
103+
### Backers
104+
105+
You can also support our continued work on `tabula-java` with a one-time or monthly donation [on OpenCollective](https://opencollective.com/tabulapdf#support). Organizations who use `tabula-java` can also [sponsor the project](https://opencollective.com/tabulapdf#support) for acknowledgement on [our official site](http://tabula.technology/) and this README.
106+
107+
Special thanks to the following users and organizations for generously supporting Tabula with donations and grants:
108+
109+
<a href="https://opencollective.com/tabulapdf/backer/0/website" target="_blank"><img src="https://opencollective.com/tabulapdf/backer/0/avatar"></a>
110+
<a href="https://opencollective.com/tabulapdf/backer/1/website" target="_blank"><img src="https://opencollective.com/tabulapdf/backer/1/avatar"></a>
111+
<a href="https://opencollective.com/tabulapdf/backer/2/website" target="_blank"><img src="https://opencollective.com/tabulapdf/backer/2/avatar"></a>
112+
<a href="https://opencollective.com/tabulapdf/backer/3/website" target="_blank"><img src="https://opencollective.com/tabulapdf/backer/3/avatar"></a>
113+
<a href="https://opencollective.com/tabulapdf/backer/4/website" target="_blank"><img src="https://opencollective.com/tabulapdf/backer/4/avatar"></a>
114+
<a href="https://opencollective.com/tabulapdf/backer/5/website" target="_blank"><img src="https://opencollective.com/tabulapdf/backer/5/avatar"></a>
115+
116+
<a title="The John S. and James L. Knight Foundation" href="http://www.knightfoundation.org/" target="_blank"><img alt="The John S. and James L. Knight Foundation" src="https://knightfoundation.org/wp-content/uploads/2019/10/KF_Logotype_Icon-and-Stacked-Name.png" width="300"></a>
117+
<a title="The Shuttleworth Foundation" href="https://shuttleworthfoundation.org/" target="_blank"><img width="200" alt="The Shuttleworth Foundation" src="https://raw.githubusercontent.com/tabulapdf/tabula/gh-pages/shuttleworth.jpg"></a>

appveyor.yml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
version: '{build}'
2+
install:
3+
- ps: |
4+
Add-Type -AssemblyName System.IO.Compression.FileSystem
5+
if (!(Test-Path -Path "C:\maven\apache-maven-3.5.4" )) {
6+
(new-object System.Net.WebClient).DownloadFile(
7+
'http://www-us.apache.org/dist/maven/maven-3/3.5.4/binaries/apache-maven-3.5.4-bin.zip',
8+
'C:\maven-bin.zip'
9+
)
10+
[System.IO.Compression.ZipFile]::ExtractToDirectory("C:\maven-bin.zip", "C:\maven")
11+
}
12+
- cmd: SET PATH=C:\maven\apache-maven-3.5.4\bin;%JAVA_HOME%\bin;%PATH%
13+
- cmd: SET MAVEN_OPTS=-Xmx2g
14+
- cmd: SET JAVA_OPTS=-Xmx2g
15+
build_script:
16+
- mvn clean package -B -DskipTests -Dmaven.javadoc.skip=true
17+
test_script:
18+
- mvn install -B -Dmaven.javadoc.skip=true -Dgpg.skip
19+
cache:
20+
- C:\maven -> appveyor.yml
21+
- C:\Users\appveyor\.m2 -> appveyor.yml

jbang-catalog.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"catalogs": {},
3+
"aliases": {
4+
"tabula": {
5+
"script-ref": "https://github.com/tabulapdf/tabula-java/releases/download/v1.0.4/tabula-1.0.4-jar-with-dependencies.jar"
6+
}
7+
}
8+
}

0 commit comments

Comments
 (0)