Extract Structured Data Using Google's Pinpoint
You can use Pinpoint to extract structured data from a collection of similarly-formatted digitized or scanned PDF documents into a set of spreadsheets.
This feature works best with collections with these attributes:
- Share the same template
- Share the same reading order (left-to-right or right-to-left only)
- Using form-like or tabular format or the combination of both
For example, if you have ten thousand scanned auto accident reports that use a similar form, you can import the scans and export a spreadsheet that enables you to group, sort, or filter accidents by date, automobile manufacturer, or any other fields provided in the source documents.
You must have full access to Pinpoint to use this feature. If you don't have full access, you can request full access using this form.
Prepare your Pinpoint collection
- Navigate to your collection consisting the documents which you wish to extract structured data from
- If you donât have a collection in Pinpoint for processing, create a new collection with the documents which you wish to extract structured data from
- Make sure your collection has been fully processed by Pinpoint. Depending on the size and number of files, processing can take up to 24 hours
- Click the âExtract Structured Dataâ link on lower left side of the collection view
- Click the âProcess collectionâ button. The processing can take from seconds to hours, depending on the size of your collection
- Once the processing is complete, click âAnnotate collectionâ
If you add documents to the processed Pinpoint collection, you would need to reprocess the collection. See âReprocess annotated collectionâ for more detail.
Choose golden document
The Extract Structured Data tool will direct you to the annotation editor page and automatically select a âgoldenâ document for you. This is a single document in which you create an annotation template to be applied to all of the documents in the same collection.
If you think the selected golden document is not the best fit for annotation, you can replace it with another document in the collection. See âReplace golden documentâ.
If the document template in your collection has a lot of optional fields in it, we recommend choosing the document with the most optional fields available as the golden document to ensure the highest matching compatibility with all of the documents in your collection.
In the rare case where not all desired fields are covered in a single golden document, you can then add more golden documents to accommodate additional optional fields. See âAdd golden documentâ.
Annotate collection
The annotation editor page is divided into four major sections:
- Main editor
This is the dominant part of the page where you will perform document annotations. You will see your golden document and your added annotations in this section.
- Toolbar
This section is at the top of the page where you can find all the actions menu for the annotation editor page, including the name of the golden document that you are working on.
- Annotations list
This section is at the right hand side of the page where you will see the list of annotations you created in the golden document.
- Preview table
This section is at the bottom of the page where you can find the preview the values of extracted fields from 10 randomly selected documents in your collection.
Currently, the tool only supports extraction to text or checkbox (boolean). All numerical values will be converted to text/string.
Key/Value
This tool is best used to extract a single labeled value from your collection. An example of the result of this annotation is âCountryâ as key and âUnited States of Americaâ as the value.
To annotate your document using the Key/value annotation, follow these steps:
- Select Key/value annotation tool on the top of the annotation editor page
- Draw a rectangle around the value that you want to extract. You should make the rectangle longer to accommodate values with more characters in other documents
- The tool will automatically select and mark a key for the value you selected. You can drag and edit this marker for accurate annotation
- To change the name of the column header in the extracted data, you can edit the name of the key parameter within the Annotations section on the right side of the window
- Repeat the steps for all the key-value pairs you wish to extract from your document collection
Each annotation is an approximate marker for the tool to extract the data from all of the documents in your collection.
When available, you can follow grids or markers in your document. If not, please make sure you accommodate for longer values.
Repeated section
This tool is best used to extract a section with repeating key-value pair(s). The annotation will be able to cover any number of continuous repeated sections over multiple pages.
To annotate your document using the Repeated Section annotation, follow these steps:
- Select Repeated Section annotation tool on the top of the annotation editor page
- Mark across the height of the first repeated instance of the section
- The tool will automatically create a line approximately below the marked instance. Drag the line until the whole section you want to annotate is highlighted
- Enter the name of the section in the âRepeated section nameâ pop-up.
- Click âSave sectionâ
- Select Key/value annotation tool on the top of the annotation editor page
- Within the range of the first repeated instance, follow the key/value annotation steps for all of the key-value pairs you want to extract
Tables
This tool is best used to extract data stored in tabular format. You will need to annotate each table you wish to extract in the document. Please note that the tool will work for a table that spans multiple pages, including repeated headers.
The tool will work best if the annotated table is of the same horizontal dimension, format, and headers across all of the documents in the collection.
To annotate your document using the Tables annotation, follow these steps:
- Select Tables annotation tool on the top of the annotation editor page
- Draw a rectangle over the table you wish to extract your data from. If the table spans multiple pages, you can highlight only the first page of the table
- The tool will try to approximately detect the table. If this doesnât roughly cover the table, please repeat the highlighting step
- Adjust the outline to match the outline of the table. Drag the bottom line so all parts of the table are highlighted, including repeated headers and parts which are on following pages
- Enter the table name inside the pop-up box
- Indicate whether the table has a header using the toggle in the pop-up box
- Adjust the header and column separator lines to match the tableâs formatting, clearly marking column widths and table headers representation in the document. You can add or delete column separators by right-clicking on the column separator
- Click âSave Tableâ
Extract and download your data
Once you are happy with the result available in the preview table, you can extract your data by clicking the âExtractâ button on the top right hand corner of the annotation editor page. This extraction is only applicable for the current set of annotations. If you edit annotations for your collection at a later date, you would need to redo this extraction process.
Once the extraction is complete, you can download the data by clicking âDownloadâ. You will get a zip file containing CSV file(s), one for each tab in the preview table and one summary file for all the documents in the collection.
You can review the extraction result for a document by clicking the link corresponding to that document provided in the summary file. See "Review extraction result".
Review extraction result
Manage annotated collection
Reprocess annotated collection
To redo the processing that Extract Structured Data tool performs on your collection, follow these steps :
- Navigate into the annotation editor page for your collection
- In the annotation editor page, click the (three dots) menu
- Select âRe-process collectionâ
- Continue with choosing golden document(s) and annotating your collection
Manage golden documents
Replace golden document
To replace a golden document with another one, follow these steps:
- Navigate into the annotation editor page for your collection
- In the annotation editor page, click the (three dots) menu
- Select âReplace golden documentâ
- Select your preferred golden document from the sample set, click âOKâ
- In the document review page, click âSet as goldenâ on the top right hand corner
- Select âReplace an existing golden documentâ, click âOKâ
The next step is dependent on whether the collection has a previously annotated golden document:
- If it does, see âAnnotation transferâ
- If it doesnât, start annotating your golden document
Add golden document
While reviewing extraction results, you can add more golden documents to accommodate for slight differences in the document template and additional optional fields to annotate in some documents.
You can do this by following these steps:
- Navigate to the document review page linked in the sample set preview table or the downloadable main summary CSV
- Click âSet as goldenâ on the top right hand corner
- Select âAdd a new golden documentâ, click âOKâ
The annotation process of additional golden documents is different from the regular annotation. See âAnnotation transferâ for details.
Remove a golden document from the set
- Select the name of the golden document from the file name dropdown at the top of the annotation editor page
- In the same dropdown, select âRemove from golden documents setâ
- Click âDeleteâ in the following prompt to approve the action
Annotation transfer
After you added a new golden document to the set or replaced an existing annotated golden document, the tool will approximately match the previously existing annotation to the new golden document.
If the tool canât match the previously annotated field to the new golden document, the field will be marked as âNeeds attentionâ in the Annotations section on the right hand side of the annotation editor page.
To resolve this you can do one of the following steps:
- If the field is actually available in the new golden document
- Add the annotation for that field
- Select âResolve a âneeds attentionâ key/valueâ in the prompt window
- Select the name of the field from the dropdown
- Click âOKâ
- If the field is not available in the new golden document
- Select the field box that needs attention in the Annotations section
- Click to mark the field as missing from only the new golden document
If thereâs any data from the new golden document that is not available in the Annotations section, you can manually annotate the data to add them only to the new golden document.
Edit annotation
Change field name or type
- Select the field box in the Annotations section on the right hand side of the annotation editor page
- Edit field name or type directly in the field box
- Click âOKâ in the following prompt
Adjust key/value annotation
- Click on the value annotation box you wish to adjust
- Drag and move the selected box or adjust the dimension by moving the edges
- Only applies to the currently edited golden document
Adjust repeated section annotation
- Click anywhere on the repeated section annotation you wish to adjust
- Adjust sections dimension by moving the separator lines vertically
- Only applies to the currently edited golden document
Adjust table annotation
- Click anywhere on the table annotation box you wish to adjust
- Drag and move lines within the box to adjust dimension, column width and header row
- Only applies to the currently edited golden document
Delete annotation
To delete any annotation from all golden documents, follow these steps:
- Select a field in the Annotations section on the right hand side of the annotation editor page
- Click and acknowledge that you wish to delete the field from all golden documents