Releases · Unstructured-IO/unstructured

0.16.6

Enhancements

Every <table> tag is considered to be ontology.Table: Added special handling for tables in HTML partitioning. This change is made to improve the accuracy of table extraction from HTML documents.
Every HTML has default ontology class assigned: When parsing HTML to ontology, each defined HTML in the ontology has an assigned default ontology class. This allows assigning an ontology class instead of UncategorizedText when the HTML tag is predicted correctly but has no class assigned.
Use (number of actual table) weighted average for table metrics: In evaluating table metrics, the mean aggregation now uses the actual number of tables in a document to weight the metric scores.

Features

None added in this release.

Fixes

ElementMetadata consolidation: Now, text_as_html metadata is combined across all elements in CompositeElement when chunking HTML output.

@plutasnyy

What's Changed

Add max recursion limit and fix to_text() method by @plutasnyy in #3773
Fix extracting value from field by @plutasnyy in #3774
chore: remove dev and release as 0.16.5 by @badGarnet in #3775

Full Changelog: 0.16.4...0.16.5

0.16.4

Enhancements

value attribute in <input/> element is parsed to OntologyElement.text in ontology
id and class attributes removed from Table subtags in HTML partitioning
cleaned to_html and newly introduced to_text in OntologyElement
Elements created from V2 HTML are less granular Added merging of adjacent text elements and inline html tags in the HTML partitioner to reduce the number of elements created from V2 HTML.

Features

Add support for link extraction in pdf hi_res strategy. The partition_pdf() function now supports link extraction when using the hi_res strategy, allowing users to extract hyperlinks from PDF documents more effectively.

Fixes

0.16.3

Enhancements

Features

Fixes

V2 elements without first parent ID can be parsed
Fix missing elements when layout element parsed in V2 ontology
updated unstructured-inference to be 0.8.1 in requirements/extra-pdf-image.in

0.16.2

Enhancements

Features

Whitespace-invariant CCT distance metric. CCT Levenshtein distance for strings is by default computed with standardized whitespaces.

Fixes

Fixed retry config settings for partition_via_api function If the SDK's default retry config is not set the retry config getter function does not fail anymore.

0.16.1

Enhancements

Bump unstructured-inference to 0.7.39 and upgrade other dependencies
Round coordinates Round coordinates when computing bounding box overlaps in pdfminer_processing.py to nearest machine precision. This can help reduce underterministic behavior from machine precision that affects which bounding boxes to combine.
Request retry parameters in partition_via_api function. Expose retry-mechanism related parameters in the partition_via_api function to allow users to configure the retry behavior of the API requests.

Features

Parsing HTML to Unstructured Elements and back

Fixes

Remove unsupported chipper model
Rewrite of partition.email module and tests. Use modern Python stdlib email module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.
Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.
Minify text_as_html from XLSX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
Minify text_as_html from CSV. Previously .metadata.text_as_html for CSV tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
Minify text_as_html from PPTX. Previously .metadata.text_as_html for PPTX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text and structure.

0.16.0

Enhancements

Remove ingest implementation. The deprecated ingest functionality has been removed, as it is now maintained in the separate unstructured-ingest repository.
- Replace extras in requirements/ingest directory with a new ingest.txt extra for installing the unstructured-ingest library.
- Remove the unstructured.ingest submodule.
- Delete all shell scripts previously used for destination ingest tests.

Features

Fixes

Add language parameter to OCRAgentGoogleVision. Introduces an optional language parameter in the OCRAgentGoogleVision constructor to serve as a language hint for document_text_detection. This ensures compatibility with the OCRAgent's get_instance method and resolves errors when parsing PDFs with Google Cloud Vision as the OCR agent.

0.15.14

Enhancements

Features

Add (but do not install) a new post-partitioning decorator to handle metadata added for all file-types, like .filename, .filetype and .languages. This will be installed in a closely following PR to replace the four currently being used for this purpose.

Fixes

Update Python SDK usage in partition_via_api. Make a minor syntax change to ensure forward compatibility with the upcoming 0.26.0 Python SDK.
Remove "unused" date_from_file_object parameter. As part of simplifying partitioning parameter set, remove date_from_file_object parameter. A file object does not have a last-modified date attribute so can never give a useful value. When a file-object is used as the document source (such as in Unstructured API) the last-modified date must come from the metadata_last_modified argument.
Fix occasional KeyError when mapping parent ids to hash ids. Occasionally the input elements into assign_and_map_hash_ids can contain duplicated element instances, which lead to error when mapping parent id.
Allow empty text files. Fixes an issue where text files with only white space would fail to be partitioned.
Remove double-decoration for CSV, DOC, ODT partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner (CSV and DOCX in this case); remove decoration from delegating partitioners.
Remove double-decoration for PPTX, TSV, XLSX, and XML partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner; remove decoration from delegating partitioners.
Remove double-decoration for HTML, EPUB, MD, ORG, RST, and RTF partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner (HTML in this case); remove decoration from delegating partitioners.
Remove obsolete min_partition/max_partition args from TXT and EML. The legacy min_partition and max_partition parameters were an initial rough implementation of chunking but now interfere with chunking and are unused. Remove those parameters from partition_text() and partition_email().
Remove double-decoration on EML and MSG. Refactor these partitioners to rely on the new @apply_metadata() decorator operating on partitioners they delegate to (TXT, HTML, and all others for attachments) and remove direct decoration from EML and MSG.
Remove double-decoration for PPT. Remove decorators from the delegating PPT partitioner.
Quick-fix CI error in auto test-filetype. Better fix to follow shortly.

0.15.13

Enhancements

Improve pdfminer image cleanup process. Optimized the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances overall processing speed of PDF documents.

Features

Fixes

Fixes high memory overhead for intersection area computation Using numpy.float32 for coordinates and remove intermediate variables to reduce memory usage when computing intersection areas
Fixes the arm64 image build arm64 builds are now fixed and will be available against starting with the 0.15.13 release.

0.15.12

Enhancements

Improve pdfminer element processing Implemented splitting of pdfminer elements (groups of text chunks) into smaller bounding boxes (text lines). This prevents loss of information from the object detection model and facilitates more effective removal of duplicated pdfminer text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.16.6

Enhancements

Features

Fixes

What's Changed

Contributors

0.16.4

Enhancements

Features

Fixes

0.16.3

Enhancements

Features

Fixes

0.16.2

Enhancements

Features

Fixes

0.16.1

Enhancements

Features

Fixes

0.16.0

Enhancements

Features

Fixes

0.15.14

Enhancements

Features

Fixes

0.15.13

Enhancements

Features

Fixes

0.15.12

Enhancements

Releases: Unstructured-IO/unstructured

0.16.6

0.16.6

Enhancements

Features

Fixes

0.16.5

What's Changed

Contributors

0.16.4

0.16.4

Enhancements

Features

Fixes

0.16.3

0.16.3

Enhancements

Features

Fixes

0.16.2

0.16.2

Enhancements

Features

Fixes

0.16.1

0.16.1

Enhancements

Features

Fixes

0.16.0

0.16.0

Enhancements

Features

Fixes

0.15.14

0.15.14

Enhancements

Features

Fixes

0.15.13

0.15.13

Enhancements

Features

Fixes

0.15.12

0.15.12

Enhancements