Skip to content

Releases: Unstructured-IO/unstructured

0.16.6

22 Nov 02:09
626f73a
Compare
Choose a tag to compare

0.16.6

Enhancements

  • Every <table> tag is considered to be ontology.Table: Added special handling for tables in HTML partitioning. This change is made to improve the accuracy of table extraction from HTML documents.
  • Every HTML has default ontology class assigned: When parsing HTML to ontology, each defined HTML in the ontology has an assigned default ontology class. This allows assigning an ontology class instead of UncategorizedText when the HTML tag is predicted correctly but has no class assigned.
  • Use (number of actual table) weighted average for table metrics: In evaluating table metrics, the mean aggregation now uses the actual number of tables in a document to weight the metric scores.

Features

  • None added in this release.

Fixes

  • ElementMetadata consolidation: Now, text_as_html metadata is combined across all elements in CompositeElement when chunking HTML output.

0.16.5

07 Nov 20:32
a6aefee
Compare
Choose a tag to compare

What's Changed

Full Changelog: 0.16.4...0.16.5

0.16.4

31 Oct 19:00
df156eb
Compare
Choose a tag to compare

0.16.4

Enhancements

  • value attribute in <input/> element is parsed to OntologyElement.text in ontology
  • id and class attributes removed from Table subtags in HTML partitioning
  • cleaned to_html and newly introduced to_text in OntologyElement
  • Elements created from V2 HTML are less granular Added merging of adjacent text elements and inline html tags in the HTML partitioner to reduce the number of elements created from V2 HTML.

Features

  • Add support for link extraction in pdf hi_res strategy. The partition_pdf() function now supports link extraction when using the hi_res strategy, allowing users to extract hyperlinks from PDF documents more effectively.

Fixes

0.16.3

25 Oct 20:55
340a07f
Compare
Choose a tag to compare

0.16.3

Enhancements

Features

Fixes

  • V2 elements without first parent ID can be parsed
  • Fix missing elements when layout element parsed in V2 ontology
  • updated unstructured-inference to be 0.8.1 in requirements/extra-pdf-image.in

0.16.2

24 Oct 17:36
9835fe4
Compare
Choose a tag to compare

0.16.2

Enhancements

Features

  • Whitespace-invariant CCT distance metric. CCT Levenshtein distance for strings is by default computed with standardized whitespaces.

Fixes

  • Fixed retry config settings for partition_via_api function If the SDK's default retry config is not set the retry config getter function does not fail anymore.

0.16.1

23 Oct 19:26
0b4c72a
Compare
Choose a tag to compare

0.16.1

Enhancements

  • Bump unstructured-inference to 0.7.39 and upgrade other dependencies
  • Round coordinates Round coordinates when computing bounding box overlaps in pdfminer_processing.py to nearest machine precision. This can help reduce underterministic behavior from machine precision that affects which bounding boxes to combine.
  • Request retry parameters in partition_via_api function. Expose retry-mechanism related parameters in the partition_via_api function to allow users to configure the retry behavior of the API requests.

Features

  • Parsing HTML to Unstructured Elements and back

Fixes

  • Remove unsupported chipper model
  • Rewrite of partition.email module and tests. Use modern Python stdlib email module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.
  • Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
  • Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.
  • Minify text_as_html from XLSX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
  • Minify text_as_html from CSV. Previously .metadata.text_as_html for CSV tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
  • Minify text_as_html from PPTX. Previously .metadata.text_as_html for PPTX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text and structure.

0.16.0

17 Oct 19:32
9049e4e
Compare
Choose a tag to compare

0.16.0

Enhancements

  • Remove ingest implementation. The deprecated ingest functionality has been removed, as it is now maintained in the separate unstructured-ingest repository.
    • Replace extras in requirements/ingest directory with a new ingest.txt extra for installing the unstructured-ingest library.
    • Remove the unstructured.ingest submodule.
    • Delete all shell scripts previously used for destination ingest tests.

Features

Fixes

  • Add language parameter to OCRAgentGoogleVision. Introduces an optional language parameter in the OCRAgentGoogleVision constructor to serve as a language hint for document_text_detection. This ensures compatibility with the OCRAgent's get_instance method and resolves errors when parsing PDFs with Google Cloud Vision as the OCR agent.

0.15.14

10 Oct 20:55
6ba376a
Compare
Choose a tag to compare

0.15.14

Enhancements

Features

  • Add (but do not install) a new post-partitioning decorator to handle metadata added for all file-types, like .filename, .filetype and .languages. This will be installed in a closely following PR to replace the four currently being used for this purpose.

Fixes

  • Update Python SDK usage in partition_via_api. Make a minor syntax change to ensure forward compatibility with the upcoming 0.26.0 Python SDK.
  • Remove "unused" date_from_file_object parameter. As part of simplifying partitioning parameter set, remove date_from_file_object parameter. A file object does not have a last-modified date attribute so can never give a useful value. When a file-object is used as the document source (such as in Unstructured API) the last-modified date must come from the metadata_last_modified argument.
  • Fix occasional KeyError when mapping parent ids to hash ids. Occasionally the input elements into assign_and_map_hash_ids can contain duplicated element instances, which lead to error when mapping parent id.
  • Allow empty text files. Fixes an issue where text files with only white space would fail to be partitioned.
  • Remove double-decoration for CSV, DOC, ODT partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner (CSV and DOCX in this case); remove decoration from delegating partitioners.
  • Remove double-decoration for PPTX, TSV, XLSX, and XML partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner; remove decoration from delegating partitioners.
  • Remove double-decoration for HTML, EPUB, MD, ORG, RST, and RTF partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner (HTML in this case); remove decoration from delegating partitioners.
  • Remove obsolete min_partition/max_partition args from TXT and EML. The legacy min_partition and max_partition parameters were an initial rough implementation of chunking but now interfere with chunking and are unused. Remove those parameters from partition_text() and partition_email().
  • Remove double-decoration on EML and MSG. Refactor these partitioners to rely on the new @apply_metadata() decorator operating on partitioners they delegate to (TXT, HTML, and all others for attachments) and remove direct decoration from EML and MSG.
  • Remove double-decoration for PPT. Remove decorators from the delegating PPT partitioner.
  • Quick-fix CI error in auto test-filetype. Better fix to follow shortly.

0.15.13

20 Sep 14:25
7d66a23
Compare
Choose a tag to compare

0.15.13

Enhancements

  • Improve pdfminer image cleanup process. Optimized the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances overall processing speed of PDF documents.

Features

Fixes

  • Fixes high memory overhead for intersection area computation Using numpy.float32 for coordinates and remove intermediate variables to reduce memory usage when computing intersection areas
  • Fixes the arm64 image build arm64 builds are now fixed and will be available against starting with the 0.15.13 release.

0.15.12

13 Sep 14:39
8b7e5bb
Compare
Choose a tag to compare

0.15.12

Enhancements

  • Improve pdfminer element processing Implemented splitting of pdfminer elements (groups of text chunks) into smaller bounding boxes (text lines). This prevents loss of information from the object detection model and facilitates more effective removal of duplicated pdfminer text.