0.16.6

Enhancements

Every tag is considered to be ontology.Table Added special handling for tables in HTML partitioning. This change is made to improve the accuracy of table extraction from HTML documents.
Every HTML has default ontology class assigned When parsing HTML to ontology each defined HTML in the Ontology has assigned default ontology class. This way it is possible to assign ontology class instead of UncategorizedText when the HTML tag is predicted correctly without class assigned class
Use (number of actual table) weighted average for table metrics In evaluating table metrics the mean aggregation now uses the actual number of tables in a document to weight the metric scores

Features

Fixes

ElementMetadata consolidation Now text_as_html metadata is combined across all elements in CompositeElement when chunking HTML output

0.16.5

Enhancements

Features

Fixes

Fixes parsing HTML v2 parser Now max recursion limit is set and value is correctly extracted from ontology element

Add language parameter to OCRAgentGoogleVision. Introduces an optional language parameter in the OCRAgentGoogleVision constructor to serve as a language hint for document_text_detection. This ensures compatibility with the OCRAgent's get_instance method and resolves errors when parsing PDFs with Google Cloud Vision as the OCR agent.

0.15.14

Improve pdfminer element processing Implemented splitting of pdfminer elements (groups of text chunks) into smaller bounding boxes (text lines). This prevents loss of information from the object detection model and facilitates more effective removal of duplicated pdfminer text.

Features

Fixes

Fixed table accuracy metric Table accuracy was incorrectly using column content difference in calculating row accuracy.

0.15.11

Enhancements

Add deprecation warning to embed code

Remove ingest console script

0.15.10

Enhancements

Enhance pdfminer element cleanup Expand removal of pdfminer elements to include those inside all non-pdfminer elements, not just tables.

Modified analysis drawing tools to dump to files and draw from dumps If the parameter analysis of the partition_pdf function is set to True, the layout for Object Detection, Pdfminer Extraction, OCR and final layouts will be dumped as json files. The drawers now accept dict (dump) objects instead of internal classes instances.

Vectorize pdfminer elements deduplication computation. Use numpy operations to compute IOU and sub-region membership instead of using simply loop. This improves the speed of deduplicating elements for pages with a lot of elements.

Features

Fixes

0.15.9

Enhancements

Features

Add support for encoding parameter in partition_csv

Fixes

Check storage contents for OLE file type detection Updates detect_filetype to check the content of OLE files to more reliable differentiate DOC, PPT, XLS, and MSG files. As part of this, the "msg" extra was removed because the python-oxmsg package is now a base dependency.

Fix disk space leaks and Windows errors when accessing file.name on a NamedTemporaryFile Uses of NamedTemporaryFile(..., delete=False) and/or uses of file.name of NamedTemporaryFiles have been replaced with TemporaryFileDirectory to avoid a known issue: https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile

0.15.8

Enhancements

Bump unstructured.paddleocr to 2.8.1.0.

Features

Fixes

Fix a bug where multiple soffice processes could be attempted Add a wait mechanism in convert_office_doc so that the function first checks if another soffice is running already: if yes wait till the other process finishes or till the wait timeout before spawning a subprocess to run soffice

partition() now forwards strategy arg to partition_docx(), partition_pptx(), and their brokering partitioners for DOC, ODT, and PPT formats. A strategy argument passed to partition() (or the default value "auto" assigned by partition()) is now forwarded to partition_docx(), partition_pptx(), and their brokering partitioners when those filetypes are detected.

0.14.8

Enhancements

Move arm64 image to wolfi-base The arm64 image now runs on wolfi-base. The arm64 build for wolfi-base does not yet include libreoffce, and so arm64 does not currently support processing .doc, .ppt, or .xls file. If you need to process those files on arm64, use the legacy rockylinux image.

Features

Fixes

Bump unstructured-inference==0.7.36 Fix ValueError when converting cells to html.

partition() now forwards strategy arg to partition_docx(), partition_ppt(), and partition_pptx(). A strategy argument passed to partition() (or the default value "auto" assigned by partition()) is now forwarded to partition_docx(), partition_ppt(), and partition_pptx() when those filetypes are detected.

Fix missing sensitive field markers for embedders

0.14.7

Enhancements

Pull from wolfi-base image. The amd64 image now pulls from the unstructured wolfi-base image to avoid duplication of dependency setup steps.

Fix windows temp file. Make the creation of a temp file in unstructured/partition/pdf_image/ocr.py windows compatible.

Features

8-bit string Outlook MSG files are parsed. partition_msg() is now able to parse non-unicode Outlook MSG emails.

Attachments to Outlook MSG files are extracted intact. partition_msg() is now able to extract attachments without corruption.

0.14.4

Enhancements

Move logger error to debug level when PDFminer fails to extract text which includes error message for Invalid dictionary construct.

Add support for Pinecone serverless Adds Pinecone serverless to the connector tests. Pinecone serverless will work version versions >=0.14.2, but hadn't been tested until now.

Features

Allow configuration of the Google Vision API endpoint Add an environment variable to select the Google Vision API in the US or the EU.

Fixes

Address the issue of unrecognized tables in UnstructuredTableTransformerModel When a table is not recognized, the element.metadata.text_as_html attribute is set to an empty string.

Remove root handlers in ingest logger. Removes root handlers in ingest loggers to ensure secrets aren't accidentally exposed in Colab notebooks.

Fix V2 S3 Destination Connector authentication Fixes bugs with S3 Destination Connector where the connection config was neither registered nor properly deserialized.

Clarified dependence on particular version of python-docx Pinned python-docx version to ensure a particular method unstructured uses is included.

Ingest preserves original file extension Ingest V2 introduced a change that dropped the original extension for upgraded connectors. This reverts that change.

0.14.3

Enhancements

Move category field from Text class to Element class.

partition_docx() now supports pluggable picture sub-partitioners. A subpartitioner that accepts a DOCX Paragraph and generates elements is now supported. This allows adding a custom sub-partitioner that extracts images and applies OCR or summarization for the image.

Add VoyageAI embedder Adds VoyageAI embeddings to support embedding via Voyage AI.

Features

Fixes

Features

Add form extraction basics (document elements and placeholder code in partition). This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a NotImplementedError.

Fixes

Add missing starting_page_num param to partition_image

Make the filename and file params for partition_image and partition_pdf match the other partitioners

Fix include_slide_notes and include_page_breaks params in partition_ppt

Re-apply: skip accuracy calculation feature Overwritten by mistake

Fix type hint for paragraph_grouper param paragraph_grouper can be set to False, but the type hint did not not reflect this previously.

Remove links param from partition_pdf links is extracted during partitioning and is not needed as a paramter in partition_pdf.

Improve CSV delimeter detection. partition_csv() would raise on CSV files with very long lines.

Fix disk-space leak in partition_doc(). Remove temporary file created but not removed when file argument is passed to partition_doc().

Fix possible SyntaxError or SyntaxWarning on regex patterns. Change regex patterns to raw strings to avoid these warnings/errors in Python 3.11+.

Fix disk-space leak in partition_odt(). Remove temporary file created but not removed when file argument is passed to partition_odt().

AstraDB: option to prevent indexing metadata

Fix Missing py.typed

0.13.7

Enhancements

Remove page_number metadata fields for HTML partition until we have a better strategy to decide page counting.

Extract OCRAgent.get_agent(). Generalize access to the configured OCRAgent instance beyond its use for PDFs.

Add calculation of table related metrics which take into account colspans and rowspans

Evaluation: skip accuracy calculation for files for which output and ground truth sizes differ greatly

Features

Partitioning raises on file-like object with .name not a local file path. When partitioning a file using the file= argument, and file is a file-like object (e.g. io.BytesIO) having a .name attribute, and the value of file.name is not a valid path to a file present on the local filesystem, FileNotFoundError is raised. This prevents use of the file.name attribute for downstream purposes to, for example, describe the source of a document retrieved from a network location via HTTP.

Fix SharePoint dates with inconsistent formatting Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string.

Include warnings about the potential risk of installing a version of pandoc which does not support RTF files + instructions that will help resolve that issue.

Incorporate the install-pandoc Makefile recipe into relevant stages of CI workflow, ensuring it is a version that supports RTF input files.

Fix Google Drive source key Allow passing string for source connector key.

Fix table structure evaluations calculations Replaced special value -1.0 with np.nan and corrected rows filtering of files metrics basing on that.

Fix Sharepoint-with-permissions test Ignore permissions metadata, update test.

Fix table structure evaluations for edge case Fixes the issue when the prediction does not contain any table - no longer errors in such case.

0.11.7

Enhancements

Add intra-chunk overlap capability. Implement overlap for split-chunks where text-splitting is used to divide an oversized chunk into two or more chunks that fit in the chunking window. Note this capability is not yet available from the API but will shortly be made accessible using a new overlap kwarg on partition functions.

Update encoders to leverage dataclasses All encoders now follow a class approach which get annotated with the dataclass decorator. Similar to the connectors, it uses a nested dataclass for the configs required to configure a client as well as a field/property approach to cache the client. This makes sure any variable associated with the class exists as a dataclass field.

Features

Add Qdrant destination connector. Adds support for writing documents and embeddings into a Qdrant collection.

Store base64 encoded image data in metadata fields. Rather than saving to file, stores base64 encoded data of the image bytes and the mimetype for the image in metadata fields: image_base64 and image_mime_type (if that is what the user specifies by some other param like pdf_extract_to_payload). This would allow the API to have parity with the library.

Fixes

Fix table structure metric script Update the call to table agent to now provide OCR tokens as required

Fix element extraction not working when using "auto" strategy for pdf and image If element extraction is specified, the "auto" strategy falls back to the "hi_res" strategy.

Fix a bug passing a custom url to partition_via_api Users that self host the api were not able to pass their custom url to partition_via_api.

0.11.6

Enhancements

Update the layout analysis script. The previous script only supported annotating final elements. The updated script also supports annotating inferred and extracted elements.

AWS Marketplace API documentation: Added the user guide, including setting up VPC and CloudFormation, to deploy Unstructured API on AWS platform.

Azure Marketplace API documentation: Improved the user guide to deploy Azure Marketplace API by adding references to Azure documentation.

Integration documentation: Updated URLs for the staging_for bricks

Features

Partition emails with base64-encoded text. Automatically handles and decodes base64 encoded text in emails with content type text/plain and text/html.

Add Chroma destination connector Chroma database connector added to ingest CLI. Users may now use unstructured-ingest to write partitioned/embedded data to a Chroma vector database.

Add Elasticsearch destination connector. Problem: After ingesting data from a source, users might want to move their data into a destination. Elasticsearch is a popular storage solution for various functionality such as search, or providing intermediary caches within data pipelines. Feature: Added Elasticsearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Elasticsearch.

Fixes

Enable --fields argument omission for elasticsearch connector Solves two bugs where removing the optional parameter --fields broke the connector due to an integer processing error and using an elasticsearch config for a destination connector resulted in a serialization issue when optional parameter --fields was not provided.

Add hi_res_model_name Adds kwarg to relevant functions and add comments that model_name is to be deprecated.

0.11.5

Features

Fixes

Removed ebooklib as a dependency ebooklib is licensed under AGPL3, which is incompatible with the Apache 2.0 license. Thus it is being removed.

Caching fixes in ingest pipeline Previously, steps like the source node were not leveraging parameters such as re_download to dictate if files should be forced to redownload rather than use what might already exist locally.

0.10.26

Enhancements

Add text CCT CI evaluation workflow Adds cct text extraction evaluation metrics to the current ingest workflow to measure the performance of each file extracted as well as aggregated-level performance.

Features

Functionality to catch and classify overlapping/nested elements Method to identify overlapping-bboxes cases within detected elements in a document. It returns two values: a boolean defining if there are overlapping elements present, and a list reporting them with relevant metadata. The output includes information about the overlapping_elements, overlapping_case, overlapping_percentage, largest_ngram_percentage, overlap_percentage_total, max_area, min_area, and total_area.

Add Local connector source metadata python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.

Fixes

Fixes elements partitioned from an image file missing certain metadata Metadata for image files, like file type, was being handled differently from other file types. This caused a bug where other metadata, like the file name, was being missed. This change brought metadata handling for image files to be more in line with the handling for other file types so that file name and other metadata fields are being captured.

Adds typing-extensions as an explicit dependency This package is an implicit dependency, but the module is being imported directly in unstructured.documents.elements so the dependency should be explicit in case changes in other dependencies lead to typing-extensions being dropped as a dependency.

Stop passing extract_tables to unstructured-inference since it is now supported in unstructured instead Table extraction previously occurred in unstructured-inference, but that logic, except for the table model itself, is now a part of the unstructured library. Thus the parameter triggering table extraction is no longer passed to the unstructured-inference package. Also noted the table output regression for PDF files.

Fix a bug in Table partitioning Previously the skip_infer_table_types variable used in partition was not being passed down to specific file partitioners. Now you can utilize the skip_infer_table_types list variable when calling partition to specify the filetypes for which you want to skip table extraction, or the infer_table_structure boolean variable on the file specific partitioning function.

Fix partition docx without sections Some docx files, like those from teams output, do not contain sections and it would produce no results because the code assumes all components are in sections. Now if no sections is detected from a document we iterate through the paragraphs and return contents found in the paragraphs.

Fix out-of-order sequencing of split chunks. Fixes behavior where "split" chunks were inserted at the beginning of the chunk sequence. This would produce a chunk sequence like [5a, 5b, 3a, 3b, 1, 2, 4] when sections 3 and 5 exceeded max_characters.

Deserialization of ingest docs fixed When ingest docs are being deserialized as part of the ingest pipeline process (cli), there were certain fields that weren't getting persisted (metadata and date processed). The from_dict method was updated to take these into account and a unit test added to check.

Map source cli command configs when destination set Due to how the source connector is dynamically called when the destination connector is set via the CLI, the configs were being set incorrectoy, causing the source connector to break. The configs were fixed and updated to take into account Fsspec-specific connectors.

0.10.25

Enhancements

Duplicate CLI param check Given that many of the options associated with the Click based cli ingest commands are added dynamically from a number of configs, a check was incorporated to make sure there were no duplicate entries to prevent new configs from overwriting already added options.

Ingest CLI refactor for better code reuse Much of the ingest cli code can be templated and was a copy-paste across files, adding potential risk. Code was refactored to use a base class which had much of the shared code templated.

Features

Cleans up temporary files after conversion Previously a file conversion utility was leaving temporary files behind on the filesystem without removing them when no longer needed. This fix helps prevent an accumulation of temporary files taking up excessive disk space.

Fixes under_non_alpha_ratio dividing by zero Although this function guarded against a specific cause of division by zero, there were edge cases slipping through like strings with only whitespace. This update more generally prevents the function from performing a division by zero.

Fix languages default Previously the default language was being set to English when elements didn't have text or if langdetect could not detect the language. It now defaults to None so there is not misleading information about the language detected.

Fixes recursion limit error that was being raised when partitioning Excel documents of a certain size Previously we used a recursive method to find subtables within an excel sheet. However this would run afoul of Python's recursion depth limit when there was a contiguous block of more than 1000 cells within a sheet. This function has been updated to use the NetworkX library which avoids Python recursion issues.

0.10.22

Enhancements

*Fixes an issue that caused a partition error for some PDF's. Fixes GH Issue 1460 by bypassing a coordinate check if an element has invalid coordinates.

0.10.15

Enhancements

Support for better element categories from the next-generation image-to-text model ("chipper"). Previously, not all of the classifications from Chipper were being mapped to proper unstructured element categories so the consumer of the library would see many UncategorizedText elements. This fixes the issue, improving the granularity of the element categories outputs for better downstream processing and chunking. The mapping update is:

"Threading": NarrativeText

"Form": NarrativeText

"Field-Name": Title

"Value": NarrativeText

"Link": NarrativeText

"Headline": Title (with category_depth=1)

"Subheadline": Title (with category_depth=2)

"Abstract": NarrativeText

Better ListItem grouping for PDF's (fast strategy). The partition_pdf with fast strategy previously broke down some numbered list item lines as separate elements. This enhancement leverages the x,y coordinates and bbox sizes to help decide whether the following chunk of text is a continuation of the immediate previous detected ListItem element or not, and not detect it as its own non-ListItem element.

Fall back to text-based classification for uncategorized Layout elements for Images and PDF's. Improves element classification by running existing text-based rules on previously UncategorizedText elements.

Adds table partitioning for Partitioning for many doc types including: .html, .epub., .md, .rst, .odt, and .msg. At the core of this change is the .html partition functionality, which is leveraged by the other effected doc types. This impacts many scenarios where Table Elements are now propery extracted.

Create and add add_chunking_strategy decorator to partition functions. Previously, users were responsible for their own chunking after partitioning elements, often required for downstream applications. Now, individual elements may be combined into right-sized chunks where min and max character size may be specified if chunking_strategy=by_title. Relevant elements are grouped together for better downstream results. This enables users immediately use partitioned results effectively in downstream applications (e.g. RAG architecture apps) without any additional post-processing.

Adds languages as an input parameter and marks ocr_languages kwarg for deprecation in pdf, image, and auto partitioning functions. Previously, language information was only being used for Tesseract OCR for image-based documents and was in a Tesseract specific string format, but by refactoring into a list of standard language codes independent of Tesseract, the unstructured library will better support languages for other non-image pipelines and/or support for other OCR engines.

Removes UNSTRUCTURED_LANGUAGE env var usage and replaces language with languages as an input parameter to unstructured-partition-text_type functions. The previous parameter/input setup was not user-friendly or scalable to the variety of elements being processed. By refactoring the inputted language information into a list of standard language codes, we can support future applications of the element language such as detection, metadata, and multi-language elements. Now, to skip English specific checks, set the languages parameter to any non-English language(s).

Adds xlsx and xls filetype extensions to the skip_infer_table_types default list in partition. By adding these file types to the input parameter these files should not go through table extraction. Users can still specify if they would like to extract tables from these filetypes, but will have to set the skip_infer_table_types to exclude the desired filetype extension. This avoids mis-representing complex spreadsheets where there may be multiple sub-tables and other content.

Better debug output related to sentence counting internals. Clarify message when sentence is not counted toward sentence count because there aren't enough words, relevant for developers focused on unstructureds NLP internals.

Faster ocr_only speed for partitioning PDF and images. Use unstructured_pytesseract.run_and_get_multiple_output function to reduce the number of calls to tesseract by half when partitioning pdf or image with tesseract

Adds data source properties to fsspec connectors These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.

Add delta table destination connector New delta table destination connector added to ingest CLI. Users may now use unstructured-ingest to write partitioned data from over 20 data sources (so far) to a Delta Table.

Rename to Source and Destination Connectors in the Documentation. Maintain naming consistency between Connectors codebase and documentation with the first addition to a destination connector.

Non-HTML text files now return unstructured-elements as opposed to HTML-elements. Previously the text based files that went through partition_html would return HTML-elements but now we preserve the format from the input using source_format argument in the partition call.

Adds PaddleOCR as an optional alternative to Tesseract for OCR in processing of PDF or Image files, it is installable via the makefile command install-paddleocr. For experimental purposes only.

Bump unstructured-inference to 0.5.28. This version bump markedly improves the output of table data, rendered as metadata.text_as_html in an element. These changes include:

add env variable ENTIRE_PAGE_OCR to specify using paddle or tesseract on entire page OCR

table structure detection now pads the input image by 25 pixels in all 4 directions to improve its recall (0.5.27)

support paddle with both cpu and gpu and assume it is pre-installed (0.5.26)

fix a bug where cells_to_html doesn't handle cells spanning multiple rows properly (0.5.25)

remove cv2 preprocessing step before OCR step in table transformer (0.5.24)

partition_html breaks on <br> elements.

Ingest error handling to properly raise errors when wrapped

GH issue 1361: fixes a sortig error that prevented some PDF's from being parsed

Bump unstructured-inference

Brings back embedded images in PDF's (0.5.23)

0.10.12

Enhancements

Removed PIL pin as issue has been resolved upstream

Bump unstructured-inference

Support for yolox_quantized layout detection model (0.5.20)

YoloX element types added

Features

Fix a bug where mismatched elements and bboxes are passed into add_pytesseract_bbox_to_elements

0.10.9

Enhancements

Fix test_json to handle only non-extra dependencies file types (plain-text)

Update partition_html to respect the order of <pre> tags.

Fix bug in partition_pdf_or_image where two partitions were called if strategy == "ocr_only".

Bump unstructured-inference

Fix issue where temporary files were being left behind (0.5.16)

Adds deprecation warning for the file_filename kwarg to partition, partition_via_api, and partition_multiple_via_api.

Fix documentation build workflow by pinning dependencies

0.10.5

Enhancements

Create new CI Pipelines

Checking text, xml, email, and html doc tests against the library installed without extras

Checking each library extra against their respective tests

partition raises an error and tells the user to install the appropriate extra if a filetype is detected that is missing dependencies.

Add custom errors to ingest

Bump unstructured-ingest==0.5.15

Handle an uncaught TesseractError (0.5.15)

Add TIFF test file and TIFF filetype to test_from_image_file in test_layout (0.5.14)

Use entire_page ocr mode for pdfs and images

Add notes on extra installs to docs

Adds ability to reuse connections per process in unstructured-ingest

Features

Add delta table connector

Fixes

0.10.4

Pass ocr_mode in partition_pdf and set the default back to individual pages for now

Add diagrams and descriptions for ingest design in the ingest README

Features

Supports multipage TIFF image partitioning

Fixes

0.10.2

Enhancements

Bump unstructured-inference==0.5.13:

Fix extracted image elements being included in layout merge, addresses the issue where an entire-page image in a PDF was not passed to the layout model when using hi_res.

Features

Fixes

0.10.1

Enhancements

Bump unstructured-inference==0.5.12:

fix to avoid trace for certain PDF's (0.5.12)

better defaults for DPI for hi_res and Chipper (0.5.11)

implement full-page OCR (0.5.10)

Features

Fixes

Fix dead links in repository README (Quick Start > Install for local development, and Learn more > Batch Processing)

Update document dependencies to include tesseract-lang for additional language support (required for tests to pass)

0.10.0

Enhancements

Add include_header kwarg to partition_xlsx and change default behavior to True

Update the links and emphasized_texts metadata fields

Features

Fixes

0.9.3

Enhancements

Pinned dependency cleanup.

Update partition_csv to always use soupparser_fromstring to parse html text

Update partition_tsv to always use soupparser_fromstring to parse html text

Add metadata.section to capture epub table of contents data

Add unique_element_ids kwarg to partition functions. If True, will use a UUID for element IDs instead of a SHA-256 hash.

Update partition_xlsx to always use soupparser_fromstring to parse html text

Add functionality to switch html text parser based on whether the html text contains emoji

Add functionality to check if a string contains any emoji characters

Add CI tests around Notion

Features

Remove unused _partition_via_api function

Fixed emoji bug in partition_xlsx.

Pass file_filename metadata when partitioning file object

Skip ingest test on missing Slack token

Add Dropbox variables to CI environments

Remove default encoding for ingest

Adds new element type EmailAddress for recognising email address in the text

Simplifies min_partition logic; makes partitions falling below the min_partition less likely.

Fix bug where ingest test check for number of files fails in smoke test

Fix unstructured-ingest entrypoint failure

0.9.0

Enhancements

Dependencies are now split by document type, creating a slimmer base installation.

0.8.8

Enhancements

Features

Fixes

Features

Adds Outlook connector

Add support for dpi parameter in inference library

Adds Onedrive connector.

Add Confluence connector for ingest cli to pull the body text from all documents from all spaces in a confluence domain.

Fixes

Fixes issue with email partitioning where From field was being assigned the To field value.

Use the image_metadata property of the PageLayout instance to get the page image info in the document_to_element_list

Add functionality to write images to computer storage temporarily instead of keeping them in memory for ocr_only strategy

Add functionality to convert a PDF in small chunks of pages at a time for ocr_only strategy

Adds .txt, .text, and .tab to list of extensions to check if file has a text/plain MIME type.

Enables filters to be passed to partition_doc so it doesn't error with LibreOffice7.

Removed old error message that's superseded by requires_dependencies.

Removes using hi_res as the default strategy value for partition_via_api and partition_multiple_via_api

0.8.1

Enhancements

Add support for Python 3.11

Features

Fixes

Fixed auto strategy detected scanned document as having extractable text and using fast strategy, resulting in no output.

Fix list detection in MS Word documents.

Don't instantiate an element with a coordinate system when there isn't a way to get its location data.

0.8.0

Enhancements

Allow model used for hi res pdf partition strategy to be chosen when called.

Updated inference package

Features

Add metadata_filename parameter across all partition functions

Fixes

Update to ensure convert_to_datafame grabs all of the metadata fields.

Adjust encoding recognition threshold value in detect_file_encoding

Fix KeyError when isd_to_elements doesn't find a type

Fix _output_filename for local connector, allowing single files to be written correctly to the disk

Fix for cases where an invalid encoding is extracted from an email header.

BREAKING CHANGES

Information about an element's location is no longer returned as top-level attributes of an element. Instead, it is returned in the coordinates attribute of the element's metadata.

0.7.12

Enhancements

Adds include_metadata kwarg to partition_doc, partition_docx, partition_email, partition_epub, partition_json, partition_msg, partition_odt, partition_org, partition_pdf, partition_ppt, partition_pptx, partition_rst, and partition_rtf

Features

Add Elasticsearch connector for ingest cli to pull specific fields from all documents in an index.

Adds Dropbox connector

Fixes

Fix tests that call unstructured-api by passing through an api-key

Fixed page breaks being given (incorrect) page numbers

Fix skipping download on ingest when a source document exists locally

0.7.11

Enhancements

More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4)

Make large model available (from unstructured-inference bump to 0.5.3)

Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)

partition_email and partition_msg will now process attachments if process_attachments=True and a attachment partitioning functions is passed through with attachment_partitioner=partition.

Features

Fixes

Fix tests that call unstructured-api by passing through an api-key

Fixed page breaks being given (incorrect) page numbers

Fix skipping download on ingest when a source document exists locally

0.7.10

Enhancements

Adds a max_partition parameter to partition_text, partition_pdf, partition_email, partition_msg and partition_xml that sets a limit for the size of an individual document elements. Defaults to 1500 for everything except partition_xml, which has a default value of None.

DRY connector refactor

Features

hi_res model for pdfs and images is selectable via environment variable.

Fixes

CSV check now ignores escaped commas.

Fix for filetype exploration util when file content does not have a comma.

Adds negative lookahead to bullet pattern to avoid detecting plain text line breaks like ------- as list items.

Fix pre tag parsing for partition_html

Fix lookup error for annotated Arabic and Hebrew encodings

Features

Add stage_for_weaviate to stage unstructured outputs for upload to Weaviate, along with a helper function for defining a class to use in Weaviate schemas.

Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.

Fixes

0.7.0

Enhancements

Installing detectron2 from source is no longer required when using the local-inference extra.

Updates .pptx parsing to include text in tables.

Features

Features

Fixes

Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.

Adds additional MIME types for CSV

0.6.8

Enhancements

Features

Add partition_csv for CSV files.

Fixes

0.6.7

Enhancements

Deprecate --s3-url in favor of --remote-url in CLI

Refactor out non-connector-specific config variables

Add file_directory to metadata

Add page_name to metadata. Currently used for the sheet name in XLSX documents.

Added a --partition-strategy parameter to unstructured-ingest so that users can specify partition strategy in CLI. For example, --partition-strategy fast.

Added metadata for filetype.

Add Discord connector to pull messages from a list of channels

Refactor unstructured/file-utils/filetype.py to better utilise hashmap to return mime type.

Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for test_filetype.py.

Features

0.6.3

Enhancements

Add an "ocr_only" strategy for partition_image.

Features

FileNotFound error when filename is provided but file is not on disk

0.5.9

Enhancements

Features

Fixes

Convert file to str in helper split_by_paragraph for partition_text

0.5.8

Enhancements

Update elements_to_json to return string when filename is not specified

elements_from_json may take a string instead of a filename with the text kwarg

detect_filetype now does a final fallback to file extension.

Empty tags are now skipped during the depth check for HTML processing.

Features

Add local file system to unstructured-ingest

Add --max-docs parameter to unstructured-ingest

Added partition_msg for processing MSFT Outlook .msg files.

Fixes

convert_file_to_text now passes through the source_format and target_format kwargs. Previously they were hard coded.

Partitioning functions that accept a text kwarg no longer raise an error if an empty string is passed (and empty list of elements is returned instead).

partition_json no longer fails if the input is an empty list.

Fixed bug in chunk_by_attention_window that caused the last word in segments to be cut-off in some cases.

BREAKING CHANGES

stage_for_transformers now returns a list of elements, making it consistent with other staging bricks

0.5.7

Enhancements

Refactored codebase using exactly_one

Adds ability to pass headers when passing a url in partition_html()

Added optional content_type and file_filename parameters to partition() to bypass file detection

Features

Add --flatten-metadata parameter to unstructured-ingest

Add --fields-include parameter to unstructured-ingest

0.5.2

Enhancements

Fully move from printing to logging.

unstructured-ingest now uses a default --download_dir of $HOME/.cache/unstructured/ingest rather than a "tmp-ingest-" dir in the working directory.

Features

Fixes

setup_ubuntu.sh no longer fails in some contexts by interpreting DEBIAN_FRONTEND=noninteractive as a command

unstructured-ingest no longer re-downloads files when --preserve-downloads is used without --download-dir.

Fixed an issue that was causing text to be skipped in some HTML documents.

0.5.1

Fixes

Update to ensure all elements are preserved during serialization/deserialization

0.4.14

Automatically install nltk models in the tokenize module.

0.4.13

Fixes unstructured-ingest cli.

0.4.12

Adds console_entrypoint for unstructured-ingest, other structure/doc updates related to ingest.

Add parser parameter to partition_html.

0.4.11

Adds partition_doc for partitioning Word documents in .doc format. Requires libreoffice.

Adds partition_ppt for partitioning PowerPoint documents in .ppt format. Requires libreoffice.

0.4.10

Fixes ElementMetadata so that it's JSON serializable when the filename is a Path object.

0.4.9

Added ingest modules and s3 connector, sample ingest script

Default to url=None for partition_pdf and partition_image

Add ability to skip English specific check by setting the UNSTRUCTURED_LANGUAGE env var to "".

Document Element objects now track metadata

0.4.8

Modified XML and HTML parsers not to load comments.

0.4.7

Added the ability to pull an HTML document from a url in partition_html.

Added the the ability to get file summary info from lists of filenames and lists of file contents.

Added optional page break to partition for .pptx, .pdf, images, and .html files.

Added to_dict method to document elements.

Include more unicode quotes in replace_unicode_quotes.

0.4.6

Loosen the default cap threshold to 0.5.

Add a UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD environment variable for controlling the cap ratio threshold.

Unknown text elements are identified as Text for HTML and plain text documents.

Body Text styles no longer default to NarrativeText for Word documents. The style information is insufficient to determine that the text is narrative.

Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.

Adds an Address element for capturing elements that only contain an address.

Suppress the UserWarning when detectron is called.

Checks that titles and narrative test have at least one English word.

Checks that titles and narrative text are at least 50% alpha characters.

Restricts titles to a maximum word length. Adds a UNSTRUCTURED_TITLE_MAX_WORD_LENGTH environment variable for controlling the max number of words in a title.

Updated partition_pptx to order the elements on the page

0.4.4

Updated partition_pdf and partition_image to return unstructured Element objects

Fixed the healthcheck url path when partitioning images and PDFs via API

Adds an optional coordinates attribute to document objects

Adds FigureCaption and CheckBox document elements

Added ability to split lists detected in LayoutElement objects

Adds partition_pptx for partitioning PowerPoint documents

LayoutParser models now download from HugginfaceHub instead of DropBox

Fixed file type detection for XML and HTML files on Amazone Linux

0.4.3

Adds requests as a base dependency

Fix in exceeds_cap_ratio so the function doesn't break with empty text

Fix bug in _parse_received_data.

Update detect_filetype to properly handle .doc, .xls, and .ppt.

0.4.2

Added partition_image to process documents in an image format.

Fixed utf-8 encoding error in partition_email with attachments for text/html

0.4.1

Added support for text files in the partition function

Pinned opencv-python for easier installation on Linux

0.4.0

Added generic partition brick that detects the file type and routes a file to the appropriate partitioning brick.

Added a file type detection module.

Updated partition_html and partition_eml to support file-like objects in 'rb' mode.

Cleaning brick for removing ordered bullets clean_ordered_bullets.

Extract brick method for ordered bullets extract_ordered_bullets.

Test for clean_ordered_bullets.

Test for extract_ordered_bullets.

Added partition_docx for pre-processing Word Documents.

Added new REGEX patterns to extract email header information

Added new functions to extract header information parse_received_data and partition_header

Added new function to parse plain text files partition_text

Added new cleaners functions extract_ip_address, extract_ip_address_name, extract_mapi_id, extract_datetimetz

Add new Image element and function to find embedded images find_embedded_images

Added get_directory_file_info for summarizing information about source documents

0.3.5

Add support for local inference

Add new pattern to recognize plain text dash bullets

Add test for bullet patterns

Fix for partition_html that allows for processing div tags that have both text and child elements

Add ability to extract document metadata from .docx, .xlsx, and .jpg files.

Helper functions for identifying and extracting phone numbers

Add new function extract_attachment_info that extracts and decodes the attachment of an email.

Staging brick to convert a list of Elements to a pandas dataframe.

Add plain text functionality to partition_email

0.3.4

Python-3.7 compat

0.3.3

Removes BasicConfig from logger configuration

Adds the partition_email partitioning brick

Adds the replace_mime_encodings cleaning bricks

Small fix to HTML parsing related to processing list items with sub-tags

Add EmailElement data structure to store email documents

0.3.2

Added translate_text brick for translating text between languages

Add an apply method to make it easier to apply cleaners to elements

0.3.1

Added __init.py__ to partition

0.3.0

Implement staging brick for Argilla. Converts lists of Text elements to argilla dataset classes.

Removing the local PDF parsing code and any dependencies and tests.

Reorganizes the staging bricks in the unstructured.partition module

Allow entities to be passed into the Datasaur staging brick

Added HTML escapes to the replace_unicode_quotes brick

Fix bad responses in partition_pdf to raise ValueError

Adds partition_html for partitioning HTML documents.

0.2.6

Small change to how _read is placed within the inheritance structure since it doesn't really apply to pdf

Add partitioning brick for calling the document image analysis API

0.2.5

Update python requirement to >=3.7

0.2.4

Add alternative way of importing Final to support google colab

0.2.3

Add cleaning bricks for removing prefixes and postfixes

Add cleaning bricks for extracting text before and after a pattern

0.2.2

Add staging brick for Datasaur

0.2.1

Added brick to convert an ISD dictionary to a list of elements

Update PDFDocument to use the from_file method

Added staging brick for CSV format for ISD (Initial Structured Data) format.

Added staging brick for separating text into attention window size chunks for transformers.

Added staging brick for LabelBox.

Added ability to upload LabelStudio predictions

Added utility function for JSONL reading and writing

Added staging brick for CSV format for Prodigy

Added staging brick for Prodigy

Added ability to upload LabelStudio annotations

Added text_field and id_field to stage_for_label_studio signature

0.2.0

Initial release of unstructured

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

0.16.6

Enhancements

Features

Fixes

0.16.5

Enhancements

Features

Fixes

0.16.4

Enhancements

Features

Fixes

0.16.3

Enhancements

Features

Fixes

0.16.2

Enhancements

Features

Fixes

0.16.1

Enhancements

Features

Fixes

0.16.0

Enhancements

Features

Fixes

0.15.14

Enhancements

Features

Fixes

0.15.13

BREAKING CHANGES

Enhancements

Features

Fixes

0.15.12

Enhancements

Features

Fixes

0.15.11

Enhancements

0.15.10

Enhancements

Features

Fixes

0.15.9

Enhancements

Features

Fixes

0.15.8

Enhancements

Features

Fixes

0.15.7

Enhancements

Features

Fixes

0.15.6

Enhancements

Features

Fixes

0.15.5

Enhancements

Features

Fixes

0.15.4

Enhancements

Features

Fixes

0.15.3

Enhancements

Features