Based on the Unstructured.io project. Follow instructions from https://docs.unstructured.io/open-source/introduction/overview to set up the Unstructured's environment. This should be set up through the requirements.txt file.
The Unstructured File Processor is a Python project designed to process a variety of file types, extract their content, and convert them into structured formats. It handles multiple file formats, including:
- Text files (
.txt
) - CSV and TSV files (
.csv
,.tsv
) - Microsoft Office documents (
.doc
,.docx
,.ppt
,.pptx
,.xlsx
) - PDF files (
.pdf
) - Image files (
.png
,.jpg
,.jpeg
,.tiff
,.bmp
,.heic
) - HTML files (
.html
) - Markdown files (
.md
) - Rich Text Format files (
.rtf
) - And many others
The extracted content can be saved in plain text, JSON, and annotated text formats, providing flexibility for downstream processing and analysis.
- Prerequisites
- Setup Instructions
- Usage
- Project Structure
- Detailed Description
- Troubleshooting
- Contributing
- License
- Operating System: macOS, Linux, or Windows
- Python Version: Python 3.8 or higher
All the required Python packages are listed in requirements.txt
. These can be installed using pip
.
Some file types require additional system dependencies, which can be installed using Homebrew (for macOS and Linux):
- Tesseract OCR: Used for optical character recognition (OCR) on images and PDFs.
- Poppler: Provides
pdftotext
,pdftohtml
, and other utilities for PDF processing. - libmagic: Used by
python-magic
for file type detection.
git clone https://github.com/j-chacko/Unstructured.git
cd Unstructured
It's recommended to use a virtual environment to manage dependencies.
python3 -m venv venv
Activate the virtual environment:
-
macOS/Linux:
source venv/bin/activate
-
Windows:
venv\Scripts\activate
Upgrade pip
and install the required packages:
pip install --upgrade pip
pip install -r requirements.txt
macOS/Linux:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install tesseract poppler libmagic
Note for Windows Users:
- Install Tesseract OCR for Windows.
- Install Poppler for Windows and add it to your
PATH
.
Create a .env
file in the project root directory:
touch .env
LOCAL_FILE_INPUT_DIR=/path/to/input
LOCAL_FILE_OUTPUT_DIR=/path/to/output
LOG_DIR=/path/to/logs
LOG_LEVEL=INFO
- LOCAL_FILE_INPUT_DIR: Absolute path to the input directory containing files to process.
- LOCAL_FILE_OUTPUT_DIR: Absolute path to the output directory where processed files will be saved.
- LOG_DIR: Absolute path to the directory where logs will be stored.
- LOG_LEVEL: Logging level (e.g., INFO, DEBUG, WARNING).
Example:
LOCAL_FILE_INPUT_DIR=/Users/username/Documents/unstructured/input
LOCAL_FILE_OUTPUT_DIR=/Users/username/Documents/unstructured/output
LOG_DIR=/Users/username/Documents/unstructured/logs
LOG_LEVEL=INFO
Ensure that the Input, Output, and Log directories specified in your .env
file exist:
mkdir -p /path/to/input
mkdir -p /path/to/output
mkdir -p /path/to/logs
The main.py
script processes all supported file types in the input directory.
Command:
python main.py
What It Does:
- Scans the input directory recursively.
- Identifies supported file types.
- Processes each file using the appropriate processing script.
- Saves outputs in the specified output directory.
- Logs processing details and errors.
Each file type has a dedicated processing script in the project directory (e.g., process_pdf.py
, process_image.py
).
Example: Processing PDFs
python process_pdf.py
Note:
- Ensure the environment variables are set, either via the
.env
file or exporting them in your shell. - Individual scripts can be useful for debugging or processing specific file types.
Unstructured/
├── main.py
├── .env
├── .gitignore
├── requirements.txt
├── utils.py
├── process_csv.py
├── process_doc_and_docx.py
├── process_email.py
├── process_epub.py
├── process_html.py
├── process_image.py
├── process_md.py
├── process_msg.py
├── process_odt.py
├── process_org.py
├── process_pdf.py
├── process_ppt_and_pptx.py
├── process_rst.py
├── process_rtf.py
├── process_text.py
├── process_tsv.py
├── process_xlsx.py
├── process_xml.py
├── input/
├── output/
└── logs/
└── ...
- main.py: Entry point script to process all supported file types.
- utils.py: Contains utility functions for logging and directory management.
- process_*.py: Scripts dedicated to processing specific file types.
- logs/: Directory where log files are stored.
The goal of the Unstructured File Processor is to automate the extraction and conversion of unstructured data from various file formats into structured formats suitable for data analysis, machine learning, or archival purposes.
Features:
- Multi-format Support: Handles a wide range of file types.
- Recursive Processing: Processes files in subdirectories.
- Error Handling: Logs errors and unsupported files for review.
- Customizable Logging: Configurable logging levels and outputs.
- Modular Scripts: Individual processing scripts for flexibility.
-
Initialization:
- Loads environment variables.
- Sets up logging configuration.
- Ensures the input, output, and log directories exist.
-
Processing Workflow:
- Scans the input directory for files.
- Filters files by supported extensions.
- For each file:
- Calls the appropriate processing function.
- Extracts content using specialized libraries (e.g.,
unstructured
,pytesseract
). - Saves output in multiple formats.
-
Output Formats:
- Plain Text (.txt): Extracted text content.
- JSON (.json): Structured data including metadata.
- Annotated Text (_annotated.txt): Text with annotations indicating element types.
-
Logging and Error Handling:
- Logs processing steps and any errors encountered.
- Creates log files in the logs directory.
- Logs unsupported or failed files for manual review.
-
No Content Extracted:
- Check if the file is corrupt or contains unsupported content.
- Review the logs for warnings.
-
Errors During Processing:
- Ensure all dependencies are installed.
- Verify that environment variables are correctly set.
- Check that required system packages (e.g., Tesseract, Poppler) are installed.
-
Logging Issues:
- Confirm that the
LOG_DIR
exists and is writable. - Adjust
LOG_LEVEL
in the.env
file for more detailed logs.
- Confirm that the
-
Permission Errors:
- Ensure you have read permissions for input files and write permissions for output directories.
Contributions are welcome! Please follow these steps:
-
Fork the repository.
-
Create a new branch:
git checkout -b feature/your-feature-name
-
Make your changes and commit them:
git commit -m "Description of your feature"
-
Push to your forked repository:
git push origin feature/your-feature-name
-
Create a pull request detailing your changes.
This project is licensed under the MIT License.
Note: Replace any placeholder paths (e.g., /path/to/input
) with actual paths relevant to your environment.