This Golang based project provides a microservice that offers a REST API and a Web view to convert PDF's and Images to Text, using Tesseract OCR scanner.
Just a proof-of-concept at this point. For future development it will be split in a multi-tier application architecture for better escalability - again for instructional purposes.
docker build -t ocr-tesseract .
docker run --privileged=true -d -t -i \
-p 8080:80 \
-e UPLOADED_FILES_DIR='/tmp/pdf-cache' \
-v /tmp/pdf-cache:/tmp/pdf-cache ocr-tesseract
The service provides some minimalistic webviews to use the functionalities.
http://localhost:8080/web/pdf
http://localhost:8080/web/img
http://localhost:8080/api/upload/pdf
http://localhost:8080/api/upload/img
This projects uses the following SDK's:
- Tesseract OCR : OCR Engine
- GhostScript: PDF interpreter used to convert PDF to a set of images (per page)
(C) 2019