Named entity extraction from financial documents with OpenCV, Pytesseract, Spacy (OCR + NER)
1. Binarization 2. Rescaling 3. Dilation 1. Single Character 2. Word 3. Line
Most chronophagous task, took around more than 10 good hours per day and some weeks
Learning
-- Collecting good data in Real Life is not a cakewalk
Eyeballing Scanned results of very common and easy input point you can get in Real Time, input can be anything in range of crazy to very crazy
You are observing NER Prediction on scanned results of above business card
Finding organisation and name is still bit difficult , clearly I have to increase business card data from 3000+ cards to maybe 10000+, in parallel I need to update my approach a bot more to bit more maybe
Develop customized Named Entity Recognizer to extract entities from scanned documents images like:
- Invoice
- Business Card [my focus] || Extract Entities like: Name, Phone, Email, Organisation and Website link
- Shipping Bill etc
-
Compute Vision modules were used to:
- scan document
- identify location of text
- extract text from image
-
Natural Language Processing used to
- extract entitles from text
- text cleaning
- parsing entities form text
- Location of Entity
- Text of Corresponding Entity
- I am using Spacy NER model, which is a
BERT architecture
i.e. I have to provide more data to this model to see performance improvement - I can also improve
Data Preparation Framework
- I am using PyTesseract(google) to extract text, it have some limitations like:
- Image resolution must be atlest
200 dpi
or width & height must be atlest300 pixels
- Text must not be Rotated or Skewed
- Text must not be having some effets applied on it
- Text must not be blured
- Text must not be cursive handwriting