AutoKYC Extraction Engine | DEMO

Named entity extraction from financial documents with OpenCV, Pytesseract, Spacy (OCR + NER)

`Development Stages`

`Training Architecture -(NER Model)`

`Architecture`

`Text Detection WorkFlow`

Image Preprocessing (Suppressing unwanted distortions, Enhancing important Image Features)

1. Binarization 2. Rescaling 3. Dilation

Image Segmentation (breaking image based on)

1. Single Character 2. Word 3. Line

`Labeling - BIO/IOB Tagging Format`

Most chronophagous task, took around more than 10 good hours per day and some weeks
Learning -- Collecting good data in Real Life is not a cakewalk

`Bounding Boxes`

`Input - Real Time`

Eyeballing Scanned results of very common and easy input point you can get in Real Time, input can be anything in range of crazy to very crazy

`NER Prediction`

You are observing NER Prediction on scanned results of above business card
Finding organisation and name is still bit difficult , clearly I have to increase business card data from 3000+ cards to maybe 10000+, in parallel I need to update my approach a bot more to bit more maybe

`Problem Statement`

Develop customized Named Entity Recognizer to extract entities from scanned documents images like:

Invoice
Business Card [my focus] || Extract Entities like: Name, Phone, Email, Organisation and Website link
Shipping Bill etc

`Technologies used`

Compute Vision modules were used to:
1. scan document
2. identify location of text
3. extract text from image
Natural Language Processing used to
1. extract entitles from text
2. text cleaning
3. parsing entities form text

`Python Libraries used in Computer Vision Module`

`Python Libraries used in Natural Language Processing`

`Flow to Extract Entities`

Location of Entity
Text of Corresponding Entity

`Some more NER use-cases`

`Improvements:`

I am using Spacy NER model, which is a BERT architecture i.e. I have to provide more data to this model to see performance improvement
I can also improve Data Preparation Framework
I am using PyTesseract(google) to extract text, it have some limitations like:
Image resolution must be atlest 200 dpi or width & height must be atlest 300 pixels
Text must not be Rotated or Skewed
Text must not be having some effets applied on it
Text must not be blured
Text must not be cursive handwriting

`Refrences`

`What Next`

Understanding this

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
poc_results_img		poc_results_img
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoKYC Extraction Engine | DEMO

`Development Stages`

`Training Architecture -(NER Model)`

`Architecture`

`Text Detection WorkFlow`

Image Preprocessing (Suppressing unwanted distortions, Enhancing important Image Features)

Image Segmentation (breaking image based on)

`Labeling - BIO/IOB Tagging Format`

`Bounding Boxes`

`Input - Real Time`

`NER Prediction`

`Problem Statement`

`Technologies used`

`Python Libraries used in Computer Vision Module`

`Python Libraries used in Natural Language Processing`

`Flow to Extract Entities`

`Some more NER use-cases`

`Improvements:`

`Refrences`

`What Next`

About

Releases

Packages

MvMukesh/AutoKYC-ExtractionEngine

Folders and files

Latest commit

History

Repository files navigation

AutoKYC Extraction Engine | DEMO

Development Stages

Training Architecture -(NER Model)

Architecture

Text Detection WorkFlow

Image Preprocessing (Suppressing unwanted distortions, Enhancing important Image Features)

Image Segmentation (breaking image based on)

Labeling - BIO/IOB Tagging Format

Bounding Boxes

Input - Real Time

NER Prediction

Problem Statement

Technologies used

Python Libraries used in Computer Vision Module

Python Libraries used in Natural Language Processing

Flow to Extract Entities

Some more NER use-cases

Improvements:

Refrences

What Next

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

`Development Stages`

`Training Architecture -(NER Model)`

`Architecture`

`Text Detection WorkFlow`

`Labeling - BIO/IOB Tagging Format`

`Bounding Boxes`

`Input - Real Time`

`NER Prediction`

`Problem Statement`

`Technologies used`

`Python Libraries used in Computer Vision Module`

`Python Libraries used in Natural Language Processing`

`Flow to Extract Entities`

`Some more NER use-cases`

`Improvements:`

`Refrences`

`What Next`

Packages