AI Document Processing: OCR, Classification & Workflow Automation

Document-heavy industries — logistics, finance, legal, healthcare — spend significant human effort on data entry from PDFs, scanned forms, and images. AI document processing pipelines can automate 90%+ of this work. Here's how we build them at Softotic.

The Problem: Unstructured Documents at Scale

Manual document processing suffers from:

High error rates (4–8% is typical for manual data entry)
Slow throughput — humans process ~50–100 documents/hour
Inability to scale during peak periods
No audit trail of extracted values

Pipeline Architecture

A complete document processing pipeline has 6 stages:

code

[Ingestion] → [Pre-processing] → [OCR] → [Classification] → [Extraction] → [Validation] → [ERP Push]

Stage 1: Document Ingestion

Documents arrive via:

Email attachments (via IMAP listener or email webhook)
API upload (POST /documents with multipart form)
FTP/SFTP directory watch
WhatsApp/messaging (webhook)

Normalise to PDF: convert DOCX, TIFF, JPEG to PDF using __INLINE_CODE_0__ or __INLINE_CODE_1__.

Stage 2: Pre-Processing

Deskew and denoise scanned images using OpenCV
Split multi-page documents into individual page images
Resize to optimal resolution for OCR (300 DPI for thermal prints, 200 DPI for standard scans)

python

import cv2
import numpy as np

def preprocess(image: np.ndarray) -> np.ndarray:
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    denoised = cv2.fastNlMeansDenoising(gray)
    _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return binary

Stage 3: OCR

Use a hybrid approach for best accuracy:

AWS Textract for structured forms and tables (handles 2-column layouts, checkboxes)
Tesseract as fallback (for offline or cost-sensitive pipelines)
Azure Form Recognizer for specific form templates with pre-built models

python

import boto3

textract = boto3.client("textract")

def run_ocr(document_bytes: bytes) -> dict:
    response = textract.analyze_document(
        Document={"Bytes": document_bytes},
        FeatureTypes=["TABLES", "FORMS"]
    )
    return response["Blocks"]

Stage 4: Document Classification

Train a multi-class classifier on document layout and keyword features:

Inputs: OCR text, page count, presence of keywords (e.g., "INVOICE", "BILL OF LADING")
Model: Fine-tuned DistilBERT or simpler TF-IDF + LogisticRegression for high-volume low-cost classification
Classes: Invoice, Delivery Note, Customs Declaration, Contract, ID Document, etc.

Confidence threshold: if < 0.85, route to human review queue.

Stage 5: Field Extraction

Per document class, extract structured fields using:

Template matching: regex for IDs, amounts, dates in known positions
ML-based extraction: LayoutLM or Donut models for zero-shot extraction from new templates

python

import re

def extract_invoice_fields(text: str) -> dict:
    return {
        "invoice_number": re.search(r"Invoice\s*#?\s*([A-Z0-9-]+)", text, re.I),
        "amount_due": re.search(r"Total\s*Due\s*:?\s*[\$£]?([\d,]+\.?\d*)", text, re.I),
        "due_date": re.search(r"Due\s*Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})", text, re.I),
    }

Stage 6: Validation & ERP Push

Validate extracted fields against business rules:

Amount is a valid number
Date is not in the past (for invoices)
Supplier ID exists in ERP

If validation passes, push to ERP via REST webhook. If not, route to review UI.

Infrastructure Stack

FastAPI — async REST API for document ingestion
Redis Queue (RQ) — background job processing
PostgreSQL — document metadata and extracted fields storage
S3 — original document storage
Docker — containerised deployment

Handling the Review Queue

Low-confidence extractions go to a human review UI where operators:

See the original document alongside extracted fields
Correct any errors
Approve and push to ERP
These corrections feed back into model fine-tuning

Conclusion

AI document processing delivers ROI within months for high-volume operations. The key investment is in the extraction and validation layers — getting those right is what separates a 60% accurate prototype from a 96%+ production system.

Ready to automate your document workflows? Talk to Softotic's AI team.