Document-heavy industries — logistics, finance, legal, healthcare — spend significant human effort on data entry from PDFs, scanned forms, and images. AI document processing pipelines can automate 90%+ of this work. Here's how we build them at Softotic.
The Problem: Unstructured Documents at Scale
Manual document processing suffers from:
- High error rates (4–8% is typical for manual data entry)
- Slow throughput — humans process ~50–100 documents/hour
- Inability to scale during peak periods
- No audit trail of extracted values
Pipeline Architecture
A complete document processing pipeline has 6 stages:
``
[Ingestion] → [Pre-processing] → [OCR] → [Classification] → [Extraction] → [Validation] → [ERP Push]
`
Stage 1: Document Ingestion
Documents arrive via:
- Email attachments (via IMAP listener or email webhook)
- API upload (POST /documents with multipart form)
- FTP/SFTP directory watch
- WhatsApp/messaging (webhook)
Normalise to PDF: convert DOCX, TIFF, JPEG to PDF using
pikepdf or img2pdf.
Stage 2: Pre-Processing
- Deskew and denoise scanned images using OpenCV
- Split multi-page documents into individual page images
- Resize to optimal resolution for OCR (300 DPI for thermal prints, 200 DPI for standard scans)
`python
import cv2
import numpy as np
def preprocess(image: np.ndarray) -> np.ndarray:
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
denoised = cv2.fastNlMeansDenoising(gray)
_, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
return binary
`
Stage 3: OCR
Use a hybrid approach for best accuracy:
- AWS Textract for structured forms and tables (handles 2-column layouts, checkboxes)
- Tesseract as fallback (for offline or cost-sensitive pipelines)
- Azure Form Recognizer for specific form templates with pre-built models
`python
import boto3
textract = boto3.client("textract")
def run_ocr(document_bytes: bytes) -> dict:
response = textract.analyze_document(
Document={"Bytes": document_bytes},
FeatureTypes=["TABLES", "FORMS"]
)
return response["Blocks"]
`
Stage 4: Document Classification
Train a multi-class classifier on document layout and keyword features:
- Inputs: OCR text, page count, presence of keywords (e.g., "INVOICE", "BILL OF LADING")
- Model: Fine-tuned DistilBERT or simpler TF-IDF + LogisticRegression for high-volume low-cost classification
- Classes: Invoice, Delivery Note, Customs Declaration, Contract, ID Document, etc.
Confidence threshold: if < 0.85, route to human review queue.
Stage 5: Field Extraction
Per document class, extract structured fields using:
- Template matching: regex for IDs, amounts, dates in known positions
- ML-based extraction: LayoutLM or Donut models for zero-shot extraction from new templates
`python
import re
def extract_invoice_fields(text: str) -> dict:
return {
"invoice_number": re.search(r"Invoice\s#?\s([A-Z0-9-]+)", text, re.I),
"amount_due": re.search(r"Total\sDue\s:?\s[\$£]?([\d,]+\.?\d)", text, re.I),
"due_date": re.search(r"Due\sDate\s:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})", text, re.I),
}
``
Stage 6: Validation & ERP Push
Validate extracted fields against business rules:
- Amount is a valid number
- Date is not in the past (for invoices)
- Supplier ID exists in ERP
If validation passes, push to ERP via REST webhook. If not, route to review UI.
Infrastructure Stack
- FastAPI — async REST API for document ingestion
- Redis Queue (RQ) — background job processing
- PostgreSQL — document metadata and extracted fields storage
- S3 — original document storage
- Docker — containerised deployment
Handling the Review Queue
Low-confidence extractions go to a human review UI where operators:
- See the original document alongside extracted fields
- Correct any errors
- Approve and push to ERP
- These corrections feed back into model fine-tuning
Conclusion
AI document processing delivers ROI within months for high-volume operations. The key investment is in the extraction and validation layers — getting those right is what separates a 60% accurate prototype from a 96%+ production system.
Ready to automate your document workflows? Talk to Softotic's AI team.