LIPI — Text

Text intelligence — any document, any format, 22 Indian languages.

BUILDING

The need it answers

Text lives everywhere a machine still can't read it — inside images, in handwriting, across 22 Indian languages and mixed scripts. Generic OCR stops at characters and breaks on Indian documents. LIPI turns any document — any language, any format — into structured, attributed, machine-usable meaning.

What it is

LIPI is the text-perception layer. Any document — PDF, image, handwriting — is detected, its text extracted, classified into domains, and attributed to a writer. It reads across 22 Indian languages, turning raw documents into structured facts.

By the numbersHow much faster is document intelligence?

Manual data entry runs at 18–40% error. LIPI-class extraction cuts a document from about 20 minutes to under 2, drops errors by 80–90%, and reads across 22 Indian languages — turning typing into reading.

0×

faster per document

20 min → <2 min

fewer errors

up to 99% accuracy

Indian languages

Eighth Schedule

18–0%

manual entry error

Docsumo 2025

Manual entry 3 docs/hr

LIPI extraction 30 docs/hr

Dimension	⊘ Manual entry	✒ With LIPI	Gain
Time / documentcapture speed	~20 min	<2 min	~10×
Error ratefidelity	18–40%	~1%	80–90% fewer
Cost, year 1operating cost	baseline	−60–80%	major
Σ Coveragelanguages	English-centric OCR	22 Indian languages	India-first

Market baselines for document automation, validated 2026-06-10; LIPI targets these as its India-first extraction layer.

Sources: Docsumo — IDP statistics 2025 Mindee — IDP explained

The evolutionHow it was distilled — and what shaped it

🌱 Seed

Extract text from documents and images — OCR.

← shaped by the gap that off-the-shelf OCR fails on Indian scripts and real-world formats.

🛤 Path

Built the L1 perception module — detect → extract → classify → attribute, across PDF, image and handwriting.

← shaped by the stack principle — text perception is the foundation layer everything sits on.

🔀 Pivot

From OCR to language intelligence — not just the characters, but the language understood.

← shaped by the CV↔LIPI boundary — CV reads the text inside pixels, then hands the words to LIPI.

💎 Crystal

LIPI = the text (L1) layer of VANI, with India-first domain schemas.

← shaped by bottom-up architecture — Phase-1 facts are prerequisite inputs to Phase-3 intelligence.

⭐ Principle

Any document, any Indian language, any format → detected, extracted, classified, attributed, in real time.

← shaped by industry-agnostic document intelligence built for India first.

Where we stand todayBuilt & working

✓Extraction pipeline: format + language detection + OCR
✓Text classifier across 6 domains
✓Writer identification (LBP + SVM, closed-set)
✓22 Indian languages via Unicode matching + OCR
✓Conversational intake agent (Stage 0)

What's nextOn the path

→OCR optimization for local image processing
→Authorship + authenticity layer
→Cross-document entity linking
→Feed structured facts into higher intelligence

★ the moonshot

Text understood to its deeper meaning — authorship, intent, authenticity — atop rock-solid multilingual extraction.