All Benchmarks/IDP Core Bench

IDP Core Bench

v1.0

Built by Nanonets. ~2,000 invoices, receipts, forms, and handwritten docs. Four tasks: field extraction from structured documents (KIE), OCR on printed and handwritten text, table cell extraction and structure parsing, and answering questions about document content (VQA). Overall = mean of all four.

Models Evaluated

25

Dataset Size

~2,000 documents

Metrics

4

Source

View on GitHub

Overall Score = Average of KIE, OCR, Table, and VQA scores

Rankings

#
Model
Overall
KIE
OCR
Table
VQA
1Gemini 3.1 ProGoogle89.686.882.896.485.0
2GPT-5.4OpenAI84.485.769.194.878.2
3Gemini-3-ProGoogle81.885.781.895.864.1
4Claude Sonnet 4.6Anthropic81.289.573.796.365.2
5Claude Opus 4.6Anthropic81.189.874.096.064.4
6Gemini-3-FlashGoogle80.591.181.785.663.5
7Nanonets OCR-3Nanonets80.284.373.885.273.0
8Qwen3-VL-235BAlibaba80.083.871.785.075.5
9Qwen3-VL-PlusAlibaba79.883.871.985.074.7
10GPT-5.2OpenAI77.487.572.886.063.5
11Qwen3.5-9BAlibaba76.286.565.576.679.5
12GPT-4.1OpenAI74.787.175.673.163.0
13Qwen3.5-4BAlibaba74.586.064.776.772.4
14Nanonets OCR2+Nanonets73.886.464.079.765.1
15GPT-5-MiniOpenAI73.385.773.069.565.0
16Claude Haiku 4.5Anthropic72.985.665.081.759.2
17Ministral-8BMistral AI71.785.767.875.957.4
18Mistral Small 4Mistral AI68.578.357.467.677.9
19Qwen3.5-2BAlibaba67.178.556.272.459.8
20GPT-5-NanoOpenAI65.884.769.645.363.5
21Qwen3.5-0.8BAlibaba61.275.962.959.952.4
22Pixtral-12BMistral AI59.076.254.847.557.5
23Llama-3.2-Vision-11BMeta58.676.165.841.151.5
24GLM-OCRZhipu AI54.983.566.724.544.9
25Gemma-3-12B-ITGoogle0.00.00.00.00.0

Metrics

KIEHigher is better

Key Information Extraction accuracy on invoices, receipts, and forms using exact-match and fuzzy-match metrics.

OCRHigher is better

OCR accuracy on mixed handwritten and printed text documents.

TableHigher is better

Table understanding including cell-level extraction and structural parsing.

VQAHigher is better

Visual Question Answering requiring reasoning over document layout and content.