OmniDocBench

v1.5

Built by OpenDataLab. 1,355 pages from papers, books, slides, exams, newspapers, and magazines. Scores text extraction via edit distance, formula recognition via CDM, table structure via TEDS, and reading order accuracy. Overall = ((1 − Text Edit) × 100 + Table TEDS + Formula CDM) / 3.

Models Evaluated

Dataset Size

1,355 pages

Metrics

Source

View on GitHub

Overall Score = ((1 - Text Edit) x 100 + Table TEDS + Formula CDM) / 3

Rankings

#	Model	Overall	Text Edit↓	CDM↑	TEDS↑	TEDS-S↑	Read Order↓
1	Gemini-3-FlashGoogle	90.1	0.077	90.2	87.7	92.6	0.081
2	Nanonets OCR-3Nanonets	90.0	0.068	87.7	88.9	93.3	0.100
3	Nanonets OCR2+Nanonets	89.5	0.056	90.3	79.1	83.6	0.090
4	Gemini-3-ProGoogle	88.8	0.078	87.3	87.0	91.7	0.084
5	GPT-5.2OpenAI	88.0	0.111	90.1	84.9	89.5	0.098
6	Claude Sonnet 4.6Anthropic	86.9	0.165	90.2	87.1	91.2	0.149
7	Claude Opus 4.6Anthropic	85.9	0.151	88.5	84.4	89.1	0.136
8	Datalab MarkerDatalab	85.5	0.109	88.3	79.1	83.7	0.106
9	Gemini 3.1 ProGoogle	85.3	0.082	83.3	80.8	85.4	0.073
10	GPT-5.4OpenAI	85.3	0.089	83.4	81.3	86.7	0.077
11	Qwen3-VL-PlusAlibaba	82.5	0.157	76.6	86.6	90.7	0.099
12	GPT-5-MiniOpenAI	82.5	0.138	86.7	74.6	80.1	0.121
13	Qwen3-VL-235BAlibaba	81.9	0.162	75.1	86.8	90.6	0.101
14	GPT-4.1OpenAI	79.9	0.167	82.2	74.0	83.8	0.115
15	Claude Haiku 4.5Anthropic	79.6	0.224	84.2	77.1	83.8	0.178
16	Ministral-8BMistral AI	78.3	0.157	83.3	67.1	73.8	0.125
17	Qwen3.5-9BAlibaba	76.7	0.253	81.4	73.9	77.6	0.116
18	Mistral Small 4Mistral AI	76.4	0.242	78.3	75.1	82.7	0.162
19	GLM-OCRZhipu AI	69.2	0.144	84.7	37.4	39.3	0.141
20	Qwen3.5-4BAlibaba	67.6	0.292	71.5	60.4	64.6	0.106
21	GPT-5-NanoOpenAI	63.4	0.319	61.0	61.2	69.5	0.243
22	Qwen3.5-2BAlibaba	48.7	0.621	62.9	45.3	48.2	0.401
23	Qwen3.5-0.8BAlibaba	47.3	0.583	62.3	37.9	41.0	0.352
24	Gemma-3-12B-ITGoogle	44.6	0.476	50.0	31.6	46.9	0.364
25	Llama-3.2-Vision-11BMeta	44.6	0.541	55.4	32.6	42.9	0.340
26	Pixtral-12BMistral AI	42.3	0.641	58.8	32.1	50.8	0.422
27	Qwen-VL-OCRAlibaba	34.1	0.823	22.6	62.1	67.7	0.810

Metrics

Text EditLower is better

Character-level edit distance between predicted and ground-truth text blocks. Lower values indicate more accurate text extraction.

CDMHigher is better

Character Detection Matching score for display formulas. Measures structural and symbolic accuracy of recognized mathematical expressions.

TEDSHigher is better

Tree Edit Distance-based Similarity for tables. Evaluates both content and structure of extracted tables.

TEDS-SHigher is better

Structure-only TEDS that ignores cell content. Focuses purely on table layout and cell spanning.

Read OrderLower is better

Edit distance measuring how well the model preserves the correct reading order across multi-column and complex layouts.