字节开源Dolphin-v2: 基于异构锚点提示的文档图像解析-育师

Dolphin-v2是一款增强型通用文档解析模型，在原始Dolphin基础上实现显著提升。该模型通过采用具备文档类型感知能力的双阶段架构及可扩展锚点提示技术，能够无缝处理任何类型的文档——无论是数字原生文件还是拍摄图像。

📑 概述

由于文档类型多样且包含文本段落、图表、公式、表格和代码块等复杂交织的元素，文档图像解析具有挑战性。Dolphin-v2通过文档类型感知的两阶段方法应对这些挑战：

🔍 第一阶段：文档类型分类（数字化文档 vs 拍摄文档）+ 带阅读顺序预测的版面分析
🧩 第二阶段：混合解析策略 - 对拍摄文档采用整体解析，对数字化文档采用并行元素级解析

海豚模型在各类页面级和元素级解析任务中展现出优异性能，其轻量化架构与并行解析机制确保了卓越的运行效率。

📅 更新日志

🔥2025.12.12发布Dolphin-v2模型。升级至30亿参数，支持21元素检测、属性字段提取、专用公式/代码解析，以及稳健的拍摄文档解析。（Dolphin-1.5版本移至v1.5分支）
🔥2025.10.16发布Dolphin-1.5模型。在保持轻量级3亿参数架构的同时，该版本实现了显著的解析改进。（Dolphin 1.0版本移至v1.0分支）
🔥2025.07.10发布Fox-Page基准测试，这是对原始Fox数据集进行人工精炼的子集。下载地址：百度云 | Google Drive
🔥2025.06.30新增TensorRT-LLM支持以加速推理！
🔥2025.06.27新增vLLM支持以加速推理！
🔥2025.06.13新增多页PDF文档解析能力
🔥2025.05.21我们的演示版已发布，访问地址：链接
🔥2025.05.20发布Dolphin预训练模型及推理代码
🔥2025.05.16我们的论文被ACL 2025接收。论文链接：arXiv

📈 性能

在OmniDocBench（v1.5）上进行的文档解析综合评估
Model	Size	Overall↑	Text^Edit↓	Formula^CDM↑	Table^TEDS↑	Table^TEDS-S↑	Read Order^Edit↓
Dolphin	0.3B	74.67	0.125	67.85	68.70	77.77	0.124
Dolphin-1.5	0.3B	85.06	0.085	79.44	84.25	88.06	0.071
Dolphin-v2	3B	89.78	0.054	87.63	87.02	90.48	0.054

🛠️ 安装

克隆仓库：

gitclone https://github.com/ByteDance/Dolphin.gitcdDolphin

安装依赖项：
```
pipinstall-r requirements.txt
```
下载Dolphin-v2的预训练模型：

访问我们的Huggingface 模型卡片，或通过以下方式下载模型：

# Download the model from Hugging Face Hubgitlfsinstallgitclone https://huggingface.co/ByteDance/Dolphin-v2 ./hf_model# Or use the Hugging Face CLIpipinstallhuggingface_hub huggingface-cli download ByteDance/Dolphin-v2 --local-dir ./hf_model

⚡ 推理

Dolphin 提供两种推理框架，支持两种解析粒度：

页面级解析：将整个文档页面解析为结构化的 JSON 和 Markdown 格式
元素级解析：解析单个文档元素（文本、表格、公式）

📄 页面级解析

# Process a single document imagepython demo_page.py --model_path ./hf_model --save_dir ./results\--input_path ./demo/page_imgs/page_1.png# Process a single document pdfpython demo_page.py --model_path ./hf_model --save_dir ./results\--input_path ./demo/page_imgs/page_6.pdf# Process all documents in a directorypython demo_page.py --model_path ./hf_model --save_dir ./results\--input_path ./demo/page_imgs# Process with custom batch size for parallel element decodingpython demo_page.py --model_path ./hf_model --save_dir ./results\--input_path ./demo/page_imgs\--max_batch_size8

🧩 元素级解析

# Process element images (specify element_type: table, formula, text, or code)python demo_element.py --model_path ./hf_model --save_dir ./results\--input_path\--element_type[table|formula|text|code]

🎨 布局解析

# Process a single document imagepython demo_layout.py --model_path ./hf_model --save_dir ./results\--input_path ./demo/page_imgs/page_1.png\# Process a single PDF documentpython demo_layout.py --model_path ./hf_model --save_dir ./results\--input_path ./demo/page_imgs/page_6.pdf\# Process all documents in a directorypython demo_layout.py --model_path ./hf_model --save_dir ./results\--input_path ./demo/page_imgs

🌟 核心特性

🔄 基于单一视觉语言模型的两阶段分析-解析方法
📊 在文档解析任务中展现优异性能
🔍 自然阅读顺序的元素序列生成
🧩 针对不同文档元素的异构锚点提示机制
⏱️ 高效的并行解析机制
🤗 支持Hugging Face Transformers以便集成

📮 公告

征集错误案例：如果您遇到模型表现不佳的案例，我们将非常感激您能在issue中分享。我们正在持续优化改进模型。

💖 致谢

我们要感谢以下为本工作提供灵感和参考的开源项目：

OmniDocBench
Donut
Nougat
GOT
MinerU
Swin
Hugging Face Transformers

📝 引用

如果您认为本代码对您的研究有所帮助，请使用以下BibTeX条目。

@article{feng2025dolphin, title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting}, author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and others}, journal={arXiv preprint arXiv:2505.14059}, year={2025} }