MinerU与Milvus集成：提取后向量入库完整指南-育师

MinerU与Milvus集成：提取后向量入库完整指南

1. 为什么需要把PDF提取结果存进向量库

你有没有遇到过这样的情况：花了一整天用MinerU把几十份技术白皮书、论文和产品手册转成Markdown，结果它们就静静躺在output文件夹里，想查某个公式推导或某张架构图时，还得手动翻文件、Ctrl+F搜索？更别说后续要做语义检索、知识问答或者构建企业级文档助手了。

这正是本文要解决的核心问题——让高质量的PDF提取结果真正活起来。MinerU 2.5-1.2B 镜像已经帮你把PDF里的文字、表格、公式、图片都精准还原成了结构化Markdown，但这只是第一步。真正的价值在于：把这些内容变成可搜索、可关联、可推理的向量数据。

而Milvus，作为当前最成熟稳定的开源向量数据库之一，就是那个能把“静态文本”变成“智能知识”的关键枢纽。它不只支持毫秒级相似检索，还能轻松对接RAG流程、支持多模态混合查询（比如“找所有提到Transformer架构且配图含注意力机制示意图的段落”），而且部署简单、扩展性强。

本指南不讲抽象概念，只聚焦一件事：从你刚运行完mineru -p test.pdf -o ./output那一刻起，到数据真正写入Milvus并能被查询，每一步怎么操作、踩过哪些坑、怎么验证结果正确。全程基于预装GLM-4V-9B和MinerU2.5的镜像环境，无需额外安装，开箱即用。

2. 环境准备与依赖确认

在开始任何代码前，请先确认你的镜像环境已就绪。本节不是走形式，而是帮你快速排除90%的常见失败原因。

2.1 检查基础服务状态

进入镜像后，默认路径为/root/workspace。我们先确认几个关键组件是否正常：

# 1. 确认Conda环境已激活（Python 3.10） python --version # 2. 检查GPU可用性（确保CUDA驱动加载成功） nvidia-smi -L # 3. 验证MinerU命令是否可调用 which mineru # 4. 检查Milvus客户端依赖是否已预装（本镜像已内置pymilvus） python -c "import pymilvus; print(' Milvus客户端就绪')"

如果第4步报错ModuleNotFoundError，说明镜像未预装pymilvus，需手动安装：

pip install pymilvus==2.4.7

注意：本指南使用Milvus 2.4.x版本，与pymilvus 2.4.7完全兼容。避免升级到3.x，接口差异较大。

2.2 启动Milvus服务（单机轻量模式）

本镜像已预装Milvus Standalone（单机版），无需Docker或K8s编排。直接启动即可：

# 启动Milvus服务（后台运行，日志输出到milvus.log） nohup milvus run standalone > milvus.log 2>&1 & # 等待10秒，检查服务是否监听6333端口 sleep 10 netstat -tuln | grep 6333

若看到tcp6 0 0 :::6333 :::* LISTEN，说明服务已就绪。这是Milvus默认的gRPC端口，后续Python客户端将通过此端口通信。

3. 从PDF提取到结构化文本的实操流程

MinerU的强大之处，在于它不只是OCR，而是理解PDF的“视觉布局”。但它的输出是Markdown文件，我们需要从中提取出适合向量化的内容块。这一步不能简单地把整个文件当一段文本处理——那样会丢失语义粒度，导致检索不准。

3.1 理解MinerU的输出结构

运行mineru -p test.pdf -o ./output --task doc后，./output目录下会生成类似这样的结构：

output/ ├── test.md # 主Markdown文件（含文字+图片/公式占位符） ├── images/ # 所有提取出的图片（png格式） │ ├── fig_001.png │ └── table_002.png ├── formulas/ # 所有识别出的LaTeX公式（.tex文件） │ └── formula_001.tex └── meta.json # 提取元信息（页码、标题层级、块类型等）

关键点在于：test.md中的图片和公式都是以![fig_001](images/fig_001.png)和$$\int_0^1 x^2 dx$$形式嵌入的，而非原始二进制数据。这意味着向量化时，我们必须：

对纯文本段落做嵌入
对图片路径和公式LaTeX源码分别做独立嵌入（后续可支持多模态检索）

3.2 编写文本切分脚本：保留语义，拒绝硬截断

我们不使用简单的按行或按字符切分，而是基于Markdown的语义结构。以下是一个精简可靠的切分逻辑（保存为split_md.py）：

# split_md.py import re import json from pathlib import Path def split_markdown(md_path: str, output_dir: str): """将MinerU输出的MD按语义块切分，保留标题层级和上下文""" md_text = Path(md_path).read_text(encoding="utf-8") # 步骤1：按一级/二级标题分割（保留标题本身） sections = re.split(r"(^#{1,2} .+$)", md_text, flags=re.MULTILINE) blocks = [] for i, sec in enumerate(sections): if not sec.strip() or re.match(r"^#{1,2} .+$", sec.strip()): continue # 跳过空块和标题行（标题将在下一步合并） # 步骤2：对每个section内的段落进一步切分（避免超长段落） paras = [p.strip() for p in sec.split("\n") if p.strip()] for para in paras: # 过滤掉纯图片/公式引用行（这些将单独处理） if re.match(r"^!\[.*\]\(.*\)$|^$$.*$$", para): continue if len(para) > 30: # 只保留有效文本块（>30字符） blocks.append(para) # 步骤3：读取meta.json，补充图片和公式块 meta_path = Path(md_path).parent / "meta.json" if meta_path.exists(): meta = json.loads(meta_path.read_text()) # 添加图片描述块（用文件名+类型提示） for img in meta.get("images", []): blocks.append(f"[图片] {img['filename']} (位于第{img['page']}页)") # 添加公式块（用LaTeX源码） for frm in meta.get("formulas", []): blocks.append(f"[公式] {frm['latex']}") # 保存为JSONL（每行一个块，便于后续批量嵌入） output_path = Path(output_dir) / "chunks.jsonl" with open(output_path, "w", encoding="utf-8") as f: for i, block in enumerate(blocks): f.write(json.dumps({"id": f"chunk_{i}", "text": block}, ensure_ascii=False) + "\n") print(f" 已生成 {len(blocks)} 个语义块，保存至 {output_path}") if __name__ == "__main__": split_markdown("./output/test.md", "./output")

运行它：

python split_md.py

你会在./output/chunks.jsonl中看到类似这样的内容：

{"id": "chunk_0", "text": "本文提出了一种新型的稀疏注意力机制，其计算复杂度从O(n²)降至O(n log n)。"} {"id": "chunk_1", "text": "[图片] fig_001.png (位于第3页)"} {"id": "chunk_2", "text": "[公式] \\frac{d}{dx} \\sin(x) = \\cos(x)"}

这个切分逻辑的关键优势：每个块都有明确语义，且长度可控（平均150字左右），完美匹配主流嵌入模型的输入窗口。

4. 使用GLM-4V-9B生成文本嵌入

本镜像已深度预装GLM-4V-9B模型权重及全套依赖，这是我们的嵌入引擎。它比通用小模型（如bge-small）更擅长处理技术文档中的专业术语、数学符号和跨模态关联。

4.1 加载模型并创建嵌入函数

新建embed_chunks.py：

# embed_chunks.py from transformers import AutoTokenizer, AutoModel import torch import json from pathlib import Path # 加载GLM-4V-9B的文本编码器（注意：我们只用其文本分支，不涉及视觉部分） tokenizer = AutoTokenizer.from_pretrained("/root/GLM-4V-9B", trust_remote_code=True) model = AutoModel.from_pretrained("/root/GLM-4V-9B", trust_remote_code=True).half().cuda() def get_embeddings(texts: list) -> list: """批量获取文本嵌入向量""" inputs = tokenizer( texts, return_tensors="pt", padding=True, truncation=True, max_length=512 ).to("cuda") with torch.no_grad(): outputs = model(**inputs) # 取最后一层隐藏状态的[CLS] token（索引0） embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy() return embeddings.tolist() # 读取切分好的块 chunks_path = "./output/chunks.jsonl" chunks = [] with open(chunks_path, "r", encoding="utf-8") as f: for line in f: chunks.append(json.loads(line)) texts = [c["text"] for c in chunks] print(f" 开始为 {len(texts)} 个文本块生成嵌入...") # 批量处理（避免OOM，每批32个） batch_size = 32 all_embeddings = [] for i in range(0, len(texts), batch_size): batch_texts = texts[i:i+batch_size] batch_embs = get_embeddings(batch_texts) all_embeddings.extend(batch_embs) print(f" → 已处理 {min(i+batch_size, len(texts))}/{len(texts)}") # 保存嵌入向量（与chunks.jsonl一一对应） emb_path = "./output/embeddings.json" with open(emb_path, "w", encoding="utf-8") as f: json.dump(all_embeddings, f) print(f" 嵌入向量已保存至 {emb_path}")

运行：

python embed_chunks.py

显存提示：GLM-4V-9B是9B参数模型，单次推理需约12GB显存。若遇OOM，可将batch_size调小至16或8，或在get_embeddings中添加.to("cpu")强制CPU推理（速度慢3-5倍，但稳定）。

4.2 验证嵌入质量：一个简单的相似度测试

在向量入库前，快速验证嵌入是否合理：

# test_similarity.py import numpy as np import json def cosine_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) embs = np.array(json.load(open("./output/embeddings.json"))) # 计算前两个块的相似度（应较低，因内容不同） sim_01 = cosine_sim(embs[0], embs[1]) print(f"块0与块1相似度: {sim_01:.3f}") # 计算块0与自身相似度（应为1.0） sim_00 = cosine_sim(embs[0], embs[0]) print(f"块0与自身相似度: {sim_00:.3f}")

理想输出：

块0与块1相似度: 0.215 块0与自身相似度: 1.000

如果相似度普遍高于0.8，说明模型可能未正确加载或文本预处理有误；如果全部接近0，可能是模型输出异常。此时请回查embed_chunks.py中的outputs.last_hidden_state[:, 0, :]是否取对了位置。

5. 向量入库Milvus：建表、插入、验证三步到位

现在，我们手握结构化文本块（chunks.jsonl）和对应的高维向量（embeddings.json），可以正式入库了。

5.1 创建Milvus集合（Collection）

Milvus中，“集合”相当于关系型数据库中的“表”。我们为PDF文档创建一个专用集合：

# create_collection.py from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection # 连接本地Milvus服务 connections.connect("default", host="localhost", port="19530") # 定义字段：id（主键）、text（原始文本）、vector（向量） fields = [ FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100), FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535), FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=4096) # GLM-4V-9B文本头维度 ] schema = CollectionSchema(fields, description="MinerU提取的PDF语义块向量库") collection = Collection("pdf_knowledge_base", schema=schema) # 创建索引（IVF_FLAT是平衡速度与精度的首选） index_params = { "index_type": "IVF_FLAT", "metric_type": "COSINE", "params": {"nlist": 100} } collection.create_index("vector", index_params) print(" Milvus集合 'pdf_knowledge_base' 创建成功，索引已建立")

运行：

python create_collection.py

5.2 批量插入向量数据

新建insert_to_milvus.py：

# insert_to_milvus.py import json import numpy as np from pymilvus import connections, Collection connections.connect("default", host="localhost", port="19530") collection = Collection("pdf_knowledge_base") # 读取数据 chunks = [] with open("./output/chunks.jsonl", "r", encoding="utf-8") as f: for line in f: chunks.append(json.loads(line)) embs = np.array(json.load(open("./output/embeddings.json"))) # 准备插入数据（注意：Milvus要求各字段列表长度一致） ids = [c["id"] for c in chunks] texts = [c["text"] for c in chunks] vectors = embs.tolist() # 转为Python list # 批量插入（每批500条，避免单次请求过大） batch_size = 500 for i in range(0, len(ids), batch_size): batch_ids = ids[i:i+batch_size] batch_texts = texts[i:i+batch_size] batch_vectors = vectors[i:i+batch_size] collection.insert([batch_ids, batch_texts, batch_vectors]) print(f" → 已插入 {min(i+batch_size, len(ids))}/{len(ids)} 条") # 刷新集合，使新数据立即可查 collection.flush() print(f" 全部 {len(ids)} 条数据已成功写入Milvus")

运行：

python insert_to_milvus.py

5.3 验证入库结果：一次真实查询

最后，用一个查询确认一切工作正常：

# verify_query.py from pymilvus import connections, Collection import numpy as np connections.connect("default", host="localhost", port="19530") collection = Collection("pdf_knowledge_base") # 构造一个查询向量（复用第一个块的向量，应返回自身） query_vector = np.array(json.load(open("./output/embeddings.json"))[0]).tolist() # 搜索最相似的3个结果 res = collection.search( data=[query_vector], anns_field="vector", param={"metric_type": "COSINE", "params": {"nprobe": 10}}, limit=3, output_fields=["id", "text"] ) print(" 查询结果（按相似度降序）：") for hit in res[0]: print(f" ID: {hit.entity.id} | 相似度: {hit.score:.3f}") print(f" 文本: {hit.entity.text[:60]}...")

预期输出中，第一条结果的id应为chunk_0，score接近1.0，text与chunks.jsonl第一行完全一致。