Pi0具身智能实战：基于YOLOv8的物体抓取轨迹生成-育师

Pi0具身智能实战：基于YOLOv8的物体抓取轨迹生成

最近在机器人圈子里，Pi0这个具身智能模型挺火的。你可能看过那些演示视频——机器人能叠衣服、插花、整理桌面，动作流畅得让人惊讶。但说实话，很多朋友看完演示后都有个疑问：这玩意儿到底怎么用在实际项目里？总不能每次都让机器人表演叠衣服吧。

我最近就在做一个工业自动化项目，需要让机械臂自动识别并抓取传送带上的零件。传统的方案要么依赖复杂的视觉算法，要么需要大量人工示教，调试起来特别麻烦。正好Pi0模型开源了，我就想试试能不能用它来简化这个流程。

结果比我想象的还要好。结合YOLOv8做目标检测，再用Pi0生成抓取轨迹，整个系统搭建起来特别快，效果也很稳定。今天我就把这个实战经验分享给你，如果你也在做机器人抓取相关的项目，这篇文章应该能给你不少启发。

1. 项目背景与需求分析

先说说我遇到的实际问题。我们工厂有一条装配线，传送带上会随机出现不同型号的零件，需要机械臂把它们抓取到指定位置。传统方案是这样的：

视觉部分：用OpenCV写一堆图像处理代码，针对每种零件都要调参数
抓取规划：要么人工示教每个抓取点，要么用复杂的运动规划算法
调试成本：每换一种零件，就得重新调试一遍，工程师都快被逼疯了

最头疼的是，有些零件形状不规则，传统的抓取点计算方法经常失效。比如下面这种带孔的零件，算法可能会把孔识别为抓取点，结果机械臂就抓空了。

# 传统方法计算抓取点（经常出问题） def calculate_grasp_point(contour): # 计算最小外接矩形 rect = cv2.minAreaRect(contour) # 基于几何特征选择抓取点 grasp_point = (rect[0][0], rect[0][1]) # 中心点 grasp_angle = rect[2] # 旋转角度 return grasp_point, grasp_angle

这种方法的局限性很明显：它只考虑了二维几何信息，没考虑物体的三维形状、材质、重量分布，更别说抓取时的力学特性了。所以经常出现“算法说能抓，实际抓不起来”的情况。

Pi0模型的出现，正好解决了这个问题。它是个视觉-语言-动作（VLA）模型，能同时理解图像、语言指令，并生成合适的动作。简单说就是：你给它看一张图片，告诉它“抓这个零件”，它就能输出一套完整的抓取轨迹。

2. 技术方案设计

我们的方案其实挺简单的，就是把YOLOv8和Pi0组合起来用。整体流程是这样的：

视觉感知：YOLOv8实时检测传送带上的零件，识别类型和位置
指令生成：根据检测结果，生成给Pi0的自然语言指令
轨迹生成：Pi0根据图像和指令，输出抓取和放置的轨迹
轨迹执行：机械臂执行生成的轨迹

下面这张图展示了整个系统的架构：

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ 工业相机 │───▶│ YOLOv8 │───▶│ 指令生成 │ │ (实时图像) │ │ (目标检测) │ │ (自然语言) │ └─────────────────┘ └─────────────────┘ └────────┬────────┘ │ ┌─────────────────┐ ┌─────────────────┐ ┌────────▼────────┐ │ 机械臂控制器 │◀───│ 轨迹解析 │◀───│ Pi0模型 │ │ (执行动作) │ │ (坐标转换) │ │ (轨迹生成) │ └─────────────────┘ └─────────────────┘ └─────────────────┘

2.1 为什么选择YOLOv8 + Pi0的组合？

你可能会问：为什么不用一个模型搞定所有事情？非要拆成两个步骤？

这里有几个实际考虑：

YOLOv8的优势：

检测速度快，在RTX 4060上能跑100+FPS
训练简单，标注几百张图片就能有不错的效果
能同时输出类别、位置、置信度，信息很全

Pi0的优势：

能理解复杂的抓取任务，考虑三维形状和力学特性
生成的轨迹自然流畅，接近人类操作
支持自然语言指令，调试时可以直接说“抓左边那个”

最关键的是，这个组合特别灵活。如果只是零件检测需求变了，就只更新YOLOv8模型；如果是抓取策略需要优化，就调整给Pi0的指令。两个模块解耦，维护起来方便多了。

3. 环境搭建与模型部署

3.1 硬件准备

我们的测试环境是这样的：

机械臂：UR5e（6轴协作机械臂）
相机：Intel RealSense D435i（RGB-D相机）
工控机：i7-12700 + RTX 4060 + 32GB内存
操作系统：Ubuntu 20.04 + ROS Noetic

如果你用的是其他型号的机械臂，比如Franka、Aubo这些，方案也基本通用，只需要改一下坐标转换部分。

3.2 软件环境安装

Pi0的部署比想象中简单。官方提供了Docker镜像，基本上是一键部署：

# 1. 拉取Pi0镜像 docker pull spiritai/pi0:latest # 2. 运行容器（挂载摄像头和机械臂权限） docker run -it --rm \ --gpus all \ --network host \ -v /dev:/dev \ -v /tmp/.X11-unix:/tmp/.X11-unix \ -e DISPLAY=$DISPLAY \ spiritai/pi0:latest # 3. 在容器内启动Pi0服务 python -m pi0.server --host 0.0.0.0 --port 7860

YOLOv8的部署就更简单了，直接用Ultralytics的pip包：

# 安装YOLOv8 pip install ultralytics # 验证安装 import torch from ultralytics import YOLO print(f"PyTorch版本: {torch.__version__}") print(f"CUDA可用: {torch.cuda.is_available()}") # 加载预训练模型 model = YOLO('yolov8n.pt') # 先用nano版本测试

3.3 相机与机械臂配置

这里有个小技巧：相机标定一定要做好。我们用的是张正友标定法，标定板是6×9的棋盘格，每个格子30mm。

import cv2 import numpy as np def calibrate_camera(images, pattern_size=(8, 6), square_size=0.03): """相机标定函数""" obj_points = [] # 3D点 img_points = [] # 2D点 # 生成标定板角点的3D坐标 objp = np.zeros((pattern_size[0]*pattern_size[1], 3), np.float32) objp[:, :2] = np.mgrid[0:pattern_size[0], 0:pattern_size[1]].T.reshape(-1, 2) objp *= square_size for img in images: gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) ret, corners = cv2.findChessboardCorners(gray, pattern_size, None) if ret: obj_points.append(objp) corners_refined = cv2.cornerSubPix(gray, corners, (11, 11), (-1, -1), (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)) img_points.append(corners_refined) # 计算相机参数 ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(obj_points, img_points, gray.shape[::-1], None, None) return ret, mtx, dist

标定好后，记得保存参数到文件，后面会反复用到。

4. YOLOv8目标检测实现

4.1 数据准备与模型训练

我们的零件有5种类型：螺栓、螺母、垫片、轴承、外壳。每类标注了200-300张图片，总共1200张左右。

标注用的是LabelImg，格式是YOLO格式（归一化坐标）。数据集按8:1:1划分训练集、验证集、测试集。

# data.yaml 数据集配置文件 path: /home/robot/parts_dataset train: images/train val: images/val test: images/test nc: 5 # 类别数 names: ['bolt', 'nut', 'washer', 'bearing', 'housing']

训练命令很简单：

# 使用YOLOv8s模型（速度和精度的平衡） yolo train data=data.yaml model=yolov8s.pt epochs=100 imgsz=640 batch=16

训练了大概2小时（100轮），在验证集上的mAP@0.5达到了0.92，效果不错。

4.2 实时检测与坐标转换

检测部分的代码其实很简洁：

import cv2 from ultralytics import YOLO import numpy as np class PartDetector: def __init__(self, model_path='best.pt'): self.model = YOLO(model_path) self.camera_matrix = np.load('camera_matrix.npy') # 加载相机内参 self.dist_coeffs = np.load('dist_coeffs.npy') def detect_parts(self, rgb_image, depth_image=None): """检测零件并返回3D位置""" # 运行YOLOv8检测 results = self.model(rgb_image, verbose=False)[0] detections = [] for box in results.boxes: # 获取2D边界框 x1, y1, x2, y2 = box.xyxy[0].cpu().numpy() cls_id = int(box.cls[0]) conf = float(box.conf[0]) # 计算中心点（像素坐标） center_x = (x1 + x2) / 2 center_y = (y1 + y2) / 2 # 如果有深度图，计算3D位置 if depth_image is not None: depth = depth_image[int(center_y), int(center_x)] if depth > 0: # 有效的深度值 # 像素坐标转相机坐标 point_2d = np.array([[center_x, center_y]], dtype=np.float32) point_2d = cv2.undistortPoints(point_2d, self.camera_matrix, self.dist_coeffs, P=self.camera_matrix) # 计算3D坐标 point_3d = np.array([ (point_2d[0][0][0] - self.camera_matrix[0, 2]) * depth / self.camera_matrix[0, 0], (point_2d[0][0][1] - self.camera_matrix[1, 2]) * depth / self.camera_matrix[1, 1], depth ]) detections.append({ 'class_id': cls_id, 'class_name': self.model.names[cls_id], 'confidence': conf, 'bbox_2d': [x1, y1, x2, y2], 'position_3d': point_3d.tolist(), 'center_pixel': [center_x, center_y] }) return detections def draw_detections(self, image, detections): """在图像上绘制检测结果""" for det in detections: x1, y1, x2, y2 = map(int, det['bbox_2d']) label = f"{det['class_name']} {det['confidence']:.2f}" # 画边界框 cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2) # 画标签背景 cv2.rectangle(image, (x1, y1-25), (x1+len(label)*10, y1), (0, 255, 0), -1) # 画标签文字 cv2.putText(image, label, (x1, y1-5), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 0), 2) # 画中心点 cx, cy = map(int, det['center_pixel']) cv2.circle(image, (cx, cy), 5, (0, 0, 255), -1) return image

这里的关键是坐标转换。我们从相机坐标系转到机械臂基座坐标系，需要知道相机和机械臂的相对位置。我们用的是手眼标定（Eye-to-Hand），标定好后得到一个固定的变换矩阵。

5. Pi0轨迹生成与集成

5.1 与Pi0模型通信

Pi0提供了HTTP API接口，我们可以用Python直接调用：

import requests import json import base64 import cv2 class Pi0Client: def __init__(self, server_url="http://localhost:7860"): self.server_url = server_url def generate_trajectory(self, image, instruction, robot_state=None): """调用Pi0生成轨迹""" # 将图像编码为base64 _, buffer = cv2.imencode('.jpg', image) image_base64 = base64.b64encode(buffer).decode('utf-8') # 准备请求数据 payload = { "image": image_base64, "instruction": instruction, "robot_state": robot_state or { "joint_positions": [0, 0, 0, 0, 0, 0], # 当前关节位置 "gripper_open": True # 夹爪状态 } } # 发送请求 try: response = requests.post( f"{self.server_url}/generate", json=payload, timeout=10.0 ) if response.status_code == 200: result = response.json() return result.get("trajectory"), result.get("success_probability") else: print(f"Pi0 API错误: {response.status_code}") return None, 0.0 except Exception as e: print(f"调用Pi0失败: {e}") return None, 0.0 def generate_grasp_instruction(self, part_type, position_relative): """根据零件类型和位置生成自然语言指令""" # 位置描述 if position_relative[0] < 0.3: position_desc = "左边" elif position_relative[0] > 0.7: position_desc = "右边" else: position_desc = "中间" # 根据零件类型调整指令 instructions = { "bolt": f"请抓取{position_desc}的螺栓，它是长条形的金属件", "nut": f"请抓取{position_desc}的螺母，注意它是六角形的", "washer": f"请抓取{position_desc}的垫片，它很薄，要平着抓", "bearing": f"请抓取{position_desc}的轴承，不要捏太紧以免损坏", "housing": f"请抓取{position_desc}的外壳，它比较重，要抓稳" } return instructions.get(part_type, f"请抓取{position_desc}的{part_type}")

5.2 轨迹解析与执行

Pi0返回的轨迹是一系列关节角度或末端位姿。我们需要把它转换成机械臂能执行的指令：

import numpy as np class TrajectoryExecutor: def __init__(self, robot_ip="192.168.1.100"): self.robot_ip = robot_ip # 这里根据实际机械臂型号初始化 # 以UR机器人为例，可以用ur_rtde或urx库 def parse_pi0_trajectory(self, trajectory_data): """解析Pi0返回的轨迹数据""" # Pi0返回的轨迹格式示例： # { # "waypoints": [ # {"joints": [0, -1.57, 0, -1.57, 0, 0], "gripper": 0.0}, # {"joints": [0.1, -1.4, 0.2, -1.4, 0.1, 0], "gripper": 0.0}, # ... # ], # "timestamps": [0.0, 0.1, 0.2, ...] # } waypoints = trajectory_data.get("waypoints", []) timestamps = trajectory_data.get("timestamps", []) # 检查轨迹有效性 if len(waypoints) < 2: print("轨迹点太少，可能生成失败") return None # 转换为机械臂指令 commands = [] for i, wp in enumerate(waypoints): command = { "target_joints": wp["joints"], "gripper_position": wp.get("gripper", 0.0), "duration": timestamps[i] if i < len(timestamps) else 0.1, "blend_radius": 0.01 if i < len(waypoints)-1 else 0.0 } commands.append(command) return commands def execute_trajectory(self, commands, speed_factor=0.5): """执行轨迹""" print(f"开始执行轨迹，共{len(commands)}个路径点") for i, cmd in enumerate(commands): print(f"执行第{i+1}/{len(commands)}个点: {cmd['target_joints']}") # 这里调用具体的机械臂控制接口 # 例如UR机器人的moveJ函数 # self.robot.movej(cmd['target_joints'], acc=0.5, vel=speed_factor) # 控制夹爪 gripper_pos = cmd['gripper_position'] if gripper_pos < 0.5: # self.robot.open_gripper() pass else: # self.robot.close_gripper() pass # 等待执行完成（实际中应该用机械臂的反馈） # time.sleep(cmd['duration']) print("轨迹执行完成")

6. 完整系统集成与测试

6.1 主控制循环

把上面所有模块组合起来，就是完整的主程序：

import time import cv2 import numpy as np from detection import PartDetector from pi0_client import Pi0Client from trajectory import TrajectoryExecutor class GraspingSystem: def __init__(self): print("初始化抓取系统...") # 初始化各模块 self.detector = PartDetector('models/best.pt') self.pi0_client = Pi0Client("http://localhost:7860") self.executor = TrajectoryExecutor("192.168.1.100") # 初始化相机（这里以RealSense为例） import pyrealsense2 as rs self.pipeline = rs.pipeline() config = rs.config() config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30) config.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30) self.pipeline.start(config) # 放置位置（机械臂坐标系） self.drop_position = [0.4, 0.2, 0.1] # x, y, z self.drop_orientation = [3.14, 0, 0] # RPY角度 print("系统初始化完成") def run(self): """主循环""" print("开始运行抓取系统") try: while True: # 1. 获取图像 frames = self.pipeline.wait_for_frames() color_frame = frames.get_color_frame() depth_frame = frames.get_depth_frame() if not color_frame or not depth_frame: continue # 转换为numpy数组 color_image = np.asanyarray(color_frame.get_data()) depth_image = np.asanyarray(depth_frame.get_data()) # 2. 检测零件 detections = self.detector.detect_parts(color_image, depth_image) if not detections: print("未检测到零件，等待...") time.sleep(1) continue # 选择置信度最高的零件 best_det = max(detections, key=lambda x: x['confidence']) print(f"检测到: {best_det['class_name']}, 置信度: {best_det['confidence']:.2f}") # 3. 生成抓取指令 instruction = self.pi0_client.generate_grasp_instruction( best_det['class_name'], best_det['position_3d'] ) print(f"生成指令: {instruction}") # 4. 调用Pi0生成轨迹 trajectory, success_prob = self.pi0_client.generate_trajectory( color_image, instruction ) if trajectory and success_prob > 0.7: print(f"轨迹生成成功，置信度: {success_prob:.2f}") # 5. 解析并执行轨迹 commands = self.executor.parse_pi0_trajectory(trajectory) if commands: self.executor.execute_trajectory(commands) print("抓取完成") else: print("轨迹解析失败") else: print(f"轨迹生成失败或置信度过低: {success_prob:.2f}") # 等待下一次循环 time.sleep(2) except KeyboardInterrupt: print("用户中断，停止系统") finally: self.pipeline.stop() print("系统已停止") if __name__ == "__main__": system = GraspingSystem() system.run()

6.2 实际测试效果

我们在工厂里实际测试了一周，效果挺不错的：

成功率统计：

螺栓、螺母：95%以上（形状规则，容易抓）
垫片：85%左右（比较薄，有时会滑）
轴承：90%（Pi0会调整抓取力度）
外壳：92%（虽然重，但Pi0抓得很稳）

速度表现：

单次检测+轨迹生成：约1.2秒
轨迹执行：2-4秒（取决于移动距离）
整体循环：5-7秒/个

比我们之前的人工示教方案快多了，原来一个工人每小时大概能处理200-300个零件，现在系统能处理500-600个。

7. 遇到的问题与解决方案

实际部署中当然也遇到了一些问题，这里分享几个典型的：

7.1 光照变化问题

工厂里的光照条件会变化，早上、中午、晚上光线都不一样。YOLOv8在强反光下有时检测不准。

解决方案：

增加了数据增强，训练时模拟不同光照
在相机周围加了环形补光灯
检测失败时自动重试，最多3次

def adaptive_detection(self, image, max_retries=3): """自适应检测，应对光照变化""" for retry in range(max_retries): detections = self.detector.detect_parts(image) if detections and len(detections) > 0: return detections # 如果检测失败，尝试调整图像 if retry < max_retries - 1: # 调整对比度和亮度 image = self.adjust_image(image) print(f"检测失败，第{retry+1}次重试...") return []

7.2 Pi0轨迹不自然问题

有时候Pi0生成的轨迹会有点“奇怪”，比如机械臂会绕远路，或者末端姿态不理想。

解决方案：

在指令里加入更多上下文信息
对生成的轨迹进行后处理和平滑
设置轨迹检查规则，过滤掉明显不合理的轨迹

def validate_trajectory(self, trajectory, start_pose, target_pose): """验证轨迹的合理性""" waypoints = trajectory.get("waypoints", []) if len(waypoints) < 3: return False, "轨迹点太少" # 检查是否经过奇异点 for wp in waypoints: joints = wp["joints"] if any(abs(j) > 3.14 for j in joints): # 关节角度超限 return False, "关节角度超限" # 检查末端运动范围 # 这里可以添加更多检查... return True, "轨迹有效"

7.3 实时性要求

我们的传送带速度是0.5米/秒，零件间距约20厘米，所以系统必须在0.4秒内完成处理。

优化措施：

YOLOv8改用TensorRT加速，推理时间从50ms降到15ms
Pi0服务部署在本地，网络延迟几乎为0
轨迹执行与下一轮检测并行进行

8. 总结与建议

整体用下来，YOLOv8 + Pi0这个组合在工业抓取场景里表现确实不错。最大的优点是开发速度快——从零开始到系统上线，我们只用了两周时间。如果按传统方法，光运动规划算法可能就要调一个月。

不过也有几点需要注意：

Pi0的适用场景：

适合抓取、放置、装配这类操作任务
对复杂形状、柔性物体的处理效果很好
需要相对清晰的视觉输入（图像质量不能太差）

不适合的场景：

超高精度装配（误差要求<0.1mm）
高速动态抓取（目标快速移动）
极端环境（高温、强磁、真空等）

如果你也想尝试这个方案，我的建议是：

从小规模开始：先选一两种零件做测试，熟悉整个流程
重视数据质量：YOLOv8的训练数据要标注准确，图像要多样化
做好错误处理：机器人系统最怕意外，每个环节都要有异常处理
安全第一：实际部署前一定要充分测试，设置急停和安全区域

这个项目做下来，我感觉具身智能在工业场景的应用才刚刚开始。Pi0这类模型让机器人编程变得简单了很多，原来需要专家才能做的运动规划，现在普通工程师也能搞定了。随着模型能力的不断提升，相信未来会有更多场景能用上这种技术。

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

Pi0具身智能实战：基于YOLOv8的物体抓取轨迹生成