Qwen3-ASR-1.7B与Flask集成：快速搭建语音识别Web服务-育师

Qwen3-ASR-1.7B与Flask集成：快速搭建语音识别Web服务

你是不是也遇到过这样的场景？手头有一堆会议录音、采访音频或者用户上传的语音文件，需要快速把它们转成文字。手动转录不仅耗时耗力，还容易出错。现在，借助开源的Qwen3-ASR-1.7B模型和轻量级的Flask框架，我们可以在半小时内搭建一个属于自己的语音识别Web服务。

Qwen3-ASR-1.7B是通义千问团队最新开源的语音识别模型，它有个特别厉害的地方——一个模型就能识别52种语言和方言，包括普通话、粤语、英语，甚至还能处理带背景音乐的歌曲。而Flask是Python里最受欢迎的Web框架之一，简单灵活，特别适合快速搭建API服务。

今天这篇文章，我就带你一步步把这两个工具结合起来，从零开始搭建一个能处理音频上传、自动识别、返回文字结果的Web服务。整个过程不需要复杂的配置，跟着做就行。

1. 准备工作：环境搭建与模型下载

在开始写代码之前，我们需要先把运行环境准备好。这里我假设你已经有了Python的基础知识，如果还没有安装Python，建议先安装Python 3.8或以上版本。

1.1 创建项目目录和虚拟环境

首先，我们创建一个专门的项目目录，这样可以保持环境干净，避免包冲突。

# 创建项目目录 mkdir qwen-asr-web-service cd qwen-asr-web-service # 创建虚拟环境（Windows用户用 python -m venv venv） python3 -m venv venv # 激活虚拟环境 # Linux/Mac: source venv/bin/activate # Windows: # venv\Scripts\activate

激活虚拟环境后，你的命令行前面应该会出现(venv)的提示，这表示你现在是在虚拟环境中操作。

1.2 安装必要的Python包

接下来，我们安装Flask和模型运行需要的几个核心包。这里我建议使用清华的镜像源，下载速度会快很多。

# 升级pip pip install --upgrade pip # 安装核心包 pip install flask torch transformers -i https://pypi.tuna.tsinghua.edu.cn/simple # 安装音频处理相关包 pip install soundfile librosa -i https://pypi.tuna.tsinghua.edu.cn/simple

这里简单说明一下每个包的作用：

flask：我们的Web框架，负责处理HTTP请求
torch：PyTorch深度学习框架，模型运行的基础
transformers：Hugging Face的模型库，方便我们加载和使用预训练模型
soundfile和librosa：处理音频文件，读取和转换音频格式

1.3 下载Qwen3-ASR-1.7B模型

模型下载有两种方式，你可以根据网络情况选择：

方式一：通过Hugging Face直接下载（推荐网络好的用户）

# 创建一个简单的下载脚本 download_model.py from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor model_name = "Qwen/Qwen3-ASR-1.7B" print(f"开始下载模型: {model_name}") print("这可能需要一些时间，模型大小约3.4GB...") # 下载模型和处理器 model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name) processor = AutoProcessor.from_pretrained(model_name) print("模型下载完成！")

方式二：使用ModelScope（国内用户推荐）

如果你在国内，使用ModelScope下载速度会更快：

# 先安装modelscope pip install modelscope -i https://pypi.tuna.tsinghua.edu.cn/simple

然后创建一个下载脚本：

# download_model_modelscope.py from modelscope import snapshot_download model_dir = snapshot_download('Qwen/Qwen3-ASR-1.7B') print(f"模型已下载到: {model_dir}")

运行下载脚本后，模型会自动保存到本地缓存中。第一次运行需要下载模型文件，可能会花一些时间（取决于你的网速），但下载完成后就可以重复使用了。

2. 核心代码：Flask应用与模型集成

环境准备好后，我们就可以开始写代码了。我会把整个应用拆解成几个部分，一步步讲解。

2.1 创建Flask应用的基本结构

首先，我们创建一个app.py文件，这是我们的主程序文件。

# app.py import os from flask import Flask, request, jsonify from werkzeug.utils import secure_filename import torch import torchaudio from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor import numpy as np # 初始化Flask应用 app = Flask(__name__) # 配置上传文件夹和允许的文件类型 UPLOAD_FOLDER = 'uploads' ALLOWED_EXTENSIONS = {'wav', 'mp3', 'm4a', 'flac'} app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER app.config['MAX_CONTENT_LENGTH'] = 100 * 1024 * 1024 # 限制上传文件大小为100MB # 创建上传文件夹 os.makedirs(UPLOAD_FOLDER, exist_ok=True) def allowed_file(filename): """检查文件类型是否允许""" return '.' in filename and \ filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

这段代码做了几件事：

导入必要的库
创建Flask应用实例
配置上传文件的保存路径和允许的文件类型
创建上传文件夹（如果不存在）
定义了一个辅助函数来检查文件类型

2.2 加载语音识别模型

接下来，我们添加模型加载的代码。为了避免每次请求都重新加载模型，我们使用Flask的before_first_request装饰器，让模型在应用启动时只加载一次。

# 全局变量，用于存储模型和处理器 model = None processor = None device = None @app.before_first_request def load_model(): """在第一个请求之前加载模型""" global model, processor, device print("正在加载Qwen3-ASR-1.7B模型...") # 检查是否有GPU可用 device = "cuda:0" if torch.cuda.is_available() else "cpu" print(f"使用设备: {device}") # 加载模型和处理器 try: model = AutoModelForSpeechSeq2Seq.from_pretrained( "Qwen/Qwen3-ASR-1.7B", torch_dtype=torch.float16 if device == "cuda:0" else torch.float32, low_cpu_mem_usage=True, use_safetensors=True ) # 将模型移动到指定设备 model.to(device) # 加载处理器 processor = AutoProcessor.from_pretrained("Qwen/Qwen3-ASR-1.7B") print("模型加载完成！") except Exception as e: print(f"模型加载失败: {e}") raise

这里有几个关键点：

我们检查是否有GPU可用，如果有就用GPU，否则用CPU
模型使用float16精度（GPU）或float32精度（CPU），这样可以节省内存
使用low_cpu_mem_usage=True可以减少内存占用
加载失败时会打印错误信息

2.3 音频预处理函数

不同的音频文件格式和参数可能不一样，我们需要一个函数来统一处理。

def preprocess_audio(audio_path): """预处理音频文件，转换为模型需要的格式""" try: # 使用torchaudio加载音频 waveform, sample_rate = torchaudio.load(audio_path) # 转换为单声道（如果原本是立体声） if waveform.shape[0] > 1: waveform = torch.mean(waveform, dim=0, keepdim=True) # 如果采样率不是16000Hz，进行重采样 if sample_rate != 16000: resampler = torchaudio.transforms.Resample(sample_rate, 16000) waveform = resampler(waveform) sample_rate = 16000 # 转换为numpy数组 audio_array = waveform.numpy().squeeze() return audio_array, sample_rate except Exception as e: print(f"音频处理失败: {e}") return None, None

这个函数做了几件重要的事：

加载音频文件
如果是立体声，转换为单声道（取平均值）
如果采样率不是16000Hz，重采样到16000Hz（这是模型需要的采样率）
转换为numpy数组格式

2.4 语音识别核心函数

这是最核心的部分，负责调用模型进行语音识别。

def transcribe_audio(audio_path, language=None): """使用Qwen3-ASR模型进行语音识别""" if model is None or processor is None: return {"error": "模型未加载，请稍后重试"} try: # 预处理音频 audio_array, sample_rate = preprocess_audio(audio_path) if audio_array is None: return {"error": "音频处理失败"} # 准备模型输入 inputs = processor( audio=audio_array, sampling_rate=sample_rate, return_tensors="pt", padding=True ) # 将输入移动到与模型相同的设备 inputs = {k: v.to(device) for k, v in inputs.items()} # 生成转录文本 with torch.no_grad(): generated_ids = model.generate(**inputs, max_new_tokens=256) # 解码结果 transcription = processor.batch_decode( generated_ids, skip_special_tokens=True )[0] # 如果指定了语言，可以添加语言提示 # 注意：Qwen3-ASR支持自动语言检测，通常不需要手动指定 return { "success": True, "text": transcription, "language": "自动检测", "audio_duration": len(audio_array) / sample_rate } except Exception as e: print(f"语音识别失败: {e}") return {"error": f"识别过程中出错: {str(e)}"}

这个函数的工作流程是：

检查模型是否已加载
预处理音频文件
使用处理器准备模型输入
调用模型生成转录文本
解码并返回结果

2.5 创建Web API接口

现在我们来创建Flask的路由，也就是API接口。

@app.route('/') def index(): """首页，显示简单的使用说明""" return ''' <h1>Qwen3-ASR语音识别服务</h1> <p>这是一个基于Qwen3-ASR-1.7B的语音识别Web服务</p> <p>使用方法：</p> <ul> <li>上传音频文件：POST /transcribe</li> <li>支持格式：WAV, MP3, M4A, FLAC</li> <li>最大文件大小：100MB</li> </ul> ''' @app.route('/transcribe', methods=['POST']) def transcribe(): """处理音频上传和转录""" # 检查是否有文件上传 if 'file' not in request.files: return jsonify({"error": "没有上传文件"}), 400 file = request.files['file'] # 检查文件名是否为空 if file.filename == '': return jsonify({"error": "未选择文件"}), 400 # 检查文件类型 if not allowed_file(file.filename): return jsonify({"error": f"不支持的文件类型。支持的类型: {ALLOWED_EXTENSIONS}"}), 400 try: # 安全保存文件名 filename = secure_filename(file.filename) filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename) file.save(filepath) print(f"文件已保存: {filepath}") # 获取可选的语言参数 language = request.form.get('language', None) # 进行语音识别 result = transcribe_audio(filepath, language) # 删除临时文件（可选） # os.remove(filepath) return jsonify(result) except Exception as e: print(f"处理请求时出错: {e}") return jsonify({"error": f"服务器内部错误: {str(e)}"}), 500 @app.route('/health', methods=['GET']) def health_check(): """健康检查接口""" if model is not None and processor is not None: return jsonify({ "status": "healthy", "model_loaded": True, "device": device }) else: return jsonify({ "status": "unhealthy", "model_loaded": False }), 503

我们创建了三个接口：

首页(/)：显示简单的使用说明
转录接口(/transcribe)：处理文件上传和语音识别
健康检查(/health)：检查服务是否正常运行

2.6 启动应用

最后，我们添加启动代码。

if __name__ == '__main__': # 在启动时加载模型 with app.app_context(): load_model() # 启动Flask应用 print("启动Flask应用...") print("访问 http://localhost:5000 查看首页") print("使用POST请求 http://localhost:5000/transcribe 上传音频文件") app.run(host='0.0.0.0', port=5000, debug=True)

3. 完整代码整合与测试

现在我们把所有代码整合到一个文件中。创建一个完整的app.py：

# app.py - 完整代码 import os from flask import Flask, request, jsonify from werkzeug.utils import secure_filename import torch import torchaudio from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor import numpy as np # 初始化Flask应用 app = Flask(__name__) # 配置 UPLOAD_FOLDER = 'uploads' ALLOWED_EXTENSIONS = {'wav', 'mp3', 'm4a', 'flac'} app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER app.config['MAX_CONTENT_LENGTH'] = 100 * 1024 * 1024 # 创建上传文件夹 os.makedirs(UPLOAD_FOLDER, exist_ok=True) # 全局变量 model = None processor = None device = None def allowed_file(filename): return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS def load_model(): global model, processor, device print("正在加载Qwen3-ASR-1.7B模型...") device = "cuda:0" if torch.cuda.is_available() else "cpu" print(f"使用设备: {device}") try: model = AutoModelForSpeechSeq2Seq.from_pretrained( "Qwen/Qwen3-ASR-1.7B", torch_dtype=torch.float16 if device == "cuda:0" else torch.float32, low_cpu_mem_usage=True, use_safetensors=True ) model.to(device) processor = AutoProcessor.from_pretrained("Qwen/Qwen3-ASR-1.7B") print("模型加载完成！") except Exception as e: print(f"模型加载失败: {e}") raise def preprocess_audio(audio_path): try: waveform, sample_rate = torchaudio.load(audio_path) if waveform.shape[0] > 1: waveform = torch.mean(waveform, dim=0, keepdim=True) if sample_rate != 16000: resampler = torchaudio.transforms.Resample(sample_rate, 16000) waveform = resampler(waveform) sample_rate = 16000 audio_array = waveform.numpy().squeeze() return audio_array, sample_rate except Exception as e: print(f"音频处理失败: {e}") return None, None def transcribe_audio(audio_path, language=None): if model is None or processor is None: return {"error": "模型未加载"} try: audio_array, sample_rate = preprocess_audio(audio_path) if audio_array is None: return {"error": "音频处理失败"} inputs = processor( audio=audio_array, sampling_rate=sample_rate, return_tensors="pt", padding=True ) inputs = {k: v.to(device) for k, v in inputs.items()} with torch.no_grad(): generated_ids = model.generate(**inputs, max_new_tokens=256) transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] return { "success": True, "text": transcription, "language": "自动检测", "audio_duration": len(audio_array) / sample_rate } except Exception as e: print(f"语音识别失败: {e}") return {"error": f"识别出错: {str(e)}"} @app.route('/') def index(): return ''' <h1>Qwen3-ASR语音识别服务</h1> <p>基于Qwen3-ASR-1.7B的语音识别Web服务</p> <p>使用方法：POST /transcribe 上传音频文件</p> <p>支持格式：WAV, MP3, M4A, FLAC</p> ''' @app.route('/transcribe', methods=['POST']) def transcribe(): if 'file' not in request.files: return jsonify({"error": "没有上传文件"}), 400 file = request.files['file'] if file.filename == '': return jsonify({"error": "未选择文件"}), 400 if not allowed_file(file.filename): return jsonify({"error": f"不支持的文件类型"}), 400 try: filename = secure_filename(file.filename) filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename) file.save(filepath) language = request.form.get('language', None) result = transcribe_audio(filepath, language) return jsonify(result) except Exception as e: return jsonify({"error": f"服务器错误: {str(e)}"}), 500 @app.route('/health', methods=['GET']) def health_check(): if model is not None and processor is not None: return jsonify({"status": "healthy", "model_loaded": True, "device": device}) else: return jsonify({"status": "unhealthy", "model_loaded": False}), 503 if __name__ == '__main__': with app.app_context(): load_model() print("服务启动中...") print("访问 http://localhost:5000") app.run(host='0.0.0.0', port=5000, debug=True)

3.1 测试服务

保存好代码后，我们来测试一下服务是否正常工作。

第一步：启动服务

python app.py

如果一切正常，你会看到类似这样的输出：

正在加载Qwen3-ASR-1.7B模型... 使用设备: cuda:0 # 或者 cpu 模型加载完成！ 服务启动中... 访问 http://localhost:5000 * Serving Flask app 'app' * Debug mode: on

第二步：测试健康检查接口

打开浏览器，访问http://localhost:5000/health，应该能看到：

{ "status": "healthy", "model_loaded": true, "device": "cuda:0" }

第三步：使用curl测试文件上传

准备一个测试音频文件（比如test.wav），然后使用curl命令测试：

curl -X POST http://localhost:5000/transcribe \ -F "file=@test.wav"

如果一切正常，你会得到类似这样的响应：

{ "success": true, "text": "这是一个测试音频，用于验证语音识别服务是否正常工作。", "language": "自动检测", "audio_duration": 5.2 }

第四步：使用Python代码测试

你也可以用Python代码来测试：

# test_client.py import requests url = "http://localhost:5000/transcribe" files = {"file": open("test.wav", "rb")} response = requests.post(url, files=files) print("状态码:", response.status_code) print("响应内容:", response.json())

4. 进阶功能与优化建议

基础功能已经实现了，但实际使用中我们可能还需要一些进阶功能。这里我分享几个实用的扩展方向。

4.1 添加批处理支持

如果你需要一次性处理多个音频文件，可以添加批处理接口：

@app.route('/batch_transcribe', methods=['POST']) def batch_transcribe(): """批量处理多个音频文件""" if 'files' not in request.files: return jsonify({"error": "没有上传文件"}), 400 files = request.files.getlist('files') if len(files) == 0: return jsonify({"error": "未选择文件"}), 400 results = [] for file in files: if file.filename == '': continue if not allowed_file(file.filename): results.append({ "filename": file.filename, "error": "不支持的文件类型" }) continue try: filename = secure_filename(file.filename) filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename) file.save(filepath) result = transcribe_audio(filepath) result["filename"] = filename results.append(result) # 清理临时文件 os.remove(filepath) except Exception as e: results.append({ "filename": file.filename, "error": str(e) }) return jsonify({ "total_files": len(files), "processed_files": len(results), "results": results })

4.2 添加进度查询接口

对于较长的音频文件，识别可能需要一些时间。我们可以添加一个任务队列和进度查询接口：

from flask import session import uuid import threading import time # 简单的任务队列 tasks = {} @app.route('/async_transcribe', methods=['POST']) def async_transcribe(): """异步转录接口，立即返回任务ID""" if 'file' not in request.files: return jsonify({"error": "没有上传文件"}), 400 file = request.files['file'] if file.filename == '': return jsonify({"error": "未选择文件"}), 400 # 生成唯一任务ID task_id = str(uuid.uuid4()) # 保存文件 filename = secure_filename(file.filename) filepath = os.path.join(app.config['UPLOAD_FOLDER'], f"{task_id}_{filename}") file.save(filepath) # 初始化任务状态 tasks[task_id] = { "status": "processing", "progress": 0, "result": None, "filepath": filepath, "created_at": time.time() } # 在后台线程中处理任务 def process_task(tid, fpath): try: tasks[tid]["progress"] = 10 result = transcribe_audio(fpath) tasks[tid]["result"] = result tasks[tid]["status"] = "completed" tasks[tid]["progress"] = 100 # 清理文件 os.remove(fpath) except Exception as e: tasks[tid]["result"] = {"error": str(e)} tasks[tid]["status"] = "failed" thread = threading.Thread(target=process_task, args=(task_id, filepath)) thread.start() return jsonify({ "task_id": task_id, "status": "processing", "progress_url": f"/task/{task_id}/progress" }) @app.route('/task/<task_id>/progress', methods=['GET']) def get_task_progress(task_id): """查询任务进度""" if task_id not in tasks: return jsonify({"error": "任务不存在"}), 404 task = tasks[task_id] return jsonify({ "task_id": task_id, "status": task["status"], "progress": task["progress"], "result": task["result"] })

4.3 性能优化建议

在实际部署时，你可能还需要考虑以下优化：

使用生产级服务器：开发环境的Flask服务器不适合生产环境，可以考虑使用Gunicorn或uWSGI。

# 安装Gunicorn pip install gunicorn # 启动服务 gunicorn -w 4 -b 0.0.0.0:5000 app:app

添加缓存机制：对于相同的音频文件，可以缓存识别结果，避免重复计算。
限制并发请求：语音识别比较耗资源，需要限制同时处理的请求数量。
使用模型量化：如果内存或计算资源有限，可以考虑使用模型量化来减少资源占用。

# 使用8位量化 from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0 ) model = AutoModelForSpeechSeq2Seq.from_pretrained( "Qwen/Qwen3-ASR-1.7B", quantization_config=quantization_config, device_map="auto" )

5. 实际应用场景

这个语音识别Web服务搭建好后，可以用在很多实际场景中：

5.1 会议记录自动化

把会议录音上传到服务，自动生成文字记录，再配合摘要模型，可以快速生成会议纪要。

5.2 客服语音分析

收集客服通话录音，批量转写成文字，用于分析客户问题、评估服务质量。

5.3 教育场景

把讲课录音转成文字，方便学生复习，也可以用于制作字幕。

5.4 多媒体内容处理

给视频文件自动生成字幕，或者处理播客节目，制作文字稿。

5.5 多语言支持

Qwen3-ASR支持52种语言和方言，可以用来处理多语言内容，比如国际会议的录音，或者外语学习材料。

6. 总结

通过这篇文章，我们完成了一个完整的语音识别Web服务的搭建。从环境准备、模型下载，到Flask应用开发、接口设计，再到进阶功能扩展，我尽量把每个步骤都讲清楚。

实际用下来，Qwen3-ASR-1.7B的识别准确率确实不错，特别是对中文的支持很到位。Flask的轻量级特性也让整个开发过程很顺畅，不需要太多复杂的配置。

如果你刚开始接触这类项目，建议先跑通基础功能，确保服务能正常启动和识别。然后再根据实际需求，添加批处理、异步任务、缓存优化这些进阶功能。部署到生产环境时，记得用Gunicorn这样的生产级服务器，并做好错误处理和日志记录。

语音识别现在越来越普及，从会议记录到客服分析，应用场景很多。有了这个自己搭建的服务，你可以灵活地集成到各种系统中，不用依赖第三方API，数据安全也更有保障。

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

Qwen3-ASR-1.7B与Flask集成：快速搭建语音识别Web服务