【Qwen】train()函数说明-育师

`train()`函数文档

train(attn_implementation='flash_attention_2')

Runs the main training loop for Qwen VL (Qwen2-VL, Qwen2.5-VL, Qwen3-VL, or Qwen3-VL-MoE) instruction tuning.
Parses command-line arguments for model, data, and training config; loads the appropriate model class and processor; optionally applies LoRA or configures which modules to tune (vision encoder, MLP merger, LLM); builds the supervised data module and Hugging FaceTrainer, runs training (with optional resume), then saves the final model and processor tooutput_dir.

Parameters

Name	Type	Default	Description
attn_implementation	`str`	`"flash_attention_2"`	Attention implementation passed to the model (e.g.`"flash_attention_2"`for Flash Attention 2).

Command-line arguments (parsed viaHfArgumentParser)

ModelArguments
- model_name_or_path(str) – HuggingFace model id or path (e.g.Qwen/Qwen2.5-VL-3B-Instruct,Qwen/Qwen3-VL-8B-Instruct). Used to select model class (Qwen2-VL, Qwen2.5-VL, Qwen3-VL, or Qwen3-VL-MoE).
- tune_mm_llm(bool) – Whether to train the language model (andlm_head).
- tune_mm_mlp(bool) – Whether to train the vision merger (MLP).
- tune_mm_vision(bool) – Whether to train the vision encoder.
DataArguments
- dataset_use(str) – Comma-separated dataset names (with optional%Nsampling, e.g.dataset1%50).
- data_flatten(bool) – Whether to flatten/concat batch sequences.
- data_packing(bool) – Whether to use packed data (requires preprocessing withpack_data.py).
- max_pixels(int) – Max image pixels (default28*28*576).
- min_pixels(int) – Min image pixels (default28*28*16).
- video_max_frames,video_min_frames,video_max_pixels,video_min_pixels,video_fps– Video sampling and resolution settings.
TrainingArguments(extendstransformers.TrainingArguments)
- cache_dir(str, optional) – Cache directory for model/processor.
- model_max_length(int) – Maximum sequence length for tokenizer.
- lora_enable(bool) – IfTrue, apply LoRA and ignoretune_mm_*for the base model.
- lora_r,lora_alpha,lora_dropout– LoRA rank, alpha, and dropout.
- mm_projector_lr,vision_tower_lr– Optional learning rates for projector and vision tower.
- Plus standard Trainer args:output_dir,bf16,per_device_train_batch_size,gradient_accumulation_steps,learning_rate,num_train_epochs,save_steps,gradient_checkpointing,deepspeed, etc.

Returns

None. Model and processor are saved undertraining_args.output_dir.

Notes

Ifoutput_diralready containscheckpoint-*directories, training is resumed withresume_from_checkpoint=True.
Whendata_flattenordata_packingis enabled, the Qwen2 VL attention class is replaced for compatibility.
Qwen3-VL MoE models useQwen3VLMoeForConditionalGeneration; other Qwen3-VL models useQwen3VLForConditionalGeneration; Qwen2.5-VL and Qwen2-VL use the corresponding classes inferred frommodel_name_or_path.

Example

# Typical usage: arguments are passed via command line (e.g. from scripts/sft_qwen3_4b.sh)torchrun --nproc_per_node=4qwenvl/train/train_qwen.py\--model_name_or_path Qwen/Qwen3-VL-8B-Instruct\--dataset_use my_dataset\--data_flatten True\--tune_mm_vision False --tune_mm_mlp True --tune_mm_llm True\--output_dir ./output\--bf16 --per_device_train_batch_size4--gradient_accumulation_steps4\--learning_rate 1e-5 --num_train_epochs0.5

# Programmatic call (still requires sys.argv or explicit parse for HfArgumentParser)fromqwenvl.train.train_qwenimporttrain train(attn_implementation="flash_attention_2")