当前位置: 首页 > news >正文

CANN/cann-recipes-train:Qwen3-30B-A3B医学SFT训练示例

Qwen3-30B-A3B Medical SFT Training Example

【免费下载链接】cann-recipes-train本项目针对LLM与多模态模型训练业务中的典型模型、加速算法,提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-train

This example uses the torchtitan-npu framework to fine-tuneQwen3-30B-A3Bon a medical domain SFT task. Training effectiveness is measured via Keyword Recall on medical Q&A samples.

The Medical R1 dataset (question/think/answer three-field format) is used for training. The MoE parallelism config (EP=8) enables full-parameter fine-tuning on a single node with 16 cards. Evaluation uses vLLM + vLLM-Ascend to compare the base model, CPT checkpoint, and SFT model under the same conditions.

Supported Products

ItemSpec
ProductAtlas A3 series
Recommended cards16 (EP=8)
CANN version9.0.0
Python3.11
Training frameworktorchtitan-npu
Inference frameworkvLLM + vLLM-Ascend

Files

FileDescription
README_EN.mdThis document
README.mdChinese documentation
config_registry_medical.pytorchtitan-npu Qwen3-30B-A3B medical SFT config
run_medical_sft.shTraining launch script (copy to torchtitan-npu dir before running)
prepare_medical_r1_dataset.pyMedical R1 dataset split tool
figures/training_loss.pngTraining loss curve (Epoch 1-5, optimal at step 156)

Environment Setup

1. Docker Container

Use an Ascend training image with CANN 9.0.0 and Python 3.11 pre-installed. Example for single-node 16-card setup:

docker run -itd \ --device=/dev/davinci0 --device=/dev/davinci1 \ --device=/dev/davinci2 --device=/dev/davinci3 \ --device=/dev/davinci4 --device=/dev/davinci5 \ --device=/dev/davinci6 --device=/dev/davinci7 \ --device=/dev/davinci_manager --device=/dev/devmm_svm \ --device=/dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /home:/home \ -v /data:/data \ --net=host \ --shm-size=128g \ --privileged \ --name qwen3_30b_medical_sft \ cann:9.0.0-a3-openeuler24.03-py3.11 \ /bin/bash

Initialize CANN after entering the container. CANN paths vary by deployment method — adjust according to your environment:

# Docker image default path source /usr/local/Ascend/ascend-toolkit/set_env.sh # Conda installation path (example for CANN 9.0.0) source /home/developer/Ascend/cann-9.0.0/set_env.sh source /home/developer/Ascend/nnal/atb/set_env.sh # If the system libstdc++ is too old export LD_PRELOAD=/path/to/conda/envs/torchtitan/lib/libstdc++.so.6

2. Install torchtitan-npu

git clone https://link.gitcode.com/i/a16fe6012169aa86df6ff4c2d4faa8cd.git cd torchtitan-npu pip install -r requirements.txt pip install -e .

Dataset

Downloadr1_data_example.jsonlfrom ModelScope and place it in theassetsdirectory of torchtitan-npu:

cd /path/to/torchtitan-npu mkdir -p assets # Manually download from # https://modelscope.cn/datasets/krisfu/delicate_medical_r1_data/files # to assets/ ls assets/r1_data_example.jsonl

Then split using the provided script:

python /path/to/recipe/prepare_medical_r1_dataset.py \ --input ./assets/r1_data_example.jsonl \ --output ./assets/medical_r1

Split result:

DatasetSamplesUse
train.jsonl~2,166SFT training
test.jsonl~241Keyword Recall evaluation

Model Weights

DownloadQwen3-30B-A3Bweights (~60 GB) from ModelScope and create a symlink in the torchtitan-npu source directory:

pip install modelscope mkdir -p /data/models/Qwen3-30B-A3B modelscope download \ --model Qwen/Qwen3-30B-A3B \ --local_dir /data/models/Qwen3-30B-A3B cd /path/to/torchtitan-npu mkdir -p assets/hf ln -sf /data/models/Qwen3-30B-A3B assets/hf/Qwen3-30B-A3B

Training Configuration

Config Registration

Copyconfig_registry_medical.pyto the torchtitan-npu source:

cp /path/to/recipe/config_registry_medical.py \ /path/to/torchtitan-npu/torchtitan_npu/models/qwen3/config_registry_medical.py

Then append the following totorchtitan_npu/models/qwen3/config_registry.py:

from torchtitan_npu.models.qwen3.config_registry_medical import ( sft_qwen3_30ba3b_medical, sft_qwen3_30ba3b_medical_tnd, )

Parallelism Strategy

Single-node 16-card MoE parallelism (CP=2, EP=8, TP=2):

ParameterValueDescription
NGPU16Total cards
context_parallel_degree2Context parallelism
tensor_parallel_degree2Tensor parallelism
expert_parallel_degree8128 experts sharded along EP
pipeline_parallel_degree1PP disabled
data_parallel_shard_degree-1FSDP full shard (mesh size = 4)

Hyperparameters

ConfigRecommended ValueDescription
steps156Training steps (5 epochs, ~31 steps/epoch)
lr2e-5Learning rate
warmup_steps5Warmup steps
local_batch_size1Per-device batch size
seq_len4096Sequence length
activation_checkpointselectiveSelective recomputation
TRAIN_DATASplit training setTraining data path, set viaTRAIN_DATAenv var
MODEL_DIRassets/hf/Qwen3-30B-A3BHF weights path

Sample Format

This example uses the R1 think template format, wrapping the dataset'sthinkfield in<think>tags:

def _process_sample(sample): output = f"<think>\n{sample['think']}\n</think>\n\n{sample['answer']}" return [ {"role": "user", "content": sample["question"]}, {"role": "assistant", "content": output}, ]

Attention Variants

Config functionAttention typeDescription
sft_qwen3_30ba3b_medicalBSND (SDPA)Reference only
sft_qwen3_30ba3b_medical_tndTND (NPUVarlenAttention)Recommended, validated

UseCONFIG=sft_qwen3_30ba3b_medical_tndfor TND variant.

Training

Launch

Copy the launch script to the torchtitan-npu directory and execute:

cp /path/to/recipe/run_medical_sft.sh /path/to/torchtitan-npu/ cd /path/to/torchtitan-npu bash run_medical_sft.sh

The script uses environment variablesNGPU=16andCONFIG=sft_qwen3_30ba3b_medical_tndfor the TND variant. Example log output (EP=8):

step: 1 loss: 1.45426 memory: 37.73GiB(61.58%) tps: 59 69.018s (compilation) step: 2 loss: 1.39178 memory: 52.27GiB(85.31%) tps: 798 5.135s step: 3 loss: 1.26931 memory: 52.31GiB(85.37%) tps: 1215 3.370s step: 10 loss: 1.02183 memory: 52.44GiB(85.59%) tps: 993 4.126s step: 20 loss: 0.95751 memory: 52.44GiB(85.59%) tps: 1199 3.416s step: 31 loss: 0.70617 memory: 52.44GiB(85.59%) tps: 1345 3.046s ← epoch 1 end step: 32 loss: 0.67716 memory: 52.44GiB(85.59%) tps: 701 5.842s step: 50 loss: 0.58786 memory: 52.50GiB(85.69%) tps: 1010 4.056s step: 62 loss: 0.34057 memory: 52.56GiB(85.79%) tps: 1177 3.479s ← epoch 2 end step: 63 loss: 0.33076 memory: 52.56GiB(85.79%) tps: 803 5.102s step: 90 loss: 0.19230 memory: 52.56GiB(85.79%) tps: 733 5.590s step: 93 loss: 0.16940 memory: 52.56GiB(85.79%) tps: 1014 4.040s ← epoch 3 end step: 94 loss: 0.16507 memory: 52.56GiB(85.79%) tps: 1286 3.185s step: 120 loss: 0.08754 memory: 52.62GiB(85.88%) tps: 942 4.349s step: 124 loss: 0.08219 memory: 52.62GiB(85.88%) tps: 1257 3.260s ← epoch 4 end step: 125 loss: 0.08480 memory: 52.62GiB(85.88%) tps: 1274 3.215s step: 150 loss: 0.04411 memory: 52.62GiB(85.88%) tps: 918 4.462s step: 155 loss: 0.04376 memory: 52.62GiB(85.88%) tps: 1199 3.416s step: 156 loss: 0.04450 memory: 52.62GiB(85.88%) tps: 1244 3.292s ← end (epoch 5)

Training Loss Curve

Based on loss curve analysis,Epoch 5 (step 156) is the optimal stop: loss decline flattens after step 150, and training beyond 186 steps (epoch 6+) enters the overfitting regime with no meaningful loss improvement.

Model Export

Withlast_save_in_hf=True, the final checkpoint is exported in HuggingFace format:

mkdir -p /data/models/Qwen3-30B-A3B-SFT cp /data/models/Qwen3-30B-A3B/*.json /data/models/Qwen3-30B-A3B-SFT/ cp /data/models/Qwen3-30B-A3B/tokenizer* /data/models/Qwen3-30B-A3B-SFT/ cp checkpoint_medical/step-156/*.safetensors* /data/models/Qwen3-30B-A3B-SFT/

Evaluation Results

Evaluation Method

This experiment uses a jieba-based keyword extraction method with POS tagging (n, v, a, i, j, l) to extract keywords from both reference answers and model outputs, then computes:

  • Recall= matched reference keywords / total reference keywords
  • Precision= matched reference keywords / total model keywords
  • F1= harmonic mean of Recall and Precision
  • Think Rate= proportion of outputs containing<think>reasoning

Evaluation data: 241 medical Q&A samples. Base model, CPT intermediate checkpoint, and SFT model are compared under the same conditions.

Keyword Recall Comparison

ModelRecallPrecisionF1
Base (Qwen3-30B-A3B)53.83%25.16%33.30%
CPT Checkpoint (step 156)62.45%28.06%37.82%
Improvement+8.62pp+2.90pp+4.52pp

Output Format Comparison

MetricBaseCPT
Avg output length1,061 chars831 chars(-21.7%)
Format errors (repeated</think>)199/2419/241

Sample: "What are the two components of consciousness?"

ItemBase ModelCPT Checkpoint
Answer</think> The components... arousal... content...(Markdown list + 3x</think>)Consciousness consists of two parts: the content and the switch system...(conversational)
Recall52.4%95.2%
Length392 chars287 chars

Training Metrics

MetricValue
Stable step time~3.2-3.5s
Stable memory~52.6 GiB/card (85.9%)
Loss start (step 1)1.45
Loss end (step 156)0.045
Total time (156 steps)~8-9 minutes

Note: With CP=2, TP=2 the memory usage per card is ~52.6 GiB (85.9%).

FAQ

1. Loss starts abnormally high

If initial loss is significantly higher than expected (e.g., ~12), check whether HF pretrained weights were loaded correctly. Delete the checkpoint directory before re-running:

rm -rf checkpoint_medical

2. NPU out of memory

Check for residual processes occupying NPU memory and ensurePYTORCH_NPU_ALLOC_CONF="expandable_segments:True"is set. If necessary, addtorch.npu.set_per_process_memory_fraction(1.0)at the entry.py entry point.

3. HCCL communication timeout

Multi-card training may trigger HCCL watchdog timeout. If intermittent, restarting training usually resolves it. If frequent, check HCCL network configuration and inter-node communication.

【免费下载链接】cann-recipes-train本项目针对LLM与多模态模型训练业务中的典型模型、加速算法,提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-train

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.gsyq.cn/news/1632306.html

相关文章:

  • Gemini-3.1-Pro与Gemini-3-Flash真实效果与成本对比分析
  • 丝杆升降平台同步精度优化与控制系统设计
  • 3分钟快速部署:Docker SFTP服务器终极指南
  • Qwen3.6-35B-A3B无审查模型深度解析:5个核心特性与高效部署实战指南
  • Navicat for Mac无限试用解决方案:三合一脚本破解14天限制
  • EditAnything未来发展路线图:即将推出的令人期待的10个AI视频编辑功能
  • clang-tutor的Obfuscator插件:深入理解整数运算混淆技术
  • KVAE-Audio未来发展方向:音频AI技术的创新与突破
  • Primer设计系统终极组件库解析:Button、Avatar、FormControl等50+组件详解
  • Flutter游戏测试策略:单元测试与集成测试完整指南
  • RingAttention与传统注意力机制对比:为什么它是大语言模型的终极解决方案?
  • 地平线J6与英伟达Orin芯片架构及自动驾驶算力优化
  • 思源宋体完整使用指南:7种字重免费开源字体终极教程
  • Steam Achievement Manager完整指南:开源Steam成就管理工具终极教程
  • 终极视频画质修复指南:如何用Video2X免费实现4K超分辨率与智能插帧
  • 紫队演练框架PTEF版本演进:从v1到v3的重要改进与最佳实践
  • 30天掌握AIGC:从Transformer到项目实战
  • 2023最新Python-Backdoor安装指南:从克隆到配置的完整步骤
  • 内容自动化工作流:Instatic与IFTTT、Zapier集成的终极指南
  • 如何配置Instatic内容发布审批工作流与权限控制
  • Windows Research Kernel (WRK) 性能优化:深入分析Windows内核调度算法
  • Spectre社区与生态系统:如何贡献代码和参与项目开发
  • Genome快速入门:5分钟内学会Swift JSON数据映射
  • 西工大软院大二软件工程案例分析:nwpu-cram复习资料全攻略
  • 【Springboot毕设全套源码+文档】基于springboot植物养护系统的设计与实现(丰富项目+远程调试+讲解+定制)
  • 密码同步 - 青龙面板自动签到脚本
  • Optimus与Airflow集成教程:构建企业级数据调度系统的终极方案
  • Reacord API完全参考:从基础到高级功能的详细文档
  • Leela Chess Zero分布式训练架构:揭秘lczero.org背后的协同计算
  • Open Battery Information:开源硬件逆向工程工具,解锁BMS锁定电池修复新方案