Researchers fine-tune Qwen3-1.7B into MedQA on AMD Instinct MI300X using ROCm

News

5/8/2026, 8:26:27 AM

Researchers fine-tune Qwen3-1.7B into MedQA on AMD Instinct MI300X using ROCm

Harikrishna (HK2184) published MedQA on May 8, 2026: A clinical multiple‑choice question‑answering model fine‑tuned from Qwen3-1.7B entirely on AMD hardware using ROCm. The project demonstrates that LoRA fine‑tuning can run on an AMD Instinct MI300X without CUDA, and the authors published the model as HK2184/medqa — qwen3-lora with a live demo and public GitHub repository.

The experimental run used one AMD Instinct MI300X device, which provides 192 GB of HBM3. The team trained the ≈1.7 billion‑parameter base model with LoRA in full fp16 and explicitly avoided 4‑bit and 8‑bit quantization. Using a 2,000‑sample slice from MedMCQA, the training completed in roughly five minutes on the MI300X.

From a tooling standpoint, the authors report that the standard Transformers/PEFT/TRL/Accelerate stack runs on ROCm without code changes. The same training script used for CUDA works on ROCm by setting three environment variables: os.environ["ROCR_VISIBLE_DEVICES"] = "0", os.environ["HIP_VISIBLE_DEVICES"] = "0", and os.environ["HSA_OVERRIDE_GFX_VERSION"] = "9.4.2". No custom kernels or CUDA compatibility shims were required.

Fine‑tuning used PEFT’s LoRA adapter approach with LoraConfig parameters r=8, lora_alpha=16, lora_dropout=0.05, target_modules=["q_proj","v_proj"], and bias="none". After applying LoRA the script reports trainable parameters: 2,228,224 versus total parameters: 1,543,901,184, a trainable percentage of 0.1443%, which keeps memory and compute costs low while freezing the base weights.

The team preserved consistent instruction formatting by using a single prompt template for training and inference. Each training example followed: "### Question: {question} ### Options: A) {opa} B) {opb} C) {opc} D) {opd} ### Answer: {answer_letter}) {answer_text} ### Explanation: {explanation}". During training the model sees the full sequence including answer and explanation; at inference the system provides text up to "### Answer:\n" and lets the model complete the answer and explanation.

The project challenges the default reliance on NVIDIA/CUDA in open‑source medical AI by showing ROCm as a practical alternative for builders with MI300X access. The MI300X’s 192 GB HBM3 reduces VRAM constraints, enabling fp16 fine‑tuning without aggressive quantization and potentially faster iteration. The authors stress this is a small‑scale proof (2,000 samples) and caution that clinical QA demands rigorous evaluation, safety checks, and substantially larger datasets before any clinical deployment.

Sources

Hugging Face Blog · 5/8/2026

Replies (0)

No replies in this topic yet.

Back