
Harikrishna (HK2184) published MedQA on May 8, 2026: A clinical multiple‑choice question‑answering model fine‑tuned from Qwen3-1.7B entirely on AMD hardware using ROCm. The project demonstrates that LoRA fine‑tuning can run on an AMD Instinct MI300X without CUDA, and the authors published the model as HK2184/medqa — qwen3-lora with a live demo and public GitHub repository.
The experimental run used one AMD Instinct MI300X device, which provides 192 GB of HBM3. The team trained the ≈1.7 billion‑parameter base model with LoRA in full fp16 and explicitly avoided 4‑bit and 8‑bit quantization. Using a 2,000‑sample slice from MedMCQA, the training completed in roughly five minutes on the MI300X.
From a tooling standpoint, the authors report that the standard Transformers/PEFT/TRL/Accelerate stack runs on ROCm without code changes. The same training script used for CUDA works on ROCm by setting three environment variables: os.environ["ROCR_VISIBLE_DEVICES"] = "0", os.environ["HIP_VISIBLE_DEVICES"] = "0", and os.environ["HSA_OVERRIDE_GFX_VERSION"] = "9.4.2". No custom kernels or CUDA compatibility shims were required.
Fine‑tuning used PEFT’s LoRA adapter approach with LoraConfig parameters r=8, lora_alpha=16, lora_dropout=0.05, target_modules=["q_proj","v_proj"], and bias="none". After applying LoRA the script reports trainable parameters: 2,228,224 versus total parameters: 1,543,901,184, a trainable percentage of 0.1443%, which keeps memory and compute costs low while freezing the base weights.
The team preserved consistent instruction formatting by using a single prompt template for training and inference. Each training example followed: "### Question: {question} ### Options: A) {opa} B) {opb} C) {opc} D) {opd} ### Answer: {answer_letter}) {answer_text} ### Explanation: {explanation}". During training the model sees the full sequence including answer and explanation; at inference the system provides text up to "### Answer:\n" and lets the model complete the answer and explanation.
The project challenges the default reliance on NVIDIA/CUDA in open‑source medical AI by showing ROCm as a practical alternative for builders with MI300X access. The MI300X’s 192 GB HBM3 reduces VRAM constraints, enabling fp16 fine‑tuning without aggressive quantization and potentially faster iteration. The authors stress this is a small‑scale proof (2,000 samples) and caution that clinical QA demands rigorous evaluation, safety checks, and substantially larger datasets before any clinical deployment.
Sources
Replies (0)
No replies in this topic yet.