How to Train a LoRA Locally for SDXL Models with AMD GPU 2025
Complete guide to training SDXL LoRAs on AMD GPUs using ROCm 6.2+ in 2025. Step-by-step setup with Kohya, optimal parameters, and troubleshooting for Radeon cards.
 You have an AMD GPU and want to train SDXL LoRAs locally, but every guide assumes NVIDIA hardware and SDXL's higher VRAM requirements make you wonder if your Radeon card can handle it. Training SDXL LoRAs on AMD GPUs is absolutely possible in 2025 with ROCm improvements, but requires more VRAM than SD 1.5 and specific configuration adjustments for the dual text encoder architecture.
Quick Answer: Training SDXL LoRAs on AMD GPUs requires 16GB+ VRAM minimum (24GB recommended), ROCm 6.2+, Python 3.10, and Kohya's sd-scripts with AMD-specific configuration. Key differences from SD 1.5 include 1024x1024 training resolution, tokenizer path fix for dual CLIP encoders, aggressive caching to manage memory, and longer training times. RX 7900 XTX (24GB) handles SDXL comfortably, RX 6800 XT/6900 XT (16GB) requires careful optimization, and cards under 16GB cannot train SDXL LoRAs reliably.
- 16GB VRAM absolute minimum, 20-24GB recommended for comfortable training
 - ROCm 6.2+ with PyTorch for ROCm 6.3 required (same as SD 1.5)
 - Must fix tokenizer paths in sdxl_train_util.py for training to work
 - 1024x1024 training resolution doubles memory usage versus SD 1.5
 - Aggressive caching and batch size 1 essential for 16GB cards
 
What Makes SDXL LoRA Training Different on AMD GPUs?
SDXL introduces architectural complexity beyond SD 1.5 that affects AMD GPU training. Understanding these differences helps you configure appropriately and avoid common failures.
The dual text encoder architecture uses both OpenAI's CLIP-ViT-L and OpenCLIP's bigG encoder. This dual encoding provides richer text understanding but doubles the text processing memory footprint. SD 1.5 uses a single CLIP encoder, making SDXL inherently more memory-intensive before considering the larger UNet.
Resolution requirements increase from 512x512 to 1024x1024 as SDXL's native training resolution. This quadruples the pixel count, dramatically increasing latent generation and VAE encoding costs. While you can train at lower resolutions, SDXL LoRAs work best when trained near the model's native resolution.
Model size grows substantially with SDXL's UNet containing approximately 2.6 billion parameters versus SD 1.5's 860 million. This 3x parameter increase translates directly to higher VRAM requirements for model weights, activations, and gradients during training.
The tokenizer configuration issue specifically affects SDXL training on AMD with Kohya. The default tokenizer path for the second encoder points to a model that often fails to download reliably. You must manually edit sdxl_train_util.py to change both TOKENIZER1_PATH and TOKENIZER2_PATH to "openai/clip-vit-large-patch14" before training works.
- RX 7900 XTX (24GB) handles SDXL comfortably with standard settings
 - RX 7900 XT (20GB) works well with moderate optimization
 - RX 6800 XT/6900 XT (16GB) requires aggressive caching and batch size 1
 - Cards under 16GB cannot reliably train SDXL LoRAs
 - Training takes 40-60% longer than equivalent NVIDIA cards
 
VRAM consumption during SDXL training typically reaches 18-22GB with standard settings. This puts 16GB cards at the absolute edge, requiring every optimization technique. 20-24GB cards provide comfortable headroom for reasonable batch sizes and less aggressive caching.
Training time increases proportionally to the added complexity. Where SD 1.5 LoRA training might take 1-2 hours on an RX 7900 XTX, equivalent SDXL training takes 2-4 hours. The larger model, higher resolution, and dual text encoders all contribute to slower iteration.
For users wanting SDXL image generation without training custom LoRAs, platforms like Apatero.com provide access to professionally trained SDXL models through optimized interfaces.
How Do You Set Up Your Environment for SDXL Training?
SDXL training setup mirrors SD 1.5 setup with identical software requirements. If you already configured your environment for SD 1.5 LoRA training on AMD, you can use the same setup for SDXL.
ROCm 6.2 or newer remains the requirement with ROCm 6.3 providing best performance and compatibility. Verify your installation with rocm-smi showing your GPU correctly. Set HSA_OVERRIDE_GFX_VERSION appropriately for your architecture (11.0.0 for RDNA 3, 10.3.0 for RDNA 2).
Python 3.10 provides optimal compatibility with current training scripts and ROCm-enabled PyTorch. Create a virtual environment specifically for SDXL training or reuse your SD 1.5 training environment if you have one configured.
PyTorch installation uses the ROCm 6.3 build. After activating your venv, install with pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3. This PyTorch build properly interfaces with AMD GPUs through ROCm.
Kohya sd-scripts installation follows the standard process. Clone from GitHub, install requirements, configure Accelerate for single-machine training with fp16 precision, and install additional dependencies like tensorflow-rocm and onnxruntime-rocm as detailed in the SD 1.5 guide.
The critical SDXL-specific fix involves editing ./sd-scripts/library/sdxl_train_util.py. Open this file and locate the TOKENIZER1_PATH and TOKENIZER2_PATH variables near the top. Change both to "openai/clip-vit-large-patch14". The default path for TOKENIZER2_PATH points to "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k" which cannot be reliably located, causing training to fail with tokenizer errors.
Storage requirements increase for SDXL. The base SDXL model weighs 6-7GB. Training datasets at 1024x1024 consume more space than 512x512 datasets. Cache files grow proportionally to resolution. Budget 150-200GB minimum for comfortable SDXL training workflows with multiple experiments.
Verification after setup involves testing PyTorch GPU detection as with SD 1.5. Import torch, check torch.cuda.is_available() returns True, and verify your AMD GPU appears with torch.cuda.get_device_name(0).
What Training Parameters Work Best for SDXL on AMD?
SDXL training parameters require adjustment from SD 1.5 defaults to account for higher resolution and memory requirements. These configurations optimize for AMD GPU characteristics.
Batch size must be 1 for most 16GB cards even with aggressive optimization. Cards with 20-24GB can experiment with batch size 2 but may still need batch size 1 depending on network dimensions and caching strategies. Start with 1 and only increase if VRAM monitoring shows substantial headroom.
Mixed precision with fp16 or bf16 is mandatory for SDXL training on AMD. Full precision fp32 is impractical due to memory requirements. Use --mixed_precision="fp16" as standard. Some users report slightly better quality with bf16 if your GPU supports it well.
Learning rate for SDXL typically ranges from 1e-4 to 5e-5. The larger model sometimes benefits from slightly lower learning rates than SD 1.5. Start with 1e-4 and reduce to 5e-5 or 8e-5 if you observe artifacts or instability. Use cosine scheduler with warmup for smooth training dynamics.
Free ComfyUI Workflows
Find free, open-source ComfyUI workflows for techniques in this article. Open source is strong.
- Batch size: 1 (mandatory for 16GB, recommended for 20GB)
 - Mixed precision: fp16 (mandatory)
 - Learning rate: 1e-4 to 5e-5 with cosine scheduler
 - Network dimension: 32-64 (lower than SD 1.5 due to memory)
 - Network alpha: 16-32 (half of dimension)
 - Resolution: 1024x1024 standard, 896x896 minimum
 - Max epochs: 10-20 (fewer than SD 1.5 due to slower training)
 
Network dimension (LoRA rank) often runs lower for SDXL than SD 1.5 due to memory constraints. Where SD 1.5 commonly uses 64-128, SDXL on AMD GPUs works better with 32-64. For 16GB cards, dimension 32 or 48 may be necessary. For 24GB cards, dimension 64 works comfortably. Higher dimensions risk OOM errors or excessively slow training.
Network alpha follows the half-dimension rule. For dimension 32, use alpha 16. For dimension 64, use alpha 32. This ratio provides good learning dynamics for most subjects.
Resolution for SDXL centers on 1024x1024 as the model's native training size. You can train at 896x896 or 960x960 to save memory, but quality may suffer slightly. Going below 896x896 significantly degrades LoRA quality. For 16GB cards, 896x896 or 960x960 represents a reasonable compromise. For 20GB+ cards, use full 1024x1024.
Max epochs decrease compared to SD 1.5 because SDXL's larger capacity learns faster and training takes longer per epoch. Where SD 1.5 might train for 20-30 epochs, SDXL typically needs 10-20 epochs for similar dataset sizes. Monitor sample images to catch optimal stopping point before overfitting.
Caching becomes absolutely critical for SDXL memory management. Enable all caching options. Use --cache_latents --cache_latents_to_disk --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk. For dual text encoders, caching text outputs saves substantial VRAM. Disk caching trades storage and some speed for dramatically lower memory usage.
The no_half_vae flag remains important with --no_half_vae preventing numerical instabilities in VAE operations. SDXL's VAE has similar precision sensitivities as SD 1.5.
Optimizer choice affects memory significantly. AdamW8bit uses less memory than standard AdamW with minimal quality impact. For tight VRAM situations, specify --optimizer_type="AdamW8bit" to save memory. Adafactor saves even more memory but with adaptive learning rate side effects.
Gradient checkpointing with --gradient_checkpointing trades compute for memory by recomputing activations during backpropagation rather than storing them. This technique saves substantial VRAM on SDXL training, enabling training on 16GB cards. The speed penalty is acceptable given the memory savings.
How Do You Prepare Datasets for SDXL Training?
Dataset preparation for SDXL follows similar principles to SD 1.5 but with resolution considerations. Proper preparation significantly impacts training outcomes.
Image collection should target SDXL's higher resolution capabilities. Gather high-quality images ideally at 1024x1024 or higher resolution. Lower resolution source images upscaled to 1024x1024 work but don't fully utilize SDXL's capabilities. For best results, use source images of 1024x1024 or larger.
Want to skip the complexity? Apatero gives you professional AI results instantly with no technical setup required.
Image count remains similar to SD 1.5 guidelines. Collect 15-40 high-quality images for most subjects. SDXL's larger capacity learns from slightly smaller datasets than SD 1.5 in some cases, but quality still matters more than quantity.
Folder structure uses the same {repeat_count}_{trigger_word} convention. For example, 10_charactername trains on images in that folder 10 times per epoch. Adjust repeat counts to balance training emphasis across different image categories.
- Source images 1024x1024+ resolution preserve SDXL quality advantages
 - 15-40 high-quality images typically sufficient due to SDXL's larger capacity
 - Captioning slightly more important due to dual text encoder architecture
 - Regularization images help prevent overfitting on powerful SDXL base
 
Captioning gains importance with SDXL's dual text encoders. The model understands more nuanced descriptions and benefits from detailed, accurate captions. Spend time writing quality captions that describe subjects, actions, settings, styles, and relevant details. The improved text understanding makes good captions more impactful.
Regularization images remain important for SDXL training. The model's power makes overfitting a real risk without regularization. Use 30-100 regularization images of your subject's class (people, objects, styles) at 1024x1024 resolution in a folder like 1_person or 1_art.
Bucketing handles various aspect ratios automatically, but SDXL benefits from training data that matches your intended use cases. If you'll generate 9:16 vertical images, include vertical training images. If you focus on landscape 16:9, prioritize landscape training images.
Preprocessing considerations include ensuring adequate lighting and detail in source images. SDXL can generate fine details, so training images should show the details you want the LoRA to learn. Blurry or low-quality training images waste SDXL's capabilities.
What Is the Complete Training Command for SDXL?
Launching SDXL training involves a long command with numerous parameters. Understanding each component helps you adjust for your specific needs.
A typical SDXL training command for AMD GPUs looks like: accelerate launch --mixed_precision="fp16" sdxl_train_network.py --pretrained_model_name_or_path="/path/to/sdxl_base_1.0.safetensors" --train_data_dir="./train" --reg_data_dir="./reg" --output_dir="./output" --output_name="mySDXL_LoRA" --network_module="networks.lora" --network_dim=48 --network_alpha=24 --learning_rate=8e-5 --max_train_epochs=15 --save_every_n_epochs=3 --train_batch_size=1 --max_token_length=225 --xformers=False --cache_latents --cache_latents_to_disk --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --no_half_vae --mixed_precision="fp16" --optimizer_type="AdamW8bit" --gradient_checkpointing --persistent_data_loader_workers --resolution="1024,1024".
Key parameter explanations include max_token_length set to 225 instead of SD 1.5's 75. SDXL supports longer text encoding, and 225 tokens accommodates more detailed captions. The resolution parameter explicitly sets 1024x1024 training size.
Join 115 other course members
Create Your First Mega-Realistic AI Influencer in 51 Lessons
Create ultra-realistic AI influencers with lifelike skin details, professional selfies, and complex scenes. Get two complete courses in one bundle. ComfyUI Foundation to master the tech, and Fanvue Creator Academy to learn how to market yourself as an AI creator.
The sdxl_train_network.py script name differs from SD 1.5's train_network.py. Ensure you use the SDXL-specific script which handles the dual text encoder architecture properly.
All caching options are enabled with disk storage to minimize VRAM usage. This configuration is critical for 16GB cards and beneficial even for larger cards.
Gradient checkpointing trades speed for memory, essential for most AMD GPU training scenarios with SDXL. The speed penalty is acceptable given the ability to train on available hardware.
Monitor VRAM usage during initial training with rocm-smi in another terminal. If usage approaches your card's limit, reduce network dimension, enable more aggressive optimizations, or lower resolution to 960x960 or 896x896.
Training time expectations for this configuration on an RX 7900 XTX with a 25-image dataset at 15 epochs range from 3-5 hours. RX 6800 XT with 16GB takes 5-8 hours for similar training due to tighter memory requiring more conservative settings.
Sample generation during training uses --sample_every_n_epochs=3 --sample_prompts="./sdxl_sample_prompts.txt" to generate test images periodically. Create sample_prompts.txt with prompts testing your LoRA. SDXL sample generation itself consumes significant VRAM, so generate samples less frequently than SD 1.5 (every 3 epochs instead of every 2).
Frequently Asked Questions
What's the minimum VRAM for SDXL LoRA training on AMD?
16GB represents the absolute minimum with aggressive optimization including fp16 precision, batch size 1, network dimension 32-48, all caching to disk, gradient checkpointing, and possibly reducing resolution to 896x896. The RX 6800 XT and 6900 XT with 16GB can train SDXL LoRAs but require careful configuration. For comfortable training, 20GB (RX 7900 XT) or 24GB (RX 7900 XTX) is recommended. Cards under 16GB cannot reliably train SDXL LoRAs.
Why does my SDXL training fail with tokenizer errors?
This is the most common SDXL training failure on AMD. Edit ./sd-scripts/library/sdxl_train_util.py and change both TOKENIZER1_PATH and TOKENIZER2_PATH to "openai/clip-vit-large-patch14". The default path for TOKENIZER2_PATH points to a model that fails to download reliably. This fix is mandatory for SDXL training on AMD with Kohya's scripts. After editing, save the file and restart training.
Can I train at 512x512 to save VRAM?
You can technically train SDXL at 512x512, but quality suffers significantly. SDXL's architecture trained at 1024x1024, and LoRAs trained at much lower resolutions don't transfer well. Minimum practical resolution is 896x896 for acceptable quality. If your card can't handle 896x896, it's too small for effective SDXL LoRA training. Consider cloud GPU rental or focusing on SD 1.5 instead.
How much longer does SDXL training take versus SD 1.5 on AMD?
SDXL training typically takes 2-3x longer than SD 1.5 for equivalent epochs and dataset sizes. The larger model, higher resolution (4x pixels), and dual text encoders all contribute. On an RX 7900 XTX, SD 1.5 LoRA training might take 1.5-2 hours while SDXL takes 3-5 hours for comparable configurations. On 16GB cards, the gap widens further due to more aggressive memory optimizations slowing SDXL training.
Should I use AdamW or AdamW8bit for SDXL?
AdamW8bit is recommended for most SDXL training on AMD GPUs. It uses less memory than standard AdamW with minimal quality impact, enabling higher network dimensions or less aggressive caching. Only use standard AdamW if you have 24GB VRAM and want to experiment with potential marginal quality gains. For 16GB cards, AdamW8bit is essentially mandatory.
Can I use the same LoRA rank as SD 1.5?
SDXL typically uses lower LoRA ranks than SD 1.5 due to VRAM constraints. Where SD 1.5 commonly uses rank 64-128, SDXL on AMD works better with rank 32-64. Start with rank 48 for balanced results. Only increase to 64 if you have 24GB VRAM and want maximum LoRA capacity. For 16GB cards, rank 32 may be necessary. SDXL's larger base model means lower-rank LoRAs still provide significant adaptation capability.
What if I get OOM errors during SDXL training?
Reduce network dimension to 32, enable all caching with disk storage, use AdamW8bit optimizer, enable gradient checkpointing if not already active, reduce resolution to 960x960 or 896x896, ensure batch size is 1, close other GPU applications, and verify mixed precision is set to fp16. If OOM errors persist after all optimizations, your GPU likely lacks sufficient VRAM for SDXL training.
Do SDXL LoRAs work with SD 1.5 models?
No, SDXL LoRAs are not compatible with SD 1.5 models due to completely different architectures. SDXL's dual text encoders, larger UNet, and different VAE make the models incompatible. You must train separate LoRAs for SD 1.5 and SDXL. However, once trained, SDXL LoRAs work with all SDXL-based models and checkpoints (SDXL base, refiner, and community fine-tunes).
Can I train SDXL LoRAs faster by reducing epochs?
Reducing epochs lowers training time but risks undertrained LoRAs that don't capture your subject well. SDXL typically needs 10-20 epochs for good results with standard dataset sizes. Going below 10 epochs often produces weak LoRAs. Instead of reducing epochs, optimize per-epoch speed with proper caching and settings, or accept the longer training time as necessary for quality results. SDXL's complexity requires sufficient training.
What batch size can I use with 24GB VRAM?
With 24GB VRAM on cards like RX 7900 XTX, batch size 2 is possible for SDXL training with moderate network dimensions (48-64) and standard caching. Some configurations allow batch size 4 with very aggressive caching and lower network dimensions, but gains are minimal and instability risks increase. Batch size 1 works reliably and produces good results, so many users stick with it even on 24GB cards. Experiment carefully if trying larger batches.
Succeeding with SDXL LoRA Training on AMD Hardware
SDXL LoRA training on AMD GPUs requires more resources and careful optimization compared to SD 1.5, but remains practical for users with adequate hardware. The 16GB minimum VRAM threshold limits accessibility compared to SD 1.5's 12GB minimum, but RX 7900 series cards handle SDXL training capably.
The mandatory tokenizer fix addresses a specific compatibility issue that would otherwise prevent training entirely. This one-line edit in sdxl_train_util.py is non-obvious but critical, illustrating the AMD-specific considerations that differ from NVIDIA workflows.
Aggressive caching strategies, gradient checkpointing, 8-bit optimizers, and batch size 1 form the optimization toolkit that makes SDXL training work on AMD hardware with limited VRAM. Understanding which optimizations provide the best memory-speed tradeoffs helps you configure appropriately.
Training times of 3-8 hours depending on hardware represent substantial investments compared to SD 1.5's 1-3 hours, but remain practical for users training occasionally. The quality improvements SDXL provides over SD 1.5 justify the additional time and resource requirements for many use cases.
For users wanting SDXL image generation without training custom LoRAs, platforms like Apatero.com provide access to professionally trained SDXL models through optimized interfaces, eliminating setup and training complexity entirely.
As ROCm continues maturing and AMD's AI compute presence grows, expect SDXL training workflows to become smoother with better performance parity to NVIDIA solutions. The foundations established in 2025 position AMD GPUs as viable platforms for SDXL LoRA training in the Stable Diffusion ecosystem.
Ready to Create Your AI Influencer?
Join 115 students mastering ComfyUI and AI influencer marketing in our complete 51-lesson course.
Related Articles
 AI Adventure Book Generation in Real Time with AI Image Generation
Create dynamic, interactive adventure books with AI-generated stories and real-time image creation. Learn how to build immersive narrative experiences that adapt to reader choices with instant visual feedback.
 AI Comic Book Creation with AI Image Generation
Create professional comic books using AI image generation tools. Learn complete workflows for character consistency, panel layouts, and story visualization that rival traditional comic production.
 Best AI Image Upscalers 2025: ESRGAN vs Real-ESRGAN vs SwinIR Comparison
The definitive comparison of AI upscaling technologies. From ESRGAN to Real-ESRGAN, SwinIR, and beyond - discover which AI upscaler delivers the best results for your needs.