SD 3.5 Large LoRA Training Locally - Complete Guide
Train Stable Diffusion 3.5 Large LoRAs on local hardware with optimized settings for consumer GPUs and professional quality results
Stable Diffusion 3.5 Large represents a significant architectural leap over SDXL, featuring three text encoders including the massive T5-XXL, a DiT (Diffusion Transformer) backbone, and rectified flow training objectives. These improvements translate to better prompt adherence, superior text rendering, and enhanced compositional understanding. However, SD 3.5 LoRA training on consumer hardware presents genuine challenges that require careful configuration to overcome. The good news is that with proper optimization, an RTX 4090 can produce high-quality SD 3.5 Large LoRAs that take full advantage of the model's capabilities. If you're new to LoRA training concepts, our comprehensive troubleshooting guide covers common issues you may encounter during SD 3.5 LoRA training.
The memory demands stem from SD 3.5's architecture. Where SDXL uses two text encoders (CLIP-L and CLIP-G) totaling around 1.3GB, SD 3.5 adds T5-XXL at approximately 10GB. The diffusion backbone itself is larger than SDXL's UNet, using more parameters and deeper attention mechanisms. During SD 3.5 LoRA training, you need memory for the model, gradients, optimizer states, and activations. Without optimization, these requirements exceed 50GB VRAM, putting them well beyond consumer hardware. With the techniques in this guide, you'll configure SD 3.5 LoRA training that completes successfully on 24GB cards while producing LoRAs that capture your subjects and styles effectively. For understanding VRAM optimization techniques that apply to SD 3.5 LoRA training, see our VRAM optimization flags guide.
Understanding SD 3.5 Large Architecture for Training
Effective SD 3.5 LoRA training optimization requires understanding what consumes memory and why SD 3.5 behaves differently than previous Stable Diffusion versions during training. This architectural knowledge is essential for successful SD 3.5 LoRA training on consumer hardware.
The Triple Text Encoder System
SD 3.5 Large uses three text encoders that work together to process prompts. CLIP-L provides short-range token relationships and basic semantic understanding. CLIP-G adds broader context and style understanding. T5-XXL delivers deep language comprehension, enabling better understanding of complex prompts and relationships between concepts.
During SD 3.5 LoRA training, all three encoders process your image captions to create conditioning embeddings. The encoders themselves aren't trained during LoRA training (you're only training the diffusion backbone), but they must be loaded into memory for the forward pass that creates these embeddings. T5-XXL alone consumes around 10GB in FP16, which is why encoder offloading is critical for consumer GPUs running SD 3.5 LoRA training.
The practical implication is that your captions get much better understanding than with SDXL. Natural language descriptions like "a woman wearing a blue dress standing in a sunlit forest clearing" work well because T5 actually comprehends that sentence structure. This means your caption quality matters more than ever for training results.
Diffusion Transformer Architecture
SD 3.5 replaces SDXL's UNet with a Diffusion Transformer (DiT). Instead of convolutional residual blocks, DiT uses transformer blocks with self-attention layers. This architecture enables better global coherence in generated images but requires more memory during training because attention computations scale with sequence length.
For SD 3.5 LoRA training, the key difference is in how gradients flow. Transformer gradients can be larger than UNet gradients at equivalent model sizes, and the attention mechanisms create larger activation tensors. Gradient checkpointing becomes essential because it trades compute time for memory by recomputing activations during the backward pass rather than storing them all.
The DiT architecture also responds differently to hyperparameters during SD 3.5 LoRA training. Learning rates that worked for SDXL may need adjustment for SD 3.5. The community is still developing best practices, so be prepared for experimentation.
Rectified Flow Training Objective
SD 3.5 uses a rectified flow formulation instead of the traditional DDPM noise prediction objective. This changes how the model learns to denoise images and affects SD 3.5 LoRA training dynamics.
Rectified flow models learn more directly, which can mean faster convergence but also faster overfitting if you're not careful during SD 3.5 LoRA training. You may find that SD 3.5 LoRAs need fewer steps than comparable SDXL LoRAs, but the margin for error is smaller. Watch your sample outputs closely and be ready to stop training before overfitting occurs.
Hardware Requirements and Memory Budget
Understanding your memory budget helps you configure SD 3.5 LoRA training that maximizes quality within your hardware limits. If you're just getting started with AI image generation, our complete beginner guide covers foundational concepts.
Memory Breakdown
Here's where VRAM goes during SD 3.5 LoRA training at FP16 without optimization:
- T5-XXL encoder: ~10GB
- CLIP-L encoder: ~0.4GB
- CLIP-G encoder: ~0.9GB
- DiT backbone: ~8GB
- LoRA weights: ~0.1-0.5GB (depending on rank)
- Optimizer states (Adam): ~0.2-1GB for LoRA parameters
- Gradients: ~0.1-0.5GB
- Activations at 1024x1024: ~15-25GB
Total without optimization: 35-46GB
This obviously doesn't fit on a 24GB card. Optimization reduces this to approximately 18-22GB:
- Encoders offloaded to CPU: 0GB GPU
- DiT backbone in BF16: ~4GB
- LoRA weights: ~0.1-0.5GB
- 8-bit optimizer states: ~0.1-0.5GB
- Gradients: ~0.1-0.5GB
- Activations with checkpointing at 512x512: ~8-12GB
Total with optimization: 13-18GB
This leaves headroom on a 24GB card for stability and minor memory spikes.
GPU Recommendations
RTX 4090 (24GB): The ideal consumer card for SD 3.5 LoRA training. All optimizations in this guide target 24GB. SD 3.5 LoRA training is comfortable at 512x512 with room for occasional 768x768 experiments.
RTX 3090 (24GB): Works for SD 3.5 LoRA training but slower due to lower memory bandwidth and older tensor cores. Same memory capacity means same configurations work, but expect 20-30% longer training times.
RTX 4080 (16GB): Challenging but possible with extreme optimization. You'll need 512x512 resolution maximum, rank 8-16 maximum, and very aggressive checkpointing. Quality may suffer from the constraints.
48GB+ Professional Cards: A6000, A100, or H100 remove most constraints. You can train at 1024x1024 with higher ranks and larger batch sizes. If budget allows, these provide a much better experience.
Complete Training Configuration
This section provides specific settings for Kohya SS, the most popular SD 3.5 LoRA training tool, optimized for RTX 4090. These configurations have been tested extensively for reliable SD 3.5 LoRA training results.
Model and Precision Settings
# Model configuration
pretrained_model_name_or_path: "stabilityai/stable-diffusion-3.5-large"
v_parameterization: false
zero_terminal_snr: false
# Precision settings
mixed_precision: "bf16"
full_bf16: true
BF16 (bfloat16) is essential for SD 3.5 LoRA training. It provides the same memory savings as FP16 but with better numerical stability for gradients. The larger dynamic range of BF16 prevents the gradient underflow issues that can occur with FP16 in transformer architectures during SD 3.5 LoRA training.
Memory Optimization Settings
# Critical memory optimizations
gradient_checkpointing: true
cpu_offload_checkpointing: false
# Text encoder handling
text_encoder_offload: true # Offloads T5-XXL to CPU after encoding
# Attention optimization
xformers: true
# OR
sdpa: true # PyTorch native scaled dot product attention
Gradient checkpointing is non-negotiable for 24GB training. It reduces activation memory by 60-70% by recomputing intermediate values during backpropagation instead of storing them. The time penalty is around 20-30% longer training, but this enables training that would otherwise be impossible.
Text encoder offloading moves the 11GB of text encoders to CPU RAM after they encode your captions. Since encoding only happens once at the start of each training sample, these models sit idle in VRAM during the actual diffusion training. Moving them to CPU frees that memory for activations.
Resolution and Batch Configuration
# Resolution
resolution: "512,512"
# OR for bucket training:
enable_bucket: true
min_bucket_reso: 256
max_bucket_reso: 1024
bucket_reso_steps: 64
# Batch configuration
train_batch_size: 1
gradient_accumulation_steps: 4
Training at 512x512 instead of SD 3.5's native 1024x1024 reduces activation memory by approximately 75% (memory scales with the square of resolution). This is the single biggest memory saving after gradient checkpointing.
You might worry that 512x512 training produces inferior LoRAs. In practice, LoRAs trained at lower resolution generalize well to higher generation resolution. The concepts and styles you're training transfer because they're not resolution-specific. You're teaching the model features, not pixel arrangements.
Batch size of 1 with gradient accumulation of 4 simulates batch size 4 while only ever holding one sample's activations in memory. This provides training stability comparable to larger batches without the memory cost.
Network Architecture Settings
# LoRA network configuration
network_module: "networks.lora"
network_dim: 16 # rank
network_alpha: 16 # or 8 for different scaling
network_args:
- "conv_dim=8"
- "conv_alpha=8"
Network dimension (rank) determines the capacity of your LoRA. Rank 16 works well for characters and simple concepts. Rank 32 captures more complex styles and detailed subjects. Higher ranks (64+) can capture even more but require more training data and risk overfitting.
Network alpha scales the learned weights. Alpha equal to rank provides standard scaling. Alpha at half the rank reduces the effective learning rate of the LoRA, which can help with stability. Experiment with both.
Free ComfyUI Workflows
Find free, open-source ComfyUI workflows for techniques in this article. Open source is strong.
The conv_dim and conv_alpha arguments apply LoRA to convolutional layers in addition to attention layers. This can improve detail capture but uses more memory. For 24GB training, keep conv_dim at half the main network_dim.
Optimizer Configuration
# Optimizer settings
optimizer_type: "AdamW8bit"
learning_rate: 1e-4
lr_scheduler: "constant"
lr_warmup_steps: 0
# Alternative optimizers
# optimizer_type: "Prodigy" # Adaptive learning rate
# optimizer_type: "Adafactor" # Memory efficient
AdamW8bit from bitsandbytes quantizes optimizer states to INT8, cutting optimizer memory usage in half compared to full precision Adam. Quality impact is minimal since the critical values remain in higher precision.
Learning rate of 1e-4 is a solid starting point for SD 3.5 LoRA training. If training seems slow to converge, try 2e-4. If you see quality degradation or overfit quickly, try 5e-5.
Constant learning rate scheduler works well for LoRA training because the optimization space is relatively smooth. Cosine schedulers can provide marginal improvement but add a hyperparameter (number of steps) that requires tuning.
Training Duration Settings
# Training duration
max_train_steps: 1500 # Adjust based on dataset and goal
# Alternatively, use epochs
# max_train_epochs: 10
# Checkpointing
save_every_n_steps: 500
save_model_as: "safetensors"
Training duration depends on your dataset size and goal:
- Character LoRAs (10-20 images): 1000-1500 steps
- Style LoRAs (50-100 images): 2000-3000 steps
- Complex concepts (100+ images): 3000-5000 steps
These are starting points. Watch your sample outputs and adjust based on actual results. Overtraining produces LoRAs that only recreate training images rather than generalizing to new prompts.
Sample Generation Settings
# Sample generation during training
sample_every_n_steps: 200
sample_prompts: "path/to/prompts.txt"
sample_sampler: "euler"
Sample generation lets you monitor training progress. Generate samples every 200 steps using prompts that test your trigger word in various contexts.
Example prompts file content:
ohwxperson standing in a forest, natural lighting
ohwxperson portrait, studio lighting, professional photo
ohwxperson wearing casual clothes, urban background
Watch for these stages in samples:
- Steps 0-300: Trigger word has no effect, outputs are generic
- Steps 300-800: Trigger starts affecting output, subject begins appearing
- Steps 800-1500: Subject becomes clear and consistent
- Steps 1500+: Risk of overfitting, watch for loss of variation
Stop training when samples look good but still show variation. If every sample looks identical to training images, you've overfit.
Dataset Preparation
Quality training data produces quality LoRAs. Time invested in dataset preparation pays off in training results.
Image Selection and Requirements
Quantity guidelines:
- Characters: 10-20 images showing variety
- Styles: 50-200 images showing range
- Objects: 15-30 images with different angles
Quality requirements:
- Sharp focus on subject
- Good lighting that doesn't obscure features
- Clean backgrounds that don't confuse training
- No watermarks, text overlays, or artifacts
Variety requirements:
Want to skip the complexity? Apatero gives you professional AI results instantly with no technical setup required.
- Different poses and angles for characters
- Various lighting conditions (avoid only one lighting setup)
- Multiple expressions for faces
- Range of examples for styles
Include variety because your LoRA learns from variation. If all training images have the same lighting, the LoRA associates that lighting with the concept. If you want your character in different lighting conditions, train with different lighting conditions.
Image Processing
Resize images to at or slightly above training resolution. There's no benefit to loading 4K images when training at 512x512, and large images slow down data loading.
from PIL import Image
import os
def process_images(input_dir, output_dir, max_size=768):
os.makedirs(output_dir, exist_ok=True)
for filename in os.listdir(input_dir):
if filename.lower().endswith(('.png', '.jpg', '.jpeg', '.webp')):
img_path = os.path.join(input_dir, filename)
img = Image.open(img_path)
# Resize if larger than max_size while maintaining aspect ratio
if max(img.size) > max_size:
ratio = max_size / max(img.size)
new_size = (int(img.size[0] * ratio), int(img.size[1] * ratio))
img = img.resize(new_size, Image.LANCZOS)
# Convert to RGB if necessary
if img.mode != 'RGB':
img = img.convert('RGB')
# Save as PNG for lossless quality
output_path = os.path.join(output_dir, f"{os.path.splitext(filename)[0]}.png")
img.save(output_path, 'PNG')
print(f"Processed: {filename} -> {os.path.basename(output_path)}")
process_images("raw_images", "training_images", max_size=768)
Captioning for SD 3.5
Captions are critical for SD 3.5 because its T5 encoder actually understands natural language. Write captions as descriptive sentences, not tag lists.
Good caption example:
A photograph of ohwxperson, a young woman with shoulder-length auburn wavy hair and green eyes, wearing a white blouse, sitting at a cafe table with a coffee cup, natural daylight from a window, shallow depth of field
Poor caption example:
ohwxperson, woman, auburn hair, green eyes, blouse, cafe, coffee, window light
The natural language caption tells T5 the relationships between elements. The woman is wearing the blouse, she's sitting at the table, the light comes from the window. Tag lists lose this relational information.
Include your trigger word in every caption. The trigger word (like "ohwxperson") is how you'll activate the LoRA during generation. It should be unique and not conflict with real words.
Automated captioning with refinement:
Generate initial captions using BLIP-2, CogVLM, or similar tools, then refine them manually:
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch
from PIL import Image
import os
# Load BLIP-2
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16
).to("cuda")
def caption_image(image_path, trigger_word="ohwxsubject"):
image = Image.open(image_path).convert('RGB')
# Generate detailed caption
prompt = "Describe this image in detail, including the subject, clothing, pose, background, and lighting:"
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=100)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
# Prepend trigger word
caption = f"{trigger_word}, {caption}"
return caption
# Process directory
image_dir = "training_images"
for filename in os.listdir(image_dir):
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
image_path = os.path.join(image_dir, filename)
caption = caption_image(image_path, "ohwxperson")
# Save caption to text file
caption_path = os.path.join(image_dir, f"{os.path.splitext(filename)[0]}.txt")
with open(caption_path, 'w') as f:
f.write(caption)
print(f"{filename}: {caption[:100]}...")
After automated captioning, review each caption file and correct inaccuracies. The auto-captioner might misidentify colors, miss important details, or describe things incorrectly. Manual review ensures your LoRA learns accurate associations.
Dataset Structure
Organize your dataset in the folder structure your training tool expects. For Kohya SS:
dataset/
├── 10_ohwxperson/
│ ├── image1.png
│ ├── image1.txt
│ ├── image2.png
│ ├── image2.txt
│ └── ...
The folder name "10_ohwxperson" means: repeat these images 10 times per epoch, with class name "ohwxperson". Adjust the repeat number based on your dataset size. Smaller datasets benefit from more repeats; larger datasets need fewer.
Monitoring and Troubleshooting Training
Watch training progress and intervene when things go wrong.
Join 115 other course members
Create Your First Mega-Realistic AI Influencer in 51 Lessons
Create ultra-realistic AI influencers with lifelike skin details, professional selfies, and complex scenes. Get two complete courses in one bundle. ComfyUI Foundation to master the tech, and Fanvue Creator Academy to learn how to market yourself as an AI creator.
Loss Curve Interpretation
Training loss should generally decrease and stabilize. Normal patterns:
- Rapid decrease in first 20% of training
- Gradual decrease in middle 60%
- Plateau or minimal decrease in final 20%
Warning signs:
- Loss increases after initial decrease (learning rate too high)
- Loss stays flat from start (learning rate too low, or dataset issue)
- Loss oscillates wildly (learning rate too high, or batch too small)
- Loss goes to zero (extreme overfitting)
Common Issues and Fixes
Out of memory during training:
- Reduce resolution to 512x512
- Enable gradient checkpointing if not already
- Reduce network rank to 16 or 8
- Enable text encoder offloading
- Close other GPU applications
Training doesn't converge (trigger word has no effect):
- Increase learning rate (try 2e-4)
- Increase training steps
- Check captions include trigger word
- Verify images contain what you're training
Overfitting (samples look exactly like training images):
- Reduce training steps
- Reduce learning rate
- Increase dataset variety
- Reduce network rank
Style bleeding (LoRA changes overall style too much):
- Use lower network rank (8-16)
- Fewer training steps
- Better captions that describe non-subject elements
- Lower LoRA weight during inference
VRAM Monitoring
Monitor GPU memory during training to understand your headroom:
# In a separate terminal
watch -n 1 nvidia-smi
Note the peak memory usage. If it's consistently above 22GB on a 24GB card, you risk OOM errors during memory spikes. Reduce settings to get usage below 20GB for stability.
Evaluating and Using Your LoRA
After SD 3.5 LoRA training completes, evaluate your LoRA to determine if it meets your needs. For loading and testing your trained LoRAs, our ComfyUI essential nodes guide covers the workflow setup.
Testing Protocol
Load your LoRA in ComfyUI or your preferred inference tool. Test with prompts that vary:
- Context (indoor, outdoor, various backgrounds)
- Pose and action (standing, sitting, various activities)
- Lighting (daylight, studio, dramatic)
- Style modifiers (photo realistic, artistic, painterly)
Your LoRA should maintain subject identity across these variations while allowing prompt control of other elements.
Weight and Strength Tuning
LoRA weight (strength) affects how strongly the LoRA influences generation:
- 0.6-0.8: Subtle influence, easier to combine with other elements
- 0.8-1.0: Strong influence, clear subject representation
- 1.0+: Can overpower other prompt elements
Start at 0.8 and adjust based on results. If the subject doesn't appear clearly, increase weight. If other prompt elements are ignored, decrease weight.
Combining with Other LoRAs
SD 3.5 Large supports combining multiple LoRAs. Use a character LoRA with a style LoRA for styled character images. Keep combined weights below 1.5 total to avoid artifacts.
# ComfyUI example
Character LoRA: weight 0.7
Style LoRA: weight 0.5
Total: 1.2
Frequently Asked Questions
Is 24GB VRAM really sufficient for SD 3.5 Large LoRA training?
Yes, with all optimizations enabled. Gradient checkpointing, BF16 precision, 8-bit optimizer, text encoder offloading, and 512x512 resolution together bring memory requirements to approximately 15-18GB, leaving headroom on 24GB cards.
How long does training take on an RTX 4090?
A typical character LoRA (1500 steps) takes 45-90 minutes depending on dataset size and exact settings. Style LoRAs needing 3000 steps take 2-3 hours. Gradient checkpointing adds about 25% to these times.
Can I train at 1024x1024 resolution on consumer hardware?
Not practically on 24GB. The activation memory requirement exceeds available VRAM even with aggressive optimization. Train at 512x512 and generate at 1024x1024. The LoRA concepts transfer across resolutions.
What network rank should I use?
Start with rank 16 for characters, 32 for styles. If results lack detail, increase rank. If you see overfitting, decrease rank. Higher ranks capture more information but need more training data and risk memorizing rather than learning.
Do I need different learning rates for SD 3.5 versus SDXL?
Start with the same rate (1e-4) and adjust based on convergence. SD 3.5's rectified flow training can converge faster, so you may need lower rates to avoid overfitting or can use fewer steps at the same rate.
Should I train all three text encoders?
For LoRA training, you typically only train the diffusion backbone. Text encoders are used for conditioning but frozen. Training text encoders requires much more memory and is usually unnecessary for most LoRA use cases.
How many images do I need for good results?
10-20 for characters with good variety. 50-200 for styles with range of examples. Quality and variety matter more than quantity. 15 diverse images train better than 50 similar ones.
Can I use my SDXL LoRAs with SD 3.5?
No, they're completely incompatible architectures. SD 3.5's DiT backbone differs fundamentally from SDXL's UNet. You need to train new LoRAs specifically for SD 3.5. For video generation with your trained models, see our Wan 2.2 ComfyUI guide.
Is SD 3.5 Medium easier to train than Large?
Yes, Medium has lower memory requirements and trains faster. If Large is too demanding for your hardware, Medium is a valid alternative that still provides the architectural benefits over SDXL.
Will my SD 3.5 LoRA work with future model versions?
Likely not directly. Stability AI's future models will probably use different architectures or modifications that require retraining. LoRAs are typically specific to the exact model version they were trained for.
Conclusion
SD 3.5 LoRA training on consumer hardware is achievable with proper configuration. The key optimizations for SD 3.5 LoRA training, including gradient checkpointing, BF16 precision, 8-bit optimizer, text encoder offloading, and reduced resolution, together bring memory requirements within 24GB reach while maintaining training quality.
Dataset preparation deserves significant attention for SD 3.5 LoRA training. SD 3.5's T5 encoder understands natural language, so write captions as descriptive sentences that convey relationships and context. Quality images with variety train better than large quantities of similar images.
Monitor your SD 3.5 LoRA training with sample generation and be ready to stop before overfitting. SD 3.5's rectified flow training can overfit faster than SDXL, so watch samples closely in the latter half of training.
The resulting LoRAs from SD 3.5 LoRA training take advantage of SD 3.5's architectural improvements, producing better prompt adherence and text rendering than SDXL LoRAs while capturing your custom subjects and styles effectively.
For SD 3.5 LoRA training without hardware constraints, cloud services provide access to 48GB+ GPUs that can train at full resolution with larger batch sizes. This can improve quality and speed for important projects. Apatero.com offers such infrastructure for users who want SD 3.5's capabilities without managing local memory limitations. For FLUX model training, see our dedicated FLUX LoRA training guide.
With the configurations in this guide, your RTX 4090 becomes a capable SD 3.5 LoRA training workstation. The initial setup takes effort, but once configured, you can train custom LoRAs that use everything SD 3.5 offers. For maintaining consistent characters across your generated images, see our character consistency guide. To optimize your inference workflow, explore TeaCache and SageAttention for faster generation speeds.
Ready to Create Your AI Influencer?
Join 115 students mastering ComfyUI and AI influencer marketing in our complete 51-lesson course.
Related Articles
10 Best AI Influencer Generator Tools Compared (2025)
Comprehensive comparison of the top AI influencer generator tools in 2025. Features, pricing, quality, and best use cases for each platform reviewed.
AI Adventure Book Generation with Real-Time Images
Generate interactive adventure books with real-time AI image creation. Complete workflow for dynamic storytelling with consistent visual generation.
AI Background Replacement: Professional Guide 2025
Master AI background replacement for professional results. Learn rembg, BiRefNet, and ComfyUI workflows for seamless background removal and replacement.