VRAM Optimization Flags Explained - ComfyUI and AI Generation Guide
Understand all VRAM optimization flags for ComfyUI and AI generation including attention modes, model offloading, and precision settings
You've seen the error messages: "CUDA out of memory. Tried to allocate 2.00 GiB." You know the frustration of generation failing at 90% complete because your GPU ran out of VRAM. The solution involves ComfyUI VRAM optimization flags and settings you've seen mentioned—lowvram, xformers, FP16, CPU offloading—but no one has explained what these actually do or when to use them. Understanding these ComfyUI VRAM optimization mechanisms transforms you from someone randomly trying flags until something works to someone who understands exactly how to configure their system for any model and workflow. If you're new to ComfyUI, our essential nodes guide provides foundational knowledge that complements these ComfyUI VRAM optimization techniques.
Proper ComfyUI VRAM optimization is essential for running modern AI models on consumer hardware. This guide covers every major ComfyUI VRAM optimization flag and technique available.
VRAM (Video Random Access Memory) is the primary constraint for local AI generation. Unlike system RAM, you can't just add more or use swap space effectively. The model weights, intermediate tensors, and activations must all fit simultaneously during generation. When they don't, generation fails. Optimization flags give you control over this equation by changing how and where data is stored and computed.
The Fundamentals of GPU Memory in AI Generation
Before diving into specific flags, understanding what consumes VRAM during generation helps you predict which optimizations will help for your situation.
What Consumes VRAM
Several distinct categories of data consume GPU memory during AI image generation:
Model weights: The trained parameters of the neural network. For a typical SDXL model, this is approximately 6.5GB in FP16 precision. The model must be loaded before generation can begin.
Activations: Intermediate results computed during the forward pass. As data flows through network layers, each layer's output becomes input to the next. These activations consume substantial memory, especially at high resolutions.
Attention matrices: Self-attention and cross-attention operations compute matrices that grow quadratically with sequence length. For images, sequence length relates to resolution—higher resolution means quadratically larger attention matrices.
Optimizer states: Only relevant for training, not inference. If you're training LoRAs, optimizer states add significant memory overhead. For common training issues, see our LoRA training troubleshooting guide.
Caching: Various caching mechanisms can consume memory for performance improvements.
Memory Scaling Behaviors
Different components scale differently with generation parameters:
- Model weights: Constant regardless of resolution
- Activations: Linear with resolution (doubled resolution = roughly doubled activation memory)
- Attention: Quadratic with resolution (doubled resolution = roughly quadrupled attention memory)
This quadratic scaling is why high-resolution generation is so much harder than low-resolution. A 1024x1024 generation uses roughly 4x the attention memory of 512x512, not 2x.
Understanding Memory Allocation Patterns
PyTorch's memory allocator reserves memory in chunks rather than allocating exactly what's needed. This reduces allocation overhead but means your actual available memory is less than total VRAM. The allocator also fragments memory over time, causing situations where you have enough total free memory but not in contiguous blocks large enough for the next allocation.
You can observe this behavior by comparing torch.cuda.memory_allocated() (actually used) versus torch.cuda.memory_reserved() (held by allocator). The gap between these values represents fragmented or pre-allocated memory.
For workflows that run multiple different generations, memory fragmentation accumulates. Restarting ComfyUI periodically clears fragmentation and restores full memory availability. Some users restart between significantly different workloads (like switching from SDXL to Flux) to ensure clean memory state.
Precision Flags: The Foundation of Memory Optimization
Precision settings control how numbers are stored, offering the most fundamental ComfyUI VRAM optimization tradeoff between memory and quality. These precision-based ComfyUI VRAM optimization settings form the foundation of your memory management strategy.
FP32 (Full Precision)
FP32 uses 32 bits (4 bytes) per number. This provides maximum numerical precision with approximately 7 significant digits and an enormous dynamic range.
# Force FP32 mode (rarely needed)
python main.py --force-fp32
FP32 is almost never necessary for inference. It uses double the memory of FP16 with no perceptible quality improvement. Some legacy workflows or unusual models might require it, but treat it as a debugging option rather than standard practice.
Memory usage: ~13GB for SDXL model weights alone Use case: Debugging numerical issues, legacy compatibility
FP16 (Half Precision)
FP16 uses 16 bits (2 bytes) per number, halving memory requirements compared to FP32.
# FP16 is typically default, but can be forced
python main.py --fp16
FP16 is the standard precision for AI inference. The reduced precision (about 3 significant digits) has no perceptible impact on image quality. Models are often distributed in FP16, and inference in FP16 is well-tested and reliable.
The limitation of FP16 is its reduced dynamic range (approximately 5.96 x 10^-8 to 65,504). Values outside this range become infinity or zero. This causes the NaN errors discussed in other guides, particularly in VAE decoding.
Memory usage: ~6.5GB for SDXL model weights Use case: Standard inference for most workflows
BF16 (Brain Float 16)
BF16 also uses 16 bits but allocates them differently than FP16. It maintains the same dynamic range as FP32 (roughly 10^-38 to 10^38) but with reduced precision (about 2 significant digits).
# Enable BF16 mode
python main.py --bf16
BF16's larger dynamic range means fewer overflow and underflow errors. This makes it slightly more numerically stable than FP16 for certain operations, particularly in training.
BF16 requires Ampere or newer NVIDIA GPUs (RTX 30 series, A100, etc.). Older GPUs don't have native BF16 support and will be much slower or fail.
Memory usage: ~6.5GB for SDXL model weights (same as FP16) Use case: Training, workflows with numerical stability issues Requirement: Ampere+ GPU (RTX 30xx, 40xx, A100)
FP8 and INT8 Quantization
Newer formats use only 8 bits per number, providing another 50% memory reduction over FP16.
# FP8 mode (requires Hopper GPU for native support)
python main.py --fp8
FP8 and INT8 quantization enable running larger models on smaller GPUs but with potential quality impact. The severity of quality degradation depends on the model and how it was trained.
Some models are trained with quantization awareness and handle low precision gracefully. Others degrade significantly. Test your specific model to evaluate the tradeoff.
Memory usage: ~3.25GB for SDXL model weights Use case: Running large models on limited VRAM, production inference where throughput matters Requirement: Ada Lovelace+ for FP8 (RTX 40xx, L40), general CUDA for INT8
Choosing Precision
For most users, FP16 is the right choice. It provides the best balance of memory efficiency and quality with broad hardware support.
Use BF16 if you have compatible hardware and experience numerical stability issues with FP16.
Use FP8/INT8 when you need to fit models that won't otherwise load, and you've verified acceptable quality with your specific model.
Use FP32 only for debugging or if specific documentation recommends it.
Attention Optimization Flags
Attention computation is memory-intensive and benefits enormously from ComfyUI VRAM optimization. Different attention implementations provide different memory/speed tradeoffs for ComfyUI VRAM optimization.
Standard PyTorch Attention
The default PyTorch attention implementation computes the full attention matrix at once. For an image represented as a sequence of patches:
- Sequence length L = (height/patch_size) * (width/patch_size)
- Attention matrix size = L * L * num_heads * batch_size
This quadratic scaling makes default attention impractical for high resolutions.
xFormers Memory-Efficient Attention
xFormers implements attention in chunks rather than computing the full matrix simultaneously.
# Enable xFormers attention
python main.py --use-xformers
# Or disable if it's causing issues
python main.py --disable-xformers
The chunked approach changes memory scaling from quadratic to near-linear. A 1024x1024 generation that would require 16GB for attention might need only 4GB with xFormers.
xFormers often improves speed as well because memory-efficient operations have better cache behavior.
Installation: xFormers is a separate package that must be installed:
pip install xformers
Match xFormers version to your PyTorch and CUDA versions. Mismatches cause errors or poor performance.
Use case: Standard optimization for most users. Enable unless you have specific reasons not to.
Flash Attention
Flash Attention fuses attention operations to minimize memory transfers between GPU compute units and memory.
# Enable Flash Attention (where supported)
python main.py --use-flash-attention
Flash Attention is typically faster than xFormers with similar memory efficiency. However, it has stricter requirements:
- Requires Ampere+ GPU
- Not all sequence lengths are supported
- Some model architectures don't support it
Use case: Best performance on compatible hardware. Use if available and working.
SageAttention
SageAttention uses custom Triton kernels for attention computation.
# Enable SageAttention
python main.py --use-sage-attention
Performance often exceeds both xFormers and Flash Attention when properly configured. The custom kernels are optimized for specific GPU architectures.
Requirements: Triton installation, may need compilation for your GPU Use case: Maximum performance for users willing to do additional setup
Scaled Dot Product Attention (SDPA)
PyTorch 2.0+ includes built-in scaled dot product attention with multiple backend options.
Free ComfyUI Workflows
Find free, open-source ComfyUI workflows for techniques in this article. Open source is strong.
# Use PyTorch's SDPA
python main.py --use-pytorch-cross-attention
SDPA automatically selects between Flash Attention, Memory-Efficient Attention, and mathematical attention based on hardware and inputs. It's a good default choice that provides optimization without manual configuration.
Use case: Good default when you don't want to manage attention backends manually
Attention Slicing
Attention slicing is a last-resort optimization that processes attention in small sequential batches.
# Enable attention slicing
python main.py --attention-slice
This dramatically reduces memory by computing only a portion of attention at a time, but significantly slows generation because operations that could parallelize are now sequential.
Use case: Only when other attention optimizations aren't enough to fit in memory. Expect 2-4x slower generation.
Choosing Attention Mode
Try attention modes in this order:
- SageAttention or Flash Attention: Best performance if supported
- xFormers: Reliable, well-tested, broad compatibility
- PyTorch SDPA: Good automatic selection
- Attention slicing: Last resort when nothing else fits
Only one attention mode can be active at a time. They're alternatives, not complements.
Offloading Flags
Offloading moves model components to CPU RAM as part of ComfyUI VRAM optimization, freeing GPU memory at the cost of transfer time. These ComfyUI VRAM optimization flags are essential for memory-constrained systems.
Text Encoder Offloading
Text encoders (CLIP, T5) are only needed at generation start to encode your prompt.
# Offload text encoder to CPU after encoding
python main.py --cpu-text-encoder
After encoding your prompt, the text encoder's VRAM can be freed for the main model. This saves 1-2GB depending on encoder type (CLIP for SD/SDXL, T5 for Flux).
The speed impact is minimal since text encoding is a small fraction of total generation time. This optimization provides good memory savings with little downside.
Memory savings: ~1GB for CLIP, ~2GB for T5 Speed impact: Minimal (seconds at generation start) Recommendation: Enable by default on memory-constrained systems
VAE Offloading
VAEs decode latents to images at generation end.
# Offload VAE to CPU during diffusion
python main.py --cpu-vae
During the diffusion process, VAE memory can be freed. The VAE reloads for final decoding.
Memory savings: ~160MB (FP16) to ~320MB (FP32) Speed impact: Small (VAE transfer at end of generation) Recommendation: Enable if needed; savings are moderate
Model Offloading (lowvram Mode)
Aggressive offloading moves main model components to CPU during generation.
# Enable aggressive offloading
python main.py --lowvram
With --lowvram, model components move between CPU and GPU as needed during computation. Only the actively-computing portion stays on GPU.
This dramatically reduces VRAM requirements but significantly slows generation due to CPU-GPU transfer overhead. Generation that takes 30 seconds without offloading might take 3-5 minutes with aggressive offloading.
Memory savings: Massive (can run SDXL in 4GB VRAM) Speed impact: Severe (5-10x slower) Recommendation: Use only when nothing else fits
Sequential Offloading
The most aggressive offloading level moves individual layers to GPU one at a time.
# Enable sequential layer offloading
python main.py --sequential-offload
Each layer loads, computes, then unloads. This minimizes peak GPU memory to a single layer plus activations.
Generation is extremely slow—potentially 20-30 minutes for a single image. But it enables running models that would otherwise be impossible on available hardware.
Want to skip the complexity? Apatero gives you professional AI results instantly with no technical setup required.
Memory savings: Maximum possible Speed impact: Extreme (20x+ slower) Recommendation: Absolute last resort for models that won't fit any other way
Medvram Mode
A moderate middle ground between default and lowvram.
# Enable moderate optimization
python main.py --medvram
This enables text encoder offloading and some other moderate optimizations without the severe slowdowns of full lowvram mode.
Memory savings: Moderate Speed impact: Small Recommendation: Good starting point for 8-12GB GPUs
Combined Optimization Configurations
Multiple ComfyUI VRAM optimization flags can be combined for cumulative benefit. Here are tested ComfyUI VRAM optimization configurations for different hardware tiers.
For video generation workflows, our Wan 2.2 complete guide shows how to apply these ComfyUI VRAM optimization techniques to video models.
4-6GB VRAM Configuration
For GTX 1060 6GB, RTX 3050, etc.:
python main.py --lowvram --use-xformers --cpu-text-encoder --cpu-vae --fp16
This configuration:
- Uses aggressive offloading for main model
- Efficient attention with xFormers
- Offloads text encoder and VAE
- FP16 precision
Expect very slow generation (5-10 minutes per image) but it will complete. Limit resolution to 512x512 for SD 1.5 or 768x768 maximum for SDXL.
8GB VRAM Configuration
For RTX 3070, RTX 4060, RTX 2080:
python main.py --medvram --use-xformers --cpu-text-encoder --fp16
This configuration:
- Moderate offloading
- Efficient attention
- Text encoder offloading
- FP16 precision
Generation time will be reasonable (1-2 minutes for typical workflows). You can run SD 1.5 comfortably and SDXL with care. Resolution up to 768x768 or higher with tiling. For additional guidance, see our beginner's guide to AI image generation.
12GB VRAM Configuration
For RTX 4070, RTX 3080 12GB:
python main.py --use-xformers --cpu-text-encoder --fp16
This is the sweet spot configuration:
- Efficient attention
- Text encoder offloading
- FP16 precision
- No main model offloading needed
Most models and workflows run without issue. Generation time is fast (30-60 seconds). SDXL at 1024x1024 works well.
16-24GB VRAM Configuration
For RTX 4080, RTX 4090, RTX 3090:
python main.py --use-sage-attention --fp16
Or for maximum speed:
python main.py --use-sage-attention --gpu-only --fp16
With abundant VRAM:
- Best attention implementation
- No offloading needed
- Keep everything on GPU for speed
Focus on speed rather than memory savings. Generation time is fast (10-30 seconds). All models and resolutions are accessible.
Environment Variables for Fine-Tuning
Beyond command-line flags, environment variables provide additional control over memory behavior.
PYTORCH_CUDA_ALLOC_CONF
This variable controls PyTorch's memory allocator behavior:
Join 115 other course members
Create Your First Mega-Realistic AI Influencer in 51 Lessons
Create ultra-realistic AI influencers with lifelike skin details, professional selfies, and complex scenes. Get two complete courses in one bundle. ComfyUI Foundation to master the tech, and Fanvue Creator Academy to learn how to market yourself as an AI creator.
# Reduce fragmentation with smaller allocation chunks
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
# Or more aggressive for very limited VRAM
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64
Smaller split sizes reduce fragmentation but increase allocation overhead. For systems hitting OOM despite having "enough" total memory, this can help.
CUDA_VISIBLE_DEVICES
Control which GPUs are visible to PyTorch:
# Use only GPU 0
export CUDA_VISIBLE_DEVICES=0
# Use GPUs 0 and 2, skip 1
export CUDA_VISIBLE_DEVICES=0,2
Useful for multi-GPU systems where you want to reserve certain GPUs for other tasks.
TF32 Settings
On Ampere+ GPUs, TF32 provides a speed boost with minimal precision loss:
# Enable TF32 for matrix operations
export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
This is usually enabled by default in recent PyTorch versions but can be explicitly set for clarity.
Workflow-Specific Optimization Strategies
Different types of workflows benefit from different optimization approaches.
Image Generation Workflows
Standard text-to-image and image-to-image workflows:
- Enable efficient attention (xFormers or Flash)
- Offload text encoder after encoding
- Use FP16 precision
- Match resolution to available VRAM
Video Generation Workflows
Video models like Wan 2.1 or LTX Video have different memory patterns:
- Temporal attention adds significant memory overhead
- Consider frame-by-frame generation with consistency techniques
- Aggressive quantization often necessary
- Accept longer generation times for quality
Training Workflows
LoRA and fine-tuning have different requirements than inference:
- Gradient storage adds 2-3x memory overhead
- Gradient checkpointing trades compute for memory
- 8-bit optimizers reduce optimizer state memory
- Batch size has major impact - use gradient accumulation
Multi-Model Workflows
ControlNet, IP-Adapter, and similar additions:
- Each model adds its memory footprint
- Consider sequential loading/unloading
- Prioritize which models need to stay resident
- Use model caching wisely
Monitoring and Debugging Memory Usage
Understanding actual memory usage helps you tune configurations.
Real-Time Monitoring
Monitor VRAM during generation:
# In separate terminal
watch -n 0.5 nvidia-smi
This shows VRAM usage updating every 0.5 seconds. Observe peak usage during different generation phases:
- Loading: Model weights loading into VRAM
- Encoding: Text encoding
- Sampling: Main diffusion process (usually peak usage)
- Decoding: VAE decoding to image
Python Memory Tracking
Add memory tracking to understand usage programmatically:
import torch
def print_memory():
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1024**3
reserved = torch.cuda.memory_reserved() / 1024**3
print(f"Allocated: {allocated:.2f}GB, Reserved: {reserved:.2f}GB")
# Call at different points in your workflow
print_memory()
Identifying Memory Peaks
Memory issues usually occur at specific phases:
High-resolution attention: Quadratic scaling makes this the most common memory bottleneck. Use efficient attention.
Large batch sizes: Each image in a batch multiplies activation memory. Reduce batch size if OOM during sampling.
Multiple models loaded: Having multiple models in VRAM simultaneously (like ControlNet + main model) accumulates. Offload unused models.
VAE decoding at high resolution: The VAE operates on full-resolution images. Use tiled VAE for very high resolutions.
Using ComfyUI Manager for Troubleshooting
ComfyUI Manager provides tools for monitoring and managing your ComfyUI installation. It can help identify which custom nodes might be consuming unexpected memory and provides easy model management.
Frequently Asked Questions
What flags should I use for 8GB VRAM?
Start with --medvram --use-xformers --cpu-text-encoder --fp16. If you still hit OOM, add --cpu-vae. If still OOM, step up to --lowvram.
Does FP16 affect image quality?
For inference, quality impact is imperceptible. FP16 is standard for image generation and extensively tested. Use it unless you have specific numerical issues.
Why is generation slow with --lowvram?
The --lowvram flag uses aggressive CPU-GPU transfers for every operation. This overhead is inherent to the approach. It's the price of running on limited VRAM.
Can I use multiple attention optimizations together?
No, they're alternatives. Choose one: xFormers OR Flash Attention OR SageAttention OR attention slicing. Using multiple causes errors or unexpected behavior.
What's the difference between FP16 and BF16?
Same memory usage, different numerical representation. BF16 has larger dynamic range but less precision. Use BF16 if you have Ampere+ GPU and experience numerical issues with FP16.
Should I always use the most aggressive optimization?
No. Excessive optimization wastes speed. Use the minimum optimization needed for stable operation. If your workflow runs fine without --lowvram, don't use it.
Why do I get OOM even with all optimizations enabled?
Some models genuinely require more VRAM than available. Very large models or very high resolutions may not fit regardless of optimization. Consider cloud GPU instances for these use cases.
Does attention slicing help with quality?
No, it's mathematically equivalent to full attention. It only affects memory and speed. Use it only when memory-efficient attention modes aren't enough.
How do I know which optimization is actually helping?
Enable one at a time and check VRAM usage with nvidia-smi. This identifies which optimizations provide actual benefit for your specific workflow.
Can these optimizations help with training?
Yes, similar optimizations apply. Training also benefits from gradient checkpointing, which trades compute for memory by recomputing activations during backward pass rather than storing them.
What if I'm still having memory issues after trying everything?
Consider these additional steps:
- Update GPU drivers and CUDA
- Close all other applications using GPU
- Restart ComfyUI to clear memory fragmentation
- Verify no memory leaks in custom nodes
- Consider cloud services for models that exceed your hardware
Conclusion
ComfyUI VRAM optimization flags give you precise control over the memory/performance tradeoff in AI generation. Understanding what each ComfyUI VRAM optimization flag does helps you configure optimal settings for your hardware rather than randomly trying configurations.
For LoRA training that requires different memory management, our Flux LoRA training guide covers training-specific ComfyUI VRAM optimization techniques.
For most users, the key optimizations are:
- FP16 precision: Half the memory with no quality loss
- Efficient attention (xFormers, Flash, or Sage): Near-linear instead of quadratic memory scaling
- Text encoder offloading: Free 1-2GB with minimal speed impact
Add more aggressive optimizations only as needed:
- VAE offloading: Moderate additional savings
- medvram: Balance of memory savings and speed
- lowvram: Maximum savings, significant speed cost
The goal is finding the minimum optimization level that runs your workflow reliably. More optimization than necessary wastes performance without benefit.
With this understanding, you can confidently configure ComfyUI for any hardware, predict which models will fit, and troubleshoot memory issues systematically rather than through trial and error.
Getting Started with VRAM Optimization
For users new to VRAM optimization, understanding the fundamentals before diving into specific flags prevents confusion and trial-and-error frustration.
Recommended Learning Path
Step 1 - Understand Your Hardware: Know your GPU's VRAM capacity and generation. RTX 30xx and 40xx have different capabilities (BF16, Flash Attention) that affect which optimizations work.
Step 2 - Learn ComfyUI Basics: Understand how workflows function before optimizing them. Our essential nodes guide covers foundational concepts that make optimization choices clearer.
Step 3 - Monitor Before Optimizing: Use nvidia-smi to observe actual VRAM usage during your normal workflows. Understanding your baseline helps you identify which optimizations will help.
Step 4 - Apply Optimizations Incrementally: Add one optimization at a time and measure impact. This isolates which changes help and which cause issues.
Step 5 - Document Working Configurations: Record configurations that work well for your hardware and workflows. This prevents re-discovery and enables quick setup on new systems.
First Optimization Recommendations
For All Users:
- Start with
--use-xformers(install xformers first) - Add
--fp16if not already default - These two provide major benefits with no drawbacks on compatible hardware
For 8-12GB VRAM:
- Add
--cpu-text-encoderfor easy 1-2GB savings - Add
--medvramif still hitting OOM - Only use
--lowvramif medvram isn't enough
For 16GB+ VRAM:
- Try SageAttention or Flash Attention for best speed
- Usually no offloading needed
- Focus on speed rather than memory savings
Understanding OOM Error Patterns
OOM During Model Loading: Model is too large for VRAM. Use lower precision (FP16, quantization) or smaller model.
OOM During Sampling: Attention computation or activations exceed available memory. Use efficient attention (xFormers) or reduce resolution/batch size.
OOM During VAE Decode: High-resolution VAE decode exceeds memory. Use tiled VAE decoding for very high resolutions.
OOM With Multiple Models: Too many models loaded simultaneously. Offload unused models or reduce concurrent model count.
For complete beginners wanting to understand AI image generation before optimizing it, our beginner's guide provides foundational context that makes optimization choices more meaningful.
Ready to Create Your AI Influencer?
Join 115 students mastering ComfyUI and AI influencer marketing in our complete 51-lesson course.
Related Articles
10 Most Common ComfyUI Beginner Mistakes and How to Fix Them in 2025
Avoid the top 10 ComfyUI beginner pitfalls that frustrate new users. Complete troubleshooting guide with solutions for VRAM errors, model loading issues, and workflow problems.
25 ComfyUI Tips and Tricks That Pro Users Don't Want You to Know in 2025
Discover 25 advanced ComfyUI tips, workflow optimization techniques, and pro-level tricks that expert users leverage. Complete guide to CFG tuning, batch processing, and quality improvements.
360 Anime Spin with Anisora v3.2: Complete Character Rotation Guide ComfyUI 2025
Master 360-degree anime character rotation with Anisora v3.2 in ComfyUI. Learn camera orbit workflows, multi-view consistency, and professional turnaround animation techniques.