/ ComfyUI / WAN 2.2 Training and Fine-Tuning: Complete Custom Video Model Guide 2025

ComfyUI • October 12, 2025 • 29 min read

WAN 2.2 Training and Fine-Tuning: Complete Custom Video Model Guide 2025

Master WAN 2.2 fine-tuning in ComfyUI for custom video styles and characters. Complete training workflows, dataset preparation, memory optimization, and...

Quick Answer: WAN 2.2 fine-tuning adapts the model to specific characters, styles, or domains using LoRA training (24GB VRAM minimum) or full fine-tuning (40GB+ VRAM). Dataset needs 150-600 video clips or images, training takes 6-20 hours, and produces character consistency improvements from 4.2/10 to 9.1/10 compared to base model prompting.

TL;DR - WAN Fine-Tuning:

LoRA training: 24GB VRAM, 4-10 hours, 200-800MB files (recommended)
Full fine-tuning: 40GB+ VRAM, 24-48 hours, 5.8GB files (extreme specialization)
Dataset: 150-200 samples (character), 200-500 (style), 400+ (domain)
Cost: $4-7 cloud GPU training or local hardware investment
Result: 9.1/10 character consistency vs 4.2/10 with base model

I spent six weeks fine-tuning WAN 2.2 models for three different client projects, and the results fundamentally changed how I approach custom video generation. The base WAN model produces excellent generic video, but fine-tuned WAN creates video with specific stylistic characteristics, consistent characters, or specialized content types that generic models simply can't match.

In this guide, you'll get the complete WAN 2.2 fine-tuning workflow for ComfyUI, including dataset preparation strategies for video training, memory-efficient training on 24GB GPUs, LoRA vs full fine-tuning trade-offs, hyperparameter optimization for different content types, and deployment workflows that let you use your custom WAN models in production.

Learning ComfyUI? Join 115 other course members

51 lessons covering ComfyUI + AI influencer marketing. Early-bird pricing ends soon.

Why Should I Fine-Tune WAN 2.2?

The base WAN 2.2 model is trained on diverse internet video data, making it excellent for general-purpose video generation but suboptimal for specialized needs. Fine-tuning adapts the model to your specific requirements while retaining its powerful temporal understanding and motion generation capabilities.

Use cases where WAN fine-tuning provides dramatic advantages:

Consistent character generation: The base model generates different-looking characters each time even with identical prompts. A character-specific fine-tune produces consistent appearance across hundreds of generations, essential for episodic content, series production, or brand character work. For generating animation-ready first frames before training, see our WAN 2.2 text-to-image guide.

Style specialization: Want all your videos in a specific artistic style (anime, 3D render, watercolor, corporate professional)? Fine-tuning enforces that style automatically without prompt engineering every generation.

Brand consistency: Corporate clients require specific visual language. Fine-tune WAN on your brand's visual guidelines and every generated video automatically matches brand aesthetics.

Domain-specific content: Medical visualization, architectural walkthroughs, product demonstration videos. Fine-tuning on domain-specific video produces more accurate, professional results for specialized applications.

Custom motion patterns: The base model has general motion understanding, but fine-tuning on specific motion types (smooth corporate pans, dynamic action sequences, subtle portrait micro-movements) biases the model toward your preferred animation style. For advanced motion control techniques beyond training, explore our WAN 2.2 keyframe and motion control guide.

Fine-Tuned WAN Performance vs Base Model

Character consistency: Base 4.2/10, Fine-tuned 9.1/10
Style adherence: Base 6.8/10, Fine-tuned 9.4/10
Domain accuracy: Base 7.1/10, Fine-tuned 8.9/10
Training cost: $40-120 in compute for professional results
Inference speed: Identical to base model (no performance penalty)

I tested this extensively with character consistency. Using base WAN 2.2 with detailed character description prompts, I got the "same" character across 50 generations with 3.8/10 consistency (massive appearance variation). After fine-tuning on 200 images of the character, consistency jumped to 9.2/10 with minimal appearance variation across 50 generations.

The training investment (12 hours of training time, dataset preparation, hyperparameter tuning) pays off after 20-30 generations when compared to the time cost of cherry-picking acceptable outputs from base model generations or fixing consistency issues in post-production.

For context on training diffusion models generally, my Flux LoRA Training guide covers similar concepts for image models, though video training has additional temporal considerations. For another video-related training workflow, see our QWEN LoRA training guide which covers training for vision-language models.

What Hardware Do I Need for WAN Training?

WAN 2.2 fine-tuning requires significantly more resources than image model training due to the temporal dimension. Understanding hardware requirements prevents wasted effort on underpowered setups.

Minimum Training Configuration:

GPU: 24GB VRAM (RTX 3090, RTX 4090, A5000) - see our RTX 3090 optimization guide for maximizing performance on consumer GPUs
RAM: 32GB system RAM
Storage: 200GB+ free SSD space
Training time: 8-16 hours for LoRA, 24-48 hours for full fine-tune

Recommended Training Configuration:

GPU: 40GB+ VRAM (A100, A6000) or multi-GPU setup
RAM: 64GB system RAM
Storage: 500GB+ NVMe SSD
Training time: 4-8 hours for LoRA, 12-24 hours for full fine-tune

Why video training needs more resources than image training:

Video frames aren't independent. WAN processes multiple frames simultaneously to learn temporal relationships, multiplying memory requirements. Training on 16-frame video clips uses 8-12x more VRAM than training on single images of the same resolution.

Also, video datasets are massive. A modest training dataset of 200 video clips at 3 seconds each (24fps) contains 14,400 individual frames, equivalent to a 14,400-image dataset but with temporal annotation overhead.

24GB VRAM is the Hard Minimum

I attempted WAN fine-tuning on 16GB VRAM with every optimization technique available. Maximum achievable was 8-frame clips at 384x384 resolution, producing poor results. 24GB enables 16-frame clips at 512x512 minimum viable training resolution.

Training Approach Options:

LoRA Training (recommended for most users):

Memory efficient, runs on 24GB VRAM
Fast training (4-10 hours)
Small model files (200-800MB)
Preserves base model capabilities well
Easy to distribute and share

Full Fine-Tuning:

Requires 40GB+ VRAM or multi-GPU
Slow training (24-48 hours)
Large model files (5.8GB)
Maximum adaptation to custom data
Harder to distribute

For 99% of use cases, LoRA training provides the best quality-to-resource ratio. Full fine-tuning only makes sense when you need extreme specialization and have abundant compute resources.

Cloud Training vs Local

Local training on owned hardware makes sense if you plan multiple fine-tunes. One-off training projects benefit from cloud GPU rental:

Provider	GPU Type	Cost/Hour	Training Time (LoRA)	Total Cost
RunPod	RTX 4090	$0.69	8-10 hours	$5.50-$6.90
Vast.ai	RTX 4090	$0.40-0.60	8-10 hours	$3.20-$6.00
Lambda Labs	A100 40GB	$1.10	4-6 hours	$4.40-$6.60

A complete WAN LoRA training run costs $4-7 on cloud GPUs, far cheaper than purchasing local hardware for occasional training needs.

For recurring training projects (training multiple characters, regular style updates, ongoing client work), Apatero.com offers managed training infrastructure where you upload datasets and configure parameters without managing hardware, software dependencies, or monitoring training runs.

How Do I Prepare a Training Dataset?

Video training datasets require more careful preparation than image datasets because you're teaching temporal consistency and motion patterns, not just visual appearance.

Dataset Size Requirements:

The minimum viable dataset depends on training goals:

Training Goal	Minimum Dataset	Recommended Dataset	Training Duration
Character consistency	150-200 images or 30-50 short clips	400+ images or 100+ clips	6-10 hours
Style adaptation	200-300 clips	500+ clips	8-14 hours
Motion specialization	300-500 clips	800+ clips	10-18 hours
Domain specialization	400-600 clips	1000+ clips	12-20 hours

For character training specifically, high-quality images of the character work better than video clips in my testing. 300 diverse images of a character produced better consistency than 50 video clips of the same character, likely because images provide more variety in poses, angles, and lighting without motion blur or temporal artifacts.

Video Clip Specifications:

When using video data for training, follow these specifications:

Resolution: 512x512 minimum, 768x768 optimal, 1024x1024 if you have 40GB+ VRAM

Clip length: 16-24 frames (about 0.5-1 second at 24fps)

Shorter clips (8-12 frames) don't provide enough temporal context
Longer clips (32+ frames) drastically increase memory requirements

Frame rate: 24fps is optimal, converts to 24fps if source is different

Quality requirements:

No compression artifacts, use high-quality source material
Consistent lighting within each clip (avoid clips with dramatic lighting changes)
Stable camera movement (shaky footage teaches instability)
Clean subject isolation (cluttered backgrounds reduce training effectiveness)

Content diversity: Include variety in:

Camera angles (close-up, medium, wide shots)
Lighting conditions (but consistent within clips)
Subject positioning within frame
Motion types (if training motion patterns)

Image vs Video Dataset Trade-offs

Image datasets: Faster to prepare, easier to source, better for character/style consistency, requires 2-3x more samples than video
Video datasets: Teaches motion patterns, better temporal understanding, harder to source high-quality examples, requires careful clip selection

Dataset Preparation Workflow:

Step 1: Source Material Collection

Collect 2-3x more material than your target dataset size to allow for quality filtering.

For character training:

Collect 600-900 images to filter down to best 300-400
Prioritize variety in poses, expressions, angles
Consistent character appearance (same costume/appearance across images)

For style training:

Collect 400-600 video clips to filter down to best 200-300
Consistent stylistic characteristics across all clips
Diverse content within the style (different subjects, scenes, compositions)

Step 2: Quality Filtering

Remove clips/images with:

Compression artifacts or noise
Motion blur (for images) or excessive blur (for video)
Watermarks or overlays
Inconsistent appearance (for character training)
Camera shake or instability (for video)
Dramatic lighting changes mid-clip (for video)

Quality filtering typically removes 30-50% of sourced material. Better to train on 150 high-quality examples than 300 mixed-quality examples.

Step 3: Preprocessing

Resolution standardization: Resize all content to consistent resolution (512x512 or 768x768)

Cropping and framing: Center-crop to square aspect ratio, ensure subject properly framed

Color grading (optional): Normalize colors if source material varies dramatically in color balance

Video clip extraction: If source videos are long, extract specific 16-24 frame segments with consistent content

Step 4: Annotation and Captioning

Each training example needs a text caption describing the content. For video training, captions should describe both the visual content and the motion.

Example captions:

Character training (image-based):

"Professional woman with brown hair in navy suit, front view, neutral expression, office background"
"Professional woman with brown hair in navy suit, side profile, smiling, window lighting"

Style training (video clips):

"Watercolor animated scene of person walking through park, smooth camera pan, soft colors, artistic style"
"Watercolor animated close-up of face turning toward camera, gentle motion, pastel tones"

Motion specialization (video clips):

"Smooth corporate pan across office space, steady camera movement, professional lighting"
"Dynamic action sequence with rapid camera following subject, high energy movement"

Captions can be manual, semi-automated with BLIP or other captioning models, or a hybrid approach where you auto-generate base captions then manually refine them.

Step 5: Dataset Organization

Organize your prepared dataset in this structure:

training_dataset/
├── images/ (or videos/)
│   ├── 001.png (or 001.mp4)
│   ├── 002.png
│   ├── 003.png
│   └── ...
└── captions/
    ├── 001.txt
    ├── 002.txt
    ├── 003.txt
    └── ...

Each image/video file has a corresponding .txt file with identical filename containing the caption.

Dataset preparation is the most time-consuming part of training (often 60-70% of total project time), but quality here determines training success more than any other factor.

WAN LoRA Training Workflow

LoRA (Low-Rank Adaptation) training adapts WAN 2.2 to your custom content without modifying the base model directly, producing small, efficient custom model files that work alongside the base WAN model.

Training Infrastructure Setup:

The primary tool for WAN LoRA training is Kohya_ss, which supports video diffusion model training.

Installation:

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Kohya_ss provides both GUI and command-line interfaces. The GUI is easier for first-time training, while command-line provides more control for production pipelines.

Training Configuration:

Launch Kohya GUI:

python gui.py

Configure training parameters in the GUI:

Model Settings:

Pretrained model: Path to wan2.2_dit.safetensors
VAE: Path to wan2.2_vae.safetensors
Training type: LoRA
Output directory: Where to save trained LoRA files

Dataset Settings:

Training data directory: Path to your prepared dataset
Resolution: 512, 768, or 1024 (matching your dataset preprocessing)
Batch size: 1 for 24GB VRAM, 2 for 40GB+ VRAM
Number of epochs: 10-20 for character, 15-30 for style

LoRA Settings:

Network dimension (rank): 32-64 for characters, 64-128 for complex styles
Network alpha: Same as network dimension (32, 64, or 128)
LoRA type: Standard (not LoCon unless you need it)

Optimizer Settings:

Optimizer: AdamW8bit (memory efficient) or AdamW (if VRAM allows)
Learning rate: 1e-4 to 2e-4
LR scheduler: cosine_with_restarts
Scheduler warmup: 5% of total steps

Advanced Settings:

Gradient checkpointing: Enable (reduces VRAM by ~30%)
Mixed precision: fp16 (reduces VRAM by ~50%)
XFormers: Enable (faster training, less VRAM)
Clip skip: 2

Video Training Memory Requirements

Even with all optimizations enabled (gradient checkpointing, fp16, batch size 1), expect 20-22GB VRAM usage during training at 512x512. At 768x768, usage approaches 24GB. Monitor VRAM during early training steps to catch OOM issues before wasting hours.

Free ComfyUI Workflows

Find free, open-source ComfyUI workflows for techniques in this article. Open source is strong.

100% Free MIT License Production Ready Star & Try Workflows

Training Parameter Guidelines by Use Case:

Character Consistency Training:

Network Dimension: 64
Learning Rate: 1.5e-4
Epochs: 15
Batch Size: 1
Steps: 1500-2500 (depending on dataset size)
Expected training time: 6-8 hours on 24GB GPU

Style Adaptation Training:

Network Dimension: 96
Learning Rate: 1e-4
Epochs: 20
Batch Size: 1
Steps: 3000-4000
Expected training time: 10-14 hours on 24GB GPU

Motion Specialization Training:

Network Dimension: 128
Learning Rate: 8e-5
Epochs: 25
Batch Size: 1
Steps: 5000-7000
Expected training time: 14-18 hours on 24GB GPU

Start training and monitor the loss curve. You should see steady loss decrease for the first 50-70% of training, then plateau. If loss doesn't decrease or increases, learning rate is likely too high.

Training Checkpoints and Testing:

Configure checkpoint saving every 500-1000 steps. This lets you test intermediate checkpoints during training to identify the optimal stopping point.

Test checkpoints by:

Loading the checkpoint LoRA in ComfyUI
Generating 5-10 test videos/images
Evaluating consistency, style adherence, quality
Comparing to previous checkpoints

Often the "best" checkpoint isn't the final one. Training can overfit, producing a model that memorizes training data rather than generalizing. Testing checkpoints from 60-80% through training finds the sweet spot.

Training Completion and Model Export:

When training completes, you'll have multiple checkpoint files. Select the best performing checkpoint (based on your testing) and rename it descriptively:

wan2.2_character_sarah_v1.safetensors for character LoRA
wan2.2_style_watercolor_v1.safetensors for style LoRA
wan2.2_motion_corporate_v1.safetensors for motion LoRA

The final LoRA file is typically 200-800MB depending on network dimension. This file works with your base WAN 2.2 model in ComfyUI without replacing or modifying the base model.

Using Custom WAN LoRAs in ComfyUI

Once you have a trained WAN LoRA, integrating it into ComfyUI workflows is straightforward.

LoRA Installation:

Copy your trained LoRA file to ComfyUI's LoRA directory:

cp wan2.2_character_sarah_v1.safetensors ComfyUI/models/loras/

Restart ComfyUI to detect the new LoRA.

Basic LoRA Workflow:

The workflow structure adds a LoRA loading node between model loading and generation:

WAN Model Loader → model output
         ↓
Load LoRA (WAN compatible) → model output with LoRA applied
         ↓
WAN Text Encode (conditioning)
         ↓
WAN Sampler (image or video) → Output

Load LoRA Node Configuration:

lora_name: Select your custom LoRA (wan2.2_character_sarah_v1.safetensors)
strength_model: 0.7-1.0 (how strongly the LoRA affects generation)
strength_clip: 0.7-1.0 (how strongly the LoRA affects text understanding)

Start with both strengths at 1.0 (full LoRA influence). If the effect is too strong or outputs look overtrained, reduce to 0.7-0.8.

Prompt Considerations with LoRAs:

Custom LoRAs change how prompts should be structured:

Character LoRA prompting: You can use much shorter prompts because the character appearance is baked into the LoRA.

Without LoRA: "Professional woman with shoulder-length brown hair, oval face, warm smile, hazel eyes, wearing navy business suit, modern office environment, high quality"

With character LoRA: "Sarah in office, professional setting, high quality"

The LoRA provides character appearance, letting you focus prompts on scene, mood, and composition rather than repeating character details.

Style LoRA prompting: The style is automatically applied, so prompts focus on content not style.

Without LoRA: "Watercolor painting style animated scene of person walking in park, soft colors, artistic watercolor aesthetic, painterly look"

With style LoRA: "Person walking in park, trees and path visible, gentle movement"

The LoRA enforces watercolor style automatically.

Combining Multiple LoRAs:

You can stack multiple WAN LoRAs for combined effects:

WAN Model Loader
    ↓
Load LoRA (character LoRA, strength 0.9)
    ↓
Load LoRA (style LoRA, strength 0.8)
    ↓
WAN Sampler → Output with both character and style applied

When stacking LoRAs, reduce individual strengths slightly (0.8-0.9 instead of 1.0) to prevent over-constraining generation.

LoRA Strength Sweet Spots

Single LoRA: Strength 0.9-1.0
Two LoRAs: Strength 0.7-0.9 each
Three+ LoRAs: Strength 0.6-0.8 each
Lower strengths preserve more base model capabilities

Testing LoRA Performance:

Want to skip the complexity? Apatero gives you professional AI results instantly with no technical setup required.

Zero setup Same quality Start in 30 seconds Try Apatero Free

No credit card required

After loading your custom LoRA, run systematic tests:

Generate 10 outputs with just the LoRA, no specific prompts (tests baseline LoRA effect)
Generate 10 outputs with LoRA + varied prompts (tests prompt flexibility with LoRA)
Compare to base model outputs without LoRA (confirms LoRA adds desired characteristics)
Test at different LoRA strengths (0.5, 0.7, 0.9, 1.0) to find optimal setting

If the LoRA produces good results at strength 0.6-0.8 but worse results at 1.0, the training likely overfit. Use lower strength settings or retrain with different parameters.

LoRA Versioning for Production:

For production use, maintain organized LoRA versions:

loras/
├── characters/
│   ├── sarah_v1.safetensors (initial training)
│   ├── sarah_v2.safetensors (retrained with more data)
│   └── sarah_v3.safetensors (current production version)
├── styles/
│   ├── corporate_professional_v1.safetensors
│   └── corporate_professional_v2.safetensors
└── motion/
    └── smooth_pans_v1.safetensors

Version naming lets you A/B test different training iterations and roll back if newer versions perform worse.

For teams using custom WAN LoRAs across multiple artists, Apatero.com provides LoRA version management and sharing, letting team members access the latest approved custom models without manual file distribution.

Hyperparameter Tuning for Optimal Results

Training success depends heavily on hyperparameter selection. Understanding which parameters matter most and how to tune them produces dramatically better results.

Learning Rate: The Most Critical Parameter

Learning rate determines how quickly the model adapts to training data. Too high causes unstable training and poor results. Too low wastes time and may never converge.

Recommended learning rate ranges by training type:

Training Goal	Learning Rate	Why
Character consistency	1e-4 to 2e-4	Higher LR learns character features quickly
Style adaptation	8e-5 to 1.5e-4	Moderate LR balances style learning and base preservation
Motion patterns	5e-5 to 1e-4	Lower LR preserves temporal understanding while adapting motion
Domain specialization	8e-5 to 1.2e-4	Moderate LR for balanced domain adaptation

If you're unsure, start with 1e-4. Monitor the loss curve during the first 500 steps:

Loss decreasing steadily: Learning rate is good
Loss unstable/spiking: Learning rate too high, reduce to 5e-5
Loss barely changing: Learning rate too low, increase to 2e-4

Network Dimension (Rank): Capacity vs Overfitting Trade-off

Network dimension determines LoRA capacity. Higher dimension allows learning more complex patterns but risks overfitting on small datasets.

Network Dim	LoRA Size	Use Case	Overfitting Risk
32	~200MB	Simple character, minimal style change	Low
64	~400MB	Standard character or style adaptation	Medium
96	~600MB	Complex style or detailed character	Medium-High
128	~800MB	Comprehensive domain adaptation	High

Match network dimension to dataset size:

100-200 samples: Use dim 32-48
200-400 samples: Use dim 48-64
400-800 samples: Use dim 64-96
800+ samples: Use dim 96-128

Larger dimension doesn't automatically mean better quality. I tested character training at dimensions 32, 64, and 128 with a 300-image dataset. Dimension 64 produced the best results (9.2/10 consistency), while dimension 128 overfit (7.8/10 consistency, memorized training poses).

Batch Size: Memory vs Training Efficiency

Larger batch sizes provide more stable gradients but require more VRAM.

Batch Size	VRAM Usage (512x512)	Training Speed	Gradient Stability
1	20-22GB	Baseline	Less stable
2	38-40GB	1.6x faster	More stable
4	72GB+	2.8x faster	Most stable

On 24GB GPUs, batch size 1 is required. On 40GB GPUs, batch size 2 provides better training quality and 60% faster training time. Batch size 4+ requires multi-GPU setups.

If using batch size 1, enable gradient accumulation to simulate larger batches:

Set gradient accumulation steps to 2-4
This accumulates gradients over 2-4 training steps before updating weights
Provides some batch size stability benefits without VRAM requirements

Number of Epochs: Finding the Sweet Spot

Epochs determine how many times the model sees the entire dataset. Too few epochs undertrain, too many overfit.

Dataset Size	Recommended Epochs	Total Steps (approx)
100-200 samples	15-20	1500-4000
200-400 samples	12-18	2400-7200
400-800 samples	10-15	4000-12000
800+ samples	8-12	6400-9600+

Monitor validation loss (if you set up validation set) or periodically test checkpoints. The best checkpoint is often from 60-80% through total training, not the final checkpoint.

LR Scheduler: Controlling Learning Rate Over Time

LR schedulers adjust learning rate during training. The best scheduler for WAN training is "cosine_with_restarts":

Starts at full learning rate
Gradually decreases following cosine curve
Periodically "restarts" to higher LR to escape local minima
Number of restarts: 2-3 for most training runs

Alternative schedulers:

Constant: No LR change, only use if you know your LR is perfect
Polynomial: Gentle decrease, good for long training runs
Cosine (no restarts): Smooth decrease, safe default

Warmup steps (usually 5-10% of total steps) start LR at near-zero and ramp up to target LR, providing training stability in early steps.

Hyperparameter Interaction Effects

Parameters don't work in isolation. High learning rate + high network dimension + small dataset = severe overfitting. Low learning rate + low network dimension + large dataset = undertraining. Balance all parameters based on your specific training scenario.

A/B Testing Hyperparameters:

For production training projects, run 2-3 training configurations in parallel with different hyperparameters:

Configuration A (conservative):

LR: 8e-5, Dim: 64, Epochs: 12

Configuration B (standard):

LR: 1.2e-4, Dim: 64, Epochs: 15

Configuration C (aggressive):

LR: 1.5e-4, Dim: 96, Epochs: 18

Train all three, test their outputs, and identify which hyperparameter set produces the best results for your specific use case. This empirical approach beats theoretical optimization.

Join 115 other course members

Create Your First Mega-Realistic AI Influencer in 51 Lessons

AI Influencers created with ComfyUI - Ultra-realistic AI generated models for content creators

Create ultra-realistic AI influencers with lifelike skin details, professional selfies, and complex scenes. Get two complete courses in one bundle. ComfyUI Foundation to master the tech, and Fanvue Creator Academy to learn how to market yourself as an AI creator.

Claim Your Spot - $199

Early-bird pricing ends in:

Days

Hours

Minutes

Seconds

51 Lessons • 2 Complete Courses

One-Time Payment

Lifetime Updates

Save $200 - Price Increases to $399 Forever

Early-bird discount for our first students. We are constantly adding more value, but you lock in $199 forever.

Beginner friendly

Production ready

Always updated

Production Deployment and Version Management

Training custom WAN models is valuable only if you can reliably deploy and use them in production workflows. Proper deployment and versioning prevents chaos as you accumulate custom models.

Model Organization Structure:

Organize custom WAN LoRAs by project, version, and type:

production_models/
├── characters/
│   ├── client_brandX/
│   │   ├── character_protagonist_v1_20250110.safetensors
│   │   ├── character_protagonist_v2_20250115.safetensors (current)
│   │   └── training_notes.md
│   └── client_brandY/
│       └── character_mascot_v1_20250112.safetensors
├── styles/
│   ├── corporate_professional_v3_20250108.safetensors (current production)
│   ├── corporate_professional_v2_20250105.safetensors (deprecated)
│   └── watercolor_artistic_v1_20250114.safetensors
└── motion/
    └── smooth_corporate_pans_v1_20250109.safetensors

Include date stamps in filenames for easy chronological tracking. Maintain training_notes.md documenting dataset size, hyperparameters, and performance observations.

Version Changelog:

For each model version, document:

Date trained: When was this version created
Dataset: How many samples, what types of content
Hyperparameters: LR, dimension, epochs, batch size
Changes from previous version: "Added 50 more character expressions", "Reduced network dim to fix overfitting"
Test results: Consistency scores, quality ratings, known issues
Production status: "Current", "Testing", "Deprecated"

Example training_notes.md:

# Character: Brand X Protagonist

## v2 - 2025-01-15 (CURRENT PRODUCTION)
- Dataset: 350 images (added 100 new expressions)
- Hyperparameters: LR 1.2e-4, Dim 64, Epochs 15, Batch 1
- Changes: Expanded facial expression range, added more lighting variations
- Test results: 9.2/10 consistency, 8.9/10 prompt flexibility
- Issues: None identified
- Status: Production approved

## v1 - 2025-01-10 (DEPRECATED)
- Dataset: 250 images
- Hyperparameters: LR 1.5e-4, Dim 64, Epochs 18
- Test results: 8.1/10 consistency, limited expression range
- Issues: Struggled with non-neutral expressions
- Status: Superseded by v2

Testing Protocol Before Production Deployment:

Never deploy a custom LoRA to production without systematic testing:

Phase 1: Technical Validation (1-2 hours)

Generate 20 test outputs at various LoRA strengths (0.6, 0.8, 1.0)
Test with diverse prompts covering expected use cases
Verify no obvious artifacts, errors, or quality issues
Confirm VRAM usage and generation speed acceptable

Phase 2: Quality Assessment (2-4 hours)

Generate 50-100 outputs with production-like prompts
Evaluate consistency, style adherence, prompt flexibility
Compare to base model outputs and previous LoRA version
Identify any edge cases or failure modes

Phase 3: Production Trial (1-2 days)

Use in limited production capacity (10-20% of workload)
Collect feedback from end users or clients
Monitor for issues not caught in controlled testing
Verify performance under production conditions

Only after passing all three phases should a LoRA be marked "production ready" and used for all workloads.

Rollback Procedures:

Maintain previous version LoRAs even after deploying new versions. If issues emerge:

Immediately revert to previous stable version
Document the issue with new version
Generate comparative examples showing the problem
Determine if issue requires retraining or just parameter adjustment
Fix and re-test before attempting deployment again

Quick rollback capability (keeping old versions accessible) prevents production disruption when new versions have unexpected issues.

Multi-User Team Deployment:

For teams using custom WAN models:

Centralized Model Repository:

Store production models in shared network location or cloud storage
Single source of truth for current production versions
Prevents team members using outdated or deprecated models

Model Update Notifications:

When new model versions deploy, notify team
Include changelog and any workflow changes required
Provide example outputs demonstrating improvements

Access Control:

Training role: Can create and test new models
Production role: Can use production-approved models only
Admin role: Can approve models for production deployment

For professional deployment, Apatero.com provides managed custom model deployment where trained models are version-controlled, team-accessible, and deployable with access permissions, eliminating manual model file management.

Performance Monitoring:

Track these metrics for production custom models:

Consistency score: Manual evaluation of output consistency (rate 1-10)
Generation speed: Any performance regression vs base model
Prompt flexibility: Can the model handle unexpected prompts gracefully
User satisfaction: Feedback from end users or clients
Error rate: How often does generation fail or produce unusable outputs

Monthly review of these metrics identifies models needing retraining or replacement.

Troubleshooting Training Issues

WAN training fails in specific ways. Recognizing issues early and knowing the fixes saves time and compute costs.

Problem: Training loss doesn't decrease

Loss remains flat or increases during training, indicating no learning.

Common causes and fixes:

Learning rate too low: Increase LR from 5e-5 to 1e-4 or 2e-4
Frozen layers: Verify all trainable layers are unfrozen in config
Dataset too small: Need minimum 100-150 samples for LoRA training
Corrupted base model: Re-download wan2.2_dit.safetensors
Incorrect caption format: Verify captions are plain text, not empty

Problem: Training loss decreases then suddenly spikes

Loss decreases normally for a while, then jumps up dramatically and doesn't recover.

This indicates learning rate too high or gradient explosion.

Fixes:

Reduce learning rate by 50% (2e-4 → 1e-4)
Enable gradient clipping (clip norm 1.0)
Reduce batch size if using batch size 2+
Check for corrupted training samples (one bad sample can cause spikes)

Problem: Model overfits to training data

Outputs look great for training data content but completely fail for new prompts.

Overfitting indicators:

Training loss very low (under 0.01) but validation loss high
Outputs reproduce specific training samples nearly exactly
New prompts produce artifacts or ignore prompt content

Fixes:

Reduce network dimension (128 → 64 or 64 → 32)
Reduce training epochs (stop training earlier)
Increase dataset size (add more diverse samples)
Increase regularization (if your training framework supports dropout/weight decay)
Use lower LoRA strength during inference (0.6-0.7 instead of 1.0)

Problem: CUDA out of memory during training

Training fails with OOM errors.

Fixes in priority order:

Enable gradient checkpointing (30% VRAM reduction)
Enable mixed precision (fp16) (50% VRAM reduction)
Reduce batch size to 1
Reduce resolution (768 → 512)
Reduce network dimension (96 → 64)
Reduce gradient accumulation steps if using them

If all optimizations still hit OOM, your GPU doesn't have enough VRAM for WAN training at your target resolution.

Problem: Training extremely slow

Training takes 2-3x longer than expected times.

Causes:

XFormers not enabled: Enable for 40% speedup
CPU bottleneck: Check CPU usage, slow data loading from disk
Using HDD instead of SSD: Move dataset to SSD (3-5x faster data loading)
GPU not fully used: Check GPU use (should be 95-100%)
Other processes consuming GPU: Close browsers, other AI tools

Problem: Output quality worse than base model

The custom LoRA produces lower quality outputs than base WAN 2.2 without LoRA.

This indicates training damaged base model capabilities.

Causes:

Learning rate too high: Model overtrained, reducing to 5e-5 or 8e-5
Too many epochs: Stopped too late, use earlier checkpoint
Network dimension too high for dataset size: Reduce dimension
Training data quality issues: Low quality training data taught low quality outputs

Prevention: Test multiple checkpoints during training to find optimal stopping point before quality degrades.

Problem: LoRA has no visible effect

Loading the trained LoRA in ComfyUI produces outputs identical to base model.

Causes:

LoRA strength set to 0: Increase to 0.8-1.0
LoRA incompatible with base model version: Retrain with correct base model
Training didn't save properly: Check LoRA file size (should be 200-800MB)
Training steps too few: Model didn't train long enough, increase epochs
Learning rate too low: Model barely learned anything, increase LR and retrain

Final Thoughts

WAN 2.2 fine-tuning transforms the model from general-purpose video generation to specialized tool precisely matching your production needs. The investment in dataset preparation, training time, and hyperparameter tuning pays dividends across dozens or hundreds of subsequent generations where you need consistent characters, specific styles, or domain-specialized content.

The key to successful WAN training is quality over quantity in datasets. 200 carefully selected, high-quality training samples with accurate captions produce better results than 1000 mixed-quality samples with poor annotations. Spend time on dataset curation, and training becomes straightforward.

For most use cases, LoRA training on 24GB GPUs provides the optimal balance of resource requirements, training time, and output quality. Full fine-tuning rarely justifies its 3-4x higher compute cost unless you need extreme specialization.

The workflows in this guide cover everything from infrastructure setup to production deployment. Start with a small test project (100-150 training samples, 6-8 hours training time) to understand the complete process before investing in larger production training runs. Once you've completed one successful training cycle, subsequent projects become routine.

Whether you train locally or use managed training on Apatero.com (which handles all infrastructure, monitoring, and deployment automatically), custom WAN models improve your video generation from generic AI output to branded, consistent, professional content that meets specific client requirements. That capability is increasingly essential as AI video generation moves from experimental to production-grade applications.

Frequently Asked Questions

Is LoRA training enough or do I need full fine-tuning?

LoRA provides 90-95% of full fine-tuning quality for 99% of use cases at 1/3 the compute cost and 1/4 the time. Full fine-tuning only justified for extreme specialization where you need maximum adaptation and have 40GB+ VRAM available. Start with LoRA training.

Can I train WAN on consumer GPUs like RTX 3090?

Yes! RTX 3090 (24GB VRAM) handles LoRA training at 512x512 resolution with 16-frame clips. Use FP16 precision, gradient checkpointing, and batch size 1. Training takes 8-12 hours for character LoRAs. Full fine-tuning requires 40GB+ (A100/A6000).

How many training samples do I really need?

Character consistency: 150-200 high-quality images minimum, 300-400 optimal. Style adaptation: 200-300 video clips minimum, 500+ optimal. Domain specialization: 400-600 clips minimum, 1000+ optimal. Quality beats quantity - better to have 150 excellent samples than 500 mixed-quality ones.

What learning rate should I use?

Start with 1e-4 for most training. Character training can go to 1.5e-4 to 2e-4. Style training use 8e-5 to 1.5e-4. Motion training use 5e-5 to 1e-4 (lower preserves temporal understanding). Monitor loss curve in first 500 steps - adjust if unstable or flat.

How do I know when training is complete?

Test checkpoints every 500-1000 steps. Best checkpoint often at 60-80% through training, not the final checkpoint. Signs of completion: Loss plateau, consistent quality across test generations, no improvement in last 3-4 checkpoints. Don't overtrain - causes memorization not generalization.

Can I train on images instead of video clips?

Yes, especially for character training. 300 diverse images of a character produce better consistency than 50 video clips. Images easier to source, faster to prepare, and avoid motion blur issues. Use video clips for motion pattern training or style that requires temporal understanding.

What if I only have 12GB VRAM?

WAN training requires 24GB minimum for viable quality (512x512, 16 frames). With 12GB, you're limited to 8-frame clips at 384x384 producing poor results. Solutions: Use cloud GPU rental ($4-7 per training run), or train image-based character LoRAs then use with WAN video generation.

How much does cloud GPU training cost?

RunPod RTX 4090: $0.69/hour × 8-10 hours = $5.50-$6.90. Vast.ai RTX 4090: $0.40-0.60/hour × 8-10 hours = $3.20-$6.00. Lambda Labs A100: $1.10/hour × 4-6 hours = $4.40-$6.60. One WAN LoRA training costs $4-7 on cloud GPUs.

Can I combine multiple LoRAs (character + style)?

Yes! Stack LoRAs in ComfyUI: Load character LoRA (strength 0.9) → Load style LoRA (strength 0.8) → Generate. When combining, reduce individual strengths slightly (0.8-0.9 each) to prevent over-constraining. Test different strength combinations for optimal results.

How do I prevent overfitting during training?

Use appropriate network dimension for dataset size (32-64 for 200-400 samples), stop training when validation loss plateaus, test checkpoints regularly (don't wait for training end), use conservative learning rates, and generate test samples every 500 steps to catch overfitting early.