How to Train WAN 2.2 LoRA for Person: The Pro Method That Actually Works
Complete guide to training WAN 2.2 LoRAs for consistent person/character video generation. Dataset prep, optimal settings, and pro techniques.
Training LoRAs for WAN 2.2 is nothing like training them for image models. I learned this the hard way after wasting 40+ hours applying SD LoRA techniques to video generation. The dual-model architecture, the motion considerations, the dataset requirements. Everything is different.
Quick Answer: WAN 2.2 person LoRA training requires sigmoid time step scheduling, 10-30 varied images/clips, 3000-5000 training steps at 0.0002 learning rate, and Differential Output Preservation set to "person" for character training. Use AI Toolkit or diffusion-pipe for the actual training.
- Use Sigmoid time step type for person/character training specifically
- 10-30 images/clips with varied poses, lighting, and backgrounds
- 3000-5000 steps at 0.0002 learning rate for faster convergence
- DOP (Differential Output Preservation) set to "person" preserves base realism
- Training produces TWO LoRAs: high_noise and low_noise for different generation phases
- Expect 24-72 hours training time depending on hardware
Why WAN 2.2 LoRA Training Is Different
Let me explain the architecture first, because this is why standard LoRA techniques fail.
WAN 2.2 uses a Mixture of Experts (MoE) architecture with separate handling for high-noise and low-noise generation phases. When you train a LoRA, you're actually training two specialized models:
high_noise_lora: Optimized for initial motion planning and temporal structure low_noise_lora: Optimized for refining motion details and smooth transitions
Both get applied during generation at different stages. If you only train one or use the wrong settings, your person LoRA won't transfer the identity properly across video frames.
This is why SD LoRA knowledge doesn't translate directly. Different architecture, different training approach.
Hardware Reality Check
I'll be honest about the hardware requirements because I've seen people start this process without understanding the commitment.
Minimum viable:
- 24GB VRAM GPU (RTX 3090/4090 or A6000)
- 64GB system RAM
- 500GB+ fast storage
- Training time: 2-3 days
Comfortable setup:
- 48GB+ VRAM (A6000, dual GPUs)
- 128GB system RAM
- NVMe storage
- Training time: 12-24 hours
Low VRAM option (16-24GB):
- Enable VRAM block swapping
- Significantly longer training times
- Works, just slower
On an NVIDIA A6000 with 96GB VRAM, training took me about 24 hours for a solid person LoRA. Consumer hardware takes 2-3 days but absolutely works.
Dataset Preparation: The Most Important Step
Here's the thing. Your LoRA will only be as good as your dataset. I've seen people spend 48 hours training on garbage data and wonder why results are bad.
Image Requirements for Person LoRAs
Quantity: 10-30 high-quality images or short clips
Variety is critical:
- Multiple poses (front, side, three-quarter angles)
- Different backgrounds (don't let the model learn background = person)
- Various lighting conditions
- Different expressions if relevant
- Multiple outfits unless clothing is part of identity
Quality requirements:
- Clear, well-lit images
- Subject fully visible (not cropped awkwardly)
- No heavy filters or extreme stylization
- Consistent person across all images
- Resolution: at least 512x512, ideally higher
Video Clip Requirements
If training from video clips instead of images:
- 4-8 seconds per clip
- 256x256 minimum resolution during training
- 12-20 high-quality clips recommended
- Clips should show representative motions you want to reproduce
- Avoid clips with occlusion or motion blur
Caption Format
Every image/clip needs captions. This is the structure that works:
[trigger token], [description of person], [description of scene/pose]
Example:
zxq-person, a woman with long dark hair, wearing a blue dress, standing in a garden, natural lighting, full body shot
The trigger token (like "zxq-person") should be unique and not a real word. This becomes how you invoke the LoRA during generation.
Captioning tips:
- Include appearance details that define the person
- Describe clothing, lighting, and framing
- Be consistent with terminology across captions
- Don't include elements you don't want the LoRA to learn
Training Tools and Setup
AI Toolkit Method
AI Toolkit is the most common approach for WAN 2.2 LoRA training. Here's the configuration that works:
Free ComfyUI Workflows
Find free, open-source ComfyUI workflows for techniques in this article. Open source is strong.
Key settings:
model_type: wan2.2
time_step_type: sigmoid # Critical for person training
learning_rate: 0.0002 # Higher than default for faster training
max_train_steps: 5000 # More steps = better results, longer time
dop_preservation_class: person # Enables DOP for character training
Sigmoid vs other time step types:
- Sigmoid is specifically designed for person/character training
- Produces better identity preservation
- Other types (uniform, cosine) work better for style/motion LoRAs
Diffusion-pipe Method
Alternative to AI Toolkit, particularly popular with some researchers.
Setup requires:
- Enable WSL (Windows) or use native Linux
- Install Ubuntu and diffusion-pipe
- Configure training parameters
Diffusion-pipe gives you more low-level control but has a steeper learning curve. I recommend AI Toolkit for first-time trainers.
DOP (Differential Output Preservation)
This is the secret sauce for person LoRAs that actually look good.
DOP helps maintain WAN's strong base realism while learning your specific person. Without it, LoRAs often degrade overall quality while learning the new identity.
How to configure:
- Set preservation class to "person" for character training
- For style LoRAs, use different preservation classes
- The model preserves base capabilities while learning new content
I cannot stress enough how much DOP improved my results. Before using it, my person LoRAs had a "uncanny valley" quality. After, they maintain WAN's natural motion and appearance.
Want to skip the complexity? Apatero gives you professional AI results instantly with no technical setup required.
The Training Process
Step 1: Prepare Your Dataset
- Collect 10-30 images/clips of your person
- Ensure variety in poses, lighting, backgrounds
- Create captions for each with your trigger token
- Organize in the expected folder structure
Step 2: Configure Training
# Example AI Toolkit config for person LoRA
base_model: wan2.2_14b
time_step_type: sigmoid
learning_rate: 0.0002
max_train_steps: 5000
batch_size: 1
gradient_accumulation_steps: 4
dop_preservation_class: person
save_every_n_steps: 1000
Step 3: Start Training
python train.py --config your_config.yaml
And wait. This is the part nobody warns you about. Training takes a long time. I recommend:
- Start training in the evening
- Let it run overnight or over a weekend
- Monitor early checkpoints for sanity checking
Step 4: Evaluate Checkpoints
LoRA saves happen at intervals (every 1000 steps in my config). Test these intermediate checkpoints:
- Does the person identity transfer?
- Is video quality maintained?
- Are there artifacts or distortions?
Often the best checkpoint isn't the final one. Training can overfit. Test and compare.
Step 5: Use Your LoRA
Load both high_noise and low_noise LoRAs in your workflow. They apply to different phases of generation automatically.
My Actual Results and Lessons
I've trained about 15 person LoRAs for WAN 2.2 now. Here's what I've learned:
What works:
- More images beat fewer images (until around 30, then diminishing returns)
- Background variety is crucial to avoid background bleed
- Higher learning rate (0.0002) produces faster convergence without quality loss
- DOP is non-negotiable for character work
- Testing checkpoints saves time vs. always using final
What doesn't work:
- Using single-outfit datasets (LoRA learns outfit = identity)
- Training without DOP (quality degrades)
- Too few steps (underfitting, weak identity)
- Too many steps (overfitting, artifacts)
- Ignoring the high/low noise dual model nature
Time investment reality: My first LoRA took a week of trial and error. Now I can prep data and configure training in an afternoon, then let it run. The process is front-loaded with learning curve.
Cloud Training Options
Not everyone has 24GB+ VRAM sitting around. Cloud options exist:
Join 115 other course members
Create Your First Mega-Realistic AI Influencer in 51 Lessons
Create ultra-realistic AI influencers with lifelike skin details, professional selfies, and complex scenes. Get two complete courses in one bundle. ComfyUI Foundation to master the tech, and Fanvue Creator Academy to learn how to market yourself as an AI creator.
WaveSpeedAI: They offer WAN 2.2 14B LoRA trainers with claims of 10x faster training. Worth considering if you lack local hardware.
RunComfy: Cloud ComfyUI with LoRA training capabilities. More accessible for those already familiar with the platform.
MimicPC: AI Toolkit hosting with WAN 2.2 base model support.
The cost/benefit depends on how many LoRAs you plan to train. For one or two, cloud is probably cheaper. For ongoing work, local hardware pays off.
Using Person LoRAs Effectively
Once you have a trained LoRA, application matters:
Strength settings:
- Start at 0.7 and adjust
- Too high: artifacts, frozen motion
- Too low: weak identity preservation
Combining with other techniques:
- LoRA + IPAdapter for additional face reference
- LoRA + ControlNet for pose control
- LoRA + base prompt for scene variation
Prompt usage: Always include your trigger token when using the LoRA:
zxq-person walking through a forest, cinematic lighting...
Without the trigger, the LoRA may not activate properly.
What Apatero Can Do for Training
Full disclosure: I'm involved with Apatero. Currently, Apatero focuses on inference (using models) rather than training (creating models). But LoRA training is something we're looking at.
For now, if you train locally and want to use your LoRA with Apatero's workflows, that's a conversation to have. The platform is primarily designed around pre-loaded models, but custom LoRA support is technically feasible.
Frequently Asked Questions
How long does WAN 2.2 LoRA training take?
With consumer hardware (RTX 4090), expect 2-3 days for 5000 steps. With enterprise hardware (A6000+), 12-24 hours. Low VRAM setups with block swapping take longer.
Can I use SD LoRA training techniques?
Not directly. WAN 2.2's dual-model architecture requires different approaches. Sigmoid time stepping and DOP are specific to video LoRA training.
How many images do I really need?
10-30 works well for person LoRAs. Under 10 tends to underfit. Over 30 has diminishing returns unless you need extreme variety.
Why does my LoRA produce artifacts?
Usually overfitting (too many steps) or dataset issues (poor quality images, insufficient variety). Try earlier checkpoints or improve your dataset.
What's the difference between high_noise and low_noise LoRAs?
High_noise handles initial structure and motion planning. Low_noise handles detail refinement and smooth transitions. Both are needed for complete results.
Can I train on video clips instead of images?
Yes, and it often works better for motion-focused LoRAs. Use 4-8 second clips at 256x256+ resolution. Around 12-20 clips is a good amount.
Final Thoughts
WAN 2.2 LoRA training is a commitment. The hardware requirements are significant, the training times are long, and the learning curve is real. But the results are worth it.
A well-trained person LoRA means consistent character identity across generated videos. No more hoping the model remembers what your character looks like. No more face drift between clips. Reliable consistency.
If you're doing production work with recurring characters, invest the time to learn this. The upfront cost pays off every time you generate without fighting identity consistency.
Start with one LoRA. Learn the process. Optimize your dataset and settings. Then scale up. The knowledge transfers to future training once you've done it right the first time.
Related guides: WAN SCAIL Character Animation, WAN 2.6 Complete Guide, Character Consistency Techniques
Ready to Create Your AI Influencer?
Join 115 students mastering ComfyUI and AI influencer marketing in our complete 51-lesson course.
Related Articles
AI Video Generation for Adult Content: What Actually Works in 2025
Practical guide to generating NSFW video content with AI. Tools, workflows, and techniques that produce usable results for adult content creators.
AI Video Generator Comparison 2025: WAN vs Kling vs Runway vs Luma vs Apatero
In-depth comparison of the best AI video generators in 2025. Features, pricing, quality, and which one is right for your needs including NSFW capabilities.
PersonaLive: How to Get Started with Real-Time AI Avatar Streaming
Complete getting started guide for PersonaLive. Generate infinite-length portrait animations in real-time on a 12GB GPU for live streaming.