/ AI Video / How to Train WAN 2.2 LoRA for Person: The Pro Method That Actually Works
AI Video 10 min read

How to Train WAN 2.2 LoRA for Person: The Pro Method That Actually Works

Complete guide to training WAN 2.2 LoRAs for consistent person/character video generation. Dataset prep, optimal settings, and pro techniques.

WAN 2.2 LoRA training for person and character video generation

Training LoRAs for WAN 2.2 is nothing like training them for image models. I learned this the hard way after wasting 40+ hours applying SD LoRA techniques to video generation. The dual-model architecture, the motion considerations, the dataset requirements. Everything is different.

Quick Answer: WAN 2.2 person LoRA training requires sigmoid time step scheduling, 10-30 varied images/clips, 3000-5000 training steps at 0.0002 learning rate, and Differential Output Preservation set to "person" for character training. Use AI Toolkit or diffusion-pipe for the actual training.

Key Takeaways:
  • Use Sigmoid time step type for person/character training specifically
  • 10-30 images/clips with varied poses, lighting, and backgrounds
  • 3000-5000 steps at 0.0002 learning rate for faster convergence
  • DOP (Differential Output Preservation) set to "person" preserves base realism
  • Training produces TWO LoRAs: high_noise and low_noise for different generation phases
  • Expect 24-72 hours training time depending on hardware

Why WAN 2.2 LoRA Training Is Different

Let me explain the architecture first, because this is why standard LoRA techniques fail.

WAN 2.2 uses a Mixture of Experts (MoE) architecture with separate handling for high-noise and low-noise generation phases. When you train a LoRA, you're actually training two specialized models:

high_noise_lora: Optimized for initial motion planning and temporal structure low_noise_lora: Optimized for refining motion details and smooth transitions

Both get applied during generation at different stages. If you only train one or use the wrong settings, your person LoRA won't transfer the identity properly across video frames.

This is why SD LoRA knowledge doesn't translate directly. Different architecture, different training approach.

Hardware Reality Check

I'll be honest about the hardware requirements because I've seen people start this process without understanding the commitment.

Minimum viable:

  • 24GB VRAM GPU (RTX 3090/4090 or A6000)
  • 64GB system RAM
  • 500GB+ fast storage
  • Training time: 2-3 days

Comfortable setup:

  • 48GB+ VRAM (A6000, dual GPUs)
  • 128GB system RAM
  • NVMe storage
  • Training time: 12-24 hours

Low VRAM option (16-24GB):

  • Enable VRAM block swapping
  • Significantly longer training times
  • Works, just slower

On an NVIDIA A6000 with 96GB VRAM, training took me about 24 hours for a solid person LoRA. Consumer hardware takes 2-3 days but absolutely works.

Heads Up: WAN 2.2 training consumes more time than WAN 2.1 due to the high+low noise dual model approach. Plan accordingly and consider starting training overnight.

Dataset Preparation: The Most Important Step

Here's the thing. Your LoRA will only be as good as your dataset. I've seen people spend 48 hours training on garbage data and wonder why results are bad.

Image Requirements for Person LoRAs

Quantity: 10-30 high-quality images or short clips

Variety is critical:

  • Multiple poses (front, side, three-quarter angles)
  • Different backgrounds (don't let the model learn background = person)
  • Various lighting conditions
  • Different expressions if relevant
  • Multiple outfits unless clothing is part of identity

Quality requirements:

  • Clear, well-lit images
  • Subject fully visible (not cropped awkwardly)
  • No heavy filters or extreme stylization
  • Consistent person across all images
  • Resolution: at least 512x512, ideally higher

Video Clip Requirements

If training from video clips instead of images:

  • 4-8 seconds per clip
  • 256x256 minimum resolution during training
  • 12-20 high-quality clips recommended
  • Clips should show representative motions you want to reproduce
  • Avoid clips with occlusion or motion blur

Caption Format

Every image/clip needs captions. This is the structure that works:

[trigger token], [description of person], [description of scene/pose]

Example:

zxq-person, a woman with long dark hair, wearing a blue dress, standing in a garden, natural lighting, full body shot

The trigger token (like "zxq-person") should be unique and not a real word. This becomes how you invoke the LoRA during generation.

Captioning tips:

  • Include appearance details that define the person
  • Describe clothing, lighting, and framing
  • Be consistent with terminology across captions
  • Don't include elements you don't want the LoRA to learn

Training Tools and Setup

AI Toolkit Method

AI Toolkit is the most common approach for WAN 2.2 LoRA training. Here's the configuration that works:

Free ComfyUI Workflows

Find free, open-source ComfyUI workflows for techniques in this article. Open source is strong.

100% Free MIT License Production Ready Star & Try Workflows

Key settings:

model_type: wan2.2
time_step_type: sigmoid  # Critical for person training
learning_rate: 0.0002    # Higher than default for faster training
max_train_steps: 5000    # More steps = better results, longer time
dop_preservation_class: person  # Enables DOP for character training

Sigmoid vs other time step types:

  • Sigmoid is specifically designed for person/character training
  • Produces better identity preservation
  • Other types (uniform, cosine) work better for style/motion LoRAs

Diffusion-pipe Method

Alternative to AI Toolkit, particularly popular with some researchers.

Setup requires:

  1. Enable WSL (Windows) or use native Linux
  2. Install Ubuntu and diffusion-pipe
  3. Configure training parameters

Diffusion-pipe gives you more low-level control but has a steeper learning curve. I recommend AI Toolkit for first-time trainers.

DOP (Differential Output Preservation)

This is the secret sauce for person LoRAs that actually look good.

DOP helps maintain WAN's strong base realism while learning your specific person. Without it, LoRAs often degrade overall quality while learning the new identity.

How to configure:

  • Set preservation class to "person" for character training
  • For style LoRAs, use different preservation classes
  • The model preserves base capabilities while learning new content

I cannot stress enough how much DOP improved my results. Before using it, my person LoRAs had a "uncanny valley" quality. After, they maintain WAN's natural motion and appearance.

Want to skip the complexity? Apatero gives you professional AI results instantly with no technical setup required.

Zero setup Same quality Start in 30 seconds Try Apatero Free
No credit card required

The Training Process

Step 1: Prepare Your Dataset

  1. Collect 10-30 images/clips of your person
  2. Ensure variety in poses, lighting, backgrounds
  3. Create captions for each with your trigger token
  4. Organize in the expected folder structure

Step 2: Configure Training

# Example AI Toolkit config for person LoRA
base_model: wan2.2_14b
time_step_type: sigmoid
learning_rate: 0.0002
max_train_steps: 5000
batch_size: 1
gradient_accumulation_steps: 4
dop_preservation_class: person
save_every_n_steps: 1000

Step 3: Start Training

python train.py --config your_config.yaml

And wait. This is the part nobody warns you about. Training takes a long time. I recommend:

  • Start training in the evening
  • Let it run overnight or over a weekend
  • Monitor early checkpoints for sanity checking

Step 4: Evaluate Checkpoints

LoRA saves happen at intervals (every 1000 steps in my config). Test these intermediate checkpoints:

  • Does the person identity transfer?
  • Is video quality maintained?
  • Are there artifacts or distortions?

Often the best checkpoint isn't the final one. Training can overfit. Test and compare.

Step 5: Use Your LoRA

Load both high_noise and low_noise LoRAs in your workflow. They apply to different phases of generation automatically.

My Actual Results and Lessons

I've trained about 15 person LoRAs for WAN 2.2 now. Here's what I've learned:

What works:

  • More images beat fewer images (until around 30, then diminishing returns)
  • Background variety is crucial to avoid background bleed
  • Higher learning rate (0.0002) produces faster convergence without quality loss
  • DOP is non-negotiable for character work
  • Testing checkpoints saves time vs. always using final

What doesn't work:

  • Using single-outfit datasets (LoRA learns outfit = identity)
  • Training without DOP (quality degrades)
  • Too few steps (underfitting, weak identity)
  • Too many steps (overfitting, artifacts)
  • Ignoring the high/low noise dual model nature

Time investment reality: My first LoRA took a week of trial and error. Now I can prep data and configure training in an afternoon, then let it run. The process is front-loaded with learning curve.

Cloud Training Options

Not everyone has 24GB+ VRAM sitting around. Cloud options exist:

Join 115 other course members

Create Your First Mega-Realistic AI Influencer in 51 Lessons

Create ultra-realistic AI influencers with lifelike skin details, professional selfies, and complex scenes. Get two complete courses in one bundle. ComfyUI Foundation to master the tech, and Fanvue Creator Academy to learn how to market yourself as an AI creator.

Early-bird pricing ends in:
--
Days
:
--
Hours
:
--
Minutes
:
--
Seconds
51 Lessons • 2 Complete Courses
One-Time Payment
Lifetime Updates
Save $200 - Price Increases to $399 Forever
Early-bird discount for our first students. We are constantly adding more value, but you lock in $199 forever.
Beginner friendly
Production ready
Always updated

WaveSpeedAI: They offer WAN 2.2 14B LoRA trainers with claims of 10x faster training. Worth considering if you lack local hardware.

RunComfy: Cloud ComfyUI with LoRA training capabilities. More accessible for those already familiar with the platform.

MimicPC: AI Toolkit hosting with WAN 2.2 base model support.

The cost/benefit depends on how many LoRAs you plan to train. For one or two, cloud is probably cheaper. For ongoing work, local hardware pays off.

Using Person LoRAs Effectively

Once you have a trained LoRA, application matters:

Strength settings:

  • Start at 0.7 and adjust
  • Too high: artifacts, frozen motion
  • Too low: weak identity preservation

Combining with other techniques:

  • LoRA + IPAdapter for additional face reference
  • LoRA + ControlNet for pose control
  • LoRA + base prompt for scene variation

Prompt usage: Always include your trigger token when using the LoRA:

zxq-person walking through a forest, cinematic lighting...

Without the trigger, the LoRA may not activate properly.

What Apatero Can Do for Training

Full disclosure: I'm involved with Apatero. Currently, Apatero focuses on inference (using models) rather than training (creating models). But LoRA training is something we're looking at.

For now, if you train locally and want to use your LoRA with Apatero's workflows, that's a conversation to have. The platform is primarily designed around pre-loaded models, but custom LoRA support is technically feasible.

Frequently Asked Questions

How long does WAN 2.2 LoRA training take?

With consumer hardware (RTX 4090), expect 2-3 days for 5000 steps. With enterprise hardware (A6000+), 12-24 hours. Low VRAM setups with block swapping take longer.

Can I use SD LoRA training techniques?

Not directly. WAN 2.2's dual-model architecture requires different approaches. Sigmoid time stepping and DOP are specific to video LoRA training.

How many images do I really need?

10-30 works well for person LoRAs. Under 10 tends to underfit. Over 30 has diminishing returns unless you need extreme variety.

Why does my LoRA produce artifacts?

Usually overfitting (too many steps) or dataset issues (poor quality images, insufficient variety). Try earlier checkpoints or improve your dataset.

What's the difference between high_noise and low_noise LoRAs?

High_noise handles initial structure and motion planning. Low_noise handles detail refinement and smooth transitions. Both are needed for complete results.

Can I train on video clips instead of images?

Yes, and it often works better for motion-focused LoRAs. Use 4-8 second clips at 256x256+ resolution. Around 12-20 clips is a good amount.

Final Thoughts

WAN 2.2 LoRA training is a commitment. The hardware requirements are significant, the training times are long, and the learning curve is real. But the results are worth it.

A well-trained person LoRA means consistent character identity across generated videos. No more hoping the model remembers what your character looks like. No more face drift between clips. Reliable consistency.

If you're doing production work with recurring characters, invest the time to learn this. The upfront cost pays off every time you generate without fighting identity consistency.

Start with one LoRA. Learn the process. Optimize your dataset and settings. Then scale up. The knowledge transfers to future training once you've done it right the first time.


Related guides: WAN SCAIL Character Animation, WAN 2.6 Complete Guide, Character Consistency Techniques

Ready to Create Your AI Influencer?

Join 115 students mastering ComfyUI and AI influencer marketing in our complete 51-lesson course.

Early-bird pricing ends in:
--
Days
:
--
Hours
:
--
Minutes
:
--
Seconds
Claim Your Spot - $199
Save $200 - Price Increases to $399 Forever