Microsoft Research’s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes

Video foundation models can paint a beautiful frame. They are still notoriously bad at remembering it. Push the camera through a corridor in Wan 2.1 or CogVideoX and walls warp, objects morph, and details vanish — the giveaway that these models are fitting 2D pixel correlations rather than simulating a coherent 3D scene.

A team of researchers from Microsoft Research and Zhejiang University introduced World-R1: a framework that aligns video generation with 3D constraints through reinforcement learning. The research team lean on a recent finding that video foundation models already encode rich 3D geometric information internally. The job, then, is to elicit that latent knowledge rather than supervise it with expensive 3D assets. World-R1 does this by post-training an existing text-to-video (T2V) model with reinforcement learning, using rewards derived from pre-trained 3D foundation models and a vision-language critic. The base architecture is left untouched and inference cost is unchanged.

Two World-R1 variants are released: World-R1-Small (built on Wan2.1-T2V-1.3B) and World-R1-Large (built on Wan2.1-T2V-14B).

https://arxiv.org/pdf/2604.24764

The setup: Flow-GRPO on a flow-matching video model

World-R1 uses Flow-GRPO-Fast, a recent adaptation of GRPO to flow-matching diffusion models. Flow-GRPO converts the deterministic ODE sampler into a reverse-time SDE so the policy is stochastic enough for advantage estimation, then optimizes a clipped GRPO surrogate with KL regularization to a reference policy. The Fast variant only injects SDE noise at randomly selected intermediate steps to cut rollout cost.

Training runs at 832×480 resolution on 48 NVIDIA H200 GPUs for the Small model and 96 H200s for the Large model, with a GRPO group size of G=8 across 48 parallel groups.

The 3D-aware reward: analysis-by-synthesis

The interesting work happens in the reward. For each generated video x, the system reconstructs a 3D Gaussian Splatting (3DGS) representation ΦGS using Depth Anything 3 and recovers an estimated camera trajectory Ê. The composite 3D reward is:

R3D = Smeta + Srecon + Straj

Smeta renders ΦGS from a meta-view — a camera pose offset from the generation trajectory — and asks Qwen3-VL to score the reconstruction from 0–9 as a “3D vision expert,” penalizing floaters, billboard artifacts, and texture stretching that look fine head-on but collapse off-axis.

Srecon re-renders the scene along Ê and compares against x via 1 − LPIPS.

Straj measures deviation between the requested trajectory E and the recovered Ê using L2 for translation and geodesic distance for rotation, wrapped in a negative exponential.

A general aesthetic term Rgen, computed as the mean HPSv3 score across the first K frames, is added with λgen = 1 to keep visual quality from collapsing under geometric pressure.

Implicit camera conditioning via noise wrapping

Rather than training a CameraCtrl-style adapter, World-R1 follows the Go-with-the-Flow paradigm: the prompt is parsed for motion tokens (push_in, orbit_left, pull_out, etc.), a sequence of camera extrinsics is generated, projected into 2D optical flow under a fronto-parallel scene assumption, and used to perform discrete noise transport on the initial latent. The transported noise preserves unit variance via a density-tracker normalization, so the diffusion prior is undisturbed but the latent already encodes the requested trajectory. No new parameters, no architectural change.

A pure text dataset, and periodic decoupling to keep motion alive

Training data is a synthetic Pure Text Dataset of roughly 3,000 prompts generated by Gemini, organized along the WorldScore camera-trajectory taxonomy (intra-scene, inter-scene, composite, static) and across Natural Landscapes, Urban & Architectural, Micro & Still Life, Fantasy & Surrealism, and Artistic Styles. Going text-only dissociates 3D learning from the visual biases of any specific video corpus.

Strict 3D rewards have a known failure mode: the model overfits to rigid scenes and stops generating dynamic content. World-R1 mitigates this with periodic decoupled training. Every 100 steps, R3D is suspended and the model is fine-tuned with Rgen alone on a roughly 500-prompt dynamic data subset (waterfalls, crowds, fire, transforming objects). Removing this stage actually raises reconstruction PSNR but drops VBench AVG from 85.21 to 82.64 — exactly the reward-hacking degeneracy the research team flags.

Understanding the Results

On a 3DGS-based reconstruction protocol, World-R1-Large hits 27.67 PSNR / 0.865 SSIM / 0.162 LPIPS, against 19.76 / 0.629 / 0.405 for Wan2.1-T2V-14B — a 7.91 dB PSNR gain. World-R1-Small posts a 10.23 dB gain over its 1.3B backbone. On the reconstruction-independent Multi-View Consistency Score (MVCS) borrowed from GeoVideo, World-R1-Large reaches 0.993, ahead of all 3D-conditioned and camera-control baselines tested (Voyager, ViewCrafter, FlashWorld, ReCamMaster, etc.).

Camera control is competitive with specialized methods: RotErr 1.21, TransErr 1.30, CamMC 2.95 for the Large model, edging out CamCloneMaster and ReCamMaster despite not being a dedicated camera-control architecture. VBench scores improve over the base Wan 2.1 in Aesthetic Quality, Imaging Quality, Motion Smoothness, and Subject Consistency, with only a small regression on Background Consistency.

Two robustness results stand out for AI professionals. A dataset scaling sweep shows monotonic gains from 1K → 2K → 3K prompts on both 3D consistency and VBench AVG, suggesting the recipe is data-efficient and could scale further. And although training is on short clips, World-R1-Large generalizes to 121-frame generations, lifting PSNR from 18.32 to 26.32 over the Wan2.1-T2V-14B backbone. A 25-participant double-blind user study reports win rates of 92% for geometric consistency, 76% for camera control accuracy, and 86% for overall preference versus Wan 2.1.

Key Takeaways

RL replaces architectural surgery for 3D consistency. World-R1 post-trains Wan2.1 with Flow-GRPO-Fast instead of bolting on 3D modules or training on 3D-supervised datasets. The base architecture and inference cost are unchanged.

The reward is analysis-by-synthesis. Each generated video is lifted to a 3D Gaussian Splatting representation via Depth Anything 3, then scored on three axes: meta-view plausibility (judged by Qwen3-VL), reconstruction fidelity (1 − LPIPS), and trajectory alignment — combined with an HPSv3 aesthetic reward to prevent quality collapse.

Camera control comes from noise wrapping, not new parameters. Motion tokens in the prompt are turned into camera extrinsics, projected to 2D optical flow, and used to warp the initial latent via Go-with-the-Flow’s discrete noise transport. No CameraCtrl-style adapter required.

Periodic decoupled training prevents reward hacking. Every 100 steps, the 3D reward is suspended and the model is fine-tuned with the aesthetic reward alone on ~500 dynamic prompts. Removing this stage raises PSNR but tanks VBench — the model collapses into static, easy-to-reconstruct outputs.

The numbers are large and hold up off-pipeline. World-R1-Large gains 7.91 dB PSNR over Wan2.1-T2V-14B, generalizes to 121-frame videos, and improves the reconstruction-independent MVCS metric — with an 86% overall preference win rate in a 25-participant blind user study.

Check out the Paper, Codes and Project Page. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Microsoft Research’s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes appeared first on MarkTechPost.

By

Leave a Reply

Your email address will not be published. Required fields are marked *