Streaming Video Geometry

Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

A lightweight recurrent stabilizer that turns image geometry foundation models into coherent streaming video models.

1The University of Hong Kong   ·   2USTC   ·   3Voyager Research, Didi Chuxing
* Equal contribution. † Project leader. Correspondence: xjqi@eee.hku.hk.
DyFN transforms a continuous stream into a stitched scale-shift consistent point cloud

Overview

Modern monocular geometry and depth foundation models are accurate on individual images, but when they are applied frame-by-frame to video streams, their predictions drift in scale and shift. DyFN targets that drift directly, preserving the frozen model's single-image accuracy while stabilizing geometry over time.

Core idea. The paper traces temporal artifacts to fluctuations in latent feature statistics, then inserts a small recurrent normalization module that predicts stable mean and variance for the feature stream.
Problem

Scale-shift drift

Naive streaming inference makes each frame live in a slightly different coordinate system, creating layering breaks and positional jitter.

Empirical Study

Feature statistics drive scale

The paper modulates latent mean and variance and observes large scale-shift changes while relative geometry remains accurate after alignment.

Method

Train only 2%

DyFN freezes the pretrained encoder and decoder, fine-tuning only a lightweight recurrent module. The final results achieves SOTA on the video depth estimation benchmark.

Problem: Diagnosing Temporal Inconsistency

DyFN starts from a simple empirical question: do streaming artifacts come from bad per-frame geometry, or from a drifting coordinate system across frames? The paper tests this with MoGe by fusing video-frame predictions under two alignment settings.

When each frame is independently aligned to metric scale using its own affine scale and shift, the reconstructed geometry is accurate and coherent. When the full sequence must share one scale and shift, the same model produces visible non-rigid warping and drift. This isolates the failure mode: the geometry is present, but its global scale and offset fluctuate over time.

Ground TruthReference reconstruction for the continuous sequence.
99.8
MoGe, Per-frame AlignedEach frame gets its own scale and shift, recovering coherent geometry.
62.5
MoGe, Sequence AlignedOne shared scale and shift exposes drift and non-rigid warping.
Conclusion. The failure is not a lack of per-frame geometric understanding. MoGe can reconstruct coherent geometry when each frame is aligned independently, but a single sequence-wide alignment exposes frame-to-frame scale-shift drift as the root cause of temporal inconsistency.

Empirical Study on Latent Statistics

The study tests whether latent feature statistics control the scale-shift drift. We modulate the encoder feature variance and mean with α and β, then decode the result and inspect how the point cloud changes across the sweep.

Outdoor RGB input image from the empirical pipeline slide
MGE
Encoder❄️
Encoded MGE feature map from the empirical pipeline slide
Variance \(\hat{\sigma}_t \in \mathbb{R}^{1\times d}\)
×α
Mean \(\hat{\mu}_t \in \mathbb{R}^{1\times d}\)
×β
Decoded 3D point cloud output from the empirical pipeline slide
MGE
Decoder❄️
Modulated MGE feature map from the empirical pipeline slide

Indoor scale-shift sweep

5 x 5 AbsRel color map
β: shift modulation
α: scale modulation
Step 1

Separate geometry from scale

Compare per-frame affine alignment against one sequence-wide alignment to expose whether drift comes from global scale and shift.

Step 2

Perturb feature statistics

Normalize encoder features, rescale their mean and variance, and decode with the frozen MGE decoder.

Conclusion

Stabilize statistics

Because mean and variance govern scale-shift behavior, DyFN learns temporally stable statistics for the streaming feature sequence.

Dynamic Feature Normalization

DyFN stabilizes the feature statistics that the empirical study identifies as the source of scale-shift drift. The detailed module structure is shown below.

Detailed DyFN module structure
Detailed DyFN structure. A compact recurrent module predicts the statistics used to re-modulate latent features over time.

Results

Across Sintel, ScanNet, KITTI, and Bonn, DyFN improves video depth stability while preserving the single-frame quality of its base model. The paper compares against six categories of depth and geometry methods under a sequence-level alignment protocol for video stability.

Relative Depth Metric Depth Multi-frame Geometry Streaming Geometry Video Depth Streaming Depth
Method Sintel (50 frames) ScanNet (90 frames) KITTI (110 frames) Bonn (110 frames)
AbsRel ↓ \(\delta_1 < 1.25\) ↑ AbsRel ↓ \(\delta_1 < 1.25\) ↑ AbsRel ↓ \(\delta_1 < 1.25\) ↑ AbsRel ↓ \(\delta_1 < 1.25\) ↑
Marigold0.53251.50.16676.90.14979.60.09193.1
DA V10.32556.40.13083.80.14280.30.07893.9
DA V20.36755.40.13582.20.14080.40.10692.1
MoGe v10.21665.30.11784.70.07696.00.07495.5
DepthPro0.31952.0(0.088)(92.7)(0.088)(92.2)(0.063)(96.6)
MoGe v20.21469.5(0.110)(88.2)(0.183)(58.8)(0.049)(98.0)
VGGT0.28766.10.03198.50.07096.50.05597.1
Monst3R0.33558.50.12383.20.10489.50.06396.4
CUT3R0.42147.90.09788.70.11888.10.07893.7
TTT3R0.40450.00.11487.70.11390.40.06895.4
DepthCrafter0.27069.70.12385.60.10489.60.07197.2
VDA0.30063.30.07595.40.07995.00.05198.1
FlashDepth0.26564.20.10190.30.10389.50.05398.0
Ours0.18073.00.07396.60.06297.30.04498.4
Qualitative indoor comparison among ground truth, FlashDepth, Video Depth Anything, and DyFN
Qualitative comparison. DyFN reduces non-rigid warping and produces more geometrically coherent stitched point clouds.

Video Demos

The demo shows DyFN on a continuous stream and reconstructed geometry.

Streaming reconstruction demo

Takeaways

1

Temporal instability is statistical

The dominant failure is scale-shift drift caused by latent feature statistic fluctuation.

2

Backbones can stay frozen

DyFN adapts strong image geometry models to streams without sacrificing their single-frame accuracy.

3

Causal and efficient

The recurrent module supports online streaming and is small enough to fine-tune efficiently.

Citation

Please cite the paper if you find the project useful.

@inproceedings{lyu2026streamingdepth,
  title={Stabilizing Streaming Video Geometry via Dynamic Feature Normalization},
  author={Lyu, Xiaoyang and Liu, Muxin and Wu, Xiaoshan and Wang, Ruicheng and Huang, Yi-Hua and Sun, Yang-Tian and Shi, Shaoshuai and Qi, Xiaojuan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}