Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

Overview

Modern monocular geometry and depth foundation models are accurate on individual images, but when they are applied frame-by-frame to video streams, their predictions drift in scale and shift. DyFN targets that drift directly, preserving the frozen model's single-image accuracy while stabilizing geometry over time.

Core idea. The paper traces temporal artifacts to fluctuations in latent feature statistics, then inserts a small recurrent normalization module that predicts stable mean and variance for the feature stream.

Problem

Scale-shift drift

Naive streaming inference makes each frame live in a slightly different coordinate system, creating layering breaks and positional jitter.

Empirical Study

Feature statistics drive scale

The paper modulates latent mean and variance and observes large scale-shift changes while relative geometry remains accurate after alignment.

Method

Train only 2%

DyFN freezes the pretrained encoder and decoder, fine-tuning only a lightweight recurrent module. The final results achieves SOTA on the video depth estimation benchmark.

Problem: Diagnosing Temporal Inconsistency

DyFN starts from a simple empirical question: do streaming artifacts come from bad per-frame geometry, or from a drifting coordinate system across frames? The paper tests this with MoGe by fusing video-frame predictions under two alignment settings.

When each frame is independently aligned to metric scale using its own affine scale and shift, the reconstructed geometry is accurate and coherent. When the full sequence must share one scale and shift, the same model produces visible non-rigid warping and drift. This isolates the failure mode: the geometry is present, but its global scale and offset fluctuate over time.

Ground TruthReference reconstruction for the continuous sequence.

99.8

MoGe, Per-frame AlignedEach frame gets its own scale and shift, recovering coherent geometry.

62.5

MoGe, Sequence AlignedOne shared scale and shift exposes drift and non-rigid warping.

Conclusion. The failure is not a lack of per-frame geometric understanding. MoGe can reconstruct coherent geometry when each frame is aligned independently, but a single sequence-wide alignment exposes frame-to-frame scale-shift drift as the root cause of temporal inconsistency.

Empirical Study on Latent Statistics

The study tests whether latent feature statistics control the scale-shift drift. We modulate the encoder feature variance and mean with α and β, then decode the result and inspect how the point cloud changes across the sweep.

Outdoor RGB input image from the empirical pipeline slide

MGE
Encoder❄️

Encoded MGE feature map from the empirical pipeline slide

Variance \(\hat{\sigma}_t \in \mathbb{R}^{1\times d}\)

×α

Mean \(\hat{\mu}_t \in \mathbb{R}^{1\times d}\)

×β

Decoded 3D point cloud output from the empirical pipeline slide

MGE
Decoder❄️

Modulated MGE feature map from the empirical pipeline slide

Indoor scale-shift sweep

5 x 5 AbsRel color map

β: shift modulation

α: scale modulation

Step 1

Separate geometry from scale

Compare per-frame affine alignment against one sequence-wide alignment to expose whether drift comes from global scale and shift.

Step 2

Perturb feature statistics

Normalize encoder features, rescale their mean and variance, and decode with the frozen MGE decoder.

Conclusion

Stabilize statistics

Because mean and variance govern scale-shift behavior, DyFN learns temporally stable statistics for the streaming feature sequence.

Dynamic Feature Normalization

DyFN stabilizes the feature statistics that the empirical study identifies as the source of scale-shift drift. The detailed module structure is shown below.

Detailed DyFN module structure — Detailed DyFN structure. A compact recurrent module predicts the statistics used to re-modulate latent features over time.

Results

Across Sintel, ScanNet, KITTI, and Bonn, DyFN improves video depth stability while preserving the single-frame quality of its base model. The paper compares against six categories of depth and geometry methods under a sequence-level alignment protocol for video stability.

Relative Depth Metric Depth Multi-frame Geometry Streaming Geometry Video Depth Streaming Depth

Method	Sintel (50 frames)		ScanNet (90 frames)		KITTI (110 frames)		Bonn (110 frames)
Method	AbsRel ↓	\(\delta_1 < 1.25\) ↑	AbsRel ↓	\(\delta_1 < 1.25\) ↑	AbsRel ↓	\(\delta_1 < 1.25\) ↑	AbsRel ↓	\(\delta_1 < 1.25\) ↑
Marigold	0.532	51.5	0.166	76.9	0.149	79.6	0.091	93.1
DA V1	0.325	56.4	0.130	83.8	0.142	80.3	0.078	93.9
DA V2	0.367	55.4	0.135	82.2	0.140	80.4	0.106	92.1
MoGe v1	0.216	65.3	0.117	84.7	0.076	96.0	0.074	95.5
DepthPro	0.319	52.0	(0.088)	(92.7)	(0.088)	(92.2)	(0.063)	(96.6)
MoGe v2	0.214	69.5	(0.110)	(88.2)	(0.183)	(58.8)	(0.049)	(98.0)
VGGT	0.287	66.1	0.031	98.5	0.070	96.5	0.055	97.1
Monst3R	0.335	58.5	0.123	83.2	0.104	89.5	0.063	96.4
CUT3R	0.421	47.9	0.097	88.7	0.118	88.1	0.078	93.7
TTT3R	0.404	50.0	0.114	87.7	0.113	90.4	0.068	95.4
DepthCrafter	0.270	69.7	0.123	85.6	0.104	89.6	0.071	97.2
VDA	0.300	63.3	0.075	95.4	0.079	95.0	0.051	98.1
FlashDepth	0.265	64.2	0.101	90.3	0.103	89.5	0.053	98.0
Ours	0.180	73.0	0.073	96.6	0.062	97.3	0.044	98.4

Qualitative indoor comparison among ground truth, FlashDepth, Video Depth Anything, and DyFN — Qualitative comparison. DyFN reduces non-rigid warping and produces more geometrically coherent stitched point clouds.

Video Demos

The demo shows DyFN on a continuous stream and reconstructed geometry.

Streaming reconstruction demo

Takeaways

Temporal instability is statistical

The dominant failure is scale-shift drift caused by latent feature statistic fluctuation.

Backbones can stay frozen

DyFN adapts strong image geometry models to streams without sacrificing their single-frame accuracy.

Causal and efficient

The recurrent module supports online streaming and is small enough to fine-tune efficiently.

Citation

Please cite the paper if you find the project useful.

@inproceedings{lyu2026streamingdepth,
  title={Stabilizing Streaming Video Geometry via Dynamic Feature Normalization},
  author={Lyu, Xiaoyang and Liu, Muxin and Wu, Xiaoshan and Wang, Ruicheng and Huang, Yi-Hua and Sun, Yang-Tian and Shi, Shaoshuai and Qi, Xiaojuan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}