Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis

WACV 2026

1Noah's Ark, Huawei Paris Research Center, France
2COSYS, Gustave Eiffel University, France
3LASTIG, IGN-ENSG, Gustave Eiffel University, France

PointmapDiff is a method that can perform extrapolated view synthesis in urban scenes. We present viewpoints generated at 45-degree angle to the right (first row) and at 1.5m position to the left (second row). Our approach significantly outperforms the baselines when rendering viewpoints beyond the original ego trajectory, whereas others struggle with severe artifacts.

Abstract

Synthesizing extrapolated views remains a difficult task, especially in urban driving scenes, where the only reliable sources of data are limited RGB captures and sparse LiDAR points. To address this problem, we present PointmapDiff, a novel framework for novel view synthesis that utilizes pre-trained 2D diffusion models. Our method leverages point maps (i.e. rasterized 3D scene coordinates) as a conditioning signal, capturing geometric and photometric priors from the reference images to guide the image generation process. With our proposed reference attention blocks and ControlNet for point map features, the model generates accurate and consistent results across varying viewpoints while respecting geometric information. Experiments on real-life driving data demonstrate that PointmapDiff achieves high-quality generation with flexible control over point map conditioning signals (e.g. dense depth map or even sparse LiDAR points) and can be used to distill to 3D representations such as 3D Gaussian Splatting for improving view extrapolation.

Method

(left) Our PointmapDiffusion model is trained in the latent space of a fixed VAE with encoder $\mathcal{E}$ and decoder $\mathcal{D}$. Given a reference RGB image $I^{r}$ and the corresponding geometry $D^{r}$, we obtain a pair of pointmaps $\{X^{r,t}, X^{t,t}\}$ as the input. We predict the target image $I^{t}$ given the geometry signal from the target pointmap, and information comes from the reference U-Net. Particularly, two Pointmap ControlNets are employed to extract geometric feature correspondences and concatenate them with the intermediate SD feature maps. We freeze the original SD model and only train the Pointmap ControlNet and the reference attention module. (right) We extract reference features using our reference U-Net. These augmented features are integrated into the target U-Net through a reference-guided cross-view attention mechanism, which is added throughout the target U-Net.

Results

Extrapolation in street-view reconstruction

PointmapDiff can enhance 3DGS rendering results on extrapolated viewpoints including (lateral shifting, rotating, flying up).


3DGS PointmapDiff 3DGS+PointmapDiff
3DGS PointmapDiff 3DGS+PointmapDiff
3DGS PointmapDiff 3DGS+PointmapDiff
3DGS PointmapDiff 3DGS+PointmapDiff
3DGS PointmapDiff 3DGS+PointmapDiff
3DGS PointmapDiff 3DGS+PointmapDiff


LiDAR-aligned Generation

PointmapDiff can generate images that both respect reference appearance and LiDAR geometry.


Baseline PointmapDiff
Baseline PointmapDiff
Baseline PointmapDiff


Videos

Flying upward

3DGS
3DGS+PointmapDiff


Lateral shift

Original trajectory
3DGS
3DGS+PointmapDiff
3DGS
FreeVS
3DGS+PointmapDiff
3DGS
FreeVS
3DGS+PointmapDiff


BibTeX

@inproceedings{nguyen2026pointmap,
    title={Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis},
    author={Nguyen, Thang-Anh-Quan and Piasco, Nathan and Rold{\~a}o, Luis and Bennehar, Moussab and Tsishkou, Dzmitry and Caraffa, Laurent and Tarel, Jean-Philippe and Br{\'e}mond, Roland},
    booktitle={Proceedings of the IEEE/CVF winter conference on applications of computer vision},
    year={2026}
}