Pointmap-Conditioned Diffusion

Abstract

Synthesizing extrapolated views remains a difficult task, especially in urban driving scenes, where the only reliable sources of data are limited RGB captures and sparse LiDAR points. To address this problem, we present PointmapDiff, a novel framework for novel view synthesis that utilizes pre-trained 2D diffusion models. Our method leverages point maps (i.e. rasterized 3D scene coordinates) as a conditioning signal, capturing geometric and photometric priors from the reference images to guide the image generation process. With our proposed reference attention blocks and ControlNet for point map features, the model generates accurate and consistent results across varying viewpoints while respecting geometric information. Experiments on real-life driving data demonstrate that PointmapDiff achieves high-quality generation with flexible control over point map conditioning signals (e.g. dense depth map or even sparse LiDAR points) and can be used to distill to 3D representations such as 3D Gaussian Splatting for improving view extrapolation.

Method

(left) Our PointmapDiffusion model is trained in the latent space of a fixed VAE with encoder $\mathcal{E}$ and decoder $\mathcal{D}$. Given a reference RGB image $I^{r}$ and the corresponding geometry $D^{r}$, we obtain a pair of pointmaps $\{X^{r,t}, X^{t,t}\}$ as the input. We predict the target image $I^{t}$ given the geometry signal from the target pointmap, and information comes from the reference U-Net. Particularly, two Pointmap ControlNets are employed to extract geometric feature correspondences and concatenate them with the intermediate SD feature maps. We freeze the original SD model and only train the Pointmap ControlNet and the reference attention module. (right) We extract reference features using our reference U-Net. These augmented features are integrated into the target U-Net through a reference-guided cross-view attention mechanism, which is added throughout the target U-Net.

Results

Extrapolation in street-view reconstruction

PointmapDiff can enhance 3DGS rendering results on extrapolated viewpoints including (lateral shifting, rotating, flying up).

LiDAR-aligned Generation

PointmapDiff can generate images that both respect reference appearance and LiDAR geometry.

Videos

Flying upward

3DGS

3DGS+PointmapDiff

Lateral shift

Original trajectory

3DGS

3DGS+PointmapDiff

3DGS

FreeVS

3DGS+PointmapDiff

3DGS

FreeVS

3DGS+PointmapDiff

BibTeX

@inproceedings{nguyen2026pointmap,
    title={Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis},
    author={Nguyen, Thang-Anh-Quan and Piasco, Nathan and Rold{\~a}o, Luis and Bennehar, Moussab and Tsishkou, Dzmitry and Caraffa, Laurent and Tarel, Jean-Philippe and Br{\'e}mond, Roland},
    booktitle={Proceedings of the IEEE/CVF winter conference on applications of computer vision},
    year={2026}
}

Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis

WACV 2026

Abstract

Method

Results

Extrapolation in street-view reconstruction

LiDAR-aligned Generation

Videos

Flying upward

Lateral shift

BibTeX