RadarSFD: Single-Frame Diffusion with Pretrained Priors for Radar Point Clouds

Rice University
Graphical abstract for the RadarSFD method.
(a) RadarHD (41 frames)
RadarHD 41-frame reconstruction result
(b) RadarHD (1 frame)
RadarHD single-frame reconstruction result
(c) Zhang et al.
Zhang et al. reconstruction result
(d) RadarSFD (Ours)
RadarSFD reconstruction result
(e) LiDAR ground truth
LiDAR ground truth point cloud

RadarSFD reconstructs dense LiDAR-like point clouds from a single radar frame, recovering sharper walls and gaps than prior single-frame baselines without requiring motion or synthetic aperture radar.

Abstract

Millimeter-wave radar provides perception robust to fog, smoke, dust, and low light, making it attractive for size, weight, and power constrained robotic platforms. Current radar imaging methods, however, rely on synthetic aperture or multi-frame aggregation to improve resolution, which is impractical for small aerial, inspection, or wearable systems.

We present RadarSFD, a conditional latent diffusion framework that reconstructs dense LiDAR-like point clouds from a single radar frame without motion or SAR. Our approach transfers geometric priors from a pretrained monocular depth estimator into the diffusion backbone, anchors them to radar inputs via channel-wise latent concatenation, and regularizes outputs with a dual-space objective combining latent and pixel-space losses.

On the RadarHD benchmark, RadarSFD achieves SOTA performance against baseline models. Qualitative results show recovery of fine walls and narrow gaps, and experiments across new environments confirm strong generalization. Ablation studies highlight the importance of pretrained initialization, radar BEV conditioning, and the dual-space loss. Together, these results establish the practical single-frame, no-SAR mmWave radar pipeline for dense point cloud perception in compact robotic systems.

System Overview

RadarSFD diffusion architecture showing pixel space, latent space, and radar conditioning.

RadarSFD takes a single radar Range-Azimuth BEV and reconstructs a dense LiDAR-like point cloud through a conditional latent diffusion pipeline. Radar and LiDAR BEV images are first encoded into latent space. The radar latent c is concatenated with the noisy LiDAR latent zt and passed into a pretrained U-Net denoiser. After iterative denoising, the decoder reconstructs a LiDAR-like point cloud with sharp geometry from a single radar frame.

  • Lightly thresholded BEV preserves useful structure while keeping the input image-like.
  • Pretrained Marigold weights inject geometric priors into the diffusion backbone.
  • Channel-wise latent concatenation anchors the generation to the observed radar input.
  • Dual-space supervision combines latent denoising with decoded image reconstruction losses.

Results

Evaluation on the RadarHD Dataset

Method # Frames Mean CD ↓ Mean MHD ↓
CFAR 1 0.84 0.91
RadarHD 41 0.44 0.34
RadarHD single-frame 1 0.56 0.45
Luan et al. 5 0.59 0.50
Zhang et al. 1 0.38 0.29
RadarSFD (Ours) 1 0.35 0.28

RadarSFD achieves the best single-frame performance on the RadarHD dataset, improving over Luan et al. (ICRA 2024) and Zhang et al. (RA-L 2024).

Scene 1 CA-CFAR result Scene 1 RadarHD 41-frame result Scene 1 RadarHD single-frame result Scene 1 Zhang et al. result Scene 1 RadarSFD result Scene 1 LiDAR ground truth Scene 2 CA-CFAR result Scene 2 RadarHD 41-frame result Scene 2 RadarHD single-frame result Scene 2 Zhang et al. result Scene 2 RadarSFD result Scene 2 LiDAR ground truth Scene 3 CA-CFAR result Scene 3 RadarHD 41-frame result Scene 3 RadarHD single-frame result Scene 3 Zhang et al. result Scene 3 RadarSFD result Scene 3 LiDAR ground truth Scene 4 CA-CFAR result Scene 4 RadarHD 41-frame result Scene 4 RadarHD single-frame result Scene 4 Zhang et al. result Scene 4 RadarSFD result Scene 4 LiDAR ground truth
(a)
(b)
(c)
(d)
(e)
(f)

Qualitative comparison of point cloud reconstructions on four representative scenes with varying complexity. All results are shown in Cartesian coordinates for direct comparison.

  • (a) CA-CFAR is sparse and noisy.
  • (b) RadarHD (41 frames) captures geometry but with blurry edges and clutter.
  • (c) RadarHD single-frame misses structures.
  • (d) Zhang et al. (RA-L 2024) yields cleaner but lower-resolution outputs.
  • (e) RadarSFD achieves sharp, complete reconstructions that closely match (f) LiDAR ground truth.

Generalization to Unseen Scenes

Unseen scene 1 RGB image Unseen scene 1 3D layout Unseen scene 1 RadarHD result Unseen scene 1 Zhang et al. result Unseen scene 1 RadarSFD result Unseen scene 2 RGB image Unseen scene 2 3D layout Unseen scene 2 RadarHD result Unseen scene 2 Zhang et al. result Unseen scene 2 RadarSFD result
(a) RGB image of unseen building
(b) 3D layout
(c) RadarHD
(d) RAL'24
(e) RadarSFD (ours)

Real-world generalization results on completely unseen scenes from our campus building. All models are trained on the same RadarHD dataset: (a) RGB image of the unseen environment, (b) 3D floor-plan layout, (c) RadarHD single-frame baseline, (d) Zhang et al. (RA-L 2024), and (e) RadarSFD single-frame latent diffusion.

Ablation Insights

Input Representation

Ablation on radar input representation

Pretrained Priors

Ablation on pretrained priors

Training Loss

Ablation on training losses

Ablation box plots using Chamfer Distance (CD). From left to right, the plots evaluate input representation, pretrained priors, and training losses.

  • Thresholded BEV inputs outperform raw I/Q signals.
  • Depth-pretrained priors such as Marigold and SDv2 perform much better than random initialization.
  • Adding pixel-space L1 drives most of the gain, while SSIM and LPIPS provide only marginal benefit.

BibTeX

@inproceedings{zhao2026radarsfd,
  title     = {RadarSFD: Single-Frame Diffusion with Pretrained Priors for Radar Point Clouds},
  author    = {Zhao, Bin and Garg, Nakul},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026}
}