Definition

A point cloud is an unordered set of 3D points, where each point is represented by its spatial coordinates (x, y, z) and optionally additional attributes such as color (RGB), surface normal vectors, intensity, or semantic labels. Point clouds are the primary 3D data representation in robotics, providing metric spatial information that 2D images cannot: the actual distance to objects, their 3D shape, and their spatial relationship to the robot.

Unlike structured representations such as voxel grids or meshes, point clouds make no assumptions about surface connectivity or grid alignment. This makes them flexible and memory-efficient for representing scenes of arbitrary complexity, but also presents challenges for neural network processing since the data is inherently unordered and irregularly sampled.

Sources: How Point Clouds Are Generated

Several sensor technologies produce point clouds, each with distinct tradeoffs:

  • RGB-D cameras (structured light) — Intel RealSense D435/D455, Microsoft Azure Kinect, and Orbbec cameras project infrared patterns and measure deformation to compute depth. Range: 0.2-10m. Resolution: 640x480 to 1280x720 depth pixels. Cost: $150-400. Best for indoor manipulation at close range.
  • Stereo depth cameras — ZED 2, OAK-D use stereo vision (two RGB cameras) to compute depth via disparity. Work outdoors (unlike structured light) and provide high-resolution color, but struggle with textureless surfaces. Range: 0.3-20m.
  • LiDAR — Spinning or solid-state laser scanners (Velodyne, Ouster, Livox) measure time-of-flight of laser pulses. High precision (mm-level) and long range (up to 200m). Standard for autonomous driving and large-scale mapping. Cost: $500-10,000+.
  • Time-of-Flight (ToF) cameras — Emit modulated light and measure phase shift to compute depth. Fast (60+ fps) but lower resolution and precision than structured light. Used in some mobile robots and consumer devices.

Applications in Robotics

Grasp planning: Point clouds are the primary input for 6-DOF grasp prediction. Networks like GraspNet, Contact-GraspNet, and AnyGrasp take a scene point cloud and predict gripper poses ranked by success probability. The 3D geometry directly reveals surface normals, curvature, and clearance needed to plan stable grasps.

Object detection and pose estimation: 3D object detectors (VoteNet, PointPillars) and pose estimators (DenseFusion, FoundationPose) use point clouds to localize objects and estimate their 6-DOF poses for manipulation. Point clouds are particularly important when objects are partially occluded or stacked.

Navigation and mapping: Mobile robots build occupancy maps from LiDAR point clouds using SLAM algorithms (LOAM, LIO-SAM, KISS-ICP). The resulting 3D maps enable autonomous navigation and obstacle avoidance.

Scene reconstruction: Multiple point clouds from different viewpoints can be registered (aligned) to build complete 3D reconstructions of environments. This is used for digital twins, sim-to-real environment modeling, and workspace analysis.

Policy input: Some learned manipulation policies use point clouds directly as observations. 3D-Diffusion-Actor and DP3 (3D Diffusion Policy) process point clouds with PointNet-style encoders to predict actions, providing better spatial reasoning than 2D image-based policies for tasks requiring precise depth understanding.

Processing Libraries and Tools

  • Open3D — Python/C++ library for point cloud processing, visualization, and 3D reconstruction. Provides voxel downsampling, normal estimation, ICP registration, RANSAC plane fitting, and mesh reconstruction. The most popular library for robotics research.
  • PCL (Point Cloud Library) — C++ library with comprehensive filtering, segmentation, registration, and feature extraction. The original standard, widely used in ROS/ROS2 pipelines. More mature but less Pythonic than Open3D.
  • NumPy-based processing — For simple operations (cropping, downsampling, transforming), raw NumPy array manipulation is often fastest. Point clouds are just Nx3 or Nx6 arrays, and many operations reduce to vectorized math.
  • PointNet / PointNet++ — Neural network architectures that process raw point clouds. PointNet uses per-point MLPs and max-pooling to achieve permutation invariance. PointNet++ adds hierarchical grouping for local feature extraction. Foundational for all learned point cloud processing.
  • ROS2 integration — The sensor_msgs/PointCloud2 message type is the standard for publishing and subscribing to point clouds in ROS2. PCL and Open3D both have ROS2 conversion utilities.

Compression and Transmission Challenges

A single 640x480 depth frame produces roughly 300,000 3D points. At 30 fps, that is 9 million points per second per camera. Multi-camera setups can easily generate 30-50 million points per second, creating bandwidth and storage challenges:

Voxel downsampling: The most common preprocessing step. A voxel grid (e.g., 5mm resolution) replaces all points within each voxel with their centroid, reducing point count by 10-100x while preserving geometric structure.

Region-of-interest cropping: Discard points outside the robot's workspace (e.g., keep only a 1m x 1m x 0.5m box on the table). This alone can reduce data by 80-90%.

Octree compression: Hierarchical spatial partitioning that enables progressive level-of-detail streaming. Used by PCL's compression module and Draco (Google's open-source 3D compression library).

Network transmission: For teleoperation and cloud robotics, point clouds must be compressed for transmission. ROS2's DDS middleware supports best-effort QoS for point cloud topics, trading reliability for latency.

Practical Requirements

Calibration: Depth cameras require intrinsic calibration (factory-provided) and extrinsic calibration (camera-to-robot transform). Calibration errors directly translate to point cloud positioning errors, which cause grasp failures. Hand-eye calibration accuracy of 1-2mm is typically sufficient for tabletop manipulation.

Noise and artifacts: Depth sensors produce noisy measurements, particularly at depth discontinuities (edges of objects), reflective surfaces (metal, glass), and transparent materials. Statistical outlier removal and bilateral filtering are standard preprocessing steps.

Compute: Real-time point cloud processing (30 fps) requires a modern CPU. GPU-accelerated processing (CUDA-based Open3D, cuPCL) enables processing of dense point clouds at high frame rates for real-time applications.

Point Cloud Processing Pipeline for Manipulation

A standard point cloud processing pipeline at SVRC for tabletop manipulation follows these steps:

  • 1. Capture and transform — Acquire depth frames from calibrated cameras (RealSense D435/D455, ZED 2). Apply extrinsic calibration to transform points from camera frame to robot base frame. For multi-camera setups, merge point clouds from 2-3 viewpoints to reduce occlusion.
  • 2. Region-of-interest crop — Discard points outside the manipulation workspace (typically a 0.6m x 0.6m x 0.4m box above the table). This removes background clutter and reduces data volume by 80-90%.
  • 3. Table plane removal — Fit a plane using RANSAC and remove points within 5mm of the fitted plane. This isolates objects on the table from the table surface itself.
  • 4. Noise filtering — Statistical outlier removal (remove points whose average distance to k nearest neighbors exceeds 2 standard deviations). This eliminates depth sensor noise and flying pixels at object edges.
  • 5. Voxel downsampling — Downsample to 3-5mm voxel resolution for grasp planning, or 1-2mm for precision tasks. This ensures uniform point density and reduces compute for downstream algorithms.
  • 6. Object segmentation — Euclidean clustering or learned segmentation (Mask3D, SAM-3D) to identify individual objects. Each cluster becomes a candidate for grasp planning.

See Also

  • Data Services — Multi-camera point cloud capture and processing for manipulation datasets
  • Data Platform — Store and visualize point cloud datasets alongside image and proprioceptive data
  • Hardware Catalog — Depth cameras, LiDAR sensors, and mounting accessories
  • Robot Leasing — Pre-configured robot cells with calibrated depth camera arrays

Key Papers

  • Qi, C.R. et al. (2017). "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation." CVPR 2017. The foundational architecture for learning directly from point clouds.
  • Qi, C.R. et al. (2017). "PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space." NeurIPS 2017. Adds local feature extraction through hierarchical grouping.
  • Fang, H. et al. (2020). "GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping." CVPR 2020. Benchmark and method for point cloud-based 6-DOF grasp detection.
  • Ze, Y. et al. (2024). "3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations." RSS 2024. Demonstrates the benefit of point cloud observations for diffusion-based robot policies.

Related Terms

  • Sim-to-Real Transfer — Synthetic point clouds in simulation must match real sensor characteristics
  • Workspace Analysis — Point clouds help verify that objects are within the robot's reachable space
  • Motion Planning — Uses point cloud-derived collision models for obstacle avoidance
  • Tactile Sensing — Complements point clouds with contact information not visible to depth cameras
  • Diffusion Policy — 3D Diffusion Policy variants use point clouds as observation input

Apply This at SVRC

Silicon Valley Robotics Center equips every manipulation cell with calibrated Intel RealSense and ZED depth cameras. Our data platform records synchronized point clouds alongside joint-state and image data, enabling teams to train 3D-aware policies. We provide pre-configured Open3D and PCL processing pipelines for common tasks like table-plane segmentation, object clustering, and grasp candidate generation.

Explore Data Services   Contact Us