Meta has released SAM 3D, an extension of its Segment Anything Model (SAM) family that enables 3D reconstruction and spatial understanding from a single 2D image. The release includes two models—SAM 3D Objects and SAM 3D Body—along with model checkpoints and inference code. Together, they target enterprise scenarios that require spatial reasoning, asset generation, or human pose estimation without relying on multi-view capture or specialized sensors.
SAM 3D reflects a broader enterprise shift toward AI systems that operate reliably on real-world visual data rather than controlled or synthetic environments. Single-image 3D reconstruction reduces deployment complexity and cost, making it practical for applications in commerce, media production, robotics, and simulation where collecting depth data or multiple viewpoints is infeasible.
SAM 3D Objects focuses on reconstructing object geometry, texture, and spatial layout from natural images that include occlusion, indirect viewpoints, and clutter. Users can select individual objects within an image and generate posed 3D assets that can be manipulated independently or viewed from different camera angles. This capability addresses a persistent limitation in 3D pipelines, where models trained on isolated or synthetic assets fail to generalize to everyday scenes.
A key contribution of SAM 3D Objects is its data strategy. High-quality 3D ground truth is scarce and expensive, traditionally requiring skilled 3D artists. To scale beyond this bottleneck, Meta developed a data annotation engine that relies on human ranking and verification of model-generated meshes rather than manual creation. Annotators evaluate multiple candidate reconstructions produced by models in the loop, while expert artists are reserved for the most difficult cases. Using this approach, nearly one million real-world images were annotated, generating more than three million candidate meshes.
The training process mirrors recent large language model practices. Synthetic 3D data is treated as pre-training, followed by a post-training alignment phase on natural images to close the sim-to-real gap. Improvements in model quality feed back into the data engine, creating an iterative loop that increases robustness and output quality over time. This tight coupling of data generation and post-training is intended to make 3D perception more scalable and less dependent on specialized labor.
To address the lack of realistic evaluation benchmarks, Meta collaborated with artists to create the SAM 3D Artist Objects (SA-3DAO) dataset, which will be released separately. The dataset pairs natural images with artist-validated meshes and is designed to be significantly more challenging than existing benchmarks that rely on staged scenes or synthetic assets, providing a more relevant measure of progress for real-world deployment.
SAM 3D Body targets 3D human pose and shape estimation from a single image, including cases with occlusion, unusual postures, or multiple people. The model is promptable, allowing users to guide predictions using segmentation masks or 2D keypoints. It introduces the Meta Momentum Human Rig (MHR), an open-source parametric mesh format that separates skeletal structure from soft-tissue shape, improving interpretability and downstream control for animation, simulation, and avatar systems.
SAM 3D Body is trained on a curated dataset of approximately eight million images drawn from large-scale photo collections, multi-camera video captures, and synthetic data, with emphasis on rare poses and challenging conditions. On standard benchmarks, it outperforms prior models in accuracy and robustness. Meta has released MHR under a permissive commercial license, enabling enterprise reuse.
Overall, SAM 3D represents a step toward operationally viable 3D perception, aligning with enterprise priorities around scalability, reliability, and deployment in uncontrolled real-world environments.