Human Universal Grasping
Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto
THE PROBLEM
Multi-fingered Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Manipulation & TasksGraspingTaking hold of an object. remains far from human-level generality. Prior approaches suffer from fundamental limitations: synthetic methods (optimization-based or learned generators in Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested.) struggle with Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. gap and require retraining per Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hand; Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. produces real embodiment-specific grasps but is tedious and cannot cover object diversity; and learning from Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. data is expensive because dexterous Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. is slow and difficult to scale. Most prior work trains on Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. (e.g., DexGraspNet, Dex1B, UniDexGrasp++) or lab-collected data (DexYCB, AnyDexGrasp), which lack the scale and diversity of real-world Manipulation & TasksGraspingTaking hold of an object.. Critically, existing multi-fingered methods often require complete object point clouds, which are unavailable in single-view real Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot., limiting practical applicability. The core problem is one of data sourcing: robots need the diverse, naturally executed Manipulation & TasksGraspingTaking hold of an object. experience humans accumulate daily, but collecting this via robot-specific Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. is prohibitively expensive. Recent advances in lightweight egocentric sensors (Aria Gen 2) and anthropomorphic Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hands with learned retargeting have made a previously infeasible approach practical: collect human grasps at scale, learn the natural distribution of human Manipulation & TasksGraspingTaking hold of an object., and retarget to robots. No prior work has demonstrated this pipeline for multi-fingered dexterous Manipulation & TasksGraspingTaking hold of an object..
HOW IT WORKS
1M-HUGs Dataset Collection & Curation
Flow-Matching Architecture with RGB-PC Fusion
Hand Retargeting to Robot Embodiments
HUG-Bench: Metric-Scale Benchmark Construction
Evaluation in Simulation and Real World
KEY RESULTS
vs. Beats DexGraspNet by +23%, Dex1B by +34%
HUG achieves significantly higher success than prior multi-fingered methods, demonstrating that learning from natural human grasps generalizes better than synthetic or simulation-trained approaches. The gap widens on this challenging set of articulated, tiny, and oversized objects.
vs. Consistent zero-shot transfer across stereo cameras and robot embodiments
The 62% in-the-wild rate shows HUG generalizes robustly to Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot. conditions far from the collection distribution. This is the most realistic setting—unconstrained object placement, variable lighting, different cameras—yet performance remains strong, validating that human Manipulation & TasksGraspingTaking hold of an object. distributions capture generalizable strategies.
vs. Figure 7 shows scaling curve; performance continues to improve without saturation
The paper demonstrates a positive Robot LearningScaling lawA pattern showing how performance improves as data, compute, or model size increases.: larger datasets yield better Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before.. This is critical evidence that the approach benefits from more human grasp data, suggesting further scaling could push performance higher. The curve does not plateau, implying diminishing but ongoing returns.
vs. Figure 9 shows qualitative cases (pineapple, hairbrush, spoon) where single modality fails but fusion succeeds
Point painting and Movement, Mechanics & Robot BodyJointA movable connection between robot parts. RGB-PC conditioning are necessary: RGB alone struggles on transparent/reflective objects (anchovies in water, glass), while point clouds alone lose texture-based information. The fusion approach balances both signals, critical for diverse real-world objects.
WHY DEVELOPERS SHOULD CARE
For software developers building Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. systems, this paper demonstrates a paradigm shift: multi-fingered dexterous Manipulation & TasksGraspingTaking hold of an object. can be learned from human data rather than Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. data or Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested., and the resulting Core ConceptsPolicyThe rule or model that maps observations or states to actions. transfers Modern Robot LearningZero-shotDoing a new task without task-specific training. to new Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. embodiments without retraining. This is significant because it decouples data collection from Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot.. Developers no longer need to commission expensive Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. campaigns or endure Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. gaps; instead, they can leverage egocentric video—which is increasingly easy to collect at scale with consumer smart glasses—to bootstrap Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. capabilities. The 1M-HUGs Robot LearningDatasetA collection of training or evaluation data. and aria2mano curation pipeline provide a concrete template for scaling this approach: capture diverse human grasps with calibrated depth and hand tracking, fit anatomical hand models, and train a simple flow-matching model. The architecture itself is surprisingly standard (DINOv2 + PointNeXt + DiT with cross-attention), suggesting that the bottleneck is data quality and diversity, not model design. For roboticists, the key takeaway is that natural human Manipulation & TasksGraspingTaking hold of an object. distributions matter. Rather than optimizing for force-closure or sampling all physically valid grasps, learning what humans actually do—which is often simpler and more conservative—produces policies that execute reliably on real Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hardware. The paper also introduces a new Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs. standard: HUG-Bench, with metric-scale reconstructions and paired simulation-to-real Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs., is a more honest Simulation & Sim-to-RealBenchmarkA standard test used to compare methods fairly. than purely simulation-only tests. The open release of code, data, trained models, and Simulation & Sim-to-RealBenchmarkA standard test used to compare methods fairly. assets lowers the barrier for future work, making this a platform for advancing Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. research.
LIMITATIONS
The paper lists several practical constraints: HUG is trained only on right-handed grasps with a fixed canonical MANO hand, so it does not model left-handed, bimanual, or hand-specific morphology. Retargeting can fail when a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hand cannot realize the predicted human pose, and real-world executions are open-loop, so shifted or articulated objects can break the plan. Labels can also be noisy under hand occlusion, accuracy drops for very small objects due to 224 x 224 inputs and for large or far objects that are rare in Perception & SensingEgocentric dataData captured from the robot’s or operator’s own point of view., and the Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs. remains indoor-only.
WHAT COMES NEXT
The natural next step is to turn HUG from a single open-loop grasp predictor into a closed-loop Manipulation & TasksGraspingTaking hold of an object. system: generate multiple candidate grasps, rank them, and replan during Movement, Mechanics & Robot BodyContactPhysical interaction between the robot and an object or surface. and lift with visual Control & PlanningFeedbackInformation returned from sensors during action to help correct behavior.. The paper also points toward broader grasp data collection: left-handed and bimanual grasps, variable hand morphology, outdoor or less controlled scenes, and more data for large or far objects would make the human-to-robot transfer story more complete.