NVIDIA Research debuts GraspGen-X, LCDrive, and NitroGen foundation models at CVPR 2026
At CVPR 2026, NVIDIA Research presented three new physical-AI foundation models built around a common thesis: train at sufficient scale, and systems generalize across embodiments, scenarios and worlds. GraspGen-X, billed as the first foundation model for zero-shot robotic grasping, was trained on roughly 2 billion simulated grasps. LCDrive replaces text-based chain-of-thought reasoning for autonomous vehicles with compact latent representations. NitroGen, built on the NVIDIA Isaac GR00T architecture, was trained across more than 1,000 games and 40,000 hours of interactive gameplay.
What's new
- GraspGen-X — described as the first foundation model for grasping, built to generalize across gripper embodiments. NVIDIA generated 2 billion simulated grasps across thousands of object shapes and synthetic gripper configurations. Given an unfamiliar gripper and an unknown object, the model produces reliable grasp pose proposals out of the box. It pairs with curoboV2, a new CUDA-accelerated motion planning library, for closed-environment execution.
- LCDrive — replaces chain-of-thought reasoning in autonomous driving with a compressed latent loop. The system alternates between proposing candidate actions and predicting the resulting world state, refining its plan in latent space. NVIDIA reports comparable trajectory quality to text-based reasoning at roughly half the tokens. The model is built on NVIDIA Alpamayo and trained from existing vehicle data.
- NitroGen — a generalized gameplay AI foundation model. Trained on more than 1,000 games and 40,000 hours of interaction using a GR00T-based architecture, it was evaluated across action role-playing games, platformers, roguelikes and open-world games. In low-data conditions, NVIDIA reports up to a 52% improvement over previous state-of-the-art methods. Open source on GitHub and Hugging Face.
Context
Each paper targets a long-standing physical-AI bottleneck. GraspGen-X attacks the embodiment-specific nature of most grasping policies, where switching grippers historically requires fresh training data and fine-tuning. LCDrive attacks the cost of running text-based reasoning on the embedded compute already shipping inside cars. NitroGen attacks the lack of varied training environments for embodied agents by treating video games as a structured corpus. NVIDIA notes that related work — Grasp-MPC, presented earlier this year at ICRA 2026 — extends the GraspGen line into closed-loop grasp execution.
Why it matters
For industrial robotics, GraspGen-X is the most consequential of the three. Eliminating per-gripper training cycles removes one of the longest-standing friction points in real-world robot deployment and would shorten the time between picking a new manipulator and putting it into production. LCDrive's roughly 50% token reduction matters more quietly: text-based chain-of-thought is the dominant reasoning paradigm for AV decision-making, and shaving half the tokens on embedded hardware translates either into faster reaction times or into cheaper inference silicon at the same latency. NitroGen sits further out on the curve — the proposition that game environments are useful pretraining for real-world embodied agents has been tested before — but the scale of the corpus and the GR00T backbone give it real reach. Taken together, the three papers are NVIDIA Research staking out a position that the foundation-model paradigm now extends fully into manipulation, driving and embodied agency.
Corroborating sources
- Blogs.nvidia
https://blogs.nvidia.com/blog/cvpr-research-grasping-driving-agent-training/
“They generated 2 billion simulated grasps across thousands of object shapes and synthetic gripper configurations, spanning the diversity of form factors a deployed robot might encounter.”