Research & Benchmarks
Papers, findings, evals, and leaderboard results — the science behind the models.
Papers, findings, evals, and leaderboard results — the science behind the models.
Google on June 10, 2026 released DiffusionGemma, an experimental open-weights language model that generates text through a fundamentally different mechanism than standard autoregressive models. While…
Researchers at Boston Children's Hospital have used OpenAI's o3 reasoning model to identify new diagnoses for 18 children whose rare genetic diseases had gone unresolved by standard clinical workup.…
OpenAI's alignment research team published findings today showing that reinforcement learning applied to a targeted set of "beneficial trait" training scenarios produces safety improvements that…
OpenAI and Molecule.one published research on June 17, 2026 showing a near-autonomous AI system successfully improved a challenging reaction in medicinal chemistry — the first time an AI has tackled…
ChatGPT has lost its majority position in the AI assistant market for the first time since its 2022 launch, dropping to 46.4% market share as of May 2026, according to Sensor Tower's State of AI…
Google's AMIE (Articulate Medical Intelligence Explorer) has matched the performance of primary care physicians in managing chronic health conditions and significantly outperformed them in plan…
Anthropic published research on June 16 showing that the single strongest predictor of success with Claude Code is domain expertise in the task at hand — not whether a user has a background in…
Alibaba's Qwen team released a technical report on June 15, 2026 introducing Qwen-RobotWorld, a foundation model for embodied AI that predicts future visual trajectories across robotics, autonomous…
NVIDIA's Blackwell GPU platform claimed the top spot across all seven benchmarks in MLPerf Training 6.0, the latest edition of the peer-reviewed AI training benchmark suite. The results, published…
Anthropic published research on June 8, 2026, examining why AI agents struggle to reliably retrieve biological data — and how a single deterministic tool layer nearly eliminates the performance gap…
NVIDIA's Blackwell architecture has taken the top spot in AgentPerf, the first benchmark designed specifically to measure agentic AI infrastructure performance, according to results published June 12…
Anthropic on June 12 released the inaugural results from its Anthropic Public Record, a new survey series the company says it will repeat to track how U.S. public attitudes toward AI shift over time.…
Alibaba's Qwen research team has published Qwen-Image-Flash: Beyond Objective Design on arXiv, presenting a few-step distillation framework that accelerates the Qwen-Image-2.0 model family for both…
DeepSeek researchers published a paper on June 9, 2026 introducing Lookahead Sparse Attention (LSA), an inference-time technique that compresses the GPU memory footprint of long-context LLM serving…
Vercel published its AI gateway production index for May 2026, drawing on routing traffic across its infrastructure to document a widening divergence in the AI provider market: DeepSeek's share of…
Anthropic disclosed on June 3, 2026 that it detected and disrupted what it describes as the first large-scale cyberattack executed largely without human intervention — a campaign in which a Chinese…
Google DeepMind published details on May 20, 2026 of an accessibility agent that guides blind and low-vision athletes while running outdoors in real time, using a Pixel 10 Pro smartphone worn on the…
Anthropic on June 5 published benchmark data showing that Claude Opus 4.7 — a general-purpose frontier model with no chemistry-specific fine-tuning — performs comparably to specialized NMR software…
Google DeepMind's AI research assistant Co-Scientist helped biologists at MIT identify more than 20 genetic factors that may reverse cellular aging — and lab experiments confirmed that several of…
A team of researchers including prominent Google DeepMind scientists published a paper at the 43rd International Conference on Machine Learning (ICML 2026) arguing that the dominant paradigm in AI…
Anthropic has released a public best-practices guide for using Claude to find and fix vulnerabilities in source code, paired with an open-source reference harness on GitHub. The guide is grounded in…
Perplexity on June 1 published research introducing Search as Code (SaC), an architecture in which AI agents compose search pipelines programmatically from atomic building blocks rather than calling…
Anthropic on June 3 published a year-long study of how malicious actors have used Claude, examining 832 accounts the company banned for cyber abuse between March 2025 and March 2026. The headline…
On June 2, 2026, Anthropic disclosed that it is significantly expanding Project Glasswing, its program to harden critical-infrastructure software with frontier-model-assisted security analysis.…
Hugging Face users running the Transformers library below version 5.3.0 are exposed to a critical remote-code-execution vulnerability that turns a routine model load into attacker-controlled Python…
On June 2, 2026 a team at the University of Toronto and the Vector Institute released a paper on arXiv showing that an AI agent can power a computer worm capable of writing fresh exploits as it…
Anthropic's engineering team published a retrospective on 2026-05-25 describing how it contains Claude agents across three products — claude.ai, Claude Code, and Claude Cowork — with a focus on…
At CVPR 2026, NVIDIA Research presented three new physical-AI foundation models built around a common thesis: train at sufficient scale, and systems generalize across embodiments, scenarios and…
In the GPT-5.5 system card (April 23, 2026; updated April 24 for API safeguards), OpenAI classifies the model's biological/chemical and cybersecurity capabilities as "High" under its Preparedness…