FeaturexAIVerified

xAI adds image search to Grok's web search API, returning images as Markdown embeds in responses

ListenMay 27, 2026published Jun 5, 2026

xAI updated its web search capability on May 27, 2026 to support explicit image search. The addition allows Grok-powered applications to retrieve and embed relevant images as part of model responses, extending the API's multimodal output beyond text-based results.

What's new

The update introduces an enable_image_search query parameter for the web search API. When enabled, Grok searches specifically for relevant images. Matching images are returned as Markdown image embeds, making them available for direct rendering in downstream interfaces or as input for subsequent model processing.

From the xAI release notes: "Web Search now supports explicitly searching for images. Enable enable_image_search to let Grok search directly for relevant images; responses can include returned images as Markdown image embeds."

The feature is documented in the Enable Image Search section of the xAI developer documentation. No separate API call or additional authentication is required — the parameter extends the existing web search path.

How it integrates in practice

Because the format is Markdown image embeds, the results slot naturally into several common integration patterns:

Rendered chat interfaces that display inline images
Multi-model pipelines that forward images to a vision model for additional analysis
Research or documentation tools that need visual references alongside text answers
Product or e-commerce interfaces that surface item imagery with text descriptions

Context

The image search addition is part of a concentrated stretch of API capability releases from xAI across May 2026. The same week saw Smart Turn end-of-turn detection for streaming speech-to-text. Earlier in May: Grok Build 0.1 for agentic coding, Context Compaction for long-context efficiency, WebSocket Responses mode for lower latency on tool-heavy agent workloads, and Custom Voices for text-to-speech voice cloning.

Web search is one of Grok's most-used developer-facing features, enabling real-time information grounding for model responses. For applications where answers depend on current information rather than training-data knowledge, web search is frequently the first integration developers reach for. The image extension adds a visual dimension to that retrieval pipeline.

A model that can answer "what does the new product look like?" by both describing it in text and returning an image of it is more complete in its response than one limited to text alone. Previously, developers building multimodal search experiences with Grok had to manage image retrieval in a separate pipeline and combine results before returning them to users.

Why it matters

Native image search at the API level reduces integration overhead for applications that need visual content alongside text. Folding image search into the same API call simplifies the architecture for interfaces that display visual content — product comparisons, visual news aggregators, research tools, and similar applications.

The Markdown embed format is a practical choice. It passes through model outputs cleanly, renders in most chat interfaces without additional code, and works naturally in multi-model pipelines where the image needs to be forwarded to a vision model for further processing.

The pattern of xAI's May releases — context compaction, WebSocket connections, Smart Turn detection, and now image search — addresses specific bottlenecks in production agentic and multimodal deployments rather than surface-level feature additions. Each capability closes a gap that developers working with Grok-powered applications are likely to encounter.

Corroborating sources

Docs.x
https://docs.x.ai/docs/release-notes
“Web Search now supports explicitly searching for images. Enable `enable_image_search` to let Grok search directly for relevant images; responses can include returned images as Markdown image embeds.”

What's new

How it integrates in practice

Because the format is Markdown image embeds, the results slot naturally into several common integration patterns:

Rendered chat interfaces that display inline images

Multi-model pipelines that forward images to a vision model for additional analysis

Research or documentation tools that need visual references alongside text answers

Product or e-commerce interfaces that surface item imagery with text descriptions

Context

Why it matters