Amazon SageMaker Async Inference now supports inline request payloads, eliminating mandatory S3 uploads
Amazon has updated SageMaker AI Async Inference to accept raw inference payloads directly in the request body, removing the long-standing requirement to stage input data in Amazon S3 before submitting each async request. The change, published June 17, supports payloads up to 128,000 bytes.
What's new
Developers can now pass inference data directly via a new Body parameter on the InvokeEndpointAsync API. Previously, every async inference request required a two-step process: upload the payload to an S3 bucket, then reference that S3 object in the request. The new approach collapses that into a single API call.
Key technical details:
- Maximum inline payload: 128,000 bytes (raw)
- New parameter:
Bodyaccepts raw bytes in the API call - Mutually exclusive:
BodyandInputLocationcannot both be set in the same request - Backward compatible: Existing S3-based
InputLocationworkflows continue unchanged - No endpoint reconfiguration required
Eliminating the S3 write removes one network round-trip and one S3 PUT charge per invocation. AWS also notes simpler IAM permission structures — teams no longer need to provision an S3 input bucket for new inline workflows. Failed inline requests return immediate synchronous validation errors rather than requiring asynchronous polling.
Context
SageMaker Async Inference launched in 2021, designed for workloads tolerating processing delays of several seconds — large payloads, long-running model inference, or post-processing pipelines. The original S3 staging requirement made sense when payloads were large (documents, images), but created unnecessary overhead for smaller requests where the overhead outweighed the async queuing benefit.
AWS has been systematically reducing friction in its inference products. Earlier this week, the team added container caching to SageMaker AI (cutting model cold-start latency by 51%) and introduced P-EAGLE parallelized speculative decoding for throughput gains. The inline payload feature continues that pattern.
Why it matters
For teams running high-frequency async workloads with sub-128KB payloads — structured data, short text, embeddings — the change simplifies architecture meaningfully. Fewer IAM roles, no S3 bucket provisioning, one fewer failure path per request, and reduced per-invocation cost.
At scale — tens of thousands of inference calls per hour — eliminating the S3 write round-trip reduces both latency and compute cost. The change is incremental rather than transformative, but it addresses a specific friction point that SageMaker users have encountered since the service launched.
Corroborating sources
- Aws.amazon
https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads/
“Customers can now send inference payloads directly in the request body of the InvokeEndpointAsync API, removing the need to upload input data to Amazon S3 before each invocation.”