Deepspeed in Production: INFERENCE OPTIMIZATION AND MODEL: Deploy LLMs efficiently with optimized serving, quantization, and low-latency inference for real-time applications

Author:   Tara Malhotra
Publisher:   Independently Published
ISBN:  

9798274507356


Pages:   290
Publication Date:   14 November 2025
Format:   Paperback
Availability:   Available To Order   Availability explained
We have confirmation that this item is in stock with the supplier. It will be ordered in for you and dispatched immediately.

Our Price $92.27 Quantity:  
Add to Cart

Share |

Deepspeed in Production: INFERENCE OPTIMIZATION AND MODEL: Deploy LLMs efficiently with optimized serving, quantization, and low-latency inference for real-time applications


Overview

Run large language models with predictable latency, controlled cost, and production reliability. Shipping LLMs is an operational problem. Teams struggle with time to first token, tokens per second, GPU memory pressure, and a moving target of engines and datatypes. This book turns those issues into clear practices you can apply with DeepSpeed and the serving layers you already use. You get a practical path from checkpoint to stable API, with configuration that fits real workloads, not toy demos. Every topic is grounded in measurable outcomes so your stack meets SLOs under mixed traffic and budget constraints. place DeepSpeed correctly in your stack and configure kernel injection, tensor parallel, and ZeRO for real services understand TTFT and throughput from prefill to decode and set metrics for p95 latency and queue time size and control the KV cache with paged attention, batching, and safe headroom targets apply quantization that holds up under load, including w8a8, awq, gptq, fp8, and fp4 use speculative decoding with a sound drafter choice, acceptance math, and stable fallbacks operate vllm, tensorrt llm on triton, and tgi with clean api surfaces and core flags scale with ray serve and plan capacity from workload shapes and arrival patterns tune for nvidia hopper and blackwell or amd mi300x, with attention backends and nvlink planning run on kubernetes with gpu operator, device plugin, mig, and topology aware placement wire observability with prometheus, dcgm, and opentelemetry spans, plus vllm bench, trtllm bench, and genai perf ship safely with quotas, redaction, audit logs, go live gates, and instant rollback plans This is a code heavy guide with working YAML, JSON, Shell, and Python examples that map directly to production, from gateway limits and network policies to rollout templates and exportable benchmark scripts. Grab your copy today and build an LLM service that stays fast, measurable, and dependable.

Full Product Details

Author:   Tara Malhotra
Publisher:   Independently Published
Imprint:   Independently Published
Dimensions:   Width: 17.80cm , Height: 1.50cm , Length: 25.40cm
Weight:   0.508kg
ISBN:  

9798274507356


Pages:   290
Publication Date:   14 November 2025
Audience:   General/trade ,  General
Format:   Paperback
Publisher's Status:   Active
Availability:   Available To Order   Availability explained
We have confirmation that this item is in stock with the supplier. It will be ordered in for you and dispatched immediately.

Table of Contents

Reviews

Author Information

Tab Content 6

Author Website:  

Countries Available

All regions
Latest Reading Guide

NOV RG 20252

 

Shopping Cart
Your cart is empty
Shopping cart
Mailing List