|
|
|||
|
||||
OverviewMastering InfiniBand is a definitive, practitioner-focused guide to designing, building, and operating the fabrics that power modern HPC clusters, AI training platforms, and data-centric infrastructure. It distills the InfiniBand architecture from first principles-end-to-end channel semantics, addressing (GUIDs, LIDs, GIDs), packet formats, virtual lanes, and credit-based flow control-through management planes (SMA, SM, SA, PMA, BMA) and IP transport via IPoIB. The book then grounds readers in physical and link-layer engineering, covering signaling from SDR to HDR/NDR and emerging XDR, lane bonding and breakouts, FEC/CRC and error propagation, port state machines, arbitration and deadlock avoidance, optics and cabling for reach and BER, and structured wiring with proactive telemetry to keep large-scale fabrics healthy. For software and system engineers, the text provides a deep dive into transport semantics and the RDMA programming model: RC, UC, UD, XRC, and DC; queue pairs and scalable completion paths; work requests, S/G lists, and polling strategies; memory registration, MR caching, and ODP; atomics, fencing, and ordering. Advanced coverage of mlx5 direct verbs and DevX enables direct hardware programming, while guidance on doorbells, BlueFlame, inline thresholds, batching, tag-matching offload, and multi-rail striping shows how to extract real-world performance. Integration chapters bridge the fabric to MPI (UCX, libfabric/OFI, HPC-X), in-network compute with SHARP, GPU networking with GPUDirect RDMA/Async and NCCL topology-aware collectives, storage over RDMA (SRP, iSER, NVMe/RDMA, SMB Direct) and parallel file systems, plus virtualization (SR-IOV, VFIO, nested) and Kubernetes device plugins, CNI, and pod-level QoS-ensuring clean workflows across HPC, AI, and service-oriented stacks. Architects and operators will find rigorous treatment of fabric topologies (fat-tree, dragonfly(+), torus, hypercube), routing strategies and adaptive policies, QoS design, congestion control and tuning, multicast scaling, and capacity planning. A comprehensive performance engineering toolkit spans host architecture (PCIe/NVLink, NUMA), IOMMU/ATS, huge pages, message sizing, connection scaling, interrupt moderation, jitter and tail-latency control, along with fair microbenchmarking and end-to-end roofline-style modeling. Day-2 operations are covered end to end: PMA-driven telemetry pipelines, SLO dashboards, BER/FEC health signals, failure domains and fast reroute, troubleshooting loops and misroutes, incast containment, packet capture and tracing, and incident response playbooks. The roadmap closes with HDR/NDR deployment trade-offs, InfiniBand routers and multi-subnet scale-out, Ethernet interoperability and RoCE contrasts, DPUs and control-plane offload, time sync, energy efficiency, zero-trust security, migration strategies, and the future of in-network compute and XDR-equipping readers to build resilient, efficient fabrics that scale with confidence. Full Product DetailsAuthor: Nova TrexPublisher: Independently Published Imprint: Independently Published Dimensions: Width: 15.20cm , Height: 2.90cm , Length: 22.90cm Weight: 0.748kg ISBN: 9798262218943Pages: 568 Publication Date: 25 August 2025 Audience: General/trade , General Format: Paperback Publisher's Status: Active Availability: Available To Order ![]() We have confirmation that this item is in stock with the supplier. It will be ordered in for you and dispatched immediately. Table of ContentsReviewsAuthor InformationTab Content 6Author Website:Countries AvailableAll regions |