This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Modern Living Redefined: Spacious Bedroom Apartments for Rent in Newport News

Modern Living Redefined: Spacious Bedroom Apartments for Rent in Newport News

NEWPORT NEWS, VA, UNITED STATES, March 14, 2026 /EINPresswire.com/ — Heritage Forest Apartments announces the

March 14, 2026

Heritage Forest Apartments Highlights Comfortable Living in Newport News

Heritage Forest Apartments Highlights Comfortable Living in Newport News

NEWPORT NEWS, VA, UNITED STATES, March 14, 2026 /EINPresswire.com/ — Heritage Forest Apartments introduces a refreshed

March 14, 2026

Pre-Orders Open March 13 for ‘Big Chap Xenomorph Close Up Shot Ver.’ Statue from Alien

Pre-Orders Open March 13 for ‘Big Chap Xenomorph Close Up Shot Ver.’ Statue from Alien

Prime 1 Studio announced "Big Chap Xenomorph Close Up Shot Ver." Statue from Alien. Pre-orders began March 13, 2026

March 14, 2026

Going Against the Tide: Why IMA ART Fertility Chose to Stay Boutique

Going Against the Tide: Why IMA ART Fertility Chose to Stay Boutique

IMA ART Fertility Redefines Luxury Fertility, by Focusing on Personalized Attention & Care BEVERLY HILLS, CA /

March 14, 2026

netWell™ Expands Member Benefits with 24/7 Veterinary Support Powered by whiskerDocs

netWell™ Expands Member Benefits with 24/7 Veterinary Support Powered by whiskerDocs

netWell™ announces partnership with whiskerDocs, offering members discounted 24/7 access to veterinary experts for any

March 14, 2026

EAR Customized Hearing Protection Celebrates Over 50 Years of Precision Engineering Excellence

EAR Customized Hearing Protection Celebrates Over 50 Years of Precision Engineering Excellence

BOULDER, CO – March 14, 2026 – PRESSADVANTAGE – EAR Customized Hearing Protection marks over five decades of delivering

March 14, 2026

Top Ships Announces Management Estimate of Net Asset Value at $289 Million

Top Ships Announces Management Estimate of Net Asset Value at $289 Million

TOP Ships Inc. (NYSE:TOPS)As per the latest market close, we are trading at a 91.2% discount to the Company’s current

March 14, 2026

eXoZymes’ CCO Damien Perriman to Present an NCT Solution at Next Week’s MISTA Symposium

eXoZymes’ CCO Damien Perriman to Present an NCT Solution at Next Week’s MISTA Symposium

eXoZymes Inc. (NASDAQ:EXOZ)What makes NCT so exciting is that it focuses on the underlying problem: how the body

March 14, 2026

ITF-USA Announces Master Jade Hwang’s Promotion to 8th Degree Black Belt

ITF-USA Announces Master Jade Hwang’s Promotion to 8th Degree Black Belt

Accomplishment highlights Master Hwang's decades of dedication, leadership and her contributions to the growth of ITF

March 14, 2026

EPC Group Expands Power BI Copilot With Enterprise Multi-Model AI Architecture

EPC Group Expands Power BI Copilot With Enterprise Multi-Model AI Architecture

New architecture integrates Copilot, Azure OpenAI, Claude, and Perplexity to transform Microsoft Power BI into an

March 14, 2026

BWISE Solutions to Join SAP Summit 2026 in Las Vegas, Showcasing Advanced Warehouse Execution for SAP Environments

BWISE Solutions to Join SAP Summit 2026 in Las Vegas, Showcasing Advanced Warehouse Execution for SAP Environments

BWISE Solutions joins the SAP Summit 2026 in Las Vegas to showcase advanced warehouse execution and WMS integration for

March 14, 2026

RestoPros of East Cleveland Highlights Industry-Standard Water Damage Restoration Protocols

RestoPros of East Cleveland Highlights Industry-Standard Water Damage Restoration Protocols

March 13, 2026 – PRESSADVANTAGE – RestoPros of East Cleveland continues to demonstrate the importance of following

March 14, 2026

Mindmachines.com Introduces Complete ROSHIwave Meditation Device Platform with pROSHI 3 Emulation

Mindmachines.com Introduces Complete ROSHIwave Meditation Device Platform with pROSHI 3 Emulation

Dallas, Texas – March 13, 2026 – PRESSADVANTAGE – Mindmachines.com has announced the completion of its ROSHIwave

March 14, 2026

Zahnarztpraxis Wallis Enhances Digital Dental Platform to Connect Patients with Providers Across Valais Region

Zahnarztpraxis Wallis Enhances Digital Dental Platform to Connect Patients with Providers Across Valais Region

Zurich, Zurich – March 13, 2026 – PRESSADVANTAGE – Zahnarztpraxis Wallis, the comprehensive dental directory serving

March 14, 2026

Youssi Custom Homes of Iowa Introduces Interactive Virtual Tours for Single Family Homes Development

Youssi Custom Homes of Iowa Introduces Interactive Virtual Tours for Single Family Homes Development

BETTENDORF, Iowa – March 13, 2026 – PRESSADVANTAGE – Youssi Custom Homes of Iowa has launched interactive virtual tours

March 14, 2026

Medical Interview Preparation Addresses NHS Consultant Shortage Through Specialized Training Support

Medical Interview Preparation Addresses NHS Consultant Shortage Through Specialized Training Support

Havant, England – March 13, 2026 – PRESSADVANTAGE – Medical Interview Preparation has expanded its specialized training

March 14, 2026

Nervous Patient Care Sandbach Cheshire Sedation Dentist Dr Mehdi Yazdi Recommends Consultations at Crown Bank Dental Sandbach

Nervous Patient Care Sandbach Cheshire Sedation Dentist Dr Mehdi Yazdi Recommends Consultations at Crown Bank Dental Sandbach

SANDBACH, UK – March 13, 2026 – PRESSADVANTAGE – Sandbach Cheshire residents who experience anxiety about visiting the

March 14, 2026

Daren Ng Examines Modern Search Engine Optimization Strategies for Sustainable Digital Visibility

Daren Ng Examines Modern Search Engine Optimization Strategies for Sustainable Digital Visibility

La Habra, California – March 13, 2026 – PRESSADVANTAGE – Digital marketing practitioner Daren Ng continues to share

March 14, 2026

Infintech Designs Publishes Conversion Rate Optimization Guide With Four-Phase Testing Framework, Eight Data-Backed Strategies, and 30-Day Implementation Roadmap

Infintech Designs Publishes Conversion Rate Optimization Guide With Four-Phase Testing Framework, Eight Data-Backed Strategies, and 30-Day Implementation Roadmap

March 13, 2026 – PRESSADVANTAGE – Infintech Designs published a detailed blog addressing the strategy, methodology, and

March 14, 2026

Law Office of Jay G. Wall Expands Team for Criminal Defense Services Amid Growing Demand

Law Office of Jay G. Wall Expands Team for Criminal Defense Services Amid Growing Demand

March 13, 2026 – PRESSADVANTAGE – Law Office of Jay G. Wall Expands Criminal Defense Team Amid Increased Demand for

March 14, 2026

Tommie’s Plumbing Greeneville Announces Expanded Diagnostic Services for Early Plumbing Problem Detection

Tommie’s Plumbing Greeneville Announces Expanded Diagnostic Services for Early Plumbing Problem Detection

March 13, 2026 – PRESSADVANTAGE – Tommie's Plumbing Greeneville announces the expansion of its diagnostic service

March 14, 2026

Siam Legal International Warns of Call Center Scam Risks After Arrest of 14 Chinese Nationals in Thailand

Siam Legal International Warns of Call Center Scam Risks After Arrest of 14 Chinese Nationals in Thailand

Bangkok, Thailand – March 13, 2026 – PRESSADVANTAGE – Siam Legal International, a Thailand Law Firm, has issued an

March 14, 2026

Central Bay Roofing Named 2026 Alameda Stars Roofing Contractor

Central Bay Roofing Named 2026 Alameda Stars Roofing Contractor

Alameda, California – March 13, 2026 – PRESSADVANTAGE – Central Bay Roofing & Restoration announced today that it

March 14, 2026

Red Piranha Releases 2026 Threat Intelligence Report Highlighting Shift in Global Cyber Threat Landscape

Red Piranha Releases 2026 Threat Intelligence Report Highlighting Shift in Global Cyber Threat Landscape

Red Piranha’s 2026 Threat Intelligence Report analyses 80M+ security events, revealing rising cyber espionage, APT

March 13, 2026

Live with Grace Animal Hospital Announces Ribbon Cutting Ceremony in Port St. Lucie

Live with Grace Animal Hospital Announces Ribbon Cutting Ceremony in Port St. Lucie

Live with Grace Animal Hospital Celebrates Opening with Ribbon Cutting Ceremony Our goal is to create an environment

March 13, 2026

Zanna Records Announces ‘Live with No Regrets — Analog Rebellion,’ an Analog Recording Featuring Veteran Rock Musicians

Zanna Records Announces ‘Live with No Regrets — Analog Rebellion,’ an Analog Recording Featuring Veteran Rock Musicians

We want to record it the way many classic rock records were made — live, raw, and analog.”— Gianluca Zanna SEDONA, AZ,

March 13, 2026

Sizing Water Softener Launches Comprehensive Online Platform

Sizing Water Softener Launches Comprehensive Online Platform

New Platform Combines a Water Softener Sizing Calculator, Water Hardness Calculator, and Full-Spectrum Water Treatment

March 13, 2026

Hermiz Law Releases Analysis of Michigan Divorce Trends as State Marks 54 Years Under No-Fault Law

Hermiz Law Releases Analysis of Michigan Divorce Trends as State Marks 54 Years Under No-Fault Law

Hermiz Law, a Troy-based family law firm, today released an analysis of Michigan divorce filing trends under the

March 13, 2026

Sleep Awareness Week ends March 14 — New Online ‘Sleep Solutions’ Aim to Help Millions Sleep Better Naturally

Sleep Awareness Week ends March 14 — New Online ‘Sleep Solutions’ Aim to Help Millions Sleep Better Naturally

How well you sleep determines how well you live”— Grace Dale VANCOUVER, BC, CANADA, March 13, 2026 /EINPresswire.com/

March 13, 2026

Dayton Law Firm, Attorney Michael Wright, Champions Victims of Injustice, Highlighting Major Data Breach Case

Dayton Law Firm, Attorney Michael Wright, Champions Victims of Injustice, Highlighting Major Data Breach Case

The law firm of Attorney Michael Wright, a leading Ohio personal injury practice, reaffirms its commitment to fighting

March 13, 2026

THIS IS IT NETWORK™ Presents WELEAD Women in Leadership Powered by Zoom at SXSW

THIS IS IT NETWORK™ Presents WELEAD Women in Leadership Powered by Zoom at SXSW

Visionary Founder Cheldin Barlatt Rumer Brings Together Influential Women Leaders for an Afternoon of Conversation,

March 13, 2026

Liz Zabala Named One of 5 Entrepreneurs Redefining Success in 2026 by Rolling Stone

Liz Zabala Named One of 5 Entrepreneurs Redefining Success in 2026 by Rolling Stone

International music mentor recognized for reshaping pathways for young artists worldwide. BOSTON, MA, UNITED STATES,

March 13, 2026

Jagga Jhangiani Selected as Top Custom Jeweler of the Year by IAOTP

Jagga Jhangiani Selected as Top Custom Jeweler of the Year by IAOTP

The International Association of Top Professionals (IAOTP) will honor Jagga Jhangiani at their annual awards gala in

March 13, 2026

Cupples Construction and York Public Adjusting Respond to Severe Hailstorm Damage in Central Illinois

Cupples Construction and York Public Adjusting Respond to Severe Hailstorm Damage in Central Illinois

Cupples Construction and York Public Adjusting respond to Illinois hailstorm with free inspections, emergency roof

March 13, 2026

Jazz Vocalist Eric Van Aro Announces New Eric Van Aro Quartet Album Recorded Live in Studio

Jazz Vocalist Eric Van Aro Announces New Eric Van Aro Quartet Album Recorded Live in Studio

The only thing being fooled on April 1 is the idea that music needs to be perfect”— Eric Van Aro LUGANO, SWITZERLAND,

March 13, 2026

Spetsnaz Security Launches World’s First PERFECT 100/100 PageSpeed Security Site

Spetsnaz Security Launches World’s First PERFECT 100/100 PageSpeed Security Site

London firm achieves PERFECT Google PageSpeed Insights across 600+ HTML pages Spetsnaz Security International Limited

March 13, 2026

WONTECH USA ARRIVES AT AAD 2026 WITH LIVE DEMOS, EXCLUSIVE DINNER AND AN OLIGIO X GIVEAWAY

WONTECH USA ARRIVES AT AAD 2026 WITH LIVE DEMOS, EXCLUSIVE DINNER AND AN OLIGIO X GIVEAWAY

Korean Aesthetic Device Leader Brings WAVE to Denver on March 28 – Dermatologists Given Hands-On Access to Oligio X and

March 13, 2026

PAUL LEDUC SELECTED AS TOP CEO OF THE DECADE IN RETAIL BY IAOTP

PAUL LEDUC SELECTED AS TOP CEO OF THE DECADE IN RETAIL BY IAOTP

The International Association of Top Professionals (IAOTP) will honor Paul Leduc at this year's annual awards gala at

March 13, 2026

Inside Heather Marianna’s ‘Toast to Hollywood’ Luxury Gifting Suite: Ageless Living LA’s Favorite Finds

Inside Heather Marianna’s ‘Toast to Hollywood’ Luxury Gifting Suite: Ageless Living LA’s Favorite Finds

An exclusive Hollywood gifting suite hosted by Heather Marianna introduced luxury wellness and beauty discoveries now

March 13, 2026

Jewelry Experts Evaluate Vintage and Antique Pieces Through Detailed Material and Craftsmanship Analysis

Jewelry Experts Evaluate Vintage and Antique Pieces Through Detailed Material and Craftsmanship Analysis

Vintage and antique jewelry reflects the materials, tools, and artistic influences available at the time the piece was

March 13, 2026