We're still figuring out how to detect "silent" failures where the job doesn't crash but stops making progress — like NCCL hangs where ranks are waiting indefinitely, or gradient norm explosions that don't trigger OOM but tank loss. Right now we rely on explicit errors in logs, but curious how others approach detecting "the job is technically running but something is very wrong" (if at all)?
We're still figuring out how to detect "silent" failures where the job doesn't crash but stops making progress — like NCCL hangs where ranks are waiting indefinitely, or gradient norm explosions that don't trigger OOM but tank loss. Right now we rely on explicit errors in logs, but curious how others approach detecting "the job is technically running but something is very wrong" (if at all)?
Measurement and alerting is usually done in business metrics, not the causes. That way you catch classes of problems.
Not sure about expected loss, that's a decay rate?
But stuck jobs are via tasks being processed and average latency.
Would love to hear how you're handling recovery for long-running training jobs today, as well as what failure modes are most common/annoying for you.