Utkarsh M — Principal Software Engineer

Problem

Industrial 3D printing facilities were losing prints to failures that took hours to become visible. A print failure on an industrial machine costs $50K–200K in material, machine time, and missed delivery. The only monitoring was a technician walking the floor twice a shift.

Context

The machines were from five different manufacturers with five different telemetry formats. Some exposed APIs. Some didn't. Some had cameras. Some didn't. The team wanted a single operations view — but the data landscape was heterogeneous and hostile.

Constraint

Real-time. Any telemetry older than 30 seconds was operationally useless for early failure detection. Machine manufacturers could not be required to change firmware. The budget covered software only.

The decision

What we chose and why.

Built an operations center with a telemetry abstraction layer that translated each manufacturer's data format into a common event schema. Anomaly detection operated on the normalized stream. Camera feeds ran through a lightweight local model that detected visual anomalies before they became structural failures.

Tradeoffs

Telemetry abstraction layeroverDirect manufacturer integrations

Manufacturers change APIs. Adding a new manufacturer must not require touching the core system. The adapter is disposable; the schema is permanent.

Contextual baselines per machineoverGlobal anomaly thresholds

Machine A running at 78°C may be normal. Machine B at 78°C may be failing. A global threshold produces alerts that operators learn to ignore.

Edge inference for camerasoverCloud video streaming

Streaming 40 camera feeds continuously was cost-prohibitive. Local edge models sent only frames flagged as anomalous — reducing bandwidth by 97%.

Architecture

Service

Data Store

Queue / Bus

Client

Click any node to inspect

The failure

The first alerting system used static thresholds. It generated 78% false positive alerts within the first two weeks. Operators muted the alert channel.

DISCOVERED — During the first operational week. A senior technician mentioned he had turned off notifications.

IMPACT — Three real failures occurred during this period that were caught manually — not by the system.

Iteration

Replaced static thresholds with per-machine contextual baselines computed over rolling 72-hour windows. Alert confidence scores were added. Low-confidence anomalies were grouped and reviewed in a daily digest rather than triggering immediate alerts. False positive rate dropped from 78% to 4%.

Outcome

67% reduction in print failures. 4,200 hours saved.

The facility went from detecting failures after they happened to preventing them. One prevented failure per month pays for the entire system. The operations team reduced floor walk frequency from 6× per shift to 2×, redirecting that time to maintenance planning.

Lessons

Alert fatigue is worse than no alerts. A muted alert channel is a monitoring system with zero coverage.

Contextual baselines always outperform global thresholds. Everything depends on what normal looks like for this specific machine.

An abstraction layer for heterogeneous data is not overhead. It's what makes the system extensible.

Edge inference exists for a reason. Not everything needs to go to the cloud.