Case Study
Beam Pump Failure Prediction
ML models trained on historian & SCADA signals to predict rod‑lift failures early and recommend proactive workovers—delivered as real‑time alerts and operator decision support in Ignition.
Oil & Gas
Predictive Maintenance
Python / ML
Ignition
OPC UA · MQTT
PI / Canary
Executive Snapshot
Objective
Anticipate beam‑pump (rod‑lift) failures days in advance to reduce unplanned downtime, avoid deferred production, and prioritize maintenance.
My Role
IT/OT Solutions Architect & hands‑on ML developer. Led 3‑person team; built data pipelines, feature set, models, and Ignition UI/alerts.
Duration
~6–8 months (pilot → production)
Scope
100–300 wells • multi‑year history • near real‑time scoring (5–15 min)
Architecture
PLC → OPC UA/MQTT → Ignition Gateway → Historian (PI/Canary) → Python models → Alerts/CMMS
scikit‑learn / XGBoost
Time‑series features
Anomaly detection
RBAC · TLS
MLOps (retrain)
Outcome
- Achieved ~70%+ precision on failure predictions in pilot; alert lead time typically 24–72 hours.
- Reduced unplanned downtime on covered wells; improved maintenance prioritization and crew dispatching.
- Operator UI in Ignition with risk score, top drivers (feature importance), and recommended actions.
Exact KPI deltas can be plugged in once we confirm production numbers for your site(s).
Problem
Beam‑pump failures (rod/tubing wear, pump‑off conditions, motor faults) caused frequent unplanned downtime and deferred production. Signals indicating degradation (e.g., amperage asymmetry, strokes‑per‑minute drift, pump‑card shape changes) were buried across SCADA tags and historian data—difficult for operators to monitor proactively at scale.
Solution
- Built pipelines to aggregate PLC/SCADA data via OPC UA and MQTT into Ignition and the historian (PI/Canary).
- Engineered time‑series features from motor current, SPM, runtime, fluid level proxies, temperatures, alarms, and pump‑card stats.
- Trained classification models (e.g., XGBoost / Random Forest) and an anomaly model for early deviation detection.
- Deployed a Python scoring service in the Ignition Gateway; surfaced risk scores & explanations to operators.
- Integrated alert routing (email/SMS) and optional CMMS ticket creation for high‑risk wells.
- Hardened with Purdue‑aligned zones, TLS, mutual certs, and RBAC; all changes auditable.
Architecture
Standards: ISA‑95. Security: segmented ICS zones, strict firewall allow‑lists, TLS, cert‑based MQTT, RBAC.
KPIs & Evidence
| Metric | Definition | Pilot Result |
|---|---|---|
| Prediction Precision | Share of alerts that preceded a true failure/workover within the time window. | ~70%+ (site‑specific) |
| Lead Time | Avg. time between first alert and failure event. | ~24–72 hours |
| Downtime Reduction | Change in unplanned downtime hours on covered wells. | Plug production value here (e.g., 10–20%) |
| Deferred Production Avoided | Estimated barrels avoided due to proactive intervention. | Range TBD by site economics |
We can swap in your actual metrics once you confirm them; structure and copy are ready.
- Data audit & label strategy (failure taxonomy, windows, leakage checks).
- Feature engineering & model selection; backtest on rolling windows.
- Shadow‑mode scoring vs. operator notes/workover logs; threshold tuning.
- Production rollout with alerting, CMMS hook, and monthly retrain job.
- You: Architecture, ML design, Ignition integration, stakeholder alignment.
- Developers (3): Pipelines, model training, dashboards, alerting.
- Operations/Production Eng: Label curation, thresholds, action playbooks.
- Labels matter: agreed failure taxonomy & time windows avoid data leakage.
- Explainability builds trust: feature importances and trend panels in HMI.
- Iterative thresholds: start conservative, tighten with feedback.