Engineering Retention: Interpretable Churn Cohort Discovery for Broadband

Nish · April 15, 2026

⏱️ 24 min read

Table of Contents

Churn prediction is only useful if someone can do something with it.

That sounds obvious, but it is where a lot of retention modelling quietly breaks down. A model can rank every customer by churn risk, produce a clean AUC, and still leave commercial teams asking the one question that actually matters:

Fine, but what should we do with this customer?

This post walks through a broadband retention system built around a different framing. Instead of asking for another black-box churn score, the goal was to discover interpretable cohorts: groups of customers with elevated churn risk, a readable reason for that risk, and a plausible intervention path.

The output was not a probability column. It was a portfolio of SQL rules that could be reviewed by humans, refreshed quarterly, scored weekly, and handed to campaign systems with clear control and treatment splits.

TL;DR

  • Treat churn retention as a triage problem, not just a prediction problem.
  • The first gate is risk: is this customer segment genuinely more likely to churn?
  • The second gate is actionability: do we have an intervention that is operationally possible and economically worthwhile?
  • Interpretable rule mining works well when the desired output is a SQL cohort definition rather than a model score.
  • Multi-snapshot training, deterministic splitting, robust target design, and holdout validation matter more than squeezing out another few points of model metric.
  • Missing data can be signal. Preserving missingness as explicit features helped discover meaningful behavioural cohorts.
  • The most important production artefact was the rule portfolio, not the model object.

The Problem With Churn Scores

Every broadband provider has a churn problem. Even a low monthly churn rate becomes expensive when the customer base is measured in millions. Tens of thousands of customers can raise cancellation events in a short window.

The issue was not that churn existed. The issue was that historical retention activity had become a patchwork of disconnected approaches.

Historical approach What it did Why it was not enough
Static SQL profiles Hand-written customer segments created once and reused for months Relevance decayed as pricing, competitors, and customer behaviour changed
Behavioural flags Identified browsing or call patterns that looked churn-indicative Useful signals, but not consistently connected to downstream activation
Geographic anomaly detection Flagged local competitor activity spikes Good area-level signal, weak individual-level guidance
Generic propensity models Ranked customers by predicted churn probability Said who might leave, but not why or what to do next

The failure mode was familiar: as contract renewals, price rises, or competitor pushes approached, teams would scramble. Someone would ask for a risky audience. Someone else would ask for a campaign. A model might produce a list. But the link between risk, reason, channel, intervention, and economics was often implicit.

A customer with a very high churn score might be impossible to save profitably. They might be out of footprint for the relevant upgrade, have no contact permission, be on a legacy technology with no viable offer, or already be in another campaign. High risk is not the same thing as high value.

The useful stakeholder brief was refreshingly direct:

We do not care about predicting churn for its own sake. We care about engineering retention.

That sentence changed the design.

From Prediction To Triage

The core reframing was to split the decision into two gates.

flowchart LR A["Customer base"] --> B{"Risk gate<br/>is this segment likely<br/>to churn?"} B -->|No| C["No retention<br/>action"] B -->|Yes| D{"Intervention gate<br/>can we act and<br/>is it worth it?"} D -->|No| E["Ignore or monitor"] D -->|Yes| F["Targeted campaign<br/>with measurement"] class B,D decision; class F terminal; class C,E guardrail;
Retention triage separates statistical risk from operational actionability.

Traditional churn modelling tends to optimise the first gate. It asks: who is most likely to leave?

Retention engineering needs the second gate as well. It asks: which risky customers can we plausibly help, through which intervention, at what cost, and with what measurement design?

That changed the required output.

Traditional churn model Retention triage system
Rank all customers by risk score Discover discrete, interpretable subgroups
Output is a probability between 0 and 1 Output is a readable rule and cohort membership
Optimises predictive discrimination Optimises actionability and deployment fit
Often becomes a generic campaign input Becomes a campaign brief
Hard to validate with non-technical stakeholders Easy to review with domain experts

A rule like this is not just a model artefact:

recent_cancel_call = 1
AND sales_call_count_last_60d > 0
AND actual_price_gbp > 26.30

It is a story: this customer has expressed cancellation intent, has already spoken to sales, and is paying above a threshold. That suggests a very different intervention from a customer whose risk is driven by local fibre competition or a recent contract-end event.

System Shape

At a high level, the system had four layers: data assembly, rule discovery, rule curation, and operational activation.

flowchart TB subgraph Data["Data layer"] A["Weekly customer<br/>snapshots"] --> D["Base and target<br/>spine"] B["Trading events<br/>churn signals"] --> D C["Pricing, behaviour,<br/>competition features"] --> E["Feature<br/>engineering"] D --> F["Assembled<br/>dataset"] E --> F F --> G["Train, eval,<br/>holdout splits"] end subgraph Training["Training pipeline"] G --> H["Preprocess<br/>and impute"] H --> I["Rule mining"] I --> J["Portfolio<br/>selection"] J --> K["Holdout<br/>validation"] end subgraph Outputs["Outputs"] K --> L["HTML inspection<br/>report"] K --> M["Imputation map"] K --> N["SQL cohort<br/>definitions"] end subgraph Activation["Activation"] N --> O["Daily or weekly<br/>cohort scoring"] O --> P["Eligibility filters"] P --> Q["Control and treatment<br/>randomisation"] Q --> R["Campaign systems"] end class I,J focus; class P,Q guardrail; class R terminal;
The model discovers candidate cohorts, but the production handoff is SQL rules plus activation configuration.

The distinction between discovery and activation is important. The discovery pipeline learns which segments look risky. The downstream data product decides who can be contacted today, how they are split into treatment and control, and which channel receives the record.

Designing The Data Product

The modelling work was only as good as the dataset it learned from. In churn problems, the target definition and time design are usually where the real bugs hide.

Use Multiple Snapshots, Not One

A single customer snapshot is convenient, but it is brittle. It can overrepresent one week of market conditions: an outage, a competitor promotion, a price-rise wave, or a campaign that happened to be live.

Instead, the training set used multiple weekly snapshots. Each snapshot captured the broadband base at that point in time, then attached a forward-looking churn target.

training snapshots:  week 1, week 2, ..., week 12
evaluation snapshot: later single week
holdout snapshot:    even later single week

The point was not just more rows. It was temporal diversity. A useful rule should survive across several weeks of operating conditions, not just look clever on one date.

The eval and holdout snapshots were separated in time. Eval was used for stakeholder review and tuning decisions. Holdout was kept as a true out-of-time check.

Define Churn At The Moment You Can Still Act

In a telecoms environment, “churn” is not a single clean event. There can be direct cancellations, competitor-triggered switching events, legacy gateway loss records, duplicate system events, home movers, bad debt disconnections, same-day reversals, and agent errors.

The target was defined as a valid raised churn event within 45 days.

The word “raised” matters. If you wait for a churn event to close, the customer may already be gone. Raised events are earlier. They are noisier, but they are closer to the moment when intervention is still possible.

A simplified version of the target logic looked like this:

AND (
    (trading_activity = 'Gateway Loss'
     AND activity_status = 'Gateway Loss')
    OR
    (trading_activity = 'OTS Churn'
     AND activity_status = 'Raised')
    OR
    (trading_activity = 'Churn'
     AND activity_status = 'Raised'
     AND COALESCE(home_move_flag, 0) = 0
     AND COALESCE(churn_group, '') != 'Competitive')
)

Several exclusions were added after domain review.

Exclusion Why it mattered
Home movers Address changes can create churn-like records that are not genuine voluntary churn
Bad debt Forced disconnections need different treatment from retention
Competitive duplicates Some switching events create linked duplicate churn records
Agent errors Operational mistakes should not teach the model false intent
Same-day cancellations Immediate reversals often indicate non-genuine churn intent

One subtle decision was joining churn events at account level rather than service level. Service-level joins inflated churn because one account could have multiple service rows. Account-level joins better matched the business decision: if an account is leaving, treat that account once.

Scope The Base To The Decision You Can Influence

The first version focused on out-of-contract customers.

That was not just a business preference. It was a modelling choice. Customers close to contract end or already out of contract have different risk dynamics from customers still safely inside a renewal window. If all lifecycle stages are mixed together, contract-end timing can dominate the model and drown out subtler behavioural or competitive signals.

The modelling base therefore focused on out-of-contract windows:

AND (
    days_until_contract_end BETWEEN -90 AND -1
    OR days_until_contract_end BETWEEN -360 AND -91
    OR days_until_contract_end BETWEEN -720 AND -361
    OR days_until_contract_end < -720
)

This gave the system a narrower but more useful question: among customers already outside contract protection, which subgroups are both high risk and actionable?

Why A 45-Day Window?

The target window had to balance two competing forces.

If the window is too short, churn is too rare and the model sees very few positives. If the window is too long, the label becomes less connected to the intervention moment. A customer who churns six months later might not be responding to any signal visible today.

The 45-day target was a practical compromise:

  • It produced enough positive examples to learn from.
  • It allowed time for scoring, campaign setup, delivery, and customer response.
  • It stayed close enough to the risk moment to support operational action.
  • It aligned with familiar 30-to-60-day retention planning windows.

You could use a donut window, such as labelling churn between days 14 and 45 only. But that creates a different problem: customers who churn tomorrow become negatives. For retention, those urgent cases are often exactly the ones you care about.

The Feature Space

The feature engineering layer produced around 140 signals across five broad families.

Feature family Examples
Pricing and product Actual price, standard price, price-rise history, price percentile in local area, product add-ons
Behavioural signals Cancellation calls, sales calls, complaints, sentiment, broadband search, cancel-page visits
Competitive environment Local competitor presence, fibre availability, recent local churn rates
Customer attributes Technology type, speed tier, service maturity, household and convergence indicators
Geographic aggregates Area-level regrade rates, ARPU deltas, call volumes, and sentiment

Everything was left joined onto the base-target spine. Missingness was not cleaned away upstream. It was passed into preprocessing deliberately, because missing customer data often tells you something about customer behaviour.

Deterministic Splitting

The split design used a salted hash of account_id, so the same customer always landed in the same pool across snapshots.

DECLARE hash_salt STRING DEFAULT 'EXP_2025_Q4_V1';
DECLARE stayer_downsample_pct FLOAT64 DEFAULT 0.01;

WITH base_with_hash AS (
    SELECT
        *,
        ABS(MOD(FARM_FINGERPRINT(CONCAT(account_id, hash_salt)), 1000)) AS customer_hash
    FROM final_dataset
),
with_pool_assignment AS (
    SELECT
        *,
        CASE
            WHEN customer_hash BETWEEN 0 AND 399 THEN 'TRAIN_POOL'
            WHEN customer_hash BETWEEN 400 AND 699 THEN 'EVAL_POOL'
            WHEN customer_hash BETWEEN 700 AND 999 THEN 'HOLDOUT_POOL'
        END AS customer_pool
    FROM base_with_hash
)
SELECT *
FROM with_pool_assignment
WHERE
    target_churn_45d = 1
    OR (customer_pool = 'TRAIN_POOL'
        AND target_churn_45d = 0
        AND customer_hash < (1000 * stayer_downsample_pct))
    OR customer_pool IN ('EVAL_POOL', 'HOLDOUT_POOL');

This gave three useful properties.

Property Why it matters
No customer overlap The same account cannot leak across train, eval, and holdout
Reproducibility The split can be regenerated exactly from the salt
Train balanced, test natural Training can downsample stayers, while eval and holdout preserve real-world rates

Why Interpretable Rules?

For this use case, a black-box model was the wrong output shape.

Gradient boosting would probably rank customers well. But the downstream system needed cohort definitions that humans could inspect, discuss, edit, and deploy as SQL. The model had to produce rules, not just scores.

The point of using an interpretable model was not aesthetics. It was the fact that the model output had to survive stakeholder review and become production configuration.

That led to SkopeRulesClassifier from the imodels ecosystem. The algorithm mines high-precision decision rules from tree ensembles. Each rule is a conjunction of simple conditions:

actual_price_gbp > 26.30
AND recent_cancel_call_flag > 0.5
AND sales_call_count_last_60d > 0.5

Those rules are naturally reviewable and SQL-codifiable.

Requirement Why rules fit
SQL deployment A rule is already close to a WHERE clause
Stakeholder review Non-technical teams can read the logic
Campaign mapping Rule features imply likely intervention pathways
Domain validation Experts can reject rules that are technically valid but operationally silly
Refreshability New rules can be shipped as configuration rather than new application code

How The Rule Miner Works

The rough mining process is simple.

flowchart TD A["Fit many shallow<br/>decision trees"] --> B["Extract tree paths<br/>as candidate rules"] B --> C["Filter by precision<br/>and recall"] C --> D["Evaluate rules<br/>on eval snapshot"] D --> E["Select non-redundant<br/>rule portfolio"] E --> F["Translate to SQL<br/>cohorts"] class C,D,E focus; class F terminal;
SkopeRules searches many tree paths, then keeps rules that are pure enough, large enough, and stable enough.

The configuration encoded the intended behaviour:

skope:
  min_risk: 0.62
  min_recall: 0.05
  n_estimators: 150
  max_depth: [2, 3, 4]
  min_samples_per_split: 150

The min_risk threshold was intentionally aggressive. If a cohort is expensive to contact or discount, false positives have a real cost. A smaller set of purer cohorts is often more useful than a broad audience nobody knows how to treat.

Depths of 2, 3, and 4 gave a multi-resolution search. Depth-2 rules found broad patterns. Depth-4 rules found more specific subgroups. The minimum split size acted as regularization, preventing the trees from forming rules around tiny pockets of noise.

Missingness As A Feature

The preprocessing step made one design choice that mattered a lot: missing numeric values were imputed, but missingness itself was preserved as binary indicators.

numeric_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median", add_indicator=True)),
])

boolean_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
])

categorical_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value="__MISSING__")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

This avoids conflating two very different cases:

customer genuinely has sentiment score = 0
customer has no sentiment score because they never called

Those are not the same customer. In retention, absence of interaction can itself be a behavioural signal. Some of the stronger rules used missing indicators to distinguish disengaged customers from engaged but unhappy customers.

The preprocessing output also saved an imputation map:

{
  "actual_price_gbp": 51.35,
  "sales_call_count_last_60d": 0.0,
  "cancel_intent_call_count_last_60d": 0.0,
  "total_active_services": 2.0,
  "avg_call_sentiment_last_60d": 0.0
}

Those values were later used in SQL through COALESCE, keeping Python training logic and SQL scoring logic aligned.

Safe Feature Names

One implementation detail saved a lot of pain: before rule mining, all feature columns were remapped to safe tokens such as f0, f1, and f2.

Tree-derived rule strings are easier to parse when feature names cannot collide with SQL keywords, spaces, punctuation, or partial substrings of other feature names.

safe_cols = [f"f{i}" for i in range(X_train.shape[1])]
orig_to_safe = dict(zip(orig_cols, safe_cols))
safe_to_orig = {v: k for k, v in orig_to_safe.items()}

After mining, the display rules were mapped back to human-readable feature names. The mapping was saved as an artefact so any downstream process could reconstruct the translation.

From Raw Rules To A Cohort Portfolio

Raw rule mining produces too much. Some rules overlap heavily. Some pass training thresholds but do not generalise. Some are statistically valid but operationally useless.

The post-processing layer turned a large rule list into a deployable portfolio.

flowchart LR A["Raw mined<br/>rules"] --> B{"Robustness<br/>gate"} B -->|Fail| X["Discard"] B -->|Pass| C{"Candidate<br/>filters"} C -->|Fail| X C -->|Pass| D["Sort by<br/>priority"] D --> E{"Incremental<br/>value gate"} E -->|Fail| X E -->|Pass| F["Selected<br/>portfolio"] F --> G["Stakeholder<br/>grouping"] G --> H["Macro cohorts<br/>and SQL"] class B,C,E decision; class F,G focus; class H terminal; class X guardrail;
Post-processing removes fragile, redundant, and low-value rules before stakeholder review.

The Robustness Gate

Each rule was evaluated on the eval snapshot and marked robust only if it passed size, lift, and statistical-confidence thresholds.

is_robust = bool(
    (eval_rule_size >= min_eval_size)
    and (eval_lift >= min_eval_lift)
    and (eval_z_score >= 1.645)
)

The z-score compared the cohort churn rate with the base churn rate using the standard error of a proportion:

\[z = \frac{\hat{p}_{cohort} - \hat{p}_{base}}{\sqrt{\frac{\hat{p}_{base}(1 - \hat{p}_{base})}{n_{cohort}}}}\]

A one-tailed threshold of 1.645 roughly corresponds to 95% confidence that the cohort’s elevated churn rate is not just sampling noise.

This is not a perfect guarantee. It is a guardrail. It stops the most obvious failure mode: deploying a rule because it looked impressive on a small, lucky slice of data.

Overlap-Aware Selection

Rules overlap. If one rule covers 5,000 customers and another covers 5,000 customers, the combined audience is not necessarily 10,000. They may share most of the same people.

The selector walked through rules in priority order and only accepted a rule if its newly covered customers were large enough and risky enough.

covered = np.zeros_like(y_eval, dtype=bool)

for rule in rules_ordered:
    rule_mask = masks[rule.id]
    new_mask = rule_mask & ~covered

    incremental_size = new_mask.sum()
    incremental_churn_rate = y_eval[new_mask].mean()
    incremental_lift = incremental_churn_rate / base_churn_rate

    if incremental_size >= 300 and incremental_lift >= 1.15:
        covered |= rule_mask
        selected_rules.append(rule)

This kept the portfolio honest. A new rule had to add incremental reach, not just re-describe customers already captured by a stronger rule.

Do Not Overfit The Eval Snapshot

It is tempting to sort rules by eval lift. That is also a good way to overfit the eval snapshot.

The portfolio used training performance as the primary ordering key, then used eval metrics as validation gates and tiebreakers.

order_by:
  - Train_Lift
  - Eval_Size
  - Eval_Lift
  - Eval_Recall
ascending: [false, false, false, false]

The reasoning was simple: a rule that worked across 12 diverse training snapshots is more credible than one that spiked on one eval week. Eval should keep us honest, not become the thing we optimise too aggressively.

Add The Economics

Risk alone is not enough. The portfolio also estimated expected value per customer.

\[EV = (\text{churn risk} \times \text{save rate} \times \text{lifetime value}) - \text{intervention cost}\]

For example, if a cohort has 62% churn risk, a 10% expected save rate, GBP 1,200 lifetime value, and GBP 10.05 intervention cost:

\[EV = (0.62 \times 0.10 \times 1200) - 10.05 = 64.35\]

The exact values are business assumptions, not universal constants. The important point is that every cohort passed through the same economic frame. A statistically risky group with no profitable action should not be treated as a campaign win.

The Output: Macro Cohorts

After stakeholder review, individual rules were grouped into five macro cohorts. The grouping was manual on purpose. Rules may be mined by an algorithm, but interventions need human naming, prioritisation, and ownership.

Macro cohort Signal Intervention pathway
Cancel-call intent Explicit cancellation calls, sales contact, above-threshold price Urgent retention call with calibrated offer
Fresher low-sentiment Newer customers, weak sentiment, low product attachment Value-add engagement and onboarding support
Recent sales other-call Sales interaction, regrade exposure, no conversion Follow-up with improved or clearer offer
Recent contract-end no-landline Recently out of contract with weaker bundle anchor Loyalty offer or competitive match
Structural mix Pricing, competition, service, and lifecycle combinations Varies by sub-rule

The final deployment artefact was a set of SQL cohort rules. A simplified macro rule looked like this:

WHERE (
    (
        COALESCE(actual_price_gbp, 51.35) > 26.305
        AND COALESCE(recent_cancel_call_flag, 0.0) > 0.5
        AND COALESCE(sales_call_count_last_60d, 0.0) > 0.5
    )
    OR
    (
        COALESCE(actual_price_gbp, 51.35) > 26.305
        AND COALESCE(cancel_intent_call_count_last_60d, 0.0) > 0.5
        AND COALESCE(sales_call_count_last_60d, 0.0) > 0.5
        AND COALESCE(total_active_services, 2.0) > 1.5
    )
)
AND contract_lifecycle IN (
    'EOOC 0-3',
    'LOOC 3-12',
    'LOOC 12-24',
    'LOOC +'
);

Notice the COALESCE values. They come from the training-time imputation map. Without that, the Python model and SQL scoring path would silently disagree about how missing values are handled.

Operationalisation

The discovery pipeline did not contact customers directly. It handed SQL cohort definitions to an operational data product.

That downstream system handled eligibility, exclusions, priority arbitration, control/treatment randomisation, discount parameterisation, and routing into campaign channels.

flowchart TD A["Active cohort<br/>configuration"] --> B["Eligibility and<br/>exclusion filters"] B --> C["Apply SQL rules<br/>to eligible base"] C --> D{"Multiple cohort<br/>matches?"} D -->|Yes| E["Use priority<br/>assignment"] D -->|No| F["Keep cohort<br/>assignment"] E --> G["Deterministic control<br/>treatment split"] F --> G G --> H["Attach treatment<br/>parameters"] H --> I["Route to email,<br/>decisioning, outbound"] class C,G,H focus; class D decision; class B,E guardrail; class I terminal;
The activation layer turns rule membership into measurable customer treatment.

The configuration table contained fields like:

Field Purpose
cohort_type Human-readable cohort name
cohort_rule SQL condition defining membership
priority Which cohort wins when a customer matches several
control_pct Percentage held out for measurement
treatment_pct Percentage eligible for action
discount_parameter Offer or incentive value to attach
channel_flags Whether the cohort can go to email, outbound, decisioning, or other channels

The split between discovery and activation kept the system maintainable.

Concern Discovery pipeline Activation data product
Main question Which cohorts are risky and interpretable? Which customers can we contact today?
Cadence Quarterly retraining or major refresh Daily or weekly scoring
Primary users Data science and commercial stakeholders Data engineering and campaign operations
Output Rules, reports, validation metrics Activation-ready customer tables
Experimentation Rule thresholds and cohort discovery Control/treatment splits and channel tests

This separation is what made the system practical. Data science could improve cohort definitions without editing campaign infrastructure. Campaign teams could adjust treatment parameters without retraining the model.

Infrastructure And Cadence

The training pipeline ran as a managed ML workflow. The important steps were standard but deliberately explicit.

flowchart LR A["Load data<br/>from warehouse"] --> B["Preprocess<br/>features"] B --> C["Train rule<br/>miner"] C --> D["Post-process<br/>portfolio"] C --> E["Validate on<br/>holdout"] D --> F["Publish<br/>artefacts"] E --> F F --> G["Inspection reports<br/>and SQL rules"] class C,D,E focus; class G terminal;
The pipeline trains slowly and carefully, while the resulting SQL rules can be scored often.

The operating principle was train once, score often.

Training quarterly made sense because structural churn drivers do not change every day. Scoring weekly, or daily where the downstream product supports it, made sense because customer behaviour changes quickly. A customer who looked safe last week might have called to cancel yesterday.

There is no value in refreshing faster than the business can act, but monthly scoring is usually too slow for fast-moving retention signals.

What We Learned

1. Eval-To-Holdout Decay Is Real

One of the most important empirical findings was lift decay from eval to holdout. Rules that looked like 3.0x lift on eval might look closer to 2.0x on a later holdout snapshot. Rules around 2.0x eval lift might decay toward 1.3x.

That does not automatically mean the model is broken. It reflects temporal instability. Customer segments are not physics. Competitors move, campaigns change, pricing changes, and customer behaviour shifts.

The mitigation was structural:

  • Train across multiple snapshots.
  • Use eval for tuning, but keep holdout as truth.
  • Do not sort primarily by eval lift.
  • Set deployment thresholds with expected decay in mind.

2. Missing Data Was Not Just A Nuisance

Some missing indicators were genuinely predictive. A missing sentiment score might mean the customer never called. A missing product interaction might mean no engagement. Those are behaviours, not just database problems.

The lesson is simple: impute values if the model needs complete inputs, but preserve the fact that imputation happened.

3. Interpretability Changed The Conversation

The HTML inspection report became the shared artefact. It showed each rule in plain language, with size, lift, churn rate, and robustness indicators.

That meant commercial stakeholders could say useful things:

This segment makes sense, but we should not send it to email. It needs an outbound call.

or:

This rule is statistically valid, but we do not have a viable treatment for it yet.

That kind of feedback is hard to get from a probability score. It is easy to get from a readable rule.

4. Anti-Overfitting Is A System Design Problem

Model regularization helped, but the bigger protections were architectural:

  • Multi-snapshot training.
  • Salted account-level splitting.
  • Separate eval and holdout windows.
  • Robustness gates.
  • Overlap-aware portfolio selection.
  • Human review before activation.

No single parameter prevents overfitting in a live commercial system. You need several layers of friction between a pattern found in data and a customer being contacted.

5. Target Definition Deserves More Time Than Model Choice

The target went through multiple iterations as edge cases surfaced. Closed churn became raised churn. Service-level joins became account-level joins. Broad lifecycle scope became out-of-contract scope. Duplicate system events were excluded.

That is normal. In customer systems, labels are rarely clean at the start. The model cannot recover from a target that encodes operational artefacts as customer intent.

Reproducing The Approach

The specific feature names are broadband-specific, but the architecture transfers well to other retention problems.

If you are building something similar, the most reusable decisions are:

  1. Frame the problem as retention engineering, not churn prediction.
  2. Discover cohorts that humans can understand and intervene on.
  3. Train across multiple time snapshots to reduce temporal overfitting.
  4. Keep eval and holdout separate, and expect some lift decay.
  5. Preserve missingness as signal.
  6. Select rules by incremental coverage, not just individual lift.
  7. Use expected value to separate risky from worth targeting.
  8. Translate the output into SQL or another operationally native rule format.
  9. Separate discovery from activation.
  10. Put domain experts in the loop early, especially for target definition.

The broader vision is a cohort engine: a repeatable system that discovers high-risk segments, maps them to interventions, measures outcomes, and learns from the next cycle.

Not a one-off churn model. A way of doing retention properly.

Resources

  • Interpretable Machine Learning by Christoph Molnar is a useful reference for thinking about transparent models and explanation quality.
  • imodels contains practical implementations of interpretable modelling approaches, including rule-based methods.
  • BigQuery FARM_FINGERPRINT is useful for deterministic splitting and reproducible control/treatment assignment.
  • scikit-learn SimpleImputer documents the add_indicator pattern used to preserve missingness.
  • Vertex AI Pipelines is one way to productionise repeatable training and artefact publication workflows.

Citation Information

If you find this content useful & plan on using it, please consider citing it using the following format:

@misc{nish-blog,
  title = {Engineering Retention: Interpretable Churn Cohort Discovery for Broadband},
  author = {Nish},
  howpublished = {\url{https://www.nishbhana.com/Engineering-Retention/}},
  note = {[Online; accessed]},
  year = {2026}
}

x.com, Facebook