Engineering Retention: Interpretable Churn Cohort Discovery for Broadband

Nish · January 15, 2026

ML Case-Study

22 min read

TL;DR
A Few Terms Worth Defining
The Problem With Churn Scores
From Prediction To Triage
System Shape
Designing The Data Product
Why Interpretable Rules?
From Raw Rules To A Cohort Portfolio
The Output: Macro Cohorts
Operationalisation
Cohorts Are Levers, Not Proof
Infrastructure And Cadence
What We Learned
Reproducing The Approach
Resources

Churn prediction is only useful if someone can do something with it.

That sounds obvious, but it is where a lot of retention modelling quietly breaks down. A model can rank every customer by churn risk, produce a clean AUC, and still leave commercial teams asking the one question that actually matters:

Fine, but what should we do with this customer?

This post walks through a broadband retention system built around a different framing. Instead of asking for another black-box churn score, the goal was to discover interpretable cohorts: groups of customers with elevated churn risk, a readable reason for that risk, and a plausible intervention path.

The output was not a probability column. It was a portfolio of SQL rules that could be reviewed by humans, refreshed quarterly, scored weekly, and handed to campaign systems with clear control and treatment splits.

TL;DR

Treat churn retention as a triage problem, not just a prediction problem.
The first gate is risk: is this customer segment genuinely more likely to churn?
The second gate is actionability: do we have an intervention that is operationally possible and economically worthwhile?
Interpretable rule mining works well when the desired output is a SQL cohort definition rather than a model score.
Multi-snapshot training, deterministic splitting, robust target design, and holdout validation matter more than squeezing out another few points of model metric.
Missing data can be signal. Preserving missingness as explicit features helped discover meaningful behavioural cohorts.
The most important production artefact was the rule portfolio, not the model object.

A Few Terms Worth Defining

Most of the language in this post is deliberately plain, but three terms are worth making explicit because they shape the design.

Term	Meaning	Why it matters here
OTS	One Touch Switch: a broadband switching process that makes it easier for customers to move to another provider.	Spikes in OTS churn can reveal competitor-driven pressure that is visible before ordinary cancellation reporting catches up.
Lift	The churn rate of a cohort divided by the churn rate of the overall base.	A 2.0x lift cohort is churning at twice the background rate, which makes rules comparable across different audience sizes.
Holdout	A later, untouched dataset used after tuning decisions have been made.	It is the closest offline proxy for whether the rule portfolio survives time rather than just fitting the evaluation snapshot.

The Problem With Churn Scores

Every broadband provider has a churn problem. Even a low monthly churn rate becomes expensive when the customer base is measured in millions. Tens of thousands of customers can raise cancellation events in a short window.

The issue was not that churn existed. The issue was that historical retention activity had become a patchwork of disconnected approaches.

Historical approach	What it did	Why it was not enough
Static SQL profiles	Hand-written customer segments created once and reused for months	Relevance decayed as pricing, competitors, and customer behaviour changed
Behavioural flags	Identified browsing or call patterns that looked churn-indicative	Useful signals, but not consistently connected to downstream activation
Geographic anomaly detection	Flagged local competitor activity spikes	Good area-level signal, weak individual-level guidance
Generic propensity models	Ranked customers by predicted churn probability	Said who might leave, but not why or what to do next

The failure mode was familiar: as contract renewals, price rises, or competitor pushes approached, teams would scramble. Someone would ask for a risky audience. Someone else would ask for a campaign. A model might produce a list. But the link between risk, reason, channel, intervention, and economics was often implicit.

A customer with a very high churn score might be impossible to save profitably. They might be out of footprint for the relevant upgrade, have no contact permission, be on a legacy technology with no viable offer, or already be in another campaign. High risk is not the same thing as high value.

The useful stakeholder brief was refreshingly direct:

We do not care about predicting churn for its own sake. We care about engineering retention.

That sentence changed the design.

From Prediction To Triage

The core reframing was to split the decision into two gates.

flowchart LR A["Customer base"] --> B{"Risk gate is this segment likely to churn?"} B -->|No| C["No retention action"] B -->|Yes| D{"Intervention gate can we act and is it worth it?"} D -->|No| E["Ignore or monitor"] D -->|Yes| F["Targeted campaign with measurement"] class B,D decision; class F terminal; class C,E guardrail;

Retention triage separates statistical risk from operational actionability.

Traditional churn modelling tends to optimise the first gate. It asks: who is most likely to leave?

Retention engineering needs the second gate as well. It asks: which risky customers can we plausibly help, through which intervention, at what cost, and with what measurement design?

That changed the required output.

Traditional churn model	Retention triage system
Rank all customers by risk score	Discover discrete, interpretable subgroups
Output is a probability between 0 and 1	Output is a readable rule and cohort membership
Optimises predictive discrimination	Optimises actionability and deployment fit
Often becomes a generic campaign input	Becomes a campaign brief
Hard to validate with non-technical stakeholders	Easy to review with domain experts

A rule like this is not just a model artefact:

recent_cancel_call = 1
AND sales_call_count_last_60d > 0
AND actual_price_gbp > 26.30

It is a story: this customer has expressed cancellation intent, has already spoken to sales, and is paying above a threshold. That suggests a very different intervention from a customer whose risk is driven by local fibre competition or a recent contract-end event.

System Shape

At a high level, the system had four layers: data assembly, rule discovery, rule curation, and operational activation.

flowchart TB subgraph Data["Data layer"] A["Weekly customer snapshots"] --> D["Base and target spine"] B["Trading events churn signals"] --> D C["Pricing, behaviour, competition features"] --> E["Feature engineering"] D --> F["Assembled dataset"] E --> F F --> G["Train, eval, holdout splits"] end subgraph Training["Training pipeline"] G --> H["Preprocess and impute"] H --> I["Rule mining"] I --> J["Portfolio selection"] J --> K["Holdout validation"] end subgraph Outputs["Outputs"] K --> L["HTML inspection report"] K --> M["Imputation map"] K --> N["SQL cohort definitions"] end subgraph Activation["Activation"] N --> O["Daily or weekly cohort scoring"] O --> P["Eligibility filters"] P --> Q["Control and treatment randomisation"] Q --> R["Campaign systems"] end class I,J focus; class P,Q guardrail; class R terminal;

The model discovers candidate cohorts, but the production handoff is SQL rules plus activation configuration.

The distinction between discovery and activation is important. The discovery pipeline learns which segments look risky. The downstream data product decides who can be contacted today, how they are split into treatment and control, and which channel receives the record.

Designing The Data Product

The modelling work was only as good as the dataset it learned from. In churn problems, the target definition and time design are usually where the real bugs hide.

Use Multiple Snapshots, Not One

A single customer snapshot is convenient, but it is brittle. It can overrepresent one week of market conditions: an outage, a competitor promotion, a price-rise wave, or a campaign that happened to be live.

Instead, the training set used multiple weekly snapshots. Each snapshot captured the broadband base at that point in time, then attached a forward-looking churn target.

training snapshots:  week 1, week 2, ..., week 12
evaluation snapshot: later single week
holdout snapshot:    even later single week

The point was not just more rows. It was temporal diversity. A useful rule should survive across several weeks of operating conditions, not just look clever on one date.

The eval and holdout snapshots were separated in time. Eval was used for stakeholder review and tuning decisions. Holdout was kept as a true out-of-time check.

Define Churn At The Moment You Can Still Act

In a telecoms environment, “churn” is not a single clean event. There can be direct cancellations, competitor-triggered switching events, legacy gateway loss records, duplicate system events, home movers, bad debt disconnections, same-day reversals, and agent errors.

The target was defined as a valid raised churn event within 45 days.

The word “raised” matters. If you wait for a churn event to close, the customer may already be gone. Raised events are earlier. They are noisier, but they are closer to the moment when intervention is still possible.

A simplified version of the target logic looked like this:

AND (
    (trading_activity = 'Gateway Loss'
     AND activity_status = 'Gateway Loss')
    OR
    (trading_activity = 'OTS Churn'
     AND activity_status = 'Raised')
    OR
    (trading_activity = 'Churn'
     AND activity_status = 'Raised'
     AND COALESCE(home_move_flag, 0) = 0
     AND COALESCE(churn_group, '') != 'Competitive')
)

Several exclusions were added after domain review.

Exclusion	Why it mattered
Home movers	Address changes can create churn-like records that are not genuine voluntary churn
Bad debt	Forced disconnections need different treatment from retention
Competitive duplicates	Some switching events create linked duplicate churn records
Agent errors	Operational mistakes should not teach the model false intent
Same-day cancellations	Immediate reversals often indicate non-genuine churn intent

One subtle decision was joining churn events at account level rather than service level. Service-level joins inflated churn because one account could have multiple service rows. Account-level joins better matched the business decision: if an account is leaving, treat that account once.

Scope The Base To The Decision You Can Influence

The first version focused on out-of-contract customers.

That was not just a business preference. It was a modelling choice. Customers close to contract end or already out of contract have different risk dynamics from customers still safely inside a renewal window. If all lifecycle stages are mixed together, contract-end timing can dominate the model and drown out subtler behavioural or competitive signals.

The modelling base therefore focused on out-of-contract windows:

AND (
    days_until_contract_end BETWEEN -90 AND -1
    OR days_until_contract_end BETWEEN -360 AND -91
    OR days_until_contract_end BETWEEN -720 AND -361
    OR days_until_contract_end < -720
)

This gave the system a narrower but more useful question: among customers already outside contract protection, which subgroups are both high risk and actionable?

Why A 45-Day Window?

The target window had to balance two competing forces.

If the window is too short, churn is too rare and the model sees very few positives. If the window is too long, the label becomes less connected to the intervention moment. A customer who churns six months later might not be responding to any signal visible today.

The 45-day target was a practical compromise:

It produced enough positive examples to learn from.
It allowed time for scoring, campaign setup, delivery, and customer response.
It stayed close enough to the risk moment to support operational action.
It aligned with familiar 30-to-60-day retention planning windows.

You could use a donut window, such as labelling churn between days 14 and 45 only. But that creates a different problem: customers who churn tomorrow become negatives. For retention, those urgent cases are often exactly the ones you care about.

The Feature Space

The feature engineering layer produced around 140 signals across five broad families.

Feature family	Examples
Pricing and product	Actual price, standard price, price-rise history, price percentile in local area, product add-ons
Behavioural signals	Cancellation calls, sales calls, complaints, sentiment, broadband search, cancel-page visits
Competitive environment	Local competitor presence, fibre availability, recent local churn rates
Customer attributes	Technology type, speed tier, service maturity, household and convergence indicators
Geographic aggregates	Area-level regrade rates, ARPU deltas, call volumes, and sentiment

Everything was left joined onto the base-target spine. Missingness was not cleaned away upstream. It was passed into preprocessing deliberately, because missing customer data often tells you something about customer behaviour.

Deterministic Splitting

The split design used a salted hash of account_id, so the same customer always landed in the same pool across snapshots.

DECLARE hash_salt STRING DEFAULT 'EXP_2025_Q4_V1';
DECLARE stayer_downsample_pct FLOAT64 DEFAULT 0.01;

WITH base_with_hash AS (
    SELECT
        *,
        ABS(MOD(FARM_FINGERPRINT(CONCAT(account_id, hash_salt)), 1000)) AS customer_hash
    FROM final_dataset
),
with_pool_assignment AS (
    SELECT
        *,
        CASE
            WHEN customer_hash BETWEEN 0 AND 399 THEN 'TRAIN_POOL'
            WHEN customer_hash BETWEEN 400 AND 699 THEN 'EVAL_POOL'
            WHEN customer_hash BETWEEN 700 AND 999 THEN 'HOLDOUT_POOL'
        END AS customer_pool
    FROM base_with_hash
)
SELECT *
FROM with_pool_assignment
WHERE
    target_churn_45d = 1
    OR (customer_pool = 'TRAIN_POOL'
        AND target_churn_45d = 0
        AND customer_hash < (1000 * stayer_downsample_pct))
    OR customer_pool IN ('EVAL_POOL', 'HOLDOUT_POOL');

This gave three useful properties.

Property	Why it matters
No customer overlap	The same account cannot leak across train, eval, and holdout
Reproducibility	The split can be regenerated exactly from the salt
Train balanced, test natural	Training can downsample stayers, while eval and holdout preserve real-world rates

Why Interpretable Rules?

For this use case, a black-box model was the wrong output shape.

Gradient boosting would probably rank customers well. But the downstream system needed cohort definitions that humans could inspect, discuss, edit, and deploy as SQL. The model had to produce rules, not just scores.

The point of using an interpretable model was not aesthetics. It was the fact that the model output had to survive stakeholder review and become production configuration.

That led to SkopeRulesClassifier from the imodels ecosystem. The algorithm mines high-precision decision rules from tree ensembles. Each rule is a conjunction of simple conditions:

actual_price_gbp > 26.30
AND recent_cancel_call_flag > 0.5
AND sales_call_count_last_60d > 0.5

Those rules are naturally reviewable and SQL-codifiable.

Requirement	Why rules fit
SQL deployment	A rule is already close to a `WHERE` clause
Stakeholder review	Non-technical teams can read the logic
Campaign mapping	Rule features imply likely intervention pathways
Domain validation	Experts can reject rules that are technically valid but operationally silly
Refreshability	New rules can be shipped as configuration rather than new application code

How The Rule Miner Works

The rough mining process is simple.

flowchart TD A["Fit many shallow decision trees"] --> B["Extract tree paths as candidate rules"] B --> C["Filter by precision and recall"] C --> D["Evaluate rules on eval snapshot"] D --> E["Select non-redundant rule portfolio"] E --> F["Translate to SQL cohorts"] class C,D,E focus; class F terminal;

SkopeRules searches many tree paths, then keeps rules that are pure enough, large enough, and stable enough.

The configuration encoded the intended behaviour:

skope:
  min_risk: 0.62
  min_recall: 0.05
  n_estimators: 150
  max_depth: [2, 3, 4]
  min_samples_per_split: 150

The min_risk threshold was intentionally aggressive. If a cohort is expensive to contact or discount, false positives have a real cost. A smaller set of purer cohorts is often more useful than a broad audience nobody knows how to treat.

Depths of 2, 3, and 4 gave a multi-resolution search. Depth-2 rules found broad patterns. Depth-4 rules found more specific subgroups. The minimum split size acted as regularization, preventing the trees from forming rules around tiny pockets of noise.

Missingness As A Feature

The preprocessing step made one design choice that mattered a lot: missing numeric values were imputed, but missingness itself was preserved as binary indicators.

numeric_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median", add_indicator=True)),
])

boolean_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
])

categorical_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value="__MISSING__")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

This avoids conflating two very different cases:

customer genuinely has sentiment score = 0
customer has no sentiment score because they never called

Those are not the same customer. In retention, absence of interaction can itself be a behavioural signal. Some of the stronger rules used missing indicators to distinguish disengaged customers from engaged but unhappy customers.

The preprocessing output also saved an imputation map:

{
  "actual_price_gbp": 51.35,
  "sales_call_count_last_60d": 0.0,
  "cancel_intent_call_count_last_60d": 0.0,
  "total_active_services": 2.0,
  "avg_call_sentiment_last_60d": 0.0
}

Those values were later used in SQL through COALESCE, keeping Python training logic and SQL scoring logic aligned.

Safe Feature Names

One implementation detail saved a lot of pain: before rule mining, all feature columns were remapped to safe tokens such as f0, f1, and f2.

Tree-derived rule strings are easier to parse when feature names cannot collide with SQL keywords, spaces, punctuation, or partial substrings of other feature names.

safe_cols = [f"f{i}" for i in range(X_train.shape[1])]
orig_to_safe = dict(zip(orig_cols, safe_cols))
safe_to_orig = {v: k for k, v in orig_to_safe.items()}

After mining, the display rules were mapped back to human-readable feature names. The mapping was saved as an artefact so any downstream process could reconstruct the translation.

From Raw Rules To A Cohort Portfolio

Raw rule mining produces too much. Some rules overlap heavily. Some pass training thresholds but do not generalise. Some are statistically valid but operationally useless.

The post-processing layer turned a large rule list into a deployable portfolio.

flowchart LR A["Raw mined rules"] --> B{"Robustness gate"} B -->|Fail| X["Discard"] B -->|Pass| C{"Candidate filters"} C -->|Fail| X C -->|Pass| D["Sort by priority"] D --> E{"Incremental value gate"} E -->|Fail| X E -->|Pass| F["Selected portfolio"] F --> G["Stakeholder grouping"] G --> H["Macro cohorts and SQL"] class B,C,E decision; class F,G focus; class H terminal; class X guardrail;

Post-processing removes fragile, redundant, and low-value rules before stakeholder review.

The Robustness Gate

Each rule was evaluated on the eval snapshot and marked robust only if it passed size, lift, and statistical-confidence thresholds.

is_robust = bool(
    (eval_rule_size >= min_eval_size)
    and (eval_lift >= min_eval_lift)
    and (eval_z_score >= 1.645)
)

The z-score compared the cohort churn rate with the base churn rate using the standard error of a proportion:

\[z = \frac{\hat{p}_{cohort} - \hat{p}_{base}}{\sqrt{\frac{\hat{p}_{base}(1 - \hat{p}_{base})}{n_{cohort}}}}\]

A one-tailed threshold of 1.645 roughly corresponds to 95% confidence that the cohort’s elevated churn rate is not just sampling noise.

This is not a perfect guarantee. It is a guardrail. It stops the most obvious failure mode: deploying a rule because it looked impressive on a small, lucky slice of data.

Overlap-Aware Selection

Rules overlap. If one rule covers 5,000 customers and another covers 5,000 customers, the combined audience is not necessarily 10,000. They may share most of the same people.

The selector walked through rules in priority order and only accepted a rule if its newly covered customers were large enough and risky enough.

covered = np.zeros_like(y_eval, dtype=bool)

for rule in rules_ordered:
    rule_mask = masks[rule.id]
    new_mask = rule_mask & ~covered

    incremental_size = new_mask.sum()
    incremental_churn_rate = y_eval[new_mask].mean()
    incremental_lift = incremental_churn_rate / base_churn_rate

    if incremental_size >= 300 and incremental_lift >= 1.15:
        covered |= rule_mask
        selected_rules.append(rule)

This kept the portfolio honest. A new rule had to add incremental reach, not just re-describe customers already captured by a stronger rule.

Do Not Overfit The Eval Snapshot

It is tempting to sort rules by eval lift. That is also a good way to overfit the eval snapshot.

The portfolio used training performance as the primary ordering key, then used eval metrics as validation gates and tiebreakers.

order_by:
  - Train_Lift
  - Eval_Size
  - Eval_Lift
  - Eval_Recall
ascending: [false, false, false, false]

The reasoning was simple: a rule that worked across 12 diverse training snapshots is more credible than one that spiked on one eval week. Eval should keep us honest, not become the thing we optimise too aggressively.

Add The Economics

Risk alone is not enough. The portfolio also estimated expected value per customer.

\[EV = (\text{churn risk} \times \text{save rate} \times \text{lifetime value}) - \text{intervention cost}\]

For example, if a cohort has 62% churn risk, a 10% expected save rate, GBP 1,200 lifetime value, and GBP 10.05 intervention cost:

\[EV = (0.62 \times 0.10 \times 1200) - 10.05 = 64.35\]

The exact values are business assumptions, not universal constants. The important point is that every cohort passed through the same economic frame. A statistically risky group with no profitable action should not be treated as a campaign win.

The Output: Macro Cohorts

After stakeholder review, individual rules were grouped into five macro cohorts. The grouping was manual on purpose. Rules may be mined by an algorithm, but interventions need human naming, prioritisation, and ownership.

Macro cohort	Signal	Intervention pathway
Cancel-call intent	Explicit cancellation calls, sales contact, above-threshold price	Urgent retention call with calibrated offer
Fresher low-sentiment	Newer customers, weak sentiment, low product attachment	Value-add engagement and onboarding support
Recent sales other-call	Sales interaction, regrade exposure, no conversion	Follow-up with improved or clearer offer
Recent contract-end no-landline	Recently out of contract with weaker bundle anchor	Loyalty offer or competitive match
Structural mix	Pricing, competition, service, and lifecycle combinations	Varies by sub-rule

The final deployment artefact was a set of SQL cohort rules. A simplified macro rule looked like this:

WHERE (
    (
        COALESCE(actual_price_gbp, 51.35) > 26.305
        AND COALESCE(recent_cancel_call_flag, 0.0) > 0.5
        AND COALESCE(sales_call_count_last_60d, 0.0) > 0.5
    )
    OR
    (
        COALESCE(actual_price_gbp, 51.35) > 26.305
        AND COALESCE(cancel_intent_call_count_last_60d, 0.0) > 0.5
        AND COALESCE(sales_call_count_last_60d, 0.0) > 0.5
        AND COALESCE(total_active_services, 2.0) > 1.5
    )
)
AND contract_lifecycle IN (
    'EOOC 0-3',
    'LOOC 3-12',
    'LOOC 12-24',
    'LOOC +'
);

Notice the COALESCE values. They come from the training-time imputation map. Without that, the Python model and SQL scoring path would silently disagree about how missing values are handled.

Operationalisation

The discovery pipeline did not contact customers directly. It handed SQL cohort definitions to an operational data product.

That downstream system handled eligibility, exclusions, priority arbitration, control/treatment randomisation, discount parameterisation, and routing into campaign channels.

flowchart TD A["Active cohort configuration"] --> B["Eligibility and exclusion filters"] B --> C["Apply SQL rules to eligible base"] C --> D{"Multiple cohort matches?"} D -->|Yes| E["Use priority assignment"] D -->|No| F["Keep cohort assignment"] E --> G["Deterministic control treatment split"] F --> G G --> H["Attach treatment parameters"] H --> I["Route to email, decisioning, outbound"] class C,G,H focus; class D decision; class B,E guardrail; class I terminal;

The activation layer turns rule membership into measurable customer treatment.

The configuration table contained fields like:

Field	Purpose
`cohort_type`	Human-readable cohort name
`cohort_rule`	SQL condition defining membership
`priority`	Which cohort wins when a customer matches several
`control_pct`	Percentage held out for measurement
`treatment_pct`	Percentage eligible for action
`discount_parameter`	Offer or incentive value to attach
`channel_flags`	Whether the cohort can go to email, outbound, decisioning, or other channels

The split between discovery and activation kept the system maintainable.

Concern	Discovery pipeline	Activation data product
Main question	Which cohorts are risky and interpretable?	Which customers can we contact today?
Cadence	Quarterly retraining or major refresh	Daily or weekly scoring
Primary users	Data science and commercial stakeholders	Data engineering and campaign operations
Output	Rules, reports, validation metrics	Activation-ready customer tables
Experimentation	Rule thresholds and cohort discovery	Control/treatment splits and channel tests

This separation is what made the system practical. Data science could improve cohort definitions without editing campaign infrastructure. Campaign teams could adjust treatment parameters without retraining the model.

Cohorts Are Levers, Not Proof

There is an important causality caveat here: a high-lift cohort does not prove that an intervention will work.

It only says risk is concentrated. It tells us that a particular slice of the customer base is behaving differently from the background population. That is useful, but it is not the same as evidence that a discount, call, email, upgrade offer, or service fix will save those customers.

The cohort identifier is a lever to pull. Its value is that it gives the business a concrete, reviewable object to experiment with:

cohort hypothesis -> available intervention -> control/treatment split -> measured outcome -> next cohort refresh

That loop is the real product.

The cultural shift was moving from “find me high-risk customers” to better questions:

Which cohorts are risky enough to deserve action?
Which of those cohorts have an intervention we can actually deliver?
Which intervention is economically sensible for that cohort?
What should the control group be so we can measure incremental saves?
What did we learn, and how should the next cohort refresh change?

This is why interpretability mattered so much. The business could reason about a cohort, choose a treatment, measure the result, and improve the targeting over time. The model did not need to be the final answer. It needed to help the organisation build an evidence-based retention muscle.

Infrastructure And Cadence

The training pipeline ran as a managed ML workflow. The important steps were standard but deliberately explicit.

flowchart LR A["Load data from warehouse"] --> B["Preprocess features"] B --> C["Train rule miner"] C --> D["Post-process portfolio"] C --> E["Validate on holdout"] D --> F["Publish artefacts"] E --> F F --> G["Inspection reports and SQL rules"] class C,D,E focus; class G terminal;

The pipeline trains slowly and carefully, while the resulting SQL rules can be scored often.

The operating principle was train once, score often.

Training quarterly made sense because structural churn drivers do not change every day. Scoring weekly, or daily where the downstream product supports it, made sense because customer behaviour changes quickly. A customer who looked safe last week might have called to cancel yesterday.

There is no value in refreshing faster than the business can act, but monthly scoring is usually too slow for fast-moving retention signals.

What We Learned

1. Eval-To-Holdout Decay Is Real

One of the most important empirical findings was lift decay from eval to holdout. Rules that looked like 3.0x lift on eval might look closer to 2.0x on a later holdout snapshot. Rules around 2.0x eval lift might decay toward 1.3x.

That does not automatically mean the model is broken. It reflects temporal instability. Customer segments are not physics. Competitors move, campaigns change, pricing changes, and customer behaviour shifts.

The mitigation was structural:

Train across multiple snapshots.
Use eval for tuning, but keep holdout as truth.
Do not sort primarily by eval lift.
Set deployment thresholds with expected decay in mind.

2. Missing Data Was Not Just A Nuisance

Some missing indicators were genuinely predictive. A missing sentiment score might mean the customer never called. A missing product interaction might mean no engagement. Those are behaviours, not just database problems.

The lesson is simple: impute values if the model needs complete inputs, but preserve the fact that imputation happened.

3. Interpretability Changed The Conversation

The HTML inspection report became the shared artefact. It showed each rule in plain language, with size, lift, churn rate, and robustness indicators.

That meant commercial stakeholders could say useful things:

This segment makes sense, but we should not send it to email. It needs an outbound call.

or:

This rule is statistically valid, but we do not have a viable treatment for it yet.

That kind of feedback is hard to get from a probability score. It is easy to get from a readable rule.

4. Anti-Overfitting Is A System Design Problem

Model regularization helped, but the bigger protections were architectural:

Multi-snapshot training.
Salted account-level splitting.
Separate eval and holdout windows.
Robustness gates.
Overlap-aware portfolio selection.
Human review before activation.

No single parameter prevents overfitting in a live commercial system. You need several layers of friction between a pattern found in data and a customer being contacted.

5. Target Definition Deserves More Time Than Model Choice

The target went through multiple iterations as edge cases surfaced. Closed churn became raised churn. Service-level joins became account-level joins. Broad lifecycle scope became out-of-contract scope. Duplicate system events were excluded.

That is normal. In customer systems, labels are rarely clean at the start. The model cannot recover from a target that encodes operational artefacts as customer intent.

Reproducing The Approach

The specific feature names are broadband-specific, but the architecture transfers well to other retention problems.

If you are building something similar, the most reusable decisions are:

Frame the problem as retention engineering, not churn prediction.
Discover cohorts that humans can understand and intervene on.
Train across multiple time snapshots to reduce temporal overfitting.
Keep eval and holdout separate, and expect some lift decay.
Preserve missingness as signal.
Select rules by incremental coverage, not just individual lift.
Use expected value to separate risky from worth targeting.
Translate the output into SQL or another operationally native rule format.
Separate discovery from activation.
Put domain experts in the loop early, especially for target definition.

The broader vision is a cohort engine: a repeatable system that discovers high-risk segments, maps them to interventions, measures outcomes, and learns from the next cycle.

Not a one-off churn model. A way of doing retention properly.

Resources

Interpretable Machine Learning by Christoph Molnar is a useful reference for thinking about transparent models and explanation quality.
imodels contains practical implementations of interpretable modelling approaches, including rule-based methods.
BigQuery FARM_FINGERPRINT is useful for deterministic splitting and reproducible control/treatment assignment.
scikit-learn SimpleImputer documents the add_indicator pattern used to preserve missingness.
Vertex AI Pipelines is one way to productionise repeatable training and artefact publication workflows.

Citation Information

If you find this content useful, please cite this work as:

Bhana, Nish. "Engineering Retention: Interpretable Churn Cohort Discovery for Broadband". Nish Blog (January 2026). https://www.nishbhana.com/Engineering-Retention/

Or use the BibTeX citation:

@article{bhana2026engineeringretention,
  title   = {Engineering Retention: Interpretable Churn Cohort Discovery for Broadband},
  author  = {Bhana, Nish},
  journal = {nishbhana.com},
  year    = {2026},
  month   = {January},
  url     = {https://www.nishbhana.com/Engineering-Retention/}
}

Share: x.com, Facebook