Most ML problems are not ML problems
We get hired to ship ML. Half the time the answer is to ship something simpler. Here is how we tell which half a project is in.
A founder comes to us with a recommendation problem. Or a fraud problem. Or a classification problem. They have read about transformers; they want a model. Half the time we deliver a model. Half the time we deliver a SQL query, a heuristic, or a small rule engine, and the result is better.
This post is how we decide which half a project belongs in. The framing might save you a quarter.
The honest test we run first
When a problem arrives that has the shape of an ML problem — input data, a desired prediction, some labelled examples — we resist building the model and ask four questions in order. We do not move to the next question until the previous one’s answer is clearly “yes”.
1. Can a rule do this with acceptable accuracy?
Most fraud-detection problems we see can be filtered to 90% precision with five hand-written rules. A model gets you to 92%. The model also costs a feature pipeline, a training loop, a versioning system, a serving layer, monitoring, and a person who understands all of that on call. Two percentage points of precision is often not the right trade.
The honest version of this: if you have not tried hand-written rules, you have not earned the right to a model.
2. Can a SQL query do this with acceptable accuracy?
A surprising number of “recommendation” problems are window functions. A surprising number of “anomaly detection” problems are WHERE value > 3 * stddev(value) OVER (...). The infrastructure you already have — your database — is more powerful than founders intuit.
The honest version of this: if you have not written the SQL, you have not earned the right to a model.
3. Can a small model on a single CPU do this with acceptable accuracy?
If a rule and a SQL query both fall short, the next stop is the smallest possible model. A scikit-learn logistic regression on engineered features. A gradient-boosted tree from XGBoost. A k-nearest-neighbours classifier. These are small models. They train in seconds. They serve in microseconds. They fit on a single CPU. They are interpretable.
The honest version of this: most “ML problems” we ship end here. The model is small, the system around it is boring, and the maintenance bill is low.
4. Does this actually need a deep model?
By the time we are seriously considering a deep model — a transformer, a CNN, an LLM in the loop — we have already failed to solve the problem three other ways. The question we ask at this stage is not “can a deep model do this?” — it usually can — but “is the marginal accuracy worth the operational complexity?”
A deep model in production needs:
- A reproducible training pipeline (data + code + hyperparameters + seed)
- A versioned model store (we use MLflow)
- A serving layer with autoscaling
- Drift monitoring that fires alerts when input distribution changes
- A rollback path when the model degrades
- Someone who understands the model well enough to debug it at 3am
That is real engineering effort. If the rule-based version solves 90% of the problem and a deep model gets you to 95%, the right answer is often the rule.
Where deep models genuinely earn their place
To be fair to the technology — deep models are correct when:
- The input is unstructured (text, image, audio) and the rules-based version would require an army to write
- The problem has scale (millions of decisions per day) and the small accuracy gain compounds into real money
- You have a labelled dataset large enough to train and validate honestly
- You have an operations team (or partner) who can keep the model alive in production
If three or four of those apply, build the model. If they do not, you are paying complexity tax for a result you could have gotten cheaper.
Two examples from our actual work
A “recommendation” engine that became a SQL query. A founder asked for “ML-powered recommendations” on a marketplace. The actual problem: surface items similar to ones the user already bought. The actual answer: a Postgres pg_trgm similarity query joined with the user’s purchase history, indexed properly. It runs in milliseconds. It is explainable to the founder. It will not break when a model drifts.
A “fraud detection” model that started as five rules. Same founder asked for fraud detection on transactions. We wrote five rules first: velocity check, geographic distance, device-fingerprint mismatch, amount-vs-history outlier, and time-of-day anomaly. The five rules caught 89% of confirmed fraud cases in the historical data. A model on top of those rules added another 4%. The combined system runs as rules first; the model is a tiebreaker.
Both cases shipped in weeks instead of quarters. Neither needed an MLOps team to keep alive.
When we WILL build the model
To be specific about what we are good at — we do build production ML systems when the problem actually needs one. Our deliverables in those cases:
- A reproducible training pipeline you can re-run
- A model versioning + serving setup that survives a rollback
- Monitoring that flags drift before users do
- A documented “this is what to do when the model breaks” runbook
- An honest evaluation report — not just the headline accuracy, but the failure modes
The bar we hold ourselves to is not “the model works in a notebook”. It is “the model works in production at 3am when nobody is watching”.
The takeaway
If you have a problem that smells like ML, run the four questions in order before you commit a quarter of engineering. If the answer is genuinely a model — great, that is what we build. If the answer is a SQL query, we will tell you that instead. The fastest road to “this works in production” is often not the road that requires a GPU.
We are an email away.