Engineering · 2026-06-05

Why we built the Crash Doctor — pattern matching beats AI for Minecraft crashes

A short post about the tool we shipped this week — Crash Doctor — and the boring but useful engineering choices behind it.

The same five crashes, every week

We host Minecraft servers for a living. Almost every customer-impacting incident we see ends up being one of the same five or six crash families:

wrong Java version (Java 21 running a 1.12 modpack, or Java 8 running a 1.21 pack);
a missing mandatory dependency (Architectury, Cloth Config, Forgify);
NoSuchMethodError between two mods that disagree on an API version;
Mixin apply failures after a mod auto-update;
Sinytra Connector falling over a Fabric mod it can't translate;
OutOfMemoryError because a 12 GB modpack is on an 8 GB plan.

Each one has a known one-line fix. The hard part for a customer staring at a 2000-line crash-report.txt is recognising which of those five they're looking at. That recognition step is what the internal version of this tool was built for.

Why pattern matching beats AI for this job

The obvious 2026 instinct is to throw an LLM at the problem: feed it the crash report, prompt for a diagnosis. We tried that. Three reasons we didn't ship it:

Latency. A regex pass is sub-millisecond. An LLM round-trip is 2-8 seconds. For a free public tool with no auth, slow feels broken.
Determinism. The same crash report always produces the same diagnosis with patterns. With an LLM, the same input can produce different summaries on re-runs, and occasionally invents fixes that don't apply to the actual problem.
Data leakage. We'd have to send customer crash reports to an external model. Even with redaction, the threat model gets harder. Local regex never leaves the box.

The 40+ patterns in the database cover roughly 80% of the crashes we see in production. The long tail — novel crashes, multi-cause failures — is what an LLM would actually help with, and we'll layer it on later for that specific case. The 80% case doesn't need it.

How sanitisation works

Crash reports include things they shouldn't. RCON passwords in argv dumps. Server IPs. Discord webhook URLs in error messages. Anything in environment variables when a child process dies. People paste their crash report into our tool expecting it to be safe; making that safe is the work.

The current sanitiser does four things, in order:

strips Discord webhook URLs by regex (replaced with [redacted-webhook]);
strips IPv4 and IPv6 addresses (both compressed and full forms);
normalises Unicode to NFKC so "Ｐassword" (fullwidth) folds to ASCII before keyword matching;
then redacts any line containing a secret-like keyword (password, secret, token, api_key, bearer, cookie:, authorization:, .env, etc.).

The order matters. We replace specific patterns first because the redaction markers we leave ("[redacted-webhook]") contain words like "webhook" that would re-trigger the keyword check and erase the whole line. Sanitisation runs before regex pattern matching, so the matcher only ever sees redacted data — even if a pattern later goes wrong, the secrets are already gone.

Why filesystem storage and not a database

Shareable result URLs (the coalhosting.com/tools/crash-doctor/r/X links) are JSON files on disk with a 30-day TTL, not a Prisma model.

Adding a model would have triggered our migration window and yellow-zone deploy gate. JSON-on-disk shipped same-day. The tradeoffs are honest: filesystem storage doesn't scale to millions of shares, can't be queried, and won't survive a multi-node deployment. None of those is a problem today — and if usage proves the tool out, migrating to a Prisma model becomes a 1-hour job (read JSON, upsert, flip the read path).

The version 1 lesson we keep relearning: don't promote infrastructure prematurely. The simple thing in the right place beats the right thing in the wrong place.

What we learned during the pre-launch warm

Before the public launch we ran 8 canonical crash signatures through the production endpoint to verify the tool actually answered correctly. Seven of the eight were perfect. The eighth — a Sinytra Connector crash — got misdiagnosed as a generic "Missing class / dependency".

The root cause was pattern priority. Sinytra Connector crashes always cascade into a downstream NoClassDefFoundError for a Fabric-only class. Our catch-all "Missing class" pattern was earlier in the iteration order than the Sinytra-specific pattern, so it stole the match.

We moved the Sinytra pattern up, added a regression test that asserts Sinytra wins when both signatures appear in the same report, and shipped the fix the same hour. The pre-warm was supposed to be a smoke test; it became a real bug catch. Worth it.

What's next

Three things in the immediate roadmap:

Lag Analyzer — paste a spark profile URL, get the top 5 mods consuming CPU per tick. Same no-signup pattern as the Crash Doctor, different input.
Performance Index — monthly published boot times, RAM ceilings, and crash rates for the top 50 modpacks, sourced from anonymised customer-server data. The kind of numbers we'd want before deciding to host a pack ourselves.
More patterns — the database grows with every report it can't recognise. If you have a crash that comes back as "no pattern detected", the relevant lines in a reply or an email are the easiest contribution.

Try the tool

The Crash Doctor lives at coalhosting.com/tools/crash-doctor. Free, no signup, secrets auto-stripped. The long-form fix guides walk through each crash family one at a time, in case you'd rather read than paste.