How a Live SQLite Crash Taught Me to Stop Being My Own Ops

A botched ad‑hoc UPDATE while pm2 held the WAL nearly erased ninety minutes of community activity, and forced me to write the runbook I’d been ignoring.

It was 9:12 PM on April 22, 2026. I had just finished a live stream where I showed the iCharles community how to add a new "likes" counter to the dashboard. The chat was buzzing, a handful of regulars were asking me to tweak the UI, and I felt the familiar rush of building something in front of an audience.

The moment the database died

In the middle of the demo I realized the likes column I had added was missing a default value. I opened a new terminal, typed sqlite3 data.db, and ran an ad‑hoc UPDATE posts SET likes = 0 WHERE likes IS NULL;. The command seemed innocent enough, a one‑liner to clean up a mistake.

Behind the scenes, pm2 was still running the Node process that served iCharles. That process kept the SQLite file open in write‑ahead log (WAL) mode. When I issued the UPDATE, sqlite3 tried to acquire a write lock that the pm2‑managed process already held. The lock request timed out, and sqlite3 printed a cryptic database is locked error.

Instead of stopping, my script kept trying, and after a few seconds the kernel started to complain about a corrupted WAL file. The iCharles server, which relied on that same database, began spitting out SQLITE_CORRUPT errors. On screen, the live chat showed a cascade of red error messages, and the community’s activity meter froze at 90 minutes of engagement.

What broke and why

In real‑estate, I learned to treat a closing table like a sacred ledger; nothing ever gets altered once the ink dries. In software, I had treated the ledger as a mutable spreadsheet. The difference is that the ledger lives inside a process that also decides when to write to it. When I tried to write to the same file from the outside, I violated a fundamental rule: never modify a live database that the runtime owns.

Because I was both the operator and the developer, I had no guardrails. I didn’t have a separate staging environment, no read‑only replica to query against, and no documented procedure for emergency fixes. The moment I pressed Enter, the entire community’s recent activity—comments, votes, and the latest version of my UI—sat on a corrupted WAL that would never be flushed correctly.

The frantic recovery

I slammed my laptop shut, restarted the pm2 process, and watched the iCharles container spin up with a fresh copy of data.db that I had pulled from the last nightly backup. The container started, but the UI showed zero likes, empty comment threads, and a missing “new user” flag. The community was still online; they could see the broken state, but they didn’t know the data was gone.

Fortunately, SQLite has a .recover command that can rebuild a database from a corrupted WAL. I launched a new terminal, pointed sqlite3 at the broken file, and ran:

.recover | sqlite3 fresh.db

The command scanned the WAL, extracted every committed page, and wrote them into fresh.db. It took about two minutes, and the output warned me that some rows might be missing.

To verify which snapshot was the most recent, I compared row counts across three sources:

The nightly backup (23 April 2026 00:00 UTC) – 12,345 rows.
The freshly recovered file – 12,322 rows.
The in‑memory snapshot that pm2 had kept before crashing – 12,330 rows (extracted via process.memoryUsage()).

The recovered file was the closest match, missing only twenty‑three rows that were probably written after the last WAL checkpoint. I merged those missing rows manually from the in‑memory dump, ran a quick SELECT COUNT(*) sanity check, and swapped fresh.db into the production container.

The runbook that finally arrived

I sat down after the stream, still hearing the echo of the chat’s frantic emojis, and wrote the first version of a runbook. It read:

Emergency Database Modification Procedure

Never run sqlite3 against the live .db file while the Node process is running.

Spin up a temporary container with a read‑only copy of the database using docker run -v $(pwd)/data.db:/data.db:ro.

Perform any ad‑hoc queries inside that container.

If a corruption occurs, stop the pm2 process immediately.

Run sqlite3 corrupted.db ".recover" | sqlite3 recovered.db.

Compare row counts against the last backup and the in‑memory snapshot.

Merge missing rows, then replace the live database file with recovered.db.

Restart pm2 and verify the UI loads without errors.

Writing it felt like drafting a contract with myself. I added a checklist to the iCharles repo, committed it with the tag runbook‑db‑emergency, and posted the link in the community’s FAQ. The next time a new member asked me to tweak a column, I pointed them to the runbook and asked them to open a pull request instead of reaching for the terminal.

What I learned about wearing two hats

In sales, I could walk into a house, read the market, and close a deal because the roles were distinct: I was the broker, not the inspector. In software, I tried to be both the builder and the operator at the same time, and the system punished me.

The rule that emerged is simple: separate the code that runs your product from the code that repairs it. When you own the server, you relinquish the right to edit its data on the fly. Build tools that let you test changes against a copy, and document the exact steps you need when the copy fails.

Since that night, I have stopped touching the live DB altogether. Every change goes through a staging container, a migration script, and a peer review. The community’s activity meter now runs on a read‑only replica, so a broken write never stalls the UI. Most importantly, I no longer feel the adrenaline rush of “fixing it myself” in the middle of a stream; I feel the steadier confidence that comes from a plan that anyone on the team can follow.

If you’re building software in public, remember the audience is watching you solve problems in real time. Show them a runbook, not a panic attack.