Preventing Automated Business Disasters: A Practical Framework
Most AI automation failures are not caused by the model making a mistake.
They are caused by the system around the model having no plan for when a mistake happens.
This is the distinction between an AI workflow and an AI system. A workflow is a sequence of steps. A system includes failure modes, recovery paths, and control boundaries.
This framework is what we use at ZENTRY to prevent automated business disasters before they reach customers.
What counts as an automated business disaster
Not every error is a disaster. A disaster is an error that:
- 1. **Reaches the public** — live content, sent emails, posted tweets
- 2. **Cannot be silently fixed** — the error is visible, indexed, or already consumed
- 3. **Damages trust or brand** — incorrect claims, missing approvals, wrong recipients
- 4. **Requires reactive work under pressure** — the team scrambles instead of executing a documented recovery
By this definition, a draft file with an error is not a disaster. A live Ghost post with wrong information, sent to a newsletter list of 2,000 subscribers, is a disaster.
The framework is designed to prevent the second category from happening.
Risk categories in AI publishing
Before building controls, you need to classify risk. Not all actions carry the same exposure.
### Category A — Irreversible actions
Actions that cannot be undone once executed:
- Newsletter send (email cannot be recalled)
- Public tweet (visible immediately, deletion leaves a trace)
- Social media posts on external platforms
**Control requirement:** Explicit package-level approval. The approved package must state which channels are included. In ZENTRY, Peter approves the scheduled publishing package — Ghost, the linked newsletter teaser, and X — while execution still requires preflight checks, Backup Guard, Ghost verification before X, and Evidence Gate.
### Category B — Reversible but visible actions
Actions that can be corrected, but the correction is itself visible:
- Ghost post published (can be reverted to draft, but Google may have indexed it)
- Slug published (changing it later creates a redirect, damages SEO temporarily)
- External link shared in a newsletter pointing to a broken URL
**Control requirement:** Pre-publish validation. Slug collision check. URL pre-verification before send.
### Category C — Internal state errors
Errors that affect the system state but do not immediately reach the public:
- Queue file corrupted or overwritten incorrectly
- Approval timestamp missing or wrong
- Backup not created before a write operation
**Control requirement:** Write-then-backup pattern. Every queue update creates a timestamped `.bak` first.
The six error prevention controls
### Control 1 — Explicit approval package with channel scope
The most common source of automated disasters is assuming that one approval covers everything.
It does not.
An approval for Ghost content does not mean:
- The X post has been reviewed
- The newsletter recipients list is correct
- The send timing is approved
- The subject line is approved
The approved package must explicitly state which channels are included. The queue file must reflect this with package-level fields:
```json
{
"status": "APPROVED_BY_PETER",
"channels_approved": ["ghost", "newsletter_teaser", "x"],
"newsletter_params_approved": true,
"newsletter": "default-newsletter",
"email_segment": "all"
}
```
An executor that treats approval as vague, incomplete, or detached from the queue state has failed at control design — not at execution.
### Control 2 — Pre-publish checklist (automated)
Before any live action, the executor runs a verification sequence:
```python
checks = [
queue.approved_by is not None, # Human approval present
queue.approval_recorded_at is not None, # Timestamp recorded
backup_guard_passed(), # Local + offsite backup verified
not slug_collision_detected(queue.slug), # No duplicate slug
content_file_exists(queue.ghost_post_file), # Content file present
content_file_non_empty(queue.ghost_post_file), # Not empty
no_secrets_in_content(queue.ghost_post_file), # No API keys in content
dry_run_output_matches_final(), # Dry-run confirmed
]
if not all(checks):
STOP(report_failed_checks(checks))
```
If any check fails: the system stops, generates a report, and waits for human intervention. No partial execution.
### Control 3 — Dry-run before every live action
Every executor run starts in dry-run mode by default. The operator (or the cron job) must explicitly pass `--live` to trigger live API calls.
Dry-run outputs:
- Exactly what content would be published
- Exactly which channels would receive it
- Exactly which API endpoints would be called
The operator reviews the dry-run output. If it matches expectations, they pass `--live`. If not, they stop and investigate.
This single control has prevented more errors at ZENTRY than any other.
### Control 4 — Slug and duplicate detection
Before publishing to Ghost, the executor queries the Ghost API for existing slugs. If the target slug already exists — even as a draft — the executor stops.
This prevents the `-2` slug collision problem that caused a duplicate content incident in our 20/05 cycle.
The check is not optional. It runs before every Ghost publish action, regardless of whether the operator believes the slug is unique.
### Control 5 — Secret scanning before content publish
No content file should ever contain API keys, tokens, or other secrets. Before any publish action, the executor scans the content for known secret patterns:
- Ghost Admin API key prefix patterns
- Cloudflare token prefix patterns
- Long hex strings (≥ 32 chars)
- Sequences of 8+ identical characters (common in test tokens)
If any pattern is found: **STOP. Do not publish. Alert immediately.**
This is not hypothetical. AI-generated content can inadvertently reproduce examples from training data that look like real credentials. The scan is mandatory.
### Control 6 — Audit trail and post-publish verification
After every successful publish action, the system:
- 1. Records the live URL, publish timestamp, and channel
- 2. Makes a HEAD request to verify the URL is reachable
- 3. Updates the queue file with the verified state
- 4. Creates a final manifest in the `/evidence/` folder
The manifest is the source of truth for what was published, when, and by whose approval. It is never overwritten — only appended to.
If the post-publish verification fails (URL not reachable, API error), the system flags the queue item as `PUBLISH_ERROR` and alerts for manual intervention.
The operational decision tree
When the executor runs, it follows this decision tree:
```
START
↓
Is today's queue item in status PENDING_PETER_REVIEW?
NO → STOP (nothing to publish today)
YES ↓
Is approved_by set?
NO → STOP (waiting for Peter approval)
YES ↓
Run pre-publish checklist
FAIL → STOP + report failed checks
PASS ↓
Execute dry-run
Output mismatch → STOP + alert
Match confirmed ↓
Execute live publish (Ghost)
API error → STOP + report + flag PUBLISH_ERROR
Success ↓
Verify Ghost URL
Not reachable → flag + alert + wait
Reachable ↓
Update queue: published_ghost = true, ghost_url = [url]
↓
Check: are X and newsletter teaser included in the approved package?
NO → STOP (wait for explicit package approval)
YES ↓
Execute live X post only after Ghost is verified
↓
Check: are newsletter params approved and recorded?
NO → STOP (wait for newsletter params approval)
YES ↓
Execute newsletter teaser send
↓
Create final manifest in /evidence/
↓
COMPLETE — all channels verified
```
Every branch that leads to STOP generates a report. No silent failures.
What this framework does not solve
Honesty requires acknowledging the limits.
**This framework does not prevent:**
- **Wrong content that passes review** — If Peter approves incorrect content, the system publishes it. The framework ensures the content reaches Peter for review; it cannot verify that the review itself is accurate.
- **External platform outages** — If Ghost or X APIs are unavailable, the executor stops. But it cannot guarantee that a delayed publish will not create a timing issue with already-sent newsletters.
- **Model hallucinations in generated content** — The secret scan checks for credential patterns, not factual accuracy. Content review by a human remains the only control for factual claims.
**These gaps are documented, not hidden.** The framework works within its scope. Decisions outside the scope remain human decisions.
Implementation checklist
For teams implementing this framework for the first time:
```
[ ] Define risk categories for your specific channels
[ ] Build an explicit approval package that lists included channels
[ ] Implement pre-publish checklist with automated checks
[ ] Add dry-run as default mode — live requires explicit flag
[ ] Add slug/duplicate detection before every publish
[ ] Add secret scanning before every content publish
[ ] Implement post-publish URL verification
[ ] Create evidence folder with final manifests
[ ] Document all STOP conditions and escalation paths
[ ] Test rollback procedure before first live cycle
```
Do not skip the last item. The rollback test is the one most teams postpone.
It is also the one most teams wish they had done when the first incident occurs.
Conclusion
Automated disasters do not announce themselves in advance.
They happen at 06:00 UTC when no one is watching, because an automated system executed exactly what it was programmed to do — and what it was programmed to do was wrong.
The framework described here does not eliminate automation risk. It structures it, controls it, and ensures that when something goes wrong, the damage is contained, the recovery is documented, and the team knows exactly what happened.
That is the difference between an automated system you can trust and one you can only hope works correctly.
At ZENTRY, we do not hope. We verify.