The challenge with integrating Large Language Models (LLMs) into production workflows is not their intelligence; it is their variance. When an AI agent acts as a “Managing Editor” or “Content Creator”, how do we ensure it doesn’t just produce a great “vibe,” but also follows every strict technical requirement of the platform?
The Problem: The “Vibe Check” vs. Production Standards#
In the early stages of AI development, we rely on “vibe checks”: reading an output and deciding if it looks correct. This doesn’t scale. For example, on this platform, every post must meet eight specific criteria:
- Correct directory structure (Leaf Bundles).
- Category and Tag governance compliance.
- Specific AEO/SEO frontmatter fields.
- Presence of a “Quick Answer” block.
- Presence of FAQ blocks.
- Semantic navigation (Related/Next) with valid parameters.
- Executive style (e.g. speak in the first person).
- A successful site build.
Manually checking these is tedious and error-prone. We need a way to make the agent its own most rigorous critic.
The Solution: Modular Deterministic Evals#
I implemented a Deterministic Evaluation Pipeline bundled directly within the skill directory. This architecture treats an AI skill as a self-contained unit of work that includes both its instructions and its verification logic.
1. The Directory Structure#
By co-locating the evals, we ensure the skill is portable and maintainable.
./skills/managing-editor/
├── SKILL.md # The "Brain" (Instructions)
└── evals/ # The "Nervous System" (Verification)
├── runner.py # Orchestrator
├── config.yaml # Pipeline Definition
└── checks/ # Atomic Python Validators
├── structure.py
├── frontmatter.py
├── aeo.py
└── build.py2. The Evaluation Protocol#
The runner.py acts as a central orchestrator. It reads a config.yaml file to determine which checks to run and in what order. Each check is a standalone Python script that follows a simple contract:
- Input: Receives the path to the file being validated.
- Output: Returns an exit code (0 for pass, 1 for fail) and a JSON string via STDOUT.
3. The Self-Correction Loop#
The most powerful aspect of this framework is its integration into the agent’s workflow. The SKILL.md defines a Self-Evaluation protocol:
- Act: The agent creates the content.
- Eval: The agent runs the
runner.py. - Analyze: The agent parses the JSON results.
- Correction: If a check fails (e.g., “Missing ‘about’ field”), the agent applies a surgical fix and re-runs the eval.
Sample Implementation#
You can see this pattern in action in this GitHub Repository.
For example, our build.py check doesn’t just say the build failed; it captures the build’s error output and passes it back to the LLM:
# Extract relevant error context for the agent
error_lines = [line for line in result.stderr.split('\n') if "ERROR" in line]
print(json.dumps({"errors": ["Hugo build failed"], "hugo_output": error_lines}))Looking Ahead: The Future of Evals#
While deterministic checks are perfect for structure and syntax, they struggle with “soft” metrics like tone or clarity. Our framework is designed to evolve:
- LLM-as-Judge: We could add a check script that passes a section of content to a separate LLM call to grade “readability” or “persuasiveness” against a rubric.
- Human-in-the-Loop: After the deterministic suite passes, the results could be posted as a comment on the PR for final human verification.
- Hybrid Pipelines: Combining regex-based structure checks with embedding-based similarity checks to ensure content remains on-topic.
By moving from “vibe-checking” to “automated verification,” we transform AI agents from helpful assistants into reliable production-grade operators.


