Skip to main content

Deterministic Evals: Hardening AI Skills for Production

John Januszczak
Author
John Januszczak
Bridging technology, capital, and leadership for the next generation of transformative ventures

The challenge with integrating Large Language Models (LLMs) into production workflows is not their intelligence; it is their variance. When an AI agent acts as a “Managing Editor” or “Content Creator”, how do we ensure it doesn’t just produce a great “vibe,” but also follows every strict technical requirement of the platform?

Quick Answer
Evaluation frameworks (Evals) are quality gates that verify AI outputs against specific rules and rubrics. By bundling a modular pipeline of Python-based validators directly with an AI agent skill, we can autonomously enforce governance, formatting & style, and build standards, allowing agents to self-correct before human review.

The Problem: The “Vibe Check” vs. Production Standards
#

In the early stages of AI development, we rely on “vibe checks”: reading an output and deciding if it looks correct. This doesn’t scale. For example, on this platform, every post must meet eight specific criteria:

  1. Correct directory structure (Leaf Bundles).
  2. Category and Tag governance compliance.
  3. Specific AEO/SEO frontmatter fields.
  4. Presence of a “Quick Answer” block.
  5. Presence of FAQ blocks.
  6. Semantic navigation (Related/Next) with valid parameters.
  7. Executive style (e.g. speak in the first person).
  8. A successful site build.

Manually checking these is tedious and error-prone. We need a way to make the agent its own most rigorous critic.

The Solution: Modular Deterministic Evals
#

I implemented a Deterministic Evaluation Pipeline bundled directly within the skill directory. This architecture treats an AI skill as a self-contained unit of work that includes both its instructions and its verification logic.

1. The Directory Structure
#

By co-locating the evals, we ensure the skill is portable and maintainable.

./skills/managing-editor/
├── SKILL.md                 # The "Brain" (Instructions)
└── evals/                   # The "Nervous System" (Verification)
    ├── runner.py            # Orchestrator
    ├── config.yaml          # Pipeline Definition
    └── checks/              # Atomic Python Validators
        ├── structure.py
        ├── frontmatter.py
        ├── aeo.py
        └── build.py

2. The Evaluation Protocol
#

The runner.py acts as a central orchestrator. It reads a config.yaml file to determine which checks to run and in what order. Each check is a standalone Python script that follows a simple contract:

  • Input: Receives the path to the file being validated.
  • Output: Returns an exit code (0 for pass, 1 for fail) and a JSON string via STDOUT.

3. The Self-Correction Loop
#

The most powerful aspect of this framework is its integration into the agent’s workflow. The SKILL.md defines a Self-Evaluation protocol:

  1. Act: The agent creates the content.
  2. Eval: The agent runs the runner.py.
  3. Analyze: The agent parses the JSON results.
  4. Correction: If a check fails (e.g., “Missing ‘about’ field”), the agent applies a surgical fix and re-runs the eval.

Sample Implementation
#

You can see this pattern in action in this GitHub Repository.

For example, our build.py check doesn’t just say the build failed; it captures the build’s error output and passes it back to the LLM:

# Extract relevant error context for the agent
error_lines = [line for line in result.stderr.split('\n') if "ERROR" in line]
print(json.dumps({"errors": ["Hugo build failed"], "hugo_output": error_lines}))

Looking Ahead: The Future of Evals
#

While deterministic checks are perfect for structure and syntax, they struggle with “soft” metrics like tone or clarity. Our framework is designed to evolve:

  • LLM-as-Judge: We could add a check script that passes a section of content to a separate LLM call to grade “readability” or “persuasiveness” against a rubric.
  • Human-in-the-Loop: After the deterministic suite passes, the results could be posted as a comment on the PR for final human verification.
  • Hybrid Pipelines: Combining regex-based structure checks with embedding-based similarity checks to ensure content remains on-topic.

By moving from “vibe-checking” to “automated verification,” we transform AI agents from helpful assistants into reliable production-grade operators.

Frequently Asked Questions

? What is a 'deterministic' eval?

A deterministic evaluation is a test with a binary (pass/fail) outcome based on hard rules, such as the presence of a specific string, a valid file path, or a successful software build. Unlike “vibe checks,” they are consistent and reproducible.

? Why bundle evals directly with the AI skill?

Bundling evals with the skill ensures that the verification logic is always available wherever the skill is used, creating a self-contained, portable, and production-ready unit of automation.