Skip to main content
The eval system helps you find and fix weaknesses before deploying.

The eval workflow

Common issues and fixes

Low completeness score

The skill does not cover all aspects of the task. Fix: Ask yourself “what would a new team member need to know?” Add missing steps, edge cases, and error handling instructions.

Low clarity score

Instructions are ambiguous or could be interpreted multiple ways. Fix: Replace vague language with specific actions:
BeforeAfter
”Handle errors appropriately""Wrap database calls in try/catch and return a 500 status with the error message"
"Use good naming""Use camelCase for variables, PascalCase for components, UPPER_SNAKE for constants"
"Follow best practices”Remove entirely, this adds no information

Low constraints score

The skill does not set clear boundaries. Fix: Add specific, measurable constraints:
  • File size limits
  • Naming conventions
  • Forbidden patterns
  • Required dependencies

Low verification score

The agent cannot check its own work. Fix: Add a verification checklist with items that can be independently checked. Each item should be binary: it either passes or fails.

Low context score

The skill assumes knowledge the agent may not have. Fix: Add background sections explaining domain concepts, project conventions, or architectural decisions that inform the instructions.

Low structure score

The skill is poorly organized or uses inconsistent formatting. Fix: Follow the standard section order: Overview, Instructions, Example Prompts. Use consistent heading levels and list formatting.

Tracking improvements

The Evals view shows sparkline trends per skill. Use these to:
  • Verify that edits actually improved scores
  • Catch regressions when updating skills
  • Compare quality across your skill library
Last modified on March 19, 2026