Improve skill quality with evals

The eval system helps you find and fix weaknesses before deploying.

The eval workflow

Common issues and fixes

Low completeness score

The skill does not cover all aspects of the task. Fix: Ask yourself “what would a new team member need to know?” Add missing steps, edge cases, and error handling instructions.

Low clarity score

Instructions are ambiguous or could be interpreted multiple ways. Fix: Replace vague language with specific actions:

Before	After
”Handle errors appropriately"	"Wrap database calls in try/catch and return a 500 status with the error message"
"Use good naming"	"Use camelCase for variables, PascalCase for components, UPPER_SNAKE for constants"
"Follow best practices”	Remove entirely, this adds no information

Low constraints score

The skill does not set clear boundaries. Fix: Add specific, measurable constraints:

File size limits
Naming conventions
Forbidden patterns
Required dependencies

Low verification score

The agent cannot check its own work. Fix: Add a verification checklist with items that can be independently checked. Each item should be binary: it either passes or fails.

Low context score

The skill assumes knowledge the agent may not have. Fix: Add background sections explaining domain concepts, project conventions, or architectural decisions that inform the instructions.

Low structure score

The skill is poorly organized or uses inconsistent formatting. Fix: Follow the standard section order: Overview, Instructions, Example Prompts. Use consistent heading levels and list formatting.

Tracking improvements

The Evals view shows sparkline trends per skill. Use these to:

Verify that edits actually improved scores
Catch regressions when updating skills
Compare quality across your skill library

​The eval workflow

​Common issues and fixes

​Low completeness score

​Low clarity score

​Low constraints score

​Low verification score

​Low context score

​Low structure score

​Tracking improvements