The eval workflow
Common issues and fixes
Low completeness score
The skill does not cover all aspects of the task. Fix: Ask yourself “what would a new team member need to know?” Add missing steps, edge cases, and error handling instructions.Low clarity score
Instructions are ambiguous or could be interpreted multiple ways. Fix: Replace vague language with specific actions:| Before | After |
|---|---|
| ”Handle errors appropriately" | "Wrap database calls in try/catch and return a 500 status with the error message" |
| "Use good naming" | "Use camelCase for variables, PascalCase for components, UPPER_SNAKE for constants" |
| "Follow best practices” | Remove entirely, this adds no information |
Low constraints score
The skill does not set clear boundaries. Fix: Add specific, measurable constraints:- File size limits
- Naming conventions
- Forbidden patterns
- Required dependencies
Low verification score
The agent cannot check its own work. Fix: Add a verification checklist with items that can be independently checked. Each item should be binary: it either passes or fails.Low context score
The skill assumes knowledge the agent may not have. Fix: Add background sections explaining domain concepts, project conventions, or architectural decisions that inform the instructions.Low structure score
The skill is poorly organized or uses inconsistent formatting. Fix: Follow the standard section order: Overview, Instructions, Example Prompts. Use consistent heading levels and list formatting.Tracking improvements
The Evals view shows sparkline trends per skill. Use these to:- Verify that edits actually improved scores
- Catch regressions when updating skills
- Compare quality across your skill library