Quality dimensions
| Dimension | What it measures |
|---|---|
| Structure | Is the skill well-organized with clear sections? |
| Content | Does the skill cover the task with sufficient depth? |
| Evidence | Are claims backed by references or links? |
| Usage | Does the skill include example prompts and usage patterns? |
| Toolchain | Are tool requirements and allowed-tools specified? |
| Freshness | Is the content current and recently maintained? |
Running evals
- Open a skill in the Editor or Library
- Navigate to the Evals tab
- Click Run evaluation
- Review the scores and any flagged issues
Evals run locally. No data is sent to external services.
Score history
Every eval run is recorded with a timestamp. The Evals view shows:- Score trends. The latest 5 scores shown per skill as a text sequence.
- Run history. Full list of past runs with scores and timestamps.
- Run heatmap. Visual overview of agent skill usage (success/failure) over time. This tracks when agents used the skill, not eval runs.
Improvement suggestions
When a dimension scores below threshold, the eval system provides specific suggestions:- Missing overview section
- Short or thin content that needs expansion
- No concrete examples with input/output
- No references or evidence links
- Missing tags in frontmatter