Skip to main content
The Evals system scores your skills against 6 quality dimensions and tracks scores over time so you can catch regressions.

Quality dimensions

DimensionWhat it measures
StructureIs the skill well-organized with clear sections?
ContentDoes the skill cover the task with sufficient depth?
EvidenceAre claims backed by references or links?
UsageDoes the skill include example prompts and usage patterns?
ToolchainAre tool requirements and allowed-tools specified?
FreshnessIs the content current and recently maintained?
Each dimension is scored independently. The overall score is a weighted aggregate.

Running evals

  1. Open a skill in the Editor or Library
  2. Navigate to the Evals tab
  3. Click Run evaluation
  4. Review the scores and any flagged issues
Evals run locally. No data is sent to external services.

Score history

Every eval run is recorded with a timestamp. The Evals view shows:
  • Score trends. The latest 5 scores shown per skill as a text sequence.
  • Run history. Full list of past runs with scores and timestamps.
  • Run heatmap. Visual overview of agent skill usage (success/failure) over time. This tracks when agents used the skill, not eval runs.

Improvement suggestions

When a dimension scores below threshold, the eval system provides specific suggestions:
  • Missing overview section
  • Short or thin content that needs expansion
  • No concrete examples with input/output
  • No references or evidence links
  • Missing tags in frontmatter
Last modified on March 19, 2026