Evals

The Evals system scores your skills against 6 quality dimensions and tracks scores over time so you can catch regressions.

Quality dimensions

Dimension	What it measures
Structure	Is the skill well-organized with clear sections?
Content	Does the skill cover the task with sufficient depth?
Evidence	Are claims backed by references or links?
Usage	Does the skill include example prompts and usage patterns?
Toolchain	Are tool requirements and allowed-tools specified?
Freshness	Is the content current and recently maintained?

Each dimension is scored independently. The overall score is a weighted aggregate.

Evals run locally. No data is sent to external services.

Every eval run is recorded with a timestamp. The Evals view shows:

Score trends. The latest 5 scores shown per skill as a text sequence.
Run history. Full list of past runs with scores and timestamps.
Run heatmap. Visual overview of agent skill usage (success/failure) over time. This tracks when agents used the skill, not eval runs.

When a dimension scores below threshold, the eval system provides specific suggestions:

Last modified on March 19, 2026