NeurIPS 2026 · under double-blind review · artifacts release

Every report. Every task. Every run.

Browse the full evaluation surface that produced the SkillLens paper: per-skill safety and effectiveness results across eight harness × model runs, the constructed judge-item task suite, and (soon) replayable execution traces.

Try

skills · reports · runs · categories · judge items

Select a skill above to inspect its full evaluation surface.

Evaluation surface across harnesses and models.

Each row is a single run: one harness driving one model along one axis (utility, security, or both). Each column is the same set of 226–227 skills. Cells link to that run's per-skill results.

How a skill becomes a task suite.

Every skill is exercised by three capability-targeted scenarios — U1, U2, U3 — each with 5–6 binary judge items. Pick a skill in the explorer above, then expand the panel below to see the full judge sheet that scored it.

No skill selected. The judge sheet for whichever skill you open will appear here.

Replayable traces — coming soon.

Per-scenario sandbox traces (file-system diff, outbound network log, agent step log) will be added to this repository before the camera-ready version. Each entry below will become a clickable timeline.

U1 / wi_skills pending upload fs.diff · net.log · steps.jsonl
U1 / wo_skills pending upload fs.diff · net.log · steps.jsonl
U2 · U3 · security pending upload paired wi / wo · matched seeds

Cite + download

BibTeX · index.json · per-skill bundles

BibTeX

@inproceedings{skilllens2026,
  title     = {SkillLens: From Task-First Evaluation to
               Skill-Centered Assessment of Agent Skill Packages},
  author    = {Anonymous Author(s)},
  booktitle = {Submitted to the 40th Conference on Neural
               Information Processing Systems (NeurIPS)},
  year      = {2026},
  note      = {Under review. Do not distribute.}
}

Reproduce

Long-format CSV (one row per skill × run) for direct loading into pandas / R / DuckDB. Per-skill JSON files contain every judge item and every (run × scenario) result, sanitized for double-blind review.

Download aggregate.csv index.json stats.json checksums.txt

1,812 rows × 28 cols · schema skilllens_artifacts_v1 · frozen 2026-05-07. Verify any file with sha256sum -c checksums.txt.