Every report. Every task. Every run.
Browse the full evaluation surface that produced the SkillLens paper: per-skill safety and effectiveness results across eight harness × model runs, the constructed judge-item task suite, and (soon) replayable execution traces.
— skills · — reports · — runs · — categories · — judge items
Evaluation surface across harnesses and models.
Each row is a single run: one harness driving one model along one axis (utility, security, or both). Each column is the same set of 226–227 skills. Cells link to that run's per-skill results.
How a skill becomes a task suite.
Every skill is exercised by three capability-targeted scenarios — U1,
U2, U3 — each with 5–6 binary judge items. Pick a skill in the
explorer above, then expand the panel below to see the full judge sheet that scored it.
Replayable traces — coming soon.
Per-scenario sandbox traces (file-system diff, outbound network log, agent step log) will be added to this repository before the camera-ready version. Each entry below will become a clickable timeline.
Cite + download
BibTeX · index.json · per-skill bundles
BibTeX
@inproceedings{skilllens2026,
title = {SkillLens: From Task-First Evaluation to
Skill-Centered Assessment of Agent Skill Packages},
author = {Anonymous Author(s)},
booktitle = {Submitted to the 40th Conference on Neural
Information Processing Systems (NeurIPS)},
year = {2026},
note = {Under review. Do not distribute.}
}
Reproduce
Long-format CSV (one row per skill × run) for direct loading into
pandas / R / DuckDB. Per-skill JSON files contain every judge item and every
(run × scenario) result, sanitized for double-blind review.