NeurIPS 2026 · under double-blind review

Should you install this skill?

Type a skill name. We'll show you whether it measurably helps the agent — and whether it triggers exploits in a runtime sandbox.

Try

227 audited · 41 unsafe · 93 confirmed exploits · 4,256 judge items

Loading evidence…

Riskiest skills you should know about.

93 confirmed exploits across 41 skills

Twelve hand-picked findings spanning five exploit classes. Click any card to inspect that skill.

How we score a skill

5-step pipeline · two independent axes

Every skill goes through the same pipeline. The same execution pass produces both axes — effectiveness and safety — never combined into a single score.

01 Profile Static scan of SKILL.md, scripts, and dependencies. Each finding gets an existence_confidence ∈ [0, 1]. → static_scan.json

02 Generate Capability-targeted scenarios (U1 / U2 / U3) with 5–6 binary judge items each, calibrated against the no-skill baseline difficulty. → scenarios/U*.yaml

03 Execute Containerized Harbor runs, paired wi / wo with matched seeds. The sandbox records FS diff and outbound network traffic in real time. → fs.diff · net.log

04 Judge An LLM judge scores each binary item; the security judge composes existence × exploitability against the runtime trace. → judges/*.json

05 Report Per-skill report with pass_rate_gain, efficiency_score, and security.score = max(10, 100 − Σ base × existence × exploit). → skill_report.json

See it on GitHub

Chrome extension · injects the verdict on any SKILL.md repo

The Chrome MV3 extension recognizes any GitHub repository whose root contains a SKILL.md and renders the same verdict directly on the page — at the moment someone is deciding whether to install.

github.com/anthropics/mcp-builder 2026-05-04 12:18:37 UTC

anthropics / mcp-builder

SKILL.md

1# MCP Builder

3Generates MCP server scaffolding for…

5## Capabilities

6- spawn server templates

7- wire transports (stdio, HTTP)

SkillLens↗ open

Effectiveness+16.7 pp

Efficiency0.31

Safety96 / 100

Findings H/M/L0 / 0 / 0

VerdictAdopt

// captured during precomputed run · 2026-05-04 · run d4f8c2

Cite + download

BibTeX · benchmark.json (4.5 MB) for reproducibility

BibTeX

@inproceedings{skilllens2026,
  title     = {SkillLens: From Task-First Evaluation to
               Skill-Centered Assessment of Agent Skill Packages},
  author    = {Anonymous Author(s)},
  booktitle = {Submitted to the 40th Conference on Neural
               Information Processing Systems (NeurIPS)},
  year      = {2026},
  note      = {Under review. Do not distribute.}
}

Reproduce

Every audit's skill_report.json — judge items, finding rationale, paired wi / wo numbers, severity weighting — bundled for all 227 evaluated skills.

Download benchmark.json

~4.5 MB · schema skilllens_benchmark_v1 · frozen 2026-05-04