Skill v1.0.1
currentAutomated scan100/100+1 new
version: "1.0.1" name: audit-agents-skills description: "Audit Claude Code agents, skills, and commands for quality and production readiness. Use when evaluating skill quality, checking production readiness scores, or comparing agents against best-practice templates." allowed-tools: Read Grep Glob Bash Write effort: high disable-model-invocation: true metadata: version: 1.0.0
Audit Agents/Skills/Commands (Advanced Skill)
Comprehensive quality audit system for Claude Code agents, skills, and commands. Provides quantitative scoring, comparative analysis, and production readiness grading based on industry best practices.
Purpose
Problem: Manual validation of agents/skills is error-prone and inconsistent. According to the LangChain Agent Report 2026, 29.5% of organizations deploy agents without systematic evaluation, leading to "agent bugs" as the top challenge (18% of teams).
Solution: Automated quality scoring across 16 weighted criteria with production readiness thresholds (80% = Grade B minimum for production deployment).
Key Features:
- Quantitative scoring (32 points for agents/skills, 20 for commands)
- Weighted criteria (Identity 3x, Prompt 2x, Validation 1x, Design 2x)
- Production readiness grading (A-F scale with 80% threshold)
- Comparative analysis vs reference templates
- JSON/Markdown dual output for programmatic integration
- Fix suggestions for failing criteria
Modes
| Mode | Usage | Output | |
|---|---|---|---|
| Quick Audit | Top-5 critical criteria only | Fast pass/fail (3-5 min for 20 files) | |
| Full Audit | All 16 criteria per file | Detailed scores + recommendations (10-15 min) | |
| Comparative | Full + benchmark vs templates | Analysis + gap identification (15-20 min) |
Default: Full Audit (recommended for first run)
Methodology
Why These Criteria?
The 16-criteria framework is derived from:
- Claude Code Best Practices (Ultimate Guide line 4921: Agent Validation Checklist)
- Industry Data (LangChain Agent Report 2026: evaluation gaps)
- Production Failures (Community feedback on hardcoded paths, missing error handling)
- Composition Patterns (Skills should reference other skills, agents should be modular)
Scoring Philosophy
Weight Rationale:
- Identity (3x): If users can't find/invoke the agent, quality is irrelevant (discoverability > quality)
- Prompt (2x): Determines reliability and accuracy of outputs
- Validation (1x): Improves robustness but is secondary to core functionality
- Design (2x): Impacts long-term maintainability and scalability
Grade Standards:
- A (90-100%): Production-ready, minimal risk
- B (80-89%): Good, meets production threshold
- C (70-79%): Needs improvement before production
- D (60-69%): Significant gaps, not production-ready
- F (<60%): Critical issues, requires major refactoring
Industry Alignment: The 80% threshold aligns with software engineering best practices for production deployment (e.g., code coverage >80%, security scan pass rates).
Workflow
Phase 1: Discovery
- Scan directories:
`` .claude/agents/ .claude/skills/ .claude/commands/ examples/agents/ (if exists) examples/skills/ (if exists) examples/commands/ (if exists) ``
- Classify files by type (agent/skill/command)
- Load reference templates (for Comparative mode):
`` guide/examples/agents/ (benchmark files) guide/examples/skills/ (benchmark files) guide/examples/commands/ (benchmark files) ``
Phase 2: Scoring Engine
Load scoring criteria from scoring/criteria.yaml:
agents:max_points: 32categories:identity:weight: 3criteria:- id: A1.1name: "Clear name"points: 3detection: "frontmatter.name exists and is descriptive"# ... (16 total criteria)
For each file:
- Parse frontmatter (YAML)
- Extract content sections
- Run detection patterns (regex, keyword search)
- Calculate score:
(points / max_points) × 100 - Assign grade (A-F)
Phase 3: Comparative Analysis (Comparative Mode Only)
For each project file:
- Find closest matching template (by description similarity)
- Compare scores per criterion
- Identify gaps:
template_score - project_score - Flag significant gaps (>10 points difference)
Example:
Project file: .claude/agents/debugging-specialist.md (Score: 78%, Grade C)Closest template: examples/agents/debugging-specialist.md (Score: 94%, Grade A)Gaps:- Anti-hallucination measures: -2 points (template has, project missing)- Edge cases documented: -1 point (template has 5 examples, project has 1)- Integration documented: -1 point (template references 3 skills, project none)Total gap: 16 points (explains C vs A difference)
Phase 4: Report Generation
Markdown Report (audit-report.md):
- Summary table (overall + by type)
- Individual scores with top issues
- Detailed breakdown per file (collapsible)
- Prioritized recommendations
JSON Output (audit-report.json):
{"metadata": {"project_path": "/path/to/project","audit_date": "2026-02-07","mode": "full","version": "1.0.0"},"summary": {"overall_score": 82.5,"overall_grade": "B","total_files": 15,"production_ready_count": 10,"production_ready_percentage": 66.7},"by_type": {"agents": { "count": 5, "avg_score": 85.2, "grade": "B" },"skills": { "count": 8, "avg_score": 78.9, "grade": "C" },"commands": { "count": 2, "avg_score": 92.0, "grade": "A" }},"files": [{"path": ".claude/agents/debugging-specialist.md","type": "agent","score": 78.1,"grade": "C","points_obtained": 25,"points_max": 32,"failed_criteria": [{"id": "A2.4","name": "Anti-hallucination measures","points_lost": 2,"recommendation": "Add section on source verification"}]}],"top_issues": [{"issue": "Missing error handling","affected_files": 8,"impact": "Runtime failures unhandled","priority": "high"}]}
Phase 5: Fix Suggestions (Optional)
For each failing criterion, generate actionable fix:
### File: .claude/agents/debugging-specialist.md**Issue**: Missing anti-hallucination measures (2 points lost)**Fix**:Add this section after "Methodology":## Source Verification-Always cite sources for technical claims-Use phrases: "According to [documentation]...", "Based on [tool output]..."-If uncertain, state: "I don't have verified information on..."-Never invent: statistics, version numbers, API signatures, stack traces**Detection**: Grep for keywords: "verify", "cite", "source", "evidence"
Scoring Criteria
See scoring/criteria.yaml for complete definitions. Summary:
Agents (32 points max)
| Category | Weight | Criteria Count | Max Points | |
|---|---|---|---|---|
| Identity | 3x | 4 | 12 | |
| Prompt Quality | 2x | 4 | 8 | |
| Validation | 1x | 4 | 4 | |
| Design | 2x | 4 | 8 |
Key Criteria:
- Clear name (3 pts): Not generic like "agent1"
- Description with triggers (3 pts): Contains "when"/"use"
- Role defined (2 pts): "You are..." statement
- 3+ examples (1 pt): Usage scenarios documented
- Single responsibility (2 pts): Focused, not "general purpose"
Skills (32 points max)
| Category | Weight | Criteria Count | Max Points | |
|---|---|---|---|---|
| Structure | 3x | 4 | 12 | |
| Content | 2x | 4 | 8 | |
| Technical | 1x | 4 | 4 | |
| Design | 2x | 4 | 8 |
Key Criteria:
- Valid SKILL.md (3 pts): Proper naming
- Name valid (3 pts): Lowercase, 1-64 chars, no spaces
- Methodology described (2 pts): Workflow section exists
- No hardcoded paths (1 pt): No
/Users/,/home/ - Clear triggers (2 pts): "When to use" section
Commands (20 points max)
| Category | Weight | Criteria Count | Max Points | |
|---|---|---|---|---|
| Structure | 3x | 4 | 12 | |
| Quality | 2x | 4 | 8 |
Key Criteria:
- Valid frontmatter (3 pts): name + description
- Argument hint (3 pts): If uses
$ARGUMENTS - Step-by-step workflow (3 pts): Numbered sections
- Error handling (2 pts): Mentions failure modes
Detection Patterns
Frontmatter Parsing
import yamlimport redef parse_frontmatter(content):match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL)if match:return yaml.safe_load(match.group(1))return None
Keyword Detection
def has_keywords(text, keywords):text_lower = text.lower()return any(kw in text_lower for kw in keywords)# Examplehas_trigger = has_keywords(description, ['when', 'use', 'trigger'])has_error_handling = has_keywords(content, ['error', 'failure', 'fallback'])
Overlap Detection (Duplication Check)
def jaccard_similarity(text1, text2):words1 = set(text1.lower().split())words2 = set(text2.lower().split())intersection = words1 & words2union = words1 | words2return len(intersection) / len(union) if union else 0# Flag if similarity > 0.5 (50% keyword overlap)if jaccard_similarity(desc1, desc2) > 0.5:issues.append("High overlap with another file")
Token Counting (Approximate)
def estimate_tokens(text):# Rough estimate: 1 token ≈ 0.75 wordsword_count = len(text.split())return int(word_count * 1.3)# Check budgettokens = estimate_tokens(file_content)if tokens > 5000:issues.append("File too large (>5K tokens)")
Industry Context
Source: LangChain Agent Report 2026 (public report, page 14-22)
Key Findings:
- 29.5% of organizations deploy agents without systematic evaluation
- 18% cite "agent bugs" as their primary challenge
- Only 12% use automated quality checks (88% manual or none)
- 43% report difficulty maintaining agent quality over time
- Top issues: Hallucinations (31%), poor error handling (28%), unclear triggers (22%)
Implications:
- Automation gap: Most teams rely on manual checklists (error-prone at scale)
- Quality debt: Agents deployed without validation accumulate technical debt
- Maintenance burden: 43% struggle with quality over time (no tracking system)
This skill addresses:
- Automation: Replaces manual checklists with quantitative scoring
- Tracking: JSON output enables trend analysis over time
- Standards: 80% threshold provides clear production gate
Output Examples
Quick Audit (Top-5 Criteria)
# Quick Audit: Agents/Skills/Commands**Files**: 15 (5 agents, 8 skills, 2 commands)**Critical Issues**: 3 files fail top-5 criteria## Top-5 Criteria (Pass/Fail)| File | Valid Name | Has Triggers | Error Handling | No Hardcoded Paths | Examples ||------|------------|--------------|----------------|--------------------|----------|| agent1.md | ✅ | ✅ | ❌ | ✅ | ❌ || skill2/ | ✅ | ❌ | ✅ | ❌ | ✅ |## Action Required1.**Add error handling**: 5 files2.**Remove hardcoded paths**: 3 files3.**Add usage examples**: 4 files
Full Audit
See Phase 4: Report Generation above for full structure.
Comparative (Full + Benchmarks)
# Comparative Audit## Project vs Templates| File | Project Score | Template Score | Gap | Top Missing ||------|---------------|----------------|-----|-------------|| debugging-specialist.md | 78% (C) | 94% (A) | -16 pts | Anti-hallucination, edge cases || testing-expert/ | 85% (B) | 91% (A) | -6 pts | Integration docs |## RecommendationsFocus on these gaps to reach template quality:1.**Anti-hallucination measures** (8 files): Add source verification sections2.**Edge case documentation** (5 files): Add failure scenario examples3.**Integration documentation** (4 files): List compatible agents/skills
Usage
Basic (Full Audit)
# In Claude CodeUse skill: audit-agents-skills# Specify pathUse skill: audit-agents-skills for ~/projects/my-app
With Options
# Quick audit (fast)Use skill: audit-agents-skills with mode=quick# Comparative (benchmark analysis)Use skill: audit-agents-skills with mode=comparative# Generate fixesUse skill: audit-agents-skills with fixes=true# Custom output pathUse skill: audit-agents-skills with output=~/Desktop/audit.json
JSON Output Only
# For programmatic integrationUse skill: audit-agents-skills with format=json output=audit.json
Integration with CI/CD
Pre-commit Hook
#!/bin/bash# .git/hooks/pre-commit# Run quick audit on changed agent/skill/command fileschanged_files=$(git diff --cached --name-only | grep -E "^\.claude/(agents|skills|commands)/")if [ -n "$changed_files" ]; thenecho "Running quick audit on changed files..."# Run audit (requires Claude Code CLI wrapper)# Exit with 1 if any file scores <80%fi
GitHub Actions
name: Audit Agents/Skillson: [pull_request]jobs:audit:runs-on: ubuntu-lateststeps:- uses: actions/checkout@v3- name: Run quality auditrun: |# Run audit skill# Parse JSON output# Fail if overall_score < 80
Comparison: Command vs Skill
| Aspect | Command (/audit-agents-skills) | Skill (this file) | |
|---|---|---|---|
| Scope | Current project only | Multi-project, comparative | |
| Output | Markdown report | Markdown + JSON | |
| Speed | Fast (5-10 min) | Slower (10-20 min with comparative) | |
| Depth | Standard 16 criteria | Same + benchmark analysis | |
| Fix suggestions | Via --fix flag | Built-in with recommendations | |
| Programmatic | Terminal output | JSON for CI/CD integration | |
| Best for | Quick checks, dev workflow | Deep audits, quality tracking |
Recommendation: Use command for daily checks, skill for release gates and quality tracking.
Maintenance
Updating Criteria
Edit scoring/criteria.yaml:
agents:categories:identity:criteria:- id: A1.5 # New criterionname: "API versioning specified"points: 3detection: "mentions API version or compatibility"
Version bump: Increment version in frontmatter when criteria change.
Adding File Types
To support new file types (e.g., "workflows"):
- Add to
scoring/criteria.yaml:
``yaml workflows: max_points: 24 categories: [...] ``
- Update detection logic (file path patterns)
- Update report templates
Related
- Command version:
.claude/commands/audit-agents-skills.md - Agent Validation Checklist: guide line 4921 (manual 16 criteria)
- Skill Validation: guide line 5491 (spec documentation)
- Reference templates:
examples/agents/,examples/skills/,examples/commands/
Changelog
v1.0.0 (2026-02-07):
- Initial release
- 16-criteria framework (agents/skills/commands)
- 3 audit modes (quick/full/comparative)
- JSON + Markdown output
- Fix suggestions
- Industry context (LangChain 2026 report)
Skill ready for use: audit-agents-skills