Skill v1.0.1
currentAutomated scan100/10010 files
version: "1.0.1" name: ln-635-test-isolation-auditor description: "Audits whether test results can be trusted: flakiness, isolation, real external dependencies, time/random/order dependency, and shared state. Use when auditing test trustworthiness." allowed-tools: Read, Grep, Glob, Bash license: MIT model: claude-haiku-4-5
Paths: File paths (references/,../ln-*) are relative to this skill directory.
Trustworthiness Auditor (L3 Worker)
Type: L3 Worker
Specialized worker auditing whether automated test results are deterministic, isolated, and trustworthy.
Purpose & Scope
- Audit Test Trustworthiness (Category 5: Medium Priority)
- Check determinism, isolation, and dependency control
- Detect flaky tests, time/random/order dependency, shared state, and real external dependencies
- Emit
REWRITE_FOR_DETERMINISMorDELETE_IF_LOW_VALUE - Calculate compliance score (X/10)
Inputs
MANDATORY READ: Load references/audit_worker_core_contract.md.
Receives contextStore with: tech_stack, testFilesMetadata, codebase_root, output_dir.
Workflow
Detection policy: use two-layer detection (candidate scan, then context verification); load references/two_layer_detection.md only when the verification method is ambiguous.
1) Parse Context: Extract tech stack, trustworthiness checklist, test file list, output_dir from contextStore 2) Check Isolation (Layer 1): Check isolation for 6 categories (APIs, DB, FS, Time, Random, Network) 2b) Context Analysis (Layer 2 -- MANDATORY): For each isolation violation, ask:
- Is this an integration test? (real dependencies are intentional) -> do NOT flag. Only flag isolation issues in unit tests
- Is in-memory DB configured via test config (not visible in grep)? -> skip
- Is this a test helper that sets up mocks for other tests? -> skip
3) Check Determinism: Check for flaky tests, time-dependent assertions, order-dependent tests, shared mutable state 4) Evaluate trust action: Use REWRITE_FOR_DETERMINISM by default; use DELETE_IF_LOW_VALUE only when the test is both untrustworthy and low-value according to obvious local evidence 5) Collect Findings: Record each violation with severity, location (file:line), effort estimate (S/M/L), action, recommendation 6) Calculate Score: Count violations by severity, calculate compliance score (X/10) 7) Write Report: Build full markdown report in memory per references/templates/audit_worker_report_template.md, write to {output_dir}/ln-635--global.md in single Write call 8) Return Summary: Return minimal summary to coordinator (see Output Format)
Audit Rules: Test Isolation
1. External APIs
Good: Mocked (jest.mock, sinon, nock) Bad: Real HTTP calls to external APIs
Detection:
- Grep for
axios.get,fetch(,http.requestwithout mocks - Check if test makes actual network calls
Severity: HIGH
Recommendation: Ensure external API calls are controlled (mock, stub, or test server). Tool choice depends on project stack. Exception: Integration tests are EXPECTED to use real dependencies -- do NOT flag
Effort: M
2. Database
Good: In-memory DB (sqlite :memory:) or mocked Bad: Real database (PostgreSQL, MySQL)
Detection:
- Check DB connection strings (localhost:5432, real DB URL)
- Grep for
beforeAll(async () => { await db.connect() })without:memory:
Severity: MEDIUM
Recommendation: Ensure DB state is controlled and isolated between test runs. Exception: Integration tests with in-memory DB via config -> skip
Effort: M-L
3. File System
Good: Mocked (mock-fs, vol) Bad: Real file reads/writes
Detection:
- Grep for
fs.readFile,fs.writeFilewithout mocks - Check if test creates/deletes real files
Severity: MEDIUM
Recommendation: Ensure file system operations are isolated (mock, temp directory, or cleanup). Tool choice depends on project stack
Effort: S-M
4. Time/Date
Good: Mocked (jest.useFakeTimers, sinon.useFakeTimers) Bad: new Date(), Date.now() without mocks
Detection:
- Grep for
new Date()in test files withoutuseFakeTimers
Severity: MEDIUM
Recommendation: Ensure time-dependent logic uses controlled clock (fake timers, injected clock, or time provider). Tool choice depends on project stack
Effort: S
5. Random
Good: Seeded random (Math.seedrandom, fixed seed) Bad: Math.random() without seed
Detection:
- Grep for
Math.random()without seed setup
Severity: LOW
Recommendation: Use seeded random for deterministic tests
Effort: S
6. Network
Good: Mocked (supertest for Express, no real ports) Bad: Real network requests (localhost:3000, binding to port)
Detection:
- Grep for
app.listen(3000)in tests - Check for real HTTP requests
Severity: MEDIUM
Recommendation: Use supertest (no real port)
Effort: M
Audit Rules: Determinism
1. Flaky Tests
What: Tests that pass/fail randomly
Detection:
- Run tests multiple times, check for inconsistent results
- Grep for
setTimeout,setIntervalwithout proper awaits - Check for race conditions (async operations not awaited)
Severity: HIGH
Recommendation: Fix race conditions, use proper async/await
Effort: M-L
2. Time-Dependent Assertions
What: Assertions on current time (expect(timestamp).toBeCloseTo(Date.now()))
Detection:
- Grep for
Date.now(),new Date()in assertions
Severity: MEDIUM
Recommendation: Mock time
Effort: S
3. Order-Dependent Tests
What: Tests that fail when run in different order
Detection:
- Run tests in random order, check for failures
- Grep for shared mutable state between tests
Severity: MEDIUM
Recommendation: Isolate tests, reset state in beforeEach
Effort: M
4. Shared Mutable State
What: Global variables modified across tests
Detection:
- Grep for
let globalVarat module level - Check for state shared between tests
Severity: MEDIUM
Recommendation: Use beforeEach to reset state
Effort: S-M
Audit Rules: Trustworthiness Drag
1. Overlarge Test With Shared Setup (>100 lines)
What: Test with >100 lines, testing too many scenarios
Detection:
- Count lines per test
- If >100 lines -> Giant
Severity: MEDIUM
Recommendation: Split into focused tests (one scenario per test)
Effort: S-M
2. Slow Poke (>5 seconds)
What: Test taking >5 seconds to run
Detection:
- Measure test duration
- If >5s -> Slow Poke
Severity: MEDIUM
Recommendation: Control external deps with test doubles or in-memory services selected from the project stack; parallelize only after isolation is verified
Effort: M
3. Conjoined Twins (Unit test without controlled dependencies)
What: Test labeled "Unit" but not mocking dependencies
Detection:
- Check if test name includes "Unit"
- Verify all dependencies are mocked
- If no mocks -> actually Integration test
Severity: LOW
Recommendation: Either mock dependencies OR rename to Integration test
Effort: S
4. Default Value Blindness (Tests with default config)
What: Tests with default config values only. Use the non-default config rule from references/risk_based_testing_guide.md; load references/risk_based_testing_methodology.md only when examples are needed.
Detection:
- Grep for common defaults in test setup:
:8080,:3000,30000,limit: 20,offset: 0 - Check if test config values match framework/library defaults
- Look for
|| DEFAULTpatterns in source code with matching test values
Severity: HIGH
Effort: S
Scoring Algorithm
MANDATORY READ: Load references/audit_scoring.md.
Severity mapping:
- Flaky tests, External API not controlled, Default Value Blindness -> HIGH
- Real database, File system, Time/Date, Network, Overlarge shared setup, Slow Poke -> MEDIUM
- Random without seed, Order-dependent, Conjoined Twins -> LOW
Output Format
MANDATORY READ: Load references/templates/audit_worker_report_template.md.
Write JSON summary per references/audit_summary_contract.md. In managed mode the caller passes both runId and summaryArtifactPath; in standalone mode the worker generates its own run-scoped artifact path per shared contract.
Write report to {output_dir}/ln-635--global.md with category: "Test Trustworthiness" and checks: api_isolation, db_isolation, fs_isolation, time_isolation, random_isolation, network_isolation, flaky_tests, order_dependency, shared_state, default_value_blindness.
Return summary per references/audit_summary_contract.md.
When summaryArtifactPath is absent, write the standalone runtime summary under .hex-skills/runtime-artifacts/runs/{run_id}/evaluation-worker/{worker}--{identifier}.json and optionally echo the same summary in structured output.
Report written: .hex-skills/runtime-artifacts/runs/{run_id}/audit-report/ln-635--global.mdScore: X.X/10 | Issues: N (C:N H:N M:N L:N)
Note: Findings are flattened into single array. Use principle field prefix (Isolation / Determinism / Dependency Control) to identify issue category. Each finding includes action: "REWRITE_FOR_DETERMINISM" or action: "DELETE_IF_LOW_VALUE".
Critical Rules
Apply the already-loaded references/audit_worker_core_contract.md.
- Do not auto-fix: Report only
- Effort realism: S = <1h, M = 1-4h, L = >4h
- Flat findings: Merge isolation + determinism + dependency-control findings into single findings array, use
principleprefix to distinguish - Context-aware: Supertest with real Express app is acceptable for integration tests
- Unique angle: Only audit whether test results can be trusted. Do not evaluate product behavior, E2E journey value, portfolio value, missing coverage, oracle strength, manual evidence, or structure.
- Action required: Every finding uses
REWRITE_FOR_DETERMINISMunless evidence shows the test is also low-value enough to useDELETE_IF_LOW_VALUE.
Monitor (2.1.98+): For repeated test runs expected >30s each, use Monitor. Fallback: Bash(run_in_background=true).
Definition of Done
Apply the already-loaded references/audit_worker_core_contract.md.
- [ ] contextStore parsed successfully (including output_dir)
- [ ] All 3 audit groups completed:
- Isolation (6 categories: APIs, DB, FS, Time, Random, Network)
- Determinism (4 checks: flaky, time-dependent, order-dependent, shared state)
- Dependency control (overlarge shared setup, slow tests, conjoined dependencies, default-value blindness)
- [ ] Findings collected with severity, location, effort, action, recommendation
- [ ] Score calculated using penalty algorithm
- [ ] Report written to
{output_dir}/ln-635--global.md(atomic single Write call) - [ ] Summary written per contract
Version: 3.0.0 Last Updated: 2025-12-23