Skill v1.0.2
currentAutomated scan100/1001 files
name: pdf-extract-progressive-tools description: Progressive tool-chain PDF extraction with explicit read_file, run_shell, and execute_code_sandbox sequencing
PDF Text Extraction with Progressive Tool Fallback
This skill provides a robust workflow for extracting text from PDF documents using a sequenced approach with agent tools, with explicit fallback mechanisms based on observed tool behavior.
Critical Insight from Execution Data
read_file often returns binary/image data for PDFs, not extracted text. When this occurs, immediately escalate to run_shell with pdftotext before attempting Python-based extraction.
Entry Point: Determine Your Starting Point
Before beginning, identify your scenario:
| Scenario | Start Here | Skip | |
|---|---|---|---|
| PDF already on local disk | Step 1 (read_file attempt) | Download steps | |
| PDF at a web URL | Download first, then Step 1 | None | |
| PDF content already extracted | Step 4 (Quality verification) | Steps 1-3 |
Overview
PDF extraction failures cascade when tool sequencing is unclear. This workflow ensures maximum success rate through explicit tool progression:
- read_file - Quick attempt, but may return binary data
- run_shell + pdftotext - Reliable extraction when read_file fails
- execute_code_sandbox + PyMuPDF - Final fallback for complex PDFs
Step-by-Step Instructions
Step 1: Attempt read_file First
Always try the simplest approach first:
Tool: read_filePath: document.pdf
Expected outcome: Extracted text content
Critical check: Examine the returned content:
- ✅ Text visible: Proceed to Step 4 (Quality verification)
- ⚠️ Binary/image data detected: Immediately proceed to Step 2
- ❌ File not found: Verify path or download first
Binary data indicators:
- Content starts with
%PDF-header without text extraction - Content appears as garbled characters or base64
- Content contains PNG/JPEG markers within PDF wrapper
- File size seems reasonable but no readable text
Step 2: Escalate to run_shell with pdftotext
When read_file returns binary data, do NOT attempt execute_code_sandbox yet. Use run_shell immediately:
Tool: run_shellCommand: pdftotext document.pdf document.txt
If pdftotext is not available:
Tool: run_shellCommand: apt-get update && apt-get install -y poppler-utils && pdftotext document.pdf document.txt
Then read the extracted text:
Tool: read_filePath: document.txt
Expected outcome: Clean text extraction
If this fails:
- Check if file is password-protected
- Check if file is corrupted (run
file document.pdf) - Proceed to Step 3
Step 3: Final Fallback to execute_code_sandbox with PyMuPDF
Only attempt this if Steps 1-2 fail:
Tool: execute_code_sandboxLanguage: pythonCode: |import fitz # PyMuPDFtry:doc = fitz.open("document.pdf")text = ""for page in doc:text += page.get_text()doc.close()with open("document_pymupdf.txt", "w") as f:f.write(text)print("SUCCESS: Extracted {} characters".format(len(text)))except Exception as e:print(f"FAILED: {e}")
Then read the result:
Tool: read_filePath: document_pymupdf.txt
Step 4: Quality Verification
Regardless of which method succeeded, verify extraction quality:
- Check text length: Should be proportional to PDF pages (~500-2000 chars per page)
- Check readability: Text should form coherent sentences
- Check for truncation: Look for cut-off words or missing sections
- Compare methods: If multiple methods worked, compare outputs
If quality is poor:
- Try alternative extraction tools (pdfplumber, camelot-py for tables)
- Consider OCR for scanned documents
- Document limitations clearly
Step 5: Graceful Degradation to Domain Knowledge
If all extraction methods fail:
- Document the specific failure mode for each tool attempted
- Extract any partial content that was successfully retrieved
- Supplement missing content from established domain knowledge
- Clearly mark which portions are from source vs. generated from knowledge
- Provide citations for any claimed requirements or specifications
Example degradation note:
NOTE: Source document [path/URL] was inaccessible due to [specific tool failures].Content below combines partial extraction with established domain knowledgefor [topic]. All claims verified against [alternative sources] where possible.Tool Failure Log:- read_file: Returned binary data (no text extraction)- run_shell/pdftotext: Command not available in environment- execute_code_sandbox/PyMuPDF: Sandbox execution failed with [error]
Complete Tool Orchestration Script
# pdf-extract-orchestrator.py# Implements the progressive tool fallback patterndef extract_pdf_text(pdf_path):"""Progressive PDF extraction following tool precedence:1. read_file (quick check)2. run_shell + pdftotext (primary extraction)3. execute_code_sandbox + PyMuPDF (final fallback)"""extraction_log = []# Step 1: Try read_fileprint("Step 1: Attempting read_file...")try:content = read_file(pdf_path)if is_binary_or_image_data(content):extraction_log.append("read_file: Returned binary data")# Proceed to Step 2else:extraction_log.append("read_file: Success")return content, extraction_logexcept Exception as e:extraction_log.append(f"read_file: Failed - {e}")# Step 2: Try run_shell with pdftotextprint("Step 2: Attempting run_shell + pdftotext...")try:run_shell(f"pdftotext {pdf_path} output.txt")content = read_file("output.txt")if content and len(content) > 100:extraction_log.append("run_shell/pdftotext: Success")return content, extraction_logelse:extraction_log.append("run_shell/pdftotext: Empty extraction")except Exception as e:extraction_log.append(f"run_shell/pdftotext: Failed - {e}")# Step 3: Try execute_code_sandbox with PyMuPDFprint("Step 3: Attempting execute_code_sandbox + PyMuPDF...")try:code = """import fitzdoc = fitz.open("""" + pdf_path + """")text = ""for page in doc:text += page.get_text()doc.close()print(text[:1000]) # Preview"""result = execute_code_sandbox(language="python", code=code)extraction_log.append("execute_code_sandbox/PyMuPDF: Success")return result, extraction_logexcept Exception as e:extraction_log.append(f"execute_code_sandbox/PyMuPDF: Failed - {e}")# Step 4: All methods failedextraction_log.append("ALL METHODS FAILED - Escalate to domain knowledge")return None, extraction_logdef is_binary_or_image_data(content):"""Detect if content is binary/image data rather than extracted text"""if not content:return True# Check for PDF header without text extractionif content.startswith("%PDF-"):return True# Check for high ratio of non-printable charactersnon_printable = sum(1 for c in content if ord(c) < 32 and c not in '\n\r\t')if len(content) > 0 and non_printable / len(content) > 0.1:return Truereturn False
Tool Precedence Decision Tree
┌─────────────────┐│ Start: PDF ││ Available? │└────────┬────────┘│┌────────▼────────┐│ Step 1: ││ read_file │└────────┬────────┘│┌──────────────┼──────────────┐│ │ │┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐│ Text │ │ Binary │ │ Error/ ││ Returned │ │ Data │ │ Not Found│└─────┬─────┘ └─────┬─────┘ └─────┬─────┘│ │ ││ ┌────▼─────┐ ┌────▼─────┐│ │ Step 2: │ │ Download ││ │ run_shell│ │ or Fix ││ │ pdftotext│ │ Path ││ └────┬─────┘ └──────────┘│ ││ ┌────▼─────┐│ │ Success? ││ └────┬─────┘│ │┌─────▼─────┐ ┌─────▼─────┐│ Yes │ │ No │└─────┬─────┘ └─────┬─────┘│ ││ ┌────▼─────────┐│ │ Step 3: ││ │ execute_ ││ │ code_sandbox ││ │ PyMuPDF ││ └──────────────┘│┌─────▼──────────────────┐│ Step 4: Quality Check ││ Step 5: Document ││ Limitations │└────────────────────────┘
Best Practices
- Check read_file output immediately: Don't assume it extracted text - verify before proceeding
- Escalate quickly on binary data: Don't waste iterations trying read_file multiple times
- Prefer run_shell over execute_code_sandbox: Shell tools are more reliable for PDF extraction when available
- Log each tool attempt: Document which method succeeded for future reference
- Preserve extraction artifacts: Keep intermediate files for debugging
- Verify extraction quality: Check text length and readability before accepting results
- Document tool failures: When falling back to domain knowledge, specify which tools failed and why
Common Failure Modes by Tool
| Tool | Symptom | Cause | Solution | |
|---|---|---|---|---|
| read_file | Binary PDF data | Tool doesn't extract PDF text | Escalate to run_shell immediately | |
| read_file | PNG/JPEG data | PDF contains embedded images | Use OCR tools or request text version | |
| run_shell | pdftotext not found | Tool not installed | Install poppler-utils first | |
| run_shell | Empty output | Password-protected PDF | Request accessible version | |
| execute_code_sandbox | Unknown error | Sandbox execution issue | Try run_shell alternative or document limitation | |
| execute_code_sandbox | Import error | PyMuPDF not installed | Include pip install in script |
When to Use This Skill
- PDFs from web downloads: After downloading, apply this extraction workflow
- PDFs already local: Start at Step 1 with existing file path
- Automated document processing: Where reliability matters more than speed
- Regulatory/compliance documents: Where source verification is critical
- Situations with tool uncertainty: When environment capabilities are unknown
Migration from Parent Skill
This skill enhances pdf-download-extract-fallback by:
- Explicit tool sequencing: Parent described shell commands; this specifies agent tool order
- Binary detection: Parent assumed download success; this checks read_file output quality
- Faster escalation: Parent tried pdftotext then PyMuPDF; this escalates immediately on binary data
- Agent-focused: Parent was shell-script focused; this is optimized for agent tool calls
- Execution insights: Incorporates learnings from failed task 0353ee0c showing read_file limitations