Skill v1.0.2
currentAutomated scan100/1001 files
name: pdf-extract-create-workflow description: Complete PDF lifecycle: download, extract, and generate structured documents with reportlab
PDF Extract and Create Workflow
This skill provides a complete PDF lifecycle workflow for acquiring PDF documents from web sources or local files, extracting their text content, AND generating new structured PDFs from processed data—with multiple fallback mechanisms throughout.
Overview
When working with PDFs, you may need to:
- Download PDFs from web sources (with anti-bot protection)
- Extract text content from PDFs (with fallback strategies)
- Generate new PDFs from processed data (with professional formatting)
This workflow ensures maximum success rate through progressive fallback strategies for extraction and templated approaches for generation.
Entry Point: Determine Your Starting Point
Before beginning, identify your scenario:
| Scenario | Start Here | Skip | |
|---|---|---|---|
| PDF already on local disk | Step 2 (Verify File Type) | Step 1 (Download) | |
| PDF at a web URL | Step 1 (Download) | None | |
| Need to CREATE a PDF from data | Mode C (Generate) | Modes A & B | |
| Need to extract AND create | Mode A/B → Mode C | None |
Mode A: Web URL Download
Step 1: Download PDF with Browser User-Agent
Many PDF hosting sites use JavaScript-based redirects or block automated requests. Use curl with a realistic browser user-agent:
curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o output.pdf "URL_HERE"
Key flags:
-L: Follow redirects-A: Set user-agent header to mimic a real browser-o: Specify output filename
Additional headers for difficult sites:
curl -L \-A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" \-H "Accept: application/pdf,*/*" \-H "Accept-Language: en-US,en;q=0.9" \-H "Connection: keep-alive" \-o output.pdf "URL_HERE"
Mode B: Local File Processing & Extraction
If you already have the PDF file locally, skip Step 1 and begin here:
Step 2: Verify File Type Before Parsing
Always validate the downloaded file is actually a PDF before attempting extraction:
file output.pdf
Expected output should contain "PDF document". If not:
- The URL may have redirected to an HTML error page
- The file may be corrupted
- Access may be blocked
Step 3: Primary Extraction with pdftotext
First attempt extraction using the standard pdftotext utility (part of poppler-utils):
pdftotext output.pdf output.txt
If pdftotext is not available, install it:
# Debian/Ubuntuapt-get update && apt-get install -y poppler-utils# macOSbrew install poppler# RHEL/CentOSyum install -y poppler-utils
Step 4: Fallback to PyMuPDF (fitz)
If pdftotext fails or produces poor results, use Python's PyMuPDF library:
import fitz # PyMuPDFdoc = fitz.open("output.pdf")text = ""for page in doc:text += page.get_text()doc.close()with open("output.txt", "w") as f:f.write(text)
Install if needed:
pip install pymupdf
Step 5: Graceful Degradation to Domain Knowledge
If the PDF cannot be accessed or extracted after all attempts:
- Document the failure mode (network issue, corrupted file, access denied, etc.)
- Extract any partial content that was successfully retrieved
- Supplement missing content from established domain knowledge
- Clearly mark which portions are from source vs. generated from knowledge
- Provide citations for any claimed requirements or specifications
Example degradation note:
NOTE: Source document [URL] was inaccessible due to [reason].Content below combines partial extraction with established domain knowledgefor [topic]. Verify against official sources when available.
Mode C: PDF Generation with ReportLab
After extracting or processing data, generate professional PDFs using Python's reportlab library.
Installation
pip install reportlab
Step C1: Basic Document Structure
Create a multi-page PDF with title page, sections, and proper formatting:
from reportlab.lib.pagesizes import letterfrom reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreakfrom reportlab.lib.styles import getSampleStyleSheet, ParagraphStylefrom reportlab.lib.units import inchfrom reportlab.lib.enums import TA_CENTER, TA_LEFTdef create_structured_pdf(output_path, title, sections):"""Create a structured PDF with title page and sections.Args:output_path: Path for output PDFtitle: Document titlesections: List of dicts with 'heading' and 'content' keys"""doc = SimpleDocTemplate(output_path,pagesize=letter,rightMargin=72,leftMargin=72,topMargin=72,bottomMargin=72)styles = getSampleStyleSheet()story = []# Title Pagetitle_style = ParagraphStyle('CustomTitle',parent=styles['Heading1'],fontSize=24,alignment=TA_CENTER,spaceAfter=30)story.append(Paragraph(title, title_style))story.append(Spacer(1, 2*inch))story.append(PageBreak())# Content Sectionsheading_style = ParagraphStyle('CustomHeading',parent=styles['Heading2'],fontSize=16,spaceBefore=12,spaceAfter=6)body_style = ParagraphStyle('CustomBody',parent=styles['Normal'],fontSize=11,leading=14,spaceAfter=12)for section in sections:story.append(Paragraph(section['heading'], heading_style))# Handle long text by splitting into paragraphsfor paragraph in section['content'].split('\n\n'):if paragraph.strip():story.append(Paragraph(paragraph, body_style))story.append(Spacer(1, 0.2*inch))doc.build(story)print(f"PDF created: {output_path}")
Step C2: Adding Tables
For structured data, include tables with proper formatting:
from reportlab.platypus import Table, TableStylefrom reportlab.lib import colorsdef create_table(data, col_widths=None):"""Create a formatted table for PDF.Args:data: List of lists (rows x columns)col_widths: Optional list of column widths"""table = Table(data, colWidths=col_widths)table.setStyle(TableStyle([# Header row('BACKGROUND', (0, 0), (-1, 0), colors.grey),('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),('ALIGN', (0, 0), (-1, -1), 'LEFT'),('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),('FONTSIZE', (0, 0), (-1, 0), 12),('BOTTOMPADDING', (0, 0), (-1, 0), 12),# Data rows('BACKGROUND', (0, 1), (-1, -1), colors.beige),('TEXTCOLOR', (0, 1), (-1, -1), colors.black),('FONTNAME', (0, 1), (-1, -1), 'Helvetica'),('FONTSIZE', (0, 1), (-1, -1), 10),# Grid('GRID', (0, 0), (-1, -1), 1, colors.black),('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.lightgrey]),]))return table
Step C3: Multi-Section Document Template
Complete example creating an organized document with multiple sections:
from reportlab.lib.pagesizes import letterfrom reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak, Table, TableStylefrom reportlab.lib.styles import getSampleStyleSheet, ParagraphStylefrom reportlab.lib.enums import TA_CENTERfrom reportlab.lib import colorsdef create_report_pdf(output_path, title, subtitle, sections, table_data=None):"""Create a complete report PDF with title, sections, and optional tables.Args:output_path: Output PDF pathtitle: Main titlesubtitle: Subtitle or datesections: List of {'heading': str, 'content': str} dictstable_data: Optional list of lists for tables"""doc = SimpleDocTemplate(output_path,pagesize=letter,rightMargin=50,leftMargin=50,topMargin=50,bottomMargin=50)styles = getSampleStyleSheet()story = []# Title Pagetitle_style = ParagraphStyle('Title',parent=styles['Heading1'],fontSize=28,alignment=TA_CENTER,spaceAfter=20,fontName='Helvetica-Bold')subtitle_style = ParagraphStyle('Subtitle',parent=styles['Normal'],fontSize=14,alignment=TA_CENTER,spaceAfter=50,textColor=colors.darkgrey)story.append(Paragraph(title, title_style))story.append(Paragraph(subtitle, subtitle_style))story.append(PageBreak())# Contentheading_style = ParagraphStyle('SectionHeading',parent=styles['Heading2'],fontSize=16,spaceBefore=20,spaceAfter=10,fontName='Helvetica-Bold',textColor=colors.darkblue)body_style = ParagraphStyle('Body',parent=styles['Normal'],fontSize=11,leading=15,spaceAfter=12)for i, section in enumerate(sections):story.append(Paragraph(section['heading'], heading_style))# Split content into paragraphsfor para in section['content'].split('\n\n'):if para.strip():# Handle very long paragraphsstory.append(Paragraph(para, body_style))# Add table after specific section if providedif table_data and i == 0:story.append(Spacer(1, 0.3*inch))table = Table(table_data)table.setStyle(TableStyle([('BACKGROUND', (0, 0), (-1, 0), colors.darkblue),('TEXTCOLOR', (0, 0), (-1, 0), colors.white),('ALIGN', (0, 0), (-1, -1), 'LEFT'),('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),('GRID', (0, 0), (-1, -1), 0.5, colors.grey),('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.lightgrey]),]))story.append(table)story.append(Spacer(1, 0.3*inch))if i < len(sections) - 1:story.append(PageBreak())doc.build(story)return output_path# Example usageif __name__ == "__main__":sections = [{'heading': 'Section 1: Overview','content': 'This is the first section content...\n\nAdditional paragraph here.'},{'heading': 'Section 2: Details','content': 'Detailed information goes here...'}]table_data = [['Header 1', 'Header 2', 'Header 3'],['Row 1 Col 1', 'Row 1 Col 2', 'Row 1 Col 3'],['Row 2 Col 1', 'Row 2 Col 2', 'Row 2 Col 3'],]create_report_pdf("output_report.pdf","Report Title","Generated: 2024",sections,table_data)
Step C4: Error Handling for PDF Generation
PDF generation can fail in multiple ways. Handle gracefully:
def safe_pdf_generation(output_path, title, sections, max_retries=3):"""Generate PDF with retry logic and error handling."""import tracebackfrom reportlab.lib.utils import ImageReaderfor attempt in range(max_retries):try:create_report_pdf(output_path, title, sections)# Verify file was createdimport osif os.path.exists(output_path) and os.path.getsize(output_path) > 0:print(f"✓ PDF generated successfully: {output_path}")return Trueelse:raise Exception("PDF file empty or not created")except Exception as e:print(f"Attempt {attempt + 1}/{max_retries} failed: {e}")if attempt < max_retries - 1:import timetime.sleep(1) # Brief delay before retryelse:print(f"PDF generation failed after {max_retries} attempts")print(traceback.format_exc())# Fallback: create minimal text filewith open(output_path.replace('.pdf', '.txt'), 'w') as f:f.write(f"Title: {title}\n\n")for section in sections:f.write(f"{section['heading']}\n{section['content']}\n\n")return False
Step C5: Best Practices for PDF Generation
- Page Breaks: Insert
PageBreak()between major sections for readability - Consistent Styling: Define ParagraphStyle objects once and reuse
- Text Wrapping: ReportLab handles wrapping automatically; split long content with
\n\n - Margins: Use at least 50-72 point margins for standard letter size
- Font Selection: Stick to Helvetica, Times-Roman, or Courier for compatibility
- File Verification: Always check the PDF was created and has content > 0 bytes
- Error Recovery: Have a fallback (e.g., .txt output) if PDF generation fails
- Memory Management: For very large documents, build in chunks or use separate files
Complete Workflow Script (Handles Download, Extract, and Generate)
#!/bin/bash# pdf-lifecycle-workflow.sh# Handles URL downloads, local files, and PDF generationINPUT="$1"MODE="${2:-extract}" # extract, generate, or bothOUTPUT_PDF="downloaded.pdf"OUTPUT_TXT="extracted.txt"OUTPUT_REPORT="generated_report.pdf"if [[ "$MODE" == "generate" ]]; thenecho "Mode: PDF Generation"python3 generate_pdf.pyexit $?fiif [[ "$INPUT" =~ ^https?:// ]]; then# Mode A: URL downloadPDF_URL="$INPUT"echo "Downloading PDF from URL..."curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" -o "$OUTPUT_PDF" "$PDF_URL"else# Mode B: Local fileif [ ! -f "$INPUT" ]; thenecho "ERROR: Local file not found: $INPUT"exit 1fiOUTPUT_PDF="$INPUT"echo "Using local file: $INPUT"fi# Step 2: Verify file typeecho "Verifying file type..."if ! file "$OUTPUT_PDF" | grep -q "PDF document"; thenecho "WARNING: File is not a valid PDF"echo "Attempting fallback extraction anyway..."fi# Step 3: Try pdftotextecho "Attempting pdftotext extraction..."if command -v pdftotext &> /dev/null; thenif pdftotext "$OUTPUT_PDF" "$OUTPUT_TXT" 2>/dev/null; thenecho "Extraction successful with pdftotext"if [[ "$MODE" == "both" ]]; thenecho "Proceeding to PDF generation..."python3 generate_pdf.pyfiexit 0fifi# Step 4: Fallback to PyMuPDFecho "Falling back to PyMuPDF..."python3 << 'PYTHON_SCRIPT'import fitzimport systry:doc = fitz.open("downloaded.pdf")text = ""for page in doc:text += page.get_text()doc.close()with open("extracted.txt", "w") as f:f.write(text)print("Extraction successful with PyMuPDF")except Exception as e:print(f"PyMuPDF failed: {e}")sys.exit(1)PYTHON_SCRIPT# Step 5: Handle complete failureif [ $? -ne 0 ]; thenecho "ERROR: All extraction methods failed."echo "ACTION: Generate content from domain knowledge and clearly mark source limitations."fiif [[ "$MODE" == "both" ]]; thenecho "Proceeding to PDF generation with extracted/fallback content..."python3 generate_pdf.pyfi
Common Failure Modes & Solutions
| Symptom | Cause | Solution | |
|---|---|---|---|
| HTML content in PDF | URL redirected to error page | Check HTTP status, try alternate URL | |
| Empty extraction | Password-protected or scanned PDF | Try OCR tools or request accessible version | |
| Garbled text | Encoding issues | Try PyMuPDF with different extraction mode | |
| Curl blocked | Anti-bot measures | Add more headers, use delay between requests | |
| PDF generation fails | Missing fonts or memory | Use standard fonts, build in chunks | |
| ReportLab errors | Version incompatibility | Use pip install --upgrade reportlab | |
| Unknown shell_agent error | Timeout on complex operations | Use direct Python execution instead |
When to Use This Skill
| Mode | Use Case | |
|---|---|---|
| Mode A (URL download) | Downloading regulatory documents from government websites | |
| Mode B (Local file) | Processing PDFs already saved to disk | |
| Mode C (Generate) | Creating reports from extracted/processed data | |
| Both (extract + generate) | Full pipeline: acquire → process → report |
Specific Scenarios
- Extracting content from technical manuals or handbooks
- Processing PDFs in automated pipelines where reliability matters
- Creating structured reports from multiple data sources
- Generating documentation with consistent formatting
- Any situation where PDF access may be unreliable or restricted
- Producing professional PDFs from text data with tables and sections
Quick Reference: Mode Selection
Need to get PDF from web? → Mode A (Download)Have PDF file already? → Mode B (Extract)Need to CREATE a PDF? → Mode C (Generate)Need full pipeline? → Mode A/B → Mode CExtraction failed? → Step 5 (Domain knowledge fallback)Generation failed? → Step C4 (Error handling + txt fallback)