Extract & Analyze Data from Multiple PDFs
Extract data from multiple PDFs and create summary tables with basic analysis. No manual data entry, no copy-paste, no hours of tedious work.
Why This Matters
The problem: Manually extracting data from multiple PDFs is time-consuming and error-prone, often taking 2-3 hours with no guarantee of accuracy.
The FabriXWork way: Your agent extracts data from all PDFs and creates summary tables with basic analysis in minutes. Add more PDFs, re-extract instantly.
See It in Action: Invoice Data Extraction
This demo shows how 20 invoice PDFs become a summary table with analysis instantly.
Note: Some scenes in this video have been accelerated to 10× speed to enhance the viewing experience. The prompt used in this video can be found in the Featured Example section below.
How it Works:
┌──────────────────────────────────────────────────────────┐
│ INPUT OUTPUT │
│ ┌──────────────┐ ┌─────────────────────┐ │
│ │ invoice-001. │ │ Summary Table │ │
│ │ pdf │ │ • All Invoices │ │
│ │ invoice-002. │ │ • Totals │ │
│ │ pdf │ ──────► │ • Averages │ │
│ │ ... │ Agent │ • By Vendor │ │
│ │ invoice-020. │ Extract │ • Insights │ │
│ │ pdf │ │ │ │
│ └──────────────┘ │ CSV / Excel │ │
│ │ (Ready to use) │ │
│ └─────────────────────┘ │
└──────────────────────────────────────────────────────────┘
The Pattern:
- You have multiple PDFs with similar structure (invoices, reports, forms)
- Agent extracts specified fields from all PDFs
- Agent creates summary table with basic analysis
- You review, export, and use for reporting/decision-making
Try It Out
Tip
Choose the scenario that best matches your needs, then adapt the prompt to fit your content and goals.
- Prepare your PDFs — Collect all PDFs in one folder (ensure they're text-based, not scanned)
- Choose an agent — All agents support data extraction (Oscar - Operations Analyst is recommended for data analysis, and Claire - Claims Specialist is recommended for claiming related tasks)
- Connect your folder — Select agent → Click "Browse" → Choose folder with PDFs
- Enter your prompt - Use the examples below as inspiration. Adapt them to your content and goals
Featured Example: Invoice Data Extraction
Scenario: You have 20 invoice PDFs from different vendors and need to extract key data into a summary table for accounts payable processing.
Example Files:
invoices/— Folder with 20 invoice PDFs (invoice-001.pdf through invoice-020.pdf)
Example Prompt:
Tip
Use Plan Mode first to review the proposed extraction structure before building. Learn more about the different modes in How to Interact with an AI Agent
Extract data from all invoice PDFs in the invoices folder and create a summary table.
**Fields to extract:**
Invoice Number, Invoice Date, Vendor Name, Total Amount, Due Date, PO Number
**Output:**
- CSV file: invoice-summary.csv
- Summary: total amount, average invoice, count by vendor, overdue invoices
- Flag invoices over $10,000 for review
Make It Your Own
Don't simply copy this prompt, adapt it. Ask yourself:
- What PDFs are you processing: invoices, reports, forms, statements?
- What fields do you need: amounts, dates, names, numbers, codes?
- What analysis do you want: totals, averages, grouping, trends?
Examples:
- Expense reports → "Extract employee, date, amount, category. Summarize by employee and category"
- Purchase orders → "Extract PO number, vendor, items, total. Flag POs over budget"
More Examples to Inspire You
Example 2: Quarterly Financial Report Comparison — See how to compare data across time periods
Scenario: You have quarterly financial reports (PDFs) and need to compare key metrics across quarters to identify trends.
Example Files:
financial-reports/— Folder with 4 quarterly report PDFs (Q1-2025.pdf through Q4-2025.pdf)
Example Prompt:
Extract and compare data from all quarterly financial report PDFs.
**Fields to extract:**
Revenue, Operating Expenses, Net Profit, Cash Flow, Headcount
**Output:**
- Excel file: financial-comparison.xlsx (quarterly data + comparisons)
- Summary: QoQ growth %, trends, best/worst quarter
Make It Your Own
Adapt this for:
- Monthly sales reports: "Extract revenue by region, compare month-over-month growth"
- Budget vs actual: "Extract budget and actual from each report, calculate variances"
- KPI dashboards: "Extract KPIs from each period, track progress toward targets"
Example 3: Claims Pattern Analysis — See how to find patterns in form data
Scenario: You have 50 claim form PDFs and need to analyze patterns in claim amounts, types, and frequencies.
Example Files:
claims/— Folder with 50 claim form PDFs
Example Prompt:
Extract data from all claim form PDFs and analyze patterns.
**Fields to extract:**
Claim Number, Claim Date, Claim Type, Claim Amount, Claimant Name, Status
**Output:**
- CSV file: claims-data.csv
- Summary: total claims, approval rate %, by claim type, top 5 highest, outliers
Make It Your Own
Adapt this for:
- Customer feedback: "Extract ratings and comments, analyze sentiment by product"
- Inspection reports: "Extract pass/fail status, defect types, analyze by location"
- Time sheets: "Extract hours by employee and project, analyze utilization"
Make It Even Better
Quick Wins
- Use text-based PDFs — Scanned PDFs need OCR first. Ensure PDFs have selectable text
- Specify exact fields — e.g. "Extract: Invoice Number, Date, Vendor Name, Total Amount"
- Request summary statistics — e.g. "Calculate: totals, averages, counts by category"
- Define output format upfront — e.g. "Save as CSV with columns: [list columns]"
- Ask for outlier detection — e.g. "Flag any values over [threshold] for review"
Review & Refine
Always verify extracted data before using for reporting or decisions.
What to Check:
- Completeness — All PDFs were processed, none skipped
- Accuracy — Extracted values match the source PDFs
- Format consistency — Dates, currency, numbers formatted correctly
- Missing data — Fields marked "Not Found" are actually missing
How to Request Corrections:
For missing PDFs:
"The extraction only processed 18 of 20 PDFs. Please check if [file names] were included and re-extract."
For incorrect extraction:
"Invoice amounts for [vendor] are incorrect. The correct amounts should be [values]. Please re-extract."
For missing fields:
"The PO Number field shows 'Not Found' for all invoices, but most invoices have PO numbers. Please re-extract this field."
For format issues:
"Dates are in different formats. Please standardize all dates to YYYY-MM-DD format."
PDF Analysis Tips
- Invoices: Extract vendor, date, amount, PO number. Summarize by vendor and month
- Financial Reports: Extract revenue, expenses, profit. Compare periods, calculate growth %
- Forms/Claims: Extract type, amount, date, status. Analyze patterns and outliers
Reference & Details
Advanced Prompting Tips — Get better results with these techniques
1. Specify Exact Fields
✅ Good: "Extract: Invoice Number, Invoice Date, Vendor Name, Total Amount, Due Date"
❌ Vague: "Extract all the important data from the invoices"
2. Define Output Structure
✅ Good: "Save as CSV with columns: Invoice#, Date, Vendor, Amount, Due Date, Status"
❌ Vague: "Create a summary table"
3. Request Calculations
✅ Good: "Calculate: total amount, average invoice, count by vendor, overdue count"
❌ Vague: "Add some summary statistics"
4. Handle Missing Data
✅ Good: "If a field is not found, mark it as 'Not Found' and continue processing"
❌ Vague: "Handle missing data"
5. Define Thresholds
✅ Good: "Flag invoices over $10,000 for review. Highlight vendors with > 5 invoices"
❌ Vague: "Flag unusual items"
6. Specify Formatting
✅ Good: "Format dates as YYYY-MM-DD. Format currency with 2 decimal places. Sort by date"
❌ Vague: "Format it nicely"
7. Iterate on Extraction After first extraction:
✅ Good: "Good start! Now also extract the payment terms and add a column for days until due"
❌ Vague: "Add more information"
Common PDF Types — What you can extract from
Financial Documents
| PDF Type | Common Fields | Typical Analysis |
|---|---|---|
| Invoices | Invoice #, Date, Vendor, Amount, Due Date, PO # | Totals by vendor, aging, overdue |
| Receipts | Date, Vendor, Amount, Category, Payment Method | Totals by category, monthly spend |
| Bank Statements | Date, Description, Debit, Credit, Balance | Cash flow, categorization, trends |
| Financial Reports | Revenue, Expenses, Profit, Assets, Liabilities | Period comparison, growth %, ratios |
Business Forms
| PDF Type | Common Fields | Typical Analysis |
|---|---|---|
| Claims Forms | Claim #, Date, Type, Amount, Status | Approval rates, patterns, outliers |
| Purchase Orders | PO #, Date, Vendor, Items, Total | Spend by vendor, budget variance |
| Expense Reports | Employee, Date, Amount, Category, Project | By employee, by project, policy compliance |
| Time Sheets | Employee, Date, Project, Hours | Utilization, project hours, overtime |
Reports & Statements
| PDF Type | Common Fields | Typical Analysis |
|---|---|---|
| Sales Reports | Date, Product, Region, Quantity, Revenue | By product, by region, trends |
| Inventory Reports | SKU, Description, Quantity, Location, Value | Stock levels, turnover, valuation |
| Customer Statements | Customer, Date, Transaction, Amount, Balance | AR aging, payment patterns |
| Compliance Reports | Metric, Target, Actual, Status | Compliance rate, gaps, trends |
Troubleshooting — Common issues and solutions
Quick Fixes
| Issue | What to Try |
|---|---|
| PDF not processed | "Check if [file name] is in the folder and is a valid PDF. Re-process all files" |
| Scanned PDF (image) | "This is a scanned image PDF. Use OCR tool first or convert to text-based PDF" |
| Field not found | "The field [name] is labeled as [different name] in the PDFs. Update field name" |
| Inconsistent formats | "PDFs have different formats. Extract what's common, note variations in report" |
| Wrong data extracted | "The extracted [field] is incorrect. It should be from [section/location] in PDF" |
| Missing calculations | "Add calculations: [list calculations]. Update the summary report" |
| Export not working | "Save the output as [CSV/Excel] format. Ensure file is not locked" |
PDF-Specific Issues
| Issue | PDF Type | What to Try |
|---|---|---|
| Tables not extracted correctly | Financial reports | "Extract table data row by row. Preserve column structure" |
| Multi-page PDFs | Long reports | "Process all pages. Combine data from entire document" |
| Different layouts | Mixed vendors | "Extract based on field labels, not position. Handle format variations" |
| Handwritten notes | Forms | "Skip handwritten fields or mark as 'Manual Review Required'" |
| Password protected | Secure PDFs | "Remove password protection first or provide password" |
| Corrupted files | Any PDF | "Skip corrupted files, list them in report for manual processing" |
Technical Details — How the output works
Output Formats
CSV (Comma-Separated Values)
- ✅ Universal format (opens in Excel, Google Sheets, any spreadsheet)
- ✅ Easy to import into databases or BI tools
- ✅ Lightweight, fast to process
- ❌ No formatting (no colors, formulas, multiple sheets)
- Best For: Data extraction, further analysis, importing to other systems
Excel (XLSX)
- ✅ Multiple sheets (data + summary + analysis)
- ✅ Formulas and calculations
- ✅ Formatting (colors, conditional formatting, charts)
- ✅ Pivot tables for analysis
- Best For: Financial analysis, dashboards, sharing with stakeholders
Markdown Report
- ✅ Readable in any text editor
- ✅ Easy to convert to PDF or HTML
- ✅ Version control friendly
- ✅ Can include tables, charts (as code)
- Best For: Summary reports, insights documentation, sharing findings
Data Quality Considerations
Accuracy Checks:
- Spot-check 10% of extractions against source PDFs
- Verify totals and calculations
- Check for duplicate entries
- Validate date ranges and amount ranges
Handling Variations:
- Different PDF layouts: Extract by field label, not position
- Missing fields: Mark as "Not Found" or "N/A"
- Format differences: Standardize in output (dates, currency, numbers)
- Multi-currency: Convert to base currency or note currency per row
Best Practices:
- Keep original PDFs organized in folders
- Name output files with date (e.g., invoice-summary-2026-03-27.csv)
- Document any manual adjustments made
- Save extraction prompts for future use
- Version control: Keep track of extraction iterations
Processing Limits
Typical Capacity:
- 10-50 PDFs: Quick processing (2-5 minutes)
- 50-200 PDFs: Moderate processing (5-15 minutes)
- 200+ PDFs: Consider batching into groups
File Size:
- Individual PDF: Up to 10MB recommended
- Total batch: Up to 500MB for optimal performance
- Larger batches: Split into multiple folders
Related Use Cases
- Customer Insight & Strategy Recommendations — Analyze extracted data to generate deeper insights and recommendations
- Auto-Fill Forms (DOCX, Excel, PDF) — Populate forms with extracted data
- Check Documents for Compliance — Validate extracted data against compliance rules
- Create Presentation Slides from Your Documents — Present your analysis findings to stakeholders