gmail-invoice-processor
npx skills add https://github.com/erafat/skills --skill gmail-invoice-processor
Agent 安装分布
Skill 文档
Gmail Invoice Processor
Automate the workflow of finding invoice PDFs in Gmail, extracting key information (vendor name and amount), and creating organized Excel summaries.
Quick Start
- Search Gmail for invoice emails
- Download PDF attachments
- Extract vendor and amount data
- Create formatted Excel summary
- Present results to user
Workflow Steps
Step 1: Search Gmail for Invoices
Use search_gmail_messages with search query targeting subject lines:
query = 'subject:(invoice OR invoices) has:attachment filename:pdf'
Add date filters if needed: after:2025/01/01 or newer_than:30d
Step 2: Download PDF Attachments
IMPORTANT: Gmail API attachment handling has known limitations. Attachments may not be returned in the parts array even when they exist.
Approach A: Check for Attachment Parts (Preferred)
For each message with PDFs:
- Use
read_gmail_threadorread_gmail_messageto get message details - Check if
payload.partsarray contains items withmimeType: 'application/pdf' - If found, decode base64 attachment data and save to
/home/claude/
Example:
import base64
from pathlib import Path
# Check message payload for parts
if message['payload'].get('parts'):
for part in message['payload']['parts']:
if part.get('mimeType') == 'application/pdf' or part.get('filename', '').endswith('.pdf'):
# Extract attachment
if 'data' in part.get('body', {}):
attachment_data = part['body']['data']
pdf_bytes = base64.urlsafe_b64decode(attachment_data)
filename = part.get('filename', 'invoice.pdf')
Path(filename).write_bytes(pdf_bytes)
Approach B: Extract Data from Email Body (Fallback)
When PDFs aren’t accessible via API, extract invoice data directly from email text:
import re
email_body = message['payload']['body'].get('data', '')
if email_body:
# Decode email body
decoded_body = base64.urlsafe_b64decode(email_body).decode('utf-8', errors='ignore')
# Search for common invoice patterns
vendor_match = re.search(r'(?:from|vendor|company)[:\s]+([A-Za-z0-9\s&.,]+)', decoded_body, re.IGNORECASE)
amount_match = re.search(r'\$\s*([0-9,]+\.[0-9]{2})', decoded_body)
vendor = vendor_match.group(1).strip() if vendor_match else 'Unknown'
amount = amount_match.group(1).replace(',', '') if amount_match else 'N/A'
Approach C: Detect and Flag for Manual Processing
If neither approach works:
- Check
mimeType: 'multipart/mixed'andsizeEstimateto confirm attachments exist - Note in Excel: “PDF attachment exists but not accessible via API – manual download required”
- Include email subject and sender for reference
Detection logic:
has_attachment = (
message['payload'].get('mimeType') in ['multipart/mixed', 'multipart/related'] and
message.get('sizeEstimate', 0) > 10000 and # Larger than text-only
not message['payload'].get('parts') # But parts array is empty
)
if has_attachment:
note = "PDF attachment detected but not accessible - manual download required"
Step 3: Extract Invoice Data
Use the bundled extraction script for automated parsing:
python scripts/extract_invoice_data.py invoice1.pdf invoice2.pdf
Output format: FILENAME|VENDOR|AMOUNT|CURRENCY
The script uses pattern matching to find:
- Vendor: Company name (typically at top of first page)
- Amount: Total/Amount Due/Balance Due values
- Currency: USD (default), EUR, GBP, CAD
If extraction quality is poor, manually extract using pdfplumber and regex patterns.
Step 4: Create Excel Summary
Create a professional Excel file with:
Required columns:
- Date Processed (when you processed it)
- Email Subject (subject line of the email)
- Filename (original PDF name)
- Vendor (extracted company name)
- Amount (invoice total)
- Currency (USD, EUR, etc.)
- Notes (for manual review flags or issues)
Formatting standards:
- Header row: Bold, filled background (light blue/gray)
- Font: Arial or similar professional font
- Column widths: Auto-fit to content
- Amount column: Number format with 2 decimals, thousands separator
- Borders: Light borders around all cells
Use openpyxl for creation:
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Border, Side, Alignment
wb = Workbook()
ws = wb.active
ws.title = "Invoice Summary"
# Headers
headers = ['Date Processed', 'Email Subject', 'Filename', 'Vendor', 'Amount', 'Currency', 'Notes']
ws.append(headers)
# Style header row
for cell in ws[1]:
cell.font = Font(bold=True)
cell.fill = PatternFill(start_color="D3D3D3", fill_type="solid")
Save to /mnt/user-data/outputs/invoice_summary_YYYYMMDD.xlsx
Step 5: Present Results
- Use
present_filestool to share the Excel file - Provide concise summary:
- Number of invoices processed
- Total amount (by currency if multiple)
- Any invoices needing manual review
Bundled Resources
Scripts
-
scripts/extract_invoice_data.py: Automated vendor and amount extraction from invoice PDFs
- Requires: pdfplumber (
pip install pdfplumber --break-system-packages) - Usage:
python extract_invoice_data.py file1.pdf file2.pdf ... - Returns pipe-delimited output for easy parsing
- Requires: pdfplumber (
-
scripts/gmail_attachment_helper.py: Enhanced Gmail PDF detection and email body extraction
- Detects PDF attachments even when Gmail API parts array is empty
- Extracts invoice data from email body text as fallback
- Provides recommendations for handling different scenarios
- Usage: Import functions in main processing workflow
References
- references/workflow.md: Comprehensive step-by-step guide with code examples, error handling, multi-layered PDF access strategies, and tips for better results
Error Handling
PDF attachments not accessible:
- Gmail API may not return attachment data in parts array even when attachments exist
- Indicators:
mimeType: 'multipart/mixed', largesizeEstimate, but emptypartsarray - Solutions:
- Extract data from email body text if invoice details are included
- Flag in Excel with note: “PDF detected but not accessible – manual download required”
- Advise user to download PDF manually from Gmail for processing
Missing data: If vendor or amount cannot be extracted, mark as “Unknown – Review Required” or “N/A – Manual Review Needed” in Excel with note
Large attachments: Some Gmail attachments may be truncated; check attachment size and handle appropriately
Multiple currencies: List separately by currency; don’t sum different currencies together
Email body extraction: When PDFs aren’t accessible, parse email body HTML/text for invoice details:
# Common patterns in email bodies
patterns = {
'amount': r'\$\s*([0-9,]+\.[0-9]{2})',
'total': r'(?:total|amount due|balance)[:\s]*\$?\s*([0-9,]+\.?[0-9]*)',
'vendor': r'(?:from|vendor|company)[:\s]+([A-Za-z0-9\s&.,]+)',
'invoice_number': r'(?:invoice|inv|ref)[\s#:]+([A-Za-z0-9-]+)'
}
Tips
- Start with date-filtered searches to avoid processing old invoices:
newer_than:30d - Use multi-layered approach for PDF access: try direct attachment â extract from email body â flag for manual download
- Test the
gmail_attachment_helper.pyscript to detect attachment presence even when parts array is empty - Invoice formats vary widely; some may need manual data entry in Excel
- Save original filenames and email subjects for traceability
- When PDFs aren’t accessible via API, parse email body HTML/text for invoice details
- Create clear notes in Excel for items requiring manual review to track follow-up needed