pdf-reader
1
总安装量
1
周安装量
#44793
全站排名
安装命令
npx skills add https://github.com/ngnnah/nhat-learn-in-public --skill pdf-reader
Agent 安装分布
opencode
1
codex
1
claude-code
1
antigravity
1
Skill 文档
PDF Reader
This skill helps you extract and read text content from PDF files using Python libraries.
When to use this skill
Use this skill when:
- Reading text content from PDF files
- Extracting specific pages from PDFs
- Analyzing PDF document structure
- Converting PDF text to plain text
- The user mentions “PDF”, “read PDF”, “extract text from PDF”
Requirements
Install required packages:
pip install PyPDF2 pdfplumber
Quick start
Basic text extraction with PyPDF2
import PyPDF2
def read_pdf_pypdf2(pdf_path):
"""Extract all text from a PDF file using PyPDF2"""
with open(pdf_path, 'rb') as file:
pdf_reader = PyPDF2.PdfReader(file)
# Get number of pages
num_pages = len(pdf_reader.pages)
print(f"PDF has {num_pages} pages")
# Extract text from all pages
full_text = ""
for page_num in range(num_pages):
page = pdf_reader.pages[page_num]
text = page.extract_text()
full_text += f"\n--- Page {page_num + 1} ---\n{text}"
return full_text
# Usage
text = read_pdf_pypdf2("document.pdf")
print(text)
Advanced extraction with pdfplumber
pdfplumber provides better text extraction and table detection:
import pdfplumber
def read_pdf_pdfplumber(pdf_path):
"""Extract text with better formatting using pdfplumber"""
with pdfplumber.open(pdf_path) as pdf:
full_text = ""
for i, page in enumerate(pdf.pages):
# Extract text
text = page.extract_text()
full_text += f"\n--- Page {i + 1} ---\n{text}\n"
# Optionally extract tables
tables = page.extract_tables()
if tables:
full_text += f"\n[Found {len(tables)} table(s) on page {i + 1}]\n"
return full_text
# Usage
text = read_pdf_pdfplumber("document.pdf")
print(text)
Extract specific pages
def read_pdf_pages(pdf_path, page_numbers):
"""Extract text from specific pages only"""
with pdfplumber.open(pdf_path) as pdf:
text = ""
for page_num in page_numbers:
if 0 <= page_num < len(pdf.pages):
page = pdf.pages[page_num]
text += f"\n--- Page {page_num + 1} ---\n"
text += page.extract_text()
else:
print(f"Warning: Page {page_num + 1} doesn't exist")
return text
# Usage: Read pages 1, 3, and 5 (0-indexed: 0, 2, 4)
text = read_pdf_pages("document.pdf", [0, 2, 4])
print(text)
Get PDF metadata
def get_pdf_info(pdf_path):
"""Get metadata and information about the PDF"""
with pdfplumber.open(pdf_path) as pdf:
info = {
'num_pages': len(pdf.pages),
'metadata': pdf.metadata,
}
# Get dimensions of first page
if pdf.pages:
first_page = pdf.pages[0]
info['page_width'] = first_page.width
info['page_height'] = first_page.height
return info
# Usage
info = get_pdf_info("document.pdf")
print(f"Pages: {info['num_pages']}")
print(f"Title: {info['metadata'].get('Title', 'N/A')}")
print(f"Author: {info['metadata'].get('Author', 'N/A')}")
Common use cases
Search for text in PDF
def search_in_pdf(pdf_path, search_term):
"""Search for a term and return pages where it appears"""
results = []
with pdfplumber.open(pdf_path) as pdf:
for i, page in enumerate(pdf.pages):
text = page.extract_text()
if search_term.lower() in text.lower():
results.append({
'page': i + 1,
'text_snippet': text[:200] # First 200 chars as preview
})
return results
# Usage
results = search_in_pdf("document.pdf", "important keyword")
for result in results:
print(f"Found on page {result['page']}")
Extract tables from PDF
def extract_tables(pdf_path):
"""Extract all tables from PDF"""
all_tables = []
with pdfplumber.open(pdf_path) as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
all_tables.append({
'page': i + 1,
'table_number': j + 1,
'data': table
})
return all_tables
# Usage
tables = extract_tables("document.pdf")
for table_info in tables:
print(f"Table {table_info['table_number']} from page {table_info['page']}")
print(table_info['data'])
Tips and best practices
-
Choose the right library:
- Use
PyPDF2for simple text extraction and PDF manipulation - Use
pdfplumberfor better text extraction and table detection - Use both if needed for different tasks
- Use
-
Handle errors gracefully:
try: text = read_pdf_pdfplumber("document.pdf") except FileNotFoundError: print("PDF file not found") except Exception as e: print(f"Error reading PDF: {e}") -
Memory management: For large PDFs, process pages one at a time instead of loading all text at once
-
Text quality: Some PDFs (especially scanned images) may not have extractable text. Consider OCR tools like
pytesseractfor those cases.
Troubleshooting
- No text extracted: The PDF might be image-based. Use OCR tools.
- Garbled text: Try
pdfplumberinstead ofPyPDF2, it often handles formatting better. - Missing packages: Run
pip install PyPDF2 pdfplumber
Related skills
- For PDF form filling: Consider creating a
pdf-formsskill - For PDF merging/splitting: Consider creating a
pdf-manipulationskill - For OCR on image PDFs: Consider using
pytesseractwithpdf2image