xdrshjr-google-images-crawler
1
总安装量
1
周安装量
#43855
全站排名
安装命令
npx skills add https://smithery.ai
Agent 安装分布
amp
1
opencode
1
kimi-cli
1
codex
1
gemini-cli
1
Skill 文档
Google Image Crawler
åºäº Playwright ç Google å¾çç¬è«ï¼æ¯ææå髿¸ åå¾ URL åæ¹éä¸è½½ã
å¿«éå¼å§
å®è£ ä¾èµ
pip install playwright aiohttp requests tqdm
playwright install chromium
åºç¡ä½¿ç¨
import asyncio
from crawler import GoogleImageCrawler
async def main():
async with GoogleImageCrawler() as crawler:
# æç´¢å¾ç
results = await crawler.search("cute cats", num_images=10)
# æå°ç»æ
for result in results:
print(f"URL: {result.url}")
print(f"Size: {result.width}x{result.height}")
# ä¸è½½å¾ç
downloaded = await crawler.download_images(
results,
output_dir="./downloads"
)
asyncio.run(main())
忥 APIï¼å¿«éè°ç¨ï¼
from crawler import search_images_sync
# ä¸è¡ä»£ç æç´¢å¾ç
results = search_images_sync("mountain landscape", num_images=5)
for r in results:
print(r.url)
æ ¸å¿åè½
1. å¾çæç´¢ (GoogleImageCrawler.search)
ä½¿ç¨ Playwright èªå¨å Google å¾çæç´¢ï¼æååå§é«æ¸ å¾ç URLã
results = await crawler.search(
keyword="search term", # æç´¢å
³é®è¯
num_images=10, # è·åæ°é
safe_search=True, # å®å
¨æç´¢
min_width=None, # æå°å®½åº¦è¿æ»¤
min_height=None # æå°é«åº¦è¿æ»¤
)
2. å¾çä¸è½½ (GoogleImageCrawler.download_images)
弿¥æ¹éä¸è½½å¾çï¼æ¯æå¹¶åæ§å¶ã
downloaded = await crawler.download_images(
results=results, # æç´¢ç»æå表
output_dir="./images", # è¾åºç®å½
concurrency=3 # å¹¶åæ°
)
3. ç¬ç«ä¸è½½æ¨¡å (ImageDownloader)
æ´å¼ºå¤§çä¸è½½åè½ï¼æ¯æçº¿ç¨æ± å¹¶ååè¿åº¦æ¾ç¤ºã
from downloader import ImageDownloader
downloader = ImageDownloader(
output_dir="./downloads",
concurrent=5,
max_retries=3
)
results = downloader.download_batch(urls)
4. å½ä»¤è¡å·¥å ·
# 仿件ä¸è½½
python cli.py -f urls.txt -o ./images
# 另忰ä¸è½½
python cli.py -f urls.txt -c 10 -t 60 -o ./images
# ä¸è½½åå¼ å¾ç
python cli.py -u "https://example.com/image.jpg" -o ./images
使ç¨ç¤ºä¾
ç¤ºä¾ 1: åºç¡æç´¢
import asyncio
from crawler import GoogleImageCrawler
async def basic_search():
async with GoogleImageCrawler(headless=True) as crawler:
results = await crawler.search("sunset beach", num_images=5)
print(f"æ¾å° {len(results)} å¼ å¾ç:")
for i, result in enumerate(results, 1):
print(f"{i}. {result.title[:40] if result.title else 'No title'}")
print(f" URL: {result.url[:70]}...")
print(f" 尺寸: {result.width}x{result.height}")
asyncio.run(basic_search())
ç¤ºä¾ 2: HD å£çº¸æç´¢ï¼å¸¦å°ºå¯¸è¿æ»¤ï¼
async def hd_wallpaper_search():
async with GoogleImageCrawler() as crawler:
results = await crawler.search(
"4k wallpaper",
num_images=5,
min_width=1920,
min_height=1080
)
print(f"æ¾å° {len(results)} å¼ é«æ¸
å¾ç:")
for result in results:
print(f"â {result.width}x{result.height} - {result.url[:60]}...")
asyncio.run(hd_wallpaper_search())
ç¤ºä¾ 3: æç´¢å¹¶ä¸è½½
async def search_and_download():
async with GoogleImageCrawler() as crawler:
# æç´¢
results = await crawler.search("puppy", num_images=10)
print(f"æç´¢å®æï¼åå¤ä¸è½½ {len(results)} å¼ å¾ç...")
# ä¸è½½
downloaded = await crawler.download_images(
results,
output_dir="./downloads",
concurrency=2
)
print(f"ä¸è½½æå: {len(downloaded)} å¼ ")
for path in downloaded:
print(f" â {path}")
asyncio.run(search_and_download())
ç¤ºä¾ 4: 使ç¨ç¬ç«ä¸è½½æ¨¡å
from downloader import ImageDownloader
# åç¬ä½¿ç¨ä¸è½½æ¨¡å
downloader = ImageDownloader(
output_dir="./downloads",
concurrent=5,
timeout=30,
max_retries=3
)
# ä» URL å表ä¸è½½
urls = [
"https://example.com/image1.jpg",
"https://example.com/image2.jpg"
]
results = downloader.download_batch(urls)
# ç»è®¡ç»æ
success = len(results['success'])
failed = len(results['failed'])
print(f"æå: {success}, 失败: {failed}")
ç¤ºä¾ 5: ç»å使ç¨ï¼æ¨èï¼
import asyncio
from crawler import GoogleImageCrawler
from downloader import ImageDownloader
async def combined_workflow():
# 1. ç¬åå¾ç URL
async with GoogleImageCrawler() as crawler:
results = await crawler.search("mountain", num_images=20)
urls = [r.url for r in results]
# 2. ä½¿ç¨ Downloader ä¸è½½ï¼æ¯ææ´å¤åè½ï¼
downloader = ImageDownloader(
output_dir="./mountain_images",
concurrent=5,
max_retries=3
)
with downloader:
results = downloader.download_batch(urls)
# ç»è®¡ç»æ
success = len(results['success'])
failed = len(results['failed'])
print(f"æå: {success}/{success + failed}")
asyncio.run(combined_workflow())
åæ°è¯´æ
GoogleImageCrawler åå§ååæ°
| åæ° | ç±»å | é»è®¤å¼ | 说æ |
|---|---|---|---|
headless |
bool | True | æ 头模å¼ï¼ä¸æ¾ç¤ºæµè§å¨çªå£ï¼ |
timeout |
int | 30 | 页é¢å è½½è¶ æ¶ï¼ç§ï¼ |
scroll_pause |
float | 1.5 | æ»å¨é´éï¼ç§ï¼ |
max_retries |
int | 3 | æå¤§éè¯æ¬¡æ° |
proxy |
str | None | 代çæå¡å¨å°å (e.g., “http://proxy:8080“) |
search() æ¹æ³åæ°
| åæ° | ç±»å | é»è®¤å¼ | 说æ |
|---|---|---|---|
keyword |
str | å¿ å¡« | æç´¢å ³é®è¯ |
num_images |
int | 10 | éè¦è·åçå¾çæ°é |
safe_search |
bool | True | æ¯å¦å¼å¯å®å ¨æç´¢ |
min_width |
int | None | å¾çæå°å®½åº¦è¿æ»¤ |
min_height |
int | None | å¾çæå°é«åº¦è¿æ»¤ |
ImageResult 屿§
| 屿§ | ç±»å | 说æ |
|---|---|---|
url |
str | åå§é«æ¸ å¾ç URL |
thumbnail_url |
str | 缩ç¥å¾ URL |
title |
str | å¾çæ é¢ |
source_url |
str | æ¥æºç½é¡µ URL |
width |
int | å¾ç宽度 |
height |
int | å¾çé«åº¦ |
ImageDownloader åå§ååæ°
| åæ° | ç±»å | é»è®¤å¼ | 说æ |
|---|---|---|---|
output_dir |
str | “./downloads” | è¾åºç®å½ |
timeout |
int | 30 | 请æ±è¶ æ¶ï¼ç§ï¼ |
max_retries |
int | 3 | æå¤§éè¯æ¬¡æ° |
concurrent |
int | 5 | å¹¶åä¸è½½æ° |
headers |
dict | None | èªå®ä¹è¯·æ±å¤´ |
CLI åæ°
| åæ° | ç®å | 说æ |
|---|---|---|
--file |
-f |
URL å表æä»¶è·¯å¾ |
--url |
-u |
å个å¾ç URL |
--output |
-o |
è¾åºç®å½ |
--concurrent |
-c |
å¹¶åä¸è½½æ° |
--timeout |
-t |
è¶ æ¶æ¶é´ï¼ç§ï¼ |
--retries |
-r |
æå¤§éè¯æ¬¡æ° |
--limit |
-l |
éå¶ä¸è½½æ°é |
--proxy |
代çæå¡å¨å°å |
é误å¤ç
常è§é误åè§£å³æ¹æ¡
1. æµè§å¨åå§å失败
# é误: BrowserInitError æ playwright ç¸å
³é误
# è§£å³æ¹æ¡: ç¡®ä¿å·²å®è£
æµè§å¨
# è¿è¡å®è£
å½ä»¤
# playwright install chromium
2. æç´¢è¶ æ¶æé¡µé¢å 载失败
# é误: TimeoutError æé¡µé¢å 载失败
# è§£å³æ¹æ¡: å¢å è¶
æ¶æ¶é´æä½¿ç¨ä»£ç
async with GoogleImageCrawler(timeout=60) as crawler:
results = await crawler.search("keyword", num_images=10)
3. å¾çä¸è½½å¤±è´¥
# é误: DownloadError æ HTTP é误
# è§£å³æ¹æ¡: å¢å éè¯æ¬¡æ°æä½¿ç¨ç¬ç«ä¸è½½æ¨¡å
downloader = ImageDownloader(max_retries=5, timeout=60)
results = downloader.download_batch(urls)
4. IP 被éå¶
# é误: é¢ç¹åºç°è¿æ¥è¢«æç»æéªè¯ç
# è§£å³æ¹æ¡: 使ç¨ä»£çåå¢å æ»å¨é´é
async with GoogleImageCrawler(
proxy="http://proxy:8080",
scroll_pause=3.0 # å¢å é´é
) as crawler:
results = await crawler.search("keyword", num_images=10)
å¼å¸¸å¤ç示ä¾
import asyncio
from crawler import GoogleImageCrawler
async def safe_search():
try:
async with GoogleImageCrawler() as crawler:
results = await crawler.search("keyword", num_images=10)
return results
except Exception as e:
print(f"æç´¢å¤±è´¥: {e}")
return []
async def safe_download(crawler, results):
try:
downloaded = await crawler.download_images(
results,
output_dir="./downloads",
concurrency=2
)
return downloaded
except Exception as e:
print(f"ä¸è½½å¤±è´¥: {e}")
return []
# 使ç¨
asyncio.run(safe_search())
ææ¯ç»è
åå¾ URL æååç
Google å¾çæç´¢ç»æçæ¯ä¸ªå¾çé½å
å«ä¸ä¸ª /imgres 龿¥ï¼
/imgres?imgurl=https://example.com/original.jpg&imgrefurl=...
ç¬è«éè¿è§£æ imgurl åæ°è·ååå¾å°åã
åç¬çç¥
- ä½¿ç¨æ£å¸¸ç¨æ· User-Agent
- 设置åçè§å£å¤§å°ï¼1920×1080ï¼
- èªå¨å¤ç Cookie åæå¼¹çª
- æ»å¨é´é模æäººå·¥æä½ï¼1.5ç§ï¼
- ææ°éé¿éè¯æºå¶
æä»¶ç»æ
google-image-crawler/
âââ crawler.py # ç¬è«æ ¸å¿æ¨¡å (Playwright + 弿¥)
âââ downloader.py # å¾çä¸è½½æ¨¡å (Requests + çº¿ç¨æ± )
âââ config.py # é
置管ç
âââ cli.py # å½ä»¤è¡å·¥å
·
âââ example.py # 使ç¨ç¤ºä¾
âââ docs/ # ææ¡£
âââ architecture.md # æ¶æè®¾è®¡
âââ api-reference.md # API åè
âââ examples.md # æ´å¤ç¤ºä¾
ä¾èµ
- playwright: æµè§å¨èªå¨å
- aiohttp: 弿¥ HTTP 客æ·ç«¯
- requests: 忥 HTTP 请æ±
- tqdm: è¿åº¦æ¡æ¾ç¤º
注æäºé¡¹
- éµå®æå¡æ¡æ¬¾: 请éµå® Google æå¡æ¡æ¬¾åç®æ ç½ç«ç robots.txt
- æ§å¶é¢ç: é¿å è¿äºé¢ç¹ç请æ±ï¼å»ºè®®è®¾ç½®åçç scroll_pause
- çæåè§: åä¸ä½¿ç¨è¯·ç¡®è®¤å¾ççæ
- 使ç¨ä»£ç: 大éç¬åæ¶å»ºè®®ä½¿ç¨ä»£çé¿å IP éå¶
- èµæºéæ¾: 使ç¨
async withä¸ä¸æç®¡çå¨ç¡®ä¿æµè§å¨æ£ç¡®å ³é
License
MIT