image-generator
npx skills add https://github.com/dair-ai/dair-academy-plugins --skill Image Generator
Skill 文档
Image Generator
This skill generates and edits images using Google’s Gemini Nano Banana Pro model (gemini-3-pro-image-preview).
IMPORTANT: Setup Required
Before using this skill, the user must set the GEMINI_API_KEY environment variable:
- Get a free API key from Google AI Studio
- Export the key in your shell profile (
~/.zshrc,~/.bashrc, etc.):export GEMINI_API_KEY="your_api_key_here" - Restart your terminal or run
source ~/.zshrc(or~/.bashrc)
The skill will not work without this configuration.
Pre-flight Check
Before making any API call, verify the key is set:
if [ -z "$GEMINI_API_KEY" ]; then
echo "ERROR: GEMINI_API_KEY is not set. Please export it in your shell profile."
exit 1
fi
If the key is missing, stop and tell the user to set it using the instructions above.
Configuration
Model: gemini-3-pro-image-preview
API Key: Read from the GEMINI_API_KEY environment variable
Iterating on User-Provided Images
When the user provides a path to an image they want to edit or iterate on, use this workflow:
Step 1: Read and encode the image to base64
# Get the image path from user
IMG_PATH="/path/to/user/image.png"
# Detect mime type
if [[ "$IMG_PATH" == *.png ]]; then
MIME_TYPE="image/png"
elif [[ "$IMG_PATH" == *.jpg ]] || [[ "$IMG_PATH" == *.jpeg ]]; then
MIME_TYPE="image/jpeg"
elif [[ "$IMG_PATH" == *.webp ]]; then
MIME_TYPE="image/webp"
else
MIME_TYPE="image/png"
fi
# Encode to base64 (works on both macOS and Linux)
if [[ "$(uname)" == "Darwin" ]]; then
IMG_BASE64=$(base64 -i "$IMG_PATH")
else
IMG_BASE64=$(base64 -w0 "$IMG_PATH")
fi
Step 2: Send image with edit prompt (File-Based Approach)
IMPORTANT: Always use a file-based approach for the request body. Base64-encoded images are too large for command-line arguments and will cause “argument list too long” errors.
# User's edit request
EDIT_PROMPT="Add a santa hat to the person in this image"
# Write request to a JSON file (avoids command line length limits)
cat > /tmp/gemini_request.json << JSONEOF
{
"contents": [{
"parts": [
{"text": "$EDIT_PROMPT"},
{
"inline_data": {
"mime_type": "$MIME_TYPE",
"data": "$IMG_BASE64"
}
}
]
}],
"generationConfig": {
"responseModalities": ["TEXT", "IMAGE"]
}
}
JSONEOF
# Call the API using the file
curl -s -X POST \
"https://generativelanguage.googleapis.com/v1beta/models/gemini-3-pro-image-preview:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d @/tmp/gemini_request.json > /tmp/gemini_response.json
Step 3: Extract and save the edited image
# Extract image from response and save
python3 -c "
import json
import base64
with open('/tmp/gemini_response.json') as f:
data = json.load(f)
for part in data['candidates'][0]['content']['parts']:
if 'inlineData' in part:
img_data = part['inlineData']['data']
mime = part['inlineData']['mimeType']
ext = 'png' if 'png' in mime else 'jpg'
with open('edited_image.' + ext, 'wb') as out:
out.write(base64.b64decode(img_data))
print(f'Saved: edited_image.{ext}')
elif 'text' in part:
print(part['text'])
"
Complete Example (File-Based)
For iterating on images, always use file-based requests:
# Variables
IMG_PATH="/path/to/image.png"
EDIT_PROMPT="Make the background a sunset beach"
OUTPUT_PATH="edited_output.png"
# Detect mime type and encode
MIME_TYPE=$([[ "$IMG_PATH" == *.png ]] && echo "image/png" || echo "image/jpeg")
IMG_BASE64=$(base64 -i "$IMG_PATH" 2>/dev/null || base64 -w0 "$IMG_PATH")
# Write request to file (required - base64 images are too large for command line)
cat > /tmp/gemini_request.json << JSONEOF
{
"contents": [{
"parts": [
{"text": "$EDIT_PROMPT"},
{"inline_data": {"mime_type": "$MIME_TYPE", "data": "$IMG_BASE64"}}
]
}],
"generationConfig": {
"responseModalities": ["TEXT", "IMAGE"]
}
}
JSONEOF
# Call API and extract image
curl -s -X POST \
"https://generativelanguage.googleapis.com/v1beta/models/gemini-3-pro-image-preview:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d @/tmp/gemini_request.json > /tmp/gemini_response.json
# Save the output image
python3 -c "
import json, base64
with open('/tmp/gemini_response.json') as f:
data = json.load(f)
for part in data.get('candidates', [{}])[0].get('content', {}).get('parts', []):
if 'inlineData' in part:
with open('$OUTPUT_PATH', 'wb') as f:
f.write(base64.b64decode(part['inlineData']['data']))
print('Saved: $OUTPUT_PATH')
"
Multi-Image Input (Combine/Compose)
To combine elements from multiple images (also uses file-based approach):
IMG1_PATH="/path/to/image1.png"
IMG2_PATH="/path/to/image2.png"
PROMPT="Put the dress from the first image on the person in the second image"
IMG1_BASE64=$(base64 -i "$IMG1_PATH" 2>/dev/null || base64 -w0 "$IMG1_PATH")
IMG2_BASE64=$(base64 -i "$IMG2_PATH" 2>/dev/null || base64 -w0 "$IMG2_PATH")
# Write request to file
cat > /tmp/gemini_request.json << JSONEOF
{
"contents": [{
"parts": [
{"text": "$PROMPT"},
{"inline_data": {"mime_type": "image/png", "data": "$IMG1_BASE64"}},
{"inline_data": {"mime_type": "image/png", "data": "$IMG2_BASE64"}}
]
}],
"generationConfig": {"responseModalities": ["TEXT", "IMAGE"]}
}
JSONEOF
curl -s -X POST \
"https://generativelanguage.googleapis.com/v1beta/models/gemini-3-pro-image-preview:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d @/tmp/gemini_request.json > /tmp/gemini_response.json
Capabilities
Text-to-Image Generation
- Generate high-quality images from text descriptions
- Support for photorealistic, stylized, and artistic outputs
- Accurate text rendering in images (logos, infographics, diagrams)
Image Editing
- Add or remove elements from images
- Inpainting with semantic masking (edit specific parts)
- Style transfer (apply artistic styles to photos)
- Multi-image composition (combine elements from multiple images)
Advanced Features
- High Resolution: 1K, 2K, or 4K output
- Aspect Ratios: 1:1, 2:3, 3:2, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9
- Google Search Grounding: Generate images based on real-time data
- Multi-turn Editing: Iteratively refine images through conversation
- Up to 14 Reference Images: Combine multiple inputs for complex compositions
API Usage
Basic Text-to-Image (Python)
from google import genai
from google.genai import types
client = genai.Client()
response = client.models.generate_content(
model="gemini-3-pro-image-preview",
contents=["Your prompt here"],
config=types.GenerateContentConfig(
response_modalities=['TEXT', 'IMAGE'],
image_config=types.ImageConfig(
aspect_ratio="16:9", # Optional
image_size="2K" # Optional: "1K", "2K", "4K"
)
)
)
for part in response.parts:
if part.text is not None:
print(part.text)
elif part.inline_data is not None:
image = part.as_image()
image.save("generated_image.png")
Basic Text-to-Image (JavaScript)
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({});
const response = await ai.models.generateContent({
model: "gemini-3-pro-image-preview",
contents: "Your prompt here",
config: {
responseModalities: ['TEXT', 'IMAGE'],
imageConfig: {
aspectRatio: "16:9",
imageSize: "2K"
}
}
});
for (const part of response.candidates[0].content.parts) {
if (part.text) {
console.log(part.text);
} else if (part.inlineData) {
const buffer = Buffer.from(part.inlineData.data, "base64");
fs.writeFileSync("generated_image.png", buffer);
}
}
REST API (curl)
curl -s -X POST \
"https://generativelanguage.googleapis.com/v1beta/models/gemini-3-pro-image-preview:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"contents": [{
"parts": [{"text": "Your prompt here"}]
}],
"generationConfig": {
"responseModalities": ["TEXT", "IMAGE"],
"imageConfig": {
"aspectRatio": "16:9",
"imageSize": "2K"
}
}
}' | jq -r '.candidates[0].content.parts[] | select(.inlineData) | .inlineData.data' | base64 --decode > output.png
Image Editing (with input image)
from google import genai
from google.genai import types
from PIL import Image
client = genai.Client()
input_image = Image.open('input.png')
prompt = "Add a wizard hat to the cat in this image"
response = client.models.generate_content(
model="gemini-3-pro-image-preview",
contents=[prompt, input_image],
config=types.GenerateContentConfig(
response_modalities=['TEXT', 'IMAGE']
)
)
for part in response.parts:
if part.inline_data is not None:
image = part.as_image()
image.save("edited_image.png")
Multi-Image Composition
from google import genai
from google.genai import types
from PIL import Image
client = genai.Client()
image1 = Image.open('dress.png')
image2 = Image.open('model.png')
prompt = "Put the dress from the first image on the model from the second image"
response = client.models.generate_content(
model="gemini-3-pro-image-preview",
contents=[image1, image2, prompt],
config=types.GenerateContentConfig(
response_modalities=['TEXT', 'IMAGE'],
image_config=types.ImageConfig(
aspect_ratio="3:4",
image_size="2K"
)
)
)
With Google Search Grounding
from google import genai
from google.genai import types
client = genai.Client()
response = client.models.generate_content(
model="gemini-3-pro-image-preview",
contents="Visualize the current weather forecast for San Francisco",
config=types.GenerateContentConfig(
response_modalities=['TEXT', 'IMAGE'],
image_config=types.ImageConfig(aspect_ratio="16:9"),
tools=[{"google_search": {}}]
)
)
Prompting Best Practices
1. Be Descriptive, Not Keyword-Based
Instead of: cat, wizard hat, cute
Write: A fluffy orange cat wearing a small knitted wizard hat, sitting on a wooden floor with soft natural lighting from a window
2. Specify Style and Mood
- Photography terms: “shot with 85mm lens”, “soft bokeh background”, “golden hour lighting”
- Artistic styles: “in the style of Van Gogh”, “minimalist illustration”, “photorealistic”
- Mood: “warm and cozy atmosphere”, “dramatic noir lighting”
3. For Text in Images
Be explicit about:
- The exact text to render
- Font style (descriptively): “clean, bold, sans-serif font”
- Placement and size
4. For Editing
- Describe what to change and what to preserve
- Use “keep everything else unchanged”
- Reference specific elements clearly
5. For Product/Commercial Images
Mention:
- Lighting setup: “three-point softbox lighting”
- Background: “clean white studio background”
- Camera angle: “slightly elevated 45-degree shot”
Resolution and Aspect Ratio Reference
| Aspect Ratio | 1K Resolution | 2K Resolution | 4K Resolution |
|---|---|---|---|
| 1:1 | 1024×1024 | 2048×2048 | 4096×4096 |
| 16:9 | 1376×768 | 2752×1536 | 5504×3072 |
| 9:16 | 768×1376 | 1536×2752 | 3072×5504 |
| 3:2 | 1264×848 | 2528×1696 | 5056×3392 |
| 2:3 | 848×1264 | 1696×2528 | 3392×5056 |
Common Use Cases
Logo Creation
Create a modern, minimalist logo for a coffee shop called 'The Daily Grind'.
The text should be in a clean, bold, sans-serif font.
Black and white color scheme. Put the logo in a circle.
Product Photography
A high-resolution, studio-lit product photograph of a minimalist ceramic
coffee mug in matte black on a polished concrete surface. Three-point
softbox lighting with soft, diffused highlights. Slightly elevated
45-degree camera angle. Sharp focus on steam rising from the coffee.
Style Transfer
Transform this photograph of a city street at night into Vincent van Gogh's
'Starry Night' style. Preserve the composition but render with swirling,
impasto brushstrokes and deep blues with bright yellows.
Infographic
Create a vibrant infographic explaining photosynthesis as a recipe.
Show "ingredients" (sunlight, water, CO2) and "finished dish" (sugar/energy).
Style like a colorful kids' cookbook, suitable for 4th graders.
Error Handling
Common issues:
- No image returned: Check that
response_modalitiesincludes'IMAGE' - Safety filters: Some prompts may be blocked; try rephrasing
- Rate limits: Implement exponential backoff for retries
- Large images: For 4K, ensure sufficient timeout settings
Dependencies
To use the Python SDK:
pip install google-genai pillow
For JavaScript:
npm install @google/genai
Important Notes
- All generated images include a SynthID watermark
- The model uses a “thinking” process for complex prompts
- For best text rendering, generate text first, then request image with that text
- Images are not stored by the API – save outputs locally