azure-ai-inference-py
1
总安装量
2
周安装量
#50042
全站排名
安装命令
npx skills add https://github.com/microsoft/agent-skills --skill azure-ai-inference-py
Agent 安装分布
gemini-cli
1
opencode
1
antigravity
1
qwen-code
1
github-copilot
1
windsurf
1
Skill 文档
Azure AI Inference SDK for Python
Client library for Azure AI model inference with chat completions and embeddings.
Installation
pip install azure-ai-inference
# With OpenTelemetry tracing
pip install azure-ai-inference[opentelemetry]
Environment Variables
# Inference endpoint
AZURE_INFERENCE_ENDPOINT=https://<resource>.services.ai.azure.com/models
AZURE_INFERENCE_CREDENTIAL=<your-api-key> # If using API key
# Optional: specific model deployment
AZURE_INFERENCE_MODEL=gpt-4o-mini
Authentication
API Key
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
import os
client = ChatCompletionsClient(
endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"])
)
Entra ID (Recommended)
from azure.ai.inference import ChatCompletionsClient
from azure.identity import DefaultAzureCredential
client = ChatCompletionsClient(
endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
credential=DefaultAzureCredential()
)
Chat Completions
Basic Completion
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential
client = ChatCompletionsClient(
endpoint=endpoint,
credential=AzureKeyCredential(key)
)
response = client.complete(
messages=[
SystemMessage(content="You are a helpful assistant."),
UserMessage(content="What is Azure AI?")
],
model="gpt-4o-mini" # Optional for single-model endpoints
)
print(response.choices[0].message.content)
Streaming Completions
response = client.complete(
stream=True,
messages=[
SystemMessage(content="You are a helpful assistant."),
UserMessage(content="Write a poem about Azure.")
]
)
for update in response:
if update.choices:
print(update.choices[0].delta.content or "", end="")
With Default Settings
# Set defaults at client creation
client = ChatCompletionsClient(
endpoint=endpoint,
credential=AzureKeyCredential(key),
temperature=0.5,
max_tokens=1000
)
# Defaults applied to all calls, can be overridden per-call
response = client.complete(
messages=[UserMessage(content="Hello")],
temperature=0.8 # Overrides default
)
Embeddings
from azure.ai.inference import EmbeddingsClient
from azure.core.credentials import AzureKeyCredential
client = EmbeddingsClient(
endpoint="https://<resource>.services.ai.azure.com/models",
credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"])
)
response = client.embed(
input=["Your text string goes here"],
model="text-embedding-3-small"
)
embedding = response.data[0].embedding
print(f"Embedding dimensions: {len(embedding)}")
Async Client
import asyncio
from azure.ai.inference.aio import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential
async def main():
client = ChatCompletionsClient(
endpoint=endpoint,
credential=AzureKeyCredential(key)
)
response = await client.complete(
messages=[
SystemMessage(content="You are a helpful assistant."),
UserMessage(content="What is Azure AI?")
]
)
print(response.choices[0].message.content)
await client.close()
asyncio.run(main())
Model Information
# Get model info (Serverless API / Managed Compute only)
model_info = client.get_model_info()
print(f"Model name: {model_info.model_name}")
print(f"Model provider: {model_info.model_provider_name}")
print(f"Model type: {model_info.model_type}")
Using load_client
from azure.ai.inference import load_client
from azure.core.credentials import AzureKeyCredential
# Auto-detect client type based on endpoint
client = load_client(
endpoint=endpoint,
credential=AzureKeyCredential(key)
)
print(f"Created client of type: {type(client).__name__}")
Tool Calling
from azure.ai.inference.models import (
SystemMessage, UserMessage, AssistantMessage, ToolMessage,
ChatCompletionsToolDefinition, FunctionDefinition
)
tools = [
ChatCompletionsToolDefinition(
function=FunctionDefinition(
name="get_weather",
description="Get current weather for a location",
parameters={
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
)
)
]
response = client.complete(
messages=[UserMessage(content="What's the weather in Seattle?")],
tools=tools
)
# Handle tool calls in response
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
# Execute tool and send result back
Message Types
| Type | Description |
|---|---|
SystemMessage |
System instructions for the model |
UserMessage |
User input (text, images, audio) |
AssistantMessage |
Model responses |
ToolMessage |
Tool execution results |
DeveloperMessage |
Developer-level instructions |
Client Types
| Client | Purpose |
|---|---|
ChatCompletionsClient |
Chat and text completions |
EmbeddingsClient |
Text and image embeddings |
load_client |
Auto-detect client from endpoint |
Best Practices
- Use Entra ID for production authentication
- Set defaults at client creation for consistent behavior
- Handle streaming for long responses to improve UX
- Close async clients explicitly or use context managers
- Specify model when endpoint serves multiple deployments
- Use load_client when you don’t know the endpoint type
- Cache model_info â it’s cached after first call
Reference Files
| File | Contents |
|---|---|
| references/streaming.md | Streaming responses, async iteration, SSE patterns, error handling |
| references/tool-calling.md | Function/tool calling patterns, tool registry, multi-turn conversations |
| scripts/chat_completion.py | CLI for chat completions, embeddings, and interactive mode |