building-dashboards
npx skills add https://github.com/axiomhq/skills --skill building-dashboards
Agent 安装分布
Skill 文档
Building Dashboards
You design dashboards that help humans make decisions quickly. Dashboards are products: audience, questions, and actions matter more than chart count.
Philosophy
- Decisions first. Every panel answers a question that leads to an action.
- Overview â drilldown â evidence. Start broad, narrow on click/filter, end with raw logs.
- Rates and percentiles over averages. Averages hide problems; p95/p99 expose them.
- Simple beats dense. One question per panel. No chart junk.
- Validate with data. Never guess fieldsâdiscover schema first.
Entry Points
Choose your starting point:
| Starting from | Workflow |
|---|---|
| Vague description | Intake â design blueprint â APL per panel â deploy |
| Template | Pick template â customize dataset/service/env â deploy |
| Splunk dashboard | Extract SPL â translate via spl-to-apl â map to chart types â deploy |
| Exploration | Use axiom-sre to discover schema/signals â productize into panels |
Intake: What to Ask First
Before designing, clarify:
-
Audience & decision
- Oncall triage? (fast refresh, error-focused)
- Team health? (daily trends, SLO tracking)
- Exec reporting? (weekly summaries, high-level)
-
Scope
- Service, environment, region, cluster, endpoint?
- Single service or cross-service view?
-
Datasets
- Which Axiom datasets contain the data?
- Run
getschemato discover fieldsânever guess:
['dataset'] | where _time between (ago(1h) .. now()) | getschema -
Golden signals
- Traffic: requests/sec, events/min
- Errors: error rate, 5xx count
- Latency: p50, p95, p99 duration
- Saturation: CPU, memory, queue depth, connections
-
Drilldown dimensions
- What do users filter/group by? (service, route, status, pod, customer_id)
Dashboard Blueprint
Use this 4-section structure as the default:
1. At-a-Glance (Statistic panels)
Single numbers that answer “is it broken right now?”
- Error rate (last 5m)
- p95 latency (last 5m)
- Request rate (last 5m)
- Active alerts (if applicable)
2. Trends (TimeSeries panels)
Time-based patterns that answer “what changed?”
- Traffic over time
- Error rate over time
- Latency percentiles over time
- Stacked by status/service for comparison
3. Breakdowns (Table/Pie panels)
Top-N analysis that answers “where should I look?”
- Top 10 failing routes
- Top 10 error messages
- Worst pods by error rate
- Request distribution by status
4. Evidence (LogStream + SmartFilter)
Raw events that answer “what exactly happened?”
- LogStream filtered to errors
- SmartFilter for service/env/route
- Key fields projected for readability
Chart Types
Note: Dashboard queries inherit time from the UI pickerâno explicit _time filter needed.
Validation: TimeSeries, Statistic, Table, Pie, LogStream, Note, MonitorList are fully validated by dashboard-validate. Heatmap, ScatterPlot, SmartFilter work but may trigger warnings.
Statistic
When: Single KPI, current value, threshold comparison.
['logs']
| where service == "api"
| summarize
total = count(),
errors = countif(status >= 500)
| extend error_rate = round(100.0 * errors / total, 2)
| project error_rate
Pitfalls: Don’t use for time series; ensure query returns single row.
TimeSeries
When: Trends over time, before/after comparison, rate changes.
// Single metric - use bin_auto for automatic sizing
['logs']
| summarize ['req/min'] = count() by bin_auto(_time)
// Latency percentiles - use percentiles_array for proper overlay
['logs']
| summarize percentiles_array(duration_ms, 50, 95, 99) by bin_auto(_time)
Best practices:
- Use
bin_auto(_time)instead of fixedbin(_time, 1m)â auto-adjusts to time window - Use
percentiles_array()instead of multiplepercentile()calls â renders as one chart - Too many series = unreadable; use
top Nor filter
Table
When: Top-N lists, detailed breakdowns, exportable data.
['logs']
| where status >= 500
| summarize errors = count() by route, error_message
| top 10 by errors
| project route, error_message, errors
Pitfalls:
- Always use
top Nto prevent unbounded results - Use
projectto control column order and names
Pie
When: Share-of-total for LOW cardinality dimensions (â¤6 slices).
['logs']
| summarize count() by status_class = case(
status < 300, "2xx",
status < 400, "3xx",
status < 500, "4xx",
"5xx"
)
Pitfalls:
- Never use for high cardinality (routes, user IDs)
- Prefer tables for >6 categories
- Always aggregate to reduce slices
LogStream
When: Raw event inspection, debugging, evidence gathering.
['logs']
| where service == "api" and status >= 500
| project-keep _time, trace_id, route, status, error_message, duration_ms
| take 100
Pitfalls:
- Always include
take N(100-500 max) - Use
project-keepto show relevant fields only - Filter aggressivelyâraw logs are expensive
Heatmap
When: Distribution visualization, latency patterns, density analysis.
['logs']
| summarize histogram(duration_ms, 15) by bin_auto(_time)
Best for: Latency distributions, response time patterns, identifying outliers.
Scatter Plot
When: Correlation between two metrics, identifying patterns.
['logs']
| summarize avg(duration_ms), avg(resp_size_bytes) by route
Best for: Response size vs latency correlation, resource usage patterns.
SmartFilter (Filter Bar)
When: Interactive filtering for the entire dashboard.
SmartFilter is a chart type that creates dropdown/search filters. Requires:
- A
SmartFilterchart with filter definitions declare query_parametersin each panel query
Filter types:
selectType: "apl"â Dynamic dropdown from APL queryselectType: "list"â Static dropdown with predefined optionstype: "search"â Free-text input
Panel query pattern:
declare query_parameters (country_filter:string = "");
['logs'] | where isempty(country_filter) or ['geo.country'] == country_filter
See reference/smartfilter.md for full JSON structure and cascading filter examples.
Monitor List
When: Display monitor status on operational dashboards.
No APL neededâselect monitors from the UI. Shows:
- Monitor status (normal/triggered/off)
- Run history (green/red squares)
- Dataset, type, notifiers
Note
When: Context, instructions, section headers.
Use GitHub Flavored Markdown for:
- Dashboard purpose and audience
- Runbook links
- Section dividers
- On-call instructions
Chart Configuration
Charts support JSON configuration options beyond the query. See reference/chart-config.md for full details.
Quick reference:
| Chart Type | Key Options |
|---|---|
| Statistic | colorScheme, customUnits, unit, showChart (sparkline), errorThreshold/warningThreshold |
| TimeSeries | aggChartOpts: variant (line/area/bars), scaleDistr (linear/log), displayNull |
| LogStream/Table | tableSettings: columns, fontSize, highlightSeverity, wrapLines |
| Pie | hideHeader |
| Note | text (markdown), variant |
Common options (all charts):
overrideDashboardTimeRange: booleanoverrideDashboardCompareAgainst: booleanhideHeader: boolean
APL Patterns
Time Filtering in Dashboards vs Ad-hoc Queries
Dashboard panel queries do NOT need explicit time filters. The dashboard UI time picker automatically scopes all queries to the selected time window.
// DASHBOARD QUERY â no time filter needed
['logs']
| where service == "api"
| summarize count() by bin_auto(_time)
Ad-hoc queries (Axiom Query tab, axiom-sre exploration) MUST have explicit time filters:
// AD-HOC QUERY â always include time filter
['logs']
| where _time between (ago(1h) .. now())
| where service == "api"
| summarize count() by bin_auto(_time)
Bin Size Selection
Prefer bin_auto(_time) â it automatically adjusts to the dashboard time window.
Manual bin sizes (only when auto doesn’t fit your needs):
| Time window | Bin size |
|---|---|
| 15m | 10sâ30s |
| 1h | 1m |
| 6h | 5m |
| 24h | 15mâ1h |
| 7d | 1hâ6h |
Cardinality Guardrails
Prevent query explosion:
// GOOD: bounded
| summarize count() by route | top 10 by count_
// BAD: unbounded high-cardinality grouping
| summarize count() by user_id // millions of rows
Field Escaping
Fields with dots need bracket notation:
| where ['kubernetes.pod.name'] == "frontend"
Fields with dots IN the name (not hierarchy) need escaping:
| where ['kubernetes.labels.app\\.kubernetes\\.io/name'] == "frontend"
Golden Signal Queries
Traffic:
| summarize requests = count() by bin_auto(_time)
Errors (as rate %):
| summarize total = count(), errors = countif(status >= 500) by bin_auto(_time)
| extend error_rate = iff(total > 0, round(100.0 * errors / total, 2), 0.0)
| project _time, error_rate
Latency (use percentiles_array for proper chart overlay):
| summarize percentiles_array(duration_ms, 50, 95, 99) by bin_auto(_time)
Layout Composition
Grid Principles
- Dashboard width = 12 units
- Typical panel: w=3 (quarter), w=4 (third), w=6 (half), w=12 (full)
- Stats row: 4 panels à w=3, h=2
- TimeSeries row: 2 panels à w=6, h=4
- Tables: w=6 or w=12, h=4â6
- LogStream: w=12, h=6â8
Section Layout Pattern
Row 0-1: [Stat w=3] [Stat w=3] [Stat w=3] [Stat w=3]
Row 2-5: [TimeSeries w=6, h=4] [TimeSeries w=6, h=4]
Row 6-9: [Table w=6, h=4] [Pie w=6, h=4]
Row 10+: [LogStream w=12, h=6]
Naming Conventions
- Use question-style titles: “Error rate by route” not “Errors”
- Prefix with context if multi-service: “[API] Error rate”
- Include units: “Latency (ms)”, “Traffic (req/s)”
Dashboard Settings
Refresh Rate
Dashboard auto-refreshes at configured interval. Options: 15s, 30s, 1m, 5m, etc.
â ï¸ Query cost warning: Short refresh (15s) + long time range (90d) = expensive queries running constantly.
Recommendations:
| Use case | Refresh rate |
|---|---|
| Oncall/real-time | 15sâ30s |
| Team health | 1mâ5m |
| Executive/weekly | 5mâ15m |
Sharing
- Just Me: Private, only you can access
- Group: Specific team/group in your org
- Everyone: All users in your Axiom org
Data visibility is still governed by dataset permissionsâusers only see data from datasets they can access.
URL Time Range Parameters
?t_qr=24h (quick range), ?t_ts=...&t_te=... (custom), ?t_against=-1d (comparison)
Setup
Run scripts/setup to check requirements (curl, jq, ~/.axiom.toml).
Config in ~/.axiom.toml (shared with axiom-sre):
[deployments.prod]
url = "https://api.axiom.co"
token = "xaat-your-token"
org_id = "your-org-id"
Deployment
Scripts
| Script | Usage |
|---|---|
scripts/get-user-id <deploy> |
Get your user ID for owner field |
scripts/dashboard-list <deploy> |
List all dashboards |
scripts/dashboard-get <deploy> <id> |
Fetch dashboard JSON |
scripts/dashboard-validate <file> |
Validate JSON structure |
scripts/dashboard-create <deploy> <file> |
Create dashboard |
scripts/dashboard-update <deploy> <id> <file> |
Update (needs version) |
scripts/dashboard-copy <deploy> <id> |
Clone dashboard |
scripts/dashboard-link <deploy> <id> |
Get shareable URL |
scripts/dashboard-delete <deploy> <id> |
Delete (with confirm) |
scripts/axiom-api <deploy> <method> <path> |
Low-level API calls |
Workflow
â ï¸ CRITICAL: Always validate queries BEFORE deploying.
- Design dashboard (sections + panels)
- Write APL for each panel
- Build JSON (from template or manually)
- Validate queries using axiom-sre with explicit time filter
dashboard-validateto check structuredashboard-createordashboard-updateto deploydashboard-linkto get URL â NEVER construct Axiom URLs manually (org IDs and base URLs vary per deployment)- Share link with user
Sibling Skill Integration
spl-to-apl: Translate Splunk SPL â APL. Map timechart â TimeSeries, stats â Statistic/Table. See reference/splunk-migration.md.
axiom-sre: Discover schema with getschema, explore baselines, identify dimensions, then productize into panels.
Templates
Pre-built templates in reference/templates/:
| Template | Use case |
|---|---|
service-overview.json |
Single service oncall dashboard with Heatmap |
service-overview-with-filters.json |
Same with SmartFilter (route/status dropdowns) |
api-health.json |
HTTP API with traffic/errors/latency |
blank.json |
Minimal skeleton |
Placeholders: {{owner_id}}, {{service}}, {{dataset}}
Usage:
USER_ID=$(scripts/get-user-id prod)
scripts/dashboard-from-template service-overview "my-service" "$USER_ID" "my-dataset" ./dashboard.json
scripts/dashboard-validate ./dashboard.json
scripts/dashboard-create prod ./dashboard.json
â ï¸ Templates assume field names (service, status, route, duration_ms). Discover your schema first and use sed to fix mismatches.
Common Pitfalls
| Problem | Cause | Solution |
|---|---|---|
| “unable to find dataset” errors | Dataset name doesn’t exist in your org | Check available datasets in Axiom UI |
| “creating dashboards for other users” 403 | Owner ID doesn’t match your token | Use scripts/get-user-id prod to get your UUID |
| All panels show errors | Field names don’t match your schema | Discover schema first, use sed to fix field names |
| Dashboard shows no data | Service filter too restrictive | Remove or adjust where service == 'x' filters |
| Queries time out | Missing time filter or too broad | Dashboard inherits time from picker; ad-hoc queries need explicit time filter |
| Wrong org in dashboard URL | Manually constructed URL | Always use dashboard-link <deploy> <id> â never guess org IDs or base URLs |
Reference
reference/chart-config.mdâ All chart configuration options (JSON)reference/smartfilter.mdâ SmartFilter/FilterBar full configurationreference/chart-cookbook.mdâ APL patterns per chart typereference/layout-recipes.mdâ Grid layouts and section blueprintsreference/splunk-migration.mdâ Splunk panel â Axiom mappingreference/design-playbook.mdâ Decision-first design principlesreference/templates/â Ready-to-use dashboard JSON files
For APL syntax: https://axiom.co/docs/apl/introduction