tilelang-developer
npx skills add https://github.com/yzlnew/infra-skills --skill tilelang-developer
Agent 安装分布
Skill 文档
TileLang Developer
Write high-performance AI compute kernels using TileLang – a tile-based programming model that bridges the gap between CUDA’s low-level control and high-level abstractions.
When to Use This Skill
Use this skill when the user needs to:
- Implement custom GPU kernels for AI operations (matrix multiplication, attention mechanisms, etc.)
- Optimize performance-critical operators for modern GPUs (NVIDIA Ampere/Hopper, AMD MI300X, Ascend NPU)
- Debug TileLang code or resolve performance issues
- Port kernels across different hardware platforms
- Understand or modify existing TileLang implementations
Kernel Development Workflow
Follow these steps when writing a TileLang kernel:
Step 1: Analyze Requirements
Gather essential information:
Input/Output Specifications:
- Tensor shapes (M, N, K dimensions)
- Data types (float16, float32, bfloat16, int8)
- Memory layout (row-major, column-major)
Hardware Target:
- NVIDIA GPU (Ampere A100, Hopper H100, etc.)
- AMD GPU (MI300X, etc.)
- Huawei Ascend NPU
Performance Goals:
- Target throughput or latency
- Memory bandwidth constraints
- Comparison baseline (cuBLAS, vendor libraries)
Ask clarifying questions if details are missing.
Step 2: Set Up Kernel Structure
Create the basic kernel scaffold:
import tilelang
import tilelang.language as T
@tilelang.jit(target="cuda", out_idx=[2]) # Specify output indices
def kernel_name(M, N, K, block_M, block_N, block_K):
@T.prim_func
def main(
A: T.Buffer((M, K), "float16"),
B: T.Buffer((K, N), "float16"),
C: T.Buffer((M, N), "float16")
):
# Kernel logic will go here
pass
return main
Key decisions:
target: “cuda” (NVIDIA), “hip” (AMD), or “cpu”out_idx: List indices of output parameters- Block dimensions: Typical values are 64, 128, or 256
Step 3: Define Grid and Memory Hierarchy
Set up computation grid and allocate memory:
# Define grid dimensions
with T.Kernel(
T.ceildiv(N, block_N), # Grid X
T.ceildiv(M, block_M), # Grid Y
threads=128
) as (bx, by):
# Allocate shared memory (L1 cache)
A_shared = T.alloc_shared((block_M, block_K), "float16")
B_shared = T.alloc_shared((block_K, block_N), "float16")
# Allocate register fragments (accumulators)
C_local = T.alloc_fragment((block_M, block_N), "float32")
# CRITICAL: Apply swizzle layout to avoid bank conflicts
T.annotate_layout({
A_shared: T.make_swizzled_layout(A_shared),
B_shared: T.make_swizzled_layout(B_shared)
})
Memory hierarchy:
- Global Memory (HBM): Input/output tensors, slowest
- Shared Memory (L1): Explicitly managed cache, ~164KB on A100
- Registers: Fastest, used for accumulators and temporaries
Critical optimization: Always apply T.make_swizzled_layout to shared memory to eliminate bank conflicts.
Step 4: Implement Computation Logic
Use TileLang primitives for data movement and computation:
# Initialize accumulator
T.clear(C_local)
# Main computation loop with software pipelining
for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
# Load tiles from global to shared memory
T.copy(A[by * block_M, k * block_K], A_shared)
T.copy(B[k * block_K, bx * block_N], B_shared)
# Compute using Tensor Cores
T.gemm(A_shared, B_shared, C_local, transpose_B=False)
# Write results back
T.copy(C_local, C[by * block_M, bx * block_N])
Key primitives:
T.copy: Intelligent data transfer (auto-selects cp.async, TMA, etc.)T.gemm: Matrix multiplication using Tensor CoresT.Pipelined: Software pipelining to overlap compute and memory transferT.Parallel: Element-wise parallel operations
Pipeline stages:
num_stages=2: Double bufferingnum_stages=3: Triple buffering (recommended for most workloads)num_stages=4+: Diminishing returns, increases shared memory usage
Step 5: Validate and Test
Generate test code to verify correctness:
# Example instantiation
func = kernel_name(
M=1024, N=1024, K=1024,
block_M=128, block_N=128, block_K=32
)
# Test against reference implementation
import torch
A = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
B = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
C_tilelang = torch.empty(1024, 1024, dtype=torch.float16, device='cuda')
C_reference = A @ B
func(A, B, C_tilelang)
# Verify with appropriate tolerance for FP16
torch.testing.assert_close(C_tilelang, C_reference, rtol=1e-3, atol=1e-3)
Step 6: Optimize Performance
Apply advanced optimizations if performance is suboptimal:
Block Size Tuning:
- A100: Try 128Ã128Ã32 or 64Ã64Ã32
- H100: Can use larger blocks (256Ã128Ã32)
- MI300X: May need smaller blocks due to 64KB shared memory limit
Pipeline Depth:
- Increase
num_stagesif memory-bound - Decrease if shared memory is exhausted
Warp Policy (for advanced cases):
T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullRow) # For attention
T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullCol) # For MLA decode
Block-level swizzle:
T.use_swizzle(panel_size=10) # Improves L2 cache hit rate
Common Kernel Patterns
Matrix Multiplication (GEMM)
Most fundamental kernel. See EXAMPLES.md for complete implementation.
Key features:
- 3-stage pipelining
- Swizzle layout for shared memory
- Float32 accumulator for precision
FlashAttention
Memory-efficient attention with online softmax. See EXAMPLES.md for complete implementation.
Key features:
- O(N) memory complexity
- Online softmax statistics
- Fused kernel (no intermediate materialization)
DeepSeek MLA
Multi-Head Latent Attention with KV compression. See EXAMPLES.md for complete implementation.
Key features:
- Split-KV parallelization
- Non-standard dimensions
- FullCol warp policy for narrow matrices
Reference Documentation
When you need specific information:
- API details (parameters, signatures, options): Read API_REFERENCE.md
- Complete code examples (GEMM, Attention, MLA): Read EXAMPLES.md
- Troubleshooting (errors, performance issues): Read DEBUGGING.md
Critical Performance Guidelines
Always include these optimizations:
-
Swizzle layout for shared memory:
T.annotate_layout({ A_shared: T.make_swizzled_layout(A_shared) }) -
Software pipelining:
for k in T.Pipelined(num_blocks, num_stages=3): -
Float32 accumulators:
C_local = T.alloc_fragment((M, N), "float32") # Not float16 -
Aligned block_K:
block_K = 32 # Or 16, must align for Tensor Core -
Initialize accumulators:
T.clear(C_local)
Hardware-Specific Considerations
NVIDIA GPUs
- Ampere (A100): Use cp.async, num_stages=3, block_K=32
- Hopper (H100): Can use TMA, larger blocks (256Ã128), num_stages=4
- Shared memory: 164KB (A100), 228KB (H100)
AMD GPUs
- MI300X: Use target=”hip”, smaller blocks, 64KB shared memory limit
- Test with both HIP and CUDA backends for compatibility
Huawei Ascend
- More experimental backend
- May require specific block sizes
- Consult Ascend-specific documentation
Code Quality Standards
When generating TileLang code:
- Add explanatory comments for non-obvious optimizations
- Specify hardware assumptions (e.g., “optimized for A100”)
- Include usage examples showing instantiation
- Document block size choices and tuning rationale
- Provide performance expectations (e.g., “~90% of cuBLAS”)
Example Kernel Request Flow
User: “Write a FP16 matrix multiplication kernel for A100”
Response:
- Clarify dimensions (if not specified)
- Generate complete kernel code with:
- Proper structure (@tilelang.jit, @T.prim_func)
- Swizzle layouts
- 3-stage pipelining
- Appropriate block sizes (128Ã128Ã32)
- Add usage example
- Explain key optimizations:
- “Swizzle layout eliminates bank conflicts”
- “3-stage pipeline overlaps memory and compute”
- “Float32 accumulator prevents overflow”
- Suggest testing approach
Troubleshooting Quick Reference
Compilation errors:
- Shared memory exceeded â Reduce block size or num_stages
- Shape mismatch â Verify dimension alignment in T.gemm
Runtime errors:
- Results all zeros â Check T.clear() and out_idx in decorator
- NaN/Inf â Use float32 accumulator, add epsilon in division
Performance issues:
- Low throughput â Verify swizzle layout and pipelining enabled
- Low occupancy â Reduce shared memory usage or block size
- Bank conflicts â Apply T.make_swizzled_layout
For detailed solutions, consult DEBUGGING.md.