tilelang-developer

📁 yzlnew/infra-skills 📅 Jan 22, 2026
8
总安装量
4
周安装量
#34435
全站排名
安装命令
npx skills add https://github.com/yzlnew/infra-skills --skill tilelang-developer

Agent 安装分布

windsurf 3
trae 3
opencode 3
codex 3
claude-code 3
antigravity 3

Skill 文档

TileLang Developer

Write high-performance AI compute kernels using TileLang – a tile-based programming model that bridges the gap between CUDA’s low-level control and high-level abstractions.

When to Use This Skill

Use this skill when the user needs to:

  • Implement custom GPU kernels for AI operations (matrix multiplication, attention mechanisms, etc.)
  • Optimize performance-critical operators for modern GPUs (NVIDIA Ampere/Hopper, AMD MI300X, Ascend NPU)
  • Debug TileLang code or resolve performance issues
  • Port kernels across different hardware platforms
  • Understand or modify existing TileLang implementations

Kernel Development Workflow

Follow these steps when writing a TileLang kernel:

Step 1: Analyze Requirements

Gather essential information:

Input/Output Specifications:

  • Tensor shapes (M, N, K dimensions)
  • Data types (float16, float32, bfloat16, int8)
  • Memory layout (row-major, column-major)

Hardware Target:

  • NVIDIA GPU (Ampere A100, Hopper H100, etc.)
  • AMD GPU (MI300X, etc.)
  • Huawei Ascend NPU

Performance Goals:

  • Target throughput or latency
  • Memory bandwidth constraints
  • Comparison baseline (cuBLAS, vendor libraries)

Ask clarifying questions if details are missing.

Step 2: Set Up Kernel Structure

Create the basic kernel scaffold:

import tilelang
import tilelang.language as T

@tilelang.jit(target="cuda", out_idx=[2])  # Specify output indices
def kernel_name(M, N, K, block_M, block_N, block_K):
    @T.prim_func
    def main(
        A: T.Buffer((M, K), "float16"),
        B: T.Buffer((K, N), "float16"),
        C: T.Buffer((M, N), "float16")
    ):
        # Kernel logic will go here
        pass

    return main

Key decisions:

  • target: “cuda” (NVIDIA), “hip” (AMD), or “cpu”
  • out_idx: List indices of output parameters
  • Block dimensions: Typical values are 64, 128, or 256

Step 3: Define Grid and Memory Hierarchy

Set up computation grid and allocate memory:

# Define grid dimensions
with T.Kernel(
    T.ceildiv(N, block_N),  # Grid X
    T.ceildiv(M, block_M),  # Grid Y
    threads=128
) as (bx, by):

    # Allocate shared memory (L1 cache)
    A_shared = T.alloc_shared((block_M, block_K), "float16")
    B_shared = T.alloc_shared((block_K, block_N), "float16")

    # Allocate register fragments (accumulators)
    C_local = T.alloc_fragment((block_M, block_N), "float32")

    # CRITICAL: Apply swizzle layout to avoid bank conflicts
    T.annotate_layout({
        A_shared: T.make_swizzled_layout(A_shared),
        B_shared: T.make_swizzled_layout(B_shared)
    })

Memory hierarchy:

  • Global Memory (HBM): Input/output tensors, slowest
  • Shared Memory (L1): Explicitly managed cache, ~164KB on A100
  • Registers: Fastest, used for accumulators and temporaries

Critical optimization: Always apply T.make_swizzled_layout to shared memory to eliminate bank conflicts.

Step 4: Implement Computation Logic

Use TileLang primitives for data movement and computation:

# Initialize accumulator
T.clear(C_local)

# Main computation loop with software pipelining
for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
    # Load tiles from global to shared memory
    T.copy(A[by * block_M, k * block_K], A_shared)
    T.copy(B[k * block_K, bx * block_N], B_shared)

    # Compute using Tensor Cores
    T.gemm(A_shared, B_shared, C_local, transpose_B=False)

# Write results back
T.copy(C_local, C[by * block_M, bx * block_N])

Key primitives:

  • T.copy: Intelligent data transfer (auto-selects cp.async, TMA, etc.)
  • T.gemm: Matrix multiplication using Tensor Cores
  • T.Pipelined: Software pipelining to overlap compute and memory transfer
  • T.Parallel: Element-wise parallel operations

Pipeline stages:

  • num_stages=2: Double buffering
  • num_stages=3: Triple buffering (recommended for most workloads)
  • num_stages=4+: Diminishing returns, increases shared memory usage

Step 5: Validate and Test

Generate test code to verify correctness:

# Example instantiation
func = kernel_name(
    M=1024, N=1024, K=1024,
    block_M=128, block_N=128, block_K=32
)

# Test against reference implementation
import torch
A = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
B = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
C_tilelang = torch.empty(1024, 1024, dtype=torch.float16, device='cuda')
C_reference = A @ B

func(A, B, C_tilelang)

# Verify with appropriate tolerance for FP16
torch.testing.assert_close(C_tilelang, C_reference, rtol=1e-3, atol=1e-3)

Step 6: Optimize Performance

Apply advanced optimizations if performance is suboptimal:

Block Size Tuning:

  • A100: Try 128×128×32 or 64×64×32
  • H100: Can use larger blocks (256×128×32)
  • MI300X: May need smaller blocks due to 64KB shared memory limit

Pipeline Depth:

  • Increase num_stages if memory-bound
  • Decrease if shared memory is exhausted

Warp Policy (for advanced cases):

T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullRow)  # For attention
T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullCol)  # For MLA decode

Block-level swizzle:

T.use_swizzle(panel_size=10)  # Improves L2 cache hit rate

Common Kernel Patterns

Matrix Multiplication (GEMM)

Most fundamental kernel. See EXAMPLES.md for complete implementation.

Key features:

  • 3-stage pipelining
  • Swizzle layout for shared memory
  • Float32 accumulator for precision

FlashAttention

Memory-efficient attention with online softmax. See EXAMPLES.md for complete implementation.

Key features:

  • O(N) memory complexity
  • Online softmax statistics
  • Fused kernel (no intermediate materialization)

DeepSeek MLA

Multi-Head Latent Attention with KV compression. See EXAMPLES.md for complete implementation.

Key features:

  • Split-KV parallelization
  • Non-standard dimensions
  • FullCol warp policy for narrow matrices

Reference Documentation

When you need specific information:

  • API details (parameters, signatures, options): Read API_REFERENCE.md
  • Complete code examples (GEMM, Attention, MLA): Read EXAMPLES.md
  • Troubleshooting (errors, performance issues): Read DEBUGGING.md

Critical Performance Guidelines

Always include these optimizations:

  1. Swizzle layout for shared memory:

    T.annotate_layout({
        A_shared: T.make_swizzled_layout(A_shared)
    })
    
  2. Software pipelining:

    for k in T.Pipelined(num_blocks, num_stages=3):
    
  3. Float32 accumulators:

    C_local = T.alloc_fragment((M, N), "float32")  # Not float16
    
  4. Aligned block_K:

    block_K = 32  # Or 16, must align for Tensor Core
    
  5. Initialize accumulators:

    T.clear(C_local)
    

Hardware-Specific Considerations

NVIDIA GPUs

  • Ampere (A100): Use cp.async, num_stages=3, block_K=32
  • Hopper (H100): Can use TMA, larger blocks (256×128), num_stages=4
  • Shared memory: 164KB (A100), 228KB (H100)

AMD GPUs

  • MI300X: Use target=”hip”, smaller blocks, 64KB shared memory limit
  • Test with both HIP and CUDA backends for compatibility

Huawei Ascend

  • More experimental backend
  • May require specific block sizes
  • Consult Ascend-specific documentation

Code Quality Standards

When generating TileLang code:

  1. Add explanatory comments for non-obvious optimizations
  2. Specify hardware assumptions (e.g., “optimized for A100”)
  3. Include usage examples showing instantiation
  4. Document block size choices and tuning rationale
  5. Provide performance expectations (e.g., “~90% of cuBLAS”)

Example Kernel Request Flow

User: “Write a FP16 matrix multiplication kernel for A100”

Response:

  1. Clarify dimensions (if not specified)
  2. Generate complete kernel code with:
    • Proper structure (@tilelang.jit, @T.prim_func)
    • Swizzle layouts
    • 3-stage pipelining
    • Appropriate block sizes (128×128×32)
  3. Add usage example
  4. Explain key optimizations:
    • “Swizzle layout eliminates bank conflicts”
    • “3-stage pipeline overlaps memory and compute”
    • “Float32 accumulator prevents overflow”
  5. Suggest testing approach

Troubleshooting Quick Reference

Compilation errors:

  • Shared memory exceeded → Reduce block size or num_stages
  • Shape mismatch → Verify dimension alignment in T.gemm

Runtime errors:

  • Results all zeros → Check T.clear() and out_idx in decorator
  • NaN/Inf → Use float32 accumulator, add epsilon in division

Performance issues:

  • Low throughput → Verify swizzle layout and pipelining enabled
  • Low occupancy → Reduce shared memory usage or block size
  • Bank conflicts → Apply T.make_swizzled_layout

For detailed solutions, consult DEBUGGING.md.