debug:tensorflow

📁 snakeo/claude-debug-and-refactor-skills-plugin 📅 Jan 19, 2026
23
总安装量
17
周安装量
#16401
全站排名
安装命令
npx skills add https://github.com/snakeo/claude-debug-and-refactor-skills-plugin --skill debug:tensorflow

Agent 安装分布

claude-code 15
antigravity 14
opencode 14
github-copilot 12
codex 11

Skill 文档

TensorFlow Debugging Guide

This skill provides a systematic approach to debugging TensorFlow applications, covering common error patterns, debugging tools, and resolution strategies.

Common Error Patterns

1. Shape Mismatch Errors

Symptoms:

  • InvalidArgumentError: Incompatible shapes
  • ValueError: Shapes (X,) and (Y,) are incompatible
  • Matrix multiplication failures

Diagnostic Steps:

# Print shapes at key points
print(f"Input shape: {x.shape}")
print(f"Expected shape: {model.input_shape}")

# Use tf.debugging for assertions
tf.debugging.assert_shapes([
    (x, ('batch', 'features')),
    (y, ('batch', 'classes'))
])

# Enable eager execution for immediate shape inspection
tf.config.run_functions_eagerly(True)

Common Causes:

  • Batch dimension mismatch (missing or extra dimension)
  • Incorrect reshape operations
  • Mismatched layer input/output dimensions
  • Broadcasting issues with incompatible shapes

Solutions:

# Expand dimensions if needed
x = tf.expand_dims(x, axis=0)  # Add batch dimension

# Reshape explicitly
x = tf.reshape(x, [-1, height, width, channels])

# Use tf.ensure_shape for runtime validation
x = tf.ensure_shape(x, [None, 224, 224, 3])

2. OOM (Out of Memory) Errors

Symptoms:

  • ResourceExhaustedError: OOM when allocating tensor
  • CUDA_ERROR_OUT_OF_MEMORY
  • Training crashes after a few epochs

Diagnostic Steps:

# Check GPU memory usage
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    for gpu in gpus:
        details = tf.config.experimental.get_device_details(gpu)
        print(f"GPU: {gpu.name}, Details: {details}")

# Monitor memory during training
tf.debugging.experimental.enable_dump_debug_info(
    '/tmp/tfdbg2_logdir',
    tensor_debug_mode='FULL_HEALTH',
    circular_buffer_size=1000
)

Solutions:

# Enable memory growth (prevent TF from allocating all GPU memory)
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

# Limit GPU memory
tf.config.set_logical_device_configuration(
    gpus[0],
    [tf.config.LogicalDeviceConfiguration(memory_limit=4096)]  # 4GB
)

# Reduce batch size
BATCH_SIZE = 16  # Try smaller values

# Use gradient checkpointing for large models
# (recompute activations during backward pass)

# Clear session between runs
tf.keras.backend.clear_session()

# Use mixed precision training
tf.keras.mixed_precision.set_global_policy('mixed_float16')

3. NaN/Inf in Loss

Symptoms:

  • Loss becomes nan or inf during training
  • Model predictions are all NaN
  • Gradient norm explodes

Diagnostic Steps:

# Enable numeric checking
tf.debugging.enable_check_numerics()

# Check for NaN in tensors
tf.debugging.check_numerics(tensor, "Tensor contains NaN or Inf")

# Use TensorBoard Debugger V2
tf.debugging.experimental.enable_dump_debug_info(
    logdir='/tmp/tfdbg2_logdir',
    tensor_debug_mode='FULL_HEALTH',
    circular_buffer_size=1000
)

Common Causes:

  • Learning rate too high
  • Exploding gradients
  • Log of zero or negative numbers
  • Division by zero
  • Incorrect loss function for data range

Solutions:

# Reduce learning rate
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)

# Add gradient clipping
optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)
# or
optimizer = tf.keras.optimizers.Adam(clipvalue=0.5)

# Use numerically stable operations
# Instead of: tf.math.log(x)
tf.math.log(x + 1e-7)  # Add epsilon

# Instead of: x / y
tf.math.divide_no_nan(x, y)

# Add batch normalization
model.add(tf.keras.layers.BatchNormalization())

# Check data for NaN before training
assert not tf.reduce_any(tf.math.is_nan(train_data)).numpy()

4. Gradient Issues

Symptoms:

  • Vanishing gradients (weights not updating)
  • Exploding gradients (loss becomes NaN)
  • Training stalls, loss doesn’t decrease

Diagnostic Steps:

# Inspect gradients with GradientTape
with tf.GradientTape() as tape:
    predictions = model(x, training=True)
    loss = loss_fn(y, predictions)

gradients = tape.gradient(loss, model.trainable_variables)

for var, grad in zip(model.trainable_variables, gradients):
    if grad is not None:
        print(f"{var.name}: grad_norm={tf.norm(grad).numpy():.6f}")
    else:
        print(f"{var.name}: NO GRADIENT (disconnected)")

# Check for dead ReLUs
activations = model.layers[5].output
dead_neurons = tf.reduce_mean(tf.cast(activations <= 0, tf.float32))

Solutions:

# For vanishing gradients
# Use He initialization for ReLU networks
initializer = tf.keras.initializers.HeNormal()

# Use LeakyReLU instead of ReLU
model.add(tf.keras.layers.LeakyReLU(alpha=0.1))

# Add residual connections (skip connections)

# For exploding gradients
# Apply gradient clipping
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)

# Use proper weight initialization
initializer = tf.keras.initializers.GlorotUniform()

5. GPU Not Detected

Symptoms:

  • tf.config.list_physical_devices('GPU') returns empty list
  • Training runs on CPU (slow)
  • CUDA errors on startup

Diagnostic Steps:

# Check available devices
print("Physical devices:", tf.config.list_physical_devices())
print("GPU devices:", tf.config.list_physical_devices('GPU'))
print("Built with CUDA:", tf.test.is_built_with_cuda())
print("GPU available:", tf.test.is_gpu_available())

# Check CUDA/cuDNN versions
import subprocess
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
print(result.stdout)

# Verify TensorFlow GPU package
import tensorflow as tf
print(tf.__version__)
print(tf.sysconfig.get_build_info())

Common Causes:

  • Wrong TensorFlow package (CPU-only version)
  • CUDA/cuDNN version mismatch
  • NVIDIA driver issues
  • GPU not visible to container (Docker)

Solutions:

# Install correct TensorFlow GPU package
pip install tensorflow[and-cuda]  # TF 2.15+
# or
pip install tensorflow-gpu  # Older versions

# Verify CUDA compatibility
# TF 2.15: CUDA 12.x, cuDNN 8.9
# TF 2.14: CUDA 11.8, cuDNN 8.7
# TF 2.13: CUDA 11.8, cuDNN 8.6

# For Docker, use nvidia-docker
docker run --gpus all -it tensorflow/tensorflow:latest-gpu
# Force GPU visibility
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'  # Use first GPU

# Verify GPU is being used
with tf.device('/GPU:0'):
    a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
    b = tf.constant([[1.0, 1.0], [0.0, 1.0]])
    c = tf.matmul(a, b)
    print(c.device)  # Should show GPU

6. SavedModel Loading Errors

Symptoms:

  • OSError: SavedModel file does not exist
  • ValueError: Unknown layer when loading
  • Version compatibility errors

Diagnostic Steps:

# Check SavedModel structure
import os
for root, dirs, files in os.walk('saved_model_dir'):
    for file in files:
        print(os.path.join(root, file))

# Verify model signature
loaded = tf.saved_model.load('saved_model_dir')
print(list(loaded.signatures.keys()))

Solutions:

# Save model correctly
model.save('my_model')  # SavedModel format (recommended)
model.save('my_model.keras')  # Keras format

# Load with custom objects
custom_objects = {
    'CustomLayer': CustomLayer,
    'custom_loss': custom_loss
}
model = tf.keras.models.load_model('my_model', custom_objects=custom_objects)

# For version mismatches, save weights only
model.save_weights('model_weights.weights.h5')
# Then rebuild model architecture and load weights
new_model.load_weights('model_weights.weights.h5')

7. Data Pipeline Issues

Symptoms:

  • InvalidArgumentError during training
  • Slow training (input bottleneck)
  • Memory leaks during data loading

Diagnostic Steps:

# Profile input pipeline
import tensorflow as tf

# Enable profiler
tf.profiler.experimental.start('/tmp/logdir')
# ... run training ...
tf.profiler.experimental.stop()

# Check dataset element spec
print(dataset.element_spec)

# Iterate and inspect
for batch in dataset.take(1):
    print(f"Batch shape: {batch[0].shape}")
    print(f"Dtype: {batch[0].dtype}")

Solutions:

# Optimize pipeline
dataset = tf.data.Dataset.from_tensor_slices((x, y))
dataset = dataset.cache()  # Cache after expensive operations
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(32)
dataset = dataset.prefetch(tf.data.AUTOTUNE)  # Overlap data loading

# Use parallel processing
dataset = dataset.map(
    preprocess_fn,
    num_parallel_calls=tf.data.AUTOTUNE
)

# Handle variable-length sequences
dataset = dataset.padded_batch(32, padded_shapes=([None], []))

Debugging Tools

tf.debugging Module

# Shape assertions
tf.debugging.assert_shapes([
    (x, ('N', 'H', 'W', 'C')),
    (y, ('N', 'num_classes'))
])

# Value assertions
tf.debugging.assert_non_negative(x)
tf.debugging.assert_near(x, y, rtol=1e-5)
tf.debugging.assert_equal(x.shape, expected_shape)

# Numeric checking
tf.debugging.check_numerics(tensor, "check: tensor contains NaN/Inf")
tf.debugging.enable_check_numerics()  # Global check

# Type assertions
tf.debugging.assert_type(x, tf.float32)

TensorBoard

# Set up TensorBoard logging
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir=log_dir,
    histogram_freq=1,
    profile_batch='500,520'  # Profile batches 500-520
)

model.fit(
    x_train, y_train,
    epochs=5,
    callbacks=[tensorboard_callback]
)

# Launch TensorBoard
# tensorboard --logdir logs/fit

TensorBoard Debugger V2

# Enable debug info dumping
tf.debugging.experimental.enable_dump_debug_info(
    logdir='/tmp/tfdbg2_logdir',
    tensor_debug_mode='FULL_HEALTH',
    circular_buffer_size=1000
)

# Run training...
model.fit(x_train, y_train, epochs=5)

# View in TensorBoard
# tensorboard --logdir /tmp/tfdbg2_logdir

Eager Execution Debugging

# Enable eager execution (default in TF 2.x)
tf.config.run_functions_eagerly(True)

# Debug with breakpoints in @tf.function
@tf.function
def my_function(x):
    tf.print("Debug:", x)  # Works in graph mode
    # Use tf.debugging.assert_* for runtime checks
    tf.debugging.assert_positive(x)
    return x * 2

# Disable tf.function for debugging
@tf.function
def buggy_function(x):
    # Temporarily remove @tf.function decorator
    # or use tf.config.run_functions_eagerly(True)
    return x

tf.print() for Graph Mode

@tf.function
def compute(x):
    # Regular print won't work in graph mode
    tf.print("Shape:", tf.shape(x))
    tf.print("Values:", x, summarize=-1)  # -1 for all values
    tf.print("Stats - min:", tf.reduce_min(x),
             "max:", tf.reduce_max(x),
             "mean:", tf.reduce_mean(x))
    return x * 2

Memory Profiler

# Profile memory usage
tf.config.experimental.set_memory_growth(gpu, True)

# Use TensorFlow Profiler
with tf.profiler.experimental.Profile('/tmp/logdir'):
    model.fit(x_train, y_train, epochs=1)

# Check memory info
tf.config.experimental.get_memory_info('GPU:0')
# Returns: {'current': bytes, 'peak': bytes}

The Four Phases of TensorFlow Debugging

Phase 1: Reproduce and Isolate

  1. Create minimal reproduction

    # Minimal test case
    import tensorflow as tf
    
    # Smallest possible model
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(10, input_shape=(5,))
    ])
    
    # Synthetic data
    x = tf.random.normal((32, 5))
    y = tf.random.normal((32, 10))
    
    model.compile(optimizer='adam', loss='mse')
    model.fit(x, y, epochs=1)
    
  2. Enable eager execution for line-by-line debugging

    tf.config.run_functions_eagerly(True)
    
  3. Add assertions at key points

    def debug_forward_pass(model, x):
        for i, layer in enumerate(model.layers):
            x = layer(x)
            tf.debugging.check_numerics(x, f"Layer {i} output")
            print(f"Layer {i}: {x.shape}, range=[{tf.reduce_min(x):.3f}, {tf.reduce_max(x):.3f}]")
        return x
    

Phase 2: Analyze and Understand

  1. Inspect tensor shapes throughout the pipeline

    def trace_shapes(model, x):
        shapes = []
        for layer in model.layers:
            x = layer(x)
            shapes.append((layer.name, x.shape))
        return shapes
    
  2. Check gradient flow

    def analyze_gradients(model, x, y, loss_fn):
        with tf.GradientTape() as tape:
            pred = model(x, training=True)
            loss = loss_fn(y, pred)
    
        grads = tape.gradient(loss, model.trainable_variables)
    
        analysis = []
        for var, grad in zip(model.trainable_variables, grads):
            if grad is None:
                analysis.append((var.name, "NONE - disconnected"))
            else:
                norm = tf.norm(grad).numpy()
                analysis.append((var.name, f"norm={norm:.6f}"))
        return analysis
    
  3. Profile performance

    # Use tf.profiler
    tf.profiler.experimental.start('/tmp/logdir')
    model.fit(x, y, epochs=1)
    tf.profiler.experimental.stop()
    

Phase 3: Fix and Verify

  1. Apply targeted fixes based on diagnosis

    • Shape issues: Add explicit reshapes and assertions
    • NaN issues: Add epsilon, reduce learning rate, clip gradients
    • OOM issues: Reduce batch size, enable memory growth
    • GPU issues: Check CUDA compatibility, install correct packages
  2. Verify fix doesn’t break other functionality

    # Run comprehensive tests
    def test_model_components():
        # Test forward pass
        output = model(sample_input)
        assert output.shape == expected_shape
    
        # Test backward pass
        with tf.GradientTape() as tape:
            loss = loss_fn(model(x), y)
        grads = tape.gradient(loss, model.trainable_variables)
        assert all(g is not None for g in grads)
    
        # Test save/load
        model.save('/tmp/test_model')
        loaded = tf.keras.models.load_model('/tmp/test_model')
        assert tf.reduce_all(model(x) == loaded(x))
    

Phase 4: Prevent and Document

  1. Add permanent assertions for critical invariants

    class RobustModel(tf.keras.Model):
        def call(self, x, training=False):
            tf.debugging.assert_shapes([(x, ('batch', 'features'))])
    
            x = self.layer1(x)
            tf.debugging.check_numerics(x, "After layer1")
    
            return self.output_layer(x)
    
  2. Set up monitoring callbacks

    class NanCallback(tf.keras.callbacks.Callback):
        def on_batch_end(self, batch, logs=None):
            if logs and tf.math.is_nan(logs.get('loss', 0)):
                self.model.stop_training = True
                raise ValueError(f"NaN detected at batch {batch}")
    
  3. Document the issue and solution

    # BUGFIX: Shape mismatch in attention layer
    # Issue: Input was (batch, seq, features) but attention expected (batch, heads, seq, features)
    # Solution: Added reshape before attention layer
    x = tf.reshape(x, [batch_size, num_heads, seq_len, -1])
    

Quick Reference Commands

Device and Configuration

# List devices
tf.config.list_physical_devices()
tf.config.list_physical_devices('GPU')

# GPU memory growth
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

# Force CPU execution
with tf.device('/CPU:0'):
    result = model(x)

# Check if built with CUDA
tf.test.is_built_with_cuda()

Debugging Assertions

# Numeric checks
tf.debugging.check_numerics(tensor, message)
tf.debugging.enable_check_numerics()

# Shape checks
tf.debugging.assert_shapes([(tensor, shape_tuple)])
tf.ensure_shape(tensor, shape)

# Value checks
tf.debugging.assert_positive(tensor)
tf.debugging.assert_non_negative(tensor)
tf.debugging.assert_near(a, b, rtol=1e-5)
tf.debugging.assert_equal(a, b)
tf.debugging.assert_less(a, b)
tf.debugging.assert_greater(a, b)

Profiling and Logging

# TensorBoard logging
tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir='./logs',
    histogram_freq=1
)

# Start profiler
tf.profiler.experimental.start('/tmp/logdir')
# ... code ...
tf.profiler.experimental.stop()

# Debug info for TensorBoard Debugger V2
tf.debugging.experimental.enable_dump_debug_info(
    '/tmp/tfdbg2',
    tensor_debug_mode='FULL_HEALTH'
)

Memory Management

# Clear session
tf.keras.backend.clear_session()

# Get memory info
tf.config.experimental.get_memory_info('GPU:0')

# Mixed precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')

Gradient Debugging

# Inspect gradients
with tf.GradientTape() as tape:
    loss = compute_loss()
gradients = tape.gradient(loss, model.trainable_variables)

# Clip gradients
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)

# Check for None gradients (disconnected graph)
for var, grad in zip(model.trainable_variables, gradients):
    if grad is None:
        print(f"Warning: {var.name} has no gradient")

Version Compatibility Reference

TensorFlow Python CUDA cuDNN
2.16.x 3.9-3.12 12.3 8.9
2.15.x 3.9-3.11 12.2 8.9
2.14.x 3.9-3.11 11.8 8.7
2.13.x 3.8-3.11 11.8 8.6
2.12.x 3.8-3.11 11.8 8.6

Additional Resources