debug:tensorflow
npx skills add https://github.com/snakeo/claude-debug-and-refactor-skills-plugin --skill debug:tensorflow
Agent 安装分布
Skill 文档
TensorFlow Debugging Guide
This skill provides a systematic approach to debugging TensorFlow applications, covering common error patterns, debugging tools, and resolution strategies.
Common Error Patterns
1. Shape Mismatch Errors
Symptoms:
InvalidArgumentError: Incompatible shapesValueError: Shapes (X,) and (Y,) are incompatible- Matrix multiplication failures
Diagnostic Steps:
# Print shapes at key points
print(f"Input shape: {x.shape}")
print(f"Expected shape: {model.input_shape}")
# Use tf.debugging for assertions
tf.debugging.assert_shapes([
(x, ('batch', 'features')),
(y, ('batch', 'classes'))
])
# Enable eager execution for immediate shape inspection
tf.config.run_functions_eagerly(True)
Common Causes:
- Batch dimension mismatch (missing or extra dimension)
- Incorrect reshape operations
- Mismatched layer input/output dimensions
- Broadcasting issues with incompatible shapes
Solutions:
# Expand dimensions if needed
x = tf.expand_dims(x, axis=0) # Add batch dimension
# Reshape explicitly
x = tf.reshape(x, [-1, height, width, channels])
# Use tf.ensure_shape for runtime validation
x = tf.ensure_shape(x, [None, 224, 224, 3])
2. OOM (Out of Memory) Errors
Symptoms:
ResourceExhaustedError: OOM when allocating tensorCUDA_ERROR_OUT_OF_MEMORY- Training crashes after a few epochs
Diagnostic Steps:
# Check GPU memory usage
gpus = tf.config.list_physical_devices('GPU')
if gpus:
for gpu in gpus:
details = tf.config.experimental.get_device_details(gpu)
print(f"GPU: {gpu.name}, Details: {details}")
# Monitor memory during training
tf.debugging.experimental.enable_dump_debug_info(
'/tmp/tfdbg2_logdir',
tensor_debug_mode='FULL_HEALTH',
circular_buffer_size=1000
)
Solutions:
# Enable memory growth (prevent TF from allocating all GPU memory)
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
# Limit GPU memory
tf.config.set_logical_device_configuration(
gpus[0],
[tf.config.LogicalDeviceConfiguration(memory_limit=4096)] # 4GB
)
# Reduce batch size
BATCH_SIZE = 16 # Try smaller values
# Use gradient checkpointing for large models
# (recompute activations during backward pass)
# Clear session between runs
tf.keras.backend.clear_session()
# Use mixed precision training
tf.keras.mixed_precision.set_global_policy('mixed_float16')
3. NaN/Inf in Loss
Symptoms:
- Loss becomes
nanorinfduring training - Model predictions are all NaN
- Gradient norm explodes
Diagnostic Steps:
# Enable numeric checking
tf.debugging.enable_check_numerics()
# Check for NaN in tensors
tf.debugging.check_numerics(tensor, "Tensor contains NaN or Inf")
# Use TensorBoard Debugger V2
tf.debugging.experimental.enable_dump_debug_info(
logdir='/tmp/tfdbg2_logdir',
tensor_debug_mode='FULL_HEALTH',
circular_buffer_size=1000
)
Common Causes:
- Learning rate too high
- Exploding gradients
- Log of zero or negative numbers
- Division by zero
- Incorrect loss function for data range
Solutions:
# Reduce learning rate
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
# Add gradient clipping
optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)
# or
optimizer = tf.keras.optimizers.Adam(clipvalue=0.5)
# Use numerically stable operations
# Instead of: tf.math.log(x)
tf.math.log(x + 1e-7) # Add epsilon
# Instead of: x / y
tf.math.divide_no_nan(x, y)
# Add batch normalization
model.add(tf.keras.layers.BatchNormalization())
# Check data for NaN before training
assert not tf.reduce_any(tf.math.is_nan(train_data)).numpy()
4. Gradient Issues
Symptoms:
- Vanishing gradients (weights not updating)
- Exploding gradients (loss becomes NaN)
- Training stalls, loss doesn’t decrease
Diagnostic Steps:
# Inspect gradients with GradientTape
with tf.GradientTape() as tape:
predictions = model(x, training=True)
loss = loss_fn(y, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
for var, grad in zip(model.trainable_variables, gradients):
if grad is not None:
print(f"{var.name}: grad_norm={tf.norm(grad).numpy():.6f}")
else:
print(f"{var.name}: NO GRADIENT (disconnected)")
# Check for dead ReLUs
activations = model.layers[5].output
dead_neurons = tf.reduce_mean(tf.cast(activations <= 0, tf.float32))
Solutions:
# For vanishing gradients
# Use He initialization for ReLU networks
initializer = tf.keras.initializers.HeNormal()
# Use LeakyReLU instead of ReLU
model.add(tf.keras.layers.LeakyReLU(alpha=0.1))
# Add residual connections (skip connections)
# For exploding gradients
# Apply gradient clipping
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
# Use proper weight initialization
initializer = tf.keras.initializers.GlorotUniform()
5. GPU Not Detected
Symptoms:
tf.config.list_physical_devices('GPU')returns empty list- Training runs on CPU (slow)
- CUDA errors on startup
Diagnostic Steps:
# Check available devices
print("Physical devices:", tf.config.list_physical_devices())
print("GPU devices:", tf.config.list_physical_devices('GPU'))
print("Built with CUDA:", tf.test.is_built_with_cuda())
print("GPU available:", tf.test.is_gpu_available())
# Check CUDA/cuDNN versions
import subprocess
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
print(result.stdout)
# Verify TensorFlow GPU package
import tensorflow as tf
print(tf.__version__)
print(tf.sysconfig.get_build_info())
Common Causes:
- Wrong TensorFlow package (CPU-only version)
- CUDA/cuDNN version mismatch
- NVIDIA driver issues
- GPU not visible to container (Docker)
Solutions:
# Install correct TensorFlow GPU package
pip install tensorflow[and-cuda] # TF 2.15+
# or
pip install tensorflow-gpu # Older versions
# Verify CUDA compatibility
# TF 2.15: CUDA 12.x, cuDNN 8.9
# TF 2.14: CUDA 11.8, cuDNN 8.7
# TF 2.13: CUDA 11.8, cuDNN 8.6
# For Docker, use nvidia-docker
docker run --gpus all -it tensorflow/tensorflow:latest-gpu
# Force GPU visibility
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0' # Use first GPU
# Verify GPU is being used
with tf.device('/GPU:0'):
a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
b = tf.constant([[1.0, 1.0], [0.0, 1.0]])
c = tf.matmul(a, b)
print(c.device) # Should show GPU
6. SavedModel Loading Errors
Symptoms:
OSError: SavedModel file does not existValueError: Unknown layerwhen loading- Version compatibility errors
Diagnostic Steps:
# Check SavedModel structure
import os
for root, dirs, files in os.walk('saved_model_dir'):
for file in files:
print(os.path.join(root, file))
# Verify model signature
loaded = tf.saved_model.load('saved_model_dir')
print(list(loaded.signatures.keys()))
Solutions:
# Save model correctly
model.save('my_model') # SavedModel format (recommended)
model.save('my_model.keras') # Keras format
# Load with custom objects
custom_objects = {
'CustomLayer': CustomLayer,
'custom_loss': custom_loss
}
model = tf.keras.models.load_model('my_model', custom_objects=custom_objects)
# For version mismatches, save weights only
model.save_weights('model_weights.weights.h5')
# Then rebuild model architecture and load weights
new_model.load_weights('model_weights.weights.h5')
7. Data Pipeline Issues
Symptoms:
InvalidArgumentErrorduring training- Slow training (input bottleneck)
- Memory leaks during data loading
Diagnostic Steps:
# Profile input pipeline
import tensorflow as tf
# Enable profiler
tf.profiler.experimental.start('/tmp/logdir')
# ... run training ...
tf.profiler.experimental.stop()
# Check dataset element spec
print(dataset.element_spec)
# Iterate and inspect
for batch in dataset.take(1):
print(f"Batch shape: {batch[0].shape}")
print(f"Dtype: {batch[0].dtype}")
Solutions:
# Optimize pipeline
dataset = tf.data.Dataset.from_tensor_slices((x, y))
dataset = dataset.cache() # Cache after expensive operations
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(32)
dataset = dataset.prefetch(tf.data.AUTOTUNE) # Overlap data loading
# Use parallel processing
dataset = dataset.map(
preprocess_fn,
num_parallel_calls=tf.data.AUTOTUNE
)
# Handle variable-length sequences
dataset = dataset.padded_batch(32, padded_shapes=([None], []))
Debugging Tools
tf.debugging Module
# Shape assertions
tf.debugging.assert_shapes([
(x, ('N', 'H', 'W', 'C')),
(y, ('N', 'num_classes'))
])
# Value assertions
tf.debugging.assert_non_negative(x)
tf.debugging.assert_near(x, y, rtol=1e-5)
tf.debugging.assert_equal(x.shape, expected_shape)
# Numeric checking
tf.debugging.check_numerics(tensor, "check: tensor contains NaN/Inf")
tf.debugging.enable_check_numerics() # Global check
# Type assertions
tf.debugging.assert_type(x, tf.float32)
TensorBoard
# Set up TensorBoard logging
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir,
histogram_freq=1,
profile_batch='500,520' # Profile batches 500-520
)
model.fit(
x_train, y_train,
epochs=5,
callbacks=[tensorboard_callback]
)
# Launch TensorBoard
# tensorboard --logdir logs/fit
TensorBoard Debugger V2
# Enable debug info dumping
tf.debugging.experimental.enable_dump_debug_info(
logdir='/tmp/tfdbg2_logdir',
tensor_debug_mode='FULL_HEALTH',
circular_buffer_size=1000
)
# Run training...
model.fit(x_train, y_train, epochs=5)
# View in TensorBoard
# tensorboard --logdir /tmp/tfdbg2_logdir
Eager Execution Debugging
# Enable eager execution (default in TF 2.x)
tf.config.run_functions_eagerly(True)
# Debug with breakpoints in @tf.function
@tf.function
def my_function(x):
tf.print("Debug:", x) # Works in graph mode
# Use tf.debugging.assert_* for runtime checks
tf.debugging.assert_positive(x)
return x * 2
# Disable tf.function for debugging
@tf.function
def buggy_function(x):
# Temporarily remove @tf.function decorator
# or use tf.config.run_functions_eagerly(True)
return x
tf.print() for Graph Mode
@tf.function
def compute(x):
# Regular print won't work in graph mode
tf.print("Shape:", tf.shape(x))
tf.print("Values:", x, summarize=-1) # -1 for all values
tf.print("Stats - min:", tf.reduce_min(x),
"max:", tf.reduce_max(x),
"mean:", tf.reduce_mean(x))
return x * 2
Memory Profiler
# Profile memory usage
tf.config.experimental.set_memory_growth(gpu, True)
# Use TensorFlow Profiler
with tf.profiler.experimental.Profile('/tmp/logdir'):
model.fit(x_train, y_train, epochs=1)
# Check memory info
tf.config.experimental.get_memory_info('GPU:0')
# Returns: {'current': bytes, 'peak': bytes}
The Four Phases of TensorFlow Debugging
Phase 1: Reproduce and Isolate
-
Create minimal reproduction
# Minimal test case import tensorflow as tf # Smallest possible model model = tf.keras.Sequential([ tf.keras.layers.Dense(10, input_shape=(5,)) ]) # Synthetic data x = tf.random.normal((32, 5)) y = tf.random.normal((32, 10)) model.compile(optimizer='adam', loss='mse') model.fit(x, y, epochs=1) -
Enable eager execution for line-by-line debugging
tf.config.run_functions_eagerly(True) -
Add assertions at key points
def debug_forward_pass(model, x): for i, layer in enumerate(model.layers): x = layer(x) tf.debugging.check_numerics(x, f"Layer {i} output") print(f"Layer {i}: {x.shape}, range=[{tf.reduce_min(x):.3f}, {tf.reduce_max(x):.3f}]") return x
Phase 2: Analyze and Understand
-
Inspect tensor shapes throughout the pipeline
def trace_shapes(model, x): shapes = [] for layer in model.layers: x = layer(x) shapes.append((layer.name, x.shape)) return shapes -
Check gradient flow
def analyze_gradients(model, x, y, loss_fn): with tf.GradientTape() as tape: pred = model(x, training=True) loss = loss_fn(y, pred) grads = tape.gradient(loss, model.trainable_variables) analysis = [] for var, grad in zip(model.trainable_variables, grads): if grad is None: analysis.append((var.name, "NONE - disconnected")) else: norm = tf.norm(grad).numpy() analysis.append((var.name, f"norm={norm:.6f}")) return analysis -
Profile performance
# Use tf.profiler tf.profiler.experimental.start('/tmp/logdir') model.fit(x, y, epochs=1) tf.profiler.experimental.stop()
Phase 3: Fix and Verify
-
Apply targeted fixes based on diagnosis
- Shape issues: Add explicit reshapes and assertions
- NaN issues: Add epsilon, reduce learning rate, clip gradients
- OOM issues: Reduce batch size, enable memory growth
- GPU issues: Check CUDA compatibility, install correct packages
-
Verify fix doesn’t break other functionality
# Run comprehensive tests def test_model_components(): # Test forward pass output = model(sample_input) assert output.shape == expected_shape # Test backward pass with tf.GradientTape() as tape: loss = loss_fn(model(x), y) grads = tape.gradient(loss, model.trainable_variables) assert all(g is not None for g in grads) # Test save/load model.save('/tmp/test_model') loaded = tf.keras.models.load_model('/tmp/test_model') assert tf.reduce_all(model(x) == loaded(x))
Phase 4: Prevent and Document
-
Add permanent assertions for critical invariants
class RobustModel(tf.keras.Model): def call(self, x, training=False): tf.debugging.assert_shapes([(x, ('batch', 'features'))]) x = self.layer1(x) tf.debugging.check_numerics(x, "After layer1") return self.output_layer(x) -
Set up monitoring callbacks
class NanCallback(tf.keras.callbacks.Callback): def on_batch_end(self, batch, logs=None): if logs and tf.math.is_nan(logs.get('loss', 0)): self.model.stop_training = True raise ValueError(f"NaN detected at batch {batch}") -
Document the issue and solution
# BUGFIX: Shape mismatch in attention layer # Issue: Input was (batch, seq, features) but attention expected (batch, heads, seq, features) # Solution: Added reshape before attention layer x = tf.reshape(x, [batch_size, num_heads, seq_len, -1])
Quick Reference Commands
Device and Configuration
# List devices
tf.config.list_physical_devices()
tf.config.list_physical_devices('GPU')
# GPU memory growth
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
# Force CPU execution
with tf.device('/CPU:0'):
result = model(x)
# Check if built with CUDA
tf.test.is_built_with_cuda()
Debugging Assertions
# Numeric checks
tf.debugging.check_numerics(tensor, message)
tf.debugging.enable_check_numerics()
# Shape checks
tf.debugging.assert_shapes([(tensor, shape_tuple)])
tf.ensure_shape(tensor, shape)
# Value checks
tf.debugging.assert_positive(tensor)
tf.debugging.assert_non_negative(tensor)
tf.debugging.assert_near(a, b, rtol=1e-5)
tf.debugging.assert_equal(a, b)
tf.debugging.assert_less(a, b)
tf.debugging.assert_greater(a, b)
Profiling and Logging
# TensorBoard logging
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='./logs',
histogram_freq=1
)
# Start profiler
tf.profiler.experimental.start('/tmp/logdir')
# ... code ...
tf.profiler.experimental.stop()
# Debug info for TensorBoard Debugger V2
tf.debugging.experimental.enable_dump_debug_info(
'/tmp/tfdbg2',
tensor_debug_mode='FULL_HEALTH'
)
Memory Management
# Clear session
tf.keras.backend.clear_session()
# Get memory info
tf.config.experimental.get_memory_info('GPU:0')
# Mixed precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')
Gradient Debugging
# Inspect gradients
with tf.GradientTape() as tape:
loss = compute_loss()
gradients = tape.gradient(loss, model.trainable_variables)
# Clip gradients
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
# Check for None gradients (disconnected graph)
for var, grad in zip(model.trainable_variables, gradients):
if grad is None:
print(f"Warning: {var.name} has no gradient")
Version Compatibility Reference
| TensorFlow | Python | CUDA | cuDNN |
|---|---|---|---|
| 2.16.x | 3.9-3.12 | 12.3 | 8.9 |
| 2.15.x | 3.9-3.11 | 12.2 | 8.9 |
| 2.14.x | 3.9-3.11 | 11.8 | 8.7 |
| 2.13.x | 3.8-3.11 | 11.8 | 8.6 |
| 2.12.x | 3.8-3.11 | 11.8 | 8.6 |