gcp-bq-data-loading

📁 funnelenvy/agents_webinar_demos 📅 Jan 24, 2026
9
总安装量
7
周安装量
#31788
全站排名
安装命令
npx skills add https://github.com/funnelenvy/agents_webinar_demos --skill gcp-bq-data-loading

Agent 安装分布

claude-code 6
gemini-cli 4
antigravity 4
codex 4
opencode 4

Skill 文档

BigQuery Data Loading

Use this skill when importing data into BigQuery from various file formats and sources.

Basic Load Command

bq load \
  --location=LOCATION \
  --source_format=FORMAT \
  PROJECT:DATASET.TABLE \
  SOURCE_PATH \
  SCHEMA

Loading from CSV

Basic CSV Load

bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  dataset.table \
  gs://bucket/data.csv \
  customer_id:STRING,amount:FLOAT,date:DATE

CSV with Schema File

# Create schema.json
echo '[
  {"name": "customer_id", "type": "STRING"},
  {"name": "amount", "type": "FLOAT"},
  {"name": "date", "type": "DATE"}
]' > schema.json

# Load with schema file
bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  dataset.table \
  gs://bucket/data.csv \
  ./schema.json

CSV Options

bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  --field_delimiter=',' \
  --quote='"' \
  --allow_quoted_newlines \
  --allow_jagged_rows \
  --max_bad_records=100 \
  --null_marker='NULL' \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Key flags:

  • --skip_leading_rows=N – Skip header rows
  • --field_delimiter=',' – Column separator (default: comma)
  • --allow_quoted_newlines – Allow newlines in quoted fields
  • --allow_jagged_rows – Allow rows with missing fields
  • --max_bad_records=N – Tolerate N parsing errors
  • --null_marker='NULL' – String representing NULL values

CSV with Auto-detect

bq load \
  --source_format=CSV \
  --autodetect \
  --skip_leading_rows=1 \
  dataset.table \
  gs://bucket/data.csv

Warning: Auto-detect is convenient but not recommended for production. Schema may change between loads.

Loading from JSON

Newline-Delimited JSON

bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  dataset.table \
  gs://bucket/data.json \
  customer_id:STRING,amount:FLOAT,date:DATE

JSON format required:

{"customer_id": "C001", "amount": 99.99, "date": "2024-01-15"}
{"customer_id": "C002", "amount": 149.99, "date": "2024-01-15"}

NOT standard JSON array:

// ❌ This won't work
[
  {"customer_id": "C001", "amount": 99.99},
  {"customer_id": "C002", "amount": 149.99}
]

JSON with Auto-detect

bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  --autodetect \
  dataset.table \
  gs://bucket/data.json

JSON with Nested Fields

Schema with nested STRUCT:

[
  {"name": "customer_id", "type": "STRING"},
  {"name": "address", "type": "RECORD", "fields": [
    {"name": "street", "type": "STRING"},
    {"name": "city", "type": "STRING"},
    {"name": "zip", "type": "STRING"}
  ]},
  {"name": "orders", "type": "RECORD", "mode": "REPEATED", "fields": [
    {"name": "order_id", "type": "STRING"},
    {"name": "amount", "type": "FLOAT"}
  ]}
]

Loading from Avro

Basic Avro Load

bq load \
  --source_format=AVRO \
  dataset.table \
  gs://bucket/data.avro

Key benefit: Schema is auto-detected from Avro metadata. No schema specification needed!

Avro with Wildcards

bq load \
  --source_format=AVRO \
  dataset.table \
  "gs://bucket/path/to/*.avro"

Note: Use quotes around wildcard paths.

Avro Advantages

  • Self-describing: Schema embedded in file
  • Efficient: Compact binary format
  • Type-safe: Strong typing preserved
  • No schema drift: Schema always matches data

Loading from Parquet

Basic Parquet Load

bq load \
  --source_format=PARQUET \
  dataset.table \
  gs://bucket/data.parquet

Like Avro: Schema auto-detected from Parquet metadata.

Parquet with Compression

bq load \
  --source_format=PARQUET \
  dataset.table \
  gs://bucket/data.snappy.parquet

Supported compression: SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD

Loading Strategies

Append Data (Default)

bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/new_data.csv \
  schema.json

Behavior: Adds rows to existing table.

Replace Table

bq load \
  --source_format=CSV \
  --replace \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Behavior: Deletes all existing data, loads new data.

Overwrite Partition

bq load \
  --source_format=CSV \
  --replace \
  --time_partitioning_field=date \
  dataset.table\$20240115 \
  gs://bucket/data_20240115.csv \
  schema.json

Syntax: TABLE$YYYYMMDD targets specific partition.

Behavior: Replaces only that partition, leaves others intact.

Append or Skip

# Skip load if table already has data
bq load \
  --source_format=CSV \
  --if_not_exists \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Loading from Cloud Storage

Single File

bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Multiple Files (List)

bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/file1.csv,gs://bucket/file2.csv,gs://bucket/file3.csv \
  schema.json

Wildcard Pattern

bq load \
  --source_format=CSV \
  dataset.table \
  "gs://bucket/path/data-*.csv" \
  schema.json

Patterns:

  • gs://bucket/*.csv – All CSV files in bucket root
  • gs://bucket/2024/*/*.csv – All CSV files in subdirectories
  • gs://bucket/data-[0-9]*.csv – Files matching pattern

Loading from Local Files

Local CSV

bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  dataset.table \
  /path/to/local/data.csv \
  schema.json

Limitation: Files are uploaded first, then loaded. Slower for large files. Use GCS for better performance.

Schema Handling

Inline Schema

bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/data.csv \
  customer_id:STRING,amount:FLOAT64,order_date:DATE,active:BOOLEAN

Schema File

[
  {"name": "customer_id", "type": "STRING", "mode": "REQUIRED"},
  {"name": "amount", "type": "FLOAT64"},
  {"name": "order_date", "type": "DATE"},
  {"name": "metadata", "type": "JSON"}
]

Modes:

  • REQUIRED – Field must have value
  • NULLABLE – Field can be NULL (default)
  • REPEATED – Field is an array

Auto-detect Schema

bq load \
  --source_format=CSV \
  --autodetect \
  dataset.table \
  gs://bucket/data.csv

Pros:

  • Quick for exploration
  • No schema definition needed

Cons:

  • Not reliable for production
  • Schema may change between loads
  • May misinterpret types

Schema Evolution

Add new columns:

bq load \
  --source_format=CSV \
  --schema_update_option=ALLOW_FIELD_ADDITION \
  dataset.table \
  gs://bucket/data_with_new_column.csv \
  schema.json

Relax required columns:

bq load \
  --schema_update_option=ALLOW_FIELD_RELAXATION \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Compression

Compressed Files

BigQuery automatically detects compression:

# Gzip compressed CSV
bq load \
  --source_format=CSV \
  dataset.table \
  gs://bucket/data.csv.gz \
  schema.json

Supported: GZIP, DEFLATE, SNAPPY, BZIP2, LZ4, ZSTD

Performance note: Uncompressed files load faster (parallel processing). Use compression only if network/storage is bottleneck.

Error Handling

Allow Bad Records

bq load \
  --source_format=CSV \
  --max_bad_records=1000 \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Behavior: Skip up to 1000 rows with errors, load the rest.

Ignore Unknown Values

bq load \
  --source_format=JSON \
  --ignore_unknown_values \
  dataset.table \
  gs://bucket/data.json \
  schema.json

Behavior: Ignore JSON fields not in schema.

Validate Only (Dry Run)

bq load \
  --dry_run \
  --source_format=CSV \
  dataset.table \
  gs://bucket/data.csv \
  schema.json

Behavior: Validate schema and format without loading data.

Loading to Partitioned Tables

Time-Partitioned Table

bq load \
  --source_format=CSV \
  --time_partitioning_field=order_date \
  --time_partitioning_type=DAY \
  dataset.partitioned_orders \
  gs://bucket/orders_2024.csv \
  order_id:STRING,customer_id:STRING,order_date:DATE,amount:FLOAT

Load Specific Partition

# Load into 2024-01-15 partition
bq load \
  --source_format=CSV \
  dataset.orders\$20240115 \
  gs://bucket/orders_20240115.csv \
  schema.json

Range-Partitioned Table

bq load \
  --source_format=CSV \
  --range_partitioning=customer_id,0,1000,100 \
  dataset.range_partitioned \
  gs://bucket/data.csv \
  customer_id:INTEGER,amount:FLOAT

Performance Optimization

Parallel Loading

BigQuery automatically parallelizes loads from:

  • Multiple files with wildcards
  • Uncompressed files (internal parallelization)

Optimal: Split large files into 1GB chunks

File Format Recommendations

Best performance:

  1. Avro – Fastest, self-describing, splittable
  2. Parquet – Fast, columnar, good compression
  3. CSV uncompressed – Good for parallel processing
  4. JSON – Flexible but slower

Avoid:

  • Very large single files (>10GB)
  • CSV with complex escaping
  • Highly compressed files (limits parallelism)

Monitoring Loads

Check Load Jobs

bq ls --jobs --max_results=10

Job Details

bq show -j JOB_ID

Failed Loads

SELECT
  job_id,
  user_email,
  error_result.message as error_message,
  creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
  job_type = 'LOAD'
  AND state = 'DONE'
  AND error_result IS NOT NULL
  AND creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY creation_time DESC;

Common Patterns

Daily Batch Load

#!/bin/bash
DATE=$(date +%Y%m%d)
bq load \
  --source_format=CSV \
  --replace \
  dataset.daily_data\$$DATE \
  gs://bucket/data_$DATE.csv \
  schema.json

Incremental Load with Deduplication

# Load to staging
bq load \
  --source_format=CSV \
  dataset.staging_orders \
  gs://bucket/new_orders.csv \
  schema.json

# Merge to production (dedup)
bq query --use_legacy_sql=false '
MERGE `project.dataset.orders` T
USING `project.dataset.staging_orders` S
ON T.order_id = S.order_id
WHEN NOT MATCHED THEN INSERT ROW
'

Troubleshooting

“Too many errors”

Problem: Exceeds max_bad_records Solution: Increase --max_bad_records or fix data quality

“Schema mismatch”

Problem: CSV columns don’t match schema Solution: Verify column order and count

“Invalid CSV format”

Problem: Unescaped quotes or newlines Solution: Use --allow_quoted_newlines

“Permission denied”

Problem: No access to GCS bucket Solution: Grant BigQuery service account Storage Object Viewer role

Quick Reference

Format priorities (fastest first):

  1. Avro (splittable, self-describing)
  2. Parquet (columnar, efficient)
  3. CSV uncompressed (parallel)
  4. JSON (flexible)

Schema strategies:

  • Production: Explicit schema file
  • Development: Auto-detect
  • Evolution: Allow field addition/relaxation

Loading strategies:

  • New data: Append (default)
  • Replace all: --replace
  • Replace partition: TABLE$YYYYMMDD
  • Incremental: Load staging → MERGE