3d-cv-labeling-2026
npx skills add https://github.com/majiayu000/claude-skill-registry --skill 3d-cv-labeling-2026
Agent 安装分布
Skill 文档
3D Computer Vision Labeling Expert (2026)
Expert guidance on 3D annotation tools, AI-assisted labeling workflows, and training architectures for LiDAR/point cloud computer vision in autonomous vehicles, robotics, infrastructure inspection, and geospatial applications.
When to Use This Skill
â Use for:
- Selecting 3D point cloud annotation tools (BasicAI, Supervisely, Segments.ai, Deepen AI)
- Implementing SAM4D/Point-SAM for auto-labeling workflows
- Designing human-in-the-loop annotation pipelines
- Sensor fusion annotation (camera + LiDAR + radar)
- Training architecture decisions: specialized models vs VLMs
- Vertical-specific 3D detection (autonomous driving, inspection, agriculture, wildfire)
â NOT for:
- 2D image labeling without 3D context (use clip-aware-embeddings or Label Studio docs)
- General ML model training (use ml-engineer)
- Video annotation without point clouds (use computer-vision-pipeline)
- VLM prompt engineering (use prompt-engineer)
- Photogrammetry/3D reconstruction (use geo processing tools)
2026 Tool Landscape Overview
Commercial Leaders
| Tool | Strength | Best For | Key AI Feature |
|---|---|---|---|
| BasicAI | One-click detection | Autonomous driving | Pre-labeling models fine-tuned for AV |
| Supervisely | Customization | R&D teams | AI tracking, 2Dâ3D single-click |
| Segments.ai | 2D+3D sync | Robotics perception | Sequential propagation |
| Deepen AI | Sensor calibration | In-house perception | Pixel-perfect multi-sensor |
| Dataloop | Enterprise MLOps | Large annotation teams | Model-assisted + Point Cloud Focus |
| Encord | Full workflow | Multi-modal projects | Track-ID management |
| Ango Hub (iMerit) | Dense annotation | Complex multi-modal | Frame-to-frame propagation |
Open Source Options
| Tool | Maturity | Limitations |
|---|---|---|
| CVAT | Stable | 3D bounding boxes only, limited interpolation |
| 3D BAT | Good | Full-surround annotation, semi-auto tracking |
| Label Studio | Partial 3D | Better for multi-format, not specialized 3D |
SAM Evolution for 3D (2024-2026)
SAM4D (ICCV 2025) – Multi-Modal + Temporal
Key innovation: Unified Multi-modal Positional Encoding (UMPE) aligns camera and LiDAR in shared 3D space.
Camera Stream â Feature Extraction â â
ââ UMPE Alignment â Promptable 3D Segmentation
LiDAR Stream â Point Encoding â â
Data engine breakthrough: Automatic pseudo-label generation at 100x+ faster than human annotation using:
- VFM-driven video masklets
- Spatiotemporal 4D reconstruction
- Cross-modal masklet fusion
Dataset: Waymo-4DSeg (300k+ camera-LiDAR aligned masklets)
Point-SAM (ICLR 2025) – Native 3D Prompting
Architecture: Efficient transformer designed specifically for point clouds (not adapted from 2D).
Knowledge distillation: 2D SAM â 3D Point-SAM via data engine that generates:
- Part-level pseudo-labels
- Object-level pseudo-labels
Benchmarks: Outperforms state-of-the-art on indoor (ScanNet) and outdoor (nuScenes, Waymo) datasets.
SAMNet++ (2025) – Hybrid Pipeline
Two-stage approach:
- SAM performs unsupervised segmentation
- Adapted PointNet++ refines for semantic accuracy
Best for: UAV/drone workflows where colorized point clouds from L1 LiDAR + RGB cameras are available.
Human-in-the-Loop Architecture
The Model-in-the-Loop Paradigm (2023-2026)
Old approach: Human labels â Train model â Deploy New approach: Model assists â Human validates â Rapid iteration
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â LABELING PIPELINE â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â Raw Data â AI Pre-label â Human Review â QA Check â
â â â â â â
â â SAM4D/VLM Corrections Consensus â
â â generates only where sampling â
â â proposals AI uncertain â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Efficiency Gains
| Approach | Time for 10k frames | Annotation Quality |
|---|---|---|
| Manual only | 400 hours | 95% (expert) |
| AI pre-label + review | 50 hours | 97% (AI+human) |
| SAM4D data engine | 4 hours | 92% (pseudo) |
The 80/20 rule: ~80% of ML project time is data prep. Model-in-the-loop cuts this dramatically.
Quality Assurance Strategies
- Consensus sampling: Multiple annotators on subset, measure agreement
- Active learning: Route uncertain predictions to experts
- Tiered review: Tier 1 (critical objects) get SME validation, Tier 2/3 use AI confidence thresholds
Why Specialized Training > VLMs for 3D
The Core Trade-off
| Aspect | Specialized (YOLO, PointPillars) | VLMs (GPT-4V, Gemini) |
|---|---|---|
| Latency | 10-50ms (real-time) | 500-2000ms |
| 3D precision | Strong geometric priors | Noisy text-3D alignment |
| Novel objects | Closed-set (what you train) | Open-vocabulary |
| Compute | Edge-deployable | GPU cluster required |
| Hallucinations | None (deterministic) | Yes (safety-critical risk) |
| Domain shift | Struggles (fog, night) | Better generalization |
When to Use Each
Use Specialized Models When:
- Real-time inference required (autonomous vehicles, robotics)
- Known object classes (infrastructure defects, crop types)
- Safety-critical deployment (can’t tolerate hallucinations)
- Edge deployment (drones, embedded systems)
Use VLMs/Foundation Models When:
- Zero-shot exploration of new domains
- Generating training data (weak labels)
- Open-vocabulary requirements (“find anything damaged”)
- Domain adaptation bootstrapping
The Hybrid Architecture (2025+ Best Practice)
âââââââââââââââââââââââââ
â VLM (Slow Brain) â
â ⢠Scene understandingâ
â ⢠Open vocabulary â
â ⢠Anomaly detection â
ââââââââââââ¬âââââââââââââ
â High-level context
â¼
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Specialized Detector (Fast Brain) â
â ⢠Real-time inference (YOLO, PointPillars, CenterPoint)â
â ⢠Known object detection & tracking â
â ⢠Safety-critical decisions â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Examples:
- VOLTRON: YOLOv8 + LLaMA2 for hazard identification
- DrivePI: Point clouds + multi-view + language instructions (0.5B Qwen2.5)
Vertical-Specific Training Architecture
Infrastructure Inspection
Objects: Utility poles, insulators, conductors, vegetation, damage types Sensor fusion: RGB + thermal + LiDAR Training data needs:
- Thermal anomaly samples (varied temperatures)
- Damage taxonomy (cracks, corrosion, rust grades)
- Vegetation clearance measurements
Architecture:
LiDAR â Point cloud encoder â â
Thermal â 2D encoder â ââ Fusion â Multi-task head
RGB â 2D encoder â â ââ Object detection
ââ Defect classification
ââ Clearance regression
Autonomous Driving
Objects: Vehicles, pedestrians, cyclists, traffic signs, lane markings Key requirement: Temporal consistency (track-IDs across frames) Training data needs:
- Long-tail scenarios (emergency vehicles, animals, debris)
- Adverse weather (fog, rain, snow, night)
- Edge cases (construction zones, accidents)
Architecture: CenterPoint, PointPillars, or Voxel-based detectors with BEV (Bird’s Eye View) representation.
Agriculture/Wildfire
Objects: Crop rows, canopy height, fuel load, fire spread boundaries Sensor fusion: RGB + multispectral + LiDAR Training data needs:
- Crop growth stages
- Disease/pest visual signatures
- Fuel load density from LiDAR CHM (Canopy Height Model)
Why not just VLM? VLMs can’t:
- Measure precise heights (LiDAR regression)
- Classify at hyperspectral wavelengths
- Maintain spatial precision for prescription maps
Common Anti-Patterns
Anti-Pattern: “Just Use SAM on Everything”
Novice thinking: “SAM segments anything, so I’ll just run it on my LiDAR data”
Reality:
- SAM 1/2 are 2D modelsâthey don’t understand 3D geometry
- Point clouds need Point-SAM or SAM4D specifically
- Raw application produces noisy masks without geometric priors
Correct approach: Use Point-SAM for native 3D, or project to 2D for SAM â lift back to 3D.
Anti-Pattern: Skipping Human Validation
Novice thinking: “AI pre-labels are 95% accurate, we can skip review”
Reality:
- 5% error on 100k objects = 5,000 wrong labels
- Errors compound in edge cases (exactly where you need accuracy)
- Model learns to reproduce annotation mistakes
Correct approach: Tier 1 (safety-critical) always human-validated. Use confidence thresholds for Tier 2/3.
Anti-Pattern: VLM for Real-Time Inference
Novice thinking: “GPT-4V can identify damage in my photos”
Reality:
- 500-2000ms latency per frame
- Can’t run on edge devices
- Hallucination risk in safety-critical contexts
Correct approach: Use VLM for data generation/exploration, specialized model for deployment.
Anti-Pattern: Single-Modal Training
Novice thinking: “LiDAR is enough for 3D detection”
Reality:
- LiDAR: Precise geometry, no color/texture
- Camera: Rich semantics, no depth
- Fusion outperforms single-modal by 5-15% mAP
Correct approach: Sensor fusion from day one. SAM4D shows fusion pseudo-labels > single-modal.
Decision Tree: Choosing Your Approach
Do you need real-time inference?
/ \
YES NO
| |
Use specialized Is this exploration?
detector (YOLO, / \
CenterPoint) YES NO
| | |
Have labeled data? Use VLM Generate
/ \ for zero- pseudo-labels
YES NO shot with SAM4D
| |
Train model Use SAM4D/
Point-SAM for
auto-labeling
Tool Selection Decision Matrix
| Requirement | Recommended Tool |
|---|---|
| Autonomous driving at scale | Deepen AI or BasicAI |
| R&D/research flexibility | Supervisely or Segments.ai |
| Multi-modal (camera+LiDAR+radar) | Ango Hub or Dataloop |
| Self-hosted/open source | CVAT + 3D plugins or 3D BAT |
| Robotics perception | Segments.ai (2D+3D sync) |
| Budget-conscious | Label Studio + custom scripts |
References
/references/sam4d-architecture.md– Deep dive on SAM4D UMPE and data engine/references/tool-comparison-matrix.md– Detailed feature comparison of all tools/references/hybrid-architecture-examples.md– VOLTRON, DrivePI implementation patterns/references/vertical-training-recipes.md– Infrastructure, AV, agriculture specifics