Skip to content

Dataset Augmentation Tutorial

This comprehensive tutorial covers full dataset augmentation - creating entirely new datasets with multiple augmented versions of episodes. Unlike episode augmentation which modifies original files, dataset augmentation creates new datasets while preserving the original.

Overview

Dataset augmentation allows you to:

  • Scale your data: Create 2x, 5x, 10x larger datasets automatically
  • Preserve originals: Keep original episodes alongside augmented versions
  • Support research: Create reproducible experimental datasets with seeds
  • Handle all formats: Works with LeRobot v1.0, v1.1, v2.0, v2.1+ automatically

Quick Start

Basic Dataset Augmentation

Create a dataset with 2 augmented versions per episode:

dataphy augment dataset \
  --dataset-path ./original_data \
  --output-path ./augmented_data \
  --config examples/dataset_augmentation_config.yaml \
  --num-augmented 2

Result Structure:

augmented_data/
├── videos/chunk-001/observation.images.webcam/
│   ├── episode_000000.mp4          # Original (if preserved)
│   ├── episode_000000_aug_001.mp4  # Augmented version 1
│   ├── episode_000000_aug_002.mp4  # Augmented version 2
│   ├── episode_000001.mp4
│   ├── episode_000001_aug_001.mp4
│   └── episode_000001_aug_002.mp4
├── data/chunk-001/
│   ├── episode_000000.parquet
│   ├── episode_000000_aug_001.parquet
│   ├── episode_000000_aug_002.parquet
│   └── ...
└── dataset_metadata.json           # Augmentation info

Step-by-Step Guide

Step 1: Prepare Your Configuration

Create a comprehensive augmentation configuration:

research_augmentation.yaml
# Dataset Augmentation Configuration
pipeline:
  sync_views: true # Synchronized across all cameras

  steps:
    # Spatial augmentations for robustness
    - name: random_crop_pad
      type: RandomCropPad
      params:
        keep_ratio_min: 0.85
        probability: 0.8

    - name: random_translate
      type: RandomTranslate
      params:
        px: 8
        probability: 0.6

    # Photometric augmentations for lighting variations
    - name: color_jitter
      type: ColorJitter
      params:
        magnitude: 0.2
        probability: 0.9

    # Texture and occlusion augmentations
    - name: random_conv
      type: RandomConv
      params:
        kernel_variance: 0.05
        probability: 0.4

    - name: cutout
      type: Cutout
      params:
        holes: 2
        size_range: [16, 32]
        probability: 0.3

# Global settings for reproducibility
settings:
  device: auto
  seed: 42
  deterministic: true
  output_quality: 95

# Dataset configuration
dataset:
  preserve_original: true
  update_metadata: true

# Progress tracking
logging:
  level: INFO
  progress_bar: true
  continue_on_error: true

Domain Randomization Configuration

For maximum realism and robustness, include domain randomization transforms:

domain_randomization_config.yaml
pipeline:
  sync_views: true
  steps:
    # Standard transforms
    - name: random_crop_pad
      keep_ratio_min: 0.88
      probability: 0.8

    - name: color_jitter
      magnitude: 0.15
      probability: 0.9

    # Domain randomization transforms
    - name: lighting_rand
      p: 0.6
      ambient_tint: [0.95, 1.05]
      directional_intensity: [0.0, 0.4]
      preserve_robot_color: true

    - name: camera_intrinsics_jitter
      p: 0.5
      fx_jitter: [0.98, 1.02]
      cx_jitter_px: [-4, 4]
      update_intrinsics: true

    - name: camera_extrinsics_jitter
      p: 0.4
      rot_deg: [-3.0, 3.0]
      transl_px: [-6, 6]
      same_delta_per_seq: true

    - name: rgb_sensor_noise
      p: 0.4
      shot_k: [0.5, 1.5]
      read_sigma: [0.002, 0.01]
      iso_range: [100, 800]

settings:
  device: auto
  seed: 42
  deterministic: true

Step 2: Inspect Your Source Dataset

First, understand your dataset structure:

# Get dataset information
dataphy dataset load --dataset-path ./source_data --info

# List episodes to see what's available
dataphy dataset load --dataset-path ./source_data --list-episodes

Example output:

Dataset Information:
  Format: lerobot
  Episodes: 25
  Total timesteps: 12,500
  Features: ['observation.images.webcam', 'observation.images.laptop', 'action']

Available Episodes:
  0. episode_000000
  1. episode_000001
  2. episode_000002
  ...

Step 3: Plan Your Augmentation Strategy

Choose your augmentation strategy based on your use case:

Research Workflow (Reproducible)

dataphy augment dataset \
  --dataset-path ./research_data \
  --output-path ./experiment_1_data \
  --config research_augmentation.yaml \
  --num-augmented 3 \
  --seed 42 \
  --preserve-original

Training Workflow (Maximum Data)

dataphy augment dataset \
  --dataset-path ./training_source \
  --output-path ./training_data \
  --config heavy_augmentation.yaml \
  --num-augmented 9 \
  --sync-views

Specific Episodes (Targeted)

dataphy augment dataset \
  --dataset-path ./data \
  --output-path ./specific_aug \
  --config light_augmentation.yaml \
  --episodes episode_000000,episode_000005,episode_000010 \
  --num-augmented 5

Camera-Specific (Single View)

dataphy augment dataset \
  --dataset-path ./data \
  --output-path ./webcam_only \
  --config webcam_augmentation.yaml \
  --cameras observation.images.webcam \
  --num-augmented 10

Step 4: Execute with Dry Run

Preview what will be created before running:

dataphy augment dataset \
  --dataset-path ./source_data \
  --output-path ./preview_data \
  --config research_augmentation.yaml \
  --num-augmented 3 \
  --dry-run

Example output:

Augmentation Plan:
  Source: ./source_data
  Output: ./preview_data
  Episodes to process: 25 / 25
  Augmented versions per episode: 3
  Camera streams: ALL
  Preserve originals: True
  Result: 100 total episodes

Dry run completed - no files were created

Step 5: Execute Full Augmentation

Run the actual augmentation:

dataphy augment dataset \
  --dataset-path ./source_data \
  --output-path ./augmented_data \
  --config research_augmentation.yaml \
  --num-augmented 3 \
  --preserve-original

Monitor progress:

Starting dataset augmentation...
Detected LeRobot version: v2.0
Structure type: chunked_with_meta
Found cameras: ['observation.images.webcam', 'observation.images.laptop']

[20.0%] Processing episode episode_000000 (1/25)
  Creating episode_000000_aug_001 (1/75)
  Successfully created episode_000000_aug_001
  Creating episode_000000_aug_002 (2/75)
  Successfully created episode_000000_aug_002
  Creating episode_000000_aug_003 (3/75)
  Successfully created episode_000000_aug_003
  Episode episode_000000 completed successfully (3/3 augmentations)

[40.0%] Processing episode episode_000001 (2/25)
...

Dataset augmentation completed:
  Total episodes in new dataset: 100
  Successfully augmented: 75
  Failed augmentations: 0
  Success rate: 100.0%
  Metadata saved: ./augmented_data/dataset_metadata.json

Advanced Usage

LeRobot Version Compatibility

The system automatically detects and handles different LeRobot versions:

# v2.0 format (carpit680/giraffe_clean_desk2)
dataphy augment dataset \
  --dataset-path ./giraffe_data \
  --output-path ./giraffe_aug \
  --config config.yaml \
  --num-augmented 3

# Output: Detected LeRobot version: v2.0

# v2.1 format (lerobot/svla_so100_sorting)
dataphy augment dataset \
  --dataset-path ./svla_data \
  --output-path ./svla_aug \
  --config config.yaml \
  --num-augmented 5

# Output: Detected LeRobot version: v2.1

Programmatic Usage

Use the Python API for more control:

dataset_augmentation_script.py
from dataphy.dataset.registry import create_dataset_loader
from dataphy.dataset.augmentor import DatasetAugmentor, AugmentationConfig

# Load dataset
loader = create_dataset_loader("./source_data")

# Create augmentor
augmentor = DatasetAugmentor(loader)

# Progress callback
def progress_callback(message, current, total):
    progress = (current / total) * 100 if total > 0 else 0
    print(f"[{progress:5.1f}%] {message}")

# Configuration
config = AugmentationConfig(
    pipeline_config="research_config.yaml",
    target="dataset",
    num_augmented_episodes=3,
    preserve_original=True,
    sync_views=True,
    random_seed=42,
    progress_callback=progress_callback
)

# Execute augmentation
results = augmentor.augment_full_dataset(config, "./augmented_data")

# Results
print(f"Created {results['total_new_episodes']} episodes")
print(f"Success rate: {results['augmentation_stats']['successful']/75*100:.1f}%")

Configuration Templates

Light Augmentation (Subtle Changes)

light_augmentation.yaml
pipeline:
  sync_views: true
  steps:
    - name: color_jitter
      type: ColorJitter
      params:
        magnitude: 0.1
        probability: 0.7
    - name: random_translate
      type: RandomTranslate
      params:
        px: 4
        probability: 0.3

settings:
  seed: 42
  output_quality: 98

Heavy Augmentation (Maximum Diversity)

heavy_augmentation.yaml
pipeline:
  sync_views: true
  steps:
    - name: random_crop_pad
      type: RandomCropPad
      params:
        keep_ratio_min: 0.7
        probability: 0.9
    - name: random_translate
      type: RandomTranslate
      params:
        px: 12
        probability: 0.8
    - name: color_jitter
      type: ColorJitter
      params:
        magnitude: 0.3
        probability: 1.0
    - name: random_conv
      type: RandomConv
      params:
        kernel_variance: 0.1
        probability: 0.6
    - name: cutout
      type: Cutout
      params:
        holes: 3
        size_range: [20, 80]
        probability: 0.4

settings:
  seed: null # Random seed each run

Quality Assurance

Verify Output Dataset

After augmentation, verify the results:

# Check dataset structure
dataphy dataset load --dataset-path ./augmented_data --info

# Compare with original
dataphy dataset load --dataset-path ./source_data --info

# Visualize augmented episode
dataphy dataset visualize --dataset-path ./augmented_data --episode episode_000000_aug_001

Metadata Inspection

Check the augmentation metadata:

import json

# Load augmentation metadata
with open("./augmented_data/dataset_metadata.json", "r") as f:
    metadata = json.load(f)

# Inspect augmentation info
aug_info = metadata["augmentation_info"]
print(f"Original episodes: {aug_info['original_episode_count']}")
print(f"Augmented per original: {aug_info['augmented_episodes_per_original']}")
print(f"Total episodes: {aug_info['total_episodes']}")
print(f"Success rate: {aug_info['augmentation_stats']}")

Best Practices

1. Resource Management

# For large datasets, use multiple smaller batches
dataphy augment dataset \
  --dataset-path ./large_data \
  --output-path ./batch_1 \
  --episodes episode_000000,episode_000001,episode_000002 \
  --num-augmented 5

dataphy augment dataset \
  --dataset-path ./large_data \
  --output-path ./batch_2 \
  --episodes episode_000003,episode_000004,episode_000005 \
  --num-augmented 5

2. Reproducible Research

# Always use seeds for reproducible results
settings:
  seed: 42
  deterministic: true

# Document your configuration
dataset:
  naming_scheme: "episode_{original_id}_aug_{aug_index:03d}"
  update_metadata: true

3. Quality Control

# Start with dry runs
--dry-run

# Use progress tracking
logging:
  progress_bar: true
  detailed_stats: true

# Enable error recovery
logging:
  continue_on_error: true
  max_retry_attempts: 3

4. Storage Optimization

# Adjust quality for storage vs quality tradeoff
settings:
  output_quality: 85 # Lower for more compression
  preserve_aspect_ratio: true

# Use efficient naming
dataset:
  naming_scheme: "ep{original_id}_a{aug_index:02d}" # Shorter names

Troubleshooting

Common Issues

Out of Disk Space

# Check space requirements first
du -sh ./source_data
# Multiply by (num_augmented + 1) for space needed

# Use smaller batches or lower quality
--num-augmented 1
# OR in config:
settings:
  output_quality: 75

Memory Issues

# Use low memory mode
advanced:
  low_memory_mode: true
  frame_batch_size: 16
  max_processes: 2

Version Compatibility Issues

# Force specific version handling if auto-detection fails
dataset:
  auto_detect_version: false
  force_version: "v2.0"

Error Recovery

If augmentation fails partway through:

# Continue from where it left off - the system will skip existing files
dataphy augment dataset \
  --dataset-path ./source_data \
  --output-path ./partially_augmented \
  --config config.yaml \
  --num-augmented 3

# Check what succeeded
dataphy dataset load --dataset-path ./partially_augmented --list-episodes

Next Steps

After creating your augmented dataset:

  1. Verify Quality: Visualize and inspect augmented episodes
  2. Update Training: Point your training pipeline to the new dataset
  3. Compare Results: Train models on original vs augmented data
  4. Iterate: Adjust augmentation parameters based on results