Dataset Augmentation Tutorial¶

This comprehensive tutorial covers full dataset augmentation - creating entirely new datasets with multiple augmented versions of episodes. Unlike episode augmentation which modifies original files, dataset augmentation creates new datasets while preserving the original.

Overview¶

Dataset augmentation allows you to:

Scale your data: Create 2x, 5x, 10x larger datasets automatically
Preserve originals: Keep original episodes alongside augmented versions
Support research: Create reproducible experimental datasets with seeds
Handle all formats: Works with LeRobot v1.0, v1.1, v2.0, v2.1+ automatically

Quick Start¶

Basic Dataset Augmentation¶

Create a dataset with 2 augmented versions per episode:

dataphy augment dataset \
  --dataset-path ./original_data \
  --output-path ./augmented_data \
  --config examples/dataset_augmentation_config.yaml \
  --num-augmented 2

Result Structure:

augmented_data/
├── videos/chunk-001/observation.images.webcam/
│   ├── episode_000000.mp4          # Original (if preserved)
│   ├── episode_000000_aug_001.mp4  # Augmented version 1
│   ├── episode_000000_aug_002.mp4  # Augmented version 2
│   ├── episode_000001.mp4
│   ├── episode_000001_aug_001.mp4
│   └── episode_000001_aug_002.mp4
├── data/chunk-001/
│   ├── episode_000000.parquet
│   ├── episode_000000_aug_001.parquet
│   ├── episode_000000_aug_002.parquet
│   └── ...
└── dataset_metadata.json           # Augmentation info

Step-by-Step Guide¶

Step 1: Prepare Your Configuration¶

Create a comprehensive augmentation configuration:

research_augmentation.yaml

# Dataset Augmentation Configuration
pipeline:
  sync_views: true # Synchronized across all cameras

  steps:
    # Spatial augmentations for robustness
    - name: random_crop_pad
      type: RandomCropPad
      params:
        keep_ratio_min: 0.85
        probability: 0.8

    - name: random_translate
      type: RandomTranslate
      params:
        px: 8
        probability: 0.6

    # Photometric augmentations for lighting variations
    - name: color_jitter
      type: ColorJitter
      params:
        magnitude: 0.2
        probability: 0.9

    # Texture and occlusion augmentations
    - name: random_conv
      type: RandomConv
      params:
        kernel_variance: 0.05
        probability: 0.4

    - name: cutout
      type: Cutout
      params:
        holes: 2
        size_range: [16, 32]
        probability: 0.3

# Global settings for reproducibility
settings:
  device: auto
  seed: 42
  deterministic: true
  output_quality: 95

# Dataset configuration
dataset:
  preserve_original: true
  update_metadata: true

# Progress tracking
logging:
  level: INFO
  progress_bar: true
  continue_on_error: true

Domain Randomization Configuration¶

For maximum realism and robustness, include domain randomization transforms:

domain_randomization_config.yaml

pipeline:
  sync_views: true
  steps:
    # Standard transforms
    - name: random_crop_pad
      keep_ratio_min: 0.88
      probability: 0.8

    - name: color_jitter
      magnitude: 0.15
      probability: 0.9

    # Domain randomization transforms
    - name: lighting_rand
      p: 0.6
      ambient_tint: [0.95, 1.05]
      directional_intensity: [0.0, 0.4]
      preserve_robot_color: true

    - name: camera_intrinsics_jitter
      p: 0.5
      fx_jitter: [0.98, 1.02]
      cx_jitter_px: [-4, 4]
      update_intrinsics: true

    - name: camera_extrinsics_jitter
      p: 0.4
      rot_deg: [-3.0, 3.0]
      transl_px: [-6, 6]
      same_delta_per_seq: true

    - name: rgb_sensor_noise
      p: 0.4
      shot_k: [0.5, 1.5]
      read_sigma: [0.002, 0.01]
      iso_range: [100, 800]

settings:
  device: auto
  seed: 42
  deterministic: true

Step 2: Inspect Your Source Dataset¶

First, understand your dataset structure:

# Get dataset information
dataphy dataset load --dataset-path ./source_data --info

# List episodes to see what's available
dataphy dataset load --dataset-path ./source_data --list-episodes

Example output:

Dataset Information:
  Format: lerobot
  Episodes: 25
  Total timesteps: 12,500
  Features: ['observation.images.webcam', 'observation.images.laptop', 'action']

Available Episodes:
  0. episode_000000
  1. episode_000001
  2. episode_000002
  ...

Step 3: Plan Your Augmentation Strategy¶

Choose your augmentation strategy based on your use case:

Research Workflow (Reproducible)¶

dataphy augment dataset \
  --dataset-path ./research_data \
  --output-path ./experiment_1_data \
  --config research_augmentation.yaml \
  --num-augmented 3 \
  --seed 42 \
  --preserve-original

Training Workflow (Maximum Data)¶

dataphy augment dataset \
  --dataset-path ./training_source \
  --output-path ./training_data \
  --config heavy_augmentation.yaml \
  --num-augmented 9 \
  --sync-views

Specific Episodes (Targeted)¶

dataphy augment dataset \
  --dataset-path ./data \
  --output-path ./specific_aug \
  --config light_augmentation.yaml \
  --episodes episode_000000,episode_000005,episode_000010 \
  --num-augmented 5

Camera-Specific (Single View)¶

dataphy augment dataset \
  --dataset-path ./data \
  --output-path ./webcam_only \
  --config webcam_augmentation.yaml \
  --cameras observation.images.webcam \
  --num-augmented 10

Step 4: Execute with Dry Run¶

Preview what will be created before running:

dataphy augment dataset \
  --dataset-path ./source_data \
  --output-path ./preview_data \
  --config research_augmentation.yaml \
  --num-augmented 3 \
  --dry-run

Example output:

Augmentation Plan:
  Source: ./source_data
  Output: ./preview_data
  Episodes to process: 25 / 25
  Augmented versions per episode: 3
  Camera streams: ALL
  Preserve originals: True
  Result: 100 total episodes

Dry run completed - no files were created

Step 5: Execute Full Augmentation¶

Run the actual augmentation:

dataphy augment dataset \
  --dataset-path ./source_data \
  --output-path ./augmented_data \
  --config research_augmentation.yaml \
  --num-augmented 3 \
  --preserve-original

Monitor progress:

Starting dataset augmentation...
Detected LeRobot version: v2.0
Structure type: chunked_with_meta
Found cameras: ['observation.images.webcam', 'observation.images.laptop']

[20.0%] Processing episode episode_000000 (1/25)
  Creating episode_000000_aug_001 (1/75)
  Successfully created episode_000000_aug_001
  Creating episode_000000_aug_002 (2/75)
  Successfully created episode_000000_aug_002
  Creating episode_000000_aug_003 (3/75)
  Successfully created episode_000000_aug_003
  Episode episode_000000 completed successfully (3/3 augmentations)

[40.0%] Processing episode episode_000001 (2/25)
...

Dataset augmentation completed:
  Total episodes in new dataset: 100
  Successfully augmented: 75
  Failed augmentations: 0
  Success rate: 100.0%
  Metadata saved: ./augmented_data/dataset_metadata.json

Advanced Usage¶

LeRobot Version Compatibility¶

The system automatically detects and handles different LeRobot versions:

# v2.0 format (carpit680/giraffe_clean_desk2)
dataphy augment dataset \
  --dataset-path ./giraffe_data \
  --output-path ./giraffe_aug \
  --config config.yaml \
  --num-augmented 3

# Output: Detected LeRobot version: v2.0

# v2.1 format (lerobot/svla_so100_sorting)
dataphy augment dataset \
  --dataset-path ./svla_data \
  --output-path ./svla_aug \
  --config config.yaml \
  --num-augmented 5

# Output: Detected LeRobot version: v2.1

Programmatic Usage¶

Use the Python API for more control:

dataset_augmentation_script.py

from dataphy.dataset.registry import create_dataset_loader
from dataphy.dataset.augmentor import DatasetAugmentor, AugmentationConfig

# Load dataset
loader = create_dataset_loader("./source_data")

# Create augmentor
augmentor = DatasetAugmentor(loader)

# Progress callback
def progress_callback(message, current, total):
    progress = (current / total) * 100 if total > 0 else 0
    print(f"[{progress:5.1f}%] {message}")

# Configuration
config = AugmentationConfig(
    pipeline_config="research_config.yaml",
    target="dataset",
    num_augmented_episodes=3,
    preserve_original=True,
    sync_views=True,
    random_seed=42,
    progress_callback=progress_callback
)

# Execute augmentation
results = augmentor.augment_full_dataset(config, "./augmented_data")

# Results
print(f"Created {results['total_new_episodes']} episodes")
print(f"Success rate: {results['augmentation_stats']['successful']/75*100:.1f}%")

Configuration Templates¶

Light Augmentation (Subtle Changes)¶

light_augmentation.yaml

pipeline:
  sync_views: true
  steps:
    - name: color_jitter
      type: ColorJitter
      params:
        magnitude: 0.1
        probability: 0.7
    - name: random_translate
      type: RandomTranslate
      params:
        px: 4
        probability: 0.3

settings:
  seed: 42
  output_quality: 98

Heavy Augmentation (Maximum Diversity)¶

heavy_augmentation.yaml

pipeline:
  sync_views: true
  steps:
    - name: random_crop_pad
      type: RandomCropPad
      params:
        keep_ratio_min: 0.7
        probability: 0.9
    - name: random_translate
      type: RandomTranslate
      params:
        px: 12
        probability: 0.8
    - name: color_jitter
      type: ColorJitter
      params:
        magnitude: 0.3
        probability: 1.0
    - name: random_conv
      type: RandomConv
      params:
        kernel_variance: 0.1
        probability: 0.6
    - name: cutout
      type: Cutout
      params:
        holes: 3
        size_range: [20, 80]
        probability: 0.4

settings:
  seed: null # Random seed each run

Quality Assurance¶

Verify Output Dataset¶

After augmentation, verify the results:

# Check dataset structure
dataphy dataset load --dataset-path ./augmented_data --info

# Compare with original
dataphy dataset load --dataset-path ./source_data --info

# Visualize augmented episode
dataphy dataset visualize --dataset-path ./augmented_data --episode episode_000000_aug_001

Metadata Inspection¶

Check the augmentation metadata:

import json

# Load augmentation metadata
with open("./augmented_data/dataset_metadata.json", "r") as f:
    metadata = json.load(f)

# Inspect augmentation info
aug_info = metadata["augmentation_info"]
print(f"Original episodes: {aug_info['original_episode_count']}")
print(f"Augmented per original: {aug_info['augmented_episodes_per_original']}")
print(f"Total episodes: {aug_info['total_episodes']}")
print(f"Success rate: {aug_info['augmentation_stats']}")

Best Practices¶

1. Resource Management¶

# For large datasets, use multiple smaller batches
dataphy augment dataset \
  --dataset-path ./large_data \
  --output-path ./batch_1 \
  --episodes episode_000000,episode_000001,episode_000002 \
  --num-augmented 5

dataphy augment dataset \
  --dataset-path ./large_data \
  --output-path ./batch_2 \
  --episodes episode_000003,episode_000004,episode_000005 \
  --num-augmented 5

2. Reproducible Research¶

# Always use seeds for reproducible results
settings:
  seed: 42
  deterministic: true

# Document your configuration
dataset:
  naming_scheme: "episode_{original_id}_aug_{aug_index:03d}"
  update_metadata: true

3. Quality Control¶

# Start with dry runs
--dry-run

# Use progress tracking
logging:
  progress_bar: true
  detailed_stats: true

# Enable error recovery
logging:
  continue_on_error: true
  max_retry_attempts: 3

4. Storage Optimization¶

# Adjust quality for storage vs quality tradeoff
settings:
  output_quality: 85 # Lower for more compression
  preserve_aspect_ratio: true

# Use efficient naming
dataset:
  naming_scheme: "ep{original_id}_a{aug_index:02d}" # Shorter names

Troubleshooting¶

Common Issues¶

Out of Disk Space¶

# Check space requirements first
du -sh ./source_data
# Multiply by (num_augmented + 1) for space needed

# Use smaller batches or lower quality
--num-augmented 1
# OR in config:
settings:
  output_quality: 75

Memory Issues¶

# Use low memory mode
advanced:
  low_memory_mode: true
  frame_batch_size: 16
  max_processes: 2

Version Compatibility Issues¶

# Force specific version handling if auto-detection fails
dataset:
  auto_detect_version: false
  force_version: "v2.0"

Error Recovery¶

If augmentation fails partway through:

# Continue from where it left off - the system will skip existing files
dataphy augment dataset \
  --dataset-path ./source_data \
  --output-path ./partially_augmented \
  --config config.yaml \
  --num-augmented 3

# Check what succeeded
dataphy dataset load --dataset-path ./partially_augmented --list-episodes

Next Steps¶

After creating your augmented dataset:

Verify Quality: Visualize and inspect augmented episodes
Update Training: Point your training pipeline to the new dataset
Compare Results: Train models on original vs augmented data
Iterate: Adjust augmentation parameters based on results

Episode Augmentation - For single episode modifications
Basic Usage - For general SDK usage
Configuration Reference - For detailed options
Python API Examples - For programmatic usage