Dataset Augmentation Tutorial¶
This comprehensive tutorial covers full dataset augmentation - creating entirely new datasets with multiple augmented versions of episodes. Unlike episode augmentation which modifies original files, dataset augmentation creates new datasets while preserving the original.
Overview¶
Dataset augmentation allows you to:
- Scale your data: Create 2x, 5x, 10x larger datasets automatically
- Preserve originals: Keep original episodes alongside augmented versions
- Support research: Create reproducible experimental datasets with seeds
- Handle all formats: Works with LeRobot v1.0, v1.1, v2.0, v2.1+ automatically
Quick Start¶
Basic Dataset Augmentation¶
Create a dataset with 2 augmented versions per episode:
dataphy augment dataset \
  --dataset-path ./original_data \
  --output-path ./augmented_data \
  --config examples/dataset_augmentation_config.yaml \
  --num-augmented 2
Result Structure:
augmented_data/
├── videos/chunk-001/observation.images.webcam/
│   ├── episode_000000.mp4          # Original (if preserved)
│   ├── episode_000000_aug_001.mp4  # Augmented version 1
│   ├── episode_000000_aug_002.mp4  # Augmented version 2
│   ├── episode_000001.mp4
│   ├── episode_000001_aug_001.mp4
│   └── episode_000001_aug_002.mp4
├── data/chunk-001/
│   ├── episode_000000.parquet
│   ├── episode_000000_aug_001.parquet
│   ├── episode_000000_aug_002.parquet
│   └── ...
└── dataset_metadata.json           # Augmentation info
Step-by-Step Guide¶
Step 1: Prepare Your Configuration¶
Create a comprehensive augmentation configuration:
# Dataset Augmentation Configuration
pipeline:
  sync_views: true # Synchronized across all cameras
  steps:
    # Spatial augmentations for robustness
    - name: random_crop_pad
      type: RandomCropPad
      params:
        keep_ratio_min: 0.85
        probability: 0.8
    - name: random_translate
      type: RandomTranslate
      params:
        px: 8
        probability: 0.6
    # Photometric augmentations for lighting variations
    - name: color_jitter
      type: ColorJitter
      params:
        magnitude: 0.2
        probability: 0.9
    # Texture and occlusion augmentations
    - name: random_conv
      type: RandomConv
      params:
        kernel_variance: 0.05
        probability: 0.4
    - name: cutout
      type: Cutout
      params:
        holes: 2
        size_range: [16, 32]
        probability: 0.3
# Global settings for reproducibility
settings:
  device: auto
  seed: 42
  deterministic: true
  output_quality: 95
# Dataset configuration
dataset:
  preserve_original: true
  update_metadata: true
# Progress tracking
logging:
  level: INFO
  progress_bar: true
  continue_on_error: true
Domain Randomization Configuration¶
For maximum realism and robustness, include domain randomization transforms:
pipeline:
  sync_views: true
  steps:
    # Standard transforms
    - name: random_crop_pad
      keep_ratio_min: 0.88
      probability: 0.8
    - name: color_jitter
      magnitude: 0.15
      probability: 0.9
    # Domain randomization transforms
    - name: lighting_rand
      p: 0.6
      ambient_tint: [0.95, 1.05]
      directional_intensity: [0.0, 0.4]
      preserve_robot_color: true
    - name: camera_intrinsics_jitter
      p: 0.5
      fx_jitter: [0.98, 1.02]
      cx_jitter_px: [-4, 4]
      update_intrinsics: true
    - name: camera_extrinsics_jitter
      p: 0.4
      rot_deg: [-3.0, 3.0]
      transl_px: [-6, 6]
      same_delta_per_seq: true
    - name: rgb_sensor_noise
      p: 0.4
      shot_k: [0.5, 1.5]
      read_sigma: [0.002, 0.01]
      iso_range: [100, 800]
settings:
  device: auto
  seed: 42
  deterministic: true
Step 2: Inspect Your Source Dataset¶
First, understand your dataset structure:
# Get dataset information
dataphy dataset load --dataset-path ./source_data --info
# List episodes to see what's available
dataphy dataset load --dataset-path ./source_data --list-episodes
Example output:
Dataset Information:
  Format: lerobot
  Episodes: 25
  Total timesteps: 12,500
  Features: ['observation.images.webcam', 'observation.images.laptop', 'action']
Available Episodes:
  0. episode_000000
  1. episode_000001
  2. episode_000002
  ...
Step 3: Plan Your Augmentation Strategy¶
Choose your augmentation strategy based on your use case:
Research Workflow (Reproducible)¶
dataphy augment dataset \
  --dataset-path ./research_data \
  --output-path ./experiment_1_data \
  --config research_augmentation.yaml \
  --num-augmented 3 \
  --seed 42 \
  --preserve-original
Training Workflow (Maximum Data)¶
dataphy augment dataset \
  --dataset-path ./training_source \
  --output-path ./training_data \
  --config heavy_augmentation.yaml \
  --num-augmented 9 \
  --sync-views
Specific Episodes (Targeted)¶
dataphy augment dataset \
  --dataset-path ./data \
  --output-path ./specific_aug \
  --config light_augmentation.yaml \
  --episodes episode_000000,episode_000005,episode_000010 \
  --num-augmented 5
Camera-Specific (Single View)¶
dataphy augment dataset \
  --dataset-path ./data \
  --output-path ./webcam_only \
  --config webcam_augmentation.yaml \
  --cameras observation.images.webcam \
  --num-augmented 10
Step 4: Execute with Dry Run¶
Preview what will be created before running:
dataphy augment dataset \
  --dataset-path ./source_data \
  --output-path ./preview_data \
  --config research_augmentation.yaml \
  --num-augmented 3 \
  --dry-run
Example output:
Augmentation Plan:
  Source: ./source_data
  Output: ./preview_data
  Episodes to process: 25 / 25
  Augmented versions per episode: 3
  Camera streams: ALL
  Preserve originals: True
  Result: 100 total episodes
Dry run completed - no files were created
Step 5: Execute Full Augmentation¶
Run the actual augmentation:
dataphy augment dataset \
  --dataset-path ./source_data \
  --output-path ./augmented_data \
  --config research_augmentation.yaml \
  --num-augmented 3 \
  --preserve-original
Monitor progress:
Starting dataset augmentation...
Detected LeRobot version: v2.0
Structure type: chunked_with_meta
Found cameras: ['observation.images.webcam', 'observation.images.laptop']
[20.0%] Processing episode episode_000000 (1/25)
  Creating episode_000000_aug_001 (1/75)
  Successfully created episode_000000_aug_001
  Creating episode_000000_aug_002 (2/75)
  Successfully created episode_000000_aug_002
  Creating episode_000000_aug_003 (3/75)
  Successfully created episode_000000_aug_003
  Episode episode_000000 completed successfully (3/3 augmentations)
[40.0%] Processing episode episode_000001 (2/25)
...
Dataset augmentation completed:
  Total episodes in new dataset: 100
  Successfully augmented: 75
  Failed augmentations: 0
  Success rate: 100.0%
  Metadata saved: ./augmented_data/dataset_metadata.json
Advanced Usage¶
LeRobot Version Compatibility¶
The system automatically detects and handles different LeRobot versions:
# v2.0 format (carpit680/giraffe_clean_desk2)
dataphy augment dataset \
  --dataset-path ./giraffe_data \
  --output-path ./giraffe_aug \
  --config config.yaml \
  --num-augmented 3
# Output: Detected LeRobot version: v2.0
# v2.1 format (lerobot/svla_so100_sorting)
dataphy augment dataset \
  --dataset-path ./svla_data \
  --output-path ./svla_aug \
  --config config.yaml \
  --num-augmented 5
# Output: Detected LeRobot version: v2.1
Programmatic Usage¶
Use the Python API for more control:
from dataphy.dataset.registry import create_dataset_loader
from dataphy.dataset.augmentor import DatasetAugmentor, AugmentationConfig
# Load dataset
loader = create_dataset_loader("./source_data")
# Create augmentor
augmentor = DatasetAugmentor(loader)
# Progress callback
def progress_callback(message, current, total):
    progress = (current / total) * 100 if total > 0 else 0
    print(f"[{progress:5.1f}%] {message}")
# Configuration
config = AugmentationConfig(
    pipeline_config="research_config.yaml",
    target="dataset",
    num_augmented_episodes=3,
    preserve_original=True,
    sync_views=True,
    random_seed=42,
    progress_callback=progress_callback
)
# Execute augmentation
results = augmentor.augment_full_dataset(config, "./augmented_data")
# Results
print(f"Created {results['total_new_episodes']} episodes")
print(f"Success rate: {results['augmentation_stats']['successful']/75*100:.1f}%")
Configuration Templates¶
Light Augmentation (Subtle Changes)¶
pipeline:
  sync_views: true
  steps:
    - name: color_jitter
      type: ColorJitter
      params:
        magnitude: 0.1
        probability: 0.7
    - name: random_translate
      type: RandomTranslate
      params:
        px: 4
        probability: 0.3
settings:
  seed: 42
  output_quality: 98
Heavy Augmentation (Maximum Diversity)¶
pipeline:
  sync_views: true
  steps:
    - name: random_crop_pad
      type: RandomCropPad
      params:
        keep_ratio_min: 0.7
        probability: 0.9
    - name: random_translate
      type: RandomTranslate
      params:
        px: 12
        probability: 0.8
    - name: color_jitter
      type: ColorJitter
      params:
        magnitude: 0.3
        probability: 1.0
    - name: random_conv
      type: RandomConv
      params:
        kernel_variance: 0.1
        probability: 0.6
    - name: cutout
      type: Cutout
      params:
        holes: 3
        size_range: [20, 80]
        probability: 0.4
settings:
  seed: null # Random seed each run
Quality Assurance¶
Verify Output Dataset¶
After augmentation, verify the results:
# Check dataset structure
dataphy dataset load --dataset-path ./augmented_data --info
# Compare with original
dataphy dataset load --dataset-path ./source_data --info
# Visualize augmented episode
dataphy dataset visualize --dataset-path ./augmented_data --episode episode_000000_aug_001
Metadata Inspection¶
Check the augmentation metadata:
import json
# Load augmentation metadata
with open("./augmented_data/dataset_metadata.json", "r") as f:
    metadata = json.load(f)
# Inspect augmentation info
aug_info = metadata["augmentation_info"]
print(f"Original episodes: {aug_info['original_episode_count']}")
print(f"Augmented per original: {aug_info['augmented_episodes_per_original']}")
print(f"Total episodes: {aug_info['total_episodes']}")
print(f"Success rate: {aug_info['augmentation_stats']}")
Best Practices¶
1. Resource Management¶
# For large datasets, use multiple smaller batches
dataphy augment dataset \
  --dataset-path ./large_data \
  --output-path ./batch_1 \
  --episodes episode_000000,episode_000001,episode_000002 \
  --num-augmented 5
dataphy augment dataset \
  --dataset-path ./large_data \
  --output-path ./batch_2 \
  --episodes episode_000003,episode_000004,episode_000005 \
  --num-augmented 5
2. Reproducible Research¶
# Always use seeds for reproducible results
settings:
  seed: 42
  deterministic: true
# Document your configuration
dataset:
  naming_scheme: "episode_{original_id}_aug_{aug_index:03d}"
  update_metadata: true
3. Quality Control¶
# Start with dry runs
--dry-run
# Use progress tracking
logging:
  progress_bar: true
  detailed_stats: true
# Enable error recovery
logging:
  continue_on_error: true
  max_retry_attempts: 3
4. Storage Optimization¶
# Adjust quality for storage vs quality tradeoff
settings:
  output_quality: 85 # Lower for more compression
  preserve_aspect_ratio: true
# Use efficient naming
dataset:
  naming_scheme: "ep{original_id}_a{aug_index:02d}" # Shorter names
Troubleshooting¶
Common Issues¶
Out of Disk Space¶
# Check space requirements first
du -sh ./source_data
# Multiply by (num_augmented + 1) for space needed
# Use smaller batches or lower quality
--num-augmented 1
# OR in config:
settings:
  output_quality: 75
Memory Issues¶
Version Compatibility Issues¶
# Force specific version handling if auto-detection fails
dataset:
  auto_detect_version: false
  force_version: "v2.0"
Error Recovery¶
If augmentation fails partway through:
# Continue from where it left off - the system will skip existing files
dataphy augment dataset \
  --dataset-path ./source_data \
  --output-path ./partially_augmented \
  --config config.yaml \
  --num-augmented 3
# Check what succeeded
dataphy dataset load --dataset-path ./partially_augmented --list-episodes
Next Steps¶
After creating your augmented dataset:
- Verify Quality: Visualize and inspect augmented episodes
- Update Training: Point your training pipeline to the new dataset
- Compare Results: Train models on original vs augmented data
- Iterate: Adjust augmentation parameters based on results
Related Tutorials¶
- Episode Augmentation - For single episode modifications
- Basic Usage - For general SDK usage
- Configuration Reference - For detailed options
- Python API Examples - For programmatic usage