Getting Started¶

This guide will help you install and start using the Dataphy SDK for robotics data management and augmentation.

Installation¶

Requirements¶

Python 3.10 or higher
Poetry (recommended) or pip

Install with Poetry (Recommended)¶

# Clone the repository
git clone https://github.com/dataphy/dataphy.git
cd dataphy-sdk

# Install base dependencies
poetry install

# Install with all extras for full functionality
poetry install --extras "torch aws hf parquet rerun"

# Upgrade rerun for visualization compatibility
poetry run dataphy-upgrade-rerun

# Verify installation
poetry run dataphy version

Install with pip¶

pip install dataphy[rerun]

Quick Verification¶

Test your installation:

# Show available commands
dataphy --help

# List supported dataset formats
dataphy dataset list-formats

# Show version information
dataphy version

First Steps¶

1. Fetch a Dataset¶

Let's start by downloading a sample robotics dataset:

# Fetch a LeRobot dataset
dataphy dataset fetch \
  --format lerobot \
  --repo-id carpit680/giraffe_clean_desk2 \
  --output ./my-dataset

2. Explore the Dataset¶

# Get dataset information
dataphy dataset info --format lerobot --repo-id carpit680/giraffe_clean_desk2

# Load and inspect locally
dataphy dataset load --dataset-path ./my-dataset --info

# List episodes
dataphy dataset load --dataset-path ./my-dataset --list-episodes

# Examine specific episode
dataphy dataset load --dataset-path ./my-dataset --episode 0 --timestep 10

3. Visualize in 2D¶

# Launch interactive 2D visualization
dataphy dataset visualize --format lerobot --dataset-path ./my-dataset

4. Apply Augmentations¶

Create a simple augmentation config:

aug.yaml

version: 1
pipeline:
  sync_views: true
  steps:
    - name: color_jitter
      magnitude: 0.1
    - name: cutout
      holes: 1
      size_range: [8, 16]
  background:
    adapter: none
seed: 42

Apply augmentations:

# Augment single episode (modifies original)
dataphy augment dataset \
  --dataset-path ./my-dataset \
  --config aug.yaml \
  --episode 0

# Create augmented dataset (preserves original)
dataphy augment dataset \
  --dataset-path ./my-dataset \
  --output-path ./augmented-dataset \
  --config examples/dataset_augmentation_config.yaml \
  --num-augmented 2

# Visualize results
dataphy dataset visualize --format lerobot --dataset-path ./augmented-dataset

Dataset Augmentation Deep Dive¶

Dataphy provides two powerful augmentation approaches:

1. Episode Augmentation¶

Purpose: Modify specific episodes in-place
Use case: Quick experimentation, single episode fixes
Result: Original dataset modified with backups

2. Full Dataset Augmentation¶

Purpose: Create entirely new datasets with multiple versions
Use case: Training data expansion, research experiments
Result: New dataset with original + augmented episodes

# Compare dataset sizes
dataphy dataset load --dataset-path ./my-dataset --info
# Episodes: 25

dataphy dataset load --dataset-path ./augmented-dataset --info
# Episodes: 75 (25 original + 50 augmented)

LeRobot Version Compatibility¶

Dataphy automatically detects and handles different LeRobot dataset formats:

v1.0: Early format with direct video structure
v1.1: Intermediate format with chunked data
v2.0: Modern format (e.g., carpit680/giraffe_clean_desk2)
v2.1: Latest format (e.g., lerobot/svla_so100_sorting)

# The system automatically detects and adapts:
# Detected LeRobot version: v2.0
# Structure type: chunked_with_meta

What's Next?¶

Basic Usage Tutorial: Learn core dataset operations
Episode Augmentation Guide: Master single episode augmentation
Dataset Augmentation Tutorial: Create expanded datasets
API Reference: Explore programmatic usage
Configuration Files: Ready-to-use augmentation configs
Python Examples: See ready-to-run code samples

Getting Help¶

Documentation: You're reading it!
GitHub Issues: Report bugs or request features
Examples: Check the examples/ directory in the repository

Configuration¶

Environment Variables¶

The SDK respects these environment variables:

DATAPHY_CACHE_DIR: Custom cache directory for datasets
HF_TOKEN: Hugging Face token for private datasets
RERUN_STRICT_MODE: Enable strict mode for visualization

Config Files¶

You can create a global config file at ~/.dataphy/config.yaml:

# Global Dataphy configuration
cache_dir: "~/dataphy-cache"
default_format: "lerobot"
visualization:
  auto_open: true
  port: 9876