Skip to content

Getting Started

This guide will help you install and start using the Dataphy SDK for robotics data management and augmentation.

Installation

Requirements

  • Python 3.10 or higher
  • Poetry (recommended) or pip
# Clone the repository
git clone https://github.com/dataphy/dataphy.git
cd dataphy-sdk

# Install base dependencies
poetry install

# Install with all extras for full functionality
poetry install --extras "torch aws hf parquet rerun"

# Upgrade rerun for visualization compatibility
poetry run dataphy-upgrade-rerun

# Verify installation
poetry run dataphy version

Install with pip

pip install dataphy[rerun]

Quick Verification

Test your installation:

# Show available commands
dataphy --help

# List supported dataset formats
dataphy dataset list-formats

# Show version information
dataphy version

First Steps

1. Fetch a Dataset

Let's start by downloading a sample robotics dataset:

# Fetch a LeRobot dataset
dataphy dataset fetch \
  --format lerobot \
  --repo-id carpit680/giraffe_clean_desk2 \
  --output ./my-dataset

2. Explore the Dataset

# Get dataset information
dataphy dataset info --format lerobot --repo-id carpit680/giraffe_clean_desk2

# Load and inspect locally
dataphy dataset load --dataset-path ./my-dataset --info

# List episodes
dataphy dataset load --dataset-path ./my-dataset --list-episodes

# Examine specific episode
dataphy dataset load --dataset-path ./my-dataset --episode 0 --timestep 10

3. Visualize in 2D

# Launch interactive 2D visualization
dataphy dataset visualize --format lerobot --dataset-path ./my-dataset

4. Apply Augmentations

Create a simple augmentation config:

aug.yaml
version: 1
pipeline:
  sync_views: true
  steps:
    - name: color_jitter
      magnitude: 0.1
    - name: cutout
      holes: 1
      size_range: [8, 16]
  background:
    adapter: none
seed: 42

Apply augmentations:

# Augment single episode (modifies original)
dataphy augment dataset \
  --dataset-path ./my-dataset \
  --config aug.yaml \
  --episode 0

# Create augmented dataset (preserves original)
dataphy augment dataset \
  --dataset-path ./my-dataset \
  --output-path ./augmented-dataset \
  --config examples/dataset_augmentation_config.yaml \
  --num-augmented 2

# Visualize results
dataphy dataset visualize --format lerobot --dataset-path ./augmented-dataset

Dataset Augmentation Deep Dive

Dataphy provides two powerful augmentation approaches:

1. Episode Augmentation

  • Purpose: Modify specific episodes in-place
  • Use case: Quick experimentation, single episode fixes
  • Result: Original dataset modified with backups

2. Full Dataset Augmentation

  • Purpose: Create entirely new datasets with multiple versions
  • Use case: Training data expansion, research experiments
  • Result: New dataset with original + augmented episodes
# Compare dataset sizes
dataphy dataset load --dataset-path ./my-dataset --info
# Episodes: 25

dataphy dataset load --dataset-path ./augmented-dataset --info
# Episodes: 75 (25 original + 50 augmented)

LeRobot Version Compatibility

Dataphy automatically detects and handles different LeRobot dataset formats:

  • v1.0: Early format with direct video structure
  • v1.1: Intermediate format with chunked data
  • v2.0: Modern format (e.g., carpit680/giraffe_clean_desk2)
  • v2.1: Latest format (e.g., lerobot/svla_so100_sorting)
# The system automatically detects and adapts:
# Detected LeRobot version: v2.0
# Structure type: chunked_with_meta

What's Next?

Getting Help

Configuration

Environment Variables

The SDK respects these environment variables:

  • DATAPHY_CACHE_DIR: Custom cache directory for datasets
  • HF_TOKEN: Hugging Face token for private datasets
  • RERUN_STRICT_MODE: Enable strict mode for visualization

Config Files

You can create a global config file at ~/.dataphy/config.yaml:

# Global Dataphy configuration
cache_dir: "~/dataphy-cache"
default_format: "lerobot"
visualization:
  auto_open: true
  port: 9876