Project Page | Paper | arXiv | HuggingFace Model | Real-World Codebase | Twitter/X | YouTube Video
Haoyi Zhu, Honghui Yang, Yating Wang, Jiange Yang, Liming Wang, Tong He
SPA is a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. It leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understanding. We also present the most comprehensive evaluation of embodied representation learning to date, covering 268 tasks across 8 simulators with diverse policies in both single-task and language-conditioned multi-task scenarios.
π₯³ NEWS:
- Jan. 2025: SPA is accepted by ICLR 2025!
- Oct. 2024: Codebase and pre-trained checkpoints are released! Paper is available on arXiv.
- Project Structure
- Installation
- Usage
- Pre-Training
- SPA Large-Scale Evaluation
- Gotchas
- License
- Acknowledgement
- Citation
Our codebase draws significant inspiration from the excellent Lightning Hydra Template. The directory structure of this project is organized as follows:
Show directory structure
βββ .github <- Github Actions workflows
β
βββ configs <- Hydra configs
β βββ callbacks <- Callbacks configs
β βββ data <- Data configs
β βββ debug <- Debugging configs
β βββ experiment <- Experiment configs
β βββ extras <- Extra utilities configs
β βββ hydra <- Hydra configs
β βββ local <- Local configs
β βββ logger <- Logger configs
β βββ model <- Model configs
β βββ paths <- Project paths configs
β βββ trainer <- Trainer configs
| |
β βββ train.yaml <- Main config for training
β
βββ data <- Project data
β
βββ logs <- Logs generated by hydra and lightning loggers
β
βββ scripts <- Shell or Python scripts
|
βββ spa <- Source code of SPA
β βββ data <- Data scripts
β βββ models <- Model scripts
β βββ utils <- Utility scripts
β β
β βββ train.py <- Run SPA pre-training
β
βββ .gitignore <- List of files ignored by git
βββ .project-root <- File for inferring the position of project root directory
βββ requirements.txt <- File for installing python dependencies
βββ setup.py <- File for installing project as a package
βββ README.md
β οΈ Warning: We have observed that using latest PyTorch versions (e.g., PyTorch 2.6) can lead to different feature maps compared to our original experiments. Currently, we do not know the exact reason for these discrepancies, nor can we confirm whether these differences will impact the final evaluation results. For reproducibility purposes, we strongly recommend using PyTorch 2.2.1, which is the version used in our original development and testing. If you choose to use a newer version, please be aware of this potential issue and proceed with caution.
Basics
# clone project
git clone https://github.com/HaoyiZhu/SPA.git
cd SPA
# crerate conda environment
conda create -n spa python=3.11 -y
conda activate spa
# install PyTorch, please refer to https://pytorch.org/ for other CUDA versions
# e.g. cuda 11.8:
pip3 install torch==2.2.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# install basic packages
pip3 install -r requirements.txt
SPA
# (optional) if you want to use SPA's volume decoder
cd libs/spa-ops
pip install -e .
cd ../..
# install SPA, so that you can import from anywhere
pip install -e .
Example of Using SPA Pre-trained Encoder
We provide pre-trained SPA weights for feature extraction. The checkpoints are available on π€Hugging Face. You don't need to manually download the weights, as SPA will automatically handle this if needed.
import torch
from spa.models import spa_vit_base_patch16, spa_vit_large_patch16
image = torch.rand((1, 3, 224, 224)) # range in [0, 1]
# Example usage of SPA-Large (recommended)
# or you can use `spa_vit_base_patch16` for SPA-base
model = spa_vit_large_patch16(pretrained=True)
model.eval()
# Freeze the model
model.freeze()
# (Recommended) move to CUDA
image = image.cuda()
model = model.cuda()
# Obtain the [CLS] token
cls_token = model(image) # torch.Size([1, 1024])
# Obtain the reshaped feature map concatenated with [CLS] token
feature_map_cat_cls = model(
image, feature_map=True, cat_cls=True
) # torch.Size([1, 2048, 14, 14])
# Obtain the reshaped feature map without [CLS] token
feature_map_wo_cls = model(
image, feature_map=True, cat_cls=False
) # torch.Size([1, 1024, 14, 14])
Note: The inputs will be automatically resized to
224 x 224
and normalized within the SPA ViT encoder.
Example of Pre-Training on ScanNet
We give an example on pre-training SPA on the ScanNet v2 dataset.
-
Prepare the dataset
- Download the ScanNet v2 dataset.
- Pre-process and extract RGB-D images following PonderV2. The preprocessed data should be put under
data/scannet/
. - Pre-generate metadata for fast data loading. The following command will generate metadata under
data/scannet/metadata
.python scripts/generate_scannet_metadata.py
-
Run the following command for pre-training. Remember to modify hyper-parameters such as number of nodes and GPU devices according to your machines.
python spa/train.py experiment=spa_pretrain_vitl trainer.num_nodes=5 trainer.devices=8
VC-1 Evaluation
We evaluate on the VC-1's MetaWorld, Adroit, DMControl, and TriFinger benchmarks. Additionally, we have a forked version of the repository that includes code and configuration for evaluating SPA.
-
Clone the forked VC-1 repo, or you can use the submodule by
git submodule update --init --recursive
andcd evaluation/eai-vc
. Then, please follow the instructions in the CortexBench README to set up the MuJoCo and TriFinger environments, as well as download the required datasets. -
Create a configuration for spa
<spa_model>.yaml
(e.g., using SPA-Large as in spa_vit_large.yaml) in <vc-1_path>/vc_models/src/vc_models/conf/model. -
To run the VC-1 evaluation for spa, specify the model config as a parameter (embedding=<spa_model>) for each of the benchmarks in cortexbench.
LIBERO Evaluation
Please first run git submodule update --init --recursive
. Then install the LIBERO enviornment:
cd evaluations/LIBERO
pip3 install -r requirements.txt
pip3 install -e .
Then you have to download LIBERO datasets:
python benchmark_scripts/download_libero_datasets.py
Then you can choose:
BENCHMARK
from[LIBERO_SPATIAL, LIBERO_OBJECT, LIBERO_GOAL, LIBERO_90, LIBERO_10]
then run the following:
export CUDA_VISIBLE_DEVICES=GPU_ID && \
export MUJOCO_EGL_DEVICE_ID=GPU_ID && \
python libero/lifelong/main.py seed=SEED \
benchmark_name=BENCHMARK \
policy=bc_transformer_policy \
lifelong=multitask \
policy/image_encoder=spa_encoder.yaml
Note that in SPA paper, we remove all the data augmentations since we aim to produce a simple and fair setting instead of training a SOTA policy. To do so, you could run the following:
export CUDA_VISIBLE_DEVICES=GPU_ID && \
export MUJOCO_EGL_DEVICE_ID=GPU_ID && \
python libero/lifelong/main.py seed=SEED \
benchmark_name=BENCHMARK \
policy=bc_transformer_policy \
lifelong=multitask \
policy/image_encoder=spa_encoder.yaml \
policy/data_augmentation@policy.color_aug=identity_aug.yaml \
policy/data_augmentation@policy.translation_aug=identity_aug.yaml
Actually, in SPA's experiments, for speed consideration, we use only 20 demos for each task. To do so, you may need to manually modify the datasets. Moreover, SPA only trains for 25 epochs.
If you encounter this error, it is due to LIBERO's numpy version.
AttributeError: module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
You can downgrade your numpy version:
pip install "numpy<1.24"
For more details, please refer to LIBERO's official documentation.
Camera Pose Evaluation
To reproduce the camera pose evaluation, we have open-sourced the code in evaluations/probe3d. Please first run git submodule update --init --recursive
and cd evaluations/probe3d
. Then follow the instructions in probe3d to prepare the NAVI dataset. Finally, run the following command to evaluate SPA:
python evaluate_navi_camera_pose.py
Featuremap Visualization
β οΈ Warning: We have observed that using latest PyTorch versions (e.g., PyTorch 2.6) can lead to different feature maps compared to our original experiments. Currently, we do not know the exact reason for these discrepancies, nor can we confirm whether these differences will impact the final evaluation results. For reproducibility purposes, we strongly recommend using PyTorch 2.2.1, which is the version used in our original development and testing. If you choose to use a newer version, please be aware of this potential issue and proceed with caution.
To reproduce the feature map visualization results, you can run with:
python scripts/visualize_featuremap.py --image_folder assets/feature_map_vis
Override any config parameter from command line
This codebase is based on Hydra, which allows for convenient configuration overriding:
python src/train.py trainer.max_epochs=20 seed=300
Note: You can also add new parameters with
+
sign.
python src/train.py +some_new_param=some_new_value
Train on CPU, GPU, multi-GPU and TPU
# train on CPU
python src/train.py trainer=cpu
# train on 1 GPU
python src/train.py trainer=gpu
# train on TPU
python src/train.py +trainer.tpu_cores=8
# train with DDP (Distributed Data Parallel) (4 GPUs)
python src/train.py trainer=ddp trainer.devices=4
# train with DDP (Distributed Data Parallel) (8 GPUs, 2 nodes)
python src/train.py trainer=ddp trainer.devices=4 trainer.num_nodes=2
# simulate DDP on CPU processes
python src/train.py trainer=ddp_sim trainer.devices=2
# accelerate training on mac
python src/train.py trainer=mps
Train with mixed precision
# train with pytorch native automatic mixed precision (AMP)
python src/train.py trainer=gpu +trainer.precision=16
Use different tricks available in Pytorch Lightning
# gradient clipping may be enabled to avoid exploding gradients
python src/train.py trainer.gradient_clip_val=0.5
# run validation loop 4 times during a training epoch
python src/train.py +trainer.val_check_interval=0.25
# accumulate gradients
python src/train.py trainer.accumulate_grad_batches=10
# terminate training after 12 hours
python src/train.py +trainer.max_time="00:12:00:00"
Note: PyTorch Lightning provides about 40+ useful trainer flags.
Easily debug
# runs 1 epoch in default debugging mode
# changes logging directory to `logs/debugs/...`
# sets level of all command line loggers to 'DEBUG'
# enforces debug-friendly configuration
python src/train.py debug=default
# run 1 train, val and test loop, using only 1 batch
python src/train.py debug=fdr
# print execution time profiling
python src/train.py debug=profiler
# try overfitting to 1 batch
python src/train.py debug=overfit
# raise exception if there are any numerical anomalies in tensors, like NaN or +/-inf
python src/train.py +trainer.detect_anomaly=true
# use only 20% of the data
python src/train.py +trainer.limit_train_batches=0.2 \
+trainer.limit_val_batches=0.2 +trainer.limit_test_batches=0.2
Note: Visit configs/debug/ for different debugging configs.
Resume training from checkpoint
python src/train.py ckpt_path="/path/to/ckpt/name.ckpt"
Note: Checkpoint can be either path or URL.
Note: Currently loading ckpt doesn't resume logger experiment, but it will be supported in future Lightning release.
Create a sweep over hyperparameters
# this will run 9 experiments one after the other,
# each with different combination of seed and learning rate
python src/train.py -m seed=100,200,300 model.optimizer.lr=0.0001,0.00005,0.00001
Note: Hydra composes configs lazily at job launch time. If you change code or configs after launching a job/sweep, the final composed configs might be impacted.
Execute all experiments from folder
python src/train.py -m 'exp_maniskill2_act_policy/maniskill2_task@maniskill2_task=glob(*)'
Note: Hydra provides special syntax for controlling behavior of multiruns. Learn more here. The command above executes all task experiments from configs/exp_maniskill2_act_policy/maniskill2_task.
Execute run for multiple different seeds
python src/train.py -m seed=100,200,300 trainer.deterministic=True
Note:
trainer.deterministic=True
makes pytorch more deterministic but impacts the performance.
For more instructions, refer to the official documentation for Pytorch Lightning, Hydra, and Lightning Hydra Template.
This repository is released under the MIT license.
Our work is primarily built upon PointCloudMatters, PonderV2, UniPAD, Pytorch Lightning, Hydra, Lightning Hydra Template, RLBench, PerAct, LIBERO, Meta-Wolrd, ACT, Diffusion Policy, DP3, TIMM, VC1, R3M. We extend our gratitude to all these authors for their generously open-sourced code and their significant contributions to the community.
Contact Haoyi Zhu if you have any questions or suggestions.
@article{zhu2024spa,
title = {SPA: 3D Spatial-Awareness Enables Effective Embodied Representation},
author = {Zhu, Haoyi and and Yang, Honghui and Wang, Yating and Yang, Jiange and Wang, Limin and He, Tong},
journal = {arXiv preprint arxiv:2410.08208},
year = {2024},
}