SPA: 3D SPatial-Awareness Enables Effective Embodied Representation

Haoyi Zhu, Honghui Yang, Yating Wang, Jiange Yang, Liming Wang, Tong He

SPA is a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. It leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understanding. We also present the most comprehensive evaluation of embodied representation learning to date, covering 268 tasks across 8 simulators with diverse policies in both single-task and language-conditioned multi-task scenarios.

🥳 NEWS:

Jan. 2025: SPA is accepted by ICLR 2025!
Oct. 2024: Codebase and pre-trained checkpoints are released! Paper is available on arXiv.

🔭 Project Structure

Our codebase draws significant inspiration from the excellent Lightning Hydra Template. The directory structure of this project is organized as follows:

Show directory structure

├── .github                   <- Github Actions workflows
│
├── configs                   <- Hydra configs
│   ├── callbacks                         <- Callbacks configs
│   ├── data                              <- Data configs
│   ├── debug                             <- Debugging configs
│   ├── experiment                        <- Experiment configs
│   ├── extras                            <- Extra utilities configs
│   ├── hydra                             <- Hydra configs
│   ├── local                             <- Local configs
│   ├── logger                            <- Logger configs
│   ├── model                             <- Model configs
│   ├── paths                             <- Project paths configs
│   ├── trainer                           <- Trainer configs
|   |
│   └── train.yaml            <- Main config for training
│
├── data                   <- Project data
│
├── logs                   <- Logs generated by hydra and lightning loggers
│
├── scripts                <- Shell or Python scripts
|
├── spa                    <- Source code of SPA
│   ├── data                     <- Data scripts
│   ├── models                   <- Model scripts
│   ├── utils                    <- Utility scripts
│   │
│   └── train.py                 <- Run SPA pre-training
│
├── .gitignore                <- List of files ignored by git
├── .project-root             <- File for inferring the position of project root directory
├── requirements.txt          <- File for installing python dependencies
├── setup.py                  <- File for installing project as a package
└── README.md

🔨 Installation

⚠️ Warning: We have observed that using latest PyTorch versions (e.g., PyTorch 2.6) can lead to different feature maps compared to our original experiments. Currently, we do not know the exact reason for these discrepancies, nor can we confirm whether these differences will impact the final evaluation results. For reproducibility purposes, we strongly recommend using PyTorch 2.2.1, which is the version used in our original development and testing. If you choose to use a newer version, please be aware of this potential issue and proceed with caution.

Basics

# clone project
git clone https://github.com/HaoyiZhu/SPA.git
cd SPA

# crerate conda environment
conda create -n spa python=3.11 -y
conda activate spa

# install PyTorch, please refer to https://pytorch.org/ for other CUDA versions
# e.g. cuda 11.8:
pip3 install torch==2.2.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# install basic packages
pip3 install -r requirements.txt

SPA

# (optional) if you want to use SPA's volume decoder
cd libs/spa-ops
pip install -e .
cd ../..

# install SPA, so that you can import from anywhere
pip install -e .

🌟 Usage

Example of Using SPA Pre-trained Encoder

We provide pre-trained SPA weights for feature extraction. The checkpoints are available on 🤗Hugging Face. You don't need to manually download the weights, as SPA will automatically handle this if needed.

import torch

from spa.models import spa_vit_base_patch16, spa_vit_large_patch16

image = torch.rand((1, 3, 224, 224))  # range in [0, 1]

# Example usage of SPA-Large (recommended)
# or you can use `spa_vit_base_patch16` for SPA-base
model = spa_vit_large_patch16(pretrained=True)
model.eval()

# Freeze the model
model.freeze()

# (Recommended) move to CUDA
image = image.cuda()
model = model.cuda()

# Obtain the [CLS] token
cls_token = model(image)  # torch.Size([1, 1024])

# Obtain the reshaped feature map concatenated with [CLS] token
feature_map_cat_cls = model(
    image, feature_map=True, cat_cls=True
)  # torch.Size([1, 2048, 14, 14])

# Obtain the reshaped feature map without [CLS] token
feature_map_wo_cls = model(
    image, feature_map=True, cat_cls=False
)  # torch.Size([1, 1024, 14, 14])

Note: The inputs will be automatically resized to 224 x 224 and normalized within the SPA ViT encoder.

🚀 Pre-Training

Example of Pre-Training on ScanNet

We give an example on pre-training SPA on the ScanNet v2 dataset.

Prepare the dataset
- Download the ScanNet v2 dataset.
- Pre-process and extract RGB-D images following PonderV2. The preprocessed data should be put under data/scannet/.
- Pre-generate metadata for fast data loading. The following command will generate metadata under data/scannet/metadata.
```
python scripts/generate_scannet_metadata.py
```
Run the following command for pre-training. Remember to modify hyper-parameters such as number of nodes and GPU devices according to your machines.
```
python spa/train.py experiment=spa_pretrain_vitl trainer.num_nodes=5 trainer.devices=8
```

💡 SPA Large-Scale Evaluation

VC-1 Evaluation

We evaluate on the VC-1's MetaWorld, Adroit, DMControl, and TriFinger benchmarks. Additionally, we have a forked version of the repository that includes code and configuration for evaluating SPA.

Clone the forked VC-1 repo, or you can use the submodule by git submodule update --init --recursive and cd evaluation/eai-vc. Then, please follow the instructions in the CortexBench README to set up the MuJoCo and TriFinger environments, as well as download the required datasets.
Create a configuration for spa <spa_model>.yaml(e.g., using SPA-Large as in spa_vit_large.yaml) in <vc-1_path>/vc_models/src/vc_models/conf/model.
To run the VC-1 evaluation for spa, specify the model config as a parameter (embedding=<spa_model>) for each of the benchmarks in cortexbench.

LIBERO Evaluation

Please first run git submodule update --init --recursive. Then install the LIBERO enviornment:

cd evaluations/LIBERO
pip3 install -r requirements.txt
pip3 install -e .

Then you have to download LIBERO datasets:

python benchmark_scripts/download_libero_datasets.py

Then you can choose:

BENCHMARK from [LIBERO_SPATIAL, LIBERO_OBJECT, LIBERO_GOAL, LIBERO_90, LIBERO_10]

then run the following:

export CUDA_VISIBLE_DEVICES=GPU_ID && \
export MUJOCO_EGL_DEVICE_ID=GPU_ID && \
python libero/lifelong/main.py seed=SEED \
                               benchmark_name=BENCHMARK \
                               policy=bc_transformer_policy \
                               lifelong=multitask \
                               policy/image_encoder=spa_encoder.yaml

Note that in SPA paper, we remove all the data augmentations since we aim to produce a simple and fair setting instead of training a SOTA policy. To do so, you could run the following:

export CUDA_VISIBLE_DEVICES=GPU_ID && \
export MUJOCO_EGL_DEVICE_ID=GPU_ID && \
python libero/lifelong/main.py seed=SEED \
                               benchmark_name=BENCHMARK \
                               policy=bc_transformer_policy \
                               lifelong=multitask \
                               policy/image_encoder=spa_encoder.yaml \
                               policy/data_augmentation@policy.color_aug=identity_aug.yaml \
                               policy/data_augmentation@policy.translation_aug=identity_aug.yaml

Actually, in SPA's experiments, for speed consideration, we use only 20 demos for each task. To do so, you may need to manually modify the datasets. Moreover, SPA only trains for 25 epochs.

If you encounter this error, it is due to LIBERO's numpy version.

AttributeError: module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

You can downgrade your numpy version:

pip install "numpy<1.24"

For more details, please refer to LIBERO's official documentation.

Camera Pose Evaluation

To reproduce the camera pose evaluation, we have open-sourced the code in evaluations/probe3d. Please first run git submodule update --init --recursive and cd evaluations/probe3d. Then follow the instructions in probe3d to prepare the NAVI dataset. Finally, run the following command to evaluate SPA:

python evaluate_navi_camera_pose.py

Featuremap Visualization

⚠️ Warning: We have observed that using latest PyTorch versions (e.g., PyTorch 2.6) can lead to different feature maps compared to our original experiments. Currently, we do not know the exact reason for these discrepancies, nor can we confirm whether these differences will impact the final evaluation results. For reproducibility purposes, we strongly recommend using PyTorch 2.2.1, which is the version used in our original development and testing. If you choose to use a newer version, please be aware of this potential issue and proceed with caution.

To reproduce the feature map visualization results, you can run with:

python scripts/visualize_featuremap.py --image_folder assets/feature_map_vis

🎉 Gotchas

Override any config parameter from command line

This codebase is based on Hydra, which allows for convenient configuration overriding:

python src/train.py trainer.max_epochs=20 seed=300

Note: You can also add new parameters with + sign.

python src/train.py +some_new_param=some_new_value

Train on CPU, GPU, multi-GPU and TPU

# train on CPU
python src/train.py trainer=cpu

# train on 1 GPU
python src/train.py trainer=gpu

# train on TPU
python src/train.py +trainer.tpu_cores=8

# train with DDP (Distributed Data Parallel) (4 GPUs)
python src/train.py trainer=ddp trainer.devices=4

# train with DDP (Distributed Data Parallel) (8 GPUs, 2 nodes)
python src/train.py trainer=ddp trainer.devices=4 trainer.num_nodes=2

# simulate DDP on CPU processes
python src/train.py trainer=ddp_sim trainer.devices=2

# accelerate training on mac
python src/train.py trainer=mps

Train with mixed precision

# train with pytorch native automatic mixed precision (AMP)
python src/train.py trainer=gpu +trainer.precision=16

Use different tricks available in Pytorch Lightning

# gradient clipping may be enabled to avoid exploding gradients
python src/train.py trainer.gradient_clip_val=0.5

# run validation loop 4 times during a training epoch
python src/train.py +trainer.val_check_interval=0.25

# accumulate gradients
python src/train.py trainer.accumulate_grad_batches=10

# terminate training after 12 hours
python src/train.py +trainer.max_time="00:12:00:00"

Note: PyTorch Lightning provides about 40+ useful trainer flags.

Easily debug

# runs 1 epoch in default debugging mode
# changes logging directory to `logs/debugs/...`
# sets level of all command line loggers to 'DEBUG'
# enforces debug-friendly configuration
python src/train.py debug=default

# run 1 train, val and test loop, using only 1 batch
python src/train.py debug=fdr

# print execution time profiling
python src/train.py debug=profiler

# try overfitting to 1 batch
python src/train.py debug=overfit

# raise exception if there are any numerical anomalies in tensors, like NaN or +/-inf
python src/train.py +trainer.detect_anomaly=true

# use only 20% of the data
python src/train.py +trainer.limit_train_batches=0.2 \
+trainer.limit_val_batches=0.2 +trainer.limit_test_batches=0.2

Note: Visit configs/debug/ for different debugging configs.

Resume training from checkpoint

python src/train.py ckpt_path="/path/to/ckpt/name.ckpt"

Note: Checkpoint can be either path or URL.

Note: Currently loading ckpt doesn't resume logger experiment, but it will be supported in future Lightning release.

Create a sweep over hyperparameters

# this will run 9 experiments one after the other,
# each with different combination of seed and learning rate
python src/train.py -m seed=100,200,300 model.optimizer.lr=0.0001,0.00005,0.00001

Note: Hydra composes configs lazily at job launch time. If you change code or configs after launching a job/sweep, the final composed configs might be impacted.

Execute all experiments from folder

python src/train.py -m 'exp_maniskill2_act_policy/maniskill2_task@maniskill2_task=glob(*)'

Note: Hydra provides special syntax for controlling behavior of multiruns. Learn more here. The command above executes all task experiments from configs/exp_maniskill2_act_policy/maniskill2_task.

Execute run for multiple different seeds

python src/train.py -m seed=100,200,300 trainer.deterministic=True

Note: trainer.deterministic=True makes pytorch more deterministic but impacts the performance.

For more instructions, refer to the official documentation for Pytorch Lightning, Hydra, and Lightning Hydra Template.

📚 License

This repository is released under the MIT license.

✨ Acknowledgement

Our work is primarily built upon PointCloudMatters, PonderV2, UniPAD, Pytorch Lightning, Hydra, Lightning Hydra Template, RLBench, PerAct, LIBERO, Meta-Wolrd, ACT, Diffusion Policy, DP3, TIMM, VC1, R3M. We extend our gratitude to all these authors for their generously open-sourced code and their significant contributions to the community.

Contact Haoyi Zhu if you have any questions or suggestions.

📝 Citation

@article{zhu2024spa,
    title = {SPA: 3D Spatial-Awareness Enables Effective Embodied Representation},
    author = {Zhu, Haoyi and and Yang, Honghui and Wang, Yating and Yang, Jiange and Wang, Limin and He, Tong},
    journal = {arXiv preprint arxiv:2410.08208},
    year = {2024},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SPA: 3D SPatial-Awareness Enables Effective Embodied Representation

📋 Contents

🔭 Project Structure

🔨 Installation

🌟 Usage

🚀 Pre-Training

💡 SPA Large-Scale Evaluation

🎉 Gotchas

📚 License

✨ Acknowledgement

📝 Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
assets		assets
configs		configs
evaluations		evaluations
libs/spa-ops		libs/spa-ops
scripts		scripts
spa		spa
.gitignore		.gitignore
.gitmodules		.gitmodules
.project-root		.project-root
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

HaoyiZhu/SPA

Folders and files

Latest commit

History

Repository files navigation

SPA: 3D SPatial-Awareness Enables Effective Embodied Representation

📋 Contents

🔭 Project Structure

🔨 Installation

🌟 Usage

🚀 Pre-Training

💡 SPA Large-Scale Evaluation

🎉 Gotchas

📚 License

✨ Acknowledgement

📝 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages