Skip to content

compling-wat/vlm-lens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VLM-Lens Logo VLM-Lens

python License Documentation Jupyter Notebook Google Colab

VLM-Lens Teaser

Table of Contents

Environment Setup

We recommend using a virtual environment to manage your dependencies. You can create one using the following command to create a virtual environment under

virtualenv --no-download "venv/vlm-lens-base" --prompt "vlm-lens-base"  # Or "python3.10 -m venv venv/vlm-lens-base"
source venv/vlm-lens-base/bin/activate

Then, install the required dependencies:

pip install --upgrade pip
pip install -r envs/base/requirements.txt

There are some models that require different dependencies, and we recommend creating a separate virtual environment for each of them to avoid conflicts. For such models, we have offered a separate requirements.txt file under envs/<model_name>/requirements.txt, which can be installed in the same way as above. All the model-specific environments are independent of the base environment, and can be installed individually.

Notes:

  1. There may be local constraints (e.g., issues caused by cluster regulations) that cause failure of the above commands. In such cases, you are encouraged to modify it whenever fit. We welcome issues and pull requests to help us keep the dependencies up to date.
  2. Some models, due to the resources available at the development time, may not be fully supported on modern GPUs. While our released environments are tested on L40s GPUs, we recommend following the error messages to adjust the environment setups for your specific hardware.

Example Usage: Extract Qwen2-VL-2B Embeddings with VLM-Lens

General Command-Line Demo

The general command to run the quick command-line demo is:

python src/main.py \
  --config <config-file-path> \
  --debug

with an optional debug flag to see more detailed outputs.

Note that the config file should be in yaml format, and that any arguments you want to send to the huggingface API should be under the model key. See configs/models/qwen/qwen-2b.yaml as an example.

Run Qwen2-VL-2B Embeddings Extraction

The file configs/models/qwen/qwen-2b.yaml contains the configuration for running the Qwen2-VL-2B model.

architecture: qwen  # Architecture of the model, see more options in src/models/configs.py
model_path: Qwen/Qwen2-VL-2B-Instruct  # HuggingFace model path
model:  # Model configuration, i.e., arguments to pass to the model
  - torch_dtype: auto
output_db: output/qwen.db  # Output database file to store embeddings
input_dir: ./data/  # Directory containing images to process
prompt: "Describe the color in this image in one word."  # Textual prompt
pooling_method: None  # Pooling method to use for aggregating token embeddings over tokens (options: None, mean, max)
modules:  # List of modules to extract embeddings from
  - lm_head
  - visual.blocks.31

To run the extraction on available GPU, use the following command:

python src/main.py --config configs/models/qwen/qwen-2b.yaml --debug

If there is no GPU available, you can run it on CPU with:

python src/main.py --config configs/models/qwen/qwen-2b.yaml --device cpu --debug

Layers of Interest in a VLM

Retrieving All Named Modules

Unfortunately there is no way to find which layers to potentially match to without loading the model. This can take quite a bit of system time figuring out.

Instead, we offer some cached results under logs/ for each model, which were generated through including the -l or --log-named-modules flag when running python src/main.py.

When running this flag, it is not necessary to set modules or anything besides the architecture and HuggingFace model path.

Matching Layers

To automatically set up which layers to find/use, one should use the Unix style strings, where you can use * to denote wildcards.

For example, if one wanted to match with all the attention layer's query projection layer for Qwen, simply add the following lines to the .yaml file:

modules:
  - model.layers.*.self_attn.q_proj

Feature Extraction using HuggingFace Datasets

To use VLM-Lens with either hosted or local datasets, there are multiple methods you can use depending on the location of the input images.

First, your dataset must be standardized to a format that includes the attributes of prompt, label and image_path. Here is a snippet of the compling/coco-val2017-obj-qa-categories dataset, adjusted with the former attributes:

id prompt label image_path
397,133 Is this A photo of a dining table on the bottom yes /path/to/397133.png
37,777 Is this A photo of a dining table on the top no /path/to/37777.png

This can be achieved manually or using the helper script in scripts/map_datasets.py.

Method 1: Using hosted datasets

If you are using datasets hosted on a platform such as HuggingFace, you will either use images that are also hosted, or ones that are downloaded locally with an identifier to map back to the hosted dataset (e.g., filename).

You must use the dataset_path attribute in your configuration file with the appropriate dataset_split (if it exists, otherwise leave it out).

1(a): Hosted Dataset with Hosted Images

dataset:
  - dataset_path: compling/coco-val2017-obj-qa-categories
  - dataset_split: val2017

1(b): Hosted Dataset with Local Images

🚨 NOTE: The image_path attribute in the dataset must contain either filenames or relative paths, such that a cell value of train/00023.png can be joined with image_dataset_path to form the full absolute path: /path/to/local/images/train/00023.png. If the image_path attribute does not require any additional path joining, you can leave out the image_dataset_path attribute.

dataset:
  - dataset_path: compling/coco-val2017-obj-qa-categories
  - dataset_split: val2017
  - image_dataset_path: /path/to/local/images  # downloaded using configs/dataset/download-coco.yaml

Method 2: Using local datasets

2(a): Local Dataset containing Image Files

dataset:
  - local_dataset_path: /path/to/local/CLEVR
  - dataset_split: train # leave out if unspecified

2(b): Local Dataset with Separate Input Image Directory

🚨 NOTE: The image_path attribute in the dataset must contain either filenames or relative paths, such that a cell value of train/00023.png can be joined with image_dataset_path to form the full absolute path: /path/to/local/images/train/00023.png. If the image_path attribute does not require any additional path joining, you can leave out the image_dataset_path attribute.

dataset:
  - local_dataset_path: /path/to/local/CLEVR
  - dataset_split: train # leave out if unspecified
  - image_dataset_path: /path/to/local/CLEVR/images

Output Database

Specified by the -o and --output-db flags, this specifies the specific output database we want. From this, in SQL we have a single table under the name tensors with the following columns:

name, architecture, timestamp, image_path, prompt, label, layer, tensor_dim, tensor

where each column contains:

  1. name represents the model path from HuggingFace.
  2. architecture is the supported flags above.
  3. timestamp is the specific time that the model was ran.
  4. image_path is the absolute path to the image.
  5. prompt stores the prompt used in that instance.
  6. label is an optional cell that stores the "ground-truth" answer, which is helpful in use cases such as classification.
  7. layer is the matched layer from model.named_modules()
  8. pooling_method is the pooling method used for aggregating token embeddings over tokens.
  9. tensor_dim is the dimension of the tensor saved.
  10. tensor is the embedding saved.

Principal Component Analysis over Primitive Concept

Data Collection

Download license-free images for primitive concepts (e.g., colors):

pip install -r data/concepts/requirements.txt
python data/concepts/download.py --config configs/concepts/colors.yaml

Embedding Extraction

Run the LLaVA model to obtain embeddings of the concept images:

python src/main.py --config configs/models/llava-7b/llava-7b-concepts-colors.yaml --device cuda

Also, run the LLaVA model to obtain embeddings of the test images:

python src/main.py --config configs/models/llava-7b/llava-7b.yaml --device cuda

Run PCA

Several PCA-based analysis scripts are provided:

pip install -r src/concepts/requirements.txt
python src/concepts/pca.py
python src/concepts/pca_knn.py
python src/concepts/pca_separation.py

Contributing to VLM-Lens

We welcome contributions to VLM-Lens! If you have suggestions, improvements, or bug fixes, please consider submitting a pull request, and we are actively reviewing them.

We generally follow the Google Python Styles to ensure readability, with a few exceptions stated in .flake8. We use pre-commit hooks to ensure code quality and consistency---please make sure to run the following scripts before committing:

pip install pre-commit
pre-commit install

Miscellaneous

Using a Cache

To use a specific cache, one should set the HF_HOME environment variable as so:

HF_HOME=./cache/ python src/main.py --config configs/models/clip/clip.yaml --debug

Using Submodule-Based Models

There are some models that require separate submodules to be cloned, such as Glamm. To use these models, please follow the instructions below to download the submodules.

Glamm

For Glamm (GroundingLMM), one needs to clone the separate submodules, which can be done with the following command:

git submodule update --recursive --init

See our document for details on the installation.

About

Extracting internal representations from vision-language models. Doc: https://compling-wat.github.io/vlm-lens/

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 11