- Environment Setup
- Example Usage: Extract Qwen2-VL-2B Embeddings with VLM-Lens
- Layers of Interest in a VLM
- Feature Extraction using HuggingFace Datasets
- Output Database
- Demo: Principal Component Analysis over Primitive Concept
- Contributing to VLM-Lens
- Miscellaneous
We recommend using a virtual environment to manage your dependencies. You can create one using the following command to create a virtual environment under
virtualenv --no-download "venv/vlm-lens-base" --prompt "vlm-lens-base" # Or "python3.10 -m venv venv/vlm-lens-base"
source venv/vlm-lens-base/bin/activate
Then, install the required dependencies:
pip install --upgrade pip
pip install -r envs/base/requirements.txt
There are some models that require different dependencies, and we recommend creating a separate virtual environment for each of them to avoid conflicts.
For such models, we have offered a separate requirements.txt
file under envs/<model_name>/requirements.txt
, which can be installed in the same way as above.
All the model-specific environments are independent of the base environment, and can be installed individually.
Notes:
- There may be local constraints (e.g., issues caused by cluster regulations) that cause failure of the above commands. In such cases, you are encouraged to modify it whenever fit. We welcome issues and pull requests to help us keep the dependencies up to date.
- Some models, due to the resources available at the development time, may not be fully supported on modern GPUs. While our released environments are tested on L40s GPUs, we recommend following the error messages to adjust the environment setups for your specific hardware.
The general command to run the quick command-line demo is:
python src/main.py \
--config <config-file-path> \
--debug
with an optional debug flag to see more detailed outputs.
Note that the config file should be in yaml format, and that any arguments you want to send to the huggingface API should be under the model
key.
See configs/models/qwen/qwen-2b.yaml
as an example.
The file configs/models/qwen/qwen-2b.yaml
contains the configuration for running the Qwen2-VL-2B model.
architecture: qwen # Architecture of the model, see more options in src/models/configs.py
model_path: Qwen/Qwen2-VL-2B-Instruct # HuggingFace model path
model: # Model configuration, i.e., arguments to pass to the model
- torch_dtype: auto
output_db: output/qwen.db # Output database file to store embeddings
input_dir: ./data/ # Directory containing images to process
prompt: "Describe the color in this image in one word." # Textual prompt
pooling_method: None # Pooling method to use for aggregating token embeddings over tokens (options: None, mean, max)
modules: # List of modules to extract embeddings from
- lm_head
- visual.blocks.31
To run the extraction on available GPU, use the following command:
python src/main.py --config configs/models/qwen/qwen-2b.yaml --debug
If there is no GPU available, you can run it on CPU with:
python src/main.py --config configs/models/qwen/qwen-2b.yaml --device cpu --debug
Unfortunately there is no way to find which layers to potentially match to without loading the model. This can take quite a bit of system time figuring out.
Instead, we offer some cached results under logs/
for each model, which were generated through including the -l
or --log-named-modules
flag when running python src/main.py
.
When running this flag, it is not necessary to set modules or anything besides the architecture and HuggingFace model path.
To automatically set up which layers to find/use, one should use the Unix style strings, where you can use *
to denote wildcards.
For example, if one wanted to match with all the attention layer's query projection layer for Qwen, simply add the following lines to the .yaml file:
modules:
- model.layers.*.self_attn.q_proj
To use VLM-Lens with either hosted or local datasets, there are multiple methods you can use depending on the location of the input images.
First, your dataset must be standardized to a format that includes the attributes of prompt
, label
and image_path
. Here is a snippet of the compling/coco-val2017-obj-qa-categories
dataset, adjusted with the former attributes:
id | prompt | label | image_path |
---|---|---|---|
397,133 | Is this A photo of a dining table on the bottom | yes | /path/to/397133.png |
37,777 | Is this A photo of a dining table on the top | no | /path/to/37777.png |
This can be achieved manually or using the helper script in scripts/map_datasets.py
.
If you are using datasets hosted on a platform such as HuggingFace, you will either use images that are also hosted, or ones that are downloaded locally with an identifier to map back to the hosted dataset (e.g., filename).
You must use the dataset_path
attribute in your configuration file with the appropriate dataset_split
(if it exists, otherwise leave it out).
dataset:
- dataset_path: compling/coco-val2017-obj-qa-categories
- dataset_split: val2017
🚨 NOTE: The
image_path
attribute in the dataset must contain either filenames or relative paths, such that a cell value oftrain/00023.png
can be joined withimage_dataset_path
to form the full absolute path:/path/to/local/images/train/00023.png
. If theimage_path
attribute does not require any additional path joining, you can leave out theimage_dataset_path
attribute.
dataset:
- dataset_path: compling/coco-val2017-obj-qa-categories
- dataset_split: val2017
- image_dataset_path: /path/to/local/images # downloaded using configs/dataset/download-coco.yaml
dataset:
- local_dataset_path: /path/to/local/CLEVR
- dataset_split: train # leave out if unspecified
🚨 NOTE: The
image_path
attribute in the dataset must contain either filenames or relative paths, such that a cell value oftrain/00023.png
can be joined withimage_dataset_path
to form the full absolute path:/path/to/local/images/train/00023.png
. If theimage_path
attribute does not require any additional path joining, you can leave out theimage_dataset_path
attribute.
dataset:
- local_dataset_path: /path/to/local/CLEVR
- dataset_split: train # leave out if unspecified
- image_dataset_path: /path/to/local/CLEVR/images
Specified by the -o
and --output-db
flags, this specifies the specific output database we want. From this, in SQL we have a single table under the name tensors
with the following columns:
name, architecture, timestamp, image_path, prompt, label, layer, tensor_dim, tensor
where each column contains:
name
represents the model path from HuggingFace.architecture
is the supported flags above.timestamp
is the specific time that the model was ran.image_path
is the absolute path to the image.prompt
stores the prompt used in that instance.label
is an optional cell that stores the "ground-truth" answer, which is helpful in use cases such as classification.layer
is the matched layer frommodel.named_modules()
pooling_method
is the pooling method used for aggregating token embeddings over tokens.tensor_dim
is the dimension of the tensor saved.tensor
is the embedding saved.
Download license-free images for primitive concepts (e.g., colors):
pip install -r data/concepts/requirements.txt
python data/concepts/download.py --config configs/concepts/colors.yaml
Run the LLaVA model to obtain embeddings of the concept images:
python src/main.py --config configs/models/llava-7b/llava-7b-concepts-colors.yaml --device cuda
Also, run the LLaVA model to obtain embeddings of the test images:
python src/main.py --config configs/models/llava-7b/llava-7b.yaml --device cuda
Several PCA-based analysis scripts are provided:
pip install -r src/concepts/requirements.txt
python src/concepts/pca.py
python src/concepts/pca_knn.py
python src/concepts/pca_separation.py
We welcome contributions to VLM-Lens! If you have suggestions, improvements, or bug fixes, please consider submitting a pull request, and we are actively reviewing them.
We generally follow the Google Python Styles to ensure readability, with a few exceptions stated in .flake8
.
We use pre-commit hooks to ensure code quality and consistency---please make sure to run the following scripts before committing:
pip install pre-commit
pre-commit install
To use a specific cache, one should set the HF_HOME
environment variable as so:
HF_HOME=./cache/ python src/main.py --config configs/models/clip/clip.yaml --debug
There are some models that require separate submodules to be cloned, such as Glamm. To use these models, please follow the instructions below to download the submodules.
For Glamm (GroundingLMM), one needs to clone the separate submodules, which can be done with the following command:
git submodule update --recursive --init
See our document for details on the installation.