Modalities got latent: A novel approach to use latent spaces for more efficient usage of resources in multi-modal models.

[Link to deployed app has been attached below]

We present to you a novel approach to solving multi-modal problems using what we call, "Latent processing": This approach was heavily inspired by the Multi-head Latent Attention paper as published by DeepSeek, back in 2024. However, instead of applying latent attention technique only during the caching time, Latent processing compresses the Query, Key, and Value vectors into latent spaces at the very beginning before passing them to the transformer encoder blocks. As shown in the image below

This approach not only makes training the thing easier, but also ensures faster inference on lower end hardware in a world where higher-end GPUs are getting expensive and difficult to procrure with each coming day...

Upon further improvement this model will be of much use in edge AI applications like, robotics. With proper robotics hardware, mated with a more diverse real-world dataset for scene reasoning (like GQA), this approach can certainly qualify to come in the big league of Vision-Language-Action models!

The Visual Question Answering model that we built using the approach is a mere demonstration of Latent Processing's capabilities...

Team IkAI members: Srijito Ghosh:- GitHub: https://www.github.com/Srijito354 Muskan Kumari:- GitHub: https://www.github.com/Muskan040399

In this project we tried building a Visual Question Answering (VQA) web-app using a CLIP model (built entirely from scratch), trained using the same original to latent space compression technique, as mentioned before. It was trained on the EasyVQA dataset (GH: ![link]https://github.com/vzhou842/easy-VQA.git).

Libraries and Frameworks used: Pytorch Streamlit

To use the model, run "streamlit run app.py" in the terminal, after installing the necessary libraries and frameworks as mentioned above.

Note: The model was trained in WSL (Windows Subsystem for Linux). However, please make sure use the repo in Windows to make proper use of it. Thank you!

Check out the deployed app here!: https://latent-clip-busmwsdi4hghbhw6erkays.streamlit.app/

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Demo.mp4		Demo.mp4
Hack4Bengal.ipynb		Hack4Bengal.ipynb
README.md		README.md
Train_the_model.ipynb		Train_the_model.ipynb
answers.txt		answers.txt
app.py		app.py
clip_mini2.pth		clip_mini2.pth
inference2.py		inference2.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Modalities got latent: A novel approach to use latent spaces for more efficient usage of resources in multi-modal models.

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Srijito354/Latent-CLIP

Folders and files

Latest commit

History

Repository files navigation

Modalities got latent: A novel approach to use latent spaces for more efficient usage of resources in multi-modal models.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages