General Introduction
MegaPairs is an open source project on GitHub by the VectorSpaceLab team to generate multimodal embedding models for image-text-to-image retrieval tasks through large-scale data synthesis techniques. The project is based on more than 26 million heterogeneous KNN triple datasets , trained BGE-VL series of models , including BGE-VL-CLIP (base and large versions) and BGE-VL-MLLM (S1 and S2 versions). Among them, BGE-VL-MLLM-S1 improves performance by 8.1% on the CIRCO zero-sample image retrieval benchmark (mAP@5) and also performs well in the MMEB multimodal embedding benchmark. The code and model have been open sourced from GitHub and Hugging Face, and the dataset is planned for subsequent release under the MIT license with data from Recap-Datacomp (CC BY 4.0 license).
Function List
- Generate large-scale data sets: Provides over 26 million heterogeneous KNN triples for training multimodal embedding models.
- BGE-VL-CLIP embedding model: Includes base and large versions, generates embedded representations of images and text, and supports efficient retrieval.
- BGE-VL-MLLM embedding model: S1 and S2 versions are available, generating high-performance multimodal embeddings that support zero-sample retrieval.
- Support zero sample search: Generate embeddings and perform image-text retrieval tasks without training.
- Model open source and extension: Provides pre-trained models at Hugging Face, supporting download, use and fine-tuning.
Using Help
MegaPairs distributes code and models via GitHub and Hugging Face, allowing users to quickly generate multimodal embeddings and complete retrieval tasks. Below is a detailed how-to guide, based on the official instructions for BGE-VL-MLLM-S1 (Hugging Face).
Acquisition and Installation
- Accessing GitHub Repositories: Open
https://github.com/VectorSpaceLab/MegaPairs
, view project details. - clone warehouse: Run the following command in the terminal to download the code:
git clone https://github.com/VectorSpaceLab/MegaPairs.git
cd MegaPairs
- Installation of dependencies: Using Python 3.10, create a virtual environment and install the necessary libraries:
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
pip install torch transformers==4.41.2 sentencepiece
Hugging Face Request transformers==4.41.2
cap (a poem) sentencepiece
The
4. Download model: Get BGE-VL-MLLM-S1 from Hugging Face:
- Visit https://huggingface.co/BAAI/BGE-VL-MLLM-S1
- Automatic download via Python script (see below).
Using the main functions
1. Use of data sets
The MegaPairs dataset, which contains 26 million triples for training multimodal embedding models, has not yet been fully released and is scheduled to be released through the Hugging Face Offer.
- Acquisition method: Keep an eye on the official update, download it and use it for model training or validation.
- data format: ternary (query image, textual description, target image) with support for embedding generation and retrieval.
2. Generation of multimodal embedding (BGE-VL-MLLM-S1)
BGE-VL-MLLM-S1 is the core embedding model for generating embedded representations of images and text and accomplishing retrieval. The following is the official code:
- Loading Models:
import torch
from transformers import AutoModel, AutoProcessor
model_name = "BAAI/BGE-VL-MLLM-S1"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()
model.cuda() # 使用 GPU 加速
- Generate embedding and retrieve:
from PIL import Image # 准备输入 query_image = Image.open("./cir_query.png").convert("RGB") query_text = "Make the background dark, as if the camera has taken the photo at night" candidate_images = [Image.open("./cir_candi_1.png").convert("RGB"), Image.open("./cir_candi_2.png").convert("RGB")] # 处理查询数据 query_inputs = processor( text=query_text, images=query_image, task_instruction="Retrieve the target image that best meets the combined criteria by using both the provided image and the image retrieval instructions: ", return_tensors="pt", q_or_c="q" ) query_inputs = {k: v.cuda() for k, v in query_inputs.items()} # 处理候选数据 candidate_inputs = processor( images=candidate_images, return_tensors="pt", q_or_c="c" ) candidate_inputs = {k: v.cuda() for k, v in candidate_inputs.items()} # 生成嵌入并计算相似度 with torch.no_grad(): query_embs = model(**query_inputs, output_hidden_states=True).hidden_states[-1][:, -1, :] candi_embs = model(**candidate_inputs, output_hidden_states=True).hidden_states[-1][:, -1, :] query_embs = torch.nn.functional.normalize(query_embs, dim=-1) candi_embs = torch.nn.functional.normalize(candi_embs, dim=-1) scores = torch.matmul(query_embs, candi_embs.T) print(scores) # 输出相似度得分
- Interpretation of results:
scores
denotes the similarity between the query embedding and the candidate embedding, the higher the score the higher the match.
- Interpretation of results:
3. Generating embeddings with BGE-VL-CLIP
BGE-VL-CLIP (base/large) can also generate multimodal embeddings:
- Load and Run:
from transformers import AutoModel model_name = "BAAI/BGE-VL-base" model = AutoModel.from_pretrained(model_name, trust_remote_code=True) model.set_processor(model_name) model.eval() with torch.no_grad(): query = model.encode(images="./cir_query.png", text="Make the background dark") candidates = model.encode(images=["./cir_candi_1.png", "./cir_candi_2.png"]) scores = query @ candidates.T print(scores)
4. Model fine-tuning
Users can fine-tune the model with the dataset:
- Data preparation: Prepare image-text pairs or triples.
- fine-tuning process: Fine-tuned code to be released, available at
transformers
(used form a nominal expression)Trainer
API. - validate (a theory): Test the effect using the CIRCO or MMEB benchmarks.
Featured Function Operation
Zero Sample Embedding Generation and Retrieval
The BGE-VL-MLLM-S1 supports zero sample operation:
- Input images and text, generate embeddings and retrieve them directly without training.
- Upgrading 8.1%'s mAP@5 on CIRCO.
High Performance and Scalability
- performances: generates excellent multimodal embeddings on MMEB, further optimized for the S2 version.
- scalability: Embedding quality improves as data volume increases, and 500,000 samples already outperforms traditional models.
caveat
- hardware requirement: Recommended GPU (16GB video memory or more).
- dependency version: Use
transformers==4.41.2
cap (a poem)sentencepiece
The - documentation support: Check out the GitHub and Hugging Face pages.
- Community Help: Ask a question in GitHub Issues or Hugging Face Discussions.
With the above steps, the user can generate the multimodal embedding and complete the retrieval task.