Instructions to use HuggingFaceM4/idefics2-8b-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceM4/idefics2-8b-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="HuggingFaceM4/idefics2-8b-base")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base") model = AutoModelForImageTextToText.from_pretrained("HuggingFaceM4/idefics2-8b-base") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HuggingFaceM4/idefics2-8b-base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HuggingFaceM4/idefics2-8b-base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceM4/idefics2-8b-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/HuggingFaceM4/idefics2-8b-base
- SGLang
How to use HuggingFaceM4/idefics2-8b-base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HuggingFaceM4/idefics2-8b-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceM4/idefics2-8b-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HuggingFaceM4/idefics2-8b-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceM4/idefics2-8b-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use HuggingFaceM4/idefics2-8b-base with Docker Model Runner:
docker model run hf.co/HuggingFaceM4/idefics2-8b-base
Initializing SIGLIP vision model in Idefics2
Hi,
First off, thanks for providing such great models and papers to the community. I'm a big fan of your "What matters when building vision-language models?" paper.
I'm trying to reproduce the pre-training of the model using similar datasets, and then do some custom pre-training and modifications.
However, I'm stuck on how to initialize the vision_model (SIGLIP) part. Any hints on how to load the checkpoint from https://huggingface.co/google/siglip-so400m-patch14-384 into the vision_model (or its state_dict) of Idefics2Model?
I've noticed that SIGLIP's position embeddings are Embedding(729, 1152), while Idefics2 uses Embedding(4900, 1152). I think I need to do some interpolation here, but I'm not sure about the details.
I've checked out https://huggingface.co/HuggingFaceM4/idefics2-8b-base/discussions/5 and Connector initialization is clear. It's mainly the vision model I'm unsure about.
Any tips or pointers would be greatly appreciated!
Thanks in advance!
Thanks for the interest!
At the beginning of our training, the weights of SigLIP are exactly the same as the original ones from Google.
We make a modification by allowing SigLIP to reach higher image resolutions, up to 980x980, while it was limited to 384x384 in the original version.
We also allow SigLIP to take images not resized to square images (we preserve the original aspect ratio).
Concretely, in the modeling, the only thing that needs to be changed to allow these modifications is adding more positional embeddings (to be able to have new positional embeddings for higher resolution images than 384x384).
A size of a patch is 14.
729 = floor(384 / 14) * floor(384 / 14)
4900 = (980 / 14) * (980 / 14)
This explains the difference between (729, 1152) and (4900, 1152)
The modeling we obtained is uploaded at this repo https://huggingface.co/HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit
where you can compare the files with the ones of the original repo without our modifications https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384.
The code we used to initialize the new positional embeddings (to go from 384 to 980):
import json
import math
import os
import torch
import torch.nn as nn
from safetensors.torch import load_file, save_file
# Source and destination file paths
source_dir = (
"/fsx/m4/experiments/local_experiment_dir/s3_async_temporary_checkpoint_folder/siglip-so400m-14-384-flash-attn2"
)
out_dir = (
"/fsx/m4/experiments/local_experiment_dir/s3_async_temporary_checkpoint_folder/siglip-so400m-14-980-flash-attn2"
)
config_input_file_path = f"{source_dir}/config.json"
config_out_file_path = f"{out_dir}/config.json"
os.makedirs(out_dir, exist_ok=True)
state_dict = load_file(f"{source_dir}/model.safetensors")
new_size = 980
with open(config_input_file_path, "r") as f:
model_config = json.loads(f.read())
vision_model_config = model_config["vision_config"]
k = "vision_model.embeddings.position_embedding.weight"
v = state_dict[k]
print(f"Shape before interpolation: {v.shape}")
height = new_size
width = new_size
patch_pos_embed = state_dict[k].unsqueeze(0)
num_positions = patch_pos_embed.shape[1]
embed_dim = patch_pos_embed.shape[-1]
num_h_patches = height // vision_model_config["patch_size"]
num_w_patches = width // vision_model_config["patch_size"]
# we add a small number to avoid floating point error in the interpolation
# see discussion at https://github.com/facebookresearch/dino/issues/8
num_h_patches, num_w_patches = num_h_patches + 0.1, num_w_patches + 0.1
sqrt_num_positions = math.sqrt(num_positions)
patch_pos_embed = patch_pos_embed.reshape(1, int(sqrt_num_positions), int(sqrt_num_positions), embed_dim)
patch_pos_embed_dtype = patch_pos_embed.dtype
patch_pos_embed = patch_pos_embed.permute(0, 3, 1, 2).to(torch.float)
patch_pos_embed = nn.functional.interpolate(
patch_pos_embed,
scale_factor=(num_h_patches / sqrt_num_positions, num_w_patches / sqrt_num_positions),
mode="bicubic",
align_corners=False,
).to(patch_pos_embed_dtype)
if int(num_h_patches) != patch_pos_embed.shape[-2] or int(num_w_patches) != patch_pos_embed.shape[-1]:
raise ValueError(
f"Number of patches for images ({int(num_h_patches), int(num_w_patches)}) don't match the "
f"shape of position embedding ({patch_pos_embed.shape[-2], patch_pos_embed.shape[-1]})"
)
patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, embed_dim)
patch_pos_embed = patch_pos_embed.squeeze(0)
state_dict[k] = patch_pos_embed
# Sanity check
print(k)
print(f"Shape after interpolation: {state_dict[k].shape}")
save_file(state_dict, f"{out_dir}/model.safetensors", metadata={"format": "pt"})
# Update config accordingly
with open(config_input_file_path, "r") as f:
model_config = json.loads(f.read())
model_config["vision_config"]["image_size"] = new_size
with open(config_out_file_path, "w") as json_file:
json.dump(model_config, json_file)