Instructions to use HuggingFaceM4/idefics2-8b-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceM4/idefics2-8b-base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="HuggingFaceM4/idefics2-8b-base")

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
model = AutoModelForImageTextToText.from_pretrained("HuggingFaceM4/idefics2-8b-base")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HuggingFaceM4/idefics2-8b-base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceM4/idefics2-8b-base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceM4/idefics2-8b-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/HuggingFaceM4/idefics2-8b-base

SGLang

How to use HuggingFaceM4/idefics2-8b-base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceM4/idefics2-8b-base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceM4/idefics2-8b-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceM4/idefics2-8b-base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceM4/idefics2-8b-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use HuggingFaceM4/idefics2-8b-base with Docker Model Runner:
```
docker model run hf.co/HuggingFaceM4/idefics2-8b-base
```

Initializing SIGLIP vision model in Idefics2

by johnwicker - opened Aug 17, 2024

Discussion

johnwicker

Aug 17, 2024

•

edited Aug 17, 2024

Hi,

First off, thanks for providing such great models and papers to the community. I'm a big fan of your "What matters when building vision-language models?" paper.

I'm trying to reproduce the pre-training of the model using similar datasets, and then do some custom pre-training and modifications.

However, I'm stuck on how to initialize the vision_model (SIGLIP) part. Any hints on how to load the checkpoint from https://huggingface.co/google/siglip-so400m-patch14-384 into the vision_model (or its state_dict) of Idefics2Model?
I've noticed that SIGLIP's position embeddings are Embedding(729, 1152), while Idefics2 uses Embedding(4900, 1152). I think I need to do some interpolation here, but I'm not sure about the details.

I've checked out https://huggingface.co/HuggingFaceM4/idefics2-8b-base/discussions/5 and Connector initialization is clear. It's mainly the vision model I'm unsure about.

Any tips or pointers would be greatly appreciated!

Thanks in advance!

HugoLaurencon

Aug 18, 2024

Thanks for the interest!

At the beginning of our training, the weights of SigLIP are exactly the same as the original ones from Google.

We make a modification by allowing SigLIP to reach higher image resolutions, up to 980x980, while it was limited to 384x384 in the original version.
We also allow SigLIP to take images not resized to square images (we preserve the original aspect ratio).

Concretely, in the modeling, the only thing that needs to be changed to allow these modifications is adding more positional embeddings (to be able to have new positional embeddings for higher resolution images than 384x384).

A size of a patch is 14.
729 = floor(384 / 14) * floor(384 / 14)
4900 = (980 / 14) * (980 / 14)
This explains the difference between (729, 1152) and (4900, 1152)

The modeling we obtained is uploaded at this repo https://huggingface.co/HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit
where you can compare the files with the ones of the original repo without our modifications https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384.

The code we used to initialize the new positional embeddings (to go from 384 to 980):

import json
import math
import os

import torch
import torch.nn as nn
from safetensors.torch import load_file, save_file


# Source and destination file paths
source_dir = (
    "/fsx/m4/experiments/local_experiment_dir/s3_async_temporary_checkpoint_folder/siglip-so400m-14-384-flash-attn2"
)
out_dir = (
    "/fsx/m4/experiments/local_experiment_dir/s3_async_temporary_checkpoint_folder/siglip-so400m-14-980-flash-attn2"
)
config_input_file_path = f"{source_dir}/config.json"
config_out_file_path = f"{out_dir}/config.json"


os.makedirs(out_dir, exist_ok=True)


state_dict = load_file(f"{source_dir}/model.safetensors")
new_size = 980

with open(config_input_file_path, "r") as f:
    model_config = json.loads(f.read())
    vision_model_config = model_config["vision_config"]

k = "vision_model.embeddings.position_embedding.weight"
v = state_dict[k]
print(f"Shape before interpolation: {v.shape}")
height = new_size
width = new_size
patch_pos_embed = state_dict[k].unsqueeze(0)
num_positions = patch_pos_embed.shape[1]

embed_dim = patch_pos_embed.shape[-1]
num_h_patches = height // vision_model_config["patch_size"]
num_w_patches = width // vision_model_config["patch_size"]
# we add a small number to avoid floating point error in the interpolation
# see discussion at https://github.com/facebookresearch/dino/issues/8
num_h_patches, num_w_patches = num_h_patches + 0.1, num_w_patches + 0.1
sqrt_num_positions = math.sqrt(num_positions)
patch_pos_embed = patch_pos_embed.reshape(1, int(sqrt_num_positions), int(sqrt_num_positions), embed_dim)
patch_pos_embed_dtype = patch_pos_embed.dtype
patch_pos_embed = patch_pos_embed.permute(0, 3, 1, 2).to(torch.float)
patch_pos_embed = nn.functional.interpolate(
    patch_pos_embed,
    scale_factor=(num_h_patches / sqrt_num_positions, num_w_patches / sqrt_num_positions),
    mode="bicubic",
    align_corners=False,
).to(patch_pos_embed_dtype)
if int(num_h_patches) != patch_pos_embed.shape[-2] or int(num_w_patches) != patch_pos_embed.shape[-1]:
    raise ValueError(
        f"Number of patches for images ({int(num_h_patches), int(num_w_patches)}) don't match the "
        f"shape of position embedding ({patch_pos_embed.shape[-2], patch_pos_embed.shape[-1]})"
    )
patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, embed_dim)
patch_pos_embed = patch_pos_embed.squeeze(0)
state_dict[k] = patch_pos_embed

# Sanity check
print(k)
print(f"Shape after interpolation: {state_dict[k].shape}")

save_file(state_dict, f"{out_dir}/model.safetensors", metadata={"format": "pt"})
# Update config accordingly
with open(config_input_file_path, "r") as f:
    model_config = json.loads(f.read())
    model_config["vision_config"]["image_size"] = new_size

with open(config_out_file_path, "w") as json_file:
    json.dump(model_config, json_file)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment