AIMv2-Large-Patch14-Native Image Classification

Original AIMv2 Paper | BibTeX

This repository contains an adapted version of the original AIMv2 model, modified to be compatible with the AutoModelForImageClassification class from Hugging Face Transformers. This adaptation enables seamless use of the model for image classification tasks.

This model has not been trained/fine-tuned

Introduction

We have adapted the original apple/aimv2-large-patch14-native model to work with AutoModelForImageClassification. The AIMv2 family consists of vision models pre-trained with a multimodal autoregressive objective, offering robust performance across various benchmarks.

Some highlights of the AIMv2 models include:

Outperforming OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
Surpassing DINOv2 in open-vocabulary object detection and referring expression comprehension.
Demonstrating strong recognition performance, with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk.

Usage

PyTorch

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "amaye15/aimv2-large-patch14-native-image-classification",
)
model = AutoModelForImageClassification.from_pretrained(
    "amaye15/aimv2-large-patch14-native-image-classification",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

# Get predicted class
predictions = outputs.logits.softmax(dim=-1)
predicted_class = predictions.argmax(-1).item()

print(f"Predicted class: {model.config.id2label[predicted_class]}")

Model Details

Model Name: amaye15/aimv2-large-patch14-native-image-classification
Original Model: apple/aimv2-large-patch14-native
Adaptation: Modified to be compatible with AutoModelForImageClassification for direct use in image classification tasks.
Framework: PyTorch

Citation

If you use this model or find it helpful, please consider citing the original AIMv2 paper:

@article{yang2023aimv2,
  title={AIMv2: Advances in Multimodal Vision Models},
  author={Yang, Li and others},
  journal={arXiv preprint arXiv:2411.14402},
  year={2023}
}

Downloads last month: 14

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for amaye15/aimv2-large-patch14-native-image-classification

Base model

apple/aimv2-large-patch14-native

Finetuned

(1)

this model

Paper for amaye15/aimv2-large-patch14-native-image-classification

Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published Nov 21, 2024 • 47