Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
•
2411.14402
•
Published
•
47
This repository contains an adapted version of the original AIMv2 model, modified to be compatible with the AutoModelForImageClassification class from Hugging Face Transformers. This adaptation enables seamless use of the model for image classification tasks.
This model has not been trained/fine-tuned
We have adapted the original apple/aimv2-large-patch14-native model to work with AutoModelForImageClassification. The AIMv2 family consists of vision models pre-trained with a multimodal autoregressive objective, offering robust performance across various benchmarks.
Some highlights of the AIMv2 models include:
import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processor = AutoImageProcessor.from_pretrained(
"amaye15/aimv2-large-patch14-native-image-classification",
)
model = AutoModelForImageClassification.from_pretrained(
"amaye15/aimv2-large-patch14-native-image-classification",
trust_remote_code=True,
)
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
# Get predicted class
predictions = outputs.logits.softmax(dim=-1)
predicted_class = predictions.argmax(-1).item()
print(f"Predicted class: {model.config.id2label[predicted_class]}")
amaye15/aimv2-large-patch14-native-image-classificationapple/aimv2-large-patch14-nativeAutoModelForImageClassification for direct use in image classification tasks.If you use this model or find it helpful, please consider citing the original AIMv2 paper:
@article{yang2023aimv2,
title={AIMv2: Advances in Multimodal Vision Models},
author={Yang, Li and others},
journal={arXiv preprint arXiv:2411.14402},
year={2023}
}
Base model
apple/aimv2-large-patch14-native