Florence-2 Icon Caption (Fine-tuned)
Fine-tuned Florence-2 model for UI icon recognition in desktop applications.
Based on OmniParser-v2.0 icon_caption weights, further fine-tuned on 12.8k curated icon samples from 163 desktop applications (including WeChat, Photoshop, VS Code, Figma, etc.)
Key Features
- Functional icon recognition: Trained only on learnable UI elements (buttons, tools, nav icons), excluding avatars/thumbnails/decorative elements
- Clean, standardized labels: 2-5 word functional descriptions like
search button,settings gear,chats nav icon - 163 app coverage: Adobe suite, Microsoft Office, WeChat, Slack, Chrome, and 150+ more
- Chinese app support: WeChat, DingTalk, Feishu, QQ, Bilibili, etc.
Performance
| Model | Val Loss | Exact Match | Output Quality |
|---|---|---|---|
| OmniParser (baseline) | - | 0% | Verbose, generic ("a loading or buffering indicator") |
| This model | 1.329 | 18.8% | Concise, functional ("settings gear", "search button") |
Training Pipeline
- YOLO detection β crop icons from 750+ screenshots across 163 apps
- Claude annotation β send original screenshot + icon grid to Claude for context-aware labeling
- Smart filtering β skip avatars, thumbnails, video frames, line numbers (unlearnable elements)
- Label standardization β normalize synonyms (close window button β close button)
- Frequency filtering β remove labels appearing < 3 times
- Full parameter training β vision tower unfrozen, lr=3e-6, 15 epochs
Usage
from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from pathlib import Path
from PIL import Image
import torch
# Load processor
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
# Load model structure from OmniParser config
config_path = hf_hub_download("microsoft/OmniParser-v2.0", "icon_caption/config.json")
config = AutoConfig.from_pretrained(str(Path(config_path).parent), trust_remote_code=True)
config._attn_implementation = "eager"
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
# Load fine-tuned weights
weights = hf_hub_download("josley/florence-2-icon-caption", "model.safetensors")
model.load_state_dict(load_file(weights, device="cpu"), strict=False)
model = model.to("cuda", dtype=torch.float16).eval()
# Inference
image = Image.open("icon.png").convert("RGB")
inputs = processor(text="<CAPTION>", images=image, return_tensors="pt").to("cuda")
inputs["pixel_values"] = inputs["pixel_values"].to(torch.float16)
gen = model.generate(input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"],
max_new_tokens=20, num_beams=1, use_cache=False)
print(processor.batch_decode(gen, skip_special_tokens=True)[0])
# Output: "search button"
Training Details
- Base weights: microsoft/OmniParser-v2.0 (icon_caption)
- Training data: 12,789 curated samples from 163 apps
- Validation: 1,422 samples
- Best epoch: 9 (val_loss=1.329)
- Config: batch=16 (8Γ2 grad_accum), lr=3e-6, fp16, vision tower unfrozen, 231M params
- Annotation: Claude-powered with original screenshot context + smart filtering pipeline
- Downloads last month
- 43
Model tree for josley/florence-2-icon-caption
Base model
microsoft/Florence-2-base