Florence-2 Icon Caption (Fine-tuned)

Fine-tuned Florence-2 model for UI icon recognition in desktop applications.

Based on OmniParser-v2.0 icon_caption weights, further fine-tuned on 12.8k curated icon samples from 163 desktop applications (including WeChat, Photoshop, VS Code, Figma, etc.)

Key Features

  • Functional icon recognition: Trained only on learnable UI elements (buttons, tools, nav icons), excluding avatars/thumbnails/decorative elements
  • Clean, standardized labels: 2-5 word functional descriptions like search button, settings gear, chats nav icon
  • 163 app coverage: Adobe suite, Microsoft Office, WeChat, Slack, Chrome, and 150+ more
  • Chinese app support: WeChat, DingTalk, Feishu, QQ, Bilibili, etc.

Performance

Model Val Loss Exact Match Output Quality
OmniParser (baseline) - 0% Verbose, generic ("a loading or buffering indicator")
This model 1.329 18.8% Concise, functional ("settings gear", "search button")

Training Pipeline

  1. YOLO detection β†’ crop icons from 750+ screenshots across 163 apps
  2. Claude annotation β†’ send original screenshot + icon grid to Claude for context-aware labeling
  3. Smart filtering β†’ skip avatars, thumbnails, video frames, line numbers (unlearnable elements)
  4. Label standardization β†’ normalize synonyms (close window button β†’ close button)
  5. Frequency filtering β†’ remove labels appearing < 3 times
  6. Full parameter training β†’ vision tower unfrozen, lr=3e-6, 15 epochs

Usage

from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from pathlib import Path
from PIL import Image
import torch

# Load processor
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)

# Load model structure from OmniParser config
config_path = hf_hub_download("microsoft/OmniParser-v2.0", "icon_caption/config.json")
config = AutoConfig.from_pretrained(str(Path(config_path).parent), trust_remote_code=True)
config._attn_implementation = "eager"
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

# Load fine-tuned weights
weights = hf_hub_download("josley/florence-2-icon-caption", "model.safetensors")
model.load_state_dict(load_file(weights, device="cpu"), strict=False)
model = model.to("cuda", dtype=torch.float16).eval()

# Inference
image = Image.open("icon.png").convert("RGB")
inputs = processor(text="<CAPTION>", images=image, return_tensors="pt").to("cuda")
inputs["pixel_values"] = inputs["pixel_values"].to(torch.float16)
gen = model.generate(input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"],
                     max_new_tokens=20, num_beams=1, use_cache=False)
print(processor.batch_decode(gen, skip_special_tokens=True)[0])
# Output: "search button"

Training Details

  • Base weights: microsoft/OmniParser-v2.0 (icon_caption)
  • Training data: 12,789 curated samples from 163 apps
  • Validation: 1,422 samples
  • Best epoch: 9 (val_loss=1.329)
  • Config: batch=16 (8Γ—2 grad_accum), lr=3e-6, fp16, vision tower unfrozen, 231M params
  • Annotation: Claude-powered with original screenshot context + smart filtering pipeline
Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for josley/florence-2-icon-caption

Finetuned
(20)
this model