Sure, here you go; this data is using my Mac. I haven't been able to test it on an iPhone, as I don't have the 16, but I believe it should also work with the iPhone 16.
Context and Setup
Device used: Mac Studio M1 Ultra 64GB with macOS 15.0.1
Models tested: Mistral7B-Instruct V0.3:
- INT4
- FP16
- Decoding algorithm: Greedy
First Run (Model Adaptation):
This is the initial execution when the app is first launched, and Core ML adapts the model to your device’s hardware. This adaptation process is time-consuming as Core ML optimizes the model to leverage the device’s capabilities in future runs.
| Configuration |
Execution Type |
Time for First Token (s) |
Generation Speed (tokens/s) |
| INT4 |
First Run (Model Adaptation) |
32.81 |
- |
| INT4 |
Subsequent Runs - First Token |
2.62 |
4.55 |
| INT4 |
Additional Iteration |
0.32 |
3.90 |
| FP16 |
First Run (Model Adaptation) |
87.28 |
- |
| FP16 |
Subsequent Runs - First Token |
23.17 |
5.93 |
| FP16 |
Additional Iteration |
7.52 |
4.09 |