Qwen3
Collection
7 items • Updated
• 1
This version of Qwen3-4B-Int8 has been converted to run on the Axera NPU using w8a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 4.2(Not released yet)
For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen3-4B
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
| Chips | w8a16 | w4a16 |
|---|---|---|
| AX650 | 4.5 tokens/sec | TBD |
方式一:克隆仓库后执行安装脚本:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
方式二:一行命令安装(默认分支 axllm):
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):
如果没有编译环境,请到:
https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm
下载 最新 CI 导出的可执行程序(axllm),然后:
chmod +x axllm
sudo mv axllm /usr/bin/axllm
先创建模型目录并进入,然后下载到该目录:
mkdir -p AXERA-TECH/Qwen3-4B
cd AXERA-TECH/Qwen3-4B
hf download AXERA-TECH/Qwen3-4B --local-dir .
# structure of the downloaded files
.
└── AXERA-TECH
└── Qwen3-4B
├── README.md
├── config.json
├── model.embed_tokens.weight.bfloat16.bin
├── post_config.json
├── qwen3_p128_l0_together.axmodel
...
├── qwen3_p128_l9_together.axmodel
├── qwen3_post.axmodel
└── qwen3_tokenizer.txt
2 directories, 42 files
(base) root@ax650:~# axllm run AXERA-TECH/Qwen3-4B/
[I][ Init][ 127]: LLM init start
tokenizer_type = 1
97% | ████████████████████████████████ | 38 / 39 [13.13s<13.47s, 2.89 count/s] init post axmodel ok,remain_cmm(5175 MB)
[I][ Init][ 188]: max_token_len : 1023
[I][ Init][ 191]: kv_cache_size : 1024, kv_cache_num: 1023
[I][ Init][ 194]: prefill_token_num : 128
[I][ Init][ 198]: grp: 1, prefill_max_kv_cache_num : 1
[I][ Init][ 198]: grp: 2, prefill_max_kv_cache_num : 128
[I][ Init][ 198]: grp: 3, prefill_max_kv_cache_num : 256
[I][ Init][ 198]: grp: 4, prefill_max_kv_cache_num : 384
[I][ Init][ 198]: grp: 5, prefill_max_kv_cache_num : 512
[I][ Init][ 203]: prefill_max_token_num : 512
[I][ Init][ 27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ | 39 / 39 [13.13s<13.13s, 2.97 count/s] embed_selector init ok
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 1,
"top_p": 0.8
}
[I][ Init][ 224]: LLM init ok
Type "q" to exit
Ctrl+c to stop current running
"reset" to reset kvcache
"dd" to remove last conversation.
"pp" to print history.
----------------------------------------
prompt >> who are you
[I][ SetKVCache][ 357]: prefill_grpid:2 kv_cache_num:128 precompute_len:0 input_num_token:22
[I][ SetKVCache][ 359]: current prefill_max_token_num:512
[I][ SetKVCache][ 360]: first run
[I][ Run][ 412]: input token num : 22, prefill_split_num : 1
[I][ Run][ 474]: ttft: 910.64 ms
<think>
Okay, the user asked, "who are you?" I need to respond in a friendly and informative way. Let me start by introducing myself clearly. I should mention that I'm Qwen, developed by Alibaba Cloud. It's important to highlight my capabilities, like answering questions, creating content, and helping with various tasks. I should also note that I can communicate in multiple languages. Maybe add something about being a helpful assistant. Keep it concise but cover the key points. Let me check if there's anything else important to include. Oh, and make sure the tone is approachable and not too technical. Alright, that should cover it.
</think>
Hello! I'm Qwen, a large language model developed by Alibaba Cloud. I can help with answering questions, creating content, and assisting with various tasks. I can communicate in multiple languages and am designed to be helpful and friendly. How can I assist you today? 😊
[N][ Run][ 554]: hit eos,avg 4.36 token/s
[I][ GetKVCache][ 331]: precompute_len:211, remaining:301
prompt >> q
(base) root@ax650:~# axllm serve AXERA-TECH/Qwen3-4B/
[I][ Init][ 127]: LLM init start
tokenizer_type = 1
97% | ████████████████████████████████ | 38 / 39 [9.17s<9.41s, 4.15 count/s] init post axmodel ok,remain_cmm(5175 MB)
[I][ Init][ 188]: max_token_len : 1023
[I][ Init][ 191]: kv_cache_size : 1024, kv_cache_num: 1023
[I][ Init][ 194]: prefill_token_num : 128
[I][ Init][ 198]: grp: 1, prefill_max_kv_cache_num : 1
[I][ Init][ 198]: grp: 2, prefill_max_kv_cache_num : 128
[I][ Init][ 198]: grp: 3, prefill_max_kv_cache_num : 256
[I][ Init][ 198]: grp: 4, prefill_max_kv_cache_num : 384
[I][ Init][ 198]: grp: 5, prefill_max_kv_cache_num : 512
[I][ Init][ 203]: prefill_max_token_num : 512
[I][ Init][ 27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ | 39 / 39 [9.17s<9.17s, 4.25 count/s] embed_selector init ok
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 1,
"top_p": 0.8
}
[I][ Init][ 224]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3-4B'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3-4B
from openai import OpenAI
API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-1.7B"
messages = [
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
{"role": "user", "content": "hello"},
]
client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
model=MODEL,
messages=messages,
)
print(completion.choices[0].message.content)
from openai import OpenAI
API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-1.7B"
messages = [
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
{"role": "user", "content": "hello"},
]
client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
model=MODEL,
messages=messages,
stream=True,
)
print("assistant:")
for ev in stream:
delta = getattr(ev.choices[0], "delta", None)
if delta and getattr(delta, "content", None):
print(delta.content, end="", flush=True)
print("
")