Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quant method about kv cache #1024

Open
sitabulaixizawaluduo opened this issue Jan 2, 2025 · 3 comments
Open

quant method about kv cache #1024

sitabulaixizawaluduo opened this issue Jan 2, 2025 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@sitabulaixizawaluduo
Copy link

Describe the bug
How can I quant kv cache but not weight?

I tried it like this

recipe = """
quant_stage:
quant_modifiers:
QuantizationModifier:
kv_cache_scheme:
num_bits: 8
type: float
observer: "minmax"
strategy: token
dynamic: true
symmetric: true
"""

but the config.json after quantization has no messages about compressed-tensors.
Does llm-compressor support only quant kv cache now? If can, what shout I do?

@sitabulaixizawaluduo sitabulaixizawaluduo added the bug Something isn't working label Jan 2, 2025
@Kha-Zix-1
Copy link

Maybe you can quantize kv cache when execute the vllm server? For example

python -m "vllm.entrypoints.openai.api_server" --model "{path}" --tensor-parallel-size {d} --gpu-memory-utilization 0.9 --served-model-name {model} --max-model-len 32768 --max-seq-len-to-capture 32768 --kv-cache-dtype fp8 --block-size 16 --port {port}

use --kv-cache-dtype fp8

@sitabulaixizawaluduo
Copy link
Author

Maybe you can quantize kv cache when execute the vllm server? For example

python -m "vllm.entrypoints.openai.api_server" --model "{path}" --tensor-parallel-size {d} --gpu-memory-utilization 0.9 --served-model-name {model} --max-model-len 32768 --max-seq-len-to-capture 32768 --kv-cache-dtype fp8 --block-size 16 --port {port}

use --kv-cache-dtype fp8

I thought llmcompressor could use the kv cache scale per token to achieve higher accuracy and avoid online quantization, but now the quantization result is per tensor

@dsikka
Copy link
Collaborator

dsikka commented Jan 3, 2025

HI @sitabulaixizawaluduo
Can you share the exact code you ran to quantize your model?
Thanks

@dsikka dsikka self-assigned this Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants