Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does llmcompressor support hybrid sparsity? #1037

Open
jiangjiadi opened this issue Jan 6, 2025 · 5 comments · May be fixed by vllm-project/vllm#11889
Open

Does llmcompressor support hybrid sparsity? #1037

jiangjiadi opened this issue Jan 6, 2025 · 5 comments · May be fixed by vllm-project/vllm#11889
Assignees
Labels
enhancement New feature or request

Comments

@jiangjiadi
Copy link

Is your feature request related to a problem? Please describe.
I've found that the model's performance is constrained when 2:4 sparsity is applied to all linear layers. However, the performance improves significantly when only some layers are subjected to 2:4 sparsity.

Describe the solution you'd like
Does llmcompressor support this hybrid compression format? Specifically, compressing the linear layers that satisfy 2:4 sparsity in the 2:4 format, and compressing the other linear layers that do not meet the 2:4 sparsity criteria using the standard quantization format.

@jiangjiadi jiangjiadi added the enhancement New feature or request label Jan 6, 2025
@rahul-tuli
Copy link
Collaborator

Hi @jiangjiadi,

Thank you for raising this question and for your interest in llm-compressor! The short answer is yes—llm-compressor does support a hybrid compression approach, enabling you to apply 2:4 sparsity to certain layers while using standard quantization for others.

The key to achieving this lies in the configuration of your recipe. Below is an example of how such a recipe might look:

pruning_stage:
    obcq_modifiers:
        SparseGPTModifier:
            sparsity: 0.5
            sequential_update: true
            mask_structure: "2:4"
            targets: ['re:model.layers.1.*$'] # Applies 2:4 sparsity to the first layer

quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"] # Excludes specific layers from quantization
            scheme: "FP8_DYNAMIC"
            targets: ["Linear"] # Applies quantization to all linear layers

    pruning_modifiers:
        ConstantPruningModifier:
            targets: [
                're:.*q_proj.weight',
                're:.*k_proj.weight', 
                're:.*v_proj.weight',
                're:.*o_proj.weight',
                're:.*gate_proj.weight',
                're:.*up_proj.weight',
                're:.*down_proj.weight',
            ]
            start: 0 # Ensures sparsity is retained during quantization

Key Notes:

  1. Flexible Targeting: The targets and ignore attributes in the recipe allow fine-grained control, enabling you to specify which layers should be subjected to sparsity, quantization, or both.
  2. Important Caveat: Any layer targeted for both 2:4 sparsity and quantization must also be included in the ConstantPruningModifier. This ensures the induced sparsity is preserved during the quantization process and avoids any unintended conflicts.

Feel free to experiment with these configurations or let us know if you encounter any issues while implementing this setup. We're here to help!

@rahul-tuli rahul-tuli self-assigned this Jan 6, 2025
@jiangjiadi
Copy link
Author

Hi @rahul-tuli, I got the following error when using the recipe above.
image

@jiangjiadi
Copy link
Author

@rahul-tuli After modifying the recipe as follows, I did indeed obtain a model:

sparsity_stage:
  run_type: oneshot
  sparsity_modifiers:
    WandaPruningModifier:
      sparsity: 0.5
      mask_structure: "2:4"
      sequential_update: true
      targets: ['Qwen2MLP']

quant_stage:
    run_type: oneshot
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"] # Excludes specific layers from quantization
            scheme: "FP8_DYNAMIC"
            targets: ["Linear"] # Applies quantization to all linear layers

    pruning_modifiers:
        ConstantPruningModifier:
            targets: [
                're:.*q_proj.weight',
                're:.*k_proj.weight', 
                're:.*v_proj.weight',
                're:.*o_proj.weight',
                're:.*gate_proj.weight',
                're:.*up_proj.weight',
                're:.*down_proj.weight',
            ]
            start: 0 # Ensures sparsity is retained during quantization

However, I am not sure if the model is actually quantized in a mixed manner. I have inspected the compressed tensors and found there is no difference in format between the tensors in mlp layer (e.g. model.layers.0.mlp.down_proj.weight) and that in attention layer (e.g. model.layers.0.self_attn.k_proj.weight). I also test the inference speed of this hybrid quantized model and the fp8 model, and there is no significant difference between them.
image

@jiangjiadi
Copy link
Author

jiangjiadi commented Jan 9, 2025

Hi @rahul-tuli, I delved into vllm's handling logic for the partially 2:4 sparse quantized model and discovered that vllm indeed processes the 2:4 sparse layers as regular quantized layers. This PR vllm-project/vllm#11889 addresses this issue.

@jiangjiadi
Copy link
Author

Hi @rahul-tuli, I attempted to replace the quantization method with int4, using the recipe as below. However, when creating the quantized model, I encountered an error.
image

Recipe:

sparsity_stage:
  run_type: oneshot
  sparsity_modifiers:
    WandaPruningModifier:
      sparsity: 0.5
      mask_structure: "2:4"
      sequential_update: true
      targets: ['Qwen2MLP']

quantization_stage:
    run_type: oneshot
    quantization_modifiers:
      GPTQModifier:
        ignore: ["lm_head"]
        dampening_frac: 0.01
        config_groups:
          group_0:
            weights:
              num_bits: 4
              type: "int"
              symmetric: true
              strategy: "group"
              group_size: 128
            targets: ["Linear"]
    pruning_modifiers:
        ConstantPruningModifier:
            targets: [
                're:.*q_proj.weight',
                're:.*k_proj.weight', 
                're:.*v_proj.weight',
                're:.*o_proj.weight',
                're:.*gate_proj.weight',
                're:.*up_proj.weight',
                're:.*down_proj.weight',
            ]
            start: 0 # Ensures sparsity is retained during quantization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants