Does llmcompressor support hybrid sparsity? #1037

jiangjiadi · 2025-01-06T05:55:01Z

Is your feature request related to a problem? Please describe.
I've found that the model's performance is constrained when 2:4 sparsity is applied to all linear layers. However, the performance improves significantly when only some layers are subjected to 2:4 sparsity.

Describe the solution you'd like
Does llmcompressor support this hybrid compression format? Specifically, compressing the linear layers that satisfy 2:4 sparsity in the 2:4 format, and compressing the other linear layers that do not meet the 2:4 sparsity criteria using the standard quantization format.

rahul-tuli · 2025-01-06T15:48:24Z

Hi @jiangjiadi,

Thank you for raising this question and for your interest in llm-compressor! The short answer is yes—llm-compressor does support a hybrid compression approach, enabling you to apply 2:4 sparsity to certain layers while using standard quantization for others.

The key to achieving this lies in the configuration of your recipe. Below is an example of how such a recipe might look:

pruning_stage:
    obcq_modifiers:
        SparseGPTModifier:
            sparsity: 0.5
            sequential_update: true
            mask_structure: "2:4"
            targets: ['re:model.layers.1.*$'] # Applies 2:4 sparsity to the first layer

quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"] # Excludes specific layers from quantization
            scheme: "FP8_DYNAMIC"
            targets: ["Linear"] # Applies quantization to all linear layers

    pruning_modifiers:
        ConstantPruningModifier:
            targets: [
                're:.*q_proj.weight',
                're:.*k_proj.weight', 
                're:.*v_proj.weight',
                're:.*o_proj.weight',
                're:.*gate_proj.weight',
                're:.*up_proj.weight',
                're:.*down_proj.weight',
            ]
            start: 0 # Ensures sparsity is retained during quantization

Key Notes:

Flexible Targeting: The targets and ignore attributes in the recipe allow fine-grained control, enabling you to specify which layers should be subjected to sparsity, quantization, or both.
Important Caveat: Any layer targeted for both 2:4 sparsity and quantization must also be included in the ConstantPruningModifier. This ensures the induced sparsity is preserved during the quantization process and avoids any unintended conflicts.

Feel free to experiment with these configurations or let us know if you encounter any issues while implementing this setup. We're here to help!

jiangjiadi · 2025-01-07T07:45:41Z

Hi @rahul-tuli, I got the following error when using the recipe above.

jiangjiadi · 2025-01-07T08:57:38Z

@rahul-tuli After modifying the recipe as follows, I did indeed obtain a model：

sparsity_stage:
  run_type: oneshot
  sparsity_modifiers:
    WandaPruningModifier:
      sparsity: 0.5
      mask_structure: "2:4"
      sequential_update: true
      targets: ['Qwen2MLP']

quant_stage:
    run_type: oneshot
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"] # Excludes specific layers from quantization
            scheme: "FP8_DYNAMIC"
            targets: ["Linear"] # Applies quantization to all linear layers

    pruning_modifiers:
        ConstantPruningModifier:
            targets: [
                're:.*q_proj.weight',
                're:.*k_proj.weight', 
                're:.*v_proj.weight',
                're:.*o_proj.weight',
                're:.*gate_proj.weight',
                're:.*up_proj.weight',
                're:.*down_proj.weight',
            ]
            start: 0 # Ensures sparsity is retained during quantization

However, I am not sure if the model is actually quantized in a mixed manner. I have inspected the compressed tensors and found there is no difference in format between the tensors in mlp layer (e.g. model.layers.0.mlp.down_proj.weight) and that in attention layer (e.g. model.layers.0.self_attn.k_proj.weight). I also test the inference speed of this hybrid quantized model and the fp8 model, and there is no significant difference between them.

jiangjiadi · 2025-01-09T08:09:33Z

Hi @rahul-tuli, I delved into vllm's handling logic for the partially 2:4 sparse quantized model and discovered that vllm indeed processes the 2:4 sparse layers as regular quantized layers. This PR vllm-project/vllm#11889 addresses this issue.

jiangjiadi · 2025-01-09T12:42:04Z

Hi @rahul-tuli, I attempted to replace the quantization method with int4, using the recipe as below. However, when creating the quantized model, I encountered an error.

Recipe:

sparsity_stage:
  run_type: oneshot
  sparsity_modifiers:
    WandaPruningModifier:
      sparsity: 0.5
      mask_structure: "2:4"
      sequential_update: true
      targets: ['Qwen2MLP']

quantization_stage:
    run_type: oneshot
    quantization_modifiers:
      GPTQModifier:
        ignore: ["lm_head"]
        dampening_frac: 0.01
        config_groups:
          group_0:
            weights:
              num_bits: 4
              type: "int"
              symmetric: true
              strategy: "group"
              group_size: 128
            targets: ["Linear"]
    pruning_modifiers:
        ConstantPruningModifier:
            targets: [
                're:.*q_proj.weight',
                're:.*k_proj.weight', 
                're:.*v_proj.weight',
                're:.*o_proj.weight',
                're:.*gate_proj.weight',
                're:.*up_proj.weight',
                're:.*down_proj.weight',
            ]
            start: 0 # Ensures sparsity is retained during quantization

jiangjiadi added the enhancement New feature or request label Jan 6, 2025

rahul-tuli self-assigned this Jan 6, 2025

jiangjiadi linked a pull request Jan 9, 2025 that will close this issue

[Bugfix] support to run partially 2:4 model with CompressedTensors24 scheme vllm-project/vllm#11889

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does llmcompressor support hybrid sparsity? #1037

Does llmcompressor support hybrid sparsity? #1037

jiangjiadi commented Jan 6, 2025

rahul-tuli commented Jan 6, 2025

jiangjiadi commented Jan 7, 2025

jiangjiadi commented Jan 7, 2025

jiangjiadi commented Jan 9, 2025 •

edited

Loading

jiangjiadi commented Jan 9, 2025

Does llmcompressor support hybrid sparsity? #1037

Does llmcompressor support hybrid sparsity? #1037

Comments

jiangjiadi commented Jan 6, 2025

rahul-tuli commented Jan 6, 2025

Key Notes:

jiangjiadi commented Jan 7, 2025

jiangjiadi commented Jan 7, 2025

jiangjiadi commented Jan 9, 2025 • edited Loading

jiangjiadi commented Jan 9, 2025

jiangjiadi commented Jan 9, 2025 •

edited

Loading