-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when quantizing LLama 3.3 70b to FP8 #963
Comments
Hello! @Syst3m1cAn0maly I have not encountered your bug yet, but I have encountered a similar bug. |
Thank you @Kha-Zix-1 it works with the workaround you provided. |
Details: vllm-project#963
I get a systematic error when quantizing LLama 3.3 70b to FP8 (static) on 2xH100, it always fails at the 82nd step of calibration with the following error :
Loading checkpoint shards: 100%
30/30 [00:03<00:00, 7.96it/s]
Loading checkpoint shards: 100%
30/30 [01:09<00:00, 1.98s/it]
Map: 100%
512/512 [00:00<00:00, 1513.65 examples/s]
Map: 100%
512/512 [00:01<00:00, 458.48 examples/s]
2024-12-06T23:51:22.108862+0000 | main | WARNING - Process rank: 0, device: cuda:0, n_gpu: 2, distributed training: True, 16-bits training: False
2024-12-06T23:51:22.110751+0000 | main | INFO - Training/evaluation parameters TrainingArguments(
_n_gpu=2,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
clear_sparse_session=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_oneshot=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.NO,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_for_metrics=[],
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/data/models/Llama-3.3-70B-Instruct-FP8/runs/Dec06_23-51-22_2343050e0892,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
oneshot_device=cuda:0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
optim_target_modules=None,
output_dir=/data/models/Llama-3.3-70B-Instruct-FP8,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
recipe=
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
input_activations:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
targets: ["Linear"]
,
recipe_args=None,
remove_unused_columns=True,
report_to=[],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=/data/models/Llama-3.3-70B-Instruct-FP8,
run_stages=False,
save_compressed=True,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
2024-12-06T23:51:22.574722+0000 | _check_create_state | INFO - State created for compression lifecycle
2024-12-06T23:51:22.576664+0000 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-12-06T23:51:22.577698+0000 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-12-06T23:51:22.634068+0000 | one_shot | INFO - *** One Shot ***
2024-12-06T23:51:22.701228+0000 | _check_compile_recipe | INFO - Recipe compiled and 1 modifiers created
/opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/session_mixin.py:95: FutureWarning:
tokenizer
is deprecated and will be removed in version 5.0.0 forTrainer.__init__
. Useprocessing_class
instead.super().init(**kwargs)
2024-12-06T23:51:23.012585+0000 | _calibrate | INFO - Running QuantizationModifier calibration with 512 samples...
16%|█▌ | 82/512 [04:17<22:32, 3.14s/it]
RuntimeError Traceback (most recent call last)
Cell In[2], line 76
66 return tokenizer(
67 sample["text"],
68 padding=False,
(...)
71 add_special_tokens=False,
72 )
74 ds = ds.map(tokenize, remove_columns=ds.column_names)
---> 76 oneshot(
77 model=model,
78 output_dir=output_dir,
79 dataset=ds,
80 recipe=recipe,
81 max_seq_length=MAX_SEQUENCE_LENGTH,
82 num_calibration_samples=NUM_CALIBRATION_SAMPLES,
83 save_compressed=True,
84 )
File /opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/text_generation.py:76, in oneshot(**kwargs)
74 model_args, data_args, training_args = parse_args(**kwargs)
75 training_args.do_oneshot = True
---> 76 main(model_args, data_args, training_args)
File /opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/text_generation.py:363, in main(model_args, data_args, training_args)
361 # One Shot
362 if training_args.do_oneshot:
--> 363 stage_runner.one_shot()
365 # Evaluation
366 if training_args.do_eval:
File /opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/runner.py:171, in StageRunner.one_shot(self, stage)
167 self.trainer.model(**dummy_inp)
169 self.trainer.accelerator.wait_for_everyone()
--> 171 self.trainer.one_shot(calibration_data=calib_data, stage=stage)
173 if is_fsdp_model(self.trainer.model):
174 try:
File /opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/session_mixin.py:439, in SessionManagerMixIn.one_shot(self, calibration_data, stage)
430 def one_shot(
431 self, calibration_data: Optional[DataLoader] = None, stage: Optional[str] = None
432 ):
433 """
434 Run oneshot calibration on the active model
435
436 :param stage: which stage of the recipe to run, or None to run whole recipe
437 :param calib_data: dataloader of calibration data
438 """
--> 439 apply(
440 recipe=self.recipe,
441 recipe_stage=stage,
442 recipe_args=self.recipe_args,
443 model=self.model,
444 calib_data=calibration_data,
445 start=-1,
446 copy_data=False,
447 accelerator=self.accelerator,
448 min_tokens_per_module=self.min_tokens_per_module,
449 )
451 # log model sparsity
452 # self.maybe_log_model_sparsification()
453 self.accelerator.wait_for_everyone()
File /opt/conda/lib/python3.11/site-packages/llmcompressor/core/session_functions.py:184, in apply(recipe, recipe_stage, recipe_args, model, teacher_model, train_data, val_data, test_data, calib_data, copy_data, start, steps_per_epoch, batches_per_step, **kwargs)
146 def apply(
147 recipe: Union[str, List[str], "Recipe", List["Recipe"], None] = None,
148 recipe_stage: Union[str, List[str], None] = None,
(...)
160 **kwargs,
161 ) -> ModifiedState:
162 """
163 A method to apply the recipe in one-shot manner. This will invoke the initialize
164 and then finalize methods for each modifier in the active session's lifecycle.
(...)
182 :return: the modified state of the active session after applying the recipe
183 """
--> 184 return active_session().apply(
185 recipe=recipe,
186 recipe_stage=recipe_stage,
187 recipe_args=recipe_args,
188 model=model,
189 teacher_model=teacher_model,
190 train_data=train_data,
191 val_data=val_data,
192 test_data=test_data,
193 calib_data=calib_data,
194 copy_data=copy_data,
195 start=start,
196 steps_per_epoch=steps_per_epoch,
197 batches_per_step=batches_per_step,
198 **kwargs,
199 )
File /opt/conda/lib/python3.11/site-packages/llmcompressor/core/session.py:210, in CompressionSession.apply(self, **kwargs)
201 def apply(self, **kwargs):
202 """
203 Apply the recipe in one-shot manner. This will invoke the initialize
204 and then finalize methods for each modifier in the session's lifecycle.
(...)
208 finalize methods
209 """
--> 210 self.initialize(**kwargs)
212 return self.finalize(**kwargs)
File /opt/conda/lib/python3.11/site-packages/llmcompressor/core/session.py:156, in CompressionSession.initialize(self, recipe, recipe_stage, recipe_args, model, teacher_model, optimizer, attach_optim_callbacks, train_data, val_data, test_data, calib_data, copy_data, start, steps_per_epoch, batches_per_step, loggers, **kwargs)
105 def initialize(
106 self,
107 recipe: Union[str, List[str], "Recipe", List["Recipe"], None] = None,
(...)
123 **kwargs,
124 ) -> ModifiedState:
125 """
126 Initialize the session for compression. This will run the initialize method
127 for each modifier in the session's lifecycle. This will also set the session's
(...)
153 :return: the modified state of the session after initializing
154 """
--> 156 mod_data = self._lifecycle.initialize(
157 recipe=recipe,
158 recipe_stage=recipe_stage,
159 recipe_args=recipe_args,
160 model=model,
161 teacher_model=teacher_model,
162 optimizer=optimizer,
163 attach_optim_callbacks=attach_optim_callbacks,
164 train_data=train_data,
165 val_data=val_data,
166 test_data=test_data,
167 calib_data=calib_data,
168 copy_data=copy_data,
169 start=start,
170 steps_per_epoch=steps_per_epoch,
171 batches_per_step=batches_per_step,
172 loggers=loggers,
173 **kwargs,
174 )
176 return ModifiedState(
177 model=self.state.model,
178 optimizer=self.state.optimizer,
179 loss=self.state.loss,
180 modifier_data=mod_data,
181 )
File /opt/conda/lib/python3.11/site-packages/llmcompressor/core/lifecycle.py:126, in CompressionLifecycle.initialize(self, **kwargs)
124 mod_data = []
125 for mod in self.modifiers:
--> 126 data = mod.initialize(state=self.state, **extras)
127 logger.debug("Initialized modifier: {}", mod)
128 if data is not None:
File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/stage.py:124, in StageModifiers.initialize(self, state, **kwargs)
122 accelerator = kwargs.get("accelerator", None)
123 for modifier in self.modifiers:
--> 124 modifier.initialize(state, **kwargs)
125 if accelerator:
126 accelerator.wait_for_everyone()
File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/modifier.py:119, in Modifier.initialize(self, state, **kwargs)
113 if (
114 self.calculate_end() >= 0
115 and state.start_event.current_index >= self.calculate_end()
116 ):
117 return
--> 119 initialized = self.on_initialize(state=state, **kwargs)
121 if not isinstance(initialized, bool):
122 raise ValueError(
123 "on_initialize must return a boolean value; "
124 "True for success, False for not initialized"
125 )
File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py:105, in QuantizationModifier.on_initialize(self, state, **kwargs)
103 module.apply(apply_calibration_status)
104 self.calibration_hooks_ = []
--> 105 self._calibrate_if_possible(module)
106 self._check_token_distribution(
107 module, threshold=kwargs.get("min_tokens_per_module")
108 )
109 module.apply(freeze_module_quantization)
File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py:268, in QuantizationModifier._calibrate_if_possible(self, module)
266 module.apply(lambda model: initialize_observer(model, base_name="output"))
267 module.apply(self.register_calibration_hooks)
--> 268 self.calibrate(module)
269 module.apply(set_unset_kv_cache)
270 for h in self.calibration_hooks:
File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py:325, in QuantizationModifier.calibrate(self, module)
322 module_training = module.training
323 module.eval()
--> 325 run_calibration_forward(
326 module,
327 self.calibration_dataloader,
328 self.num_calibration_steps,
329 self.calibration_function_,
330 )
332 if module_training:
333 module.train()
File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/utils/pytorch_helpers.py:105, in run_calibration_forward(model, calibration_dataloader, num_calibration_steps, calibration_function, device, mask_padding)
101 intermediates.append((e.args, e.kwargs))
103 # TODO: not ideal, figure out where we aren't freeing memory instead
104 # currently without this we run OOM on the 2nd forward pass
--> 105 torch.cuda.empty_cache()
107 return intermediates
File /opt/conda/lib/python3.11/site-packages/torch/cuda/memory.py:170, in empty_cache()
159 r"""Release all unoccupied cached memory currently held by the caching
160 allocator so that those can be used in other GPU application and visible in
161
nvidia-smi
.(...)
167 more details about GPU memory management.
168 """
169 if is_initialized():
--> 170 torch._C._cuda_emptyCache()
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Additional context
Add any other context about the problem here. Also include any relevant files.
The text was updated successfully, but these errors were encountered: