feat(gpu): optimize packing keyswitch on gpu #1908

andrei-stoian-zama · 2024-12-30T15:19:49Z

GPU optimization for packing keyswitch for any level count using GEMM.

Bench using make bench_integer_compression_gpu on GTX 4060 mobile

Bench	timing ms	latency difference w.r.t previous impl
cuda::packing_compression::pack_u2	1.6173	-35.22%
cuda::packing_compression::pack_u8	1.6211	-31.76%
cuda::packing_compression::pack_u16	1.6316	-32.55%
cuda::packing_compression::pack_u32	1.6345	-38.88%
cuda::packing_compression::pack_u64	1.648	-70.32%
cuda::packing_compression::pack_u128	1.7274	-81.30%
cuda::packing_compression::pack_u256	3.3431	-81.69%
cuda::packing_compression::pack_u512	5.9654	-84.47%

pdroalves · 2025-01-06T13:46:16Z

Performance improvement on H100s is also quite impressive:

Bench	timing ms	latency difference w.r.t previous impl
cuda::packing_compression::pack_u2	2.4167	-37.42%
cuda::packing_compression::pack_u8	2.4250	-32.89%
cuda::packing_compression::pack_u16	2.4298	-32.18%
cuda::packing_compression::pack_u32	2.4338	-32.20%
cuda::packing_compression::pack_u64	2.4452	-32.26%
cuda::packing_compression::pack_u128	2.4713	-32.88%
cuda::packing_compression::pack_u256	2.6083	-61.40%
cuda::packing_compression::pack_u512	2.6287	-75.61%

pdroalves

Just some smaller fixes. Although I still want to better understand the algorithm here. Also, let's run some extensive tests to be sure we are not breaking anything before merge.

pdroalves · 2025-01-06T13:49:09Z

backends/tfhe-cuda-backend/cuda/src/crypto/fast_packing_keyswitch.cuh

@@ -31,8 +31,7 @@ __host__ inline bool can_use_pks_fast_path(uint32_t lwe_dimension,
                                           uint32_t polynomial_size,
                                           uint32_t level_count,
                                           uint32_t glwe_dimension) {
-  // TODO: Generalize to level_count > 1 by transposing the KSK
-  return level_count == 1;
+  return true;


I think now we can get rid of this function, now?

pdroalves · 2025-01-06T13:52:31Z

backends/tfhe-cuda-backend/cuda/src/crypto/fast_packing_keyswitch.cuh


  Torus a_i = lwe_in[read_val_idx];

  Torus state = init_decomposer_state(a_i, base_log, level_count);

  Torus mod_b_mask = (1ll << base_log) - 1ll;
  lwe_out[write_val_idx] = decompose_one<Torus>(state, mod_b_mask, base_log);
+  __syncthreads();


Can you replace __syncthreads() by synchronize_threads_in_block()? This is obviously not wrong but we are trying to encapsulate some CUDA intrinsics.

pdroalves · 2025-01-06T13:53:13Z

backends/tfhe-cuda-backend/cuda/src/crypto/fast_packing_keyswitch.cuh


  Torus a_i = lwe_in[read_val_idx];

  Torus state = init_decomposer_state(a_i, base_log, level_count);

  Torus mod_b_mask = (1ll << base_log) - 1ll;
  lwe_out[write_val_idx] = decompose_one<Torus>(state, mod_b_mask, base_log);
+  __syncthreads();
+  lwe_out[write_state_idx] = state;
+  __syncthreads();


This line is not needed.

pdroalves · 2025-01-06T13:53:29Z

backends/tfhe-cuda-backend/cuda/src/crypto/fast_packing_keyswitch.cuh


  Torus mod_b_mask = (1ll << base_log) - 1ll;

  buffer_in[val_idx] = decompose_one<Torus>(state, mod_b_mask, base_log);
+  __syncthreads();
+  buffer_in[state_idx] = state;
+  __syncthreads();


This line is not needed.

cla-bot bot added the cla-signed label Dec 30, 2024

andrei-stoian-zama requested a review from pdroalves December 30, 2024 20:20

pdroalves requested changes Jan 6, 2025

View reviewed changes

andrei-stoian-zama requested a review from pdroalves January 9, 2025 15:53

feat(gpu): optimize packing keyswitch on gpu

e120ed8

andrei-stoian-zama force-pushed the feat/as_optimize_pks_all_levels branch from 9589a09 to e120ed8 Compare January 9, 2025 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gpu): optimize packing keyswitch on gpu #1908

feat(gpu): optimize packing keyswitch on gpu #1908

andrei-stoian-zama commented Dec 30, 2024 •

edited

Loading

pdroalves commented Jan 6, 2025

pdroalves left a comment

pdroalves Jan 6, 2025

pdroalves Jan 6, 2025

pdroalves Jan 6, 2025

pdroalves Jan 6, 2025

feat(gpu): optimize packing keyswitch on gpu #1908

Are you sure you want to change the base?

feat(gpu): optimize packing keyswitch on gpu #1908

Conversation

andrei-stoian-zama commented Dec 30, 2024 • edited Loading

pdroalves commented Jan 6, 2025

pdroalves left a comment

Choose a reason for hiding this comment

pdroalves Jan 6, 2025

Choose a reason for hiding this comment

pdroalves Jan 6, 2025

Choose a reason for hiding this comment

pdroalves Jan 6, 2025

Choose a reason for hiding this comment

pdroalves Jan 6, 2025

Choose a reason for hiding this comment

andrei-stoian-zama commented Dec 30, 2024 •

edited

Loading