Unrolling uint64 encoding #97

tenderlove · 2024-05-23T21:48:45Z

We have to encode 64 bit integers all the time (basically every field header requires it), so I tried unrolling the uint64 encoding logic to see what kind of performance gains we can get.

I tried 3 different ways of unrolling the loop:

Just a naive way translating the while loop to if statements
Same as 1, but eliminating intermediate local variable writes
Rather than appending to the buffer for each byte, use the Array#pack optimization

The code is below:

Click me for benchmark

# frozen_string_literal: true

require "benchmark/ips"

def uint64_enc1 x, buff
  while x != 0
    byte = x & 0x7F
    x >>= 7
    byte |= 0x80 if x > 0
    buff << byte
  end
end

def uint64_enc2 x, buff
  if (x > 0)
    byte = (x >> 0) & 0x7F

    if (x >> 7) > 0
      buff << (byte | 0x80)

      byte = (x >> 7) & 0x7F

      if (x >> 14) > 0
        buff << (byte | 0x80)

        byte = (x >> 14) & 0x7F

        if (x >> 21) > 0
          buff << (byte | 0x80)

          byte = (x >> 21) & 0x7F

          if (x >> 28) > 0
            buff << (byte | 0x80)

            byte = (x >> 28) & 0x7F

            if (x >> 35) > 0
              buff << (byte | 0x80)

              byte = (x >> 35) & 0x7F

              if (x >> 42) > 0
                buff << (byte | 0x80)

                byte = (x >> 42) & 0x7F

                if (x >> 49) > 0
                  buff << (byte | 0x80)

                  byte = (x >> 49) & 0x7F

                  if (x >> 56) > 0
                    buff << (byte | 0x80)

                    byte = (x >> 56) & 0x7F

                    if (x >> 63) > 0
                      buff << (byte | 0x80)

                      buff << 1
                    else
                      buff << byte
                    end
                  else
                    buff << byte
                  end
                else
                  buff << byte
                end
              else
                buff << byte
              end
            else
              buff << byte
            end
          else
            buff << byte
          end
        else
          buff << byte
        end
      else
        buff << byte
      end
    else
      buff << byte
    end
  end
end

def uint64_enc3 x, buff
  if (x > 0)
    if (x >> 7) > 0
      buff << (((x >> 0) & 0x7F) | 0x80)

      if (x >> 14) > 0
        buff << (((x >> 7) & 0x7F) | 0x80)

        if (x >> 21) > 0
          buff << (((x >> 14) & 0x7F) | 0x80)

          if (x >> 28) > 0
            buff << (((x >> 21) & 0x7F) | 0x80)

            if (x >> 35) > 0
              buff << (((x >> 28) & 0x7F) | 0x80)

              if (x >> 42) > 0
                buff << (((x >> 35) & 0x7F) | 0x80)

                if (x >> 49) > 0
                  buff << (((x >> 42) & 0x7F) | 0x80)

                  if (x >> 56) > 0
                    buff << (((x >> 49) & 0x7F) | 0x80)

                    if (x >> 63) > 0
                      buff << (((x >> 56) & 0x7F) | 0x80)

                      buff << 1
                    else
                      buff << (x >> 56)
                    end
                  else
                    buff << (x >> 49)
                  end
                else
                  buff << (x >> 42)
                end
              else
                buff << (x >> 35)
              end
            else
              buff << (x >> 28)
            end
          else
            buff << (x >> 21)
          end
        else
          buff << (x >> 14)
        end
      else
        buff << (x >> 7)
      end
    else
      buff << x
    end
  end
end

def uint64_enc4 x, buff
  if (x > 0)
    if (x >> 7) > 0
      byte0 = (((x >> 0) & 0x7F) | 0x80)

      if (x >> 14) > 0
        byte1 = (((x >> 7) & 0x7F) | 0x80)

        if (x >> 21) > 0
          byte2 = (((x >> 14) & 0x7F) | 0x80)

          if (x >> 28) > 0
            byte3 = (((x >> 21) & 0x7F) | 0x80)

            if (x >> 35) > 0
              byte4 = (((x >> 28) & 0x7F) | 0x80)

              if (x >> 42) > 0
                byte5 = (((x >> 35) & 0x7F) | 0x80)

                if (x >> 49) > 0
                  byte6 = (((x >> 42) & 0x7F) | 0x80)

                  if (x >> 56) > 0
                    byte7 = (((x >> 49) & 0x7F) | 0x80)

                    if (x >> 63) > 0
                      byte8 = (((x >> 56) & 0x7F) | 0x80)

                      buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, byte8, 1].pack("CCCCCCCCCC")
                    else
                      buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, (x >> 56)].pack("CCCCCCCCC")
                    end
                  else
                    buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, (x >> 49)].pack("CCCCCCCC")
                  end
                else
                  buff << [byte0, byte1, byte2, byte3, byte4, byte5, (x >> 42)].pack("CCCCCCC")
                end
              else
                buff << [byte0, byte1, byte2, byte3, byte4, (x >> 35)].pack("CCCCCC")
              end
            else
              buff << [byte0, byte1, byte2, byte3, (x >> 28)].pack("CCCCC")
            end
          else
            buff << [byte0, byte1, byte2, (x >> 21)].pack("CCCC")
          end
        else
          buff << [byte0, byte1, (x >> 14)].pack("CCC")
        end
      else
        buff << [byte0, (x >> 7)].pack("CC")
      end
    else
      buff << [(x >> 0)].pack("C")
    end
  end
end

Benchmark.ips do |x|
  x.report("uint64_enc1 (loop)") {
    uint64_enc1(0xFFFF_FFFF, "".b)
  }
  x.report("uint64_enc2 (unroll)") {
    uint64_enc2(0xFFFF_FFFF, "".b)
  }
  x.report("uint64_enc3 (unroll better)") {
    uint64_enc3(0xFFFF_FFFF, "".b)
  }
  x.report("uint64_enc4 (Array#pack)") {
    uint64_enc4(0xFFFF_FFFF, "".b)
  }
end

Results are below.

Interpreter:

$ ruby test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) [arm64-darwin23]
Warming up --------------------------------------
  uint64_enc1 (loop)   345.396k i/100ms
uint64_enc2 (unroll)   380.889k i/100ms
uint64_enc3 (unroll better)
                       415.911k i/100ms
uint64_enc4 (Array#pack)
                       319.319k i/100ms
Calculating -------------------------------------
  uint64_enc1 (loop)      3.413M (± 2.4%) i/s -     17.270M in   5.063010s
uint64_enc2 (unroll)      3.743M (± 1.2%) i/s -     19.044M in   5.088180s
uint64_enc3 (unroll better)
                          4.119M (± 0.6%) i/s -     20.796M in   5.048463s
uint64_enc4 (Array#pack)
                          3.144M (± 0.7%) i/s -     15.966M in   5.078173s

YJIT:

$ ruby --yjit test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) +YJIT [arm64-darwin23]
Warming up --------------------------------------
  uint64_enc1 (loop)     1.067M i/100ms
uint64_enc2 (unroll)     1.095M i/100ms
uint64_enc3 (unroll better)
                         1.089M i/100ms
uint64_enc4 (Array#pack)
                       640.782k i/100ms
Calculating -------------------------------------
  uint64_enc1 (loop)     11.533M (± 0.5%) i/s -     58.679M in   5.087999s
uint64_enc2 (unroll)     11.700M (± 0.6%) i/s -     59.114M in   5.052717s
uint64_enc3 (unroll better)
                         11.760M (± 0.5%) i/s -     59.869M in   5.090842s
uint64_enc4 (Array#pack)
                          6.785M (± 1.4%) i/s -     33.961M in   5.006533s

Unrolling the loop seems to help the interpreter, but doesn't seem to do much for YJIT. I'm a little bit surprised by how slow the Array#pack solution is on YJIT, I expected it to fare better since I merged ruby/ruby#8734

The text was updated successfully, but these errors were encountered:

maximecb · 2024-05-27T14:37:51Z

Might make sense to look at the disasm and make sure that your optimization is getting applied, see what the generated code looks like.

I'm a bit surprised that unrolling doesn't help YJIT at all. Wondering if there is a big source of overhead somewhere else or something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unrolling uint64 encoding #97

Unrolling uint64 encoding #97

tenderlove commented May 23, 2024 •

edited

Loading

maximecb commented May 27, 2024

Unrolling uint64 encoding #97

Unrolling uint64 encoding #97

Comments

tenderlove commented May 23, 2024 • edited Loading

maximecb commented May 27, 2024

tenderlove commented May 23, 2024 •

edited

Loading