Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrolling uint64 encoding #97

Open
tenderlove opened this issue May 23, 2024 · 1 comment
Open

Unrolling uint64 encoding #97

tenderlove opened this issue May 23, 2024 · 1 comment

Comments

@tenderlove
Copy link
Contributor

tenderlove commented May 23, 2024

We have to encode 64 bit integers all the time (basically every field header requires it), so I tried unrolling the uint64 encoding logic to see what kind of performance gains we can get.

I tried 3 different ways of unrolling the loop:

  1. Just a naive way translating the while loop to if statements
  2. Same as 1, but eliminating intermediate local variable writes
  3. Rather than appending to the buffer for each byte, use the Array#pack optimization

The code is below:

Click me for benchmark
# frozen_string_literal: true

require "benchmark/ips"

def uint64_enc1 x, buff
  while x != 0
    byte = x & 0x7F
    x >>= 7
    byte |= 0x80 if x > 0
    buff << byte
  end
end

def uint64_enc2 x, buff
  if (x > 0)
    byte = (x >> 0) & 0x7F

    if (x >> 7) > 0
      buff << (byte | 0x80)

      byte = (x >> 7) & 0x7F

      if (x >> 14) > 0
        buff << (byte | 0x80)

        byte = (x >> 14) & 0x7F

        if (x >> 21) > 0
          buff << (byte | 0x80)

          byte = (x >> 21) & 0x7F

          if (x >> 28) > 0
            buff << (byte | 0x80)

            byte = (x >> 28) & 0x7F

            if (x >> 35) > 0
              buff << (byte | 0x80)

              byte = (x >> 35) & 0x7F

              if (x >> 42) > 0
                buff << (byte | 0x80)

                byte = (x >> 42) & 0x7F

                if (x >> 49) > 0
                  buff << (byte | 0x80)

                  byte = (x >> 49) & 0x7F

                  if (x >> 56) > 0
                    buff << (byte | 0x80)

                    byte = (x >> 56) & 0x7F

                    if (x >> 63) > 0
                      buff << (byte | 0x80)

                      buff << 1
                    else
                      buff << byte
                    end
                  else
                    buff << byte
                  end
                else
                  buff << byte
                end
              else
                buff << byte
              end
            else
              buff << byte
            end
          else
            buff << byte
          end
        else
          buff << byte
        end
      else
        buff << byte
      end
    else
      buff << byte
    end
  end
end

def uint64_enc3 x, buff
  if (x > 0)
    if (x >> 7) > 0
      buff << (((x >> 0) & 0x7F) | 0x80)

      if (x >> 14) > 0
        buff << (((x >> 7) & 0x7F) | 0x80)

        if (x >> 21) > 0
          buff << (((x >> 14) & 0x7F) | 0x80)

          if (x >> 28) > 0
            buff << (((x >> 21) & 0x7F) | 0x80)

            if (x >> 35) > 0
              buff << (((x >> 28) & 0x7F) | 0x80)

              if (x >> 42) > 0
                buff << (((x >> 35) & 0x7F) | 0x80)

                if (x >> 49) > 0
                  buff << (((x >> 42) & 0x7F) | 0x80)

                  if (x >> 56) > 0
                    buff << (((x >> 49) & 0x7F) | 0x80)

                    if (x >> 63) > 0
                      buff << (((x >> 56) & 0x7F) | 0x80)

                      buff << 1
                    else
                      buff << (x >> 56)
                    end
                  else
                    buff << (x >> 49)
                  end
                else
                  buff << (x >> 42)
                end
              else
                buff << (x >> 35)
              end
            else
              buff << (x >> 28)
            end
          else
            buff << (x >> 21)
          end
        else
          buff << (x >> 14)
        end
      else
        buff << (x >> 7)
      end
    else
      buff << x
    end
  end
end

def uint64_enc4 x, buff
  if (x > 0)
    if (x >> 7) > 0
      byte0 = (((x >> 0) & 0x7F) | 0x80)

      if (x >> 14) > 0
        byte1 = (((x >> 7) & 0x7F) | 0x80)

        if (x >> 21) > 0
          byte2 = (((x >> 14) & 0x7F) | 0x80)

          if (x >> 28) > 0
            byte3 = (((x >> 21) & 0x7F) | 0x80)

            if (x >> 35) > 0
              byte4 = (((x >> 28) & 0x7F) | 0x80)

              if (x >> 42) > 0
                byte5 = (((x >> 35) & 0x7F) | 0x80)

                if (x >> 49) > 0
                  byte6 = (((x >> 42) & 0x7F) | 0x80)

                  if (x >> 56) > 0
                    byte7 = (((x >> 49) & 0x7F) | 0x80)

                    if (x >> 63) > 0
                      byte8 = (((x >> 56) & 0x7F) | 0x80)

                      buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, byte8, 1].pack("CCCCCCCCCC")
                    else
                      buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, (x >> 56)].pack("CCCCCCCCC")
                    end
                  else
                    buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, (x >> 49)].pack("CCCCCCCC")
                  end
                else
                  buff << [byte0, byte1, byte2, byte3, byte4, byte5, (x >> 42)].pack("CCCCCCC")
                end
              else
                buff << [byte0, byte1, byte2, byte3, byte4, (x >> 35)].pack("CCCCCC")
              end
            else
              buff << [byte0, byte1, byte2, byte3, (x >> 28)].pack("CCCCC")
            end
          else
            buff << [byte0, byte1, byte2, (x >> 21)].pack("CCCC")
          end
        else
          buff << [byte0, byte1, (x >> 14)].pack("CCC")
        end
      else
        buff << [byte0, (x >> 7)].pack("CC")
      end
    else
      buff << [(x >> 0)].pack("C")
    end
  end
end

Benchmark.ips do |x|
  x.report("uint64_enc1 (loop)") {
    uint64_enc1(0xFFFF_FFFF, "".b)
  }
  x.report("uint64_enc2 (unroll)") {
    uint64_enc2(0xFFFF_FFFF, "".b)
  }
  x.report("uint64_enc3 (unroll better)") {
    uint64_enc3(0xFFFF_FFFF, "".b)
  }
  x.report("uint64_enc4 (Array#pack)") {
    uint64_enc4(0xFFFF_FFFF, "".b)
  }
end

Results are below.

Interpreter:

$ ruby test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) [arm64-darwin23]
Warming up --------------------------------------
  uint64_enc1 (loop)   345.396k i/100ms
uint64_enc2 (unroll)   380.889k i/100ms
uint64_enc3 (unroll better)
                       415.911k i/100ms
uint64_enc4 (Array#pack)
                       319.319k i/100ms
Calculating -------------------------------------
  uint64_enc1 (loop)      3.413M (± 2.4%) i/s -     17.270M in   5.063010s
uint64_enc2 (unroll)      3.743M (± 1.2%) i/s -     19.044M in   5.088180s
uint64_enc3 (unroll better)
                          4.119M (± 0.6%) i/s -     20.796M in   5.048463s
uint64_enc4 (Array#pack)
                          3.144M (± 0.7%) i/s -     15.966M in   5.078173s

YJIT:

$ ruby --yjit test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) +YJIT [arm64-darwin23]
Warming up --------------------------------------
  uint64_enc1 (loop)     1.067M i/100ms
uint64_enc2 (unroll)     1.095M i/100ms
uint64_enc3 (unroll better)
                         1.089M i/100ms
uint64_enc4 (Array#pack)
                       640.782k i/100ms
Calculating -------------------------------------
  uint64_enc1 (loop)     11.533M (± 0.5%) i/s -     58.679M in   5.087999s
uint64_enc2 (unroll)     11.700M (± 0.6%) i/s -     59.114M in   5.052717s
uint64_enc3 (unroll better)
                         11.760M (± 0.5%) i/s -     59.869M in   5.090842s
uint64_enc4 (Array#pack)
                          6.785M (± 1.4%) i/s -     33.961M in   5.006533s

Unrolling the loop seems to help the interpreter, but doesn't seem to do much for YJIT. I'm a little bit surprised by how slow the Array#pack solution is on YJIT, I expected it to fare better since I merged ruby/ruby#8734

@maximecb
Copy link
Contributor

Might make sense to look at the disasm and make sure that your optimization is getting applied, see what the generated code looks like.

I'm a bit surprised that unrolling doesn't help YJIT at all. Wondering if there is a big source of overhead somewhere else or something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants