Add XOR swap and XCHG assembly optimization for integral types #5215

ohhmm · 2025-01-02T12:51:17Z

Add _ENABLE_XOR_SWAP macro with default value of 0
Implement XOR swap optimization for integral types
Add x86/x64 assembly XCHG optimization
Add comprehensive test coverage for swap operations

Tests are initially disabled to allow for review before enabling.

stl/inc/utility

tests/std/tests/XOR_swap_demo/test.cpp

- Add _ENABLE_XOR_SWAP macro with default value of 0 - Implement XOR swap optimization for integral types - Add x86/x64 assembly XCHG optimization - Add comprehensive test coverage for swap operations Tests are initially disabled to allow for review before enabling. Co-Authored-By: Serg Kryvonos <[email protected]>

StephanTLavavej · 2025-01-04T04:01:51Z

This doesn't seem like a pure performance win, and would therefore fall under our non-goals. Notably, compilers and processors are good at recognizing temporary variables used for swaps, and I believe that processors can even accomplish it sometimes via register renaming, i.e. no actual instructions are spent. In contrast, the need for a branch to defend against self-swap, and the actual instructions used for xoring, are not an obvious win.

We'll talk about this at the next weekly maintainer meeting, but I am not inclined to proceed with this PR - although we always appreciate interest in improving the repo!

AlexGuteniev · 2025-01-04T09:08:41Z

This looks like a dis-optimization multiple reasons (ordered the most important to the least important):

xchg instruction is implicitly atomic, without lock prefix. Note how _InterlockedExchange compiles to the same https://godbolt.org/z/vWGnTbv1M. The resulting perf is terrible
By itself inline assembly in MSVC inhibits optimization, as it forces the compiler to serialize and spill in surrounding code. (It is not that bad for other compilers, albeit it is better to avoid inline assembly for them too)
The xor trick, while seemingly eliminating variable, is not a win against the plain swap code, as @StephanTLavavej observed.
Some compilers are able to vectorize swapping of an array when it is written as a plain loop. Not MSVC yet, though, that's why there's manual swap vectorization of swap, and Auto-vectorize arrays swap #4991. But we have other targets, besides MSVC (officially Clang and NVCC, unofficially Intel too). The xor trick is not frequent in real code, so unlikely to be recognized as a swap, and so may be not vectorized, or vectorized as xor sequence, which would be way worse compared to the usual vector swap.
The self-swap check is an extra branch, as @StephanTLavavej observed, while it may be optimized away in some cases, it is likely to persist in others.

ohhmm requested a review from a team as a code owner January 2, 2025 12:51

frederick-vs-ja suggested changes Jan 2, 2025

View reviewed changes

stl/inc/utility Outdated Show resolved Hide resolved

stl/inc/utility Outdated Show resolved Hide resolved

tests/std/tests/XOR_swap_demo/test.cpp Outdated Show resolved Hide resolved

ohhmm force-pushed the devin/-xor-swap-optimization branch 4 times, most recently from 3c3058b to fc2ef56 Compare January 2, 2025 18:34

ohhmm force-pushed the devin/-xor-swap-optimization branch from fc2ef56 to 9362cb3 Compare January 2, 2025 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XOR swap and XCHG assembly optimization for integral types #5215

Add XOR swap and XCHG assembly optimization for integral types #5215

ohhmm commented Jan 2, 2025

StephanTLavavej commented Jan 4, 2025

AlexGuteniev commented Jan 4, 2025 •

edited

Loading

Add XOR swap and XCHG assembly optimization for integral types #5215

Are you sure you want to change the base?

Add XOR swap and XCHG assembly optimization for integral types #5215

Conversation

ohhmm commented Jan 2, 2025

StephanTLavavej commented Jan 4, 2025

AlexGuteniev commented Jan 4, 2025 • edited Loading

AlexGuteniev commented Jan 4, 2025 •

edited

Loading