`<regex>`: Simplify `regex_traits<_Elem>::translate(_Elem)` #5209

muellerj2 · 2024-12-27T15:07:49Z

Simplifies regex_traits<_Elem>::translate(_Elem) to just return its only argument. In #5204 (comment), I voiced my suspicion that the current implementation essentially does just that in a very complicated and expensive way. I verified this now by running the following program that tested 902 locales available on my machine (no output to stdout produced by the program):

#include <iostream>
#include <locale>
#include <regex>
#include <string>
#include <vector>

#include <windows.h>

using namespace std;

BOOL add_locale(LPWSTR str, DWORD flags, LPARAM param) {
    vector<string>& vec = *reinterpret_cast<vector<string>*>(param);
    int size            = WideCharToMultiByte(CP_ACP, 0, str, -1, nullptr, 0, nullptr, nullptr);
    if (size) {
        vector<char> narrow(static_cast<size_t>(size), '\0');
        if (WideCharToMultiByte(CP_ACP, 0, str, -1, narrow.data(), size, nullptr, nullptr)) {
            vec.push_back(narrow.data());
        }
    }
    return TRUE;
}

int main() {
    vector<string> locales;
    locales.push_back("C");
    EnumSystemLocalesEx(&add_locale, 0, reinterpret_cast<LPARAM>(&locales), nullptr);

    for (const string& loc_name : locales) {
        const locale loc(loc_name);
        regex_traits<char> traits;
        regex_traits<wchar_t> wtraits;
        traits.imbue(loc);
        wtraits.imbue(loc);

        for (unsigned int i = 0; i <= 0xff; ++i) {
            try {
                const char x = (char) (unsigned char) i;
                if (traits.translate(x) != x) {
                    cout << loc_name << i << '\n';
                }
            } catch (const length_error&) {
            }
        }

        for (unsigned int i = 0; i <= 0xffff; ++i) {
            try {
                const wchar_t x = (wchar_t) i;
                if (wtraits.translate(x) != x) {
                    cout << loc_name << i << '\n';
                }
            } catch (const length_error&) {
            }
        }
    }
    return 0;
}

Even so, this PR still introduces a minor behavior change: The previous implementation can throw length_error("string too long") when it is passed a char that isn't a valid character in the locale's encoding (e.g., 0x80 in locales using UTF-8 encoding). But I think that the old behavior is undesirable anyway, as it makes the regex engine always fail with an exception in regex_constants::collate mode when a locale using UTF-8 encoding is imbued and the regex engine is applied to strings containing non-ASCII characters.

frederick-vs-ja

Aha, it seems that we took too much time to look into this.
[re.traits]/4 simply specifies that an implementation-provided regex_traits<C>::translate just returns the argument. See also https://en.cppreference.com/w/cpp/regex/regex_traits/translate.

muellerj2 · 2024-12-27T16:19:03Z

I put so much work in this mainly because I was worried about mix-and-match scenarios, not because I consider the old implementation correct. Think of the scenario that the implementations actually commonly produce different results and regex parser and matcher pick up different implementations or some other strange combination. Then this change had the potential to actually degrade the regex engine in such a mix-and-match scenario and lead to regex bugs that are difficult to understand and reproduce.

<regex>: Simplify regex_traits<_Elem>::translate(_Elem)

fdebb65

muellerj2 requested a review from a team as a code owner December 27, 2024 15:07

frederick-vs-ja approved these changes Dec 27, 2024

View reviewed changes

muellerj2 mentioned this pull request Dec 27, 2024

<locale>: std::collate<_Elem>::do_transform() should behave appropriately when _LStrxfrm() fails #5210

Open

StephanTLavavej added the enhancement Something can be improved label Jan 4, 2025

StephanTLavavej self-assigned this Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`<regex>`: Simplify `regex_traits<_Elem>::translate(_Elem)` #5209

`<regex>`: Simplify `regex_traits<_Elem>::translate(_Elem)` #5209

muellerj2 commented Dec 27, 2024 •

edited

Loading

frederick-vs-ja left a comment

muellerj2 commented Dec 27, 2024

<regex>: Simplify regex_traits<_Elem>::translate(_Elem) #5209

Are you sure you want to change the base?

<regex>: Simplify regex_traits<_Elem>::translate(_Elem) #5209

Conversation

muellerj2 commented Dec 27, 2024 • edited Loading

frederick-vs-ja left a comment

Choose a reason for hiding this comment

muellerj2 commented Dec 27, 2024

`<regex>`: Simplify `regex_traits<_Elem>::translate(_Elem)` #5209

`<regex>`: Simplify `regex_traits<_Elem>::translate(_Elem)` #5209

muellerj2 commented Dec 27, 2024 •

edited

Loading