One more bug:
The call to pcre2_compile_32 should be changed from:
code = pcre2_compile_32(pattern_ucs, pattern.size(),
PCRE2_NO_UTF_CHECK | flags, &error_code,
&error_offset, 0);
To:
code = pcre2_compile_32(pattern_ucs, pattern.size(),
PCRE2_UTF | PCRE2_UCP | flags, &error_code,
&error_offset, 0);
Without PCRE2_UTF, proper Unicode semantics will not be applied (such as properly handling case matching for non-ASCII characters).
PCRE2_UCP, is a little less obvious. I think it would make sense to enable it, since we care more for correctness than performance. Here's what the documentation has to say about it:
“This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcre2pattern page. If you set PCRE2_UCP, matching one of the items it affects takes much longer.”
Finally, I don't think it makes sense to use PCRE2_NO_UTF_CHECK since at best it's a no-op (since we're using UTF-32) and at worst it can cause a crash when trying to match an invalid string. That's not worth what little performance benefit there is to gain from it.
Regards,
Elias