▲SIMD-friendly algorithms for substring searching (2016)0x80.pl

210 points by Rendello 1 days ago | 32 comments

burntsushi 19 hours ago [-]

The "AVX2 (generic)" approach is roughly what ripgrep uses (via Rust's `regex` crate) to accelerate most searches. Even something like `\w+\s+Sherlock\s+\w+` will benefit since ripgrep will pluck `Sherlock` out of the regex and search that.

The actual implementation is here: https://github.com/BurntSushi/memchr?tab=readme-ov-file#algo...

The main difference with the algorithm presented here is that instead of always using the first and last bytes of the needle, a heuristic is used to try to pick two bytes that occur less frequently according to a background frequency distribution.

It ends up being quite a bit faster than just plain Two-Way or even GNU libc's implementation of `memmem`. From the root of the `memchr` repository:

    $ rebar rank benchmarks/record/x86_64/2023-12-29.csv -e '^rust/memchr/memmem/(oneshot|prebuilt|twoway)' -e '^libc/memmem/oneshot'
    Engine                       Version  Geometric mean of speed ratios  Benchmark count
    ------                       -------  ------------------------------  ---------------
    rust/memchr/memmem/prebuilt  2.7.1    1.07                            57
    rust/memchr/memmem/oneshot   2.7.1    1.39                            54
    libc/memmem/oneshot          unknown  3.15                            54
    rust/memchr/memmem/twoway    2.7.1    5.26                            54

The "prebuilt" benchmarks also demonstrate the API deficiencies of routines like `memmem`. Because of their API, they need to rebuild the state necessary to execute a search for the needle given. But it is often the case that the needle is invariant across repeated calls with different haystacks. So you end up doing that repeated work for no gain.

jakobnissen 16 hours ago [-]

Interesting. How do you know the background distribution of bytes? Won't a scan of the haystack to get the distribution take up a significant amount of time?

burntsushi 15 hours ago [-]

It's a _background_ distribution. That is, a guess. A heuristic. It doesn't change with the input.

ncruces 1 days ago [-]

I implemented these in my attempt to SIMD optimize Wasm/WASI libc.

https://github.com/ncruces/go-sqlite3/tree/main/sqlite3/libc

For known haystack lengths, and large needles, I found it useful to augment the approach with Quick Search.

https://igm.univ-mlv.fr/~lecroq/string/node19.html

mwsherman 14 hours ago [-]

C# has implemented some SIMD for IndexOf: https://github.com/dotnet/runtime/pull/63285

clauderoux 24 hours ago [-]

I implemented many different algorithms for searching and splitting strings using SMID methods as well: https://github.com/naver/tamgu/blob/06aedf2f14895925d7b5a8e2... I used a different algorithm than the ones presented here.

ncruces 23 hours ago [-]

Could you summarize your algorithm? Like a high-level description?

Like algorithm 1 (from the article) checks N possible positions at a time by matching the first and last character; algorithm 2 instead tests the 4 leading characters.

clauderoux 22 seconds ago [-]

What I do is to build a substring out of the initial strings that is a multiple of 2. If the string I try to search is 9 characters long, then I extract an 8 characters substrings that I transform into an integer over an integer:

Here is the example for 4 characters: //We look for the presence of four characters in a row int32_t cc = search[3]; for (shift = 2; shift >= 0; shift--) { cc <<= 8; cc |= search[shift]; } In this case, I will look for a 4 bytes integers over my sequence: current_bytes = _mm256_cmpeq_epi16(firstchar, current_bytes); q <<= 1; q |= _mm256_movemask_epi8(current_bytes);

Rendello 16 hours ago [-]

I tried to implement a modified version for LZ77 window search with Zig's generic SIMD a few years ago:

https://news.ycombinator.com/item?id=44273983

klaussilveira 15 hours ago [-]

Reminds me of https://github.com/nikneym/hparse

Which uses SIMD algos for fast HTTP parsing.

eska 20 hours ago [-]

The swar algorithm has UB because it casts the 1 byte aligned data to 8 byte aligned. Perhaps it suffers some performance issues from unaligned loads?

burntsushi 18 hours ago [-]

I always took these as "idealized" algorithms or pseudo-code. In addition to not doing correct unaligned loads (using `memcpy`), the boundary checks are wrong. The code assumes the length of the haystack is a multiple of 8. Additionally, if the needle is empty, you'll get unsigned integer overflow leading to an out-of-bounds access on the needle. That's three different ways to get UB.

This is pretty common in my experience when demonstrating SIMD code. Often only the relevant bits are shown, and not a lot of effort is given to getting the boundary cases correct. It's rarely acknowledged explicitly, but I would assume it's because the boundary conditions are somewhat less interesting than the "meat" of how to use vectors in an interesting way.

Rendello 16 hours ago [-]

When I modified the idealized algorithm to suit my needs a few years ago, I learned just how painful the boundary cases are. I realized I didn't want to ever touch SIMD string searching unless:

- I had a very specific use-case that I really needed SIMD for and was willing to really invest a lot in it; or

- The work had already been done for me in a generic sense and I could reap the fruits of that labour (ripgrep / Rust Regex, thanks for that by the way)

eska 16 hours ago [-]

True. The size assumption is mentioned in the article however. The simd library by agner makes the same assumption and demands that the user over-allocates their data

rurban 22 hours ago [-]

We know that the libc strstr performs horrible, but musl is fast and state of the art.

Now we just need a name, and we can add it to the smart shootout. Wonder how it compares to the best SIMD algos

ncruces 21 hours ago [-]

If you're interested in a rough benchmark, I compared musl's “two way” with a variation on this algorithm for my SIMD optimized libc linked in a sister comment.

The benchmark involves finding this line in that file: https://github.com/ncruces/go-sqlite3/blob/v0.26.1/sqlite3/l...

The improvements over musl are in this table: https://github.com/ncruces/go-sqlite3/tree/v0.26.1/sqlite3/l...

I'd say it's a decent improvement, if not spectacular. musl does particularly well with known length strings (memmem), whereas I'm allowed to cheat for unknown length strings (strstr) because the Wasm environment allows me to skirt some UB.

The NUL terminated string curve ball wrecks many good algorithms.

MattPalmer1086 21 hours ago [-]

Would be good to have more SIMD algorithms in smart. Maybe I'll have a play with it if I get time.

MattPalmer1086 18 hours ago [-]

Looking at them, would need a little work to make them safe (not reading past needle or haystack). Probably not too much effort, may need a different search for data mod block size at the end.

mgaunard 22 hours ago [-]

missing (2018) ?

kristianp 3 hours ago [-]

It was actually 2016 with spelling fixes in 2018. They still missed a couple of errors:

"comparing two substrings is expansive" -> expensive

And:

"memcmp costs are simply ridden off" -> written

Interesting how bad buldozer was for stdlib, but with SSE2 it was more competitive with Westmere.

sylware 22 hours ago [-]

What is the string sizes thresold for efficiency?

Because the SIMD (aligned, size computations, etc) setup code is usually more expensive that byte-oriented basic search for "small" strings.

Yes, it depends heavily on the hardware architecture.

ncruces 21 hours ago [-]

Here, it's often the opposite.

These algorithms are basically brute force with very little setup, and can become quadratic for large, periodic needles.

String search algorithms that avoid quadratic behaviour (and may even be sublinear) have an higher cost setup that explores needle structure

sylware 21 hours ago [-]

What? It all depends on the sizes of the strings. Hardware benchmarks, for each significant hardware µarchitectures, must be done on combinations of string sizes in order to know "when" the SIMD setup code is worth it.

Often the setup code is not worth it, even the SIMD complexity is not bringing enough performance in many use cases (basically, kept for niche applications, little lib statically linked for those).

ncruces 19 hours ago [-]

Did you look at the algorithm 1? The setup code is a pair of splats. There's no consideration for alignment, it works with unaligned data just fine. So what do you mean? What threshold would you expect for it to be worth it (assuming it's useful at all)?

sylware 18 hours ago [-]

Dude, I cannot be more explicit than that.

That said, when you work on such code, use assembly, C code is for reference (have a look at dav1d av1 decoder).

ncruces 18 hours ago [-]

I can't either.

Having implemented this, I'll claim that this is already competitive for needles as small as 2 bytes, and if the needle doesn't show up in the first couple “vector sizes” of the haystack.

Also, look at burntsushi's comment. They use this algorithm for similarly small sizes. They do use Rabin-Karp for “supremely short haystacks” (just because that lib is awesome and uses every trick in the book), but that wording should give you an idea of how small the threshold will be.

And really, how common are “supremely small” haystacks?

burntsushi 18 hours ago [-]

What? You don't have to do that. Hell, the SIMD code that ripgrep uses is written in Rust.

azhenley 18 hours ago [-]

Now if I can just use SIMD directly from Python without calling out to another language.

ashvardanian 16 hours ago [-]

Hi Austin! Was just checking your blog and the vowels detection post (< https://austinhenley.com/blog/vowels.html>).

What exactly do you mean by “use SIMD directly without calling out to another language”?

In some way Assembly will probably anyways be another language… but that’s a technicality.

I guess the spectrum of SIMD-related work in relation to Python is quite broad. There are projects like PeachPy, to help one write x86 Asm in Python, new Python’esque languages like Mojo, or SIMD libraries with thin CPython bindings. Do you mean one of those?

PS: I’m in the last camp. And in case you are focused on ASCII, for vowel search, have you tried StringZilla’s `find_first_of` (< https://github.com/ashvardanian/StringZilla/blob/960967d78c7...>)? I think it should perform well ;)

adgjlsfhk1 18 hours ago [-]

why would you want to? if you are performance bound with Python code, you can pick up a 20-50x performance improvement by switching language.

jgalt212 7 hours ago [-]

Right, but if there's only a small portion of my code that does string search and it's a hot path, it would still be much much much more convenient to access SIMD-based string search code direct from Python rather than writing the code (LLM or not) in another language and then construct bindings (LLM or not).

kristianp 3 hours ago [-]

You can use cython and call C SIMD code from it.

17 hours ago [-]

Loading comments...

burntsushi 19 hours ago [-]

The actual implementation is here: https://github.com/BurntSushi/memchr?tab=readme-ov-file#algo...

It ends up being quite a bit faster than just plain Two-Way or even GNU libc's implementation of `memmem`. From the root of the `memchr` repository:

    $ rebar rank benchmarks/record/x86_64/2023-12-29.csv -e '^rust/memchr/memmem/(oneshot|prebuilt|twoway)' -e '^libc/memmem/oneshot'
    Engine                       Version  Geometric mean of speed ratios  Benchmark count
    ------                       -------  ------------------------------  ---------------
    rust/memchr/memmem/prebuilt  2.7.1    1.07                            57
    rust/memchr/memmem/oneshot   2.7.1    1.39                            54
    libc/memmem/oneshot          unknown  3.15                            54
    rust/memchr/memmem/twoway    2.7.1    5.26                            54

jakobnissen 16 hours ago [-]

Interesting. How do you know the background distribution of bytes? Won't a scan of the haystack to get the distribution take up a significant amount of time?

burntsushi 15 hours ago [-]

It's a _background_ distribution. That is, a guess. A heuristic. It doesn't change with the input.

ncruces 1 days ago [-]

I implemented these in my attempt to SIMD optimize Wasm/WASI libc.

https://github.com/ncruces/go-sqlite3/tree/main/sqlite3/libc

For known haystack lengths, and large needles, I found it useful to augment the approach with Quick Search.

https://igm.univ-mlv.fr/~lecroq/string/node19.html

mwsherman 14 hours ago [-]

C# has implemented some SIMD for IndexOf: https://github.com/dotnet/runtime/pull/63285

clauderoux 24 hours ago [-]

ncruces 23 hours ago [-]

Could you summarize your algorithm? Like a high-level description?

Like algorithm 1 (from the article) checks N possible positions at a time by matching the first and last character; algorithm 2 instead tests the 4 leading characters.

clauderoux 22 seconds ago [-]

Rendello 16 hours ago [-]

I tried to implement a modified version for LZ77 window search with Zig's generic SIMD a few years ago:

https://news.ycombinator.com/item?id=44273983

klaussilveira 15 hours ago [-]

Reminds me of https://github.com/nikneym/hparse

Which uses SIMD algos for fast HTTP parsing.

eska 20 hours ago [-]

The swar algorithm has UB because it casts the 1 byte aligned data to 8 byte aligned. Perhaps it suffers some performance issues from unaligned loads?

burntsushi 18 hours ago [-]

Rendello 16 hours ago [-]

When I modified the idealized algorithm to suit my needs a few years ago, I learned just how painful the boundary cases are. I realized I didn't want to ever touch SIMD string searching unless:

- I had a very specific use-case that I really needed SIMD for and was willing to really invest a lot in it; or

- The work had already been done for me in a generic sense and I could reap the fruits of that labour (ripgrep / Rust Regex, thanks for that by the way)

eska 16 hours ago [-]

True. The size assumption is mentioned in the article however. The simd library by agner makes the same assumption and demands that the user over-allocates their data

rurban 22 hours ago [-]

We know that the libc strstr performs horrible, but musl is fast and state of the art.

Now we just need a name, and we can add it to the smart shootout. Wonder how it compares to the best SIMD algos

ncruces 21 hours ago [-]

If you're interested in a rough benchmark, I compared musl's “two way” with a variation on this algorithm for my SIMD optimized libc linked in a sister comment.

The benchmark involves finding this line in that file: https://github.com/ncruces/go-sqlite3/blob/v0.26.1/sqlite3/l...

The improvements over musl are in this table: https://github.com/ncruces/go-sqlite3/tree/v0.26.1/sqlite3/l...

The NUL terminated string curve ball wrecks many good algorithms.

MattPalmer1086 21 hours ago [-]

Would be good to have more SIMD algorithms in smart. Maybe I'll have a play with it if I get time.

MattPalmer1086 18 hours ago [-]

Looking at them, would need a little work to make them safe (not reading past needle or haystack). Probably not too much effort, may need a different search for data mod block size at the end.

mgaunard 22 hours ago [-]

missing (2018) ?

kristianp 3 hours ago [-]

It was actually 2016 with spelling fixes in 2018. They still missed a couple of errors:

"comparing two substrings is expansive" -> expensive

And:

"memcmp costs are simply ridden off" -> written

Interesting how bad buldozer was for stdlib, but with SSE2 it was more competitive with Westmere.

sylware 22 hours ago [-]

What is the string sizes thresold for efficiency?

Because the SIMD (aligned, size computations, etc) setup code is usually more expensive that byte-oriented basic search for "small" strings.

Yes, it depends heavily on the hardware architecture.

ncruces 21 hours ago [-]

Here, it's often the opposite.

These algorithms are basically brute force with very little setup, and can become quadratic for large, periodic needles.

String search algorithms that avoid quadratic behaviour (and may even be sublinear) have an higher cost setup that explores needle structure

sylware 21 hours ago [-]

Often the setup code is not worth it, even the SIMD complexity is not bringing enough performance in many use cases (basically, kept for niche applications, little lib statically linked for those).

ncruces 19 hours ago [-]

sylware 18 hours ago [-]

Dude, I cannot be more explicit than that.

That said, when you work on such code, use assembly, C code is for reference (have a look at dav1d av1 decoder).

ncruces 18 hours ago [-]

I can't either.

Having implemented this, I'll claim that this is already competitive for needles as small as 2 bytes, and if the needle doesn't show up in the first couple “vector sizes” of the haystack.

And really, how common are “supremely small” haystacks?

burntsushi 18 hours ago [-]

What? You don't have to do that. Hell, the SIMD code that ripgrep uses is written in Rust.

azhenley 18 hours ago [-]

Now if I can just use SIMD directly from Python without calling out to another language.

ashvardanian 16 hours ago [-]

Hi Austin! Was just checking your blog and the vowels detection post (< https://austinhenley.com/blog/vowels.html>).

What exactly do you mean by “use SIMD directly without calling out to another language”?

In some way Assembly will probably anyways be another language… but that’s a technicality.

adgjlsfhk1 18 hours ago [-]

why would you want to? if you are performance bound with Python code, you can pick up a 20-50x performance improvement by switching language.

jgalt212 7 hours ago [-]

kristianp 3 hours ago [-]

You can use cython and call C SIMD code from it.

17 hours ago [-]