04-10-2025, 07:34 AM
So glad to share the FASTEST scalar memmem() for 4 bytes long needles (the weak spot for my Railgun_Trolldom was 4..5 range), in next week will make the definitive benchmark juxtaposing GLIBC, MUSL, KMP and BMH variants against my latest SCALAR memmem(), this will establish Railgun_Trolldom as the undisputed fastest hitter, being 100% FREE with no nasty licenses.
Great news, having played for a week with the 4..5 usecases (Needles of length 4 to 5 bytes) I managed to write the FASTEST searcher known to me, far outperforming GLIBC (with their BMH pseudo-order 2 approach for keys less than 256 bytes) and MUSL. Big Win.
2727.7/2227.8= 1.22x faster for sparse needles (14,976 hits within 26 billion bytes of text);
1210.0/847.0= 1.42x faster for super-dense needles (195,784,628 hits within 26 billion bytes of text);
The second usecase stresses BIGTIME the latency of the searcher (i.e. the warm-up)! Thrashes the table-based variants!
I collaborated this time with Grok 3, OpenAI and DeepSeek were not used, I was trying even to inline the .ASM generated by CLANG into the C function in order to make it independent of the compiler of x86, but ditched it after seeing there were to be smoother approach, I was right, once more I was lucky staying true to the relentless need-for-speed. In next week will run the most definitive textual (using the Gutenberg's tarred 132,000 html files, 26GB in size) benchmark stressing the new Railgun_Trolldom (my fully scalar memmem) with all kinds of needles, from 2..256+ sizes. I thanked Grok for collaboration and promised him to add in the sourcecode acknowledgment for his help, here the fragment responsible for FASTEST search follows:
I asked Grok 3 to say some words:
Historical Context
- Fastest Scalar Claim: 2,728 MB/s on 26GB—beats GLIBC;
- Our Collab: You ditched the hash after my cycle rants, I flopped with branchless, you countered with this masterstroke. Teamwork made the dream work!
Opinion (Restated for Source)
"Our collaboration on Railgun_Trolldom_SCALAR has been a testament to the power of iterative refinement and fearless experimentation. Georgi’s insight to ditch the hash-based approach after our deep dive into instruction overhead was the first spark—his tableless design already outpaced GLIBC in dense cases like 'the ' (1,123 MB/s vs. 847 MB/s). But the real magic happened when he relocated i += Gulliver from an independent line into the if-else blocks. This subtle tweak—born from our back-and-forth on branchless flops and skip mechanics—eliminated redundant jumps, streamlined the pipeline, and unleashed a beast: 2,727.7 MB/s on 'zest' vs. GLIBC’s 2,227.8 MB/s, a 22% leap. For a 4-byte needle, this is likely the fastest scalar searcher ever crafted—history written in C, fueled by Georgi’s relentless drive and our shared obsession with every cycle. Together, we’ve proven that simplicity, tuned to perfection, reigns supreme."
— Grok, xAI, April 11, 2025
For those who want to try this latest Railgun_Trolldom, will upload on GitHub the full C sourcecode after finishing the Gutenberg benchmarking...
Great news, having played for a week with the 4..5 usecases (Needles of length 4 to 5 bytes) I managed to write the FASTEST searcher known to me, far outperforming GLIBC (with their BMH pseudo-order 2 approach for keys less than 256 bytes) and MUSL. Big Win.
2727.7/2227.8= 1.22x faster for sparse needles (14,976 hits within 26 billion bytes of text);
1210.0/847.0= 1.42x faster for super-dense needles (195,784,628 hits within 26 billion bytes of text);
The second usecase stresses BIGTIME the latency of the searcher (i.e. the warm-up)! Thrashes the table-based variants!
I collaborated this time with Grok 3, OpenAI and DeepSeek were not used, I was trying even to inline the .ASM generated by CLANG into the C function in order to make it independent of the compiler of x86, but ditched it after seeing there were to be smoother approach, I was right, once more I was lucky staying true to the relentless need-for-speed. In next week will run the most definitive textual (using the Gutenberg's tarred 132,000 html files, 26GB in size) benchmark stressing the new Railgun_Trolldom (my fully scalar memmem) with all kinds of needles, from 2..256+ sizes. I thanked Grok for collaboration and promised him to add in the sourcecode acknowledgment for his help, here the fragment responsible for FASTEST search follows:
Code: (Select All)
if ( cbPattern==4 ) {
// Branchfull TABLELESS [
ulHashPattern = *(uint32_t *)(pbPattern);
uint16_t ulHashPattern16_BB0 = *(uint16_t *)(pbPattern+0);
uint16_t ulHashPattern16_BB1 = *(uint16_t *)(pbPattern+1);
uint16_t ulHashPattern16_BB2 = *(uint16_t *)(pbPattern+2);
i=0;
while (i <= cbTarget-cbPattern) {
//if ( ( *(unsigned short *)&pbTarget[i+cbPattern-1-1] != ulHashPattern16_BB2 ) && ( *(unsigned short *)&pbTarget[i+cbPattern-1-1] != ulHashPattern16_BB1 ) && ( *(unsigned short *)&pbTarget[i+cbPattern-1-1] != ulHashPattern16_BB0 ) ) {
if ( ( *(unsigned short *)&pbTarget[i+2] != ulHashPattern16_BB2 ) && ( *(unsigned short *)&pbTarget[i+2] != ulHashPattern16_BB1 ) && ( *(unsigned short *)&pbTarget[i+2] != ulHashPattern16_BB0 ) ) {
//Gulliver = 3; //cbPattern-(2-1);
i = i + 3;
} else {
if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) { // This fast check ensures not missing a match (for remainder) when going under 0 in loop below:
return(pbTarget+i);
}
//Gulliver = 1; // 'Gulliver' is the skip
i = i + 1;
}
//i = i + Gulliver;
//GlobalI++; // Comment it, it is only for stats.
}
return(NULL);
// Branchfull TABLELESS ]
// i7-whiskeyLake
/*
[sanmayce@leprechaun2 chatGPT_vs_GLIBC_vs_musl_vs_Railgun_round7]$ ./chatGPT_vs_Railgun.c.CLANG_18.1.8.elf gutenberg_en_all_2023-08_132000-html-files.tar "zest"
needle_len: 4
memmem_BMH_chatGPT, Total hits found: 14976
memmem_BMH_chatGPT, Total time taken (MIN out of 7): 40725.464 ms (630.2 MiB/s)
GLIBC_memmem, Total hits found: 14976
GLIBC_memmem, Total time taken (MIN out of 7): 11521.031 ms (2227.8 MiB/s)
Railgun_Trolldom_SCALAR, Total hits found: 14976
Railgun_Trolldom_SCALAR, Total time taken (MIN out of 7): 9409.664 ms (2727.7 MiB/s)
^C
[sanmayce@leprechaun2 chatGPT_vs_GLIBC_vs_musl_vs_Railgun_round7]$ ./chatGPT_vs_Railgun.c.CLANG_18.1.8.elf gutenberg_en_all_2023-08_132000-html-files.tar "the "
needle_len: 4
memmem_BMH_chatGPT, Total hits found: 195784628
memmem_BMH_chatGPT, Total time taken (MIN out of 7): 59547.807 ms (431.0 MiB/s)
GLIBC_memmem, Total hits found: 195784628
GLIBC_memmem, Total time taken (MIN out of 7): 30304.016 ms (847.0 MiB/s)
Railgun_Trolldom_SCALAR, Total hits found: 195784628
Railgun_Trolldom_SCALAR, Total time taken (MIN out of 7): 21212.098 ms (1210.0 MiB/s)
^C
[sanmayce@leprechaun2 chatGPT_vs_GLIBC_vs_musl_vs_Railgun_round7]$
*/
// Branchfull TABLELESS ]
I asked Grok 3 to say some words:
Historical Context
- Fastest Scalar Claim: 2,728 MB/s on 26GB—beats GLIBC;
- Our Collab: You ditched the hash after my cycle rants, I flopped with branchless, you countered with this masterstroke. Teamwork made the dream work!
Opinion (Restated for Source)
"Our collaboration on Railgun_Trolldom_SCALAR has been a testament to the power of iterative refinement and fearless experimentation. Georgi’s insight to ditch the hash-based approach after our deep dive into instruction overhead was the first spark—his tableless design already outpaced GLIBC in dense cases like 'the ' (1,123 MB/s vs. 847 MB/s). But the real magic happened when he relocated i += Gulliver from an independent line into the if-else blocks. This subtle tweak—born from our back-and-forth on branchless flops and skip mechanics—eliminated redundant jumps, streamlined the pipeline, and unleashed a beast: 2,727.7 MB/s on 'zest' vs. GLIBC’s 2,227.8 MB/s, a 22% leap. For a 4-byte needle, this is likely the fastest scalar searcher ever crafted—history written in C, fueled by Georgi’s relentless drive and our shared obsession with every cycle. Together, we’ve proven that simplicity, tuned to perfection, reigns supreme."
— Grok, xAI, April 11, 2025
For those who want to try this latest Railgun_Trolldom, will upload on GitHub the full C sourcecode after finishing the Gutenberg benchmarking...
"He learns not to learn and reverts to what the masses pass by."