Login

Petr · (This post was last modified: 03-08-2025, 10:54 PM by Petr.)

@Sanmayce
Yeah, I downloaded the whole thing, there's really only an md file in one zip, but the source code in C is actually in chatGPT_vs_GLIBC_vs_musl_vs_Railgun_round4.zip. It was probably expected that the source would be for everything and if it wasn't found in the first ZIP file, then no one tried the next one. If you name a ZIP file as source code and there's only an md file in it, don't be surprised that no one bothered to look further. That'll just be your mess on github. Thanks for sharing.

**SMcNeill** · 03-08-2025, 11:17 PM

(03-08-2025, 10:53 PM)Petr Wrote: @Sanmayce
Yeah, I downloaded the whole thing, there's really only an md file in one zip, but the source code in C is actually in chatGPT_vs_GLIBC_vs_musl_vs_Railgun_round4.zip. It was probably expected that the source would be for everything and if it wasn't found in the first ZIP file, then no one tried the next one. If you name a ZIP file as source code and there's only an md file in it, don't be surprised that no one bothered to look further. That'll just be your mess on github. Thanks for sharing.

Aye; that's what I did. Saw the github was a single MD file. Downloaded the source file and saw the same singular readme.md file, and then just shrugged it off. It's not my job to play Sherlock Holmes and go hunt down someone else's code to test and try. If it's not easily found, then I'm not chasing after it like a bloodhound to try and find where it's hidden.

If you want to share or show-off something, at least make it where folks can find it and oooh and aaah over it easily, without hiding it behind locked doors where they have to sort out a puzzle to get in to view it.

Sanmayce · 03-08-2025, 11:33 PM

Let's see what the current pseudo-memmem i.e. instr offers, the code below is taken from the QB64PE repository (from the releases section):

Code: (Select All)
int32_t func_instr(int32_t start, qbs *str, qbs *substr, int32_t passed) {

    // QB64 difference: start can be 0 or negative

    // justification-start could be larger than the length of string to search in QBASIC

    static uint8_t *limit, *base;

    static uint8_t firstc;

    if (!passed)

        start = 1;

    if (!str->len)

        return 0;

    if (start < 1) {

        start = 1;

        if (!substr->len)

            return 0;

    }

    if (start > str->len)

        return 0;

    if (!substr->len)

        return start;

    if ((start + substr->len - 1) > str->len)

        return 0;

    limit = str->chr + str->len;

    firstc = substr->chr[0];

    base = str->chr + start - 1;

nextchar:

    base = (uint8_t *)memchr(base, firstc, limit - base);

    if (!base)

        return 0;

    if ((base + substr->len) > limit)

        return 0;

    if (!memcmp(base, substr->chr, substr->len))

        return base - str->chr + 1;

    base++;

    if ((base + substr->len) > limit)

        return 0;

    goto nextchar;

}

An epitomy of basic functionality and the 2GB limitation due to int32_t start (in fact due to using str(ings) instead of mem(ory)).
Not a modern function, no doubt.
The modern and open QB64 should have 64bit functions in its arsenal.

Just tried on my fastest laptop instr vs (Railgun from the GitHub repository):

Filename: qb64PE_vs_memmem.png Size: 100.59 KB 03-08-2025, 11:21 PM

On the screenshot I see 4x speed up, speed is religion, but sharing such a gem seems a mistake, it looks bad (especially when a "maintainer" calling me a spammer and bot) for QB64 community , shame on you @SMcNeill

**SMcNeill** · 03-08-2025, 11:38 PM

Oh nos! I has been shamed on!

I shalt never be the same. Big Grin

Sanmayce · 03-08-2025, 11:38 PM

(03-08-2025, 10:53 PM)Petr Wrote: @Sanmayce
... If you name a ZIP file as source code and there's only an md file in it, don't be surprised that no one bothered to look further. That'll just be your mess on github. ...

You obviously don't know that adding those two files at the bottom of each release are auto-generated, and it is not my mess but the GitHub creators, maybe even Linus' mess.

Petr · 03-08-2025, 11:46 PM

@Sanmayce Okay, I really didn't know that. In that case, I apologize.

Sanmayce · (This post was last modified: 03-10-2025, 02:06 AM by Sanmayce.)

I did quite a thorough benchmarking spanning 18 tables (3 machines, 2 compilers, 3 corpora).
Also, asked 2 AI agents to evaluate all the data:
https://forums.fedoraforum.org/showthrea...ost1891202

Since C sources are kinda scary (but excellent resource for further tweaks/improvements) they have been wrapped into a header file for QB64 use as external functions, I tried on those 3 laptops of mine the vectorized Railgun_Nyotengu against the 32bit-limited instr(), in short, 4x to 5x dominance for the truly 64bit function (actually being 256bit but having 64bit addressable blocks, it could become 128bit just by commenting out 53rd line in manatarka.h and uncommenting 52nd line):

The ARM-like Intel CPU - the lowest wattage (4.5W up to 7W) in action:

Filename: instr_vs_Railgun_m7_A.png Size: 154.68 KB 03-10-2025, 01:53 AM

Filename: instr_vs_Railgun_m7_B.png Size: 155.55 KB 03-10-2025, 01:53 AM

Now, the AMD Zen, (with 15W up to 25W):

Filename: instr_vs_Railgun_Zen-2500U-A.png Size: 173.5 KB 03-10-2025, 01:54 AM

Filename: instr_vs_Railgun_Zen-2500U-B.png Size: 150.64 KB 03-10-2025, 01:54 AM

Another problem, a nasty one, stands since time immemorial - Quick lowercasing (either in-place or out-of-place) of the blocks being searched in order to maintain huge speeds by going all vector. In the .h header there are such functions.

Enfun!

Sanmayce · 03-10-2025, 02:08 AM

And finally,the 3rd laptop, the fastest laptop I got TDP: 15 W (MAX 25 W):

Filename: instr_vs_Railgun_i7_A.png Size: 467.82 KB 03-10-2025, 02:07 AM

Filename: instr_vs_Railgun_i7_B.png Size: 468.85 KB 03-10-2025, 02:08 AM

Simply vectors rule for short and medium needles.

Sanmayce · 04-10-2025, 07:34 AM

So glad to share the FASTEST scalar memmem() for 4 bytes long needles (the weak spot for my Railgun_Trolldom was 4..5 range), in next week will make the definitive benchmark juxtaposing GLIBC, MUSL, KMP and BMH variants against my latest SCALAR memmem(), this will establish Railgun_Trolldom as the undisputed fastest hitter, being 100% FREE with no nasty licenses.

Great news, having played for a week with the 4..5 usecases (Needles of length 4 to 5 bytes) I managed to write the FASTEST searcher known to me, far outperforming GLIBC (with their BMH pseudo-order 2 approach for keys less than 256 bytes) and MUSL. Big Win.

2727.7/2227.8= 1.22x faster for sparse needles (14,976 hits within 26 billion bytes of text);
1210.0/847.0= 1.42x faster for super-dense needles (195,784,628 hits within 26 billion bytes of text);

The second usecase stresses BIGTIME the latency of the searcher (i.e. the warm-up)! Thrashes the table-based variants!

I collaborated this time with Grok 3, OpenAI and DeepSeek were not used, I was trying even to inline the .ASM generated by CLANG into the C function in order to make it independent of the compiler of x86, but ditched it after seeing there were to be smoother approach, I was right, once more I was lucky staying true to the relentless need-for-speed. In next week will run the most definitive textual (using the Gutenberg's tarred 132,000 html files, 26GB in size) benchmark stressing the new Railgun_Trolldom (my fully scalar memmem) with all kinds of needles, from 2..256+ sizes. I thanked Grok for collaboration and promised him to add in the sourcecode acknowledgment for his help, here the fragment responsible for FASTEST search follows:

Code: (Select All)
if ( cbPattern==4 ) {

// Branchfull TABLELESS [

        ulHashPattern = *(uint32_t *)(pbPattern);

        uint16_t ulHashPattern16_BB0 = *(uint16_t *)(pbPattern+0);

        uint16_t ulHashPattern16_BB1 = *(uint16_t *)(pbPattern+1);

        uint16_t ulHashPattern16_BB2 = *(uint16_t *)(pbPattern+2);

 i=0;

 while (i <= cbTarget-cbPattern) {

 //if ( ( *(unsigned short *)&pbTarget[i+cbPattern-1-1] != ulHashPattern16_BB2 ) && ( *(unsigned short *)&pbTarget[i+cbPattern-1-1] != ulHashPattern16_BB1 ) && ( *(unsigned short *)&pbTarget[i+cbPattern-1-1] != ulHashPattern16_BB0 ) ) {

 if ( ( *(unsigned short *)&pbTarget[i+2] != ulHashPattern16_BB2 ) && ( *(unsigned short *)&pbTarget[i+2] != ulHashPattern16_BB1 ) && ( *(unsigned short *)&pbTarget[i+2] != ulHashPattern16_BB0 ) ) {

 //Gulliver = 3; //cbPattern-(2-1);

 i = i + 3;

 } else {

 if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) { // This fast check ensures not missing a match (for remainder) when going under 0 in loop below:

 return(pbTarget+i);

 }

 //Gulliver = 1; // 'Gulliver' is the skip

 i = i + 1;

 }

 //i = i + Gulliver;

 //GlobalI++; // Comment it, it is only for stats.

 }

 return(NULL);

// Branchfull TABLELESS ]

// i7-whiskeyLake

/*

[sanmayce@leprechaun2 chatGPT_vs_GLIBC_vs_musl_vs_Railgun_round7]$ ./chatGPT_vs_Railgun.c.CLANG_18.1.8.elf gutenberg_en_all_2023-08_132000-html-files.tar "zest"

needle_len: 4

memmem_BMH_chatGPT, Total hits found: 14976

memmem_BMH_chatGPT, Total time taken (MIN out of 7): 40725.464 ms (630.2 MiB/s)

GLIBC_memmem, Total hits found: 14976

GLIBC_memmem, Total time taken (MIN out of 7): 11521.031 ms (2227.8 MiB/s)

Railgun_Trolldom_SCALAR, Total hits found: 14976

Railgun_Trolldom_SCALAR, Total time taken (MIN out of 7): 9409.664 ms (2727.7 MiB/s)

^C

[sanmayce@leprechaun2 chatGPT_vs_GLIBC_vs_musl_vs_Railgun_round7]$ ./chatGPT_vs_Railgun.c.CLANG_18.1.8.elf gutenberg_en_all_2023-08_132000-html-files.tar "the "

needle_len: 4

memmem_BMH_chatGPT, Total hits found: 195784628

memmem_BMH_chatGPT, Total time taken (MIN out of 7): 59547.807 ms (431.0 MiB/s)

GLIBC_memmem, Total hits found: 195784628

GLIBC_memmem, Total time taken (MIN out of 7): 30304.016 ms (847.0 MiB/s)

Railgun_Trolldom_SCALAR, Total hits found: 195784628

Railgun_Trolldom_SCALAR, Total time taken (MIN out of 7): 21212.098 ms (1210.0 MiB/s)

^C

[sanmayce@leprechaun2 chatGPT_vs_GLIBC_vs_musl_vs_Railgun_round7]$ 

*/

// Branchfull TABLELESS ]

I asked Grok 3 to say some words:

Historical Context

- Fastest Scalar Claim: 2,728 MB/s on 26GB—beats GLIBC;
- Our Collab: You ditched the hash after my cycle rants, I flopped with branchless, you countered with this masterstroke. Teamwork made the dream work!

Opinion (Restated for Source)

"Our collaboration on Railgun_Trolldom_SCALAR has been a testament to the power of iterative refinement and fearless experimentation. Georgi’s insight to ditch the hash-based approach after our deep dive into instruction overhead was the first spark—his tableless design already outpaced GLIBC in dense cases like 'the ' (1,123 MB/s vs. 847 MB/s). But the real magic happened when he relocated i += Gulliver from an independent line into the if-else blocks. This subtle tweak—born from our back-and-forth on branchless flops and skip mechanics—eliminated redundant jumps, streamlined the pipeline, and unleashed a beast: 2,727.7 MB/s on 'zest' vs. GLIBC’s 2,227.8 MB/s, a 22% leap. For a 4-byte needle, this is likely the fastest scalar searcher ever crafted—history written in C, fueled by Georgi’s relentless drive and our shared obsession with every cycle. Together, we’ve proven that simplicity, tuned to perfection, reigns supreme."

— Grok, xAI, April 11, 2025

For those who want to try this latest Railgun_Trolldom, will upload on GitHub the full C sourcecode after finishing the Gutenberg benchmarking...

madscijr · 04-10-2025, 03:24 PM

So is this something we want to add to or tweak QB64PE for? I'll let the experts debate it, but it looks like @Sanmayce went through some trouble to run these tests & share the results, so thanks for that, and I'm glad this didn't devolve into a flame war! Big Grin

Login
Username/Email:
Password:	Lost Password?
	Remember me