The need for memmem()-like function accepting Edit Distance as a parameter stands for a long time.
In addition to Exact and Wildcards modes, the Fuzzy mode is a must.
For example, often searches are targeting misspelled phrases/words, the brute-force way is to harness the whole CPU power to locate them - this is what the following tool does.
Stomp, stomp... here comes DraFF - the Dragonic Fuzzy Finderess
The idea is, once having a superfast console tool written in C (by moi or/and AI) to transition to a QB64 function wrapper invoking the very same C function. Thus, the user will be able to compare speeds and hits correctness across both usecases.
Exactly a decade ago I wrote (the sourcecode is attached) the fastest SCALAR console tool Kazahana doing the tri modes (Exact, Wildcards, Fuzzy), but it is not that elegant codewise, moreover I stupidly limited it to using 16 threads and focused unnecessarily on text-only inputs. In contrast, DraFF is the simplicity itself while utilizing all the threads available (imagine how it will fly on 192 cores), also the input could be binary (filtering NUL,TAB,CR,LF on the fly, simply converting them to SPC). The advantage of Kazahana is the physical line dumping - giving the "sentences" housing the hits, simply put, it is more readable whereas DraFF targets getting the offsets of hits i.e. being more of an indexer. Also, I use some optimization heuristics making Kazahana twice as fast, see the results_X270.txt console dump file in the attachment.
For now, I did only the initial step, benchmarking the console tools against all the 453,474,349 digrams of English Wikipedia:
Excited I am, lacking fuzzy functionality is to be thing of the past... the question is whether the QB64 wrapper will be allowed to execute the OpenMP C function! If problem arises then will rely on SHELL command.
Looking back, on the old QB64 forum, in this post I detected 2 errors in KJV bible - the two misspellings "daughers":
https://qb64forum.alephc.xyz/index.php?t...#msg139414
Using DraFF to find all the errorful "daughters" within Edit Distance 2, the console is this:
Thus, we found only one type of misspelling within 2 insertions/additions/substitutions.
In addition to Exact and Wildcards modes, the Fuzzy mode is a must.
For example, often searches are targeting misspelled phrases/words, the brute-force way is to harness the whole CPU power to locate them - this is what the following tool does.
Stomp, stomp... here comes DraFF - the Dragonic Fuzzy Finderess
The idea is, once having a superfast console tool written in C (by moi or/and AI) to transition to a QB64 function wrapper invoking the very same C function. Thus, the user will be able to compare speeds and hits correctness across both usecases.
Exactly a decade ago I wrote (the sourcecode is attached) the fastest SCALAR console tool Kazahana doing the tri modes (Exact, Wildcards, Fuzzy), but it is not that elegant codewise, moreover I stupidly limited it to using 16 threads and focused unnecessarily on text-only inputs. In contrast, DraFF is the simplicity itself while utilizing all the threads available (imagine how it will fly on 192 cores), also the input could be binary (filtering NUL,TAB,CR,LF on the fly, simply converting them to SPC). The advantage of Kazahana is the physical line dumping - giving the "sentences" housing the hits, simply put, it is more readable whereas DraFF targets getting the offsets of hits i.e. being more of an indexer. Also, I use some optimization heuristics making Kazahana twice as fast, see the results_X270.txt console dump file in the attachment.
For now, I did only the initial step, benchmarking the console tools against all the 453,474,349 digrams of English Wikipedia:
Code: (Select All)
[sanmayce@leprechaun2 DraFF]$ ls -l
-rw-r--r-- 1 sanmayce sanmayce 8271489435 Oct 18 12:51 enwiki-20241001-pages-articles.xml.2.sorted-unique
-rwxrwxrwx 1 sanmayce sanmayce 18544 Dec 31 08:42 parallel_edit_distance
-rwxrwxrwx 1 sanmayce sanmayce 6608 Dec 31 08:42 parallel_edit_distance.c
-rwxrwxrwx 1 sanmayce sanmayce 18504 Dec 31 08:42 parallel_edit_distance_RAM
-rwxrwxrwx 1 sanmayce sanmayce 7290 Dec 31 08:41 parallel_edit_distance_RAM.c
[sanmayce@leprechaun2 DraFF]$ perf stat -d ./parallel_edit_distance enwiki-20241001-pages-articles.xml.2.sorted-unique "dragon fury" 2
Searching using 4 threads for pattern 'dragon fury' ...
Thread 3: Match found at file offset 190942553: '_dragonfury' (Edit Distance: 2)
Thread 3: Match found at file offset 190942554: 'dragonfury ' (Edit Distance: 2)
Thread 3: Match found at file offset 548591669: '_dragonfury' (Edit Distance: 2)
Thread 3: Match found at file offset 548591670: 'dragonfury ' (Edit Distance: 2)
Thread 0: Match found at file offset 563134337: '_dragonfury' (Edit Distance: 2)
Thread 0: Match found at file offset 563134338: 'dragonfury ' (Edit Distance: 2)
Thread 0: Match found at file offset 567968201: '_dragonfury' (Edit Distance: 2)
Thread 0: Match found at file offset 567968202: 'dragonfury ' (Edit Distance: 2)
Thread 1: Match found at file offset 1197260642: '_dragonfury' (Edit Distance: 2)
...
[sanmayce@leprechaun2 DraFF_rev1]$ ls -l
-rwxr-xr-x 1 sanmayce sanmayce 166928 Jan 29 2024 Kazahana_Hexadecad_CLANG_SSE42_64bit.elf
[sanmayce@leprechaun2 DraFF_rev1]$ perf stat -d ./Kazahana_Hexadecad_CLANG_SSE42_64bit.elf 2e "dragon fury" enwiki-20241001-pages-articles.xml.2.sorted-unique 1234
____ __. .__
| |/ _|_____ _____________ | |__ _____ ____ _____
| < \__ \ \___ /\__ \ | | \ \__ \ / \ \__ \
| | \ / __ \_ / / / __ \_| Y \ / __ \_| | \ / __ \_
|____|__ \(____ //_____ \(____ /|___| /(____ /|___| /(____ /
\/ \/ \/ \/ \/ \/ \/ \/
Kazahana, a typhoon-class exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcheress, r. 1-++fix+nowait_critical_nixFIX_Wolfram+fixITER+EX+CS_fix_DEFINE_Trolldom_LineByLine_fix_BackToBuffer_Lowercasing-fix, copyleft Kaze 2024-Jan-29.
WILDCARD_IP_flag = 0
WILDCARD_FAST_flag = 2
Exact_flag = 0
EXHAUSTIVE_flag = 1
Enforcing Case-Insensitive Fuzzy (EXHAUSTIVE) mode ...
Pattern: dragon fury
omp_get_num_procs( ) = 4
omp_get_max_threads( ) = 4
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating Master-Buffer 1234KB ... OK
/; Speed: 00,003,968,824 bytes/second; Traversed: 8,270,313,894 bytes; Dumped: 69
Kazahana: Total/Checked/Dumped xgrams: 453,474,349/14,330,902,985/69
Kazahana: Performance: 3,873 KB/s
Kazahana: Performance: 217,454 xgrams/s
Kazahana: Performance: Total/fread() clocks: 2,085,376,715/10,118,262
Kazahana: Performance: I/O time, i.e. fread() time, is 0 percents
Kazahana: Done.
Performance counter stats for './Kazahana_Hexadecad_CLANG_SSE42_64bit.elf 2e dragon fury enwiki-20241001-pages-articles.xml.2.sorted-unique 1234':
2,116,690.74 msec task-clock:u # 3.900 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
637 page-faults:u # 0.301 /sec
4,952,369,330,793 cycles:u # 2.340 GHz (50.00%)
8,194,339,388,346 instructions:u # 1.65 insn per cycle (62.50%)
1,381,429,000,327 branches:u # 652.636 M/sec (62.50%)
6,690,074,767 branch-misses:u # 0.48% of all branches (62.50%)
2,219,023,836,944 L1-dcache-loads:u # 1.048 G/sec (62.50%)
22,229,652,767 L1-dcache-load-misses:u # 1.00% of all L1-dcache accesses (62.50%)
7,365,496,829 LLC-loads:u # 3.480 M/sec (50.00%)
8,776,501 LLC-load-misses:u # 0.12% of all LL-cache accesses (50.00%)
542.778126319 seconds time elapsed
2077.248144000 seconds user
9.108950000 seconds sys
[sanmayce@leprechaun2 DraFF_rev1]$ sort Kazahana.txt
AJ:u_dragonfury
BJ:by_dragonfury
BJ:on_dragonfury
BJ:to_dragonfury
CJ:per_dragonfury
CJ:was_dragonfury
EK:moved_dragonsfury
EK:title_dragonsfury
FC:dragon_fur
FC:dragon_ury
FD:dragon_frye
FD:dragon_fury
FD:dragon_gury
FD:dragon_jury
FE:dragon_furea
FE:dragon_furia
FE:dragon_surya
FG:dragon_fureimu
FG:dragon_furious
FG:dragon_fursuit
FG:dragon_further
FI:dragon_furniture
FJ:dragon_furnishing
FJ:enwiki_dragonfury
GD:dragons_fury
GJ:comment_dragonfury
HJ:dragfyre_dragonfury
HJ:username_dragonfury
IC:seadragon_fry
IE:mondragon_faryd
IG:pendragon_further
IJ:userlinks_dragonfury
JA:dragonfury_s
JB:dragonfury_lt
JC:dragonfury_amp
JC:dragonfury_and
JC:dragonfury_has
JC:dragonfury_not
JC:dragonfury_was
JD:dragonfury_fred
JD:dragonfury_fwiw
JD:dragonfury_quot
JD:dragonfury_talk
JD:dragonfury_user
JD:dragonfury_with
JE:dragonfury_moved
JE:dragonfury_reply
JF:dragonfury_coibot
JF:dragonfury_enwiki
JF:dragonfury_notice
JF:dragonfury_target
JG:dragonfury_comment
JH:dragonfury_contribs
JH:dragonfury_previous
JH:dragonfury_reported
JH:dragonfury_username
JI:dragonfury_dysepsion
JI:greydragon_furniture
JJ:dragonfury_dragonfury
JJ:dragonfury_dragonhawk
JP:adventures_dragonsfuryclose
KF:dragonsfury_roller
KJ:userreports_dragonfury
KJ:usersummary_dragonfury
MJ:contributions_dragonfury
MJ:dragonfederal_dragonfury
ND:marcbluedragon_jury
OD:rcdbdragonsfury_quot
PD:tparkdragonsfury_quot
[sanmayce@leprechaun2 DraFF_rev1]$
Excited I am, lacking fuzzy functionality is to be thing of the past... the question is whether the QB64 wrapper will be allowed to execute the OpenMP C function! If problem arises then will rely on SHELL command.
Looking back, on the old QB64 forum, in this post I detected 2 errors in KJV bible - the two misspellings "daughers":
https://qb64forum.alephc.xyz/index.php?t...#msg139414
Using DraFF to find all the errorful "daughters" within Edit Distance 2, the console is this:
Code: (Select All)
[sanmayce@djudjeto7 DraFF_rev1]$ ./parallel_edit_distance TERAPIG_The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt daughters 2
Searching using 4 threads for pattern 'daughters' ...
Thread 1: Match found at file offset 1048698: ' daughter' (Edit Distance: 2)
Thread 1: Match found at file offset 1048699: 'daughters' (Edit Distance: 0)
Thread 1: Match found at file offset 1048700: 'aughters ' (Edit Distance: 2)
Thread 1: Match found at file offset 1048744: ' daughter' (Edit Distance: 2)
...
[sanmayce@djudjeto7 DraFF_rev1]$ sort TERAPIG_The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.hits -u
aughter s
aughter's
aughters
aughters,
aughters.
aughters:
aughters;
aughters?
daughers
daughers
daughter
daughter
daughter!
daughter'
daughter,
daughter.
daughter:
daughter;
daughter?
Daughter,
daughters
Daughters
laughter
laughter,
laughter.
laughter:
laughter;
[sanmayce@djudjeto7 DraFF_rev1]$
Thus, we found only one type of misspelling within 2 insertions/additions/substitutions.
"He learns not to learn and reverts to what the masses pass by."