Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Masakari - the abandoned text viewer
#1
Huh, how time flies, just found some old stuff:

Topic: A skeleton code for Text Scroller via Drag-and-Drop
https://qb64forum.alephc.xyz/index.php?t...#msg130335

Decided to recompile it with QB64PE 3.9.1 and see how my main ThinkPad X270 (32GB RAM, 4TB nvme) running Linux Fedora 40 fares with loading-for-browsing latest English Wikipedia XML dump, 97.1GiB:

Since each line is indexed, it needs 22.8 GiB RAM ~= 1.5 billion lines with 8 bytes OFFSET + 8 bytes LENGTH descriptor per line = 1.5*16= 24,000,000,000/1024/1024/1024), not as its successor TriMasakari which needs ... 0 RAM.

Wanted to see how fast "loads" and searches:

   

- All lines of enwiki indexed in: 420 seconds
- All hitlines with "NyoTengu" found in: 90 seconds

One of useful abilities is to sort all the lines by ... their length, this is done with 'Shift+1' combo.
Never saw an editor (even the legendary UltraEdit) with such a function, it is useful sometimes to view longest line(s) - clustered at the bottom. Regarding this XML dump, the longest line is 2,069,029 bytes long, it took 64 seconds to sort the 1.5 billion lines, pressing 'Enter' over the highlighted line wraps it for scrolling.

'Shift+2' sorts by Offset i.e. it returns to the original content.

Looking back, it features some nifty ideas, hate that it has to step aside and make path for better scrolling, faster searching, better ergonomic, more-more functionality... anyway, for archival purposes here comes revision 8.2+ package wit all sources.

My focus again is on English texts and phrase ripping/searching, hopefully next year will show the unseen English Phrase-Checker which is to be part of TriMasakari...
As a gift to all English language learners-n-aficionados, I collected under one roof a printer-friendly corpus 173 pages strong of all must-have (kinda cheatsheets) basics of English suffix/preposition craft, ONE UNSEEN RESOURCE, my word. These pages are product of more than a decade of my diving into English language. ENFUN!

   

   

Also, I showcased how the digrams/trigrams/tetragrams of a given interview (or book, or a superhuge book collection e.g. 'Library Genesis non-fiction' which I ripped recently, by the way 450GB is only the text and 1TB more are the x-grams where the tetragrams are 18+billion) in order the reader to have a superb index (exhaustive one) and to feed the incoming TriMasakari with this phrase-checking files...


Attached Files
.pdf   DRAGON_SUFFIX_ROSTER_REVISION_9.pdf (Size: 9.26 MB / Downloads: 23)
.7z   MASAKARI_r8.2+.7z (Size: 33.83 MB / Downloads: 24)
"He learns not to learn and reverts to what the masses pass by."
Reply
#2
The need for memmem()-like function accepting Edit Distance as a parameter stands for a long time.
In addition to Exact and Wildcards modes, the Fuzzy mode is a must.
For example, often searches are targeting misspelled phrases/words, the brute-force way is to harness the whole CPU power to locate them - this is what the following tool does.

Stomp, stomp... here comes DraFF - the Dragonic Fuzzy Finderess

   

The idea is, once having a superfast console tool written in C (by moi or/and AI) to transition to a QB64 function wrapper invoking the very same C function. Thus, the user will be able to compare speeds and hits correctness across both usecases.

Exactly a decade ago I wrote (the sourcecode is attached) the fastest SCALAR console tool Kazahana doing the tri modes (Exact, Wildcards, Fuzzy), but it is not that elegant codewise, moreover I stupidly limited it to using 16 threads and focused unnecessarily on text-only inputs. In contrast, DraFF is the simplicity itself while utilizing all the threads available (imagine how it will fly on 192 cores), also the input could be binary (filtering NUL,TAB,CR,LF on the fly, simply converting them to SPC). The advantage of Kazahana is the physical line dumping - giving the "sentences" housing the hits, simply put, it is more readable whereas DraFF targets getting the offsets of hits i.e. being more of an indexer. Also, I use some optimization heuristics making Kazahana twice as fast, see the results_X270.txt console dump file in the attachment.

For now, I did only the initial step, benchmarking the console tools against all the 453,474,349 digrams of English Wikipedia:

Code: (Select All)
[sanmayce@leprechaun2 DraFF]$ ls -l
-rw-r--r-- 1 sanmayce sanmayce 8271489435 Oct 18 12:51 enwiki-20241001-pages-articles.xml.2.sorted-unique
-rwxrwxrwx 1 sanmayce sanmayce      18544 Dec 31 08:42 parallel_edit_distance
-rwxrwxrwx 1 sanmayce sanmayce      6608 Dec 31 08:42 parallel_edit_distance.c
-rwxrwxrwx 1 sanmayce sanmayce      18504 Dec 31 08:42 parallel_edit_distance_RAM
-rwxrwxrwx 1 sanmayce sanmayce      7290 Dec 31 08:41 parallel_edit_distance_RAM.c
[sanmayce@leprechaun2 DraFF]$ perf stat -d ./parallel_edit_distance enwiki-20241001-pages-articles.xml.2.sorted-unique "dragon fury" 2
Searching using 4 threads for pattern 'dragon fury' ...
Thread 3: Match found at file offset 190942553: '_dragonfury' (Edit Distance: 2)
Thread 3: Match found at file offset 190942554: 'dragonfury ' (Edit Distance: 2)
Thread 3: Match found at file offset 548591669: '_dragonfury' (Edit Distance: 2)
Thread 3: Match found at file offset 548591670: 'dragonfury ' (Edit Distance: 2)
Thread 0: Match found at file offset 563134337: '_dragonfury' (Edit Distance: 2)
Thread 0: Match found at file offset 563134338: 'dragonfury ' (Edit Distance: 2)
Thread 0: Match found at file offset 567968201: '_dragonfury' (Edit Distance: 2)
Thread 0: Match found at file offset 567968202: 'dragonfury ' (Edit Distance: 2)
Thread 1: Match found at file offset 1197260642: '_dragonfury' (Edit Distance: 2)
...
[sanmayce@leprechaun2 DraFF_rev1]$ ls -l
-rwxr-xr-x 1 sanmayce sanmayce    166928 Jan 29  2024 Kazahana_Hexadecad_CLANG_SSE42_64bit.elf
[sanmayce@leprechaun2 DraFF_rev1]$ perf stat -d ./Kazahana_Hexadecad_CLANG_SSE42_64bit.elf 2e "dragon fury" enwiki-20241001-pages-articles.xml.2.sorted-unique 1234
____  __.                        .__                           
|    |/ _|_____  _____________  |  |__  _____    ____  _____ 
|      <  \__  \  \___  /\__  \  |  |  \ \__  \  /    \ \__  \ 
|    |  \  / __ \_ /    /  / __ \_|  Y  \ / __ \_|  |  \ / __ \_
|____|__ \(____  //_____ \(____  /|___|  /(____  /|___|  /(____  /
        \/    \/      \/    \/      \/      \/      \/      \/
Kazahana, a typhoon-class exact & wildcards & Levenshtein Distance (Wagner-Fischer) searcheress, r. 1-++fix+nowait_critical_nixFIX_Wolfram+fixITER+EX+CS_fix_DEFINE_Trolldom_LineByLine_fix_BackToBuffer_Lowercasing-fix, copyleft Kaze 2024-Jan-29.
WILDCARD_IP_flag = 0
WILDCARD_FAST_flag = 2
Exact_flag = 0
EXHAUSTIVE_flag = 1
Enforcing Case-Insensitive Fuzzy (EXHAUSTIVE) mode ...
Pattern: dragon fury
omp_get_num_procs( ) = 4
omp_get_max_threads( ) = 4
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating Master-Buffer 1234KB ... OK
/; Speed: 00,003,968,824 bytes/second; Traversed: 8,270,313,894 bytes; Dumped: 69
Kazahana: Total/Checked/Dumped xgrams: 453,474,349/14,330,902,985/69
Kazahana: Performance: 3,873 KB/s
Kazahana: Performance: 217,454 xgrams/s
Kazahana: Performance: Total/fread() clocks: 2,085,376,715/10,118,262
Kazahana: Performance: I/O time, i.e. fread() time, is 0 percents
Kazahana: Done.
Performance counter stats for './Kazahana_Hexadecad_CLANG_SSE42_64bit.elf 2e dragon fury enwiki-20241001-pages-articles.xml.2.sorted-unique 1234':
      2,116,690.74 msec task-clock:u                    #    3.900 CPUs utilized           
                0      context-switches:u              #    0.000 /sec                     
                0      cpu-migrations:u                #    0.000 /sec                     
              637      page-faults:u                    #    0.301 /sec                     
4,952,369,330,793      cycles:u                        #    2.340 GHz                        (50.00%)
8,194,339,388,346      instructions:u                  #    1.65  insn per cycle              (62.50%)
1,381,429,000,327      branches:u                      #  652.636 M/sec                      (62.50%)
    6,690,074,767      branch-misses:u                  #    0.48% of all branches            (62.50%)
2,219,023,836,944      L1-dcache-loads:u                #    1.048 G/sec                      (62.50%)
    22,229,652,767      L1-dcache-load-misses:u          #    1.00% of all L1-dcache accesses  (62.50%)
    7,365,496,829      LLC-loads:u                      #    3.480 M/sec                      (50.00%)
        8,776,501      LLC-load-misses:u                #    0.12% of all LL-cache accesses    (50.00%)
    542.778126319 seconds time elapsed
    2077.248144000 seconds user
      9.108950000 seconds sys
[sanmayce@leprechaun2 DraFF_rev1]$ sort Kazahana.txt
AJ:u_dragonfury
BJ:by_dragonfury
BJ:on_dragonfury
BJ:to_dragonfury
CJ:per_dragonfury
CJ:was_dragonfury
EK:moved_dragonsfury
EK:title_dragonsfury
FC:dragon_fur
FC:dragon_ury
FD:dragon_frye
FD:dragon_fury
FD:dragon_gury
FD:dragon_jury
FE:dragon_furea
FE:dragon_furia
FE:dragon_surya
FG:dragon_fureimu
FG:dragon_furious
FG:dragon_fursuit
FG:dragon_further
FI:dragon_furniture
FJ:dragon_furnishing
FJ:enwiki_dragonfury
GD:dragons_fury
GJ:comment_dragonfury
HJ:dragfyre_dragonfury
HJ:username_dragonfury
IC:seadragon_fry
IE:mondragon_faryd
IG:pendragon_further
IJ:userlinks_dragonfury
JA:dragonfury_s
JB:dragonfury_lt
JC:dragonfury_amp
JC:dragonfury_and
JC:dragonfury_has
JC:dragonfury_not
JC:dragonfury_was
JD:dragonfury_fred
JD:dragonfury_fwiw
JD:dragonfury_quot
JD:dragonfury_talk
JD:dragonfury_user
JD:dragonfury_with
JE:dragonfury_moved
JE:dragonfury_reply
JF:dragonfury_coibot
JF:dragonfury_enwiki
JF:dragonfury_notice
JF:dragonfury_target
JG:dragonfury_comment
JH:dragonfury_contribs
JH:dragonfury_previous
JH:dragonfury_reported
JH:dragonfury_username
JI:dragonfury_dysepsion
JI:greydragon_furniture
JJ:dragonfury_dragonfury
JJ:dragonfury_dragonhawk
JP:adventures_dragonsfuryclose
KF:dragonsfury_roller
KJ:userreports_dragonfury
KJ:usersummary_dragonfury
MJ:contributions_dragonfury
MJ:dragonfederal_dragonfury
ND:marcbluedragon_jury
OD:rcdbdragonsfury_quot
PD:tparkdragonsfury_quot
[sanmayce@leprechaun2 DraFF_rev1]$

Excited I am, lacking fuzzy functionality is to be thing of the past... the question is whether the QB64 wrapper will be allowed to execute the OpenMP C function! If problem arises then will rely on SHELL command.

Looking back, on the old QB64 forum, in this post I detected 2 errors in KJV bible - the two misspellings "daughers":
https://qb64forum.alephc.xyz/index.php?t...#msg139414

   

Using DraFF to find all the errorful "daughters" within Edit Distance 2, the console is this:

Code: (Select All)
[sanmayce@djudjeto7 DraFF_rev1]$ ./parallel_edit_distance TERAPIG_The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt daughters 2
Searching using 4 threads for pattern 'daughters' ...
Thread 1: Match found at file offset 1048698: ' daughter' (Edit Distance: 2)
Thread 1: Match found at file offset 1048699: 'daughters' (Edit Distance: 0)
Thread 1: Match found at file offset 1048700: 'aughters ' (Edit Distance: 2)
Thread 1: Match found at file offset 1048744: ' daughter' (Edit Distance: 2)
...
[sanmayce@djudjeto7 DraFF_rev1]$ sort TERAPIG_The_Project_Gutenberg_EBook_of_The_King_James_Bible_kjv10.txt.hits -u
aughter s
aughter's
aughters
aughters,
aughters.
aughters:
aughters;
aughters?
 daughers
daughers
 daughter
daughter
daughter!
daughter'
daughter,
daughter.
daughter:
daughter;
daughter?
Daughter,
daughters
Daughters
laughter
laughter,
laughter.
laughter:
laughter;
[sanmayce@djudjeto7 DraFF_rev1]$

Thus, we found only one type of misspelling within 2 insertions/additions/substitutions.


Attached Files
.7z   DraFF_rev1.7z (Size: 37.03 MB / Downloads: 8)
"He learns not to learn and reverts to what the masses pass by."
Reply
#3
Still didn't try making the QB64 wrapper of DraFF, wanted to clear the heavy fuzzy benchmark first by showcasing the showdown between DraFF & Kazahana - the two exhaustive fuzzy finderesses of mine, glad to share the complete package which will be the base for further functions in my QB64 GUI assistants.

In essence, the included benchmark addresses the problem with finding hard-to-find errors/misspellings which cannot be fixed with mere spell-check wordlists. The fuzzy searches are indispensable when phrases are the targets.

Now, the problematic question solved:
How many unique misspellings of "Sylvester Stallone" are in the English Wikipedia (from 2024-Oct) XML dump?

   

Code: (Select All)
[sanmayce@djudjeto SS_vs_enwiki]$ cat enwiki-20241001-pages-articles.xml.hits.unique
Salvestor Stallone
Silvester Stallone
=Silvester Stalone
Silvester Stalone
Silvestro Stallone
:Silvestr Stallone
Silvestr Stallone]
Sylvester Stallone
Sylvester Stalone
>Sylvester Stalone
[Sylvester Stalone
|Sylvester Stalone
Sylvester Stalone
Sylvester Stalone,
Sylvester Stalone.
Sylvester Stalone<
Sylvester Stalone]
Sylvester Stalone|
Sylvestor Stallone
[sanmayce@djudjeto SS_vs_enwiki]$
In the near future, I envision DraFF becoming part of my QB64PE text-sidekick TriMasakari...


Attached Files
.7z   CPU_Benchmark_Fuzzy-Search-Wikipedia_2025.7z (Size: 88.13 MB / Downloads: 4)
.pdf   Fuzzy-Finding_DraFF_plus_Kazahana_sourcecode.pdf (Size: 15.81 MB / Downloads: 4)
"He learns not to learn and reverts to what the masses pass by."
Reply




Users browsing this thread: 1 Guest(s)