Glad to share the first (known to me) English Phrase-Checker, it is called Pro(o)fanitto. The little profane that proofs, heh-heh, similar to "The little engine that could".
Simply, the tool checks one file against another file, line-by-line, dumping all the not found lines paired with the line in the second file where mismatch happened.
The beauty is that no index structure is needed.
It works both for Linux/Windows and will be the very basis for incoming GUI phrase-checking in TriMasakari...
The package contains all needed to properly RIP and PHRASECHECK any English text, all sourcecode included.
The file being ripped/phrasechecked could be bigger than my biggest corpus Library-Genesis-non-fiction, 780GB tar, 1.5B/7.6B/18.5B digrams/trigrams/tetragrams strong.
To illustrate the functionality, I checked the Sam's interview (from the previous post) against all the digrams/trigrams/tetragrams in EnglishWikipedia from 2024 October.
Speedwise, the proofing (dumping only the unfamiliar 2-grams, 3-grams, 4-grams) happened in 1.9 seconds, not bad.
Simply, the tool checks one file against another file, line-by-line, dumping all the not found lines paired with the line in the second file where mismatch happened.
The beauty is that no index structure is needed.
It works both for Linux/Windows and will be the very basis for incoming GUI phrase-checking in TriMasakari...
The package contains all needed to properly RIP and PHRASECHECK any English text, all sourcecode included.
The file being ripped/phrasechecked could be bigger than my biggest corpus Library-Genesis-non-fiction, 780GB tar, 1.5B/7.6B/18.5B digrams/trigrams/tetragrams strong.
To illustrate the functionality, I checked the Sam's interview (from the previous post) against all the digrams/trigrams/tetragrams in EnglishWikipedia from 2024 October.
Code: (Select All)
[sanmayce@leprechaun2 rip]$ time sh Proofanitto.sh Sam_Obernik.txt enwiki-20241001-pages-articles.xml
real 0m1.899s
user 0m0.907s
sys 0m0.985s
[sanmayce@leprechaun2 rip]$ cat Sam_Obernik.txt.unknown-to-enwiki-20241001-pages-articles.xml.digrams.txt
Needle_Line 'BG:my_hoffman', not found, last checked Haystack_Line: 'BG:my_hoglund'.
Needle_Line 'CF:son_myself', not found, last checked Haystack_Line: 'CF:son_mysore'.
Needle_Line 'DE:dove_cause', not found, last checked Haystack_Line: 'DE:dove_cease'.
Needle_Line 'EI:gifts_awareness', not found, last checked Haystack_Line: 'EI:gifts_baksheesh'.
Needle_Line 'FC:norcap_and', not found, last checked Haystack_Line: 'FC:norcap_eng'.
Needle_Line 'GB:obernik_mr', not found, last checked Haystack_Line: 'GB:obernik_on'.
Needle_Line 'GC:obernik_had', not found, last checked Haystack_Line: 'GC:obernik_tom'.
Needle_Line 'GD:newborn_part', not found, last checked Haystack_Line: 'GD:newborn_paul'.
Needle_Line 'GE:hoffman_tools', not found, last checked Haystack_Line: 'GE:hoffman_topsy'.
Needle_Line 'GG:various_hoffman', not found, last checked Haystack_Line: 'GG:various_hokkien'.
Needle_Line 'GH:parents_patterns', not found, last checked Haystack_Line: 'GH:parents_patteson'.
Needle_Line 'GI:obernik_interview', not found, last checked Haystack_Line: 'GI:obernik_publisher'.
Needle_Line 'HD:stellify_your', not found, last checked Haystack_Line: 'HD:stellina_arte'.
Needle_Line 'HF:loveless_crimes', not found, last checked Haystack_Line: 'HF:loveless_critic'.
Needle_Line 'HL:physical_housekeeping', not found, last checked Haystack_Line: 'HL:physical_hydrodynamic'.
Needle_Line 'IC:butterfly_any', not found, last checked Haystack_Line: 'IC:butterfly_apa'.
Needle_Line 'IF:contacted_norcap', not found, last checked Haystack_Line: 'IF:contacted_norman'.
Needle_Line 'IF:emotional_reboot', not found, last checked Haystack_Line: 'IF:emotional_recall'.
Needle_Line 'JJ:collective_contagious', not found, last checked Haystack_Line: 'JJ:collective_constructs'.
Needle_Line 'KE:opportunity_steps', not found, last checked Haystack_Line: 'KE:opportunity_steve'.
Needle_Line 'NB:procrastinator_so', not found, last checked Haystack_Line: 'NB:procrastinator_yy'.
[sanmayce@leprechaun2 rip]$ tail Sam_Obernik.txt.2.sorted-unique
KI:significant_milestone
LB:cancellation_so
LB:consequences_it
LG:meticulously_planned
LG:particularly_towards
LK:particularly_challenging
MC:accommodation_was
MI:understanding_exercises
MM:compassionate_understanding
NB:procrastinator_so
[sanmayce@leprechaun2 rip]$ grep significant_milestone enwiki-20241001-pages-articles.xml.2.sorted-unique
KI:significant_milestone
KJ:significant_milestones
[sanmayce@leprechaun2 rip]$
Speedwise, the proofing (dumping only the unfamiliar 2-grams, 3-grams, 4-grams) happened in 1.9 seconds, not bad.
"He learns not to learn and reverts to what the masses pass by."