QB64 Phoenix Edition
A hybrid pattern-based data compressor - Printable Version

+- QB64 Phoenix Edition (https://qb64phoenix.com/forum)
+-- Forum: QB64 Rising (https://qb64phoenix.com/forum/forumdisplay.php?fid=1)
+--- Forum: Code and Stuff (https://qb64phoenix.com/forum/forumdisplay.php?fid=3)
+---- Forum: Utilities (https://qb64phoenix.com/forum/forumdisplay.php?fid=8)
+---- Thread: A hybrid pattern-based data compressor (/showthread.php?tid=4254)



A hybrid pattern-based data compressor - Dav - 12-23-2025

This is made as a learning experience, and not to try and replace DEFLATE or a real compressor.  It's a hybrid data compressor using pattern recognition in a data window size, sort of like LZ77 does, and VLQ encoding like base-128 uses.  It searches for patterns in the data and encodes those as references, compressing data.   It's better than RLE, but not that much.  Was a fun code to work on.  More info in the code.

- Dav

Code: (Select All)
'=========
'PCOMP.BAS
'=========
'Simple Pattern-Based data Compressor with VLQ encoding.
'Coded by Dav for QB64-PE, DEC/2025

'This is a hybrid pattern-based data compressor. It searches data for
'repeated patterns within a specified window size, similar to how LZ77
'does, but without maintaining a sliding window or dynamic dictionary.
'Instead, it performs pattern matching within a fixed window size.

'Pattern references (distance and length) are encoded using VLQ, using
'the most significant bit (MSB) as a flag. If MSB=1, then more bytes
'follow. If MSB=0, this is the last byte. This method is like base-128
'encoding and is also the method used in MIDI files.

'Unlike RLE, which detects only consecutive runs, this compressor can
'identify non-consecutive repeats, matching patterns that appeared
'earlier in the data.

'Note: This compressor works well with patterned data but is not good
'for compressing random data.

'=====================================================================

Screen 12: _FullScreen

'make string of data for testing
For o = 1 To 12
    orig$ = orig$ + _Trim$(Str$(Timer))
    orig$ = orig$ + Time$ + Date$
Next

cdat$ = Compress$(orig$)
ddat$ = Decompress$(cdat$)

_ControlChr Off

Print "==============================================================================="
Print "ORIGINAL STRING      Size ="; Len(orig$); "bytes"
Print "==============="
Print orig$
Print
Print "==============================================================================="
Print "COMPRESS STRING      Size ="; Len(cdat$); "bytes"
Print "==============="
Print cdat$
Print
Print "==============================================================================="
Print "DECOMPRESSED STRING  Size ="; Len(ddat$); "bytes"
Print "==================="
Print ddat$
Print
Print "==============================================================================="
Print
If orig$ = ddat$ Then
    Print "MATCH! Original and decompressed are the same!"
    If Len(cdat$) >= Len(orig$) Then
        Print " >> Sorry, this data could not be compressed!"
    Else
        Print "Ratio ------>: Compressed to"; _Round(Len(cdat$) / Len(orig$) * 100); "% of original!"
    End If
    Print
Else
    Print "ERROR! Original and decompressed don't match!"
End If
Print "============================================="

Function Compress$ (in$)
    p = 1
    While p <= Len(in$)
        matchlen = 0: dis = 0
        For f = _Max(1, p - 2048) To p - 1
            lenmatch = 0
            While (p + lenmatch <= Len(in$)) And (f + lenmatch <= Len(in$)) And (Mid$(in$, f + lenmatch, 1) = Mid$(in$, p + lenmatch, 1))
                lenmatch = lenmatch + 1: If lenmatch >= 64 Then Exit While
            Wend
            If lenmatch > matchlen Then
                matchlen = lenmatch
                dis = p - f
            End If
        Next
        If matchlen >= 4 Then
            out$ = out$ + Chr$(255)
            v = dis: d$ = ""
            Do
                bval = v Mod 128: v = v \ 128
                If v > 0 Then bval = bval Or 128
                d$ = d$ + Chr$(bval)
            Loop While v > 0
            out$ = out$ + d$
            v = matchlen: b$ = ""
            Do
                bval = v Mod 128: v = v \ 128
                If v > 0 Then bval = bval Or 128
                b$ = b$ + Chr$(bval)
            Loop While v > 0
            out$ = out$ + b$
            p = p + matchlen
        Else
            c$ = Mid$(in$, p, 1)
            bval = Asc(c$)
            If bval = 255 Then
                out$ = out$ + Chr$(0) + Chr$(2)
            ElseIf bval = 0 Then
                out$ = out$ + Chr$(0) + Chr$(1)
            Else
                out$ = out$ + c$
            End If
            p = p + 1
        End If
    Wend
    Compress$ = out$
End Function

Function Decompress$ (in$)
    p = 1
    While p <= Len(in$)
        v = Asc(Mid$(in$, p, 1))
        If v = 255 Then
            p = p + 1: vlq = 0: b = 0
            Do
                bVal = Asc(Mid$(in$, p, 1)): p = p + 1
                vlq = vlq + ((bVal And 127) * (2 ^ b))
                b = b + 7
            Loop While (bVal And 128) <> 0
            d = vlq: vlq = 0: b = 0
            Do
                bVal = Asc(Mid$(in$, p, 1)): p = p + 1
                vlq = vlq + ((bVal And 127) * (2 ^ b))
                b = b + 7
            Loop While (bVal And 128) <> 0
            l = vlq
            For k = 1 To l
                If Len(out$) - d + 1 < 1 Then Exit For
                out$ = out$ + Mid$(out$, Len(out$) - d + 1, 1)
            Next
        ElseIf v = 0 Then
            bVal = Asc(Mid$(in$, p + 1, 1))
            If bVal = 1 Then
                out$ = out$ + Chr$(0)
            ElseIf bVal = 2 Then
                out$ = out$ + Chr$(255)
            End If
            p = p + 2
        Else
            out$ = out$ + Chr$(v)
            p = p + 1
        End If
    Wend
    Decompress$ = out$
End Function



RE: A hybrid pattern-based data compressor - PhilOfPerth - 12-24-2025

Very impressive - especially to someone whos nearest thing to data compression is using LOL and ASAP in my E-mails!   Big Grin Nice work.


RE: A hybrid pattern-based data compressor - Dav - 12-24-2025

(12-24-2025, 05:28 AM)PhilOfPerth Wrote: Very impressive - especially to someone whos nearest thing to data compression is using LOL and ASAP in my E-mails!   Big Grin Nice work.

lol.  Thanks, Phil.

Tested it by compressing the qb64pe.bas source and compared to _DEFLATE$.  Got the following results:

original size: 1,084,708 bytes
using PCOM:  383,271 bytes
_DEFLATE$:  176,435 bytes

Not too shabby, but PCOM took a LONG time to compress the file.  I'll work on it and try to improve compression, or at least speed it up.  Maybe replace the string concatenating building with the MID$ swapping method instead that @RhoSigma suggested on my base-85 encoder - that speed up the encoding/decoding dramatically.

- Dav