A hybrid pattern-based data compressor - Dav - 12-23-2025
This is made as a learning experience, and not to try and replace DEFLATE or a real compressor. It's a hybrid data compressor using pattern recognition in a data window size, sort of like LZ77 does, and VLQ encoding like base-128 uses. It searches for patterns in the data and encodes those as references, compressing data. It's better than RLE, but not that much. Was a fun code to work on. More info in the code.
- Dav
Code: (Select All)
'=========
'PCOMP.BAS
'=========
'Simple Pattern-Based data Compressor with VLQ encoding.
'Coded by Dav for QB64-PE, DEC/2025
'This is a hybrid pattern-based data compressor. It searches data for
'repeated patterns within a specified window size, similar to how LZ77
'does, but without maintaining a sliding window or dynamic dictionary.
'Instead, it performs pattern matching within a fixed window size.
'Pattern references (distance and length) are encoded using VLQ, using
'the most significant bit (MSB) as a flag. If MSB=1, then more bytes
'follow. If MSB=0, this is the last byte. This method is like base-128
'encoding and is also the method used in MIDI files.
'Unlike RLE, which detects only consecutive runs, this compressor can
'identify non-consecutive repeats, matching patterns that appeared
'earlier in the data.
'Note: This compressor works well with patterned data but is not good
'for compressing random data.
'=====================================================================
Screen 12: _FullScreen
'make string of data for testing
For o = 1 To 12
orig$ = orig$ + _Trim$(Str$(Timer))
orig$ = orig$ + Time$ + Date$
Next
cdat$ = Compress$(orig$)
ddat$ = Decompress$(cdat$)
_ControlChr Off
Print "==============================================================================="
Print "ORIGINAL STRING Size ="; Len(orig$); "bytes"
Print "==============="
Print orig$
Print
Print "==============================================================================="
Print "COMPRESS STRING Size ="; Len(cdat$); "bytes"
Print "==============="
Print cdat$
Print
Print "==============================================================================="
Print "DECOMPRESSED STRING Size ="; Len(ddat$); "bytes"
Print "==================="
Print ddat$
Print
Print "==============================================================================="
Print
If orig$ = ddat$ Then
Print "MATCH! Original and decompressed are the same!"
If Len(cdat$) >= Len(orig$) Then
Print " >> Sorry, this data could not be compressed!"
Else
Print "Ratio ------>: Compressed to"; _Round(Len(cdat$) / Len(orig$) * 100); "% of original!"
End If
Print
Else
Print "ERROR! Original and decompressed don't match!"
End If
Print "============================================="
Function Compress$ (in$)
p = 1
While p <= Len(in$)
matchlen = 0: dis = 0
For f = _Max(1, p - 2048) To p - 1
lenmatch = 0
While (p + lenmatch <= Len(in$)) And (f + lenmatch <= Len(in$)) And (Mid$(in$, f + lenmatch, 1) = Mid$(in$, p + lenmatch, 1))
lenmatch = lenmatch + 1: If lenmatch >= 64 Then Exit While
Wend
If lenmatch > matchlen Then
matchlen = lenmatch
dis = p - f
End If
Next
If matchlen >= 4 Then
out$ = out$ + Chr$(255)
v = dis: d$ = ""
Do
bval = v Mod 128: v = v \ 128
If v > 0 Then bval = bval Or 128
d$ = d$ + Chr$(bval)
Loop While v > 0
out$ = out$ + d$
v = matchlen: b$ = ""
Do
bval = v Mod 128: v = v \ 128
If v > 0 Then bval = bval Or 128
b$ = b$ + Chr$(bval)
Loop While v > 0
out$ = out$ + b$
p = p + matchlen
Else
c$ = Mid$(in$, p, 1)
bval = Asc(c$)
If bval = 255 Then
out$ = out$ + Chr$(0) + Chr$(2)
ElseIf bval = 0 Then
out$ = out$ + Chr$(0) + Chr$(1)
Else
out$ = out$ + c$
End If
p = p + 1
End If
Wend
Compress$ = out$
End Function
Function Decompress$ (in$)
p = 1
While p <= Len(in$)
v = Asc(Mid$(in$, p, 1))
If v = 255 Then
p = p + 1: vlq = 0: b = 0
Do
bVal = Asc(Mid$(in$, p, 1)): p = p + 1
vlq = vlq + ((bVal And 127) * (2 ^ b))
b = b + 7
Loop While (bVal And 128) <> 0
d = vlq: vlq = 0: b = 0
Do
bVal = Asc(Mid$(in$, p, 1)): p = p + 1
vlq = vlq + ((bVal And 127) * (2 ^ b))
b = b + 7
Loop While (bVal And 128) <> 0
l = vlq
For k = 1 To l
If Len(out$) - d + 1 < 1 Then Exit For
out$ = out$ + Mid$(out$, Len(out$) - d + 1, 1)
Next
ElseIf v = 0 Then
bVal = Asc(Mid$(in$, p + 1, 1))
If bVal = 1 Then
out$ = out$ + Chr$(0)
ElseIf bVal = 2 Then
out$ = out$ + Chr$(255)
End If
p = p + 2
Else
out$ = out$ + Chr$(v)
p = p + 1
End If
Wend
Decompress$ = out$
End Function
RE: A hybrid pattern-based data compressor - PhilOfPerth - 12-24-2025
Very impressive - especially to someone whos nearest thing to data compression is using LOL and ASAP in my E-mails! Nice work.
RE: A hybrid pattern-based data compressor - Dav - 12-24-2025
(12-24-2025, 05:28 AM)PhilOfPerth Wrote: Very impressive - especially to someone whos nearest thing to data compression is using LOL and ASAP in my E-mails! Nice work.
lol. Thanks, Phil.
Tested it by compressing the qb64pe.bas source and compared to _DEFLATE$. Got the following results:
original size: 1,084,708 bytes
using PCOM: 383,271 bytes
_DEFLATE$: 176,435 bytes
Not too shabby, but PCOM took a LONG time to compress the file. I'll work on it and try to improve compression, or at least speed it up. Maybe replace the string concatenating building with the MID$ swapping method instead that @RhoSigma suggested on my base-85 encoder - that speed up the encoding/decoding dramatically.
- Dav
|