Performance improvement splitting huge file

Performance improvement splitting huge file - Printable Version

+- QB64 Phoenix Edition (https://qb64phoenix.com/forum)
+-- Forum: QB64 Rising (https://qb64phoenix.com/forum/forumdisplay.php?fid=1)
+--- Forum: Code and Stuff (https://qb64phoenix.com/forum/forumdisplay.php?fid=3)
+---- Forum: Help Me! (https://qb64phoenix.com/forum/forumdisplay.php?fid=10)
+---- Thread: Performance improvement splitting huge file (/showthread.php?tid=2115)

Pages: 1 2

Performance improvement splitting huge file - mdijkens - 10-21-2023

This is a simplified part of a more complex process to split 1 huge inFile into multiple smaller outFile() ones:

Code: (Select All)
  Dim As _Unsigned _Byte inPos, splitFiles, splitFile

  Dim As _Unsigned _Integer64 inSize, splitPos, splitSize

  Dim As _Unsigned _Byte inFile(1 To inSize), char

  Get #1, 1, inFile()

  Dim As String outFile(splitFiles)

  For splitFile = 1 To splitFiles

    outFile(splitFile) = String$(splitSize, 0)

  Next splitFile

  For splitPos = 1 To splitSize

    For splitFile = 1 To splitFiles

      inPos = inPos + 1

      If inPos <= inSize Then

        char = inFile(inPos)

        Mid$(outFile(splitFile), splitPos, 1) = Chr$(char)

      End If

    Next splitFile

  Next splitPos

  For splitFile = 1 To splitFiles

    Put #splitFile%, , outFile(splitFile)

  Next splitFile

inFile is the byte-array of the inputfile
inSize is the size in bytes of the inFile
inPos is the current characterposition of the inFile
outFile() are the strings build for the split-files
splitFiles is the number of files to split into
splitSize is the size in bytes of each outFile (e.g. roundup(inSize/splitFiles))
splitFile is current splitFile
splitPos is the current characterposition of the outFile

Above works, but variable length strings and the Mid$() command are very time-expensive (2GB inFile takes ~3 minutes)

I've tried 2-dimensional byte-arrays for the out-files like outFile(files, length) , but QB64 does not support Put with one dimension like Put #x, , outFile(x)
I've also tried mapping this 2-dimensional array with _MEM but did not succeed so far.

Does anyone have a clever trick to speed this up?

RE: Performance improvement splitting huge file - bplus - 10-21-2023

(10-21-2023, 01:36 PM)mdijkens Wrote: This is a simplified part of a more complex process to split 1 huge inFile into multiple smaller outFile() ones:

Code: (Select All)
Dim As _Unsigned _Byte inPos, splitFiles, splitFile Dim As _Unsigned _Integer64 inSize, splitPos, splitSize Dim As _Unsigned _Byte inFile(1 To inSize), char Get #1, 1, inFile() Dim As String outFile(splitFiles) For splitFile = 1 To splitFiles outFile(splitFile) = String$(splitSize, 0) Next splitFile For splitPos = 1 To splitSize For splitFile = 1 To splitFiles inPos = inPos + 1 If inPos <= inSize Then char = inFile(inPos) Mid$(outFile(splitFile), splitPos, 1) = Chr$(char) End If Next splitFile Next splitPos For splitFile = 1 To splitFiles Put #splitFile%, , outFile(splitFile) Next splitFile
inFile is the byte-array of the inputfile
inSize is the size in bytes of the inFile
inPos is the current characterposition of the inFile
outFile() are the strings build for the split-files
splitFiles is the number of files to split into
splitSize is the size in bytes of each outFile (e.g. roundup(inSize/splitFiles))
splitFile is current splitFile
splitPos is the current characterposition of the outFile

Above works, but variable length strings and the Mid$() command are very time-expensive (2GB inFile takes >10 minutes)

I've tried 2-dimensional byte-arrays for the out-files like outFile(files, length) , but QB64 does not support Put with one dimension like Put #x, , outFile(x)
I've also tried mapping this 2-dimensional array with _MEM but did not succeed so far.

Does anyone have a clever trick to speed this up?

Binary access, not clever but no splitting needed, just clever coding ;-)) You are not saying why you think you need to split.

RE: Performance improvement splitting huge file - mdijkens - 10-21-2023

Sorry bplus, but I do need to split.

This snippet is part of a much bigger complex process where I split input to be parallel-processed on different servers
The input has different record-types (in the snippet just bytes) that need processing on specific servers

This approach works but is slow
Also because of variable strings, I can only process in chunks of max 2GB

RE: Performance improvement splitting huge file - SMcNeill - 10-21-2023

2GB file split in about 3 seconds on my drive.

Code: (Select All)

_DEFINE A-Z AS _INTEGER64

PRINT "Creating file data"

'create a nice large temp file

a$ = "Hello World.  I'm Testing Crap!"

FOR i = 1 TO 26

    a$ = a$ + a$

NEXT





OPEN "test.txt" FOR BINARY AS #1

PRINT "Writing file data", LEN(a$)

PUT #1, , a$

CLOSE



infile$ = "test.txt" 'the file to read in and split

savefile$ = "split test" 'the name of the files to save to, without extension or index

Splitfiles = 11 'split it into 11 parts (or whatever you want)





OPEN infile$ FOR BINARY AS #1

SplitFileSize = _CEIL(LOF(1) / Splitfiles)

PRINT "Splitting file data", SplitFileSize







t## = TIMER

i = 0: p = 1

DO UNTIL EOF(1) OR p > LOF(1)

    i = i + 1

    IF p + SplitFileSize > LOF(1) THEN SplitFileSize = LOF(1) - SplitFileSize * (i - 1)

    partialfile$ = SPACE$(SplitFileSize)

    GET #1, , partialfile$

    p = p + SplitFileSize

    outfile$ = savefile$ + "(" + _TRIM$(STR$(i)) + ").split"

    PRINT "Writing file: "; outfile$, LEN(partialfile$)

    OPEN outfile$ FOR OUTPUT AS #2: CLOSE #2

    OPEN outfile$ FOR BINARY AS #2

    PUT #2, , partialfile$

    CLOSE #2

    IF i > 100 THEN Explode

LOOP

PRINT USING "Done in ###.### seconds."; TIMER - t##



SUB Explode

    BEEP

    BEEP

    BEEP

    PRINT "OMG!! WE just filled the drive with a ton of crap from a bad, endless split!!"

    BEEP

    BEEP

    BEEP

    PRINT "HELP!!"

    BEEP

    BEEP

    BEEP

    PRINT "HELP!!"

    BEEP

    BEEP

    BEEP

    PRINT "DELETE IT ALL!!  HEEEEEELLLPPPP!!!"

    BEEP

    BEEP

    BEEP

    END

END SUB

RE: Performance improvement splitting huge file - SMcNeill - 10-21-2023

1.93 GB (2,080,374,784 bytes) size, anyway...

RE: Performance improvement splitting huge file - mdijkens - 10-21-2023

Thanks SMcNeill, but this puts the first part of the inFile in the first outFile, etc.
1111222233334444 > 1111 2222 3333 4444

What I need (and requires a lot more processing) is:
1234123412341234 > 1111 2222 3333 4444

So every next byte (or record) needs to go the the next file...

I've really tried to narrow down the snippet to as small and simple as possilbe; please take a look what it does

RE: Performance improvement splitting huge file - SMcNeill - 10-21-2023

Code: (Select All)

'$CHECKING:OFF

_DEFINE A-Z AS _INTEGER64

PRINT "Creating file data"

'create a nice large temp file

a$ = "0123456789"

FOR i = 1 TO 27

    a$ = a$ + a$

NEXT



OPEN "test.txt" FOR OUTPUT AS #1: CLOSE #1

OPEN "test.txt" FOR BINARY AS #1

PRINT "Writing file data", LEN(a$) '1.3 GB file

PUT #1, , a$

CLOSE



DIM FileData AS _MEM

FileData = _MEMNEW(LEN(a$)) 'can't point a mem pointer to a variable length string, so make a new memblock

_MEMPUT FileData, FileData.OFFSET, a$ 'and put the string in it



DIM splitData(0 TO 9) AS STRING





OPEN "test.txt" FOR BINARY AS #1



FOR i = 0 TO 9

    OPEN "test_split(" + _TRIM$(STR$(i)) + ").txt" FOR OUTPUT AS #i + 2: CLOSE #i + 2

    OPEN "test_split(" + _TRIM$(STR$(i)) + ").txt" FOR BINARY AS #i + 2

    splitData(i) = SPACE$(1000)

NEXT



PRINT "Writing Data..."

t## = TIMER



DIM o AS _OFFSET

o = FileData.OFFSET



c = 1 'c is the push counter



DO UNTIL o >= FileData.OFFSET + FileData.SIZE

    FOR i = 0 TO 9

        ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)

    NEXT

    IF c MOD 1000 = 0 THEN 'dump data in small segments so our strings don't grow too large

        c = 0

        FOR j = 0 TO 9

            PUT 2 + j, , splitData(j)

        NEXT

    END IF

    o = o + 10

    c = c + 1

LOOP

IF c <> 1 THEN 'dump remained of data

    FOR j = 0 TO 9

        l$ = LEFT$(splitData(j), c - 1)

        PUT 2 + j, , l$

        splitData(j) = ""

    NEXT

END IF



PRINT USING "###.### seconds to break data into 10 files."; TIMER - t##

1.3GB split into 10 files in about 13 seconds.

The biggest trick that I'm using here is a set string size of 1000 bytes, which I then change the values of with ASC and _MEMGET, so there's never any string addition or concatenation at work.

ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)

c is the push count. (Every 1000 elements, I put the data to the drive so we can start back over at 0, which allows me to only use 10,000 bytes for sorting, since I'm already holding several GB of data in memory from the file itself.)
o is the data position in the file itself.

ASC is how I assign the value I get to the splitData() array.
_MEMGET lets me peek directly into the FileData.

I could've also used _MEMPUT to put the data into each of those splitData() arrays, but ASC seemed sufficient for the task. It's fairly well optomized and oodles faster than MID$, and allows us to avoid any calls to those slow string routines.

Feel free to swap out to _MEMPUT and see if that makes an even larger difference for you. At 10 seconds per GB, I'm pretty satisfied with the speed here. Smile

NOTE: This splits "0123456789012345678901234567890...." into files of:

"0000000000...."
"1111111111...."
"2222222222...."
and so on, as you mentioned.

RE: Performance improvement splitting huge file - mdijkens - 10-22-2023

YESSS, That's what I was looking for!
Thanks a lot SMcNeill

It was this line I was looking for :-)
ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)

There are still some minor issues in your code when FileData.SIZE MOD NrOfSplitFiles <> 0 but that's easy to fix

Really happy with an over 10x speed increase

RE: Performance improvement splitting huge file - SpriggsySpriggs - 10-23-2023

What kind of files are needing to be processed that this kind of splitting is required?

RE: Performance improvement splitting huge file - mdijkens - 10-23-2023

This is for an exceptionlog of millions of smartmeters processing minute-readings. (~10 GB / hour)
Once every hour these loglines need to be split per energy-provider ( ~20) to be processed in parallel by different systems.

In short term it is too difficult to have the logprovider already split them up when creating (future scenario) so we need this for the time being

I love to do these temporary utilities in QB64 :-)