Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Performance improvement splitting huge file
#1
This is a simplified part of a more complex process to split 1 huge inFile into multiple smaller outFile() ones:

Code: (Select All)
  Dim As _Unsigned _Byte inPos, splitFiles, splitFile
  Dim As _Unsigned _Integer64 inSize, splitPos, splitSize
  Dim As _Unsigned _Byte inFile(1 To inSize), char
  Get #1, 1, inFile()
  Dim As String outFile(splitFiles)
  For splitFile = 1 To splitFiles
    outFile(splitFile) = String$(splitSize, 0)
  Next splitFile

  For splitPos = 1 To splitSize
    For splitFile = 1 To splitFiles
      inPos = inPos + 1
      If inPos <= inSize Then
        char = inFile(inPos)
        Mid$(outFile(splitFile), splitPos, 1) = Chr$(char)
      End If
    Next splitFile
  Next splitPos
  For splitFile = 1 To splitFiles
    Put #splitFile%, , outFile(splitFile)
  Next splitFile
inFile is the byte-array of the inputfile
inSize is the size in bytes of the inFile
inPos is the current characterposition of the inFile
outFile() are the strings build for the split-files
splitFiles is the number of files to split into
splitSize is the size in bytes of each outFile (e.g. roundup(inSize/splitFiles))
splitFile is current splitFile
splitPos is the current characterposition of the outFile

Above works, but variable length strings and the Mid$() command are very time-expensive (2GB inFile takes ~3 minutes)

I've tried 2-dimensional byte-arrays for the out-files like outFile(files, length) , but QB64 does not support Put with one dimension like Put #x, , outFile(x)
I've also tried mapping this 2-dimensional array with _MEM but did not succeed so far.

Does anyone have a clever trick to speed this up?
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply
#2
(10-21-2023, 01:36 PM)mdijkens Wrote: This is a simplified part of a more complex process to split 1 huge inFile into multiple smaller outFile() ones:

Code: (Select All)
  Dim As _Unsigned _Byte inPos, splitFiles, splitFile
  Dim As _Unsigned _Integer64 inSize, splitPos, splitSize
  Dim As _Unsigned _Byte inFile(1 To inSize), char
  Get #1, 1, inFile()
  Dim As String outFile(splitFiles)
  For splitFile = 1 To splitFiles
    outFile(splitFile) = String$(splitSize, 0)
  Next splitFile

  For splitPos = 1 To splitSize
    For splitFile = 1 To splitFiles
      inPos = inPos + 1
      If inPos <= inSize Then
        char = inFile(inPos)
        Mid$(outFile(splitFile), splitPos, 1) = Chr$(char)
      End If
    Next splitFile
  Next splitPos
  For splitFile = 1 To splitFiles
    Put #splitFile%, , outFile(splitFile)
  Next splitFile
inFile is the byte-array of the inputfile
inSize is the size in bytes of the inFile
inPos is the current characterposition of the inFile
outFile() are the strings build for the split-files
splitFiles is the number of files to split into
splitSize is the size in bytes of each outFile (e.g. roundup(inSize/splitFiles))
splitFile is current splitFile
splitPos is the current characterposition of the outFile

Above works, but variable length strings and the Mid$() command are very time-expensive (2GB inFile takes >10 minutes)

I've tried 2-dimensional byte-arrays for the out-files like outFile(files, length) , but QB64 does not support Put with one dimension like Put #x, , outFile(x)
I've also tried mapping this 2-dimensional array with _MEM but did not succeed so far.

Does anyone have a clever trick to speed this up?

Binary access, not clever but no splitting needed, just clever coding ;-)) You are not saying why you think you need to split.
b = b + ...
Reply
#3
Sorry bplus, but I do need to split.

This snippet is part of a much bigger complex process where I split input to be parallel-processed on different servers
The input has different record-types (in the snippet just bytes) that need processing on specific servers

This approach works but is slow
Also because of variable strings, I can only process in chunks of max 2GB
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply
#4
2GB file split in about 3 seconds on my drive.

Code: (Select All)
_DEFINE A-Z AS _INTEGER64
PRINT "Creating file data"
'create a nice large temp file
a$ = "Hello World. I'm Testing Crap!"
FOR i = 1 TO 26
a$ = a$ + a$
NEXT


OPEN "test.txt" FOR BINARY AS #1
PRINT "Writing file data", LEN(a$)
PUT #1, , a$
CLOSE

infile$ = "test.txt" 'the file to read in and split
savefile$ = "split test" 'the name of the files to save to, without extension or index
Splitfiles = 11 'split it into 11 parts (or whatever you want)


OPEN infile$ FOR BINARY AS #1
SplitFileSize = _CEIL(LOF(1) / Splitfiles)
PRINT "Splitting file data", SplitFileSize



t## = TIMER
i = 0: p = 1
DO UNTIL EOF(1) OR p > LOF(1)
i = i + 1
IF p + SplitFileSize > LOF(1) THEN SplitFileSize = LOF(1) - SplitFileSize * (i - 1)
partialfile$ = SPACE$(SplitFileSize)
GET #1, , partialfile$
p = p + SplitFileSize
outfile$ = savefile$ + "(" + _TRIM$(STR$(i)) + ").split"
PRINT "Writing file: "; outfile$, LEN(partialfile$)
OPEN outfile$ FOR OUTPUT AS #2: CLOSE #2
OPEN outfile$ FOR BINARY AS #2
PUT #2, , partialfile$
CLOSE #2
IF i > 100 THEN Explode
LOOP
PRINT USING "Done in ###.### seconds."; TIMER - t##

SUB Explode
BEEP
BEEP
BEEP
PRINT "OMG!! WE just filled the drive with a ton of crap from a bad, endless split!!"
BEEP
BEEP
BEEP
PRINT "HELP!!"
BEEP
BEEP
BEEP
PRINT "HELP!!"
BEEP
BEEP
BEEP
PRINT "DELETE IT ALL!! HEEEEEELLLPPPP!!!"
BEEP
BEEP
BEEP
END
END SUB
Reply
#5
1.93 GB (2,080,374,784 bytes) size, anyway...
Reply
#6
Thanks SMcNeill, but this puts the first part of the inFile in the first outFile, etc.
1111222233334444 > 1111 2222 3333 4444

What I need (and requires a lot more processing) is:
1234123412341234 > 1111 2222 3333 4444

So every next byte (or record) needs to go the the next file...

I've really tried to narrow down the snippet to as small and simple as possilbe; please take a look what it does
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply
#7
Code: (Select All)
'$CHECKING:OFF
_DEFINE A-Z AS _INTEGER64
PRINT "Creating file data"
'create a nice large temp file
a$ = "0123456789"
FOR i = 1 TO 27
    a$ = a$ + a$
NEXT

OPEN "test.txt" FOR OUTPUT AS #1: CLOSE #1
OPEN "test.txt" FOR BINARY AS #1
PRINT "Writing file data", LEN(a$) '1.3 GB file
PUT #1, , a$
CLOSE

DIM FileData AS _MEM
FileData = _MEMNEW(LEN(a$)) 'can't point a mem pointer to a variable length string, so make a new memblock
_MEMPUT FileData, FileData.OFFSET, a$ 'and put the string in it

DIM splitData(0 TO 9) AS STRING


OPEN "test.txt" FOR BINARY AS #1

FOR i = 0 TO 9
    OPEN "test_split(" + _TRIM$(STR$(i)) + ").txt" FOR OUTPUT AS #i + 2: CLOSE #i + 2
    OPEN "test_split(" + _TRIM$(STR$(i)) + ").txt" FOR BINARY AS #i + 2
    splitData(i) = SPACE$(1000)
NEXT

PRINT "Writing Data..."
t## = TIMER

DIM o AS _OFFSET
o = FileData.OFFSET

c = 1 'c is the push counter

DO UNTIL o >= FileData.OFFSET + FileData.SIZE
    FOR i = 0 TO 9
        ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)
    NEXT
    IF c MOD 1000 = 0 THEN 'dump data in small segments so our strings don't grow too large
        c = 0
        FOR j = 0 TO 9
            PUT 2 + j, , splitData(j)
        NEXT
    END IF
    o = o + 10
    c = c + 1
LOOP
IF c <> 1 THEN 'dump remained of data
    FOR j = 0 TO 9
        l$ = LEFT$(splitData(j), c - 1)
        PUT 2 + j, , l$
        splitData(j) = ""
    NEXT
END IF

PRINT USING "###.### seconds to break data into 10 files."; TIMER - t##

1.3GB split into 10 files in about 13 seconds.

The biggest trick that I'm using here is a set string size of 1000 bytes, which I then change the values of with ASC and _MEMGET, so there's never any string addition or concatenation at work.

        ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)

c is the push count. (Every 1000 elements, I put the data to the drive so we can start back over at 0, which allows me to only use 10,000 bytes for sorting, since I'm already holding several GB of data in memory from the file itself.)
o is the data position in the file itself.

ASC is how I assign the value I get to the splitData() array.
_MEMGET lets me peek directly into the FileData.

I could've also used _MEMPUT to put the data into each of those splitData() arrays, but ASC seemed sufficient for the task.  It's fairly well optomized and oodles faster than MID$, and allows us to avoid any calls to those slow string routines. 

Feel free to swap out to _MEMPUT and see if that makes an even larger difference for you.  At 10 seconds per GB, I'm pretty satisfied with the speed here.  Smile


NOTE:  This splits "0123456789012345678901234567890...." into files of:

"0000000000...."
"1111111111...."
"2222222222...."
and so on, as you mentioned.
Reply
#8
YESSS, That's what I was looking for!
Thanks a lot  SMcNeill

It was this line I was looking for :-)
ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)

There are still some minor issues in your code when FileData.SIZE MOD NrOfSplitFiles <> 0 but that's easy to fix

Really happy with an over 10x speed increase
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply
#9
What kind of files are needing to be processed that this kind of splitting is required?
Tread on those who tread on you

Reply
#10
This is for an exceptionlog of millions of smartmeters processing minute-readings. (~10 GB / hour)
Once every hour these loglines need to be split per energy-provider ( ~20) to be processed in parallel by different systems.

In short term it is too difficult to have the logprovider already split them up when creating (future scenario) so we need this for the time being

I love to do these temporary utilities in QB64 :-)
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply




Users browsing this thread: 1 Guest(s)