Posts: 128
Threads: 12
Joined: Apr 2022
Reputation:
14
10-21-2023, 01:36 PM
(This post was last modified: 10-21-2023, 01:44 PM by mdijkens.)
This is a simplified part of a more complex process to split 1 huge inFile into multiple smaller outFile() ones:
Code: (Select All) Dim As _Unsigned _Byte inPos, splitFiles, splitFile
Dim As _Unsigned _Integer64 inSize, splitPos, splitSize
Dim As _Unsigned _Byte inFile(1 To inSize), char
Get #1, 1, inFile()
Dim As String outFile(splitFiles)
For splitFile = 1 To splitFiles
outFile(splitFile) = String$(splitSize, 0)
Next splitFile
For splitPos = 1 To splitSize
For splitFile = 1 To splitFiles
inPos = inPos + 1
If inPos <= inSize Then
char = inFile(inPos)
Mid$(outFile(splitFile), splitPos, 1) = Chr$(char)
End If
Next splitFile
Next splitPos
For splitFile = 1 To splitFiles
Put #splitFile%, , outFile(splitFile)
Next splitFile
inFile is the byte-array of the inputfile
inSize is the size in bytes of the inFile
inPos is the current characterposition of the inFile
outFile() are the strings build for the split-files
splitFiles is the number of files to split into
splitSize is the size in bytes of each outFile (e.g. roundup(inSize/splitFiles))
splitFile is current splitFile
splitPos is the current characterposition of the outFile
Above works, but variable length strings and the Mid$() command are very time-expensive (2GB inFile takes ~3 minutes)
I've tried 2-dimensional byte-arrays for the out-files like outFile(files, length) , but QB64 does not support Put with one dimension like Put #x, , outFile(x)
I've also tried mapping this 2-dimensional array with _MEM but did not succeed so far.
Does anyone have a clever trick to speed this up?
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Posts: 3,981
Threads: 177
Joined: Apr 2022
Reputation:
220
10-21-2023, 01:40 PM
(This post was last modified: 10-21-2023, 01:41 PM by bplus.)
(10-21-2023, 01:36 PM)mdijkens Wrote: This is a simplified part of a more complex process to split 1 huge inFile into multiple smaller outFile() ones:
Code: (Select All) Dim As _Unsigned _Byte inPos, splitFiles, splitFile
Dim As _Unsigned _Integer64 inSize, splitPos, splitSize
Dim As _Unsigned _Byte inFile(1 To inSize), char
Get #1, 1, inFile()
Dim As String outFile(splitFiles)
For splitFile = 1 To splitFiles
outFile(splitFile) = String$(splitSize, 0)
Next splitFile
For splitPos = 1 To splitSize
For splitFile = 1 To splitFiles
inPos = inPos + 1
If inPos <= inSize Then
char = inFile(inPos)
Mid$(outFile(splitFile), splitPos, 1) = Chr$(char)
End If
Next splitFile
Next splitPos
For splitFile = 1 To splitFiles
Put #splitFile%, , outFile(splitFile)
Next splitFile
inFile is the byte-array of the inputfile
inSize is the size in bytes of the inFile
inPos is the current characterposition of the inFile
outFile() are the strings build for the split-files
splitFiles is the number of files to split into
splitSize is the size in bytes of each outFile (e.g. roundup(inSize/splitFiles))
splitFile is current splitFile
splitPos is the current characterposition of the outFile
Above works, but variable length strings and the Mid$() command are very time-expensive (2GB inFile takes >10 minutes)
I've tried 2-dimensional byte-arrays for the out-files like outFile(files, length) , but QB64 does not support Put with one dimension like Put #x, , outFile(x)
I've also tried mapping this 2-dimensional array with _MEM but did not succeed so far.
Does anyone have a clever trick to speed this up?
Binary access, not clever but no splitting needed, just clever coding ;-)) You are not saying why you think you need to split.
b = b + ...
Posts: 128
Threads: 12
Joined: Apr 2022
Reputation:
14
Sorry bplus, but I do need to split.
This snippet is part of a much bigger complex process where I split input to be parallel-processed on different servers
The input has different record-types (in the snippet just bytes) that need processing on specific servers
This approach works but is slow
Also because of variable strings, I can only process in chunks of max 2GB
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Posts: 2,700
Threads: 327
Joined: Apr 2022
Reputation:
217
2GB file split in about 3 seconds on my drive.
Code: (Select All)
_DEFINE A-Z AS _INTEGER64
PRINT "Creating file data"
'create a nice large temp file
a$ = "Hello World. I'm Testing Crap!"
FOR i = 1 TO 26
a$ = a$ + a$
NEXT
OPEN "test.txt" FOR BINARY AS #1
PRINT "Writing file data", LEN(a$)
PUT #1, , a$
CLOSE
infile$ = "test.txt" 'the file to read in and split
savefile$ = "split test" 'the name of the files to save to, without extension or index
Splitfiles = 11 'split it into 11 parts (or whatever you want)
OPEN infile$ FOR BINARY AS #1
SplitFileSize = _CEIL(LOF(1) / Splitfiles)
PRINT "Splitting file data", SplitFileSize
t## = TIMER
i = 0: p = 1
DO UNTIL EOF(1) OR p > LOF(1)
i = i + 1
IF p + SplitFileSize > LOF(1) THEN SplitFileSize = LOF(1) - SplitFileSize * (i - 1)
partialfile$ = SPACE$(SplitFileSize)
GET #1, , partialfile$
p = p + SplitFileSize
outfile$ = savefile$ + "(" + _TRIM$(STR$(i)) + ").split"
PRINT "Writing file: "; outfile$, LEN(partialfile$)
OPEN outfile$ FOR OUTPUT AS #2: CLOSE #2
OPEN outfile$ FOR BINARY AS #2
PUT #2, , partialfile$
CLOSE #2
IF i > 100 THEN Explode
LOOP
PRINT USING "Done in ###.### seconds."; TIMER - t##
SUB Explode
BEEP
BEEP
BEEP
PRINT "OMG!! WE just filled the drive with a ton of crap from a bad, endless split!!"
BEEP
BEEP
BEEP
PRINT "HELP!!"
BEEP
BEEP
BEEP
PRINT "HELP!!"
BEEP
BEEP
BEEP
PRINT "DELETE IT ALL!! HEEEEEELLLPPPP!!!"
BEEP
BEEP
BEEP
END
END SUB
Posts: 2,700
Threads: 327
Joined: Apr 2022
Reputation:
217
1.93 GB (2,080,374,784 bytes) size, anyway...
Posts: 128
Threads: 12
Joined: Apr 2022
Reputation:
14
Thanks SMcNeill, but this puts the first part of the inFile in the first outFile, etc.
1111222233334444 > 1111 2222 3333 4444
What I need (and requires a lot more processing) is:
1234123412341234 > 1111 2222 3333 4444
So every next byte (or record) needs to go the the next file...
I've really tried to narrow down the snippet to as small and simple as possilbe; please take a look what it does
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Posts: 2,700
Threads: 327
Joined: Apr 2022
Reputation:
217
10-21-2023, 07:36 PM
(This post was last modified: 10-21-2023, 09:15 PM by SMcNeill.)
Code: (Select All)
'$CHECKING:OFF
_DEFINE A-Z AS _INTEGER64
PRINT "Creating file data"
'create a nice large temp file
a$ = "0123456789"
FOR i = 1 TO 27
a$ = a$ + a$
NEXT
OPEN "test.txt" FOR OUTPUT AS #1: CLOSE #1
OPEN "test.txt" FOR BINARY AS #1
PRINT "Writing file data", LEN(a$) '1.3 GB file
PUT #1, , a$
CLOSE
DIM FileData AS _MEM
FileData = _MEMNEW(LEN(a$)) 'can't point a mem pointer to a variable length string, so make a new memblock
_MEMPUT FileData, FileData.OFFSET, a$ 'and put the string in it
DIM splitData(0 TO 9) AS STRING
OPEN "test.txt" FOR BINARY AS #1
FOR i = 0 TO 9
OPEN "test_split(" + _TRIM$(STR$(i)) + ").txt" FOR OUTPUT AS #i + 2: CLOSE #i + 2
OPEN "test_split(" + _TRIM$(STR$(i)) + ").txt" FOR BINARY AS #i + 2
splitData(i) = SPACE$(1000)
NEXT
PRINT "Writing Data..."
t## = TIMER
DIM o AS _OFFSET
o = FileData.OFFSET
c = 1 'c is the push counter
DO UNTIL o >= FileData.OFFSET + FileData.SIZE
FOR i = 0 TO 9
ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)
NEXT
IF c MOD 1000 = 0 THEN 'dump data in small segments so our strings don't grow too large
c = 0
FOR j = 0 TO 9
PUT 2 + j, , splitData(j)
NEXT
END IF
o = o + 10
c = c + 1
LOOP
IF c <> 1 THEN 'dump remained of data
FOR j = 0 TO 9
l$ = LEFT$(splitData(j), c - 1)
PUT 2 + j, , l$
splitData(j) = ""
NEXT
END IF
PRINT USING "###.### seconds to break data into 10 files."; TIMER - t##
1.3GB split into 10 files in about 13 seconds.
The biggest trick that I'm using here is a set string size of 1000 bytes, which I then change the values of with ASC and _MEMGET, so there's never any string addition or concatenation at work.
ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)
c is the push count. (Every 1000 elements, I put the data to the drive so we can start back over at 0, which allows me to only use 10,000 bytes for sorting, since I'm already holding several GB of data in memory from the file itself.)
o is the data position in the file itself.
ASC is how I assign the value I get to the splitData() array.
_MEMGET lets me peek directly into the FileData.
I could've also used _MEMPUT to put the data into each of those splitData() arrays, but ASC seemed sufficient for the task. It's fairly well optomized and oodles faster than MID$, and allows us to avoid any calls to those slow string routines.
Feel free to swap out to _MEMPUT and see if that makes an even larger difference for you. At 10 seconds per GB, I'm pretty satisfied with the speed here.
NOTE: This splits "0123456789012345678901234567890...." into files of:
"0000000000...."
"1111111111...."
"2222222222...."
and so on, as you mentioned.
Posts: 128
Threads: 12
Joined: Apr 2022
Reputation:
14
YESSS, That's what I was looking for!
Thanks a lot SMcNeill
It was this line I was looking for :-)
ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)
There are still some minor issues in your code when FileData.SIZE MOD NrOfSplitFiles <> 0 but that's easy to fix
Really happy with an over 10x speed increase
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Posts: 734
Threads: 30
Joined: Apr 2022
Reputation:
43
What kind of files are needing to be processed that this kind of splitting is required?
Tread on those who tread on you
Posts: 128
Threads: 12
Joined: Apr 2022
Reputation:
14
This is for an exceptionlog of millions of smartmeters processing minute-readings. (~10 GB / hour)
Once every hour these loglines need to be split per energy-provider ( ~20) to be processed in parallel by different systems.
In short term it is too difficult to have the logprovider already split them up when creating (future scenario) so we need this for the time being
I love to do these temporary utilities in QB64 :-)
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
|