Performance improvement splitting huge file - Printable Version +- QB64 Phoenix Edition (https://qb64phoenix.com/forum) +-- Forum: QB64 Rising (https://qb64phoenix.com/forum/forumdisplay.php?fid=1) +--- Forum: Code and Stuff (https://qb64phoenix.com/forum/forumdisplay.php?fid=3) +---- Forum: Help Me! (https://qb64phoenix.com/forum/forumdisplay.php?fid=10) +---- Thread: Performance improvement splitting huge file (/showthread.php?tid=2115) Pages:
1
2
|
Performance improvement splitting huge file - mdijkens - 10-21-2023 This is a simplified part of a more complex process to split 1 huge inFile into multiple smaller outFile() ones: Code: (Select All) Dim As _Unsigned _Byte inPos, splitFiles, splitFile inSize is the size in bytes of the inFile inPos is the current characterposition of the inFile outFile() are the strings build for the split-files splitFiles is the number of files to split into splitSize is the size in bytes of each outFile (e.g. roundup(inSize/splitFiles)) splitFile is current splitFile splitPos is the current characterposition of the outFile Above works, but variable length strings and the Mid$() command are very time-expensive (2GB inFile takes ~3 minutes) I've tried 2-dimensional byte-arrays for the out-files like outFile(files, length) , but QB64 does not support Put with one dimension like Put #x, , outFile(x) I've also tried mapping this 2-dimensional array with _MEM but did not succeed so far. Does anyone have a clever trick to speed this up? RE: Performance improvement splitting huge file - bplus - 10-21-2023 (10-21-2023, 01:36 PM)mdijkens Wrote: This is a simplified part of a more complex process to split 1 huge inFile into multiple smaller outFile() ones: Binary access, not clever but no splitting needed, just clever coding ;-)) You are not saying why you think you need to split. RE: Performance improvement splitting huge file - mdijkens - 10-21-2023 Sorry bplus, but I do need to split. This snippet is part of a much bigger complex process where I split input to be parallel-processed on different servers The input has different record-types (in the snippet just bytes) that need processing on specific servers This approach works but is slow Also because of variable strings, I can only process in chunks of max 2GB RE: Performance improvement splitting huge file - SMcNeill - 10-21-2023 2GB file split in about 3 seconds on my drive. Code: (Select All)
RE: Performance improvement splitting huge file - SMcNeill - 10-21-2023 1.93 GB (2,080,374,784 bytes) size, anyway... RE: Performance improvement splitting huge file - mdijkens - 10-21-2023 Thanks SMcNeill, but this puts the first part of the inFile in the first outFile, etc. 1111222233334444 > 1111 2222 3333 4444 What I need (and requires a lot more processing) is: 1234123412341234 > 1111 2222 3333 4444 So every next byte (or record) needs to go the the next file... I've really tried to narrow down the snippet to as small and simple as possilbe; please take a look what it does RE: Performance improvement splitting huge file - SMcNeill - 10-21-2023 Code: (Select All)
1.3GB split into 10 files in about 13 seconds. The biggest trick that I'm using here is a set string size of 1000 bytes, which I then change the values of with ASC and _MEMGET, so there's never any string addition or concatenation at work. ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE) c is the push count. (Every 1000 elements, I put the data to the drive so we can start back over at 0, which allows me to only use 10,000 bytes for sorting, since I'm already holding several GB of data in memory from the file itself.) o is the data position in the file itself. ASC is how I assign the value I get to the splitData() array. _MEMGET lets me peek directly into the FileData. I could've also used _MEMPUT to put the data into each of those splitData() arrays, but ASC seemed sufficient for the task. It's fairly well optomized and oodles faster than MID$, and allows us to avoid any calls to those slow string routines. Feel free to swap out to _MEMPUT and see if that makes an even larger difference for you. At 10 seconds per GB, I'm pretty satisfied with the speed here. NOTE: This splits "0123456789012345678901234567890...." into files of: "0000000000...." "1111111111...." "2222222222...." and so on, as you mentioned. RE: Performance improvement splitting huge file - mdijkens - 10-22-2023 YESSS, That's what I was looking for! Thanks a lot SMcNeill It was this line I was looking for :-) ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE) There are still some minor issues in your code when FileData.SIZE MOD NrOfSplitFiles <> 0 but that's easy to fix Really happy with an over 10x speed increase RE: Performance improvement splitting huge file - SpriggsySpriggs - 10-23-2023 What kind of files are needing to be processed that this kind of splitting is required? RE: Performance improvement splitting huge file - mdijkens - 10-23-2023 This is for an exceptionlog of millions of smartmeters processing minute-readings. (~10 GB / hour) Once every hour these loglines need to be split per energy-provider ( ~20) to be processed in parallel by different systems. In short term it is too difficult to have the logprovider already split them up when creating (future scenario) so we need this for the time being I love to do these temporary utilities in QB64 :-) |