Code: (Select All)
'$CHECKING:OFF
_DEFINE A-Z AS _INTEGER64
PRINT "Creating file data"
'create a nice large temp file
a$ = "0123456789"
FOR i = 1 TO 27
a$ = a$ + a$
NEXT
OPEN "test.txt" FOR OUTPUT AS #1: CLOSE #1
OPEN "test.txt" FOR BINARY AS #1
PRINT "Writing file data", LEN(a$) '1.3 GB file
PUT #1, , a$
CLOSE
DIM FileData AS _MEM
FileData = _MEMNEW(LEN(a$)) 'can't point a mem pointer to a variable length string, so make a new memblock
_MEMPUT FileData, FileData.OFFSET, a$ 'and put the string in it
DIM splitData(0 TO 9) AS STRING
OPEN "test.txt" FOR BINARY AS #1
FOR i = 0 TO 9
OPEN "test_split(" + _TRIM$(STR$(i)) + ").txt" FOR OUTPUT AS #i + 2: CLOSE #i + 2
OPEN "test_split(" + _TRIM$(STR$(i)) + ").txt" FOR BINARY AS #i + 2
splitData(i) = SPACE$(1000)
NEXT
PRINT "Writing Data..."
t## = TIMER
DIM o AS _OFFSET
o = FileData.OFFSET
c = 1 'c is the push counter
DO UNTIL o >= FileData.OFFSET + FileData.SIZE
FOR i = 0 TO 9
ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)
NEXT
IF c MOD 1000 = 0 THEN 'dump data in small segments so our strings don't grow too large
c = 0
FOR j = 0 TO 9
PUT 2 + j, , splitData(j)
NEXT
END IF
o = o + 10
c = c + 1
LOOP
IF c <> 1 THEN 'dump remained of data
FOR j = 0 TO 9
l$ = LEFT$(splitData(j), c - 1)
PUT 2 + j, , l$
splitData(j) = ""
NEXT
END IF
PRINT USING "###.### seconds to break data into 10 files."; TIMER - t##
1.3GB split into 10 files in about 13 seconds.
The biggest trick that I'm using here is a set string size of 1000 bytes, which I then change the values of with ASC and _MEMGET, so there's never any string addition or concatenation at work.
ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)
c is the push count. (Every 1000 elements, I put the data to the drive so we can start back over at 0, which allows me to only use 10,000 bytes for sorting, since I'm already holding several GB of data in memory from the file itself.)
o is the data position in the file itself.
ASC is how I assign the value I get to the splitData() array.
_MEMGET lets me peek directly into the FileData.
I could've also used _MEMPUT to put the data into each of those splitData() arrays, but ASC seemed sufficient for the task. It's fairly well optomized and oodles faster than MID$, and allows us to avoid any calls to those slow string routines.
Feel free to swap out to _MEMPUT and see if that makes an even larger difference for you. At 10 seconds per GB, I'm pretty satisfied with the speed here.
NOTE: This splits "0123456789012345678901234567890...." into files of:
"0000000000...."
"1111111111...."
"2222222222...."
and so on, as you mentioned.