Login

**SMcNeill** · (This post was last modified: 10-21-2023, 09:15 PM by SMcNeill.)

Code: (Select All)

'$CHECKING:OFF

_DEFINE A-Z AS _INTEGER64

PRINT "Creating file data"

'create a nice large temp file

a$ = "0123456789"

FOR i = 1 TO 27

    a$ = a$ + a$

NEXT



OPEN "test.txt" FOR OUTPUT AS #1: CLOSE #1

OPEN "test.txt" FOR BINARY AS #1

PRINT "Writing file data", LEN(a$) '1.3 GB file

PUT #1, , a$

CLOSE



DIM FileData AS _MEM

FileData = _MEMNEW(LEN(a$)) 'can't point a mem pointer to a variable length string, so make a new memblock

_MEMPUT FileData, FileData.OFFSET, a$ 'and put the string in it



DIM splitData(0 TO 9) AS STRING





OPEN "test.txt" FOR BINARY AS #1



FOR i = 0 TO 9

    OPEN "test_split(" + _TRIM$(STR$(i)) + ").txt" FOR OUTPUT AS #i + 2: CLOSE #i + 2

    OPEN "test_split(" + _TRIM$(STR$(i)) + ").txt" FOR BINARY AS #i + 2

    splitData(i) = SPACE$(1000)

NEXT



PRINT "Writing Data..."

t## = TIMER



DIM o AS _OFFSET

o = FileData.OFFSET



c = 1 'c is the push counter



DO UNTIL o >= FileData.OFFSET + FileData.SIZE

    FOR i = 0 TO 9

        ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)

    NEXT

    IF c MOD 1000 = 0 THEN 'dump data in small segments so our strings don't grow too large

        c = 0

        FOR j = 0 TO 9

            PUT 2 + j, , splitData(j)

        NEXT

    END IF

    o = o + 10

    c = c + 1

LOOP

IF c <> 1 THEN 'dump remained of data

    FOR j = 0 TO 9

        l$ = LEFT$(splitData(j), c - 1)

        PUT 2 + j, , l$

        splitData(j) = ""

    NEXT

END IF



PRINT USING "###.### seconds to break data into 10 files."; TIMER - t##

1.3GB split into 10 files in about 13 seconds.

The biggest trick that I'm using here is a set string size of 1000 bytes, which I then change the values of with ASC and _MEMGET, so there's never any string addition or concatenation at work.

ASC(splitData(i), c) = _MEMGET(FileData, o + i, _UNSIGNED _BYTE)

c is the push count. (Every 1000 elements, I put the data to the drive so we can start back over at 0, which allows me to only use 10,000 bytes for sorting, since I'm already holding several GB of data in memory from the file itself.)
o is the data position in the file itself.

ASC is how I assign the value I get to the splitData() array.
_MEMGET lets me peek directly into the FileData.

I could've also used _MEMPUT to put the data into each of those splitData() arrays, but ASC seemed sufficient for the task. It's fairly well optomized and oodles faster than MID$, and allows us to avoid any calls to those slow string routines.

Feel free to swap out to _MEMPUT and see if that makes an even larger difference for you. At 10 seconds per GB, I'm pretty satisfied with the speed here. Smile

NOTE: This splits "0123456789012345678901234567890...." into files of:

"0000000000...."
"1111111111...."
"2222222222...."
and so on, as you mentioned.

Login
Username/Email:
Password:	Lost Password?
	Remember me