Posts: 120
Threads: 35
Joined: Apr 2022
Reputation:
9
This is a cautionary tale.
Machine #1, i7 8 core ddr3. Machine #2, i9 24 core ddr5 Ram drive on each machine for outputs. Only for reference here.
I had a very large list of unsorted data (file #1). Each line was meant to go into another file based on a name that was inside the first file. (line by line new names)
First thought would be to find the name, open destination file with append, write the line out. Then close the output file.
Well this way was very slow on machine #1, somewhat slow on machine number 2.
The right way to do this and 10,000% faster.
Keep a running list of open file handles in a dim o$(filename,open handle#)
Search the array for previous open filename. If found good keep the number in a value
If not found, use next position in array for new filename and open handle #. And open new filename with append. Save in array.
Write the data to appropriate filename needed. DO NOT CLOSE THE FILE.
Since this is a do/loop loop and get next name from file #1.
On eof(1) you are done. System or end will close all handles.
Lot's of little bits in my description are needed. The core of the procedure is valid.
Seems opening and closing files when using append. Is very wasteful for the processor cycles. Even in a ram drive. (my first wrong assumption)
Take it for what it is.
Posts: 143
Threads: 14
Joined: Apr 2022
Reputation:
15
01-15-2025, 09:40 PM
(This post was last modified: 01-15-2025, 10:12 PM by mdijkens.)
Good point!
Beside keeping files open, writing millions of small strings is also expensive.
I've adapted my writeBigFile function for your scenario to append to multiple outputfiles (~250 MB/sec)
I'm using this approach to merge logfiles of over 100GB each
Code: (Select All)
'open all output files
Dim fh%(0 To 9)
For i% = 0 To 9
fh%(i%) = writeBigFile(0, "~test" + _ToStr$(i%) + ".tmp")
Next i%
'append 10mln random string to random output files
t0! = Timer
For n& = 1 To 10000000
i% = Int(Rnd * 10)
v$ = String$(1 + Rnd * 100, 33 + Rnd * 30)
tbytes&& = tbytes&& + writeBigFile(fh%(i%), v$)
Next n&
t1! = Timer
'close all output files
For i% = 0 To 9
Print Using "~test#.tmp ###,###,### bytes"; i%; writeBigFile(-fh%(i%), "")
Next i%
Print Using "###,###,### total bytes written in #.### seconds"; tbytes&&; t1! - t0!
Function writeBigFile~&& (file%, content$)
Const BLOCKSIZE = 2 ^ 22 ' 4 MB
Static buf$(1000), bufPos(1000) As _Integer64
Dim As _Unsigned _Integer64 contentLen, bufRemain
If file% > 0 Then 'append file
contentLen = Len(content$): bufRemain = BLOCKSIZE - bufPos(file%)
writeBigFile = contentLen
If contentLen > bufRemain Then
Mid$(buf$(file%), bufPos(file%) + 1, bufRemain) = Left$(content$, bufRemain)
Print #file%, buf$(file%);
content$ = Mid$(content$, bufRemain + 1)
contentLen = contentLen - bufRemain
bufPos(file%) = 0
End If
Mid$(buf$(file%), bufPos(file%) + 1, contentLen) = Left$(content$, contentLen)
bufPos(file%) = bufPos(file%) + contentLen
ElseIf file% = 0 Then 'new file
file% = FreeFile
buf$(file%) = String$(BLOCKSIZE, 0): bufPos(file%) = 0
Open content$ For Append As #file%
writeBigFile = file%
ElseIf file% < 0 Then 'close file
file% = -file%
If bufPos(file%) > 0 Then Print #file%, Left$(buf$(file%), bufPos(file%));
buf$(file%) = ""
writeBigFile = LOF(file%)
Close #file%
End If
End Function
You can change APPEND to OUTPUT in line 42 if you want to overwrite output files
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Posts: 143
Threads: 14
Joined: Apr 2022
Reputation:
15
01-15-2025, 09:48 PM
(This post was last modified: 01-15-2025, 10:11 PM by mdijkens.)
BTW, I've also created standard function to read huge (>100GB) files to process line by line or field by field extremely fast
Code: (Select All)
Function CSV.read& (fileName$, eol$) ' 4M lines/sec
Const BLOCKSIZE = 2 ^ 22 ' 4 MB
If Not _FileExists(fileName$) Then CSV.read& = 0: Exit Function
'Print "Reading lines from "; fileName$; " ";: cpos% = Pos(0)
eoll% = Len(eol$)
Dim As _Unsigned _Integer64 blocks
Dim block As String * BLOCKSIZE
ff% = FreeFile
Open fileName$ For Binary Access Read As #ff%
blocks = .5 + LOF(ff%) / Len(block)
sep& = 0
lines& = -1
$Checking:Off
For curblock& = 1 To blocks
Get #ff%, , block
If curblock& > 1 Then
buf$ = Mid$(buf$, sep&) + block
r0& = InStr(buf$, eol$) + eoll%
Else
buf$ = block
r0& = 1
End If
r1& = InStr(r0&, buf$, eol$)
Do While r1& >= r0& And r0& > 0
lin$ = Mid$(buf$, r0&, r1& - r0& + eoll%)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
ret% = CSV.line(lin$) ' Process lin$
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
lines& = lines& + 1
sep& = r1&: r0& = r1& + eoll%: r1& = InStr(r0&, buf$, eol$)
Loop
'Locate , cpos%, 0: Print lines&;
Next curblock&
$Checking:On
Close #ff%
buf$ = ""
'Locate , cpos%, 0
CSV.read& = lines&
End Function
Function CSV.line% (l$)
fields% = Val(CSV.field(l$, 0))
For i% = 1 To fields%
Print CSV.field(l$, i%); ",";
Next i%
Print
CSV.line% = 1
End Function
Function CSV.field$ (lin$, n%)
Const MAXFIELDS = 99
Static cf%, rec$, f$(1 To MAXFIELDS)
If rec$ <> lin$ Then
rec$ = lin$
cf% = 0: q% = 0: i0% = 0: ll% = Len(rec$)
For i% = 1 To ll%
cc% = Asc(Mid$(rec$, i%, 1))
If cc% = 13 Or cc% = 10 Then
Exit For
ElseIf cc% = 34 Then '34 = "
q% = 1 - q%
ElseIf cc% = 44 And q% = 0 Then '44 = ,
cf% = cf% + 1: f$(cf%) = Mid$(rec$, i0%, i% - i0%)
i0% = i% + 1
End If
Next i%
cf% = cf% + 1: f$(cf%) = Mid$(rec$, i0%, i% - i0%)
End If
If n% <= 0 Then CSV.field$ = _ToStr$(cf%) Else If n% <= cf% Then CSV.field$ = f$(n%) Else CSV.field$ = ""
End Function
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Posts: 120
Threads: 35
Joined: Apr 2022
Reputation:
9
01-16-2025, 10:23 AM
(This post was last modified: 01-16-2025, 10:25 AM by doppler.)
It's exactly what I am doing. Log file "common" to respective machine logs "specific". Thinking about it, I am just using a for/next loop to search open files in use. But using the for value as the file open number. I am going to change the open handle to the safer FREEFILE function. All output handles are in the arrray I am searching for filenames. So the value of the handle does not have to be known. Only stored for look up.
|