Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Cautionary tale of open, append, close
#1
This is a cautionary tale.

Machine #1, i7 8 core ddr3.  Machine #2, i9 24 core ddr5  Ram drive on each machine for outputs. Only for reference here.

I had a very large list of unsorted data (file #1).  Each line was meant to go into another file based on a name that was inside the first file.  (line by line new names)

First thought would be to find the name, open destination file with append, write the line out.  Then close the output file.
Well this way was very slow on machine #1, somewhat slow on machine number 2.

The right way to do this and 10,000% faster.
Keep a running list of open file handles in a dim o$(filename,open handle#)
Search the array for previous open filename.  If found good keep the number in a value
If not found, use next position in array for new filename and open handle #. And open new filename with append. Save in array.

Write the data to appropriate filename needed. DO NOT CLOSE THE FILE.
Since this is a do/loop loop and get next name from file #1.
On eof(1) you are done.  System or end will close all handles.

Lot's of little bits in my description are needed.  The core of the procedure is valid.
Seems opening and closing files when using append.  Is very wasteful for the processor cycles.  Even in a ram drive. (my first wrong assumption)

Take it for what it is.
Reply
#2
Good point!
Beside keeping files open, writing millions of small strings is also expensive.
I've adapted my writeBigFile function for your scenario to append to multiple outputfiles (~250 MB/sec)
I'm using this approach to merge logfiles of over 100GB each

Code: (Select All)
'open all output files
Dim fh%(0 To 9)
For i% = 0 To 9
  fh%(i%) = writeBigFile(0, "~test" + _ToStr$(i%) + ".tmp")
Next i%

'append 10mln random string to random output files
t0! = Timer
For n& = 1 To 10000000
  i% = Int(Rnd * 10)
  v$ = String$(1 + Rnd * 100, 33 + Rnd * 30)
  tbytes&& = tbytes&& + writeBigFile(fh%(i%), v$)
Next n&
t1! = Timer

'close all output files
For i% = 0 To 9
  Print Using "~test#.tmp ###,###,### bytes"; i%; writeBigFile(-fh%(i%), "")
Next i%
Print Using "###,###,### total bytes written in #.### seconds"; tbytes&&; t1! - t0!

Function writeBigFile~&& (file%, content$)
Const BLOCKSIZE = 2 ^ 22 ' 4 MB
Static buf$(1000), bufPos(1000) As _Integer64
Dim As _Unsigned _Integer64 contentLen, bufRemain

If file% > 0 Then 'append file
contentLen = Len(content$): bufRemain = BLOCKSIZE - bufPos(file%)
writeBigFile = contentLen
If contentLen > bufRemain Then
Mid$(buf$(file%), bufPos(file%) + 1, bufRemain) = Left$(content$, bufRemain)
Print #file%, buf$(file%);
content$ = Mid$(content$, bufRemain + 1)
contentLen = contentLen - bufRemain
bufPos(file%) = 0
End If
Mid$(buf$(file%), bufPos(file%) + 1, contentLen) = Left$(content$, contentLen)
bufPos(file%) = bufPos(file%) + contentLen
ElseIf file% = 0 Then 'new file
file% = FreeFile
buf$(file%) = String$(BLOCKSIZE, 0): bufPos(file%) = 0
Open content$ For Append As #file%
writeBigFile = file%
ElseIf file% < 0 Then 'close file
file% = -file%
If bufPos(file%) > 0 Then Print #file%, Left$(buf$(file%), bufPos(file%));
buf$(file%) = ""
writeBigFile = LOF(file%)
Close #file%
End If
End Function

You can change APPEND to OUTPUT in line 42 if you want to overwrite output files
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply
#3
BTW, I've also created standard function to read huge (>100GB) files to process line by line or field by field extremely fast

Code: (Select All)
Function CSV.read& (fileName$, eol$) ' 4M lines/sec
  Const BLOCKSIZE = 2 ^ 22 ' 4 MB
  If Not _FileExists(fileName$) Then CSV.read& = 0: Exit Function
  'Print "Reading lines from "; fileName$; " ";: cpos% = Pos(0)
  eoll% = Len(eol$)
  Dim As _Unsigned _Integer64 blocks
  Dim block As String * BLOCKSIZE
  ff% = FreeFile
  Open fileName$ For Binary Access Read As #ff%
  blocks = .5 + LOF(ff%) / Len(block)

  sep& = 0
  lines& = -1
  $Checking:Off
  For curblock& = 1 To blocks
    Get #ff%, , block
    If curblock& > 1 Then
      buf$ = Mid$(buf$, sep&) + block
      r0& = InStr(buf$, eol$) + eoll%
    Else
      buf$ = block
      r0& = 1
    End If
    r1& = InStr(r0&, buf$, eol$)
    Do While r1& >= r0& And r0& > 0
      lin$ = Mid$(buf$, r0&, r1& - r0& + eoll%)
      ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
      ret% = CSV.line(lin$) ' Process lin$
      ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
      lines& = lines& + 1
      sep& = r1&: r0& = r1& + eoll%: r1& = InStr(r0&, buf$, eol$)
    Loop
    'Locate , cpos%, 0: Print lines&;
  Next curblock&
  $Checking:On
  Close #ff%
  buf$ = ""
  'Locate , cpos%, 0
  CSV.read& = lines&
End Function

Function CSV.line% (l$)
  fields% = Val(CSV.field(l$, 0))
  For i% = 1 To fields%
    Print CSV.field(l$, i%); ",";
  Next i%
  Print
  CSV.line% = 1
End Function

Function CSV.field$ (lin$, n%)
  Const MAXFIELDS = 99
  Static cf%, rec$, f$(1 To MAXFIELDS)
  If rec$ <> lin$ Then
    rec$ = lin$
    cf% = 0: q% = 0: i0% = 0: ll% = Len(rec$)
    For i% = 1 To ll%
      cc% = Asc(Mid$(rec$, i%, 1))
      If cc% = 13 Or cc% = 10 Then
        Exit For
      ElseIf cc% = 34 Then '34 = "
        q% = 1 - q%
      ElseIf cc% = 44 And q% = 0 Then '44 = ,
        cf% = cf% + 1: f$(cf%) = Mid$(rec$, i0%, i% - i0%)
        i0% = i% + 1
      End If
    Next i%
    cf% = cf% + 1: f$(cf%) = Mid$(rec$, i0%, i% - i0%)
  End If
  If n% <= 0 Then CSV.field$ = _ToStr$(cf%) Else If n% <= cf% Then CSV.field$ = f$(n%) Else CSV.field$ = ""
End Function
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply
#4
It's exactly what I am doing.  Log file "common" to respective machine logs "specific".  Thinking about it, I am just using a for/next loop to search open files in use.  But using the for value as the file open number.  I am going to change the open handle to the safer FREEFILE function.  All output handles are in the arrray I am searching for filenames.  So the value of the handle does not have to be known.  Only stored for look up.
Reply




Users browsing this thread: 4 Guest(s)