Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Processing huge files
#1
I saw the discussion in MemFile System about fast fileread
That all works fine up to 2GB files

I sometimes have to process huge (100GB+) csv files.
Therefore I created this reader (2x slower but unlimited size):
Code: (Select All)
t! = Timer
recs~&& = processBigFile("20.csv", Chr$(10))
Print "Done"; " in"; (Timer - t!); "seconds"
End

Function processBigFile~&& (ifile$, eol$)
  Const BLOCKSIZE = 4 * 1024 * 1024 'on average 4MB blocks seems fastest
  Dim block As String * BLOCKSIZE
  filenum% = FreeFile
  Open ifile$ For Random Access Read As filenum% Len = Len(block)
  blocks~& = .5 + LOF(filenum%) / Len(block)
  buf$ = "": recs~&& = 0: bufpos~& = 0
  $Checking:Off
  For blck~& = 1 To blocks~&
    Get filenum%, blck~&, block: buf$ = Mid$(buf$, bufpos~&) + block
    bufpos~& = 1: endline~& = InStr(bufpos~&, buf$, eol$)
    Do While endline~& >= bufpos~&
      recs~&& = recs~&& + 1
      lin$ = Mid$(buf$, bufpos~&, endline~& - bufpos~&)
      processLine lin$
      bufpos~& = endline~& + Len(eol$): endline~& = InStr(bufpos~&, buf$, eol$)
    Loop
    Locate , 1, 0: Print recs~&&;
  Next blck~&
  Print
  $Checking:On
  buf$ = "": Close
  processBigFile~&& = recs~&&
End Function

Sub processLine (lin$)
  ' do something with lin$
  'f3$ = CSV.field$(lin$, 3)
End Sub

Function CSV.field$ (lin$, n%)
  Const MAXFIELDS = 100
  Static rec$, fld$(1 To MAXFIELDS)
  If rec$ <> lin$ Then
    rec$ = lin$
    cf% = 0: q% = 0: i0% = 0: ll% = Len(rec$)
    For i% = 1 To ll%
      cc% = Asc(Mid$(rec$, i%, 1))
      If cc% = 13 Or cc% = 10 Then
        Exit For
      ElseIf cc% = 34 Then '34 = "
        q% = 1 - q%
      ElseIf cc% = 44 And q% = 0 Then '44 = ,
        cf% = cf% + 1: fld$(cf%) = Mid$(rec$, i0%, i% - i0%)
        i0% = i% + 1
      End If
    Next i%
    cf% = cf% + 1: fld$(cf%) = Mid$(rec$, i0%, i% - i0%)
  End If
  CSV.field$ = fld$(n%)
End Function
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply
#2
(08-22-2022, 05:08 PM)mdijkens Wrote: I sometimes have to process huge (100GB+) csv files.
:
Seriously, I give up.

The thing is that when I try to copy a 4GB file into an USB v3.0 external disk it takes more than half-hour! It seems more ornerous doing it on Linux than on Windows. How long does it take for you to copy a file that large from one disk to another? Or do you need to copy it?

Thank you for this program, at any rate. :tu:
Reply
#3
Yes. Processing logfiles of serverclusters.
Btw, above code also provides line and field based processing.
The csv.field function (also usable in other line based processing ) sets the fields only at the first call with a certain line which works pretty fast.
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply
#4
(08-24-2022, 10:13 PM)mnrvovrfc Wrote:
(08-22-2022, 05:08 PM)mdijkens Wrote: I sometimes have to process huge (100GB+) csv files.
:
Seriously, I give up.

The thing is that when I try to copy a 4GB file into an USB v3.0 external disk it takes more than half-hour! It seems more ornerous doing it on Linux than on Windows. How long does it take for you to copy a file that large from one disk to another? Or do you need to copy it?

Thank you for this program, at any rate. :tu:

Did not see the yellow text at first...
Just copying with this block based approach is mostly limited by the write speed of the USB stick. 
I have a relatively fast one that reaches around 250MB/s.
But on PCIe thunderbolt this same routine reaches 3GB/s
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply




Users browsing this thread: 1 Guest(s)