Login

mdijkens · 08-22-2022, 05:08 PM

I saw the discussion in MemFile System about fast fileread
That all works fine up to 2GB files

I sometimes have to process huge (100GB+) csv files.
Therefore I created this reader (2x slower but unlimited size):

Code: (Select All)
t! = Timer

recs~&& = processBigFile("20.csv", Chr$(10))

Print "Done"; " in"; (Timer - t!); "seconds"

End

Function processBigFile~&& (ifile$, eol$)

  Const BLOCKSIZE = 4 * 1024 * 1024 'on average 4MB blocks seems fastest

  Dim block As String * BLOCKSIZE

  filenum% = FreeFile

  Open ifile$ For Random Access Read As filenum% Len = Len(block)

  blocks~& = .5 + LOF(filenum%) / Len(block)

  buf$ = "": recs~&& = 0: bufpos~& = 0

  $Checking:Off

  For blck~& = 1 To blocks~&

    Get filenum%, blck~&, block: buf$ = Mid$(buf$, bufpos~&) + block

    bufpos~& = 1: endline~& = InStr(bufpos~&, buf$, eol$)

    Do While endline~& >= bufpos~&

      recs~&& = recs~&& + 1

      lin$ = Mid$(buf$, bufpos~&, endline~& - bufpos~&)

      processLine lin$

      bufpos~& = endline~& + Len(eol$): endline~& = InStr(bufpos~&, buf$, eol$)

    Loop

    Locate , 1, 0: Print recs~&&;

  Next blck~&

  Print

  $Checking:On

  buf$ = "": Close

  processBigFile~&& = recs~&&

End Function

Sub processLine (lin$)

  ' do something with lin$

  'f3$ = CSV.field$(lin$, 3)

End Sub

Function CSV.field$ (lin$, n%)

  Const MAXFIELDS = 100

  Static rec$, fld$(1 To MAXFIELDS)

  If rec$ <> lin$ Then

    rec$ = lin$

    cf% = 0: q% = 0: i0% = 0: ll% = Len(rec$)

    For i% = 1 To ll%

      cc% = Asc(Mid$(rec$, i%, 1))

      If cc% = 13 Or cc% = 10 Then

        Exit For

      ElseIf cc% = 34 Then '34 = "

        q% = 1 - q%

      ElseIf cc% = 44 And q% = 0 Then '44 = ,

        cf% = cf% + 1: fld$(cf%) = Mid$(rec$, i0%, i% - i0%)

        i0% = i% + 1

      End If

    Next i%

    cf% = cf% + 1: fld$(cf%) = Mid$(rec$, i0%, i% - i0%)

  End If

  CSV.field$ = fld$(n%)

End Function

mnrvovrfc · 08-24-2022, 10:13 PM

(08-22-2022, 05:08 PM)mdijkens Wrote: I sometimes have to process huge (100GB+) csv files.
:

Seriously, I give up.

The thing is that when I try to copy a 4GB file into an USB v3.0 external disk it takes more than half-hour! It seems more ornerous doing it on Linux than on Windows. How long does it take for you to copy a file that large from one disk to another? Or do you need to copy it?

Thank you for this program, at any rate. :tu:

mdijkens · 08-24-2022, 10:38 PM

Yes. Processing logfiles of serverclusters.
Btw, above code also provides line and field based processing.
The csv.field function (also usable in other line based processing ) sets the fields only at the first call with a certain line which works pretty fast.

mdijkens · 08-24-2022, 10:44 PM

(08-24-2022, 10:13 PM)mnrvovrfc Wrote:
(08-22-2022, 05:08 PM)mdijkens Wrote: I sometimes have to process huge (100GB+) csv files.
:
Seriously, I give up.

The thing is that when I try to copy a 4GB file into an USB v3.0 external disk it takes more than half-hour! It seems more ornerous doing it on Linux than on Windows. How long does it take for you to copy a file that large from one disk to another? Or do you need to copy it?

Thank you for this program, at any rate. :tu:

Did not see the yellow text at first...
Just copying with this block based approach is mostly limited by the write speed of the USB stick.
I have a relatively fast one that reaches around 250MB/s.
But on PCIe thunderbolt this same routine reaches 3GB/s

Login
Username/Email:
Password:	Lost Password?
	Remember me