Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Most efficient way to build a big variable length string?
#1
I need to build a huge variable length string (>1GB) in memory before calling a function or writing to a file.
I already found out I can most efficiently build it up in a _MEM block.
But what is the best way to move that in the end to a variable string without copying and doubling the memory used.
So basically I am looking to a more efficient way to do:
Code: (Select All)
sql$ = Space$(mpos): _MemGet m, m.OFFSET, sql$
for very big mpos?
Is that even possible?

related to this, what is the most efficient coding to work with a big string where you have to remove a lot of pieces out without doing something like:
Code: (Select All)
sql$ = Left$(sql$, mpos - 1) + Mid$(sql$, mpos + 1)
With very big strings this is also expensive
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply
#2
Why do you need to do it this way. 

    A more efficient way to do this would be as an Array of shorter strings.      You could keep the same information there I think just as easily.    Then if you need to remove or alter parts in the middle you don't have to move the entire block.

If it needs to go back into a file in one piece thats pretty easily done with an array when you write it out to the file.
Reply
#3
For the first, you can always use a fixed length string of large enough size to contain mpos.

DIM sql AS STRING * 1,500,000,000 '1.5 GB string
DIM M AS _MEM: m = _MEM(sql)

mpos tracks the size of the string, so when you finally need to print to disk or whatever, just:

PUT #1, , LEFT$(sql, mpos)

'similar to the below:

Code: (Select All)
Dim sql As String * 10
Dim m As _MEM: m = _Mem(sql)
For i = 0 To 9
    _MemPut m, m.OFFSET + i, 65 + i As _BYTE
Next

msize = 5
Print Left$(sql, msize)

Just make certain sql is large enough to hold the data to begin with, rather than variable length in size.  (And set a max limit to how much data you're going to process at one time.  For example, what would you do with a 100 GB database?  You can't load that thing all at once into your memory, so you've got to work with it 1 GB at a time max and process it in chunks.)



As for the second, instead of this:
sql$ = Left$(sql$, mpos - 1) + Mid$(sql$, mpos + 1)

Simply do:
MID$(sql$, mpos) = MID$(sql$, mpos+1)

Now note that this won't reduce the size of your string.  You'd have to track how many bytes you shaved off with such a method and then take the needed part later, but this is much more efficient than what you're doing above for processing it.  You're not building a new string, adding strings, freeing old strings...  All you're doing is writing to an existing string.  It'd just be up to you to manually keep track of the true size of that string and then resize it (or only use the valid part of it) once you're finished stripping out parts of it.

Code: (Select All)
Dim sql As String * 10
Dim m As _MEM: m = _Mem(sql)
For i = 0 To 9
    _MemPut m, m.OFFSET + i, 65 + i As _BYTE
Next

msize = 5
Print Left$(sql, msize)
Mid$(sql, 3) = Mid$(sql, 5)
Print sql

Print Left$(sql, Len(sql) - 2) ' I removed two bytes above (5 - 3 = 2), so this should look like you'd expect now.
Reply
#4
Thanks for the suggestions!


I like the idea of an array of shorter strings; much easier and quicker to work with.
But in the end I'd still need to put them all together in one big string which is very slow with over 10mln pieces

I did experiment with the _MEM(sql$) but the PUT #1, , LEFT$(sql, mpos) still makes a copy in memory.

I didn't think of  MID$(sql$, mpos) = MID$(sql$, mpos+1)
Very nice! It saves at least copying half the string!

Converting/merging/processing big files/databases is a bit of my thing with qb64pe. I've already created lots of very fast functions to handle big data.
So if there are no better options for above challenges, I can live with that but sometimes it's a small things that makes a big difference  Smile

Thanks again for thinking with me
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply
#5
(Today, 09:37 AM)mdijkens Wrote: I need to build a huge variable length string (>1GB) in memory before calling a function or writing to a file.
I already found out I can most efficiently build it up in a _MEM block.
But what is the best way to move that in the end to a variable string without copying and doubling the memory used.
So basically I am looking to a more efficient way to do:
Code: (Select All)
sql$ = Space$(mpos): _MemGet m, m.OFFSET, sql$
for very big mpos?
Is that even possible?

related to this, what is the most efficient coding to work with a big string where you have to remove a lot of pieces out without doing something like:
Code: (Select All)
sql$ = Left$(sql$, mpos - 1) + Mid$(sql$, mpos + 1)
With very big strings this is also expensive
Just to save you some time and frustration, you cannot create a 1GB variable-length string. Beyond just the general problems (String's get copied a decent amount), there are bugs regarding the handling of extremely large strings like this that will crash your program. A 1GB fixed-length string might work because it's allocated differently from a variable-length one, but I wouldn't bother and you're likely to run into similar issues with it. _MEM is the only way to deal with a piece of data that large.

Like Steve mentioned you're much better off just splitting it up into smaller parts. Unfortunately we don't have a `_MemPut` that will let you write the _MEM directly to a file, so a copy is going to be necessary.
Reply
#6
(9 hours ago)mdijkens Wrote: I did experiment with the _MEM(sql$) but the PUT #1, , LEFT$(sql, mpos) still makes a copy in memory.

The only way to print it to file without would be to print it at set sizes

FOR i = 0 to mpos -1000000 STEP 1,000,000  'write one 1MB to drive at a time
   PUT #1, , _MEMGET(m, m.offset + i, STRING * 1000000)
NEXT

'after 999 writes you've now written 999,000,000
'if your file was 999,999,999 bytes in size, just make a temp string with the 999,999 leftover bytes and write it last.  There won't be much overhead in the creation of a temp string of that size in your program.
Reply
#7
(10 hours ago)ahenry3068 Wrote: Why do you need to do it this way. 

    A more efficient way to do this would be as an Array of shorter strings.      You could keep the same information there I think just as easily.    Then if you need to remove or alter parts in the middle you don't have to move the entire block.

If it needs to go back into a file in one piece thats pretty easily done with an array when you write it out to the file.

for @ahenry3068
You can't use arrays in UDT's in QB64, that's a BIG obstacle when trying to stucture your program.

Neither can you pass arrays from Functions in QB64 but a String is a piece of cake for both UDT's and Function returns.

I am giving this Topic 5 stars!
b = b + ...
Reply
#8
Yes for file writing you can do 1MB chunks, but for calling a function that expects a string, options are limited.

Passing part of mem +0x0 casted to string would be nice
45y and 2M lines of MBASIC>BASICA>QBASIC>QBX>QB64 experience
Reply
#9
for @ahenry3068

That's another problem, UDTs (User Defined Types) originally started way back in old GW Basic could always do fixed length strings when it came to saving or retrieving to/from files but variable length strings are new to QB64pe (not QB64? I think...) anyway varaible length strings still can not be used directly to file data.
b = b + ...
Reply
#10
(6 hours ago)bplus Wrote:
(10 hours ago)ahenry3068 Wrote: Why do you need to do it this way. 

    A more efficient way to do this would be as an Array of shorter strings.      You could keep the same information there I think just as easily.    Then if you need to remove or alter parts in the middle you don't have to move the entire block.

If it needs to go back into a file in one piece thats pretty easily done with an array when you write it out to the file.

for @ahenry3068
You can't use arrays in UDT's in QB64, that's a BIG obstacle when trying to stucture your program.

Neither can you pass arrays from Functions in QB64 but a String is a piece of cake for both UDT's and Function returns.

I am giving this Topic 5 stars!
     If I was doing this I wouldn't put the array in a UDT.    I would probably do an array of 256 character length strings and write a set of subs/funcs to refer to that array as a single virtual string.     Then when writing them out to file I would just iterate through the array when writing out the data.
Reply




Users browsing this thread: 11 Guest(s)