QB64 Phoenix Edition
Extended KotD #19: _UCHARPOS - Printable Version

+- QB64 Phoenix Edition (https://qb64phoenix.com/forum)
+-- Forum: Official Links (https://qb64phoenix.com/forum/forumdisplay.php?fid=16)
+--- Forum: Learning Resources and Archives (https://qb64phoenix.com/forum/forumdisplay.php?fid=13)
+---- Forum: Keyword of the Day! (https://qb64phoenix.com/forum/forumdisplay.php?fid=49)
+---- Thread: Extended KotD #19: _UCHARPOS (/showthread.php?tid=2826)



Extended KotD #19: _UCHARPOS - SMcNeill - 06-26-2024

Okies guys, this is one of those keywords that I've kinda been dreading to get to have to cover.  

Why??

Because the *concept* behind this one and what it does for us is rather complex to describe fully.  I have a feeling this is going to be a long arse post, so I'll probably end up wrapping it into spoiler tags so folks can navigate it a little easier, but grab yourself a soda and a snack before delving into this topic.  Tongue

INTRO
Show Content

UTF-8 vs UTF-16 vs UTF-32

Show Content

CODE POINTS vs GLYPHS

Show Content

And THAT gets us to the point where _UCharPos comes into existence for us!!

As I explained above, a string of unicode data may use 1, 2, 3, 4, 5 bytes to generate a codepoint.  And then it might take multiple codepoints and merge them together to make a final glyph/character, before printing it to the screen.

So a string of 20-bytes, in UTF-8 format, might be 20 characters.  Or 19 characters.  Or it might be only 2.   Heck, it might even be formatted wrong and not be any!!

UGHH!!!

So how the BLEEP would someone know how many characters an unicode string has??  How would you underline the word "FOO" in bight red, if you can't even know how many characters are before it, or after it??   

HOW THE FLIP DOES ANYONE DO ANYTHING MUCH AT ALL WITH UTF-8 FORMATTED CRAP???

UGGGHHHHHH!!!



Have no fear, _UCharPos is here!!   Tongue 

(to be continued in the next post below this one, as the forum has text limits and I don't want to write and write and write, just to have it cut me off or lose my work)


RE: Extended KotD #19: _UCharPos - SMcNeill - 06-26-2024

Now, with all the above covered as a sort of foreward, let's get into what _UCharPos actually does for us.  Smile

As I mentioned previously, an UTF8 text string consisting of 25-bytes, may have anywhere from 1 to 25 characters associated with it.   So how the heck do we know how many bytes it *actually* has, and how many characters are we going to draw onto the screen?

https://qb64phoenix.com/qb64wiki/index.php/UCHARPOS

_UCharPos is the closest tool we have at the moment, and it's not 100% perfect.

WHY do I say that??

Because it only gives us the nuumber of CODE POINTS that we generate with our string -- and you remember what I mentioned about them and glyphs being different?   ":" plus ")" = "Smile"    Two (or more) code points can be added together to generate ONE glyph/character.

From the top of my head I can think of several examples of this:
1) Emoji is probably the mose obvious for most folks -- thus my mentioning it.
2) Several of the country flag symbols are composite and made from multiple code points.
3) A whole bunch of the oriental/foreign words are composed of multiple code points.

So let me emphasize for everyone here:   _UCharPos isn't guaranteed to give you the combined GLYPH/CHARACTER count. At the moment, it only gives you the CODE POINT count.



And with that pointed out, let me now step back and also point out:

We don't currently do ANY code point combining in QB64PE (as of version 3.13).  

That's right...   As of now, you can't type a colon and a parenthesis and generate a smiley face in QB64PE.   

What we have is a basic TRUE TYPE FONT RENDERING LIBRARY that we use.  It simply draws the characters that we specify for it.  It doesn't do the combing of multiple code points to create composite glyphs.

So why am I even bothering to mention the difference and point them out for folks??

*Because we may eventually expand our functionality and add additional font functionality to our source.*



At this moment, _UCharPos will give you the number of characters thay you're printing to the screen -- but folks need to keep in mind that what it's REALLY giving you is the number of CODE POINTS that you're printing to the screen.   IF a libray is added later that does do the combining to make emojis and such, then the number of code points may not match the number of characters you print to the screen.

At the moment however, it does, as we don't have any combination code point libraries working for us.



So for now, _UCHARPOS can tell you how many code points (characters) your string holds.

This is the basic return value for the command, and is used as simply as:

Code: (Select All)
codepoints = _UCharPos(utf8$, , 8)

Notice that 8 in the code above?  That's *required*, if you want a correct answer, so you can tell _UCharPos how your string is formatted.

0 = ASCII
8 = UTF-8
16 = UTF-16
32 = UTF-32

If you leave that value out, the return value you're going to get is always going to be the LENGTH of the text.   May as well just use LEN(text$) for the same value there, as ASCII encoded text is always 1 code point per byte.

And with that basic return value out of the way (it returns the number of code points for us), and that 3rd paramater mentioned above (that's the 8 I was talking about in that little sample line of code), let's talk about the rest of the parameters:

codepoints& = _UCHARPOS(text$[, posArray&()][, utfEncoding&][, fontHandle&])

Now, the text$ is the text that you want to get this information back on.  I'd imagine it's going to be an unicode string -- if not, why aren't you just doing things the same way you always have, with LEN and _PRINTWIDTH or _FONTWIDTH??

That second parameter is posArray&(), and it's a LONG array.  Let's take a moment to talk about this parameter and what it does for us:




PosArray()

Since _UCharPos can tell us how many code points our string has, what do you guys think PosArray does here?

I'll be nice and go ahead and tell you!  Big Grin

PosArray is a long array which we pass to the command, and it passes back the POSition of each character in that ARRAY.

Let me give you a simple make believe example:

ABCDBig GrinEF

Now, if the above is in monospaced-font, it's simple enough to calculate these positions.   Let's say the monospace width is 10:
A starts at pixel 0 and goes to pixel 9
B starts at pixel 10 and goes to pixel 19
and so on, with each character being 10 pixels wide.  

But now, what if this wasn't a monospaced set of characters?

abcWiiBig Grini

a , b, and c...  look close enough that each of them might be 10 pixels wide.
W ... this is obviously wider.  It's probably 15 pixels in width.
i  ...  these puny little i's are probably no more than 5 pixels in width.

And THAT's what the PosArray gives us --- It's an array of the starting points of those characters on the screen.

(And note, that these character positions have zilch to do with the byte-position of our text.   Remember, it can take 1 to a variable X-number bytes to make a character.)



Think of PosArray as basically running a for loop, and getting the _UPrintWidth for each character in your string of text.   Your text might be 20-bytes in length, only have 5 characters to it, and those characters might be 16 pixels, 8 pixels, 8 pixels, 11pixels and 13 pixels in width.

That's basically what _UCharPos can return back to you:

codepoints = _UCharPos <-- the function gives you the number of codepoints (which is currently also the number of characters, but may change if we ever expand the library to do glyph merging)

PosArray&()   <-- this LONG array that you pass it gives you the width of each actual character/codepoint in the text.

0/8/16/32 <--- this 3rd parameter tells the function what encoding it's looking for with your string.

fonthandle&   <--- and the last parameter is simply a shortcut to tell it which font you're calculating these values using.



Steve's honest thoughts on this command:

It's complicated.  It's messy.  BUT, unicode and UTF-8 isn't the simplest thing in the world to work with.  codepoints.  glyphs.  characters.   multiple encoding methods....  UGGHHH!!  It's enough to make your head spin!!

Unless you're just working with unicode specifically, and unless you NEED to know the number of code points and positional placement of each character in that string of text, I'd honestly just whistle merrily and forget this command even existed.

It's essential what it does for us, and we need it for working with unicode -- but only in extremely specific cases such as if you're writting unicode word wrapping routines.

For most folks, who don't tend to worry about such things, add this to your list of "Things I'm not worried about."

If you need it, it's here.  I just honestly don't think a lot of folks are going to need:

1) To know the number of unicode codepoints s they're printing to a line.
2) The x-position of each unicode codepoint that they print to that line.

And that's basically what this function gives to us.  Use it, or leave it, as your personally programming needs dictate.

(And my needs dictate that I now need a little more vodka for my coffee....)  Big Grin