Login

**SMcNeill** · 06-10-2024, 08:37 PM

Not quite. Wide-text is basically UNICODE.

https://learn.microsoft.com/en-us/window...th-strings

Now, unicode tends to work kinda odd. It tries to keep file/memory sizes as small as possible. Let me try and give a simplified version.

Let's say I have a dataset of 15 items, but the first 8 items appear 90% of the time. How can I pack my data to make it as small as possible, using base-10 numbers?

Sure, I could use 2-digit numbers for everything:

Item 1 = 01
Item 2 = 02
Item 3 = 03
...
Item 14 = 14
Item 15 = 15
Item 16 = 16

If I have X items in a list, then that list is going to be 2 * X characters long. Rather simple to understand. Right?

But it's not the SMALLEST we can make! Instead, let's say:
ITEM 1 to 8 are represented by the numbers 1 to 8.
Items 9 to 16 are represented by the number 9, followed by a 1 to 8.

Now, since 9 to 16 are rarely used (think the box characters in ASCII), we now can write the vast majority of our data with single digit values, and only occasionally need double digit values.

......

And that's basically UNICODE in a nutshell.

128 base characters for ANSI text.
128 extended characters which then point to extended character pages.

For most folks, 128 characters are all they need and use. (How often do you use anything higher than that, honestly?)

For most strings, (65) is A. (129, 65) = Accented A. (130, 65 ) = Reverse accent A.

(See the ANSI value in there, and that extended value before it??)

Now, the problem comes with Unicode *ALWAYS* expanding! First we had basically ANSI and the various ANSI code pages. Then someone added japanses, Korean, Greek, hebrew, Klingon, Wingdings.... and then they added EMOJI... and then...

We ran out of room, so how do we expand??

Extended-Extended Characters! And then Extend those more! And extend those more!

In my 1 to 16 example, we saw:

1 to 8 was represented by 1 to 8.
9 to 16 was represented by 9, followed by 1 to 8.

So for 17 to 24, following that same pattern, it'd be: 9, followed by 9, followed by 1 to 8.
And 25 to 32? 9, 9, 9, 1 to 8...

And that's UNICODE!!

Unicode might be 1 character. Or 2. Or 8!!

So you can't honestly say "character * 2 for width".

"Steve is Big Grin

" may be 16 chatacters in Unicode:
Steve_is_ <--- that's 9 characters.
Big Grin

<-- this might be 6 unicode characters!

So that basic idea of "space + ASCII character" = WIDE character is only going to apply true for a small subset of characters. If folks change character pages or use extended characters, that won't hold true at all.

And 2-bytes may not hold the whole character info with Wide characters, so you'll need to account for that as well.

Login
Username/Email:
Password:	Lost Password?
	Remember me