Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help capturng unicoded directory names
#21
(8 hours ago)doppler Wrote: Thanks steve for coming up with a QB64pe only solution.  And I realize it's the console window that stays open, allowing code page to be found (or changed) by multiple shell's.  I will experiment with different unicode pages in the console window.  To see if they can be identified separately.  I suspect they can.

This has to be referenced in the wiki, in the Shell and Console:Only as foot notes.  I can't be the only one stumped by this.

Thanks again

My external solution works too, but not as elegantly as yours.  I hate having to rely on third part programs.
If it can't be done with QB64pe.  Then keep bashing until it can.

With the CHCP 65001 page, you shouldn't have to worry about "different unicode pages".  The whole point of unicode is basically to make every character available for you at the same time.   With ASCII and ANSI you only have 128 or 256 characters available for use.  Almost all of them have the same 128-ANSI characters, and you swap out various code pages for the 129-256 character range that you're using.  Unicode doesn't hold that same limit and thus isn't something we have to fret over so much.  Just set that console/terminal to CHCP 65001 and you're good to go.  (As long as your font has the characters you're looking for inside it.  Not all unicode fonts hold every possible character set.)

I'm glad this works for you.  Wink

Note that you probably don't need a $CONSOLE:ONLY line to get it working.  $CONSOLE and then a _CONSOLE ON set of commands would probably work just as well, since you'd still have that same persistent console to make changes to.  The problem with using SHELL without any console is the command runs, a console opens up, the command finishes, and then the console closes and the changes aren't necessarily persistent for the next SHELL issued.

Use of $CONSOLE:ONLY or $CONSOLE and _CONSOLE ON should keep everything in the same console and let you make the changes you need to get back the information you're looking for with DIR.  Smile
Reply
#22
I read that every unicode text should have a BOM at the beginning, which should be its identification - and the code page should probably be set accordingly. I'm trying to find out more about it now.


Reply
#23
(3 hours ago)Petr Wrote: I read that every unicode text should have a BOM at the beginning, which should be its identification - and the code page should probably be set accordingly. I'm trying to find out more about it now.
Get the program Hexplorer here: https://sourceforge.net/projects/hexplor...t/download
It's from 2018, old and still works dam good.  Watch out you can modify file content.

To understand unicode ID bytes: https://en.wikipedia.org/wiki/Byte_order_mark

Good luck
Reply
#24
So there is an API in Windows that can do some language conversion. But it happens online with Microsoft and you need to write Java headers and send strings over the Microsoft network and then request them back and decode them... Then there is a Python library NLTK or spaCY. Well, for C there is supposedly ICU, LibIconV and or CharDet. So I turned to Gemini with the problem:
The basic idea is to try converting the text to different encodings and see if there are any errors during the conversion. If there are no errors during the conversion, it is likely that we have found the correct encoding.

Example code:

[c]

#include <stdio.h>
#include <iconv.h>

int main() {

char *inbuf = "Hello world!"; // Your input text  // in Czech is used Haló světe (two not-basic characters)
char outbuf[1024];
size_t inbytesleft, outbytesleft;
iconv_t cd;

// List of encodings you want to try
char *codings[] = {"UTF-8", "ISO-8859-1", "Windows-1250"};
int num_codings = sizeof(codings) / sizeof(codings[0]);

for (int i = 0; i < num_codings; i++) {
cd = iconv_open(codings[i], "UTF-8"); // Try converting from UTF-8 to the chosen encoding
if (cd == (iconv_t)-1) {
perror("iconv_open");
continue;
}

inbytesleft = strlen(inbuf);
outbytesleft = sizeof(outbuf);
char *in = inbuf;
char *out = outbuf;

if (iconv(cd, &in, &inbytesleft, &out, &outbytesleft) == (size_t)-1) {
// An error occurred during conversion, probably bad encoding
printf("Possible encoding: %s\n", codings[i]);
} else {
printf("Probable encoding: %s\n", codings[i]);
break; // If the conversion was successful, we end the cycle
}

iconv_close(cd);
}

return 0;
}
[/c]

How the code works:

Define the input text: The variable inbuf stores the text whose encoding we want to find out.
Encoding list: The codings field lists the encodings we want to try. You can add more encodings here as needed.
Loop through encodings: For each encoding, we create a converter using iconv_open.
Conversion: We try to convert the input text to the selected encoding using iconv. If an error occurs, we probably chose the wrong encoding.
Result: If the conversion succeeds without errors, we print the encoding name as probable.
Important notes:

Encoding list: You can customize the list of encodings you want to try according to your needs.
Errors: This code only provides an estimate of the encoding. It is not guaranteed to detect the correct encoding in all cases.
More complex scenarios: For more complex scenarios (e.g. detecting multiple languages in one file), you can use more sophisticated algorithms and heuristics.
libiconv library: Make sure you have libiconv library installed and linked with your project.
Other options:

chardet: This library is specialized in encoding detection and can provide more accurate results.
ICU: The ICU library offers even more features for working with Unicode and can be useful for more complex tasks.


Reply




Users browsing this thread: 3 Guest(s)