Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help capturng unicoded directory names
#24
So there is an API in Windows that can do some language conversion. But it happens online with Microsoft and you need to write Java headers and send strings over the Microsoft network and then request them back and decode them... Then there is a Python library NLTK or spaCY. Well, for C there is supposedly ICU, LibIconV and or CharDet. So I turned to Gemini with the problem:
The basic idea is to try converting the text to different encodings and see if there are any errors during the conversion. If there are no errors during the conversion, it is likely that we have found the correct encoding.

Example code:

[c]

#include <stdio.h>
#include <iconv.h>

int main() {

char *inbuf = "Hello world!"; // Your input text  // in Czech is used Haló světe (two not-basic characters)
char outbuf[1024];
size_t inbytesleft, outbytesleft;
iconv_t cd;

// List of encodings you want to try
char *codings[] = {"UTF-8", "ISO-8859-1", "Windows-1250"};
int num_codings = sizeof(codings) / sizeof(codings[0]);

for (int i = 0; i < num_codings; i++) {
cd = iconv_open(codings[i], "UTF-8"); // Try converting from UTF-8 to the chosen encoding
if (cd == (iconv_t)-1) {
perror("iconv_open");
continue;
}

inbytesleft = strlen(inbuf);
outbytesleft = sizeof(outbuf);
char *in = inbuf;
char *out = outbuf;

if (iconv(cd, &in, &inbytesleft, &out, &outbytesleft) == (size_t)-1) {
// An error occurred during conversion, probably bad encoding
printf("Possible encoding: %s\n", codings[i]);
} else {
printf("Probable encoding: %s\n", codings[i]);
break; // If the conversion was successful, we end the cycle
}

iconv_close(cd);
}

return 0;
}
[/c]

How the code works:

Define the input text: The variable inbuf stores the text whose encoding we want to find out.
Encoding list: The codings field lists the encodings we want to try. You can add more encodings here as needed.
Loop through encodings: For each encoding, we create a converter using iconv_open.
Conversion: We try to convert the input text to the selected encoding using iconv. If an error occurs, we probably chose the wrong encoding.
Result: If the conversion succeeds without errors, we print the encoding name as probable.
Important notes:

Encoding list: You can customize the list of encodings you want to try according to your needs.
Errors: This code only provides an estimate of the encoding. It is not guaranteed to detect the correct encoding in all cases.
More complex scenarios: For more complex scenarios (e.g. detecting multiple languages in one file), you can use more sophisticated algorithms and heuristics.
libiconv library: Make sure you have libiconv library installed and linked with your project.
Other options:

chardet: This library is specialized in encoding detection and can provide more accurate results.
ICU: The ICU library offers even more features for working with Unicode and can be useful for more complex tasks.


Reply


Messages In This Thread
RE: Need help capturng unicoded directory names - by Petr - 6 hours ago



Users browsing this thread: 1 Guest(s)