Posts: 277
Threads: 46
Joined: May 2022
Reputation:
28
I read that every unicode text should have a BOM at the beginning, which should be its identification - and the code page should probably be set accordingly. I'm trying to find out more about it now.
Posts: 277
Threads: 46
Joined: May 2022
Reputation:
28
So there is an API in Windows that can do some language conversion. But it happens online with Microsoft and you need to write Java headers and send strings over the Microsoft network and then request them back and decode them... Then there is a Python library NLTK or spaCY. Well, for C there is supposedly ICU, LibIconV and or CharDet. So I turned to Gemini with the problem:
The basic idea is to try converting the text to different encodings and see if there are any errors during the conversion. If there are no errors during the conversion, it is likely that we have found the correct encoding.
Example code:
[c]
#include <stdio.h>
#include <iconv.h>
int main() {
char *inbuf = "Hello world!"; // Your input text // in Czech is used Haló světe (two not-basic characters)
char outbuf[1024];
size_t inbytesleft, outbytesleft;
iconv_t cd;
// List of encodings you want to try
char *codings[] = {"UTF-8", "ISO-8859-1", "Windows-1250"};
int num_codings = sizeof(codings) / sizeof(codings[0]);
for (int i = 0; i < num_codings; i++) {
cd = iconv_open(codings[i], "UTF-8"); // Try converting from UTF-8 to the chosen encoding
if (cd == (iconv_t)-1) {
perror("iconv_open");
continue;
}
inbytesleft = strlen(inbuf);
outbytesleft = sizeof(outbuf);
char *in = inbuf;
char *out = outbuf;
if (iconv(cd, &in, &inbytesleft, &out, &outbytesleft) == (size_t)-1) {
// An error occurred during conversion, probably bad encoding
printf("Possible encoding: %s\n", codings[i]);
} else {
printf("Probable encoding: %s\n", codings[i]);
break; // If the conversion was successful, we end the cycle
}
iconv_close(cd);
}
return 0;
}
[/c]
How the code works:
Define the input text: The variable inbuf stores the text whose encoding we want to find out.
Encoding list: The codings field lists the encodings we want to try. You can add more encodings here as needed.
Loop through encodings: For each encoding, we create a converter using iconv_open.
Conversion: We try to convert the input text to the selected encoding using iconv. If an error occurs, we probably chose the wrong encoding.
Result: If the conversion succeeds without errors, we print the encoding name as probable.
Important notes:
Encoding list: You can customize the list of encodings you want to try according to your needs.
Errors: This code only provides an estimate of the encoding. It is not guaranteed to detect the correct encoding in all cases.
More complex scenarios: For more complex scenarios (e.g. detecting multiple languages in one file), you can use more sophisticated algorithms and heuristics.
libiconv library: Make sure you have libiconv library installed and linked with your project.
Other options:
chardet: This library is specialized in encoding detection and can provide more accurate results.
ICU: The ICU library offers even more features for working with Unicode and can be useful for more complex tasks.
Posts: 2,733
Threads: 331
Joined: Apr 2022
Reputation:
227
If I have to sort out what type of encoding you used for a mystery text file, I simply don't need to read that mystery file. If it's not ASCII page 437 (which is pretty much the default), or UTF-8, then you need to tell me what format it's in, or else I simply won't worry over it at all. IF it's something that seems like it might *kill* me not to know its contents, then I might try *once* to load it in Word or Office and see if it can detect the information. If not, then I'll just die.
Nobody has time to be running around and guessing at encoding types. If the other guy can't *tell* you what the encoding is and what then endianness is, then he doesn't have anything worth saying anyway. It's just not worth the hassle.
Posts: 759
Threads: 35
Joined: Apr 2022
Reputation:
51
(12-18-2024, 11:34 PM)SMcNeill Wrote: If I have to sort out what type of encoding you used for a mystery text file, I simply don't need to read that mystery file. If it's not ASCII page 437 (which is pretty much the default), or UTF-8, then you need to tell me what format it's in, or else I simply won't worry over it at all. IF it's something that seems like it might *kill* me not to know its contents, then I might try *once* to load it in Word or Office and see if it can detect the information. If not, then I'll just die.
Nobody has time to be running around and guessing at encoding types. If the other guy can't *tell* you what the encoding is and what then endianness is, then he doesn't have anything worth saying anyway. It's just not worth the hassle.
Seems a bit of a foolish take. A 5 second search on Google provides this:
Code: (Select All)
#include <fstream>
#include <iostream>
int main() {
std::ifstream file("your_file.txt", std::ios::binary);
if (!file.is_open()) {
std::cerr << "Error opening file!" << std::endl;
return 1;
}
char bom[3];
file.read(bom, 3);
if (bom[0] == 0xEF && bom[1] == 0xBB && bom[2] == 0xBF) {
std::cout << "Encoding: UTF-8" << std::endl;
} else if (bom[0] == 0xFF && bom[1] == 0xFE) {
std::cout << "Encoding: UTF-16 Little Endian" << std::endl;
} else if (bom[0] == 0xFE && bom[1] == 0xFF) {
std::cout << "Encoding: UTF-16 Big Endian" << std::endl;
} else {
std::cout << "Encoding: Unknown" << std::endl;
}
return 0;
}
A 15 second search on Google reveals this:
AutoItConsulting/text-encoding-detect: C# and C++ UTF8/UFT16 encoding detection library.
The noticing will continue