Weblog entry #46 for Utumno
(locale: us_US.UTF-8; system is fully UTF-8)
I am a Pole living in Taiwan, so frequently I need to view Chinese characters or data with Polish diacritic marks ( I dont need to input them, just view ). This mostly works in graphical environment but fails miserably in the console. Also, I cannot seem to be able to correctly serve files encoded with ISO-8859-2 with Apache ( Polish-specific letters are garbled ). More specifically, when I set charset in Apache to ISO-8859-2, then file contents are displayed correctly, but file NAMES are not ( take a look for yourself: www.koltunski.pl/test ) . I suspect this and garbled Polish data when viewed from the console is really one and the same problem.
[RANT]
Shouldn't all this just work transparently?? Isn't fully-UTF-8 system all about being able to view whatever you want, whenever you want?? Why is it so complicated?? We have to have kernel support for various codepages, we probably have to mount filesystems with correct 'charset' and 'codepage' options, we have to set all those "LANGs" "LC_ALLs" and whatnot, we have to install appropriate fonts and God knows what else...
[/RANT]
Comments on this Entry
> This mostly works in graphical environment but fails miserably in the console.
When you say console, do you actually mean on the console sitting at the computer, or are you using something to access the console.
Maybe you mean Putty, in which case set your character set in Putty to utf8.
> I cannot seem to be able to correctly serve files encoded with ISO-8859-2 with Apache ( Polish-specific letters are garbled ). More specifically, when I set charset in Apache to ISO-8859-2, then file contents are displayed correctly, but file NAMES are not ( take a look for yourself: www.koltunski.pl/test )
You can force a character set to a certain type for a certain file extension, but remember that your filesystem stores filenames as utf-8.
Why not convert the files to utf-8 with iconv? It's 2008.
[ Parent | Reply to this comment ]
[ Send Message | View Utumno's Scratchpad | View Weblogs ]
leszek@utumno:~/encoding-tests$ ls BytyÃÆÃâ& Atilde;â ââ¬& acirc;¢ÃÆÃ¢â ;¬Å¡Ãâ&Atild e;± leszek@utumno:~/encoding-tests$ echo `ls` | iconv -f ISO-8859-2 -t UTF-8 BytyÃÆÃâ& Atilde;¢ââ ;¬Ã Â¾ÃÆ&At ilde;¢Ã¢â â¬Å¡Ãˆ ;â¦Ã¡Ã& #131;ÆÃâà ¢â∠; Â¾ÃÆÃ&A circ;¢Ã¢â⠬šÃ¬Ãâ ;æ
( should be 'Byty̢̮â&n ot;žÃâÃ Â¹ÃÆÃÂ&mac r;Ãâÿ& Atilde;âý' ; as you can see in http://www.koltunski.pl/test/ISO-8859-2 )
Edit: sweet! The Polish characters do not work here, either! ( the first and second occurances should be garbled, as it is in my console (although it is garbled in a different way here than it is in my console!), but the third is a correct UTF-8 string copied and pasted and if all this were so easy, would get displayed here correctly!!
[ Parent | Reply to this comment ]
[ Send Message | View Utumno's Scratchpad | View Weblogs ]
Heh, let's try to copy/paste some chinese characters:
éâ¡Å½Ã¨&iu ml;¿½â°Ã¨Å½ âå¸à ;©ï¿½â¹Ã¦& Acirc;£å¼ï&iques t;½Ã¥Â®Å¡Ã¨&Ac irc;ªÂ¿Ã¯Â¼ÅR 17;主å&Acir c;¼ÂµÃ¤Â¿Â&re g;æâ¹
Polish characters:
oplatajÃ⦠siÃâ¢
[ Parent | Reply to this comment ]
[ Send Message | View Utumno's Scratchpad | View Weblogs ]
Lol
And I am not just talking about Debian or Linux, - they are broken every freaking where, including Windows, Mac, the Internet, ISO standards, everywhere...
And it's been broken ever since some moron decided to use a 7bit charset back in the (?) 60s. Ever since then one could only keep building an increasingly complicated mess on top of it.
[ Parent | Reply to this comment ]
[ Send Message | View Utumno's Scratchpad | View Weblogs ]
Wait, there's more. Go to Firefox -> Menu -> View -> Character Encoding and set your encoding to something more funky, like the Chinese Big5. Then watch the hilarious mess this site has turned into.
That's just pathetic. I feel like grabbing the 'inventor' of ASCII and slapping some sense into him.
[ Parent | Reply to this comment ]
You're doing something wrong.
1. Your filenames on disk are stored in utf-8.
2. Your files are either stored in Polish or utf-8.
To view the contents of a file properly, your editor and console/terminal emulator must both be using the correct character set. If one is using utf-8 and the other Polish (as it seems in your case) then you're doing something wrong.
Can you:
1. Tell us which editor you are using
2. Tell us which console/terminal emulator you are using.
Try first with vim, and gnome-terminal/konsole/Putty.
Make sure you set the character set to utf-8.
Can you see the utf-8 files correctly? If not, how do you know they are utf-8.
[ Parent | Reply to this comment ]
[ Send Message | View Utumno's Scratchpad | View Weblogs ]
I know how to view the files, and I could convert the Chinese characters to UTF-8 first before pasting them here which, I really do hope , would have worked.
That however is not the point and my entries here are not intended to be questions; they are intended to be a rant against the current state of i18n in IT world.
I could delve into technical details why it is HORRIBLY broken everywhere including Windows, Macs or the Internet - but instead I am simply going to say this -
I18N in the IT world sucks. BIGTIME.
It is a hint for you not to try to convince me that it doesn't. There is no law of nature that prevents copying Russian characters from wherever you happen to have them and pasting them to a Chinese forum online from simply working.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Send Message | View Utumno's Scratchpad | View Weblogs ]
Well, I came here to rant :)
[ Parent | Reply to this comment ]
Convert all iso2 to utf8 and use only the latter. That's the way I did it - I'm using en_GB.UTF-8 as the system locale and read/write both Polish and Hebrew in UTF-8.
My advice: forget ISO in a multi-language environment; UTF-8 is not ideal but for some languages is good enough.
Regards,
rjc
[ Parent | Reply to this comment ]
[ Send Message | View Utumno's Scratchpad | View Weblogs ]
Forget ISO? Heh. When I go to a Russian site, copy some Russian characters from it, then go to a Chinese forum and paste them there, I expect them to be shown correctly, but they come out garbled. Whose fault is this? Did I just fail to 'forget ISO' ?
No, it's fault of whatever moron assigned 1 byte for the char ( was it Ritche or or his buddy Kernighan? ) and another one who decided to use only 7 bits of it for the charset.
7 bits ought to be enough for everybody, no?
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
K&R's famous contributions came a LONG time after ASCII appeared (about 1963).
7 bit ASCII characters won't be causing you any issues at all if you use UTF8 everywhere.
If you don't like 7 bit ASCII you should go back in time and try the alternatives (EBCDIC is no doubt still fun for some folks).
[ Parent | Reply to this comment ]
[ Send Message | View Utumno's Scratchpad | View Weblogs ]
I use ISO-8859-2 in the small forum I host.
Reason? Users cling to their IE5s and IE6s like their very life depended in it. Now, I don't know - probably it's possible to view & post in a UTF-8 using IE5, but users report they have 'problems' and I dont have one around to test. I Probably should do it, though.
[ Parent | Reply to this comment ]
I did have the same problem with filenames when I first moved to UTF-8, I had a filename stored with extended characters in iso-8859-1. Just had to rename some files to their correct names. Now I have filenames in cyrillic, hirigana, hangul, and latin characters.
[ Parent | Reply to this comment ]