Encyclopedia > Wikipedia:Unicode

Article Content

Wikipedia:Unicode

This is merely a discusion page, go to Unicode for the real McCoy.

Someone is going through and adding UNICODE characters for various symbols in the martial arts areas. I have mixed feelings about this. Is it a good idea?

In my browser, this doesn't look good or useful at all. It just puts question marks there.

I would imagine that for other people it is useful and nifty.

Opinions?

(And, this chit chat should be removed soon to make a real UNICODE article!)

I'm in process of downloading fonts so I can see the characters. I don't think people should have to download fonts to see the characters. --Koyaanis Qatsi

This is an encyclopedia. It is intended for people to research on various topics and termonologies. If a term is originated from a foreign language, it would be helpful to include the term in its original writing, such as Chinese, Hebrew, Arabic etc. This is made possible only recently because of the UNICODE standard. The original writing is helpful for researchers to communicate with people who knows the language. It also helps resolve confusion in variation of transliteration in different languages, e.g. Qi vs. Chi vs. Ki etc. When you copy down the original writings, you get help from native speakers easier than trying to explain what exactly you are talking about. I have gone through some articles and put in the UNICODE for just the title terms only. I have refrained myself from touching other content of the articles just for the reason you have complained about. In my opinion, the UNICODE addition is an important resource for any researchers. If no one appreciates this contribution, I can stop right away.

If you believe that Unicode is the standard encoding for HTML in the future, you should upgrade your browser to such a state that you can see all text from around the world regardless whether you can read it or not.

Rebuttal?

Well, I don't really have a rebuttal, exactly. I think it's a toss-up. On the one hand, it is probably true that over the next couple of years, people will naturally and accidentally find their computers using Unicode so that they can just magically see all these nice characters. On the other hand, I don't know how many people can see them now.

It isn't really about me. I'm a computer savvy guy. I probably should upgrade my browsing situation to UNICODE. It's really about less computer savvy people. We shouldn't be snobbish and demand that everyone in the world upgrade to the latest browser, should we?

--Jimbo Wales

Does this means unicode in articles like Hebrew alphabet are not welcome here either? They all show up as question marks and alot of numbers. However, I believe some people somewhere are able to read that page if they care to set up the computer with the fonts. The missing wiki links are shown as ? too, to me, they also look very ugly. Can we get rid of them too?

Jimbo, I thought you own this project. You have enough authority to decide one way or the other.

Actually the unicode does not really affect the operation of the encyclopedia. The user just need to know that the ?'s are for user who care to see the foreign characters, all those who don't need to know can simply ignore the ???s.

As somebody whose system can display Japanese/Chinese unicode characters (mozilla 0.9.5 on Debian), they certainly look way cool, but I'm not convinced they add much to the article (then again, I'm not a scholar of Asian languages). Then again, having (???) there shouldn't be so bad.

Would it be practical to have a routine to convert characters specified in unicode to gifs and embed them in the page for the interim period until the majority of people have unicode browsers and sufficient fontsets to view Asian-language glyphs? --Robert Merkel

Most people with unsophisticated browsing arrangements (read: Windows 98/ME/2000/XP with Internet Explorer 5+) will either see those symbols fine as it is, or will get a automatic prompt from Explorer to download appropriate fonts to see them if they want to.

I strongly agree with native-language rendering in case of important names, I think it only adds to the articles. We might want to consider adding some kind of standard one-line disclaimer to articles that feature a lot of Unicode characters that are likely to demand additional fonts from the user, something that would link to a special page which would explain the situation, where to DL the fonts if necessary, etc. --AV

The need to include native writing is obvious, as pointed out by the comments given above. You can see the same approach in other articles such as Munich where the first paragraph says

Munich is a city in and the state capital of Bavaria in Germany. Its German name is München.

The only difference here is that the German text has the luxury of showing in the ISO-Latin1 code page. i.e. it is visible to most western users while the Asian text nor Polish text nor Arabic text are not visible without special browser setup. Nevertheless the need is there regardless of the font issue. On the other hand, most Chinese and Japanese users who use the asian code page will not be able to see extended European characters on their browsers. So it is an even game. So if your argument to ban foreign characters is purely based on displayability of the text, then you must ban all non-English characters including German and French.

Another way is to specify the codepoint and instruction for the user how to look up the Unicode character as in Chinese numerals. My opinion is that it is not as convenient as the in-line Unicode inclusion.

ISO-Latin-1 codes should be converted into Unicode HTML entities (this is especially true for all accented characters). We've discussed this before, and all the evidence's for it. I'll fix Munich right away. --AV

changing the Latin1 character into a ü code does not help much. The character is still shown as a ? on a Chinese browser. So as I pointed out earlier, the font problem is from the browser, not whether unicode is used or not. If you don't like Asian text to show as ??? on your English browser, I don't like German text to show as ??? on my Chinese browser either. If you really want to please everyone, you can only use the lowest common denominator, which is pure English. Why can't we just tolerate each other's ??? as long as everyone knows that someone else are able to see those ??? as native text.

changing the Latin character does help much: it allows browsers to correctly display the text without identifying the encoding. If your Chinese browser still shows ?, then your Chinese browser is deficient, and it is the problem, not Wikipedia. The article now correctly gives it all the information about the character; and if you use Windows, accented European characters are available in default fonts in all Windows versions including Asian versions. And in fact, the Chinese character show up perfectly correctly on my English Windows and browser.

I'm not trying to please everyone, I'm trying to make it possible for everyone to see the right text, which is only possible by using Unicode. I'm not speaking against Chinese characters, on the contrary. My opinion is that short names and crucial concepts should be given in their native rendering, but sentence fragments or complete sentences should be in English. --AV

Actually, AV, ISO-8859-1 characters do not need to be converted, and in fact they're more likely to display correctly on some older (pre-HTML4) browers if they aren't. The Bomis server sends a "Character-Encoding:" header with the pages that specifies ISO-8859-1, so all of those characters will come across with no problems unless

the browser reading the text is broken (and in that case, the entities probably won't help). It's only characters outside the Latin1 range that have to be specified as entities. It is unfortunate that present software and standards bodies are often out of sync, but that's the way it is. --LDC

LDC, the HTML source doesn't contain any specification of the encoding (just view the source of any Wikipedia page). Maybe the HTTP headers contain a header which specify ISO-8859-1, I haven't checked, but even if it does, it's not enough: too many browsers, including for example this here IE 5+ I'm using, don't use that information to automatically set the page's encoding. When I visit a Wikipedia page which uses even Latin1 8-bit characters, I often see them as random Cyrillic characters (owing to the fact that I often visit Russian websites prior to visiting Wikipedia), the browser isn't smart enough to switch the encoding.

Yes, I'm talking about HTTP headers, not HTML source. Here they are exactly:

  Date: Tue, 23 Oct 2001 01:56:47 GMT
  Server: Apache/1.3.19 (Unix) PHP/4.0.4pl1 mod_fastcgi/2.2.10
  Connection: close
  Transfer-Encoding: chunked
  Content-Type: text/html; charset=iso-8859-1

Yes, some modern browsers will ignore that header; most older browsers don't even know it exists. But that's precisely my point: they will assume 8859-1 because that's the native character set (or Windows code page 1251, which is close enough), and do the right thing for the wrong reason. Those same older browsers won't know what ü means, because that's a recent HTML-ism. All I'm saying is that those browsers that get it wrong when not encoded and get it right when encoded are few (you seem to have one), but the ones that will get it right when not encoded are many. And I'm not saying we should actually do the wrong thing just because it works on some browsers--what I recommend is technically correct. Do you have your IE5 set to "Auto-select" encoding? That's the View->Encoding->Auto-select menu item. It should switch back and forth correctly--mine does.

The encoding information should be inside the HTML document, and the Wiki software currently doesn't do this. There're many good reasons why HTTP header isn't enough, not just practical ones: for instance, HTML files should strive to be XHTML-compliant, and this absolutely requires encoding information inside the HTML stream. If the Wiki software is modified to insert the appropriate attribute into all outgoing streams, I agree that converting Latin-1 characters won't be needed for correct displaying; I still think that it's better to present as entities all characters outside of 7-bit ASCII. --AV

I would like to see the encoding inside the HTML as well. Perhaps after a while when old pre-HTML4 browsers have been phased out even overseas, the entity refs will be the clearly better thing to do. But I think you underestimate the number of IE2, IE3, and Netscape 3.x machines in the world who won't know what a ü is.

Are you sure Netscape 3.x won't understand ü ? Anyway, one way to check would be to ask for Bomis administrators to analyze the Wikipedia traffic and give us statistics on browsers used to access Wikipedia now. --AV

Instead of ignoring very old browsers, another way could be to translate things like ü into their 8-bit equivalents when it's needed; e.g. when a user selects a certain flag or even better, automatically when an older browser is detected (as all pages are dynamic anyway). --AV

One compromise would be to set a policy that only the subject title can include the native writing and it must be enclosed in () so that it does not interrupt the flow of the sentence. I also agree the one line to point out what the ??? are will be helpful for the clueless.

As long as an article doesn't depend on the characters, and they are merely parenthetical extra information (as I rendered them in the Ang Lee article), then they can be nothing but a benefit. Also, this page should be deleted and moved to something like "Wiki special characters/ChineseTalk". --Lee Daniel Crocker

Though the Chinese characters started this discussion, I think the rule can extend to other languages. We have a lot of terms in this wikipedia that are based on Arabic e.g. Al-Qaida and Hebrew e.g. Yahweh. It would be benefitial for future schoolars to actually see which text is transliterated when there are multiple versions of the English spelling.

Agreed; we can certainly represent eth and thorn in Anglo-Saxon with Unicode so that AS verse is rendered as is to all browsers. sjc

One problem with Hebrew and Arabic is that the text reads from right to left. The text will be displayed wrong if the browsers do not handle bi-directional writings. On the other hand, people who use the wrong browsers will see the text as ???, it reads the same either way. :-)

I like the idea of having Unicode as the base font of the wikipedia instead of ASCII. This adds considerably to the quality and eases the description of many things because you can use specific characters and symbols.

Alan Wood has a comprehensive site on Unicode fonts for the various operating systems. [1] (http://www.alanwood.net/unicode/). In particular he indicates where to download Unicode fonts so that Internet Explorer 5.0 (two years old) and above display Unicode fonts.

Hannes Hirzel

One danger of using Unicode is that people start to overdo it. Since Unicode supports all languages of the world, there is a tendency to add too much non-English text into an English encyclopedia.

I would suggest that wikipedia put down a policy to limit foreign text just for specifying the original writings of any transliterated English word such as Tao, Al-Qaida and Yahweh etc. I have no objection on including quotation in the original language provided that it is set in a block that does not affect the flow of the reading assuming some people will just see all ???s if they don't care about the original text.

I can see that including native text may not be important for any alphabet based language because there is always a one-to-one mapping from the English alphabet/syllable back to the foreign alphabet, so including the Unicode does not add much. However, for non-alphabetic languages such as Japanese, Chinese etc. The English transliteration seldem maps back to the original text correctly. For example, in Chinese, one pronunciation can maps to over 100 different characters. On the other hand, in Japanese, one Kanji character (e.g. the character for one) can map to over 100 different pronunciations depending on the context it is in. Including the native text is the only way to solve such problem.

In several of the Physics articles I help work on, there is a need for a way to display h-bar (Planck's constant divided by 2π) in the various formulae. In Unicode, it is as simple as entering ℏ. A hack is to use a struck-through "h", like this: h. Which unfortunately looks terrible. My gut feeling is to use Unicode, but is are there people who feel strongly against it? --CYD

For the love of all that is good, yes, use a Unicode character reference and not some ugly hack! (My opinion, anyway.) If the specific Planck's over 2π (ℏ = ℏ) isn't sufficiently widely viewable (works for me), try an italicized lowercase h-with-stroke (ħ = ''ħ'') which is fairly close and should be in basic fonts on most recent OSs, being in ISO-8859-3 and the Unicode Latin-extended B section rather than the math symbols section. --Brion VIBBER

All Wikipedia text is available under the terms of the GNU Free Documentation License

Search Encyclopedia

Search over one million articles, find something about almost anything!