Unicode, Browsers, Python, and Kvetching

My HTML/unicode character utility is now in a reasonably usable state. I ended up devoting rather more effort to it than I had originally planned, especially given that there are other perfectly useful such things out there. But once you start tweaking, it’s hard to stop. There are now many wonderful subtleties there that no one but me will ever notice.

What gave me the most grief was handling characters outside the Basic Multilingual Plane, i.e. those with codes above 0xFFFF. That’s hardly surprising. And I suppose it shouldn’t be surprising that browsers handle them so inconsistently. All the four major browsers try to display BMP characters using whatever fonts are installed, but not so for the higher ones. In detail:

  • Firefox makes a valiant effort to display them, using whatever installed fonts it can find. It’s fairly inconsistent about which ones it uses, though.
  • IE7 and Opera make no effort to find fonts with the appropriate characters. They do work if you specify an appropriate font.
  • Safari (on Windows) doesn’t display them even if you specify a font. This does not further endear Safari to me.

Oh, and on a couple of XP machines I had to reinstall Cambria Math (really useful for, you know, math) to get the browsers to find it. There must be something odd about how the Office 2007 compatibility pack installed its fonts the first time (I assume that’s how they got there).

On the server side, I knew I would have to do some surrogate-pair processing myself, and that didn’t bother me. Finding character names and the like was more annoying. I was delighted with python’s unicodedata library until I started trying to get the supplementary planes to work. The library restricts itself to the BMP, presumably because python unicode strings have 16-bit characters. The reason for the restriction is somewhat obscure to me—the library’s functions could presumably work either with single characters or surrogate pairs; and I’m pretty sure all the data is actually there (the \N{} for string literals works for supplementary-plane characters, for example).

The whole unicode range ought to work in wide builds of python, but I have no idea if that would work with Django and apache/mod_python and Webfaction, and I’m far too lazy to try. So I processed the raw unicode data into my own half-assed extended unicode library, basically just a ginormous dict with a couple of functions to extract what I want (so far just names, categories and things to come if I ever get around to it).


Tags: , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: