4.8.05. Learning Python and Unicode

There's been a lot of discussion around the web recently about unicode in Python. Now that I'm doing Zope 3 development in earnest for a client, I'm running into the unicode wall. My unicode problems have nothing to do with Python's support and whether it's good, bad, ugly, or whatever. I just don't understand good unicode usage / ethics. When do I encode? When do I decode? What does it all mean? I just don't even understand the basics. This doesn't just apply to Python - I also don't really understand setting encoding options on basic HTML and other documents, and often just go with the defaults and hope for the best.

Phillipe Normand points out some slides on this very topic by Marc-André Lemburg that, on first glance, looks like it covers the basics. I also just remembered that there's a Dive Into Python chapter on unicode that's been sitting in my bookmarks for quite some time.

I'm writing some file system adapters for some custom Zope 3 content, which is useful for FTP support, and this is where I started running into issues. Some of it may revolve around some fairly old code of mine that I'm using to do some document parsing that dates back to the late nineties and pre-unicode Python. There are assumptions that I used to be comfortable making about strings and incoming text that I need to re-evaluate.

Update: I also found this essay by Joel Spolsky about "The Absolute Minimum Every Software Developer Absolutely Positively Must Know About Unicode and Character Sets (No Excuses!)". In one of the opening paragraphs, he says Like many programmers, he just wished it would all blow over somehow. I've been one of that many, resisting the outside world from my landlocked American position.