Unicode Security: Yes, there is such a thing

Like all aspects of computers Unicode has its own security issues. And like all Unicode issues most engineers spend their entire professional career trying to avoid dealing with them. It’s ok, you can be honest, I understand. When I gave my talk about Twitter International at Chirp (the Twitter developer conference) I mentioned some of these issues. After that talk I was surprised how many people who know more about internationalization than I do said they hadn’t considered some of these issues.

I’m not going to go into a ton of detail since I’m not a security researcher. I am, however, and engineer focused on international and as such I think it’s my business to know where my push to internationalize everything reaches it’s limit. If you’re in a similar position, pushing people to internationalize, you should make sure you fully understand these issues. If you push people to internationalize and in the process create security flaws you’ll be spending your credibility. Don’t spend it on this – the cost is too high.

Continue reading

On Heroism – A Rocky Analogy

After reading Alex Payne’s post on heroism (Don’t Be A Hero) I have to say I was a little irked. I disagree somewhat on the details of what defines a hero in this context and that seems to be the crux of my discomfort. I don’t think hero’s have to work until four in the morning. Nor do I think a hero creates inherently lower quality software. A hero is someone so dedicated and passionate about what they are doing that they are willing to work hard and deliver when other people are not (and for some people, what they are passionate about is not low-quality “feature work”). For some “heros” this becomes late nights, for others early mornings, and for still others it’s a during the day activity with no extra time. I’ll be honest, that last case is pretty rare, because the passionate usually see time as flexible and success as a rigid goal.

I was never really sure how to some up my feelings on the post until last night. Oddly, it was the Eye of The Tiger scene in Persepolis that enlightened me. I’ve seen Rocky many times – and even once in the last few weeks – but somehow seeing that well-worn scene re-used highlighted my feelings. What’s great about Rocky is that it gives me a way to sum up not just The Hero, but also the personalities that often surround them.
Continue reading

Unicode with Ruby – Regular Expressions

Unicode support in Ruby doesn’t get much attention. Most of the information about it focuses on MySQL more than it does on actual Ruby support. Ruby can read and write Unicode data without much trouble but actually working with it, and moreover making sure it does not get corrupted, is one of the lesser visited back-alleys of Ruby. Hopefully I can make some more time to blog about other Ruby/Unicode interaction but I have to start somewhere so Regular Expressions are as good a place as any. Perhaps better since they’re their own dark art. Continue reading

Tokens are not just for Chuck-e-cheese

Tokenization refers to splitting any data into chunks, and in the case of this post I’m focusing on splitting text into words. The process of turning free-form text into individual pieces of information (word, phrases, sentences, etc) is something that natural language parsing (NLP) researchers have been interested in for years. There is a whole field of study on the subject that this post does not hope to even touch on. For developers with no language experience this process is usually overlooked as absurdly simple, I mean split(/\W+/), right? If you nodded then this is for you. If you think that was overly simple this will probably be old hat. Continue reading

Character Direction for Developers

When English speaking developers first encounter languages like Hebrew or Arabic where things are written from right to left they react in one of two ways. Either they see this as insurmountable to support in their application or they feel the opposite and assume that since they have UTF-8 everything will just work. While most modern programming languages support UTF-8 encoding that does not mean that everything does it correctly, and often the right-to-left layout is an overlooked part of UTF-8 support. This post hopes to clarify a little bit about right-to-left processing and Arabic in particular since I speak some of that and it inspired this post.
Continue reading

Language Detection Geekery

There have been a few questions on the Twitter API development list asking about how search.twitter.com is able to detect the language of a tweet. The methods used are nothing new to the field of natural language processing (NLP), but most developers haven’t studied much NLP. I’ll cover the industry standard method we’re using, as well as the shortcomings.
Continue reading