Category Archives: Languages

R2rb – Mirroring CSS direction

I’m proud to announce the release of what is possibly the smallest Ruby gem I’ve ever worked on, R2 (R2rb on github, simply r2 on rubygems.org). Anybody who has read my older posts knows that I’m interested in Arabic, and more specifically Arabic information processing. While talking about something unrelated I found out that Dustin Diaz (@deddustindiaz.com) has written a Node.js module called R2 for mirroring the appropriate CSS values needed to alter the directionality of a page. While this isn’t a silver bullet it does do a very good job on pages that have successfully separated presentation from markup (read as: don’t use inline CSS styles). Continue reading

Advertisements

Unicode Security: Yes, there is such a thing

Like all aspects of computers Unicode has its own security issues. And like all Unicode issues most engineers spend their entire professional career trying to avoid dealing with them. It’s ok, you can be honest, I understand. When I gave my talk about Twitter International at Chirp (the Twitter developer conference) I mentioned some of these issues. After that talk I was surprised how many people who know more about internationalization than I do said they hadn’t considered some of these issues.

I’m not going to go into a ton of detail since I’m not a security researcher. I am, however, and engineer focused on international and as such I think it’s my business to know where my push to internationalize everything reaches it’s limit. If you’re in a similar position, pushing people to internationalize, you should make sure you fully understand these issues. If you push people to internationalize and in the process create security flaws you’ll be spending your credibility. Don’t spend it on this – the cost is too high.

Continue reading

Tokens are not just for Chuck-e-cheese

Tokenization refers to splitting any data into chunks, and in the case of this post I’m focusing on splitting text into words. The process of turning free-form text into individual pieces of information (word, phrases, sentences, etc) is something that natural language parsing (NLP) researchers have been interested in for years. There is a whole field of study on the subject that this post does not hope to even touch on. For developers with no language experience this process is usually overlooked as absurdly simple, I mean split(/\W+/), right? If you nodded then this is for you. If you think that was overly simple this will probably be old hat. Continue reading

Character Direction for Developers

When English speaking developers first encounter languages like Hebrew or Arabic where things are written from right to left they react in one of two ways. Either they see this as insurmountable to support in their application or they feel the opposite and assume that since they have UTF-8 everything will just work. While most modern programming languages support UTF-8 encoding that does not mean that everything does it correctly, and often the right-to-left layout is an overlooked part of UTF-8 support. This post hopes to clarify a little bit about right-to-left processing and Arabic in particular since I speak some of that and it inspired this post.
Continue reading

Language Detection Geekery

There have been a few questions on the Twitter API development list asking about how search.twitter.com is able to detect the language of a tweet. The methods used are nothing new to the field of natural language processing (NLP), but most developers haven’t studied much NLP. I’ll cover the industry standard method we’re using, as well as the shortcomings.
Continue reading