Category Archives: Programming

Sufferin’ Safari: Quirks Between Safari Versions

Browser incompatibility is so 1999, isn’t it? Well, while we spend our time fretting about IE version incompatibility and cross-browser issues we often overlook the version issues of other browsers. Over the past week I’ve been working on the twitter-text-js support for hashtags in Russian, Korean, Japanese and Chinese. Along the way I ran into two bugs in some versions of Safari that surprised me. I didn’t find much online about it so I wanted to take a moment and jot this down. Continue reading

R2rb – Mirroring CSS direction

I’m proud to announce the release of what is possibly the smallest Ruby gem I’ve ever worked on, R2 (R2rb on github, simply r2 on rubygems.org). Anybody who has read my older posts knows that I’m interested in Arabic, and more specifically Arabic information processing. While talking about something unrelated I found out that Dustin Diaz (@deddustindiaz.com) has written a Node.js module called R2 for mirroring the appropriate CSS values needed to alter the directionality of a page. While this isn’t a silver bullet it does do a very good job on pages that have successfully separated presentation from markup (read as: don’t use inline CSS styles). Continue reading

Crowdsourcing vs. Community

This is post is about how I have come to use the words “crowdsourced” and “community” to distinguish different, but related, activities. I’ve been working on Twitter’s community translation tools since before they were launched and this is a lesson I’ve learned during that time. This all started with my reply to a Quora topic  and much of the information was already covered there. But since Quora is a smaller community than the web at large I wanted to re-format the information for widest consumption and change some of the examples to be a little bit clearer.

Continue reading

MySQL and Unicode

I used MySQL for a great many projects over the years with the assumption that a charset of utf8 and a collation of utf8_unicode_ci was going to support all of UTF-8 and that was all I need to do. I was sorely mistaken but there was no point in writing until now, because MySQL 5.5 has finally helped rectify the issue. Up until MySQL 5.5 (released December of 2010) the UTF-8 support was severely hobbled. With MySQL 5.5 the server can now support the full range of characters that UTF-8 allows but it’s not the default behavior. There are still plenty of pitfalls for the naïve developer starting out with MySQL. Continue reading

Grease Monkey programming for #NewTwitter

So the new Twitter redesign (a.k.a. #NewTwitter) is out in the wild at last, even if it’s only a small percentage of users. Soon enough we’ll all have access but even before that I wanted to write about customizing #NewTwitter using Grease Monkey. Much has been said about the new right side “Detail Pane” real estate as a platform but I don’t know about any of that. I suspect that annotations and the Details Pane will be a match made in heaven but that’s not something I heard at the office, just my personal view as a former Platform team member, and former 3rd party Twitter developer. What I’m interested in right now is customizing the Details Pane for myself using Grease Money.

Continue reading

Unicode Security: Yes, there is such a thing

Like all aspects of computers Unicode has its own security issues. And like all Unicode issues most engineers spend their entire professional career trying to avoid dealing with them. It’s ok, you can be honest, I understand. When I gave my talk about Twitter International at Chirp (the Twitter developer conference) I mentioned some of these issues. After that talk I was surprised how many people who know more about internationalization than I do said they hadn’t considered some of these issues.

I’m not going to go into a ton of detail since I’m not a security researcher. I am, however, and engineer focused on international and as such I think it’s my business to know where my push to internationalize everything reaches it’s limit. If you’re in a similar position, pushing people to internationalize, you should make sure you fully understand these issues. If you push people to internationalize and in the process create security flaws you’ll be spending your credibility. Don’t spend it on this – the cost is too high.

Continue reading

Unicode with Ruby – Regular Expressions

Unicode support in Ruby doesn’t get much attention. Most of the information about it focuses on MySQL more than it does on actual Ruby support. Ruby can read and write Unicode data without much trouble but actually working with it, and moreover making sure it does not get corrupted, is one of the lesser visited back-alleys of Ruby. Hopefully I can make some more time to blog about other Ruby/Unicode interaction but I have to start somewhere so Regular Expressions are as good a place as any. Perhaps better since they’re their own dark art. Continue reading

Tokens are not just for Chuck-e-cheese

Tokenization refers to splitting any data into chunks, and in the case of this post I’m focusing on splitting text into words. The process of turning free-form text into individual pieces of information (word, phrases, sentences, etc) is something that natural language parsing (NLP) researchers have been interested in for years. There is a whole field of study on the subject that this post does not hope to even touch on. For developers with no language experience this process is usually overlooked as absurdly simple, I mean split(/\W+/), right? If you nodded then this is for you. If you think that was overly simple this will probably be old hat. Continue reading

Character Direction for Developers

When English speaking developers first encounter languages like Hebrew or Arabic where things are written from right to left they react in one of two ways. Either they see this as insurmountable to support in their application or they feel the opposite and assume that since they have UTF-8 everything will just work. While most modern programming languages support UTF-8 encoding that does not mean that everything does it correctly, and often the right-to-left layout is an overlooked part of UTF-8 support. This post hopes to clarify a little bit about right-to-left processing and Arabic in particular since I speak some of that and it inspired this post.
Continue reading

Language Detection Geekery

There have been a few questions on the Twitter API development list asking about how search.twitter.com is able to detect the language of a tweet. The methods used are nothing new to the field of natural language processing (NLP), but most developers haven’t studied much NLP. I’ll cover the industry standard method we’re using, as well as the shortcomings.
Continue reading