Friday, July 21, 2006

Thinking versus doing

Ever had the feeling that:
  • the task at hand would be easier if you mastered the topic more?

  • you had researched a topic thoroughly but had nothing to show for it?

In the last few months, I've become more aware the tradeoffs required to program. There is a constant battle between thinking and doing. More importantly, there is a need to be careful about over-thinking and over-doing.

ThinkingDoing
  • research
  • RTFM
  • planning
  • designing
  • implementing
  • experimenting
  • practicing
  • typing (!)

Pitfalls of "thinking" too much?
  • less/no results
  • less/no practice
  • less/no hands-on experience to guide

Pitfalls of "doing" too much?
  • re-inventing the wheel
  • less than optimal solutions
  • less/no benefits from accepted best-practice (other people's mistakes)

The key is to constantly switch modes. Code a little bit, research a little bit, refactor/readjust your code, research with your newly acquired experience, and so on.

...

A typical scenario: ASCII downcast

On one of the project I am working on, we ask for city names and display their peer-produced restaurant reviews. The problem comes with cities like "Montréal" which contain accents. We capture the string as UTF-8, that's not the problem. Inserting it into a URL is another matter, although it works, it does make for ugly URLs: Montr%E9al.

The problem becomes, how do I convert "Montréal" to "Montreal"? More broadly, how do I remove all the accents? This is something I call "ASCII downcast", I would be happy to learn the real name of this process -- it would surely guide my efforts towards a solution.

Here are the steps I followed:
  • I googled for a few keywords
    (ruby, conversion, ASCII, UTF-8, downcast)
  • I researched the Ruby standard library
    (especially: Iconv)
  • I experimented a little bit in irb
    (utf-8 to ASCII, utf-8 to iso8859-1 to ASCII)
  • I googled for more keywords, including problem-specific keywords
    (e-acute, escape sequences)
  • I investigated the tr function of String
  • I researched tr and multi-byte encoding
Time elapsed: ~1 hour.

This could have taken a LOT longer. Consider that I'm already quite familiar with encodings, conversions, that I had sample files with the same content in multiple encodings, and that I knew the command-line versions of tr and iconv.

The solution that I came up with is:
  • convert UTF-8 to ISO-8859-1 (raise exception if not possible)
  • use the tr function and map all accented characters to non-accented correspondent characters
  • manually map characters -- a one-shot investment
  • detect other "evil" characters (raise exception if needed)
It works, but it feels like a hack. I feel like I'm missing a library somewhere that already does that. I cannot believe that this problem has not been solved before and it implies I didn't do my research well enough.

This reminds me of another similar situation I faced recently with Flickr-style tags. We implemented a "naive" version of tags for the above mentioned projects only to find this.

It was a learning experience, both coding it and being humbled by a better solution.

0 Comments:

Post a Comment

<< Home