Jonathan's Techno-Tales: Thinking versus doing

Ever had the feeling that:

the task at hand would be easier if you mastered the topic more?

you had researched a topic thoroughly but had nothing to show for it?

In the last few months, I've become more aware the tradeoffs required to program. There is a constant battle between thinking and doing. More importantly, there is a need to be careful about over-thinking and over-doing.

Thinking	Doing
research RTFM planning designing	implementing experimenting practicing typing (!)

Pitfalls of "thinking" too much?

less/no results
less/no practice
less/no hands-on experience to guide

Pitfalls of "doing" too much?

re-inventing the wheel
less than optimal solutions
less/no benefits from accepted best-practice (other people's mistakes)

The key is to constantly switch modes. Code a little bit, research a little bit, refactor/readjust your code, research with your newly acquired experience, and so on.

...

A typical scenario: ASCII downcast

On one of the project I am working on, we ask for city names and display their peer-produced restaurant reviews. The problem comes with cities like "Montréal" which contain accents. We capture the string as UTF-8, that's not the problem. Inserting it into a URL is another matter, although it works, it does make for ugly URLs: Montr%E9al.

The problem becomes, how do I convert "Montréal" to "Montreal"? More broadly, how do I remove all the accents? This is something I call "ASCII downcast", I would be happy to learn the real name of this process -- it would surely guide my efforts towards a solution.

Here are the steps I followed:

I googled for a few keywords
(ruby, conversion, ASCII, UTF-8, downcast)
I researched the Ruby standard library
(especially: Iconv)
I experimented a little bit in irb
(utf-8 to ASCII, utf-8 to iso8859-1 to ASCII)
I googled for more keywords, including problem-specific keywords
(e-acute, escape sequences)
I investigated the tr function of String
I researched tr and multi-byte encoding

Time elapsed: ~1 hour.

This could have taken a LOT longer. Consider that I'm already quite familiar with encodings, conversions, that I had sample files with the same content in multiple encodings, and that I knew the command-line versions of tr and iconv.

The solution that I came up with is:

convert UTF-8 to ISO-8859-1 (raise exception if not possible)
use the tr function and map all accented characters to non-accented correspondent characters
manually map characters -- a one-shot investment
detect other "evil" characters (raise exception if needed)

It works, but it feels like a hack. I feel like I'm missing a library somewhere that already does that. I cannot believe that this problem has not been solved before and it implies I didn't do my research well enough.

This reminds me of another similar situation I faced recently with Flickr-style tags. We implemented a "naive" version of tags for the above mentioned projects only to find this.

It was a learning experience, both coding it and being humbled by a better solution.

Jonathan's Techno-Tales

Friday, July 21, 2006

Thinking versus doing

0 Comments:

About Me

Previous Posts