Ever had the feeling that:
- the task at hand would be easier if you mastered the topic more?
- you had researched a topic thoroughly but had nothing to show for it?
In the last few months, I've become more aware the tradeoffs required to program. There is a constant battle between thinking and doing. More importantly, there is a need to be careful about over-thinking and over-doing.
Thinking | Doing |
---|
- research
- RTFM
- planning
- designing
| - implementing
- experimenting
- practicing
- typing (!)
|
Pitfalls of "thinking" too much?
- less/no results
- less/no practice
- less/no hands-on experience to guide
Pitfalls of "doing" too much?
- re-inventing the wheel
- less than optimal solutions
- less/no benefits from accepted best-practice (other people's mistakes)
The key is to
constantly switch modes. Code a little bit, research a little bit, refactor/readjust your code, research with your newly acquired experience, and so on.
...
A typical scenario: ASCII downcast
On one of the project I am working on, we ask for city names and display their peer-produced restaurant reviews. The problem comes with cities like "Montréal" which contain accents. We capture the string as UTF-8, that's not the problem. Inserting it into a URL is another matter, although it works, it does make for ugly URLs: Montr%E9al.
The problem becomes, how do I convert "Montréal" to "Montreal"? More broadly, how do I remove all the accents? This is something I call "ASCII downcast", I would be happy to learn the real name of this process -- it would surely guide my efforts towards a solution.
Here are the steps I followed:
- I googled for a few keywords
(ruby, conversion, ASCII, UTF-8, downcast) - I researched the Ruby standard library
(especially: Iconv) - I experimented a little bit in irb
(utf-8 to ASCII, utf-8 to iso8859-1 to ASCII) - I googled for more keywords, including problem-specific keywords
(e-acute, escape sequences) - I investigated the tr function of String
- I researched tr and multi-byte encoding
Time elapsed: ~1 hour.
This could have taken a LOT longer. Consider that I'm already quite familiar with encodings, conversions, that I had sample files with the same content in multiple encodings, and that I knew the command-line versions of tr and iconv.
The solution that I came up with is:
- convert UTF-8 to ISO-8859-1 (raise exception if not possible)
- use the tr function and map all accented characters to non-accented correspondent characters
- manually map characters -- a one-shot investment
- detect other "evil" characters (raise exception if needed)
It works, but it feels like a hack. I feel like I'm missing a library somewhere that already does that. I cannot believe that this problem has not been solved before and it implies I didn't do my research well enough.
This reminds me of another similar situation I faced recently with Flickr-style tags. We implemented a "naive" version of tags for the above mentioned projects only to find
this.
It was a learning experience, both coding it and being humbled by a better solution.