I was playing around with google this weekend. The original problem I wanted to solve was that last.fm returns strange strange release dates for albums, so I was writing a small script that would extract the correct release date from various sources. I was aiming at www.metal-archives.com and wikipedia. Both of these sites have different search pages, and in general I’ve come to rely more on google’s site:xxx functionaly, than on individual pages own search engines. So I thought, why not just use google programmatically to search the sites. Seems easy enough.
Failure 1 (I’m feeling lucky):
Google has a very nice feature called “I’m feeling lucky” that will direct you to the first result. If I could specify my queries good enough, I could rely on that, and not have to parse google to get the url. It’s very simple, you just add &btnI at the end of your query and google will redirect. Sadly it works fine most of the time, but sometimes it just fails to redirect you. I couldn’t find any patterns to this randomness and a “works sometimes solution” is not a good one 😐
Failure 2 (google ajax):
I then found out that google has a seemingly very nice api that lets you do queries and get JSON back. JSON is easy to work with and it also allows one to go through several results, in case google doesn’t return the right one as the first result. After a bit poking around I found out that google ajax randomly returns different results from the normal google. It’s like using Yahoo instead of google. A bit of poking around returns the following 2 year old bug report. Furthermore the TOS directly forbids using the API for this kind of activity. Oh well, it didn’t work anyway.
Failure 3 (parsing google results directly):
After two bitter defeats I thought screw it, I’ll just parse the damn google result pages, how hard can it be? At least I know that it gives me the right results. So I did that, coded everything up and checked that things was working. Then let it loose on my collection (2×275 requests) and around the middle it stopped working. I poked a bit around, and found out that google has identified my program as a bad boy and decided to spank it by returning a “Please identify yourself as a human” page back instead of the normal google result page.
As a side note, after 3 bitter defeats I was ready to jump ship and try bing or Yahoo. That was a quick detour though, as none of them where up for the challenge of returning good results.