My Own Search Engine

I guess every software engineer should be writing their own competitors to Google and Facebook in their garage (in a future post I’ll include pics of our garage data center).  Because of issues I saw with Facebook, I created ReadPath. A social network with more of a focus on privacy and news sharing. It’s currently about 70% done. UI still needs lots of tweaks and there are some features that need to be completed.

One nice bonus of running ReadPath, is that it is constantly spidering content from RSS feeds for the news reader. The other day I realized that I’ve now stored a full billion content items going back several years. So of course having that much content I had to create a search engine to mine it. So, I created MiniSearch to play with different concepts involved in running a general search engine. There are a lot of things that turned out to be a lot harder than expected.

Currently the index is in the process of being built and only includes 20% of available content. There is also a lot of work to be done with ranking still. I’ll post again when I think it’s in a more usable state.

Hadoop Lives

I was a bit concerned after the announcement that Microsoft would be taking over search development for Yahoo that Hadoop would suffer. But according to this, development will continue if not be enhanced.

I’ve been using Hadoop and HBase with ReadPath and have been thoroughly impressed with the results. Prior to testing it out, I had read many comments about HBase being usable for batch processing but not really up to the task of replacing a database for live requests.

As an initial test I’ve replaced ReadPath’s dictionary system, which had been running on Mysql, with HBase. This system is responding to hundreds of requests a second and returning in ~10mS which is on par with what Mysql was doing. The big wins are with scalability though. I’ve currently got a five server cluster with which I’m seeing a 100x increase in inserts/updates over Mysql with a much larger data set. As I convert other databases to use HBase and add these servers into the Hadoop cluster, performance should only increase.