My Own Search Engine

I guess every software engineer should be writing their own competitors to Google and Facebook in their garage (in a future post I’ll include pics of our garage data center).  Because of issues I saw with Facebook, I created ReadPath. A social network with more of a focus on privacy and news sharing. It’s currently about 70% done. UI still needs lots of tweaks and there are some features that need to be completed.
One nice bonus of running ReadPath, is that it is constantly spidering content from RSS feeds for the news reader. The other day I realized that I’ve now stored a full billion content items going back several years. So of course having that much content I had to create a search engine to mine it. So, I created MiniSearch to play with different concepts involved in running a general search engine. There are a lot of things that turned out to be a lot harder than expected.
Currently the index is in the process of being built and only includes 20% of available content. There is also a lot of work to be done with ranking still. I’ll post again when I think it’s in a more usable state.

Moving back to Firefox

With the release of Firefox 4.0 today, I’ve switched my default browser back to Firefox. Chrome had taken over for awhile because it was cleaner and faster. The latest Firefox seems to be just as fast now and I prefer having access to Firebug when I need it without having to open another browser.
Great Job Mozilla guys 🙂

Virtualizing Mission Critical Applications

Jaimie has organized a webinar to discuss what it takes to manage a large scale virtualization project.
One of the speakers, Mr. Brodhun, is uniquely qualified on this subject having previously served as Technical Director for Enterprise Standards and Technologies for the United States Marine Corps, where he oversaw the deployment of approximately 2,300 ESX hosts and nearly 7,000 virtual machines across 167 sites.
Regardless of the size of your virtualization project, you’ll learn how to maximize uptime and performance of mission critical applications, while eliminating hidden costs that can decrease virtualization ROI upwards of 50%.
Click here to register for this webinar.

Streaming music on the appleTV

I just stumbled across a little feature on the appleTV that I hadn’t been aware of before. It appears that if you create a playlist in the iTunes library that the appleTV is connected to. Then put streaming urls into that playlist. When you then go the internet tab on the appleTV, there will be a playlist menu item with those items in it. This way you should be able to get whichever streaming media that you want on your appleTV.

The Future has arrived

Ever since I moved out to California, one of the things that I’ve secretly wanted was to be able to listen to my favorite music while driving. The problem has always been that my favorite channel by far is the Vocal Trance channel off of Digitally Imported Radio. So this meant that I would need to be able to stream internet radio while driving in the car.
Well, today that day has finally arrived. I noticed yesterday that DI has an iPhone app that allows you to stream their premium channels over 3G. I was listening today for about an hour while out running some errands with Caitlin. I only lost the signal once for about 5 sec while driving in some hills, the app does a great job of buffering and keeping the music going. The quality is great and with ~1hr of streaming it only used 25Mb of data according to the built in meter (I’ll have to double check with ATT’s meter).
One of the greatest things about the app is that if it does cut out for any reason it can determine that it’s at the end of a stream and gracefully fades out so that there aren’t any jarring cuts in or out. Every streaming app should copy this.

Adding RSS discovery to Chrome

For Chrome on the mac, it appears that RSS auto discovery is not included by default. This is the feature that puts a little RSS icon in the URL bar when the page that you’re on has an RSS feed available. In order to enable this feature for Chrome on the mac you need to install this extension. This is an extension from Google and seems to work great.
Then if you find a RSS feed that you’d like to add to ReadPath, drag the link below to your bookmark bar.
Add To ReadPath
Then when you’re on a page that you want to subscribe to, press the bookmarklet and it will have ReadPath subscribe you to the feed.

Changing my comment policy

I’ve turned off comments on this blog for now. While I’ve gotten some great comments in the past, the volume of spam just isn’t worth the hassle. Instead I’ve put a mailto: link at the bottom of each article. This is the best way to get ahold of me anyway. If you send something relevant and worth sharing I’ll add it to the blog post.
Oh, I changed the name of the blog as well. Not as worried about having my real name on the web anymore.

CDH3 Hbase

I’ve spent the last several days playing with and configuring CDH3B2 with Hbase. My test cluster is using an ESXi server with 20Gb of ram to boot up a bunch of CentOS5 VMs. Definitely not something that you’d want to run in production, but it works great for testing. Actually helps to expose some bugs due to the slow IO of running a bunch of servers on the same disk.
My production cluster is still running HBase 0.20.3 and has performed flawlessly. It has a table holding half a billion content items, taking up several terabytes of space, and has made it through several disk failures without a hitch. However, I’m looking at the Cloudera distro because I’m not happy with having to repackage everything, test it out, push to the live cluster, and then retest to make sure that everything made it properly every time a new release comes out. I’m hoping that using the Cloudera distro will simplify a lot of this. I’m also hoping that with the patches that they include and testing being done that I’ll have a more stable cluster. I had a real bad experience with the 20.4 update which is why production is still on 20.3.
One major problem that I still have, even with the Cloudera distro, is that the LZO code isn’t included due to licensing problems. I’m really hoping that the new compression options can be packaged up soon so that these libraries don’t need to be maintained separately any more.
A couple quick notes that I found from my testing.

  • The way that HBase interacts with and manages zookeeper has changed. It’s more like running an unmanaged zookeeper setup. I found that not only did I need to make sure that the zookeeper configs in the hbase-site.xml needed to be correct on all of the servers, but that when I ran map-reduce jobs against HBase that it seemed to be reading from /etc/zookeeper/zoo.cfg and that this needed to be correct on all of the regionservers. I initially had only edited it on the server running my zookeeper instance. I also added the file to the HADOOP_CLASSPATH in hadoop-env.sh but I’m not sure that that’s required.
  • I wish that there was a better way to manage the HADOOP_CLASSPATH and it’s inclusion of the hbase and zookeeper jar files. I’m trying to find a way so that this doesn’t need to be edited each time I update the software to a new version.
  • I had to change the value for dfs.datanode.socket.write.timeout. On the live cluster I have it set to 0 which is supposed to mean no timeout, but it appears that there is a bug with this latest version that doesn’t respect that value properly. So I just set it to a high number.

Evening news displays its intellectual dishonesty again

Lead story and headline on the NBC evening news again tonight talked about the “Controversial statement by President Obama about the NY Mosque”.  If the news editors were being honest they might say that “President Obama today reaffirmed a belief in freedom of religion that has been a foundation of our country for 200+ years”. But of course that isn’t as exciting as saying that Republicans are trying to stir up fear and hatred of foreigners again.
It is quite sad to see this spin. Journalism’s purpose is to find the truth in the story instead of making up their own spin to sensationalize. But we’ve gotten to the point where the corporate interests behind the News has an interest in keeping the crossfire between the right and the left going. If this means taking some liberties, then they’ve shown that they’re willing to do so.
I wouldn’t be terribly sad to see almost every major news body we have today collapse. They’ve strayed too far from their true purpose. There will still be a need for honest news gathering and there are people willing to devote their lives to that purpose. The way just needs to be cleared for these people to come to the fore again.