CDH3 Hbase

I’ve spent the last several days playing with and configuring CDH3B2 with Hbase. My test cluster is using an ESXi server with 20Gb of ram to boot up a bunch of CentOS5 VMs. Definitely not something that you’d want to run in production, but it works great for testing. Actually helps to expose some bugs due to the slow IO of running a bunch of servers on the same disk.

My production cluster is still running HBase 0.20.3 and has performed flawlessly. It has a table holding half a billion content items, taking up several terabytes of space, and has made it through several disk failures without a hitch. However, I’m looking at the Cloudera distro because I’m not happy with having to repackage everything, test it out, push to the live cluster, and then retest to make sure that everything made it properly every time a new release comes out. I’m hoping that using the Cloudera distro will simplify a lot of this. I’m also hoping that with the patches that they include and testing being done that I’ll have a more stable cluster. I had a real bad experience with the 20.4 update which is why production is still on 20.3.

One major problem that I still have, even with the Cloudera distro, is that the LZO code isn’t included due to licensing problems. I’m really hoping that the new compression options can be packaged up soon so that these libraries don’t need to be maintained separately any more.

A couple quick notes that I found from my testing.

  • The way that HBase interacts with and manages zookeeper has changed. It’s more like running an unmanaged zookeeper setup. I found that not only did I need to make sure that the zookeeper configs in the hbase-site.xml needed to be correct on all of the servers, but that when I ran map-reduce jobs against HBase that it seemed to be reading from /etc/zookeeper/zoo.cfg and that this needed to be correct on all of the regionservers. I initially had only edited it on the server running my zookeeper instance. I also added the file to the HADOOP_CLASSPATH in hadoop-env.sh but I’m not sure that that’s required.
  • I wish that there was a better way to manage the HADOOP_CLASSPATH and it’s inclusion of the hbase and zookeeper jar files. I’m trying to find a way so that this doesn’t need to be edited each time I update the software to a new version.
  • I had to change the value for dfs.datanode.socket.write.timeout. On the live cluster I have it set to 0 which is supposed to mean no timeout, but it appears that there is a bug with this latest version that doesn’t respect that value properly. So I just set it to a high number.

Evening news displays its intellectual dishonesty again

Lead story and headline on the NBC evening news again tonight talked about the “Controversial statement by President Obama about the NY Mosque”.  If the news editors were being honest they might say that “President Obama today reaffirmed a belief in freedom of religion that has been a foundation of our country for 200+ years”. But of course that isn’t as exciting as saying that Republicans are trying to stir up fear and hatred of foreigners again.

It is quite sad to see this spin. Journalism’s purpose is to find the truth in the story instead of making up their own spin to sensationalize. But we’ve gotten to the point where the corporate interests behind the News has an interest in keeping the crossfire between the right and the left going. If this means taking some liberties, then they’ve shown that they’re willing to do so.

I wouldn’t be terribly sad to see almost every major news body we have today collapse. They’ve strayed too far from their true purpose. There will still be a need for honest news gathering and there are people willing to devote their lives to that purpose. The way just needs to be cleared for these people to come to the fore again.

Playing with ESXi

I had to test out a desktop virtualization product (Pano Logic) this week and as part of the installation I needed a VMware ESX base system. I’m a huge user of their Workstation product, but I had never used the ESX line since it used to be so expensive and required certified hardware. Things have changed though and it’s now possible to download a copy of ESXi for free and to run without a dedicated SAN.

One of the difficulties with VMware is that their acronyms can be very difficult to wade through. ESXi is what they refer to as a hypervisor. This essentially is a very cut down operating system that is designed to only run other Virtual Machines. There are some requirements to running ESXi, I had to go through 3-4 servers before I found one that the installer had all of the drivers. I finally got it to run a server I had picked up from Penguin Computing (2x dual core Opteron with 4Gb mem and 250Gb hard drive).

Once I found a server that worked, the system installed quickly. The next problem was that you need to download the vSphere client to administer the server which is windows only (there are command line clients for other operating systems, but I wasn’t ready for that yet). I didn’t have a windows box laying around (all linux and mac), so I had to launch a WinXP VM in workstation on my linux desktop to administer my ESXi server. Amazingly everything worked great.

The next issue that I ran into was that I already had a large number of VMs created that I was using on workstation, but I couldn’t see how to get them on to the ESXi server. In the vSphere client there are clear instructions on how to create a new VM or download an appliance, but not how to import an existing VM. It turns out that VMware has a very simple way of doing this using the VMware Converter. This product works as a switchboard allowing you to convert or move VMs from one place to another, a really handy tool.

Overall ESXi is a great tool for running a whole bunch of server VMs. VMware offers a huge number of management products in the vSphere product line for managing load and moving VMs in a datacenter. But if you just need to run a few VMs on a single server I would definitely recommend looking at ESXi.

Just ordered a new kindle

I just placed a pre-order for Amazon’s newest Kindle the other day. This one is to replace Jaimie’s Kindle since she was still using the very first generation model and there are quite a few updates in this newest version. Supposedly much better battery life, better contrast, and faster page turns, all great things for a power reader.

We also decided to go with the wifi only model, which is cheaper, but most of our book reading is at home where there is total coverage. If we’re out and about we read on the iPhone and then sync back to the kindle to continue reading at home.

Hopefully this latest version will arrive before the next round of books that we’ve been waiting for. I’m waiting for The Evolutionary Void and we’re very happy that MockingJay will also be released in a kindle version.

I was happy to hear that Bezos was focusing on creating the best book reader instead of chasing yet another tablet.