CDH3 Hbase

I’ve spent the last several days playing with and configuring CDH3B2 with Hbase. My test cluster is using an ESXi server with 20Gb of ram to boot up a bunch of CentOS5 VMs. Definitely not something that you’d want to run in production, but it works great for testing. Actually helps to expose some bugs due to the slow IO of running a bunch of servers on the same disk.

My production cluster is still running HBase 0.20.3 and has performed flawlessly. It has a table holding half a billion content items, taking up several terabytes of space, and has made it through several disk failures without a hitch. However, I’m looking at the Cloudera distro because I’m not happy with having to repackage everything, test it out, push to the live cluster, and then retest to make sure that everything made it properly every time a new release comes out. I’m hoping that using the Cloudera distro will simplify a lot of this. I’m also hoping that with the patches that they include and testing being done that I’ll have a more stable cluster. I had a real bad experience with the 20.4 update which is why production is still on 20.3.

One major problem that I still have, even with the Cloudera distro, is that the LZO code isn’t included due to licensing problems. I’m really hoping that the new compression options can be packaged up soon so that these libraries don’t need to be maintained separately any more.

A couple quick notes that I found from my testing.

  • The way that HBase interacts with and manages zookeeper has changed. It’s more like running an unmanaged zookeeper setup. I found that not only did I need to make sure that the zookeeper configs in the hbase-site.xml needed to be correct on all of the servers, but that when I ran map-reduce jobs against HBase that it seemed to be reading from /etc/zookeeper/zoo.cfg and that this needed to be correct on all of the regionservers. I initially had only edited it on the server running my zookeeper instance. I also added the file to the HADOOP_CLASSPATH in but I’m not sure that that’s required.
  • I wish that there was a better way to manage the HADOOP_CLASSPATH and it’s inclusion of the hbase and zookeeper jar files. I’m trying to find a way so that this doesn’t need to be edited each time I update the software to a new version.
  • I had to change the value for dfs.datanode.socket.write.timeout. On the live cluster I have it set to 0 which is supposed to mean no timeout, but it appears that there is a bug with this latest version that doesn’t respect that value properly. So I just set it to a high number.

Playing with ESXi

I had to test out a desktop virtualization product (Pano Logic) this week and as part of the installation I needed a VMware ESX base system. I’m a huge user of their Workstation product, but I had never used the ESX line since it used to be so expensive and required certified hardware. Things have changed though and it’s now possible to download a copy of ESXi for free and to run without a dedicated SAN.

One of the difficulties with VMware is that their acronyms can be very difficult to wade through. ESXi is what they refer to as a hypervisor. This essentially is a very cut down operating system that is designed to only run other Virtual Machines. There are some requirements to running ESXi, I had to go through 3-4 servers before I found one that the installer had all of the drivers. I finally got it to run a server I had picked up from Penguin Computing (2x dual core Opteron with 4Gb mem and 250Gb hard drive).

Once I found a server that worked, the system installed quickly. The next problem was that you need to download the vSphere client to administer the server which is windows only (there are command line clients for other operating systems, but I wasn’t ready for that yet). I didn’t have a windows box laying around (all linux and mac), so I had to launch a WinXP VM in workstation on my linux desktop to administer my ESXi server. Amazingly everything worked great.

The next issue that I ran into was that I already had a large number of VMs created that I was using on workstation, but I couldn’t see how to get them on to the ESXi server. In the vSphere client there are clear instructions on how to create a new VM or download an appliance, but not how to import an existing VM. It turns out that VMware has a very simple way of doing this using the VMware Converter. This product works as a switchboard allowing you to convert or move VMs from one place to another, a really handy tool.

Overall ESXi is a great tool for running a whole bunch of server VMs. VMware offers a huge number of management products in the vSphere product line for managing load and moving VMs in a datacenter. But if you just need to run a few VMs on a single server I would definitely recommend looking at ESXi.