Binary Talks

How to kick off Hadoop’s rack awareness

Hadoop, an Open Source framework for reliable, scalable, distributed computing and data storage, has a nice feature called rack awareness. This means nothing more than that you’re able to widely spread your Hadoop cluster over multiple machines within different racks and even different data centers that are worlds apart from each other. Sadly this isn’t well documented as almost anything regarding Hadoop since it’s under heavy development and because of the few people that are actually working with Hadoop compared to other huge Open Source projects.

Hadoop - Rack Awareness

Anyway, kicking off Hadoop’s rack awareness is no big deal in general. Here’s how to achieve this goal:

Put a small script in whatever language you prefer to a location of your choice which is accessible by the local Hadoop user on the namenode. The only requirement is that the script is able to print a record to stdout. In this example I’m using a small Python script written by Vadim Zaliva stored in the Hadoop user’s home directory under /home/hadoop:

#!/usr/bin/env python
 
'''
This script used by hadoop to determine network/rack topology.  It
should be specified in hadoop-site.xml via topology.script.file.name
Property.
 
 topology.script.file.name
 /home/hadoop/topology.py
 
'''
 
import sys
from string import join
 
DEFAULT_RACK = '/default/rack0';
 
RACK_MAP = { '10.72.10.1' : '/datacenter0/rack0',
 
             '10.112.110.26' : '/datacenter1/rack0',
             '10.112.110.27' : '/datacenter1/rack0',
             '10.112.110.28' : '/datacenter1/rack0',
 
             '10.2.5.1' : '/datacenter2/rack0',
             '10.2.10.1' : '/datacenter2/rack1'
    }
 
if len(sys.argv)==1:
    print DEFAULT_RACK
else:
    print join([RACK_MAP.get(i, DEFAULT_RACK) for i in sys.argv[1:]]," ")

Then you need to add a property directive to the hadoop-site.xml you’re using for your cluster’s configuration (delete all leading underscores in the tags, this is just for display purposes):

<property>
 <name>topology.script.file.name</name>
 <value>/home/hadoop/topology.py</value>
</property>

Simply restart the namenode’s process and from now on the Namenode runs the script and looks for a record regarding the datanode everytime a new datanode tries to participate in the cluster.
Keep in mind that taking care of connections between multiple locations via VPN or else and proper DNS resolution is your business and not Hadoop’s. Make sure that resolving the datanode’s DNS record is possible and that it’s accessible within your Hadoop environment.

Share and Enjoy:
  • del.icio.us
  • Digg
  • Slashdot
  • Google Bookmarks
  • LinkedIn
  • StumbleUpon
  • Reddit
  • Yigg
  • Netvibes
  • MisterWong
  • Facebook
  • HackerNews
  • Identi.ca
  • FriendFeed
  • NewsVine

1 Comment

speak up

Add your comment below, or trackback from your own site.

Subscribe to these comments.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">

*Required Fields