Hadoop, an Open Source framework for reliable, scalable, distributed computing and data storage, has a nice feature called rack awareness. This means nothing more than that you’re able to widely spread your Hadoop cluster over multiple machines within different racks and even different data centers that are worlds apart from each other. Sadly this isn’t well documented as almost anything regarding Hadoop since it’s under heavy development and because of the few people that are actually working with Hadoop compared to other huge Open Source projects.
Anyway, kicking off Hadoop’s rack awareness is no big deal in general. Here’s how to achieve this goal:
Put a small script in whatever language you prefer to a location of your choice which is accessible by the local Hadoop user on the namenode. The only requirement is that the script is able to print a record to stdout. In this example I’m using a small Python script written by Vadim Zaliva stored in the Hadoop user’s home directory under /home/hadoop:
#!/usr/bin/env python ''' This script used by hadoop to determine network/rack topology. It should be specified in hadoop-site.xml via topology.script.file.name Property. topology.script.file.name /home/hadoop/topology.py ''' import sys from string import join DEFAULT_RACK = '/default/rack0'; RACK_MAP = { '10.72.10.1' : '/datacenter0/rack0', '10.112.110.26' : '/datacenter1/rack0', '10.112.110.27' : '/datacenter1/rack0', '10.112.110.28' : '/datacenter1/rack0', '10.2.5.1' : '/datacenter2/rack0', '10.2.10.1' : '/datacenter2/rack1' } if len(sys.argv)==1: print DEFAULT_RACK else: print join([RACK_MAP.get(i, DEFAULT_RACK) for i in sys.argv[1:]]," ")
Then you need to add a property directive to the hadoop-site.xml you’re using for your cluster’s configuration (delete all leading underscores in the tags, this is just for display purposes):
<property> <name>topology.script.file.name</name> <value>/home/hadoop/topology.py</value> </property>
Simply restart the namenode’s process and from now on the Namenode runs the script and looks for a record regarding the datanode everytime a new datanode tries to participate in the cluster.
Keep in mind that taking care of connections between multiple locations via VPN or else and proper DNS resolution is your business and not Hadoop’s. Make sure that resolving the datanode’s DNS record is possible and that it’s accessible within your Hadoop environment.

Is the script actually run on the NN when a datanode tries to participate? I’m having problems changing rackids on specific datanodes. I’ve updated the scirpt and wiped the dfs directories and re-added the datanode (both by restarting manually from the dn, and also by running start-dfs.sh on the NN on a runnnig cluster), and it still shows in the old rackid…
Thoughts?