<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Matejunkie &#187; apache</title>
	<atom:link href="http://www.matejunkie.com/tag/apache/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.matejunkie.com</link>
	<description>&#34;Look behind you, a Three-Headed Monkey!&#34;</description>
	<lastBuildDate>Thu, 07 Jan 2010 14:26:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>How to kick off Hadoop&#8217;s rack awareness</title>
		<link>http://www.matejunkie.com/how-to-kick-off-hadoops-rack-awareness/</link>
		<comments>http://www.matejunkie.com/how-to-kick-off-hadoops-rack-awareness/#comments</comments>
		<pubDate>Tue, 21 Jul 2009 22:31:39 +0000</pubDate>
		<dc:creator>Mike Adolphs</dc:creator>
				<category><![CDATA[Binary Talks]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[distributed data storage]]></category>
		<category><![CDATA[framework]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HDFS]]></category>
		<category><![CDATA[namenode]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[rack awareness]]></category>

		<guid isPermaLink="false">http://www.matejunkie.com/?p=1198</guid>
		<description><![CDATA[Hadoop, an Open Source framework for reliable, scalable, distributed computing and data storage, has a nice feature called rack awareness. This means nothing more than that you&#8217;re able to widely spread your Hadoop cluster over multiple machines within different racks and even different data centers that are worlds apart from each other. Sadly this isn&#8217;t [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://hadoop.apache.org/" title="Hadoop">Hadoop</a>, an Open Source framework for reliable, scalable, distributed computing and data storage, has a nice feature called rack awareness. This means nothing more than that you&#8217;re able to widely spread your Hadoop cluster over multiple machines within different racks and even different data centers that are worlds apart from each other. Sadly this isn&#8217;t well documented as almost anything regarding Hadoop since it&#8217;s under heavy development and because of the few people that are actually working with Hadoop compared to other huge Open Source projects.</p>
<p style="text-align: left;"><a href="http://www.matejunkie.com/wp-content/uploads/2009/07/hadoop_rack_awareness_example_01.png"><img class="aligncenter size-full wp-image-1199" title="Hadoop - Rack Awareness" src="http://www.matejunkie.com/wp-content/uploads/2009/07/hadoop_rack_awareness_example_01.png" alt="Hadoop - Rack Awareness" width="500" height="337" /></a></p>
<p>Anyway, kicking off Hadoop&#8217;s rack awareness is no big deal in general. Here&#8217;s how to achieve this goal:</p>
<p>Put a small script in whatever language you prefer to a location of your choice which is accessible by the local Hadoop user on the namenode. The only requirement is that the script is able to print a record to stdout. In this example I&#8217;m using a small Python script written <a href="http://www.nabble.com/Re%3A-Hadoop-topology.script.file.name-Form-p22588620.html" title="Nabble - Hadoop Core Users">by Vadim Zaliva</a> stored in the Hadoop user&#8217;s home directory under /home/hadoop:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/env python</span>
&nbsp;
<span style="color: #483d8b;">''</span><span style="color: #483d8b;">'
This script used by hadoop to determine network/rack topology.  It
should be specified in hadoop-site.xml via topology.script.file.name
Property.
&nbsp;
 topology.script.file.name
 /home/hadoop/topology.py
&nbsp;
'</span><span style="color: #483d8b;">''</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span>
<span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">string</span> <span style="color: #ff7700;font-weight:bold;">import</span> join
&nbsp;
DEFAULT_RACK = <span style="color: #483d8b;">'/default/rack0'</span><span style="color: #66cc66;">;</span>
&nbsp;
RACK_MAP = <span style="color: black;">&#123;</span> <span style="color: #483d8b;">'10.72.10.1'</span> : <span style="color: #483d8b;">'/datacenter0/rack0'</span>,
&nbsp;
             <span style="color: #483d8b;">'10.112.110.26'</span> : <span style="color: #483d8b;">'/datacenter1/rack0'</span>,
             <span style="color: #483d8b;">'10.112.110.27'</span> : <span style="color: #483d8b;">'/datacenter1/rack0'</span>,
             <span style="color: #483d8b;">'10.112.110.28'</span> : <span style="color: #483d8b;">'/datacenter1/rack0'</span>,
&nbsp;
             <span style="color: #483d8b;">'10.2.5.1'</span> : <span style="color: #483d8b;">'/datacenter2/rack0'</span>,
             <span style="color: #483d8b;">'10.2.10.1'</span> : <span style="color: #483d8b;">'/datacenter2/rack1'</span>
    <span style="color: black;">&#125;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#41;</span>==<span style="color: #ff4500;">1</span>:
    <span style="color: #ff7700;font-weight:bold;">print</span> DEFAULT_RACK
<span style="color: #ff7700;font-weight:bold;">else</span>:
    <span style="color: #ff7700;font-weight:bold;">print</span> join<span style="color: black;">&#40;</span><span style="color: black;">&#91;</span>RACK_MAP.<span style="color: black;">get</span><span style="color: black;">&#40;</span>i, DEFAULT_RACK<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span><span style="color: black;">&#93;</span>,<span style="color: #483d8b;">&quot; &quot;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>Then you need to add a property directive to the hadoop-site.xml you&#8217;re using for your cluster&#8217;s configuration (delete all leading underscores in the tags, this is just for display purposes):</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;"><span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
 <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>topology.script.file.name<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
 <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>/home/hadoop/topology.py<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/value<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/property<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></pre></div></div>

<p>Simply restart the namenode&#8217;s process and from now on the Namenode runs the script and looks for a record regarding the datanode everytime a new datanode tries to participate in the cluster.<br />
Keep in mind that taking care of connections between multiple locations via VPN or else and proper DNS resolution is your business and not Hadoop&#8217;s. Make sure that resolving the datanode&#8217;s DNS record is possible and that it&#8217;s accessible within your Hadoop environment.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.matejunkie.com/how-to-kick-off-hadoops-rack-awareness/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
