Hadoop

* Good article on MapReduce
http://www.bigsynapse.com/mapreduce-internals

* org.apache.hadoop.mapreduce.Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> is generic class. There is an Mapper generic interface too but it is in different package org.apache.hadoop.mapre.Mapper<K1,V1,K2,V2> and has nothing to do with the Mapper class.

The Reducer is also a generic class similar to Mapper.

* Hadoop also has two special annotations to indicate the audience and stability of any interface or class
-InterfaceStability
-InterfaceAudience

* All the key value types in Hadoop Map Reduce programming must implement Writable interface, which results in type implementing efficient serialization using DataInput/DataOutput.

* hadoop fs -put input/sample.txt hdfs://quickstart.cloudera:8020/sample.txt returned "ConnectException" because namenode service was not running. Used following commands to find the status and restart the service

sudo service hadoop-hdfs-namenode status
sudo service hadoop-hdfs-namenode restart

* Turning off safe mode

sudo -u hdfs hadoop dfsadmin -safemode leave

Hadoop starts in safe mode. Replication does not start in safemode. Datanodes send blockreport messages to namenode.

Once certain conditions are met in terms of these blockreport messages , it exits safemode and start replicating the blocks which are under replicated.

* If there is no data node running , hadoop map reduce job throws exception. So make sure at least one data node is running. At least that is what happened to me.

* If you see a message like following

ipc.Client: Retrying connect to server: quickstart.cloudera/127.0.0.1:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

Port 8032 indicates resource manager. start YARN service as following.

sudo service hadoop-yarn-resourcemanager start

* Fencing : The HA implementation goes to great lengths to ensure that the previously active namenode is prevented from doing any damage and causing corruption—a method known as fencing ( from Tom White's Hadoop Definitive Guide )

STONITH - Shoot Other Node in the head is the technique used for fencing. it powers down the host machine.

* Cloudera quickstart VM sometimes don't start all the services, so if you face problems connecting to namenode or datanode or hue or hive or yarn etc..go check the service name under /etc/init.d and see the status if service is running.

* Nice link which talks about MapR and Linux version compatibility

http://maprdocs.mapr.com/home/InteropMatrix/r_os_matrix.html

* HIVE expects leading negative sign for numbers . It can not handle trailing negative sign. Huge problem for importing data from financial systems like SAP / POS where at times they use trailing negative sign for reporting convenience.

* YARN - Yet Another Resource Negotiator. YARN is a cluster resource management system, which allows any distributed program (not just MapReduce) to run on data in a Hadoop cluster.

* OLTP - Online Transaction Processing ( both reads and writes ), OLAP - Online Analytical Processing ( less writes, more reads )

* High Performance Computing - Distribute workload across machines in cluster while sharing the file system. This becomes an issue when each need needs to read large volume of data. Data locality is at the heart of data processing in Hadoop echo-system.

*Edge Node - Edge nodes / gateway nodes are the interface between the Hadoop cluster and the outside network. See good explanation at link. These nodes are used for running client apps and cluster admin tools.

Search This Blog

AWS, BIG DATA, JAVA

Hadoop

Comments

Post a Comment

Popular posts from this blog

JAR file contents

Spark with Python

Akka