Posts

Showing posts from February, 2019

HDFS

( Notes from Definitive Guide 4th edition ) Hadoop is written in Java. Good for Very large files, streaming data access, commodity hardware. Not suitable for low latency applications, lots of small files, multiple writes and arbitrary file modifications Disk block is different from file system block. File system ( of a single disk)  blocks consists of multiple disk blocks. Disk block is typically 512 bytes.  HDFS also has concept of block, but it is much larger than disk or regular single disk file system block.  HDFS block size is large to so that it can take advantage of higher disk transfer rate as compared to lower seek rate. Map tasks in MapReduce normally operate on one block at time. Why HDFS blocks? file can be larger than a disk , simplifies storage management (fixed width) , blocks fit well for replication Block Cache : for frequently used file blocks , can be  administered via cache directives to cache pools. Namode/datanode => Master/slave Namenode : ( m

Hbase - Distributed , Scalable,Hadoop Database

HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.  Use Apache HBase™ when you need random, realtime read/write access to your Big Data. It is a NoSQL database ( it is a key/value store )  Conceptually HBase is very much like Google's Bigtable ( multidimensional hashmap ) . See  https://research.google.com/archive/bigtable.html  for more details . It stores the data by column families.  You can add a new column qualifier ( column family:qualifier)  to a existing column family anytime.  The column name is "column family name: qualifier"  Row Key,  Timestamp,  Column Family1 R1, t1, CF1:q1="" R1,t2,CF1:q1="" Row Key,  Timestamp,  Column Family2 R1,t3,CF2:q2 R1,t4, CF2:q2 Row Key,  Timestamp,  Column Family3 R1,t5,CF3:q3 Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the table is up and running. Physica

Scala In 10 Minutes

Pure object oriented. Everything is an object including numbers. single argument methods can be called like:   obj method arg  OR   obj method { arg }  OR obj.method  arg OR obj.method { arg }  Functions are objects too. Anonymous functions Classes can accept arguments ( Constructor ) Methods can be defined without any parameters ( no parenthesis ) To override methods specify it explicitly scala.AnyRef  is superclass case classes : These are special classes which can be used in pattern matching scenarios or cases  ( where we use algebraic data types in languages like Java )    "trait" is like  interface which contains code also. (Scala has abstract class too) Companion Object : Object having same name as class.  The class then is called Companion class.  Companion class and object can access each other's private members. ( static methods in companion object, instance methods in companion class.)  All operators that end in : are right associative in Scala.

Spark

1.  System.exit(0); to exit the spark shell. 2.  You don't have to start the Spark cluster (EMR) with  dynamicAllocation enabled to true.  You can always start the spark-shell with --conf spark.dynamicAllocation.enabled=true.  You can also pass --conf maximizeResourceAllocation=true. 3. You can customize logging for spark-submit by passing a custom log4j properties file: ( Assuming you are running in client mode on yarn cluster )  --files ../log4j.properties --conf spark.driver.extraJavaOptions='-Dlog4j.configuration=file: ../log4j.properties'

Amazon EMR

Image
While trying to create EMR cluster I got following error. "Could not create cluster.The instance profile for the newly created default role is not yet visible. Please try after a few seconds."  Tried again and again but no luck.  Ultimately I deleted all the default EMR roles which were there and I recreated the roles as following. aws emr create-default-roles --profile admin --region us-east-1 After that I was able to create the cluster.  https://aws.amazon.com/blogs/big-data/optimize-amazon-emr-costs-with-idle-checks-and-automatic-resource-termination-using-advanced-amazon-cloudwatch-metrics-and-aws-lambda/ Master node does not need high compute power.Core task nodes need both compute and storage and task  nodes don't need storage.   You can not change instance type for master and core nodes while cluster is running but you can do so for task nodes. By default  EMR schedules jobs in such a way that even if any task node gets terminated , jobs can sti

Networking

Subnet Mask : IP Address Range : 98.114.91.230/24  /24 is subnet mask. It tells that first 24 bits in the IP address indicate subnet address ( 98.114.91.0)  and remaining 8 bits indicate host address ( 0.0.0.230 ). Binary IP  addresses with a host portion of all ones and all zeros are invalid. Hence 98.114.91.0 and 98.114.91.255 are not valid IP addresses ( first and last address on any subnet can not be assigned to a host ). /24 can also be represented as 255.255.255.0 Network Classes Class A :  255.0.0.0 & first octet ( 0-127 ) Class B :  255.255.0.0 & first octet ( 128-191 ) Class C :  255.255.255.0 & first octet ( 192-223 ) Reference : https://support.microsoft.com/en-us/help/164015/understanding-tcp-ip-addressing-and-subnetting-basics

AWS RedShift

Columnar storage, Parallel & distributed queries across multiple nodes, can use existing BI tools ( standard SQL  , JDBC, ODBC support) ,  scale-in/out, scale up/down, column level compression , automated admin activities like snapshots/backups etc. -- Sample URLs HOST :  lab.cwfgigx176xf.us-west-2.redshift.amazonaws.com Endpoint : lab.cwfgigx176xf.us-west-2.redshift.amazonaws.com:5439/lab JDBC :  jdbc:redshift://lab.cwfgigx176xf.us-west-2.redshift.amazonaws.com:5439/lab ODBC : Driver={Amazon Redshift (x64)}; Server=lab.cwfgigx176xf.us-west-2.redshift.amazonaws.com; Database=lab -- Does not use indexes for sequential scan like traditional db.  -- To see disk capacity in cluster nodes:  SELECT   owner AS node,   diskno,   used,   capacity,   used/capacity::numeric * 100 as percent_used FROM stv_partitions WHERE host = node ORDER BY 1, 2; -- To see table usage ( blocks of 1MB each ) : SELECT name, count(*) FROM stv_blocklist JOIN (SELECT DISTINCT name, id as tbl from stv_tbl_perm