AWS, BIG DATA, JAVA

Posts

Showing posts from February, 2019

HDFS

February 28, 2019

( Notes from Definitive Guide 4th edition ) Hadoop is written in Java. Good for Very large files, streaming data access, commodity hardware. Not suitable for low latency applications, lots of small files, multiple writes and arbitrary file modifications Disk block is different from file system block. File system ( of a single disk) blocks consists of multiple disk blocks. Disk block is typically 512 bytes. HDFS also has concept of block, but it is much larger than disk or regular single disk file system block. HDFS block size is large to so that it can take advantage of higher disk transfer rate as compared to lower seek rate. Map tasks in MapReduce normally operate on one block at time. Why HDFS blocks? file can be larger than a disk , simplifies storage management (fixed width) , blocks fit well for replication Block Cache : for frequently used file blocks , can be administered via cache directives to cache pools. Namode/datanode => Master/sl...

Hbase - Distributed , Scalable,Hadoop Database

February 26, 2019

HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Use Apache HBase™ when you need random, realtime read/write access to your Big Data. It is a NoSQL database ( it is a key/value store ) Conceptually HBase is very much like Google's Bigtable ( multidimensional hashmap ) . See https://research.google.com/archive/bigtable.html for more details . It stores the data by column families. You can add a new column qualifier ( column family:qualifier) to a existing column family anytime. The column name is "column family name: qualifier" Row Key, Timestamp, Column Family1 R1, t1, CF1:q1="" R1,t2,CF1:q1="" Row Key, Timestamp, Column Family2 R1,t3,CF2:q2 R1,t4, CF2:q2 Row Key, Timestamp, Column Family3 R1,t5,CF3:q3 Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the...

Scala In 10 Minutes

February 18, 2019

Pure object oriented. Everything is an object including numbers. single argument methods can be called like: obj method arg OR obj method { arg } OR obj.method arg OR obj.method { arg } Functions are objects too. Anonymous functions Classes can accept arguments ( Constructor ) Methods can be defined without any parameters ( no parenthesis ) To override methods specify it explicitly scala.AnyRef is superclass case classes : These are special classes which can be used in pattern matching scenarios or cases ( where we use algebraic data types in languages like Java ) "trait" is like interface which contains code also. (Scala has abstract class too) Companion Object : Object having same name as class. The class then is called Companion class. Companion class and object can access each other's private members. ( static methods in companion object, instance methods in companion class.)...

Spark

February 12, 2019

1. System.exit(0); to exit the spark shell. 2. You don't have to start the Spark cluster (EMR) with dynamicAllocation enabled to true. You can always start the spark-shell with --conf spark.dynamicAllocation.enabled=true. You can also pass --conf maximizeResourceAllocation=true. 3. You can customize logging for spark-submit by passing a custom log4j properties file: ( Assuming you are running in client mode on yarn cluster ) --files ../log4j.properties --conf spark.driver.extraJavaOptions='-Dlog4j.configuration=file: ../log4j.properties'

Amazon EMR

February 11, 2019

While trying to create EMR cluster I got following error. "Could not create cluster.The instance profile for the newly created default role is not yet visible. Please try after a few seconds." Tried again and again but no luck. Ultimately I deleted all the default EMR roles which were there and I recreated the roles as following. aws emr create-default-roles --profile admin --region us-east-1 After that I was able to create the cluster. https://aws.amazon.com/blogs/big-data/optimize-amazon-emr-costs-with-idle-checks-and-automatic-resource-termination-using-advanced-amazon-cloudwatch-metrics-and-aws-lambda/ Master node does not need high compute power.Core task nodes need both compute and storage and task nodes don't need storage. You can not change instance type for master and core nodes while cluster is running but you can do so for task nodes. By default EMR schedules jobs in such a way that even if any task node gets termi...

Networking

February 10, 2019

Subnet Mask : IP Address Range : 98.114.91.230/24 /24 is subnet mask. It tells that first 24 bits in the IP address indicate subnet address ( 98.114.91.0) and remaining 8 bits indicate host address ( 0.0.0.230 ). Binary IP addresses with a host portion of all ones and all zeros are invalid. Hence 98.114.91.0 and 98.114.91.255 are not valid IP addresses ( first and last address on any subnet can not be assigned to a host ). /24 can also be represented as 255.255.255.0 Network Classes Class A : 255.0.0.0 & first octet ( 0-127 ) Class B : 255.255.0.0 & first octet ( 128-191 ) Class C : 255.255.255.0 & first octet ( 192-223 ) Reference : https://support.microsoft.com/en-us/help/164015/understanding-tcp-ip-addressing-and-subnetting-basics

AWS RedShift

February 08, 2019

Columnar storage, Parallel & distributed queries across multiple nodes, can use existing BI tools ( standard SQL , JDBC, ODBC support) , scale-in/out, scale up/down, column level compression , automated admin activities like snapshots/backups etc. -- Sample URLs HOST : lab.cwfgigx176xf.us-west-2.redshift.amazonaws.com Endpoint : lab.cwfgigx176xf.us-west-2.redshift.amazonaws.com:5439/lab JDBC : jdbc:redshift://lab.cwfgigx176xf.us-west-2.redshift.amazonaws.com:5439/lab ODBC : Driver={Amazon Redshift (x64)}; Server=lab.cwfgigx176xf.us-west-2.redshift.amazonaws.com; Database=lab -- Does not use indexes for sequential scan like traditional db. -- To see disk capacity in cluster nodes: SELECT owner AS node, diskno, used, capacity, used/capacity::numeric * 100 as percent_used FROM stv_partitions WHERE host = node ORDER BY 1, 2; -- To see table usage ( blocks of 1MB each ) : SELECT name, count(*) FROM stv_blocklist J...