Hbase - Distributed , Scalable,Hadoop Database

HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Use Apache HBase™ when you need random, realtime read/write access to your Big Data. It is a NoSQL database ( it is a key/value store ) 

Conceptually HBase is very much like Google's Bigtable ( multidimensional hashmap ) . See https://research.google.com/archive/bigtable.html  for more details.

It stores the data by column families.  You can add a new column qualifier ( column family:qualifier)  to a existing column family anytime.  The column name is "column family name: qualifier" 

Row Key,  Timestamp, Column Family1

R1, t1, CF1:q1=""
R1,t2,CF1:q1=""

Row Key,  Timestamp, Column Family2

R1,t3,CF2:q2
R1,t4, CF2:q2

Row Key,  Timestamp, Column Family3
R1,t5,CF3:q3

Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the table is up and running.

Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.

While rows and column keys are expressed as bytes, the version is specified using a long integer.


HBase is not suitable for transactional applications, large volume MapReduce jobs, relational analytics, etc. It is preferred when you have a variable schema with slightly different rows. It is also suitable when you are going for a key dependent access to your stored data.
HBase is also suitable when you need fault tolerant, fast and usable data management in a non-relational environment. 

Real Use Cases
  • Use of HBase by Mozilla: They generally store all crash data in HBase
  • Use of HBase by Facebook: Facebook uses HBase storage to store real-time messages.
In Hbase  rowkey plays a role in partitioning the data.  So careful consideration should be given to the design of rowkey. The idea is also to keep the similar data rows together.  Hence rowkey should be meaningful  rather than arbitrary. 



Comments

Popular posts from this blog

SQL

Analytics

HIVE