Posts

Showing posts from March, 2018

Big Data Security

Big data has opened up new areas for attacks and provided tools  for deriving  intelligence to secure systems at the same time.  Big data did not start with security in mind, but it became a necessary afterthought. Below are few nice articles on Big Data Security 9 Key Big Data Security Issues Security Buzzwords In response to big data security challenges: Apache Accumulo - sorted , distributed, key / value data store on top of HDFS which provides robust , scalable data storage , retrieval &  cell level access control. 

Cloudera Security

Security in Cloudera Cluster can be implemented at 4 logical levels in cloudera. No security  Basic (Authentication, Authorization, Audit Trail)  Data Security & Governance(Encryption & Key Management, Metadata Discovery, Lineage Visibilit)  Enterprise Data Hub - Fully compliant Implementation Mechanisms Authentication - Kerberos - MIT or Microsoft Active Directory implementaion Authorization - Sentry , HDFS Access Control Lists Encryption - transparent HDFS encryption for Data-At-Rest using enterprise grade Key Trustee Server.  Navigator Encrypt for rest of Cloudera Application's  metadata.  Auditing : Cloudera Navigator

Impala Web UI (Debugging / Diagnostic)

Impala has a nice debugging feature. Each of it daemon processes( impalad/ catalogd, statestored) has a web server built in at different ports. http(s)://impalad_host:25000 http(s)://catalogd_host:25020, http(s)://statestored_host:25010 For impalad , there are various information available at following pages: http(s)://impalad_host:25000/backends http(s)://impalad_host:25000/catalog http(s)://impalad_host:25000/logs http(s)://impalad_host:25000/memz http(s)://impalad_host:25000/metrics http(s)://impalad_host:25000/queries http(s)://impalad_host:25000/sessions http(s)://impalad_host:25000/threadz http(s)://impalad_host:25000/varz

Parquet Wins

Hadoop supports multiple storage formats like CSV, Avro ( binary) , Parquet etc.  If performance is the only criteria then in most scenarios PARQUET wins.  See the nice blog below. http://blog.cloudera.com/blog/2016/04/benchmarking-apache-parquet-the-allstate-experience/ There is another  format  similar to parquet called ORC and promoted by Hortonworks . Parquet was created and promoted by Cloudera.