Posts

Showing posts from 2018

SQL

Joins - Not always very obvious ( Following example uses Hive ). create table x_temp ( name string, p_date int  ) insert into x_temp(name,p_date) values ('x1',20181025),('x2', 20181026), ('x3',20181027) create table y_temp ( name string, s_date int  ) insert into y_temp(name,s_date) values ('x1',20181025),('x2', 20181026), ('x3',20181027) create table z_temp ( name string, p_date int, s_date int ) insert into z_temp(name,p_date,s_date) values ('z1','20181025','20181025') select x.*, y.* from x_temp x, y_temp y, z_temp z  where x.p_date != z.p_date and  y.s_date != z.s_date) generates 4 records... x.name,x.p_date,y.name,y.s_date x2, 20181026, x2, 20181026 x3, 20181027, x2, 20181026 x2, 20181026, x3, 20181027 x3, 20181027, x3, 20181027 select x.*, y.* from x_temp x, y_temp y, z_temp z  where x.p_date != z.p_date or  y.s_date != z.s_date) generates 8 records x.name,x.p_date,y.

HIVE

1. When doing multi table joins, except the last ( right most ) table, other tables are buffered in the reducers. So to have less memory footprint on reducers, largest tables should be the last one as it is not buffered but streamed to reducers.   For Impala , it is different. 2.  You can filter the unwanted records within the join itself. SELECT  a.*,   b. * FROM a LEFT OUTER JOIN b ON ( a.key=b.key  AND  b.name= 'Tim'   AND a .name= 'Kim' ) 3. Joins are left associative. So execution happens from left to right and the result of joins is  fed into next join operation. 4. To find the value of variable     set hive.execution.engine 5. To find values of all variables set  6. If you want to pass different hive configuration settings in hive shell, you can use --hiveconf. For example hive --hiveconf hivi.optimize.sort.dynamic.partition=true --hiveconf hive.exec.dynamic.partition.mode=nonstrict 7.  In case you accidentally drop the p

Spark SQL

1. By default , spark sql is configured to use 200 parts inside parquet table folder. So if the data you are inserting into table is small, then you see these many tiny file parts created inside the table folder and will be detrimental to performance. You can change this in following manner, in case to 10. sqlContext.setConf( "spark.sql.shuffle.partitions" , "10" ); 2. To enable compression in spark sql using default codec (gzip). sqlContext.setConf( "hive.exec.compress.output" , "true" )   3.  When you write a RCFile (  compressed ) using spark with default settings,  if you are using dynamic partitions, you may end up writing too many small files. It may be true for any other format also , not just RCFile.  The solution is to use buckets.  Using buckets you can control the number of files. The larger number of files will cause OutOfMemory errors in HIVE as well as  Spark SQL.  4. When you execute a query in SQL Spark , the n

GZIP

To test if the file is proper .gz file and  is not corrupted use: gzip -t <gzipped_file_name>

Hive vs Impala

Hive => Impala select split('abc-def','\\-')[0] => select split_part('abc-def','-', 1) 

Big Data Security

Big data has opened up new areas for attacks and provided tools  for deriving  intelligence to secure systems at the same time.  Big data did not start with security in mind, but it became a necessary afterthought. Below are few nice articles on Big Data Security 9 Key Big Data Security Issues Security Buzzwords In response to big data security challenges: Apache Accumulo - sorted , distributed, key / value data store on top of HDFS which provides robust , scalable data storage , retrieval &  cell level access control. 

Cloudera Security

Security in Cloudera Cluster can be implemented at 4 logical levels in cloudera. No security  Basic (Authentication, Authorization, Audit Trail)  Data Security & Governance(Encryption & Key Management, Metadata Discovery, Lineage Visibilit)  Enterprise Data Hub - Fully compliant Implementation Mechanisms Authentication - Kerberos - MIT or Microsoft Active Directory implementaion Authorization - Sentry , HDFS Access Control Lists Encryption - transparent HDFS encryption for Data-At-Rest using enterprise grade Key Trustee Server.  Navigator Encrypt for rest of Cloudera Application's  metadata.  Auditing : Cloudera Navigator

Impala Web UI (Debugging / Diagnostic)

Impala has a nice debugging feature. Each of it daemon processes( impalad/ catalogd, statestored) has a web server built in at different ports. http(s)://impalad_host:25000 http(s)://catalogd_host:25020, http(s)://statestored_host:25010 For impalad , there are various information available at following pages: http(s)://impalad_host:25000/backends http(s)://impalad_host:25000/catalog http(s)://impalad_host:25000/logs http(s)://impalad_host:25000/memz http(s)://impalad_host:25000/metrics http(s)://impalad_host:25000/queries http(s)://impalad_host:25000/sessions http(s)://impalad_host:25000/threadz http(s)://impalad_host:25000/varz

Parquet Wins

Hadoop supports multiple storage formats like CSV, Avro ( binary) , Parquet etc.  If performance is the only criteria then in most scenarios PARQUET wins.  See the nice blog below. http://blog.cloudera.com/blog/2016/04/benchmarking-apache-parquet-the-allstate-experience/ There is another  format  similar to parquet called ORC and promoted by Hortonworks . Parquet was created and promoted by Cloudera.

Ping and TCP/IP

Ping is highly used command to see if the machine is responding or not or if the machine is reachable or not. There is one thing which we forget to realize that Ping uses ICMP protocol and does not use any particular port. So Ping tells that machine is reachable and up  but it does not guarantee that machine is reachable on a particular TCP port which many utilities like SFTP , SSH etc. use.  See a nice article at  https://www.rootusers.com/how-to-ping-a-port/

HIVE Lateral View Explode

The "lateral view explode" construct in HIVE allows optimizing  storage for data where only one column values differ for multiple records.  It allows collapsing all such records into a single record, with the column having different values in a array or comma separated list.  By using lateral view explode, you can always generate a exploded view of the data. HIVE supports "lateral view explode" construct but Impala  does not. In Impala you have to achieve the same using complex types like ARRAY. http://www.ericlin.me/2013/09/how-to-use-hive-lateral-view-in-your-query/ http://linehrr.blogspot.com/2013/09/how-to-use-lateral-view-explode.html

HDFS Commands

* HDFS Commnads   Running file system commands in HDFS:   hdfs dfs -du  <hdfs_path>   hdfs dfs -ls -R   <hdfs_path>  : To list all files in the HDFS path       you can also use wildcard for pattenmatching. like       hdfs dfs -ls  dir1/dir2*/file1   hdfs dfs -rm -R   <hdfs_path> : To remove all files under the hdfs path      hdfs dfs -rm -R -skipTrash  dir1/dir2*/file1  ( wildcard)   hdfs dfs is same as now deprecated "hadoop dfs".  "hdfs dfs"  works only for HDFS , while "hadoop fs .." works all kinds of file systems. * To temporarily enable Kerberos  authentication for a user kinit -k -t  /home/username/username.keytab   username@ doamin