AWS, BIG DATA, JAVA

Posts

Showing posts from 2018

SQL

December 04, 2018

Joins - Not always very obvious ( Following example uses Hive ). create table x_temp ( name string, p_date int ) insert into x_temp(name,p_date) values ('x1',20181025),('x2', 20181026), ('x3',20181027) create table y_temp ( name string, s_date int ) insert into y_temp(name,s_date) values ('x1',20181025),('x2', 20181026), ('x3',20181027) create table z_temp ( name string, p_date int, s_date int ) insert into z_temp(name,p_date,s_date) values ('z1','20181025','20181025') select x.*, y.* from x_temp x, y_temp y, z_temp z where x.p_date != z.p_date and y.s_date != z.s_date) generates 4 records... x.name,x.p_date,y.name,y.s_date x2, 20181026, x2, 20181026 x3, 20181027, x2, 20181026 x2, 20181026, x3, 20181027 x3, 20181027, x3, 20181027 select x.*, y.* from x_temp x, y_temp y, z_temp z where x.p_date != z.p_date or y.s_date != z.s_date) generates 8 records x.name,x.p_date,y....

HIVE

June 25, 2018

1. When doing multi table joins, except the last ( right most ) table, other tables are buffered in the reducers. So to have less memory footprint on reducers, largest tables should be the last one as it is not buffered but streamed to reducers. For Impala , it is different. 2. You can filter the unwanted records within the join itself. SELECT a.*, b. * FROM a LEFT OUTER JOIN b ON ( a.key=b.key AND b.name= 'Tim' AND a .name= 'Kim' ) 3. Joins are left associative. So execution happens from left to right and the result of joins is fed into next join operation. 4. To find the value of variable set hive.execution.engine 5. To find values of all variables set 6. If you want to pass different hive configuration settings in hive shell, you can use --hiveconf. For example hive --hiveconf hivi.optimize.sort.dynamic.partition=true --hiveconf hive.exec.dynamic.partiti...

Spark SQL

June 12, 2018

1. By default , spark sql is configured to use 200 parts inside parquet table folder. So if the data you are inserting into table is small, then you see these many tiny file parts created inside the table folder and will be detrimental to performance. You can change this in following manner, in case to 10. sqlContext.setConf( "spark.sql.shuffle.partitions" , "10" ); 2. To enable compression in spark sql using default codec (gzip). sqlContext.setConf( "hive.exec.compress.output" , "true" ) 3. When you write a RCFile ( compressed ) using spark with default settings, if you are using dynamic partitions, you may end up writing too many small files. It may be true for any other format also , not just RCFile. The solution is to use buckets. Using buckets you can control the number of files. The larger number of files will cause OutOfMemory errors in HIVE as well as Spark SQL. 4. When you execute a query in SQL Spark , the n...

GZIP

May 03, 2018

To test if the file is proper .gz file and is not corrupted use: gzip -t <gzipped_file_name>

Hive vs Impala

April 23, 2018

Hive => Impala select split('abc-def','\\-')[0] => select split_part('abc-def','-', 1)

Big Data Security

March 25, 2018

Big data has opened up new areas for attacks and provided tools for deriving intelligence to secure systems at the same time. Big data did not start with security in mind, but it became a necessary afterthought. Below are few nice articles on Big Data Security 9 Key Big Data Security Issues Security Buzzwords In response to big data security challenges: Apache Accumulo - sorted , distributed, key / value data store on top of HDFS which provides robust , scalable data storage , retrieval & cell level access control.

Cloudera Security

March 25, 2018

Security in Cloudera Cluster can be implemented at 4 logical levels in cloudera. No security Basic (Authentication, Authorization, Audit Trail) Data Security & Governance(Encryption & Key Management, Metadata Discovery, Lineage Visibilit) Enterprise Data Hub - Fully compliant Implementation Mechanisms Authentication - Kerberos - MIT or Microsoft Active Directory implementaion Authorization - Sentry , HDFS Access Control Lists Encryption - transparent HDFS encryption for Data-At-Rest using enterprise grade Key Trustee Server. Navigator Encrypt for rest of Cloudera Application's metadata. Auditing : Cloudera Navigator

Impala Web UI (Debugging / Diagnostic)

March 22, 2018

Impala has a nice debugging feature. Each of it daemon processes( impalad/ catalogd, statestored) has a web server built in at different ports. http(s)://impalad_host:25000 http(s)://catalogd_host:25020, http(s)://statestored_host:25010 For impalad , there are various information available at following pages: http(s)://impalad_host:25000/backends http(s)://impalad_host:25000/catalog http(s)://impalad_host:25000/logs http(s)://impalad_host:25000/memz http(s)://impalad_host:25000/metrics http(s)://impalad_host:25000/queries http(s)://impalad_host:25000/sessions http(s)://impalad_host:25000/threadz http(s)://impalad_host:25000/varz

Parquet Wins

March 03, 2018

Hadoop supports multiple storage formats like CSV, Avro ( binary) , Parquet etc. If performance is the only criteria then in most scenarios PARQUET wins. See the nice blog below. http://blog.cloudera.com/blog/2016/04/benchmarking-apache-parquet-the-allstate-experience/ There is another format similar to parquet called ORC and promoted by Hortonworks . Parquet was created and promoted by Cloudera.

Ping and TCP/IP

February 08, 2018

Ping is highly used command to see if the machine is responding or not or if the machine is reachable or not. There is one thing which we forget to realize that Ping uses ICMP protocol and does not use any particular port. So Ping tells that machine is reachable and up but it does not guarantee that machine is reachable on a particular TCP port which many utilities like SFTP , SSH etc. use. See a nice article at https://www.rootusers.com/how-to-ping-a-port/

HIVE Lateral View Explode

February 01, 2018

The "lateral view explode" construct in HIVE allows optimizing storage for data where only one column values differ for multiple records. It allows collapsing all such records into a single record, with the column having different values in a array or comma separated list. By using lateral view explode, you can always generate a exploded view of the data. HIVE supports "lateral view explode" construct but Impala does not. In Impala you have to achieve the same using complex types like ARRAY. http://www.ericlin.me/2013/09/how-to-use-hive-lateral-view-in-your-query/ http://linehrr.blogspot.com/2013/09/how-to-use-lateral-view-explode.html

HDFS Commands

February 01, 2018

* HDFS Commnads Running file system commands in HDFS: hdfs dfs -du <hdfs_path> hdfs dfs -ls -R <hdfs_path> : To list all files in the HDFS path you can also use wildcard for pattenmatching. like hdfs dfs -ls dir1/dir2*/file1 hdfs dfs -rm -R <hdfs_path> : To remove all files under the hdfs path hdfs dfs -rm -R -skipTrash dir1/dir2*/file1 ( wildcard) hdfs dfs is same as now deprecated "hadoop dfs". "hdfs dfs" works only for HDFS , while "hadoop fs .." works all kinds of file systems. * To temporarily enable Kerberos authentication for a user kinit -k -t /home/username/username.keytab username@ doamin