AWS, BIG DATA, JAVA

Posts

Showing posts from June, 2018

HIVE

June 25, 2018

1. When doing multi table joins, except the last ( right most ) table, other tables are buffered in the reducers. So to have less memory footprint on reducers, largest tables should be the last one as it is not buffered but streamed to reducers. For Impala , it is different. 2. You can filter the unwanted records within the join itself. SELECT a.*, b. * FROM a LEFT OUTER JOIN b ON ( a.key=b.key AND b.name= 'Tim' AND a .name= 'Kim' ) 3. Joins are left associative. So execution happens from left to right and the result of joins is fed into next join operation. 4. To find the value of variable set hive.execution.engine 5. To find values of all variables set 6. If you want to pass different hive configuration settings in hive shell, you can use --hiveconf. For example hive --hiveconf hivi.optimize.sort.dynamic.partition=true --hiveconf hive.exec.dynamic.partiti...

Spark SQL

June 12, 2018

1. By default , spark sql is configured to use 200 parts inside parquet table folder. So if the data you are inserting into table is small, then you see these many tiny file parts created inside the table folder and will be detrimental to performance. You can change this in following manner, in case to 10. sqlContext.setConf( "spark.sql.shuffle.partitions" , "10" ); 2. To enable compression in spark sql using default codec (gzip). sqlContext.setConf( "hive.exec.compress.output" , "true" ) 3. When you write a RCFile ( compressed ) using spark with default settings, if you are using dynamic partitions, you may end up writing too many small files. It may be true for any other format also , not just RCFile. The solution is to use buckets. Using buckets you can control the number of files. The larger number of files will cause OutOfMemory errors in HIVE as well as Spark SQL. 4. When you execute a query in SQL Spark , the n...