Spark SQL
1. By default , spark sql is configured to use 200 parts inside parquet table folder. So if the data you are inserting into table is small, then you see these many tiny file parts created inside the table folder and will be detrimental to performance. You can change this in following manner, in case to 10.
sqlContext.setConf("spark.sql.shuffle.partitions", "10");
2. To enable compression in spark sql using default codec (gzip).
sqlContext.setConf("spark.sql.shuffle.partitions", "10");
2. To enable compression in spark sql using default codec (gzip).
sqlContext.setConf("hive.exec.compress.output","true")
3. When you write a RCFile ( compressed ) using spark with default settings, if you are using dynamic partitions, you may end up writing too many small files. It may be true for any other format also , not just RCFile. The solution is to use buckets. Using buckets you can control the number of files. The larger number of files will cause OutOfMemory errors in HIVE as well as Spark SQL.
4. When you execute a query in SQL Spark , the number of tasks depend upon the number of partitions, actual data within the partitions. The HDFS scan phase very much depend on actual data volume. If the volume is small..even if the number of files ( due to partitions ) may be higher , this stage will move quickly, on the other side even if the numbers of files may be smaller but data is large , the HDFS scan will take time.
5. spark-shell
Few common parameters used for spark-shell.
--num-executors 6
--excutor-cores 1
--executor-memory 3g
6. Spark SQL throws IndexOutOfBoundsException while inserting data into a table using "INSERT OVERWRITE SELECT ....." if select statement columns don't align with columns in the table definition.
7. When we set number of buckets/files ( shuffle partitions ) under a directory to smaller value we run into memory issues and those memory issues go away when you increase this setting.
sqlContext.setConf("spark.sql.shuffle.partitions", "10");
Also as you reduce this setting, the time taken to write the same amount of data increases even if the settings is higher than number of cores available.
4. When you execute a query in SQL Spark , the number of tasks depend upon the number of partitions, actual data within the partitions. The HDFS scan phase very much depend on actual data volume. If the volume is small..even if the number of files ( due to partitions ) may be higher , this stage will move quickly, on the other side even if the numbers of files may be smaller but data is large , the HDFS scan will take time.
5. spark-shell
Few common parameters used for spark-shell.
--num-executors 6
--excutor-cores 1
--executor-memory 3g
6. Spark SQL throws IndexOutOfBoundsException while inserting data into a table using "INSERT OVERWRITE SELECT ....." if select statement columns don't align with columns in the table definition.
7. When we set number of buckets/files ( shuffle partitions ) under a directory to smaller value we run into memory issues and those memory issues go away when you increase this setting.
sqlContext.setConf("spark.sql.shuffle.partitions", "10");
Also as you reduce this setting, the time taken to write the same amount of data increases even if the settings is higher than number of cores available.
8. A good article on Coalesce & Repartition in spark.
9. Another good article on how to partition data on disk.
Comments
Post a Comment