Hadoop/Hive Data Ingestion

Data Ingestion:

    Files : Stage the files & use Hadoop/Hive CLI 
    Database: Sqoop, no CDC for smaller tables but only for larger (10M+) , use -m option for large db        dump, NiFi is another option 
    Streaming: NiFi , Flume , Streamsets. NiFi is popular.

   
File Ingestion:   

CSV  into TEXTFILE : 

Overwrite: 
  • Move the file to HDFS, create an external  TEXTFILE table on top of the HDFS location.    
  • You can also create the table and use "LOAD DATA INPATH LOCAL localpath OVERWRITE INTO  tablename".  This approach will be handy for internal tables where location is not specified and if you  don't know the HDFS warehouse location where table is created.   You can use LOAD DATA command for loading data from local as well as hdfs file. 
Append: 
  • You can still use "LOAD DATA INPATH ....INSERT INTO tablename" .
  • create a temporary table using overwrite approach and then insert into original table from temporary table.
Same approach will work for partitioned tables also.

  CSV file into PARQUET:  you need to first create a temporary TEXTFILE table and create a PARQUET table and then insert from TEXTFILE into PARQUET table. 
  CSV file into AVRO : similar to PARQUET
    

PRECAUTIONS:

Try to minimize  number of INSERT operations. Each single operation results into Map Reduce job and it also creates additional files into HDFS.  So always try to club as many INSERT operations into single one as possible. Especially try to merge  individual inserts into single one. 

HIVE is especially not suited for update/delete operations.  Insert/append is still ok. 

UPDATE in HIVE basically creates new file and marks earlier file for delete. HIVE is not build for UPDATES, so try to minimize those. See link for more details. 

Try to do COMPACTION after each such operation to cleanup unnecessary files.  

    

Comments

Popular posts from this blog

SQL

Analytics

DBeaver