Hadoop/Hive Data Ingestion

Data Ingestion:

Files : Stage the files & use Hadoop/Hive CLI

Database: Sqoop, no CDC for smaller tables but only for larger (10M+) , use -m option for large db dump, NiFi is another option

Streaming: NiFi , Flume , Streamsets. NiFi is popular.

File Ingestion:

CSV into TEXTFILE :

Overwrite:

Move the file to HDFS, create an external TEXTFILE table on top of the HDFS location.
You can also create the table and use "LOAD DATA INPATH LOCAL localpath OVERWRITE INTO tablename". This approach will be handy for internal tables where location is not specified and if you don't know the HDFS warehouse location where table is created. You can use LOAD DATA command for loading data from local as well as hdfs file.

Append:

You can still use "LOAD DATA INPATH ....INSERT INTO tablename" .
create a temporary table using overwrite approach and then insert into original table from temporary table.

Same approach will work for partitioned tables also.

CSV file into PARQUET: you need to first create a temporary TEXTFILE table and create a PARQUET table and then insert from TEXTFILE into PARQUET table.

CSV file into AVRO : similar to PARQUET

PRECAUTIONS:

Try to minimize number of INSERT operations. Each single operation results into Map Reduce job and it also creates additional files into HDFS. So always try to club as many INSERT operations into single one as possible. Especially try to merge individual inserts into single one.

HIVE is especially not suited for update/delete operations. Insert/append is still ok.

UPDATE in HIVE basically creates new file and marks earlier file for delete. HIVE is not build for UPDATES, so try to minimize those. See link for more details.

Try to do COMPACTION after each such operation to cleanup unnecessary files.

Search This Blog

AWS, BIG DATA, JAVA

Hadoop/Hive Data Ingestion

Comments

Post a Comment

Popular posts from this blog

JAR file contents

Deep Learning

Spark with Python