Posts

Spark with Python

References:  https://towardsdatascience.com/pyspark-and-sparksql-basics-6cb4bf967e53 

Transactions in HIVE

Full ACID  semantics at row level is supported since HIVE  0.13.  Earlier it was only supported at partition level.   Atomicity: Entire operation is single operation. Either entire operation happens or none. Consistency: Once an operation is completed, every subsequent operations sees the same result of it.  Isolation: One user operation does not impact another user.  Durability: Once operation is done ..its result remains thereafter. At present isolation is only at snapshot level. There are following isolation levels in various DBMS: Snapshot: The snapshot of data at the beginning of  transaction will be visible to through out the transaction. Whatever is happening in other transaction will never be seen by this transaction.  Dirty Read: Uncommitted updates from other transactions can be seen. Read Committed: Only updates which are committed at the time of read will be seen by this transaction. Repeatable Read: Read lock on data being read and write lock on data being created/updated/

Hadoop/Hive Data Ingestion

Data Ingestion:     Files :  Stage the files & use Hadoop/Hive CLI      Database: Sqoop, no CDC for smaller tables but only for larger (10M+) , use -m option for large db        dump, NiFi is another option      Streaming: NiFi , Flume , Streamsets. NiFi is popular.     File Ingestion:    CSV  into TEXTFILE :  Overwrite:  Move the file to HDFS, create an external  TEXTFILE table on top of the HDFS location.     You can also create the table and use "LOAD DATA INPATH LOCAL localpath OVERWRITE INTO   tablename".  This approach will be handy for internal tables where location is not specified and if you   don't know the HDFS warehouse location where table is created.   You can use LOAD DATA command for loading data from local as well as hdfs file.  Append:  You can still use "LOAD DATA INPATH ....INSERT INTO tablename" . create a temporary table using overwrite approach and then insert into original table from temporary table. Same approach will work for parti

HCFS

 Hadoop Compatible File System

Data Lakes

With cloud computing and big data processing technological advances enabled data lakes. Data Lakes are becoming a natural choice for organizations to harness the power of data. It creates a central repository for all sorts of data; structure, semi-structured or unstructured. Data lakes store data from all sort of data sources and in all sort of formats. No preparation is  required before storing the data. Huge quantities of data can be stored in a cost-effective manner. Data pipelines are setup to cleanse & transform the data.  Data can be consumed in multiple ways; via interactive queries; exporting  into data warehousing or business intelligence solutions.  Functional Areas Data Ingestion or Collection : batch or streaming; Catalog & Search : data cataloging, metadata creation, tagging  Manage & Secure Data : cost effective storage ; security ( access restrictions & encryption at rest)  Processing : cleansing & transformation, , ETL or ELT  pipelines,  raw data to

EC2-Classic

This was the initial EC2 platform. In this, all the ec2 instances were launches in a flat network which was shared by all the customers. There was no concept of VPC.  The accounts which are created a fter 2013-12-04  ,they won't have support for EC2-Classic.