AWS, BIG DATA, JAVA

Posts

Spark with Python

November 29, 2020

References: https://towardsdatascience.com/pyspark-and-sparksql-basics-6cb4bf967e53

Full ACID semantics at row level is supported since HIVE 0.13. Earlier it was only supported at partition level. Atomicity: Entire operation is single operation. Either entire operation happens or none. Consistency: Once an operation is completed, every subsequent operations sees the same result of it. Isolation: One user operation does not impact another user. Durability: Once operation is done ..its result remains thereafter. At present isolation is only at snapshot level. There are following isolation levels in various DBMS: Snapshot: The snapshot of data at the beginning of transaction will be visible to through out the transaction. Whatever is happening in other transaction will never be seen by this transaction. Dirty Read: Uncommitted updates from other transactions can be seen. Read Committed: Only updates which are committed at the time of read will be seen by this transaction. Repeatable Read: Read lock on data being read a...

Hadoop/Hive Data Ingestion

November 28, 2020

Data Ingestion: Files : Stage the files & use Hadoop/Hive CLI Database: Sqoop, no CDC for smaller tables but only for larger (10M+) , use -m option for large db dump, NiFi is another option Streaming: NiFi , Flume , Streamsets. NiFi is popular. File Ingestion: CSV into TEXTFILE : Overwrite: Move the file to HDFS, create an external TEXTFILE table on top of the HDFS location. You can also create the table and use "LOAD DATA INPATH LOCAL localpath OVERWRITE INTO tablename". This approach will be handy for internal tables where location is not specified and if you don't know the HDFS warehouse location where table is created. You can use LOAD DATA command for loading data from local as well as hdfs file. Append: You can still use "LOAD DATA INPATH ....INSERT INTO tablename" . create a te...

HCFS

November 27, 2020

Hadoop Compatible File System

Data Lakes

November 24, 2020

With cloud computing and big data processing technological advances enabled data lakes. Data Lakes are becoming a natural choice for organizations to harness the power of data. It creates a central repository for all sorts of data; structure, semi-structured or unstructured. Data lakes store data from all sort of data sources and in all sort of formats. No preparation is required before storing the data. Huge quantities of data can be stored in a cost-effective manner. Data pipelines are setup to cleanse & transform the data. Data can be consumed in multiple ways; via interactive queries; exporting into data warehousing or business intelligence solutions. Functional Areas Data Ingestion or Collection : batch or streaming; Catalog & Search : data cataloging, metadata creation, tagging Manage & Secure Data : cost effective storage ; security ( access restrictions & encryption at rest) Processing : cleansing & transformation, , ETL or...

EC2-Classic

November 23, 2020

This was the initial EC2 platform. In this, all the ec2 instances were launches in a flat network which was shared by all the customers. There was no concept of VPC. The accounts which are created a fter 2013-12-04 ,they won't have support for EC2-Classic.

Search This Blog