Posts

Showing posts from November, 2020

Spark with Python

References:  https://towardsdatascience.com/pyspark-and-sparksql-basics-6cb4bf967e53 

Transactions in HIVE

Full ACID  semantics at row level is supported since HIVE  0.13.  Earlier it was only supported at partition level.   Atomicity: Entire operation is single operation. Either entire operation happens or none. Consistency: Once an operation is completed, every subsequent operations sees the same result of it.  Isolation: One user operation does not impact another user.  Durability: Once operation is done ..its result remains thereafter. At present isolation is only at snapshot level. There are following isolation levels in various DBMS: Snapshot: The snapshot of data at the beginning of  transaction will be visible to through out the transaction. Whatever is happening in other transaction will never be seen by this transaction.  Dirty Read: Uncommitted updates from other transactions can be seen. Read Committed: Only updates which are committed at the time of read will be seen by this transaction. Repeatable Read: Read lock on data being read and write lock on data being created/updated/

Hadoop/Hive Data Ingestion

Data Ingestion:     Files :  Stage the files & use Hadoop/Hive CLI      Database: Sqoop, no CDC for smaller tables but only for larger (10M+) , use -m option for large db        dump, NiFi is another option      Streaming: NiFi , Flume , Streamsets. NiFi is popular.     File Ingestion:    CSV  into TEXTFILE :  Overwrite:  Move the file to HDFS, create an external  TEXTFILE table on top of the HDFS location.     You can also create the table and use "LOAD DATA INPATH LOCAL localpath OVERWRITE INTO   tablename".  This approach will be handy for internal tables where location is not specified and if you   don't know the HDFS warehouse location where table is created.   You can use LOAD DATA command for loading data from local as well as hdfs file.  Append:  You can still use "LOAD DATA INPATH ....INSERT INTO tablename" . create a temporary table using overwrite approach and then insert into original table from temporary table. Same approach will work for parti

HCFS

 Hadoop Compatible File System

Data Lakes

With cloud computing and big data processing technological advances enabled data lakes. Data Lakes are becoming a natural choice for organizations to harness the power of data. It creates a central repository for all sorts of data; structure, semi-structured or unstructured. Data lakes store data from all sort of data sources and in all sort of formats. No preparation is  required before storing the data. Huge quantities of data can be stored in a cost-effective manner. Data pipelines are setup to cleanse & transform the data.  Data can be consumed in multiple ways; via interactive queries; exporting  into data warehousing or business intelligence solutions.  Functional Areas Data Ingestion or Collection : batch or streaming; Catalog & Search : data cataloging, metadata creation, tagging  Manage & Secure Data : cost effective storage ; security ( access restrictions & encryption at rest)  Processing : cleansing & transformation, , ETL or ELT  pipelines,  raw data to

EC2-Classic

This was the initial EC2 platform. In this, all the ec2 instances were launches in a flat network which was shared by all the customers. There was no concept of VPC.  The accounts which are created a fter 2013-12-04  ,they won't have support for EC2-Classic. 

ELBs

ALB is layer 7 load balancer. It is context/application aware and capable of doing content based routing. It examines  the content of the request and forwards accordingly. It also supports AWS outpost. NLB is layer 3/4 load balancer. It does not know the application it is load balancing. It just forwards traffic  based on  connection parameters ( source ip, source port, tcp sequence number ) . Each TCP connection can have different port and seq no even if from same client.  So request from same client can be forwarded to different targets.  Though traffic from single connection goes to same target for entire connection duration. Only difference for UDP traffic is , it does not have sequence number. So packets are forwarded based on source IP and source port. This kind of routing does not require lots of processing so it is very fast. Can process millions of requests per second. It also provides one  static & elastic IP ( elastic IP only if internet facing )  per zone. NLB only pres

IPv6

IPv6 - 128 bit address space ( 8 groups for 4 hexadecimal digits separated by colon )  IPv4 - 32 bit ( 4 groups of integers between 0-255 separated by colon ) Concept of private networks helped dealing with scarcity of IP addresses in the IPv4 protocol. Machines on private network did n't have its own public IP but rather used NAT server for this.  NAT - Network Address Translation. It is used for facilitating communication from machines on privates networks to the Internet. As machines on private networks don't have public IP, they go throw NAT ( router or firewall ) which takes private source IP and assigns its own public IP address to the traffic and sends to the external public IP. Does the reverse to the reply which comes back to it from external system.

HTTP/2

Many efforts were done to address HTTP/1.1 issues. HTTP/2  was originated out of those esp. Google's SPDY protocol.  HTTP/2 extends HTTP/1.1 and doesn't replace it and is fully backward compatible.  It is implementation change while keeping the interface same. Major differences from HTTP/1.1:  Server Push - allows servers to push additional content for page loading even without browser requesting those. Request Multiplexing - Allow multiple requests over same connection.  Request prioritization  Header compression  Binary Message Framing - More efficient processing of messages through use of binary message framing. For more details see   https://developers.google.com/web/fundamentals/performance/http2 There was one major problem with HTTP/1.1 , which was called "head-of-line blocking" , HTTP/2 fixed it by multiplexing & prioritizing  the requests over connection but the problem still remained at TCP .   One lost packet in the TCP stream makes  all  streams wait un

gRPC

gRPC -  Remote Procedure Call framework that is used for high performance communication between services. This is alternative to REST esp. for micro-services communication.  gRPC uses HTTP/2 protocol.  It uses protocol buffers for encoding data which is more lighter/efficient  than JSON/XML in terms of bandwidth consumption during data transfer. Also HTTP/2 allows multiplexing requests so multiple requests/responses can be served at the same time rather than sequentially.  gRPC is built to overcome the limitations of REST in microservice communication. Understanding gRPC

TCP vs UDP

TCP - Transmission Control Protocol =>   connection based, flow-control , error checking , in order delivery, guaranteed delivery, relatively slower ( FTP, HTTP/HTTPS, SSH, POP/IMAP, SMTP, DNS )  UDP - User Datagram Protocol - connection less, no flow control , no error checking , packet loss possible and they can be out of order, faster ( VPN tunneling , video streaming, online games, live broadcasting, DNS, VOIP, TFTP ) 

SSL vs. TLS

SSL and TLS both the cryptographic protocols but SSL is older version and has been replaced by TLS.  These protocols allows authenticating the server and encrypt the traffic between server and client.   Nice article on how SSL cryptography works:   SSL Cryptography In short, server sends the public key to browser/client.  Browser generates symmetric session key and encrypts it using server's public key. Server decrypts and retrieves the symmetric session key.  Now browser and server both communicate by encrypting/decrypting data using symmetric session key which is used for that session only.  in short, SSL ends up using asymmetric and symmetric encryption. Asymmetric or public key encryption algorithms are : RSA ( public key is factor of two large primes and private keys is those two large prime numbers) , ECC ( Elliptic Curve Cryptography - relies on the fact that it is impractical to find discrete algo for random elliptical curve element in relation to publicly known base point.

OSI

The purpose of OSI ( Open Systems Interconnection) Model to provide a set of design standards for equipment manufacturers so they could communicate with each other. Nice explanation of various layers:  OSI & TCP/IP Models   More details on OSI layers and how different protocols and devices fit into these layers:  OSI Layers, Protocols, Devices  . Routers work at network layer(3) , switches works at Data Link Layer (2) and hubs and cables work at Physical layer(1).   Network Devices : Routers, Switches, Hubs One key difference between switch and hub is that switch switches the data frames  intelligently to the  port which is connected to the destination device, while hub sends it to all the ports. Hub tends to cause traffic congestion in the network and data frames end on devices which are not the intended destination for it and those devices have to process it unnecessarily only to figure that it is not intended for them.  Router is needed to route traffic between different networ

Blue Green Deployment

Deployment strategy for minimum or no downtime. This strategy becomes more feasible and relevant in Cloud environment as Infrastructure provisioning becomes automated. Let us call existing prod env.  Blue. 1. Create a clone  of that, call it Green. 2. Switch the prod traffic to Green 3. Update blue with changes and test.  4. when everything look ok , switch the traffic back to Blue. 5. Terminate Green

Utilities/Tools

Below is list of tools/sites which can be handy at times: mailinator.com  awwapp.com mockable.io

RTO & RPO

RTO - Recovery Time Objective RPO - Recovery Point Objective

Industry Regulations

There are numerous regulations/guidelines across industries. Listing majority of those here as an executive summary. Finance SOX :  The Sarbanes-Oxley Act of 2002 came in response to financial scandals in the early 2000s involving publicly traded companies such as Enron Corporation, Tyco International plc, and WorldCom. GLBA :  The  GLBA  was an attempt to update and modernize the financial industry. This act was passed in 1999 under Clinton administration and it allowed commercial banks to provide financial services like investments, insurance etc. It was also known as repeal of  Glass-Steagall Act  of 1933. PCI DSS : Security guidelines dealing with payment card industry. You can find the latest version of PCI DSS at  PCI Document Library GDPR : GDPR lays out the basic premise that individuals should have control over their own data and places new restrictions on financial institutions and other organizations seeking to store, process or transmit that data ( FINANCE AND GDPR: WHAT Y