Posts

Showing posts from 2020

Spark with Python

References:  https://towardsdatascience.com/pyspark-and-sparksql-basics-6cb4bf967e53 

Transactions in HIVE

Full ACID  semantics at row level is supported since HIVE  0.13.  Earlier it was only supported at partition level.   Atomicity: Entire operation is single operation. Either entire operation happens or none. Consistency: Once an operation is completed, every subsequent operations sees the same result of it.  Isolation: One user operation does not impact another user.  Durability: Once operation is done ..its result remains thereafter. At present isolation is only at snapshot level. There are following isolation levels in various DBMS: Snapshot: The snapshot of data at the beginning of  transaction will be visible to through out the transaction. Whatever is happening in other transaction will never be seen by this transaction.  Dirty Read: Uncommitted updates from other transactions can be seen. Read Committed: Only updates which are committed at the time of read will be seen by this transaction. Repeatable Read: Read lock on data being read and write lock on data being created/updated/

Hadoop/Hive Data Ingestion

Data Ingestion:     Files :  Stage the files & use Hadoop/Hive CLI      Database: Sqoop, no CDC for smaller tables but only for larger (10M+) , use -m option for large db        dump, NiFi is another option      Streaming: NiFi , Flume , Streamsets. NiFi is popular.     File Ingestion:    CSV  into TEXTFILE :  Overwrite:  Move the file to HDFS, create an external  TEXTFILE table on top of the HDFS location.     You can also create the table and use "LOAD DATA INPATH LOCAL localpath OVERWRITE INTO   tablename".  This approach will be handy for internal tables where location is not specified and if you   don't know the HDFS warehouse location where table is created.   You can use LOAD DATA command for loading data from local as well as hdfs file.  Append:  You can still use "LOAD DATA INPATH ....INSERT INTO tablename" . create a temporary table using overwrite approach and then insert into original table from temporary table. Same approach will work for parti

HCFS

 Hadoop Compatible File System

Data Lakes

With cloud computing and big data processing technological advances enabled data lakes. Data Lakes are becoming a natural choice for organizations to harness the power of data. It creates a central repository for all sorts of data; structure, semi-structured or unstructured. Data lakes store data from all sort of data sources and in all sort of formats. No preparation is  required before storing the data. Huge quantities of data can be stored in a cost-effective manner. Data pipelines are setup to cleanse & transform the data.  Data can be consumed in multiple ways; via interactive queries; exporting  into data warehousing or business intelligence solutions.  Functional Areas Data Ingestion or Collection : batch or streaming; Catalog & Search : data cataloging, metadata creation, tagging  Manage & Secure Data : cost effective storage ; security ( access restrictions & encryption at rest)  Processing : cleansing & transformation, , ETL or ELT  pipelines,  raw data to

EC2-Classic

This was the initial EC2 platform. In this, all the ec2 instances were launches in a flat network which was shared by all the customers. There was no concept of VPC.  The accounts which are created a fter 2013-12-04  ,they won't have support for EC2-Classic. 

ELBs

ALB is layer 7 load balancer. It is context/application aware and capable of doing content based routing. It examines  the content of the request and forwards accordingly. It also supports AWS outpost. NLB is layer 3/4 load balancer. It does not know the application it is load balancing. It just forwards traffic  based on  connection parameters ( source ip, source port, tcp sequence number ) . Each TCP connection can have different port and seq no even if from same client.  So request from same client can be forwarded to different targets.  Though traffic from single connection goes to same target for entire connection duration. Only difference for UDP traffic is , it does not have sequence number. So packets are forwarded based on source IP and source port. This kind of routing does not require lots of processing so it is very fast. Can process millions of requests per second. It also provides one  static & elastic IP ( elastic IP only if internet facing )  per zone. NLB only pres

IPv6

IPv6 - 128 bit address space ( 8 groups for 4 hexadecimal digits separated by colon )  IPv4 - 32 bit ( 4 groups of integers between 0-255 separated by colon ) Concept of private networks helped dealing with scarcity of IP addresses in the IPv4 protocol. Machines on private network did n't have its own public IP but rather used NAT server for this.  NAT - Network Address Translation. It is used for facilitating communication from machines on privates networks to the Internet. As machines on private networks don't have public IP, they go throw NAT ( router or firewall ) which takes private source IP and assigns its own public IP address to the traffic and sends to the external public IP. Does the reverse to the reply which comes back to it from external system.

HTTP/2

Many efforts were done to address HTTP/1.1 issues. HTTP/2  was originated out of those esp. Google's SPDY protocol.  HTTP/2 extends HTTP/1.1 and doesn't replace it and is fully backward compatible.  It is implementation change while keeping the interface same. Major differences from HTTP/1.1:  Server Push - allows servers to push additional content for page loading even without browser requesting those. Request Multiplexing - Allow multiple requests over same connection.  Request prioritization  Header compression  Binary Message Framing - More efficient processing of messages through use of binary message framing. For more details see   https://developers.google.com/web/fundamentals/performance/http2 There was one major problem with HTTP/1.1 , which was called "head-of-line blocking" , HTTP/2 fixed it by multiplexing & prioritizing  the requests over connection but the problem still remained at TCP .   One lost packet in the TCP stream makes  all  streams wait un

gRPC

gRPC -  Remote Procedure Call framework that is used for high performance communication between services. This is alternative to REST esp. for micro-services communication.  gRPC uses HTTP/2 protocol.  It uses protocol buffers for encoding data which is more lighter/efficient  than JSON/XML in terms of bandwidth consumption during data transfer. Also HTTP/2 allows multiplexing requests so multiple requests/responses can be served at the same time rather than sequentially.  gRPC is built to overcome the limitations of REST in microservice communication. Understanding gRPC

TCP vs UDP

TCP - Transmission Control Protocol =>   connection based, flow-control , error checking , in order delivery, guaranteed delivery, relatively slower ( FTP, HTTP/HTTPS, SSH, POP/IMAP, SMTP, DNS )  UDP - User Datagram Protocol - connection less, no flow control , no error checking , packet loss possible and they can be out of order, faster ( VPN tunneling , video streaming, online games, live broadcasting, DNS, VOIP, TFTP ) 

SSL vs. TLS

SSL and TLS both the cryptographic protocols but SSL is older version and has been replaced by TLS.  These protocols allows authenticating the server and encrypt the traffic between server and client.   Nice article on how SSL cryptography works:   SSL Cryptography In short, server sends the public key to browser/client.  Browser generates symmetric session key and encrypts it using server's public key. Server decrypts and retrieves the symmetric session key.  Now browser and server both communicate by encrypting/decrypting data using symmetric session key which is used for that session only.  in short, SSL ends up using asymmetric and symmetric encryption. Asymmetric or public key encryption algorithms are : RSA ( public key is factor of two large primes and private keys is those two large prime numbers) , ECC ( Elliptic Curve Cryptography - relies on the fact that it is impractical to find discrete algo for random elliptical curve element in relation to publicly known base point.

OSI

The purpose of OSI ( Open Systems Interconnection) Model to provide a set of design standards for equipment manufacturers so they could communicate with each other. Nice explanation of various layers:  OSI & TCP/IP Models   More details on OSI layers and how different protocols and devices fit into these layers:  OSI Layers, Protocols, Devices  . Routers work at network layer(3) , switches works at Data Link Layer (2) and hubs and cables work at Physical layer(1).   Network Devices : Routers, Switches, Hubs One key difference between switch and hub is that switch switches the data frames  intelligently to the  port which is connected to the destination device, while hub sends it to all the ports. Hub tends to cause traffic congestion in the network and data frames end on devices which are not the intended destination for it and those devices have to process it unnecessarily only to figure that it is not intended for them.  Router is needed to route traffic between different networ

Blue Green Deployment

Deployment strategy for minimum or no downtime. This strategy becomes more feasible and relevant in Cloud environment as Infrastructure provisioning becomes automated. Let us call existing prod env.  Blue. 1. Create a clone  of that, call it Green. 2. Switch the prod traffic to Green 3. Update blue with changes and test.  4. when everything look ok , switch the traffic back to Blue. 5. Terminate Green

Utilities/Tools

Below is list of tools/sites which can be handy at times: mailinator.com  awwapp.com mockable.io

RTO & RPO

RTO - Recovery Time Objective RPO - Recovery Point Objective

Industry Regulations

There are numerous regulations/guidelines across industries. Listing majority of those here as an executive summary. Finance SOX :  The Sarbanes-Oxley Act of 2002 came in response to financial scandals in the early 2000s involving publicly traded companies such as Enron Corporation, Tyco International plc, and WorldCom. GLBA :  The  GLBA  was an attempt to update and modernize the financial industry. This act was passed in 1999 under Clinton administration and it allowed commercial banks to provide financial services like investments, insurance etc. It was also known as repeal of  Glass-Steagall Act  of 1933. PCI DSS : Security guidelines dealing with payment card industry. You can find the latest version of PCI DSS at  PCI Document Library GDPR : GDPR lays out the basic premise that individuals should have control over their own data and places new restrictions on financial institutions and other organizations seeking to store, process or transmit that data ( FINANCE AND GDPR: WHAT Y

Multiple python versions on Mac

 I had MacOS Mojave.  It needed multiple versions of pythons esp. I needed a version of Python between 2.6.x and 3.0.x.  I came across this nice article  https://medium.com/faun/pyenv-multi-version-python-development-on-mac-578736fb91aa I followed all steps but when I tried to install 2.7.0 and 2.7.1 I ran into this issue. ERROR : The Python ssl extension was not compiled. Missing the OpenSSL lib? I had openssl installed but it seems it was not the version it was looking for. Eventually I was able to install python version 2.7.15. Looks like 2.7.15 was using the openssl lib which I had it installed hence no issues.  bash-3.2$ pyenv install 2.7.15 python-build: use openssl@1.1 from homebrew python-build: use readline from homebrew Downloading Python-2.7.15.tar.xz... -> https://www.python.org/ftp/python/2.7.15/Python-2.7.15.tar.xz Installing Python-2.7.15... python-build: use readline from homebrew python-build: use zlib from xcode sdk Installed Python-2.7.15 to /Users/glbairwa/.pyenv

REST API Calls from JavaScript

https://www.freecodecamp.org/news/here-is-the-most-popular-ways-to-make-an-http-request-in-javascript-954ce8c95aaa/

Microsoft Active Directory

I followed up following blog post for installing active directory in my own AWS account.  https://www.ecloudture.com/en/use-ec2-to-build-windows-active-directory-2/ I ran into one issue. When I tried to add machine(PC01 in the post )  into  domain(ADLAB.com) I got an error. When I looked into  AD Server Manager.  on AD DS Service , DFS replication was failing with following error:  Additional Information: Error: 1355 (The specified domain either does not exist or could not be contacted.) DNS Service was also showing a error/warning complaining that it was waiting for some signal from AD DS.  After some research I came across this post.  https://community.spiceworks.com/topic/1726627-the-specified-domain-either-does-not-exist-or-could-not-be-contacted One of the answers on this post, recommends following procedure on this. https://support.microsoft.com/en-us/kb/947022 I did follow the procedure and after that I was able to add the machine to the domain. I am not sure whether following t

Hashing

Interesting article on Hash Functions  Consistent Hashing is a strategy used in system design esp. in scaling the backend data stores.  The typical hash based sharding used to horizontally scale the databases doesn't handle the adding/removing of shards in efficient manner. Consistent hash sharding takes care of that.  Here is a nice article comparing different sharding strategies.

Angular

  Angular Recently came across this framework.  It lets you do lots of cool things, which are difficult otherwise. 1. Getting javascript object values written to DOM or html field names. <label  ng-repeat="obj in objCollection" class="radio-inline">        <input name="obj" id="obj-{{$index}}" value="{{obj.value}}" ng-model="scopedObj.val" type="radio">      {{obj.label}}     <label> objCollection is the collection of obj objects.  objCollection and scopedObj are both in the scope. 2.  If there an object available in scope , you can directly access it using a URL. 3. similar to jQuery  , you can make ajax calls easily.         var uri = "http://mydomain/uri";         var $http = angular.element('html').injector().get('$http');         $http.get(uri, {         }).success(function(data,status,headers,config){                 console.log(data);        

Okta

Okta is Cloud based Identity and Access Management platform. Below  are some notes from a whitepaper published on Okta website.  Okta can be used as foundation for implementing Zero Trust within organizations.  Zero Trust is a security framework developed by Forrester Research.   Zero Trust => never trust, always verify Zero Trust framework evolved later  into ZTX ( Zero Trust Extended Ecosystem ) . The focus has shifted from network perimeter to  access ( who is accessing the system ). The focus has shifted to Identity and Access Control.  There are four stages of Zero Trust implementation: Fragmented Identity Unified IAM Contextual Access : Context based access policies Adaptive Workforce : Risk based access policies, continuous and adaptive authn and authz Services Offered By OKta Okta Universal Directory Okta SSO  Okta Advanced Server Access Okta Adaptive MFA Okta also integrates with number of vendors in security ecosystem.  That helps in pinpointing the root cause of compromis

Azure Functions

Azure Functions is equivalent to AWS Lambda.  Azure Functions can be triggered by HTTP and also respond to HTTP request as of now but it does not have input binding for HTTP. It cannot take input from HTTP

Google Cloud

Scopes - Global(Network), Region(static external IP), Zone(disks, vms )  Project ( id , name, number )  -  Any GC resource must belong to a project.  A id is unique & can never be reused. can be seen as workspace. click here to see  Comparison of services among various cloud providers   Cloud SQL is similar to AWS RDS.  Cloud Storage is similar to S3. it has multiple tiers and support Object Lifecycle Management to transition from one storage class to other based on certain criteria.  Workflow is similar to AWS Glue or Azure Data Factory.  BigQuery compares  to  Redshift  in AWS Pub/Sub + Dataflow compares to   Kinesis in AWS and Azure Event Hub in Azure.

Enterprise Integration Patterns

Enterprise Integration Patterns

Kerberos

Image
Kerberos is an authentication protocol for trusted hosts on untrusted networks.  The authentication among various parties happen as shown in the diagram. The diagram is based on the video presentation at following link: Kerberos Authentication Explained | A deep dive

S3 Transfer Accelerator

 S3 transfer accelerator   takes advantage of Amazon CloudFront’s globally distributed edge locations.  This is used for transferring fast , easy , secure transfer of files over long distances between client and S3 bucket. 

AWS DirectConnect

 AWS DirectConnect is a service to connect your on-premise systems with AWS without going thru internet. You may need this specifically if you need high speed and/or low latency.

AWS GuardDuty

 AWS GuardDuty is a thread detection service by continuously analyzing event log data. It can monitor VPC Flow logs, CloudTrail event logs, DNS logs and integrate with AWS CloudWatch events. It generates alerts.

AWS DataSync

AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or  AWS Direct Connect . DataSync can copy data between Network File System (NFS), Server Message Block (SMB) file servers, self-managed object storage, or  AWS Snowcone , and Amazon Simple Storage Service ( Amazon S3 ) buckets, Amazon Elastic File System ( Amazon EFS ) file systems, and  Amazon FSx for Windows File Server  file systems. D ataSync includes encryption and integrity validation to help make sure your data arrives securely, intact.  DataSync does both full initial copies, and incremental transfers of changing data.  

Elasticsearch

What is Elasticsearch ELK stack - Elasticsearch , Logstash, Kibana and Beats 

AWS DynamoDB

Global & Local Secondary Indexes : when data access patterns can not be accommodated using primary keys.

AWS Big Data Specialty

Ingestion Tools Delivery: Guranteed oredering  delivery by all AWS services except Firehose & SQS(Standard) Exactly Once  delivery by only DynamoDB Streams and Amazon SQL(FIFO)  and all others are at least once.  AWS Lambda : Limited capabilities of buffering  AWS EMR : Single Availability Zone AWS Redshift does not support resource based policies.   DynamoDB provides fine-grained access to your tables and data

PowerBI Gateway

References: https://blog.pragmaticworks.com/power-bi-and-data-security-on-premises-data-gateway

AWS GLUE

AWS Glue -  batch jobs, ETL, minimum 5 min intervals, no support for NoSQL stores, not suitable for heterogeneous processing use AWS Data Pipeline. Configurable DPUs (Data Processing Units) fully managed ( serverless) Scale out Apache Spark environment, pay-as-you-go, ETL service, discovers and profiles data via Glue Data Catalog, generates ETL code to transform data into target schema, can run the  job to load data into destination, allows you to configure, orchestrate and monitor complex data flows. The AWS Glue Data Catalog is Apache Hive Metastore compatible and is a drop-in replacement for the Apache Hive Metastore for Big Data applications running on Amazon EMR. I n the context of updating the metadata,  whatever you can do with a Hive DB, you can also do with AWS Glue Data Catalog.    AWS Glue =  Data Catalog + Flexible Scheduler For supported data sources see  AWS Glue FAQ AWS Glue can also be used for complex ETL of  streaming data.  If focus is on delivery of streaming data u

AWS DMS

 AWS DMS :  For One time migration and ongoing replication or  change data capture. no impact on source database. for CDC uses native database APIs to read change logs  from source db and replay it in target store. uses EC2 as replication instance. instance can be scaled up/down

DynamoDB Streams

DynamoDB Streams captures a time-ordered sequence of item-level modifications in any DynamoDB table and stores this information in a log for up to 24 hours.  

AWS Kinesis

Various Kinesis  Services:  Kinesis Data Streams: Collect streaming data, at scale, for real-time analytics(sub second latency ) , custom processing, choice of processing framework , no limit on number of shards, Each shard - 1 mb/sec write, 2 mb/sec read, 1000 puts/sec Kinesis Data Firehose:  Prepare and load real-time data streams into data stores and analytics services, 60 sec or higher latency, use existing analytics tools on S3, RedShift, ElasticSearch, zero administration , transformation of data, autoscaling to match throughput. Kinesis Agent :  stand-alone Java software application that offers an easy way to collect and send data to Kinesis Data Streams. https://aws.amazon.com/blogs/architecture/serverless-stream-based-processing-for-real-time-insights/

Bootable Backup

 If anytime you need to replace or backup your bootable hard drive , you can use software ( Carbon Copy Cloner )  from bombich.com and clone it. It will make bootable clone. 

Analytics

Types of Analytics ( Notes from AWS Course )  Descriptive - What happened ?  hindsight  Diagnostic - Why did it  happen?  hindsight & insight  Predictive  - What will happen ? insight & foresight  Prescriptive - What should I do ? foresight  Cognitive & AI  - What are recommended actions ? foresight & hypothesis input  Popular BI Tools:  https://www.cio.com/article/3284415/10-bi-tools-for-data-visualization.html

Apache Knox

https://www.adaltas.com/en/2019/02/04/apache-knox/

ODBC Test

If you are developing/debugging  or supporting ODBC drivers, Microsoft has this great utility called ODBC Test    which appears very useful.     To know more about ODBC click on this link .

DBeaver

Recently I used this tool for connecting to Hive via Knox gateway.  Good thing about this tool is that it automatically downloads the driver and configures it. That  is a pain in tools like Squirrel.  It also makes a JDBC connection to Hive.  Below is how the HOST and JDBC URL settings look like : HOST : <KNOX_HOST>:8443/;ssl=true;transportMode=http;httpPath=gateway/<topology>/hive JDBC URL :  jdbc:hive2://<KNOX_HOST>:8443/;ssl=true;transportMode=http;httpPath=gateway/<topology>/hive What I noticed on my setup was that if you keep the default port which 10000 and you set it to 8443 , it does not work. So you have to keep it blank.  That may be specific to my setup. Also you might notice that ssl=true but there are  no  sslTrustStore= < Path to the “.jks” file in the NiFi node > ;trustStorePassword= < TrustStore Password>  parameters. The reason is if you import the gateway cert  into the trust store at this location  "<DBeaver_Insta