Data Lakes

With cloud computing and big data processing technological advances enabled data lakes. Data Lakes are becoming a natural choice for organizations to harness the power of data. It creates a central repository for all sorts of data; structure, semi-structured or unstructured. Data lakes store data from all sort of data sources and in all sort of formats. No preparation is  required before storing the data. Huge quantities of data can be stored in a cost-effective manner. Data pipelines are setup to cleanse & transform the data.  Data can be consumed in multiple ways; via interactive queries; exporting  into data warehousing or business intelligence solutions. 

Functional Areas
  • Data Ingestion or Collection : batch or streaming;
  • Catalog & Search : data cataloging, metadata creation, tagging 
  • Manage & Secure Data : cost effective storage ; security ( access restrictions & encryption at rest) 
  • Processing : cleansing & transformation, , ETL or ELT  pipelines,  raw data to processed data
  • Consumption : interactive query, data warehousing systems , BI tools
  • Access &User Interface :  Administrative tasks

AWS Services: 

Data Ingestion: DMS, Kinesis, Glue, Lambda
Catalog & Search: AWS Glue, DynamoDB, Amazon Elastic Search
Manage & Secure Data: S3, IAM, KMS, CloudTrail, CloudWatch
Processing: AWS EMR ( Hive, Spark ), AWS Glue(python, java, scala) , Apache Hive 
Consumption: Apache Hive , AWS Athena, AWS Redshift, AWS Quicksight 
User Interface: AWS AppSync, API Gateway, AWS Cognito

Non-AWS

Data Ingestion:
  
    Files : Stage the files & use Hadoop CLI 
    Database: Sqoop, no CDC for smaller tables but only for larger (10M+) , use -m option for large db        dump, NiFi is another option 
    Streaming: NiFi , Flume , Streamsets. NiFi is popular.

   
File Ingestion :   
    CSV  into TEXTFILE : move the file to HDFS, create an external  TEXTFILE table on top of the HDFS location.   
   CSV file into PARQUET:  you need to first create a temporary TEXTFILE table and create a PARQUET table and then insert from TEXTFILE into PARQUET table. 
    
    







Comments

Popular posts from this blog

SQL

Analytics

HIVE