Data Lakes
With cloud computing and big data processing technological advances enabled data lakes. Data Lakes are becoming a natural choice for organizations to harness the power of data. It creates a central repository for all sorts of data; structure, semi-structured or unstructured. Data lakes store data from all sort of data sources and in all sort of formats. No preparation is required before storing the data. Huge quantities of data can be stored in a cost-effective manner. Data pipelines are setup to cleanse & transform the data. Data can be consumed in multiple ways; via interactive queries; exporting into data warehousing or business intelligence solutions.
Functional Areas
- Data Ingestion or Collection : batch or streaming;
- Catalog & Search : data cataloging, metadata creation, tagging
- Manage & Secure Data : cost effective storage ; security ( access restrictions & encryption at rest)
- Processing : cleansing & transformation, , ETL or ELT pipelines, raw data to processed data
- Consumption : interactive query, data warehousing systems , BI tools
- Access &User Interface : Administrative tasks
AWS Services:
Data Ingestion: DMS, Kinesis, Glue, Lambda
Catalog & Search: AWS Glue, DynamoDB, Amazon Elastic Search
Manage & Secure Data: S3, IAM, KMS, CloudTrail, CloudWatch
Processing: AWS EMR ( Hive, Spark ), AWS Glue(python, java, scala) , Apache Hive
Consumption: Apache Hive , AWS Athena, AWS Redshift, AWS Quicksight
User Interface: AWS AppSync, API Gateway, AWS Cognito
Non-AWS
Data Ingestion:
Files : Stage the files & use Hadoop CLI
Database: Sqoop, no CDC for smaller tables but only for larger (10M+) , use -m option for large db dump, NiFi is another option
Streaming: NiFi , Flume , Streamsets. NiFi is popular.
File Ingestion :
CSV into TEXTFILE : move the file to HDFS, create an external TEXTFILE table on top of the HDFS location.
CSV file into PARQUET: you need to first create a temporary TEXTFILE table and create a PARQUET table and then insert from TEXTFILE into PARQUET table.
Comments
Post a Comment