End to end batch processing,data orchestration and real time streaming analytics on GCP

Google Cloud platform is catching up and a lot of companies have already started moving their infrastructure to GCP. I have worked on multiple projects to give the most practical solutions to real world use cases in terms of data engineering on the Cloud. These projects are designed keeping in mind end to end lifecycle of a typical Big data ETL project both batch processing and real-time streaming and analytics.

Considering the most important components of any batch processing or streaming jobs , this project makes use of

  1. Writing ETL jobs using Pyspark from scratch
  2. Storage components on GCP (GCS & Dataproc HDFS)
  3. Loading Data into Data-warehousing tool on GCP (BigQuery)
  4. Handling/Writing Data Orchestration and dependencies using Apache Airflow(Google Composer) in Python from scratch
  5. Batch Data ingestion using Sqoop , CloudSql and Apache Airflow
  6. Real Time data streaming and analytics using the latest API , Spark Structured Streaming with Python
  7. Micro batching using PySpark streaming & Hive on Dataproc

What I learnt