Toggle menu
Toggle preferences menu
Toggle personal menu
Not logged in
Your IP address will be publicly visible if you make any edits.

Data Engineering: Difference between revisions

From Andy’s Data Science Wiki
(Created page with "== Overview == Data Engineering focuses on designing, building, and optimizing systems for data collection, storage, and processing. I specialize in creating scalable ETL pipelines, managing data warehouses, and enabling robust analytics. == Skills and Tools == * **ETL Pipelines**: Airflow, Luigi * **Data Warehousing**: BigQuery, Snowflake * **Programming**: Python, SQL, Spark * **Cloud Platforms**: AWS, GCP * **Database Management**: PostgreSQL, MySQL == Featured Proj...")
 
No edit summary
 
Line 11: Line 11:
== Featured Projects ==
== Featured Projects ==
### ETL Optimization for Streaming Data
### ETL Optimization for Streaming Data
* **Objective**: Built an ETL pipeline for real-time data ingestion.
* '''Objective''': Built an ETL pipeline for real-time data ingestion
* **Tools Used**: Apache Kafka, Airflow, and BigQuery.
* '''Tools Used''': Apache Kafka, Airflow, and BigQuery
* **Outcome**: Reduced processing time by 40% and enabled near real-time analytics.
* '''Outcome''': Reduced processing time by 40% and enabled near real-time analytics
 
### ETL Optimization for Batch Data
* '''Objective''': Build an ETL/ELT pipeline for async batch data from 300+ data collectors for Harvard University
* '''Tools Used''': Amazon Web Services (AWS), Google Cloud Platform (GCP), Python
* '''Outcome''':


== Tutorials and Resources ==
== Tutorials and Resources ==

Latest revision as of 18:54, 25 November 2024

Overview

Data Engineering focuses on designing, building, and optimizing systems for data collection, storage, and processing. I specialize in creating scalable ETL pipelines, managing data warehouses, and enabling robust analytics.

Skills and Tools

  • **ETL Pipelines**: Airflow, Luigi
  • **Data Warehousing**: BigQuery, Snowflake
  • **Programming**: Python, SQL, Spark
  • **Cloud Platforms**: AWS, GCP
  • **Database Management**: PostgreSQL, MySQL

Featured Projects

      1. ETL Optimization for Streaming Data
  • Objective: Built an ETL pipeline for real-time data ingestion
  • Tools Used: Apache Kafka, Airflow, and BigQuery
  • Outcome: Reduced processing time by 40% and enabled near real-time analytics
      1. ETL Optimization for Batch Data
  • Objective: Build an ETL/ELT pipeline for async batch data from 300+ data collectors for Harvard University
  • Tools Used: Amazon Web Services (AWS), Google Cloud Platform (GCP), Python
  • Outcome:

Tutorials and Resources

Achievements

  • Google Cloud Professional Data Engineer (2024)
  • Published "Scaling Data Pipelines in the Cloud" on Medium