Shambhu Adhikari
Senior Data Engineer
Building large-scale data pipelines and modern data platforms across AWS, Azure, and GCP. Specializing in distributed ETL/ELT workflows, lakehouse architectures, and GenAI-powered data solutions.
About
Senior Data Engineer with extensive experience building large-scale data pipelines and modern data platforms across AWS, Azure, and GCP environments. Proven track record in designing distributed systems and automating complex data processing workflows.
Key Achievements
- •Designed distributed ETL/ELT workflows using PySpark, Apache Airflow, AWS Glue, Azure Data Factory, and GCP Dataflow
- •Built lakehouse architectures with Delta Lake, Iceberg, and Hudi for batch and streaming workloads
- •Implemented GenAI-powered data quality pipelines using OpenAI APIs, LangChain, and custom prompt engineering
- •Developed ML-ready feature pipelines supporting end-to-end model lifecycle and real-time inference
- •Created cost-optimized multi-cloud deployments using Terraform and CI/CD workflows
- •Secured sensitive data in compliance with HIPAA, HITRUST, and PCI-DSS standards
Technical Skills
Big Data
Databases
Languages
Cloud Technologies
AI & ML Tools
Data Formats
DevOps & Tools
Visualization
Professional Experience
Sr. Data Engineer
United Airlines
- •Designed and implemented fully automated ETL pipeline on AWS leveraging S3, Glue, Redshift, and Step Functions
- •Built event-driven workflows using Amazon EventBridge and Step Functions to orchestrate complex tasks
- •Engineered distributed data processing jobs on AWS EMR using PySpark, processing 15+ TB of data weekly
- •Tuned Amazon Redshift for high-concurrency workloads, reducing query latency by 40%
- •Integrated Amazon Kinesis Data Streams and Firehose for real-time data ingestion
Sr. ETL Data Engineer
American Express
- •Designed and maintained scalable ETL pipelines using Python, SQL, and Apache Airflow
- •Leveraged AWS Glue and EMR with PySpark to process over 12TB of daily financial data
- •Implemented AI-enriched fraud detection data pipelines for real-time model scoring
- •Tuned ETL job performance using advanced partitioning, reducing processing time by 40%
- •Led migration of legacy ETL processes to modern Spark-based lakehouse architecture
Azure Data Engineer
Cedar Gate Company
- •Designed scalable data pipelines using Azure Data Factory to ingest healthcare data from 30+ sources
- •Built Delta Lake architecture on Azure Data Lake Gen2 with bronze, silver, and gold layers
- •Engineered distributed ETL pipelines using Azure Databricks (PySpark) for multi-terabyte datasets
- •Integrated FHIR APIs and HL7 feeds for real-time clinical data ingestion
- •Led migration from on-prem SSIS to cloud-native ADF and Databricks with 40% cost reduction
Big Data Developer
Cotiviti
- •Developed Spark applications using PySpark and Spark-SQL for data extraction and transformation
- •Used Spark Streaming to receive real-time data from Kafka and store to HDFS
- •Built on-premise data pipelines using Kafka and Spark for real-time data analysis
- •Created Tableau dashboards using Hive outputs for data visualization
- •Used AWS EMR to transform and move large amounts of data into S3 and DynamoDB
Hadoop Developer
Mango Software
- •Built end-to-end ETL pipelines using Hadoop ecosystem components
- •Used Hive to analyze partitioned and bucketed data for reporting metrics
- •Created shell scripts, Oozie workflows, and Coordinator jobs for automation
- •Implemented performance optimizations including distributed cache and map-side joins
- •Implemented Hive managed tables as ACID compliant for SCD Type 1
Education
Master of Data Science
University of New Haven
Connecticut, US
Bachelor of Computer Science and Engineering
Ansal University
Gurugram, India
Get In Touch
I'm always open to discussing new opportunities, collaborations, or data engineering challenges.
© 2025 Shambhu Adhikari. Built with Next.js and Tailwind CSS.