In the dynamic field of data engineering, we at Anghami, have undergone a significant evolution in our approach to data processing and storage. This transformation, marked by the transition from Redshift to a Spark-powered framework and the integration of Amazon S3, has opened up a multitude of opportunities. At the heart of this transformative endeavor lies Gong, a versatile ETL pipeline framework. This article dives into the intricate technical aspects of Gong, uncovering its operational intricacies and its pivotal role in Anghami’s Data Engineering and Platforms ecosystem.
A little background first. Anghami’s data engineering journey began with Amazon Redshift, a robust solution that efficiently handled a multitude of ETL and ad-hoc tasks daily. However, as Anghami’s data needs grew, the constraints imposed by Redshift’s quotas became apparent, impeding our ability to surpass predefined limits.
- Storage Constraints: Redshift’s storage quotas proved inadequate to accommodate rapidly growing data volume. The inability to seamlessly scale storage capacity hindered the organization’s ability to retain and analyze historical data, a critical aspect of data-driven decision-making.
- Computational Bottlenecks: The computational capabilities of Redshift struggled to match the accelerating demands. Limited processing power resulted in extended query execution times, obstructing the generation of real-time data insights and undermining the agility of timely decision-making processes.
- Cost-Effectiveness Calculations: The pricing model became increasingly cost-prohibitive. The escalating costs could impact the organization’s financial resources, potentially impeding the pursuit of further data-driven initiatives.
This quest led to a transformative shift: leveraging the power of Spark and the flexibility of Amazon S3. This strategic move marked a pivotal moment in Anghami’s data infrastructure evolution. By embracing Spark’s computational prowess (For ETL), Trino (ad-hoc analytics), and S3’s adaptability, Anghami not only resolved Redshift’s constraints but also positioned itself at the forefront of data processing and storage methodologies. This laid the foundation for enhanced efficiency, scalability, and innovation within Anghami’s evolving data ecosystem.
Proposed solution:
From a technical perspective, Gong functions as a Spark application deployed within the AWS EMR infrastructure, accepting various arguments tailored to specific business logic requirements. The process kicks off by retrieving data in the form of a JDBC (Java Database Connectivity) dataframe. This data retrieval is made possible through a JDBC connection established with any of the supported MySQL, PostgreSQL, or SQL Server relational database management systems (RDBMS). Gong efficiently transfers this data to the designated S3 storage environment. Additionally, it creates a Hive table on top of the S3-stored data. Currently, we have more than 2000+ pipelines running daily reading from 15+ different databases.
Gong’s Role in Anghami’s Ecosystem
Approaching Gong from a product standpoint reveals its power as a tool for business users. It empowers users to seamlessly extract data from various databases, monitor processes via platforms like Slack, email, and workplace alerts, track metrics, and execute business logic within a distributed computing environment. Gong plays a pivotal role in the ingestion layer, acting as a source for the subsequent Transformation layer, ultimately contributing to the Visualization layer.
Detailed Flow and Task Lifecycle
Gong’s Capabilities & Benefits:
1. Zero Code Onboarding
Gong streamlines the onboarding process coupled with Airflow, requiring zero lines of code. Its generic code structure simplifies the incorporation of new jobs by editing environment variables, eliminating the need for code refactoring.
2. Database Compatibility
It seamlessly supports MySQL, Postgres, and SQL Server RDS, ensuring compatibility across various database systems.
3. Load Strategies
Equipped to handle diverse load strategies, Gong accommodates Full Loads, Incremental Loads, and Delta Loads within specified timeframes. It also supports both partitioned and unpartitioned hive tables.
4. Checkpointing
The checkpointing feature efficiently manages loads, ensuring data integrity and consistency. This feature makes pipelines idempotent, contributing to enhanced reliability.
5. Column Exclusion/Selection
Offers selective data fetching, enabling users to exclude or choose specific columns, rather than retrieving all available columns. This selectivity safeguards sensitive information, eliminates unnecessary data transfer, and reduces RDS I/O usage, optimizing costs.”
6. Race Condition Handling
Gong’s approach to data ingestion involves writing to a new S3 target path format with an incorporated EPOCHVALUE at runtime. This ensures uninterrupted user queries during job execution and enables multiple jobs to run concurrently without the risk of data inconsistencies, enhancing overall data reliability.
7. Tracking Metrics
Every job execution captures a comprehensive set of metrics, including database and table names, record count, transferred data size, load type, and elapsed time, among various other key performance indicators (KPIs) in a MySQL table. This robust metric collection serves multiple purposes, such as ensuring data quality, maintaining data integrity, monitoring Service Level Agreements (SLA), and tracking the efficiency of both sluggish and optimized job executions.
8. Airflow custom dependency operator
We are using the MySQL table mentioned in point #7 as the source to track if a particular table is up to date or not. All the airflow jobs that have the gong tables as the dependency make use of this operator for validation before going ahead to prevent processing inaccurate information
Orchestration from Airflow:
All the Gong tasks are defined as environment variables in Airflow and are dynamically triggered from Airflow workflows.
Captured job metrics:
Each task execution produces a comprehensive array of metrics that includes database and table names, record count, data transfer volume, load type, elapsed time, and other essential performance indicators (KPIs). These metrics are stored in a MySQL table as can be seen in the screenshot. This extensive collection of metrics serves various purposes.
We also have a dedicated “s3 pruner” job that reads this table and deletes files located in the s3 path present in the “old_s3” column. This way we can also store some definite versions of a table if needed.
Wrapping up, Gong’s capabilities and seamless integration into Anghami’s data ecosystem have revolutionized data ingestion and data management within the organization. As the company continues to grow and its data needs evolve, Gong will play an even more pivotal role. We are constantly exploring new ways to enhance Gong’s capabilities, empowering Anghami to make data-driven decisions that will drive its success in the years to come.