Guitar + Honeycomb: Anghami's Complete Data Engineering Solution

How we built a unified Data Engineering platform for schema management and Spark job development that transformed our workflows

Overview

At Anghami & OSN+, we process terabytes of data daily across multiple Spark jobs running on AWS EMR. Our engineering team faced two critical challenges:

Spark Development Complexity: Making Spark development fast, safe, and accessible to developers who just want to write business logic
Schema Management Chaos: Managing Hive table schemas at scale with proper governance, versioning, and automation

We solved both by building Guitar and Honeycomb, two complementary systems that form a complete data engineering platform.

🎸 Guitar is a Docker-based Spark framework that abstracts away Spark’s complexity and lets developers focus on business logic (SQL and high-level job patterns).

🍯🐝 Honeycomb is a Git-based schema management system that treats Hive schemas as code versioned, validated, and automatically deployed.

Together, Honeycomb manages your schemas, Guitar runs your jobs against those schemas.

Why the name Guitar?

The name Guitar is the latest step in a long evolution: Harp → Lyre → Guitar with roughly a decade of learning at Anghami! The ML team was the one who started running Spark jobs at the company from the very beginning; they built and maintained the Harp repo. Backend developers later created the Lyre repo v2 for their own Spark use cases, learning from Harp. Guitar is v3: it takes those lessons and makes things better and simpler unified local development, schema-as-code with Honeycomb, and a single platform for data engineering.

The Problems We Faced Before

1. Slow Feedback Loops

Testing ETL pipelines on EMR clusters was time-consuming and expensive:

Long feedback loops (deploy to EMR to test)
2-3 hours per schema change (manual process)
Hard to replicate production environment locally

2. Weak Schema Governance

Anyone with EMR/Hive access could create, drop, or alter Hive tables directly with no governance or approval layer:

Accidental table creation in production
Schema mismatches could cause job failures
No easy rollback mechanism for schema updates
Multiple conflicting “sources of truth” for table definitions

3. Shared Cluster Risk

Testing non-production jobs on EMR could impact production workloads:

Resource contention between test and production jobs
Accidental data loss from test jobs affecting production tables
Risk of breaking critical production pipelines

4. Uncontrolled Costs

Unoptimized jobs could spike AWS costs even for small test changes:

Testing multiple code changes on EMR incurred significant cloud charges
No way to validate job efficiency before production deployment
Difficulty estimating job costs before running

5. Human Error

Typos in S3 paths, table names, or secrets could break pipelines:

No validation before job execution
Manual, error-prone schema deployments
Inconsistent error handling
Ad-hoc backfills left no tracking documentation

The Solution

Guitar: Docker-based Spark Framework

Guitar abstracts away Spark’s complexity so developers can focus on business logic. The framework also includes a full Docker Data Engineering stack: Spark (EMR image for distributed data processing), Hive metastore (metadata catalog for tables), S3 (LocalStack local object storage emulator), Trino (Distributed SQL query engine), Redash (Data visualization and Dashboarding tool), Airflow (Workflow orchestration and scheduling), Postgres, and MySQL all run as containers. One command brings the whole ecosystem up; developers run jobs, run SQL, and visualize dashboards locally.

Core requirements:

Built-in Security: Table validation, schema checks, production safeguards
Complete Local Environment: Docker-based setup with Spark, Hive, S3 (LocalStack), Airflow, Trino, and Redash
Zero-Downtime Deployments: SWAP folder strategy (New data gets written in new s3 path to prevent downtime/partial data issue) for safe production updates
Dual Language Support: Scala and Python (PySpark) with identical patterns

Key Design Principles:

Tables are never created by jobs: Tables are created first in Honeycomb (schema-as-code), then deployed to Hive. Jobs only write into existing tables.
Production writes use SWAP logic: Instead of saveAsTable() (which creates/overwrites tables), production jobs use SWAP-folder logic for refreshes.
Local development mirrors production: All Production dependencies including Spark, Hive metastore, S3, Trino, Redash, and Airflow run as Docker containers with minimal setup required.

Honeycomb: Schema Management as Code

Honeycomb is integrated as a submodule within the Guitar repository, ensuring seamless schema management within Guitar’s local big data development environment. This integration allows developers to work with schemas alongside their data pipeline code, maintaining consistency across the entire data engineering workflow.

Core requirements:

Git as Single Source of Truth: All schemas versioned in Git
Automated Deployment: Smart ALTER statement generation and execution
Multi-Layer Validation: Pre-commit, PR validation, and deployment verification
Clear Audit Trail: Complete history of who changed what, when, and why
Built-in Guardrails: Type compatibility checks, reserved keyword handling

Key Design Principles:

Git defines reality: Repo is the source of truth for schemas. The Hive metastore is treated as an implementation detail that is continuously reconciled to match what’s in Git, enabling traceability, and reviews through version control.
No production touch during review: All PR checks operate purely on Git state comparing branch changes to main branch without querying or mutating the live metastore. This keeps validation fast, predictable, and risk-free.
Deploy only what actually changed: Post-merge deployments diff Git against the real metastore and apply only the necessary ALTERs. There’s no guessing changes are driven by the observed difference between desired and actual state.

Guitar Internals:

The Spark framework runs against a stack of other Docker apps: Postgres (Hive metastore, Airflow, Redash), Hive metastore, LocalStack (S3 + Secrets), MySQL (ETL metadata), Trino, Redash, and Airflow. Together they give a production-like setup, same catalog, same storage model with just one command to bring the stack up and one command to tear it down. Nothing needs to be installed on the host except Docker.

Dependencies: what starts in what order

Containers start in tiers. Each tier waits for the previous one via health checks. The tiers in the image reflect what each stage does so the startup order is easy to remember and debug.

Prod jobs must not use saveAsTable to create or replace tables. Tables are owned by schema-as-code (Honeycomb); jobs can only write into tables that already exist in the Hive metastore. The framework enforces this at runtime.

Dictator linting: strict checks that fail the build

Scala: Scalafmt enforces formatting; Scalafix runs with strict rules: unused imports and variables are removed and violations fail the build .
Python: Ruff runs for lint and format on DAGs, pyspark, and scripts. Ruff check and format are required in CI; failures block the PR. Together with Scala, the repo uses a “dictator” style: one format and one lint standard, enforced everywhere.

Honeycomb First, Then Guitar

The correct flow for ETL pipeline creation is:

Honeycomb: Define or alter tables in Git using schema-as-code.
Deploy to Hive: Honeycomb’s pipeline applies the schema to Hive (and optionally prepares S3 paths).
Guitar jobs: Write to the existing Hive table. No table creation occurs in Guitar.

Honeycomb Internals:

When we went looking for a declarative way to manage schemas, Skeema caught our attention as its model was simple and powerful: define the desired schema state using CREATE statements in Git, and let the tool safely compute and apply the required changes. It’s open source, well-designed, and works great for MySQL and MariaDB.

Then we ran into the catch. Skeema doesn’t support Hive!

Hive schemas don’t just describe columns and types; they encode storage details like S3 locations, table properties, and partitions. There are no such tools at the moment which handled these concerns well.

So we built our own!

Inspired by Skeema’s core ideas, we created a Hive-native, open-source schema diff engine backed by Git as the source of truth. It generates precise CREATE, DROP, and ALTER statements for tables and columns, and tracks changes to comments, S3 paths, and other table properties giving us a reliable, reviewable way to evolve Hive schemas with confidence.

Intelligent Schema Diffing

Honeycomb automatically detects:

New columns
Dropped columns
Type changes (with compatibility checking)
Comment updates (table and column level)
New tables/views
Deleted tables/views

Smart handling:

Handle reserved keywords (like time, type, timestamp) by automatically quoting column names
Detect incompatible type changes (e.g., bigint → timestamp) and use REPLACE COLUMNS instead of CHANGE COLUMN
Preserve column order and relationships

Smart Type Compatibility Detection

Hive has strict rules about type conversions. Honeycomb knows them:

✅ Allowed (uses CHANGE COLUMN):

TINYINT → SMALLINT → INT → BIGINT (widening)
FLOAT → DOUBLE
VARCHAR(n) → STRING

❌ Not Allowed (uses REPLACE COLUMNS):

BIGINT → TIMESTAMP
STRING → INT
Any narrowing conversions

The system automatically chooses the right approach.

Concluding

By building Guitar and Honeycomb, we transformed our data engineering workflow from a fragile, manual process into a robust, automated platform. This combination of speedy local spark pipelines development and schema-as-code has not only improved our productivity but also eliminated entire categories of production issues.

The key takeaways from our journey:

Automation is essential: Manual schema management doesn’t scale
Developer experience matters: Make it easy to do the right thing
Version control everything: Git is your friend for both code and schemas
Test locally first: Save time and money by catching issues early with production-parity environments
Build in guardrails: Prevent errors before they reach production

Guitar + Honeycomb: Anghami’s Complete Data Engineering Solution

Ajinkya Bhat

Related Posts

Detecting Music Label Fraud at Scale: A Graph-Based Approach

Rebuilding OSN+: A Technical Post-Mortem

+OSN تتعاون مع شركة castLabs لتعزيز حماية المحتوى على منصتها الرقمية

OSN+ Partners with castLabs to Enhance Content Protection with Cutting-edge Multi-DRM Technology, DRMtoday

SAJJEL 2: Anghami and the Literature, Publishing, and Translation Commission Launch the Largest Literary Podcast Competition in the Arab Region

Dalia Mubarak Takes Over Boulevard Square Unveiling Her Album 11:11 at Anghami Lab in Collaboration with Warner Music MENA

Anghami Reports H1 2025 Financial Results; Marked by Topline Growth & Transformative Deal with Warner Bros. Discovery

Anghami and Huawei celebrate five years of collaboration shaping a connected entertainment ecosystem across MENA

OSN+ Launches Music-Led Campaign Built Around Its Biggest Genres and Shows

SAJJEL 2: Anghami and the Literature, Publishing, and Translation Commission Launch the Largest Literary Podcast Competition in the Arab Region