• Engineering
  • Product
  • For Brands
  • What’s New
  • Music
  • Life at Anghami
No Result
View All Result
  • Engineering
  • Product
  • For Brands
  • What’s New
  • Music
  • Life at Anghami
No Result
View All Result
Anghami
No Result
View All Result

Guitar + Honeycomb: Anghami’s Complete Data Engineering Solution

Ajinkya Bhat by Ajinkya Bhat
February 13, 2026
in Engineering
Share on FacebookShare on Twitter
How we built a unified Data Engineering platform for schema management and Spark job development that transformed our workflows

Overview

At Anghami & OSN+, we process terabytes of data daily across multiple Spark jobs running on AWS EMR. Our engineering team faced two critical challenges:

  1. Spark Development Complexity: Making Spark development fast, safe, and accessible to developers who just want to write business logic
  2. Schema Management Chaos: Managing Hive table schemas at scale with proper governance, versioning, and automation

We solved both by building Guitar and Honeycomb, two complementary systems that form a complete data engineering platform.

🎸 Guitar is a Docker-based Spark framework that abstracts away Spark’s complexity and lets developers focus on business logic (SQL and high-level job patterns). 

🍯🐝 Honeycomb is a Git-based schema management system that treats Hive schemas as code versioned, validated, and automatically deployed.

Together, Honeycomb manages your schemas, Guitar runs your jobs against those schemas.

Why the name Guitar? 

The name Guitar is the latest step in a long evolution: Harp → Lyre → Guitar with roughly a decade of learning at Anghami! The ML team was the one who started running Spark jobs at the company from the very beginning; they built and maintained the Harp repo. Backend developers later created the Lyre repo v2 for their own Spark use cases, learning from Harp. Guitar is v3: it takes those lessons and makes things better and simpler unified local development, schema-as-code with Honeycomb, and a single platform for data engineering.


The Problems We Faced Before

1. Slow Feedback Loops

Testing ETL pipelines on EMR clusters was time-consuming and expensive:

  • Long feedback loops (deploy to EMR to test)
  • 2-3 hours per schema change (manual process)
  • Hard to replicate production environment locally
2. Weak Schema Governance

Anyone with EMR/Hive access could create, drop, or alter Hive tables directly with no governance or approval layer:

  • Accidental table creation in production
  • Schema mismatches could cause job failures
  • No easy rollback mechanism for schema updates
  • Multiple conflicting “sources of truth” for table definitions
3. Shared Cluster Risk

Testing non-production jobs on EMR could impact production workloads:

  • Resource contention between test and production jobs
  • Accidental data loss from test jobs affecting production tables
  • Risk of breaking critical production pipelines
4. Uncontrolled Costs

Unoptimized jobs could spike AWS costs even for small test changes:

  • Testing multiple code changes on EMR incurred significant cloud charges
  • No way to validate job efficiency before production deployment
  • Difficulty estimating job costs before running
5. Human Error

Typos in S3 paths, table names, or secrets could break pipelines:

  • No validation before job execution
  • Manual, error-prone schema deployments
  • Inconsistent error handling
  • Ad-hoc backfills left no tracking documentation

The Solution

Guitar: Docker-based Spark Framework

Guitar abstracts away Spark’s complexity so developers can focus on business logic. The framework also includes a full Docker Data Engineering stack: Spark (EMR image for distributed data processing), Hive metastore (metadata catalog for tables), S3 (LocalStack local object storage emulator), Trino (Distributed SQL query engine), Redash (Data visualization and Dashboarding tool), Airflow (Workflow orchestration and scheduling), Postgres, and MySQL all run as containers. One command brings the whole ecosystem up; developers run jobs, run SQL, and visualize dashboards locally.

Core requirements:

  • Built-in Security: Table validation, schema checks, production safeguards
  • Complete Local Environment: Docker-based setup with Spark, Hive, S3 (LocalStack), Airflow, Trino, and Redash
  • Zero-Downtime Deployments: SWAP folder strategy (New data gets written in new s3 path to prevent downtime/partial data issue) for safe production updates
  • Dual Language Support: Scala and Python (PySpark) with identical patterns

Key Design Principles:

  • Tables are never created by jobs: Tables are created first in Honeycomb (schema-as-code), then deployed to Hive. Jobs only write into existing tables.
  • Production writes use SWAP logic: Instead of saveAsTable() (which creates/overwrites tables), production jobs use SWAP-folder logic for refreshes.
  • Local development mirrors production: All Production dependencies including Spark, Hive metastore, S3, Trino, Redash, and Airflow run as Docker containers with minimal setup required.

Honeycomb: Schema Management as Code

Honeycomb is integrated as a submodule within the Guitar repository, ensuring seamless schema management within Guitar’s local big data development environment. This integration allows developers to work with schemas alongside their data pipeline code, maintaining consistency across the entire data engineering workflow.

Core requirements:

  • Git as Single Source of Truth: All schemas versioned in Git
  • Automated Deployment: Smart ALTER statement generation and execution
  • Multi-Layer Validation: Pre-commit, PR validation, and deployment verification
  • Clear Audit Trail: Complete history of who changed what, when, and why
  • Built-in Guardrails: Type compatibility checks, reserved keyword handling

Key Design Principles:

  • Git defines reality: Repo is the source of truth for schemas. The Hive metastore is treated as an implementation detail that is continuously reconciled to match what’s in Git, enabling traceability, and reviews through version control.
  • No production touch during review: All PR checks operate purely on Git state comparing branch changes to main branch without querying or mutating the live metastore. This keeps validation fast, predictable, and risk-free.
  • Deploy only what actually changed: Post-merge deployments diff Git against the real metastore and apply only the necessary ALTERs. There’s no guessing changes are driven by the observed difference between desired and actual state.

Guitar Internals:

The Spark framework runs against a stack of other Docker apps: Postgres (Hive metastore, Airflow, Redash), Hive metastore, LocalStack (S3 + Secrets), MySQL (ETL metadata), Trino, Redash, and Airflow. Together they give a production-like setup, same catalog, same storage model with just one command to bring the stack up and one command to tear it down. Nothing needs to be installed on the host except Docker.

Dependencies: what starts in what order

Containers start in tiers. Each tier waits for the previous one via health checks. The tiers in the image reflect what each stage does so the startup order is easy to remember and debug.

Prod jobs must not use saveAsTable to create or replace tables. Tables are owned by schema-as-code (Honeycomb); jobs can only write into tables that already exist in the Hive metastore. The framework enforces this at runtime.

Dictator linting: strict checks that fail the build

  • Scala: Scalafmt enforces formatting; Scalafix runs with strict rules: unused imports and variables are removed and violations fail the build .
  • Python: Ruff runs for lint and format on DAGs, pyspark, and scripts. Ruff check and format are required in CI; failures block the PR. Together with Scala, the repo uses a “dictator” style: one format and one lint standard, enforced everywhere.

Honeycomb First, Then Guitar

The correct flow for ETL pipeline creation is:

  1. Honeycomb: Define or alter tables in Git using schema-as-code.
  2. Deploy to Hive: Honeycomb’s pipeline applies the schema to Hive (and optionally prepares S3 paths).
  3. Guitar jobs: Write to the existing Hive table. No table creation occurs in Guitar.

Honeycomb Internals:

When we went looking for a declarative way to manage schemas, Skeema caught our attention as its model was simple and powerful: define the desired schema state using CREATE statements in Git, and let the tool safely compute and apply the required changes. It’s open source, well-designed, and works great for MySQL and MariaDB.

Then we ran into the catch. Skeema doesn’t support Hive!

Hive schemas don’t just describe columns and types; they encode storage details like S3 locations, table properties, and partitions. There are no such tools at the moment which handled these concerns well.

So we built our own!

Inspired by Skeema’s core ideas, we created a Hive-native, open-source schema diff engine backed by Git as the source of truth. It generates precise CREATE, DROP, and ALTER statements for tables and columns, and tracks changes to comments, S3 paths, and other table properties giving us a reliable, reviewable way to evolve Hive schemas with confidence.

Intelligent Schema Diffing

Honeycomb automatically detects:

  • New columns
  • Dropped columns
  • Type changes (with compatibility checking)
  • Comment updates (table and column level)
  • New tables/views
  • Deleted tables/views

Smart handling:
  • Handle reserved keywords (like time, type, timestamp) by automatically quoting column names
  • Detect incompatible type changes (e.g., bigint → timestamp) and use REPLACE COLUMNS instead of CHANGE COLUMN
  • Preserve column order and relationships
Smart Type Compatibility Detection

Hive has strict rules about type conversions. Honeycomb knows them:

✅ Allowed (uses CHANGE COLUMN):

  • TINYINT → SMALLINT → INT → BIGINT (widening)
  • FLOAT → DOUBLE
  • VARCHAR(n) → STRING

❌ Not Allowed (uses REPLACE COLUMNS):

  • BIGINT → TIMESTAMP
  • STRING → INT
  • Any narrowing conversions

The system automatically chooses the right approach.

Concluding

By building Guitar and Honeycomb, we transformed our data engineering workflow from a fragile, manual process into a robust, automated platform. This combination of speedy local spark pipelines development and schema-as-code has not only improved our productivity but also eliminated entire categories of production issues.

The key takeaways from our journey:

  • Automation is essential: Manual schema management doesn’t scale
  • Developer experience matters: Make it easy to do the right thing
  • Version control everything: Git is your friend for both code and schemas
  • Test locally first: Save time and money by catching issues early with production-parity environments
  • Build in guardrails: Prevent errors before they reach production
Ajinkya Bhat

Ajinkya Bhat

Passionate about Data Engineering, Scaling Infrastructure. I navigate the intricate realms of analytics with a keen eye for detail. As a self-proclaimed data nerd, I thrive on transforming raw information into actionable insights. Beyond the world of tech, I'm a jetsetter exploring the globe, all while embracing a vegetarian lifestyle.

Related Posts

Graph-based network detection using Jaccard similarity to connect labels with shared content
Engineering

Detecting Music Label Fraud at Scale: A Graph-Based Approach

The Problem Every day, thousands of new songs appear on streaming platforms-not all of them created by real artists....

by Elias El Khoury
January 8, 2026
House of Code: rebuilding OSN+ in 4 months
Engineering

Rebuilding OSN+: A Technical Post-Mortem

I have wanted to write this post for a while now. But honestly, after the marathon of delivering this...

by Sebastien Melki
October 3, 2025
+OSN تتعاون مع شركة castLabs لتعزيز حماية المحتوى على منصتها الرقمية
Engineering

+OSN تتعاون مع شركة castLabs لتعزيز حماية المحتوى على منصتها الرقمية

أعلنت castLabs، الشركة الرائدة في تكنولوجيا الفيديو الرقمي، عن تعاونها مع +OSN لتقديم تقنية "دي آر إم توداي" لحماية...

by Nour Sawli
September 11, 2024
OSN+ Partners with castLabs to Enhance Content Protection with Cutting-edge Multi-DRM Technology, DRMtoday
Engineering

OSN+ Partners with castLabs to Enhance Content Protection with Cutting-edge Multi-DRM Technology, DRMtoday

OSN+ has partnered with castLabs to implement DRMtoday, a cloud-based digital rights management (DRM) solution aiming to safeguard it's...

by Nour Sawli
September 11, 2024
Next Post
SAJJEL 2: Anghami and the Literature, Publishing, and Translation Commission Launch the Largest Literary Podcast Competition in the Arab Region

SAJJEL 2: Anghami and the Literature, Publishing, and Translation Commission Launch the Largest Literary Podcast Competition in the Arab Region

  • Dalia Mubarak Takes Over Boulevard Square Unveiling Her Album 11:11 at Anghami Lab in Collaboration with Warner Music MENA

    Dalia Mubarak Takes Over Boulevard Square Unveiling Her Album 11:11 at Anghami Lab in Collaboration with Warner Music MENA

    0 shares
    Share 0 Tweet 0
  • Anghami Reports H1 2025 Financial Results; Marked by Topline Growth & Transformative Deal with Warner Bros. Discovery

    0 shares
    Share 0 Tweet 0
  • Anghami and Huawei celebrate five years of collaboration shaping a connected entertainment ecosystem across MENA

    0 shares
    Share 0 Tweet 0
  • OSN+ Launches Music-Led Campaign Built Around Its Biggest Genres and Shows

    0 shares
    Share 0 Tweet 0
  • SAJJEL 2: Anghami and the Literature, Publishing, and Translation Commission Launch the Largest Literary Podcast Competition in the Arab Region

    0 shares
    Share 0 Tweet 0

About Anghami . Join Our Team . Go To app

© 2021 Anghami

No Result
View All Result
  • Homepage
  • Engineering
  • Product
  • What’s New
  • For Brands
  • Music
  • Life at Anghami

© 2020 Anghami blog