How we built a unified Data Engineering platform for schema management and Spark job development that transformed our workflows
Overview
At Anghami & OSN+, we process terabytes of data daily across multiple Spark jobs running on AWS EMR. Our engineering team faced two critical challenges:
- Spark Development Complexity: Making Spark development fast, safe, and accessible to developers who just want to write business logic
- Schema Management Chaos: Managing Hive table schemas at scale with proper governance, versioning, and automation
We solved both by building Guitar and Honeycomb, two complementary systems that form a complete data engineering platform.
🎸 Guitar is a Docker-based Spark framework that abstracts away Spark’s complexity and lets developers focus on business logic (SQL and high-level job patterns).
🍯🐝 Honeycomb is a Git-based schema management system that treats Hive schemas as code versioned, validated, and automatically deployed.
Together, Honeycomb manages your schemas, Guitar runs your jobs against those schemas.



Why the name Guitar?
The name Guitar is the latest step in a long evolution: Harp → Lyre → Guitar with roughly a decade of learning at Anghami! The ML team was the one who started running Spark jobs at the company from the very beginning; they built and maintained the Harp repo. Backend developers later created the Lyre repo v2 for their own Spark use cases, learning from Harp. Guitar is v3: it takes those lessons and makes things better and simpler unified local development, schema-as-code with Honeycomb, and a single platform for data engineering.
The Problems We Faced Before
1. Slow Feedback Loops
Testing ETL pipelines on EMR clusters was time-consuming and expensive:
- Long feedback loops (deploy to EMR to test)
- 2-3 hours per schema change (manual process)
- Hard to replicate production environment locally
2. Weak Schema Governance
Anyone with EMR/Hive access could create, drop, or alter Hive tables directly with no governance or approval layer:
- Accidental table creation in production
- Schema mismatches could cause job failures
- No easy rollback mechanism for schema updates
- Multiple conflicting “sources of truth” for table definitions
3. Shared Cluster Risk
Testing non-production jobs on EMR could impact production workloads:
- Resource contention between test and production jobs
- Accidental data loss from test jobs affecting production tables
- Risk of breaking critical production pipelines
4. Uncontrolled Costs
Unoptimized jobs could spike AWS costs even for small test changes:
- Testing multiple code changes on EMR incurred significant cloud charges
- No way to validate job efficiency before production deployment
- Difficulty estimating job costs before running
5. Human Error
Typos in S3 paths, table names, or secrets could break pipelines:
- No validation before job execution
- Manual, error-prone schema deployments
- Inconsistent error handling
- Ad-hoc backfills left no tracking documentation
The Solution
Guitar: Docker-based Spark Framework



Guitar abstracts away Spark’s complexity so developers can focus on business logic. The framework also includes a full Docker Data Engineering stack: Spark (EMR image for distributed data processing), Hive metastore (metadata catalog for tables), S3 (LocalStack local object storage emulator), Trino (Distributed SQL query engine), Redash (Data visualization and Dashboarding tool), Airflow (Workflow orchestration and scheduling), Postgres, and MySQL all run as containers. One command brings the whole ecosystem up; developers run jobs, run SQL, and visualize dashboards locally.
Core requirements:
- Built-in Security: Table validation, schema checks, production safeguards
- Complete Local Environment: Docker-based setup with Spark, Hive, S3 (LocalStack), Airflow, Trino, and Redash
- Zero-Downtime Deployments: SWAP folder strategy (New data gets written in new s3 path to prevent downtime/partial data issue) for safe production updates
- Dual Language Support: Scala and Python (PySpark) with identical patterns
Key Design Principles:
- Tables are never created by jobs: Tables are created first in Honeycomb (schema-as-code), then deployed to Hive. Jobs only write into existing tables.
- Production writes use SWAP logic: Instead of saveAsTable() (which creates/overwrites tables), production jobs use SWAP-folder logic for refreshes.
- Local development mirrors production: All Production dependencies including Spark, Hive metastore, S3, Trino, Redash, and Airflow run as Docker containers with minimal setup required.
Honeycomb: Schema Management as Code



Honeycomb is integrated as a submodule within the Guitar repository, ensuring seamless schema management within Guitar’s local big data development environment. This integration allows developers to work with schemas alongside their data pipeline code, maintaining consistency across the entire data engineering workflow.
Core requirements:
- Git as Single Source of Truth: All schemas versioned in Git
- Automated Deployment: Smart ALTER statement generation and execution
- Multi-Layer Validation: Pre-commit, PR validation, and deployment verification
- Clear Audit Trail: Complete history of who changed what, when, and why
- Built-in Guardrails: Type compatibility checks, reserved keyword handling
Key Design Principles:
- Git defines reality: Repo is the source of truth for schemas. The Hive metastore is treated as an implementation detail that is continuously reconciled to match what’s in Git, enabling traceability, and reviews through version control.
- No production touch during review: All PR checks operate purely on Git state comparing branch changes to main branch without querying or mutating the live metastore. This keeps validation fast, predictable, and risk-free.
- Deploy only what actually changed: Post-merge deployments diff Git against the real metastore and apply only the necessary ALTERs. There’s no guessing changes are driven by the observed difference between desired and actual state.
Guitar Internals:
The Spark framework runs against a stack of other Docker apps: Postgres (Hive metastore, Airflow, Redash), Hive metastore, LocalStack (S3 + Secrets), MySQL (ETL metadata), Trino, Redash, and Airflow. Together they give a production-like setup, same catalog, same storage model with just one command to bring the stack up and one command to tear it down. Nothing needs to be installed on the host except Docker.



Dependencies: what starts in what order
Containers start in tiers. Each tier waits for the previous one via health checks. The tiers in the image reflect what each stage does so the startup order is easy to remember and debug.



Prod jobs must not use saveAsTable to create or replace tables. Tables are owned by schema-as-code (Honeycomb); jobs can only write into tables that already exist in the Hive metastore. The framework enforces this at runtime.
Dictator linting: strict checks that fail the build
- Scala: Scalafmt enforces formatting; Scalafix runs with strict rules: unused imports and variables are removed and violations fail the build .
- Python: Ruff runs for lint and format on DAGs, pyspark, and scripts. Ruff check and format are required in CI; failures block the PR. Together with Scala, the repo uses a “dictator” style: one format and one lint standard, enforced everywhere.
Honeycomb First, Then Guitar
The correct flow for ETL pipeline creation is:
- Honeycomb: Define or alter tables in Git using schema-as-code.
- Deploy to Hive: Honeycomb’s pipeline applies the schema to Hive (and optionally prepares S3 paths).
- Guitar jobs: Write to the existing Hive table. No table creation occurs in Guitar.
Honeycomb Internals:
When we went looking for a declarative way to manage schemas, Skeema caught our attention as its model was simple and powerful: define the desired schema state using CREATE statements in Git, and let the tool safely compute and apply the required changes. It’s open source, well-designed, and works great for MySQL and MariaDB.
Then we ran into the catch. Skeema doesn’t support Hive!
Hive schemas don’t just describe columns and types; they encode storage details like S3 locations, table properties, and partitions. There are no such tools at the moment which handled these concerns well.
So we built our own!
Inspired by Skeema’s core ideas, we created a Hive-native, open-source schema diff engine backed by Git as the source of truth. It generates precise CREATE, DROP, and ALTER statements for tables and columns, and tracks changes to comments, S3 paths, and other table properties giving us a reliable, reviewable way to evolve Hive schemas with confidence.
Intelligent Schema Diffing
Honeycomb automatically detects:
- New columns
- Dropped columns
- Type changes (with compatibility checking)
- Comment updates (table and column level)
- New tables/views
- Deleted tables/views



Smart handling:
- Handle reserved keywords (like time, type, timestamp) by automatically quoting column names
- Detect incompatible type changes (e.g., bigint → timestamp) and use REPLACE COLUMNS instead of CHANGE COLUMN
- Preserve column order and relationships
Smart Type Compatibility Detection
Hive has strict rules about type conversions. Honeycomb knows them:
✅ Allowed (uses CHANGE COLUMN):
- TINYINT → SMALLINT → INT → BIGINT (widening)
- FLOAT → DOUBLE
- VARCHAR(n) → STRING
❌ Not Allowed (uses REPLACE COLUMNS):
- BIGINT → TIMESTAMP
- STRING → INT
- Any narrowing conversions
The system automatically chooses the right approach.
Concluding
By building Guitar and Honeycomb, we transformed our data engineering workflow from a fragile, manual process into a robust, automated platform. This combination of speedy local spark pipelines development and schema-as-code has not only improved our productivity but also eliminated entire categories of production issues.
The key takeaways from our journey:
- Automation is essential: Manual schema management doesn’t scale
- Developer experience matters: Make it easy to do the right thing
- Version control everything: Git is your friend for both code and schemas
- Test locally first: Save time and money by catching issues early with production-parity environments
- Build in guardrails: Prevent errors before they reach production






