Rebuilding OSN+: A Technical Post-Mortem

I have wanted to write this post for a while now. But honestly, after the marathon of delivering this product in a very short timeline, burnout had set in deep. But late is better than never, and here we are. My colleague Elie Hage wrote about his perspective and experience in this blog post last year. I’d like to build on his thoughts and focus more on the server side of things.

Disclaimer: A lot of the things that I will write below are very opinionated, they stem from my own personal experience after a decade of working on backend systems.

Chapter 1: The Four-Month Countdown

The Reality on the Ground

Around November 2023, Elie Habib, Anghami Co-founder, informed us that Anghami would soon become part of a bigger group called OSN and that we would be responsible for rebuilding their VOD offering, OSN+, from scratch. While we had a lot of experience in the music streaming business, this was our first venture into the world of video streaming. Our engineering team is small, but it has a track record of building fast and building well. We try our best!

The kicker? We couldn’t start building for real until January due to legal matters. This wasn’t a simple integration project, we weren’t bolting two systems together or building an adapter layer. OSN+ needed to be completely rebuilt on new infrastructure. For many reasons, we could not inherit any of the legacy codebase, which had the benefit of not inheriting any of the technical debt.

Here’s the thing about mergers, they come with non-negotiable timelines. TL;DR: we had 4 months to deliver, and all we had was a blank slate.

The Technical Landscape

OSN+ was already an established product, with hundreds of thousands of active users watching movies and shows daily on the platform. HBO partnerships meant same-day-as-US releases, 4K streaming, DRM protection, multiple device types, complex subscription models integrated with dozens of telco providers, among many other complications. The main driver that led us to decide that a complete rebuild was necessary was the fact that the legacy OSN+ infrastructure was a patchwork. Different contracting agencies had built different pieces. Each platform had its own quirks, its own assumptions about how the backend worked.

There was no unified technical vision because there had never been one team. It’s hard enough to manage and organize teams within a single company, imagine doing it with so many external variables. In addition to that, many of the core features such as search, recommendations, and video streaming were provided by third-party vendors. There were many dependencies that made any kind of modifications or improvements very hard to implement. This setup meant a few things: running OSN+ was way more expensive than it should be, way more complicated to reason about than it should be, and way harder to maintain than it should be. And the worst part? Most of the contracts with external vendors ended in April, and renewing them was not cheap.

Like I said earlier, we couldn’t officially start building before January, but that didn’t mean we couldn’t start exploring our options. We started by hacking around the legacy (current at the time) iOS app. We used Proxyman to start sniffing the network traffic that was coming in and out of the app while using it and exploring its features. After one sleepless night, we were able to replicate a basic web server written in Go that could speak the same language as the app. After proxying the API calls from the app to my local machine, I could launch the app, pick a profile, display a homepage, display a movie page, and play a static video. We presented this to the team and this triggered a thought: we could potentially buy some time by keeping the legacy apps alive, simply by exchanging their backend with ours. This would allow us to have more time to build our new apps.

The Parallel Mission

We couldn’t just build the new platform and flip a switch. OSN+ users had active subscriptions, watch histories, and preferences. And from past experience, we already knew that people don’t always tend to update their apps immediately. The scenario of having users on both the new and old backend was inevitable, and we had to prepare accordingly. So we had two missions:

Build the future: Create the new OSN+ backend from scratch, designed to last
Support the past: Keep legacy OSN+ apps running on our new infrastructure without requiring a major app update. Remember that updating all platforms was no small feat given the distributed nature of the teams maintaining the various apps.

The second mission was crucial. If we could reverse engineer the legacy API contracts and implement them on our new backend, we could migrate users transparently. They’d wake up one day using a new backend without knowing anything changed. This meant we’d be building two API layers: one for the new apps (built in parallel by the frontend teams), and one that mimicked the legacy contracts perfectly enough to fool apps built by contractors we’d never met. The demotivating factor here was that the people working on the legacy mimicry already knew that every piece of code they were writing was destined for the garbage chute. It was simply there to give us time to breathe.

The Immovable Deadline

April 2024. That was the line in the sand. HBO’s House of the Dragon Season 2 was launching in June. We needed the new platform production-ready with time to spare for users to update their apps. Anybody who has developed an iOS app knows that the Apple approval process takes time, sometimes more time than we can afford. We couldn’t take any chances. The other pressure point was the expiry date of the legacy OSN+ vendor contracts. If we weren’t ready by then, OSN+ would simply stop working overnight. It would be a shell of an app that does nothing useful. We had four months to:

Reverse engineer a black box system built by multiple agencies
Design and build a new backend from scratch
Implement video streaming infrastructure we’d never built before
Migrate user data and subscriptions without downtime
Merge our user bases
Support both legacy and new apps simultaneously
Build 30+ telco integrations
Add 4K streaming support (which wasn’t supported by the legacy apps)

The math didn’t add up. But the deadline wasn’t negotiable.

Chapter 2: Black Box Refactoring

The Transparent Proxy Strategy

Understanding the legacy contracts through conversation and documentation (where it existed) was one thing. But we needed much more. We needed to see the actual data flowing between the apps and the legacy backend. We decided to design a transparent proxy between the apps and our servers. We used a feature of the Go standard library that provides a simple implementation of a reverse proxy. The concept was simple: place ourselves in the middle. The legacy apps would make requests thinking they’re talking to the old backend, but those requests would flow through our proxy first. Now we could inspect everything: from request and response formats to headers, error cases, and authentication flows. After sniffing around for what we needed, we could forward the request to the real legacy backend and then capture the responses.

That gave us something invaluable: real traffic from real users at scale. We started logging everything: every endpoint hit, every parameter passed, every response code returned. The proxy allowed us to gradually start reimplementing the endpoints that the apps relied on, so we immediately started working on moving as much as possible, as fast as possible. Prioritizing was key here because, technically, not every feature is born equal. The core had to be focused on first: allowing users to watch videos.

At the same time, we realized that the proxy was the ideal place to start gathering data. After figuring out which endpoints were responsible for listing and creating user profiles, for example, we could simply start saving them in our database before proxying the requests. This allowed us to gather everything we needed, mainly:

The active customers
The user profiles
The users’ bookmarks (which enable features like “Continue Watching”)
The users’ movies and shows lists

Combining this real-time data gathering with eventual data dumps provided by the OSN team, we had solved the data migration problem, at least in theory. This approach came with its own challenges. We needed to handle the data carefully. User privacy concerns, data consistency, dealing with conflicts when the same user might be active on both systems during the transition. The proxy wasn’t just a read-only observer anymore; it was actively participating in maintaining state across two systems.

Building the Compatibility Layer

Now we had the contracts. We had the data. We needed to build the thing that would actually serve the legacy apps. This wasn’t like building a normal API. When you design a new API, you think about what makes sense: clean resource models, consistent naming, potentially RESTful patterns, proper error handling, etc. You get to make decisions that optimize for clarity and maintainability. But we weren’t designing; we were replicating. Unfortunately, this meant accepting some truly bizarre decisions. I won’t go into the weeds here, but safe to say that this was not a fun exercise. And the least fun part was that we had focused on reverse engineering the iOS app, thinking that this would be enough to handle serving all other platforms. But alas, every platform had made its own assumptions about how to interact with the backend.

The compatibility layer was an exercise in humility. Every instinct to “fix” something had to be suppressed; we did not have a choice. Our job wasn’t to make it better; our job was to make it identical. We built test suites that compared our responses byte-for-byte with the legacy system. We collected thousands of real request-response pairs from the proxy and turned them into regression tests. If the legacy API returned a specific error message, we returned the same error message, typos and all.

Chapter 3: Contracts as Infrastructure

The Speed vs. Safety Paradox

There’s a common misconception that speed and safety are opposing forces in software development. The common trope is “move fast and break things.” Build carefully and you miss deadlines. Pick one, you can’t have both. In our case, we didn’t have the luxury of picking one. We needed both. Four months to build a platform from scratch meant we had to move incredibly fast. But hundreds of thousands of active users meant we couldn’t afford to break things. A bug in production wouldn’t just be embarrassing, it would cost real money and destroy user trust.

Traditionally, moving fast means cutting corners, skipping documentation, deferring testing for the “later” that never comes. Build quick and dirty prototypes and iterate, except iteration doesn’t always happen. When you’re building infrastructure that needs to support multiple teams working in parallel, cutting corners doesn’t make you fast. It makes you slower. Every ambiguity becomes a blocker, every assumption becomes a point of coordination, and obviously every misunderstanding becomes rework, which is the loss of precious time we did not have.

We needed a different approach. We needed a way to move fast that actually made us safer, not less safe. The answer to that was ironclad contracts. Not contracts in the legal sense, but in the technical sense. Explicit, unambiguous definitions of how every piece of the system would interact with every other piece. Before anyone wrote a single line of production code, we would define exactly what the API would look like. The default instinct is to assume that this would involve more planning, more meetings, and more documentation. In reality, the opposite happened, because once we had the contract defined, everyone was unblocked and could move independently. The contract became our shared source of truth. Not documentation that might be outdated. Not API endpoints that might change. A versioned, explicit contract that everyone agreed on and could build against with confidence.

Protobuf as the Foundation

We chose Protocol Buffers (Protobuf) as our contract definition language. We had some previous experience using them, but we never really went all the way by making them the core of our architecture. We had an intuition that it would work out and we decided to go with it. To be honest, there wasn’t much time to second-guess certain decisions. We had to trust each other. Protobuf is a typed, language-agnostic schema definition format. You define your data structures once, and you can generate code for any platform. Go for the backend, Swift for iOS, Kotlin for Android, TypeScript for web. Everyone works with the same structure, enforced by the compiler. For more information and examples on how we used Protobufs to define our contracts, please refer back to the article that Elie Hage wrote.

After defining the contracts, the backend commits to returning this structure. Frontend commits to consuming it. No ambiguity about field names, types, or structure. The compiler enforces it. In addition to type safety, it gave us evolution. We could add new fields without breaking existing clients. We could deprecate old fields gracefully. Version management was built into the format. This was crucial when we were moving fast. We could iterate on the contract without breaking everything that depended on it.

Parallel Development in Practice

The real power of contracts showed up in how teams worked. Traditional API development typically involves backend building an endpoint while frontend waits. Then frontend tries to use it, finds problems, and reports them back. Backend fixes issues, frontend waits again… Repeat until it works. This is a slow process, and when you have multiple frontend platforms, it’s multiplicatively slower. Our approach flipped this model on its head. Everything started by defining the contract based on requirements; after that, everybody could work in parallel. When all sides are ready, integration is trivial. Frontend points at the real API instead of mocks. If the contract was followed, it just works. No surprises, no “oh, I didn’t realize you needed this field,” no back-and-forth. We did this across dozens of features simultaneously. Different engineers, different features, all moving in parallel. The contract kept everyone aligned without constant coordination.

Removing the headache of coordination hell allowed many teams to focus on other very important things that were crucial to the merger. The billing team had to integrate with more than 30 telco providers. For those who don’t know, this can quickly become a nightmare. Every provider handles things differently. This pattern repeated across the entire project. DRM integration, video encoding, analytics, many complex systems built by different people could be focused on without worrying about how everything would fit together in the end.

Contracts didn’t just make us fast, they made us scalable. They let a small team accomplish what would normally require a much larger and more coordinated effort.

Chapter 4: The Monolith Decision

The Microservices Pressure

In 2024, if you’re building a new backend system, there’s an almost gravitational pull toward microservices. It’s the default architecture in most technical discussions. “We’re building a video streaming platform? Obviously we need microservices.” Separate services for authentication, content catalog, playback, recommendations, billing. Each with its own database, its own deployment pipeline, its own team ownership. As a matter of fact, that’s how Amazon Prime works, right?! (It did until it didn’t).

The arguments for microservices are always compelling. Things like independent deployments, flexibility in technology choice, small teams owning specific domains, fault isolation, scaling independently, etc. You name it. These are real benefits, but only in the right context. In my opinion, microservices simply do not work for a small team, working on a new domain, on a four-month timeline. They involve way more DevOps and infrastructure challenges to solve. Things like distributed tracing and monitoring, high availability, and scaling become exponentially more complicated the more services are added to the stack. Networking becomes a big failure point. Not to mention data consistency in that world is a tough pickle to handle. There was some pressure to go with microservices. Industry best practices pointed that way. Conference talks and blog posts from companies we admired described microservices architectures. Some team members had worked in microservices environments before and advocated for them.

Like my disclaimer mentioned, this article is going to be opinionated. We didn’t think microservices would be a viable approach, so we decided to go in another direction.

The Modular Monolith Approach

In my opinion, modular monoliths give you most of the benefits that microservices offer, without the downsides. When people think of monoliths, they think of a ball of mud. But a modular monolith defies this notion.

The idea is simple: structure your codebase as if services were separate, but deploy them together. Just squint your eyes and pretend they were microservices by creating strong business boundaries between your modules, and voila: you get a maintainable piece of software. All of the design patterns out there, from Domain Driven Design (DDD) to hexagonal architecture to whatever flavor of the month is out there, are all trying to teach the same thing: create good boundaries that respect the domain language.

Modules are not allowed to depend on each other, and thankfully we can enforce this easily with the magic “internal” directory in Go. Modules can only communicate through their exposed public API represented by Go interfaces; the rest is kept private. If the billing module needed user data, it called the users module through its interface, never by directly querying the users database. This gave us logical separation without physical separation. Modules couldn’t accidentally depend on each other’s implementation details. We got the architectural benefits of service boundaries. But operationally, it was simple. One binary to build. One process to deploy. One set of logs to search. One monitoring dashboard. When we needed to debug a request that touched multiple modules, we could trace through it in a single codebase. No distributed tracing infrastructure required.

What We Gained (and What We Gave Up)

The modular monolith let us move fast. We spent our time building features, not infrastructure. Development velocity was high because the feedback loop was tight. Change some code, run tests, deploy. No waiting for multiple services to build and deploy. No coordinating releases across services. Debugging was straightforward. Set a breakpoint, step through code, see exactly what happened. No trying to piece together what happened across service boundaries from distributed logs.

But we gave up some things. We couldn’t scale individual components independently. If playback was the bottleneck, we had to scale the whole monolith, not just the playback service. In practice, this mattered less than you’d think. We couldn’t use different technologies for different problems. Everything was Go. For us, this was fine. We lost some fault isolation. If one module had a bug that crashed the process, the whole thing went down. We mitigated this with good error handling and panic recovery, but the risk was real.

The monolith was the right choice for us, at that time, with that team, under those constraints. It let us ship in four months. Could we have done it with microservices? Maybe. But I doubt it. Sometimes the boring choice is the right choice.

Chapter 5: Tech Stack Choices (And Mistakes)

Go: The Safe Choice

We chose Go for the backend. This wasn’t a controversial decision, it was practically a non-decision. Anghami’s backend was already Go. The team knew Go. We had years of production experience with Go. We had libraries, patterns, and infrastructure built around Go.

When you have four months to ship, you don’t experiment with your foundation. You use what you know works. Go gave us what we needed: fast compilation, strong typing, great concurrency primitives, excellent standard library, and a mature ecosystem. The performance was good enough for our scale. The tooling was solid. Most importantly, everyone on the team could be productive immediately.

There’s a certain appeal to choosing the hot new technology. “We’re building something from scratch, let’s use Rust! Let’s try Elixir! Let’s go with TypeScript on the backend!” But every new technology comes with a learning curve. With unknowns. With sharp edges you won’t discover until production. We didn’t have time for that. Go was boring. Go was proven. Go was the right choice.

gRPC: The Wrong Choice

Now here’s where we screwed up.

We decided to use gRPC as our API protocol (my bad!). On paper, it made sense. We were already using Protobuf for contracts and gRPC is built on Protobuf. It’s fast. It’s efficient. It has built-in code generation. Major companies use it. It felt like the modern, forward-thinking choice. Didn’t I just say that the shiny new thing is more often than not the wrong choice? I guess I should practice what I preach more often.

The reasoning went like this: we’re defining everything in Protobuf anyway. gRPC services are just Protobuf with RPC semantics. We can generate both client and server code. Type safety end-to-end. Streaming support built-in (sounds cool, but we never ended up using it). Better performance than JSON over HTTP. So we built our backend APIs as gRPC services. Generated Go server code from our Protobuf definitions. Started implementing endpoints. Everything worked great… in development.

Mobile apps don’t speak gRPC natively; or rather, they can, but it’s complicated. You need a lot of special consideration that the operating system gives you for free when you are using pure HTTP. gRPC-Web exists, but it’s not the same as native gRPC. Browser support is limited. Our web frontend couldn’t easily call gRPC services directly. We had built a gRPC backend for a world that needed HTTP. This was a fundamental mismatch.

The “right” solution would have been to go back and rebuild with HTTP. Start over with REST or HTTP+JSON endpoints. But we didn’t have time for that. We were weeks into development. We had endpoints built, business logic implemented, tests written. Starting over meant missing our deadline.

So we did what you do when you’ve made a bad architectural choice and can’t afford to fix it properly: we found a workaround.

The grpc-gateway Bandaid

Enter grpc-gateway. It’s a reverse proxy that translates HTTP+JSON calls into gRPC calls. You annotate your Protobuf definitions with HTTP mappings, and it generates a proxy that sits in front of your gRPC server. So instead of clients calling gRPC directly, they call HTTP endpoints. The gateway translates the HTTP request into a gRPC call, forwards it to our backend, gets the gRPC response, and translates it back to HTTP+JSON for the client.

It worked. It let us keep our gRPC backend while exposing HTTP APIs. But it was a bandaid. An extra layer of translation. Another place for bugs to hide. More complexity in the request path. Performance overhead from the translation layer. And it meant we were now maintaining HTTP API definitions (the gateway annotations) in addition to our gRPC definitions. The worst part? We knew it was wrong while we were doing it. We knew we’d taken a wrong turn. But the deadline was immovable, and going back meant failure. So we moved forward with grpc-gateway and made it work.

In production, it actually held up fine. The performance overhead was there, but we could live with it. The translation layer didn’t cause major bugs. Users didn’t know or care that there was an extra hop in the request path (who knew that users don’t care about your technology choices?). By pragmatic measures, it was fine. But as engineers, we knew. We had built something more complicated than it needed to be. We had a layer of indirection that served no real purpose except to patch up an early mistake.

Silver Lining: Building sebuf

Here’s the thing about mistakes: sometimes they teach you something valuable.

The grpc-gateway experience taught us that the translation layer between HTTP and Protobuf was actually useful. Not as a workaround for gRPC, but as a general pattern. What if we could define APIs in Protobuf and automatically get both HTTP and gRPC support, without needing gRPC at all?

That idea eventually led to building sebuf: https://github.com/SebastienMelki/sebuf

Sebuf is a code generator that takes Protobuf definitions and generates HTTP handlers for Go. You get the type safety and contract benefits of Protobuf, but your actual API is just HTTP+JSON. No gRPC dependency. No translation layer. Just clean, simple HTTP services with Protobuf-derived types. It solved the problem we should have solved from the beginning: how to use Protobuf contracts without committing to gRPC as a transport protocol.

In hindsight, if we’d had sebuf at the start, we would have used it. We would have skipped gRPC entirely, built HTTP services from day one, and avoided the whole grpc-gateway detour. But we wouldn’t have built sebuf if we hadn’t made the gRPC mistake first. The pain of working with grpc-gateway motivated us to find a better solution. The experience taught us exactly what we needed: Protobuf contracts with HTTP simplicity.

This became useful beyond just OSN+. We started thinking about how to unify the Anghami and OSN+ backends. They both used Protobuf contracts. They both needed to expose HTTP APIs. With sebuf, we could have both platforms share the same contract definitions and generate compatible HTTP endpoints. We’re getting close to that reality now. But it started with a mistake. With choosing gRPC when we should have chosen HTTP. Sometimes the best tools come from solving your own pain.

Chapter 6: Building a Unified Backend

The Unified Backend Vision

Here’s something we realized early: we weren’t just building OSN+. We were building the foundation for something bigger. Anghami and OSN+ would remain separate apps, separate brands, and separate user experiences. Music and video are different products with different needs. But under the hood? There was no reason they couldn’t share infrastructure.

The vision was simple: one backend codebase that could serve both platforms. Not two separate backends that happened to look similar. One actual codebase, with both Anghami and OSN+ as different entry points into the same system. This wasn’t just an architectural aesthetic. It had real benefits:

Shared user accounts: A user could have one account that worked across both platforms. Sign in once, access everything. Your profile, preferences, subscriptions all unified

Shared infrastructure: Authentication, user management, payment processing, analytics. Build it once, use it everywhere. No duplicating effort, no keeping two systems in sync.

Cross-learning: Improvements to one platform benefit the other. Better caching strategy in Anghami? OSN+ gets it too. New recommendation algorithm in OSN+? Anghami can use it.

Operational simplicity: One deployment pipeline. One monitoring system. One on-call rotation. Instead of maintaining two separate backends, we’d maintain one that served two products.

The key insight was that Protobuf contracts made this possible. If both platforms used the same contract definitions, they could share code naturally. Anghami endpoints could be defined in Protobuf. OSN+ endpoints could be defined in Protobuf. And code that didn’t care about music vs. video could be truly shared.

We weren’t there yet when we started building OSN+ in January 2024. But we knew where we wanted to go. And we made architectural decisions with that future in mind. And now we are reaping those benefits as we speak, building more fun features for both products!

Leveraging Anghami’s Proven Stack

We had a huge advantage: Anghami’s backend was already mature. Years of production experience. Millions of users. Battle-tested systems for things that OSN+ would need. We’d been through multiple iterations, solved edge cases, handled security properly. We could take that system wholesale and use it for OSN+.

Authentication, search, recommendation engines, caching layers, analytics, monitoring, etc., we had so much that we could build on top of. This wasn’t just code reuse, it was also knowledge reuse. The team that built Anghami knew how to build streaming platforms at scale. They’d made mistakes, learned lessons, and built something that worked. OSN+ got to benefit from all of that experience. The modular monolith architecture helped here. We could use Anghami modules and use them as part of the OSN+ implementation. With clear interfaces between modules, the music-specific parts stayed separate while the general infrastructure was shared.

Of course there were many things to adapt. Music streaming and video streaming are different after all. Content delivery especially was a way bigger beast when serving huge 4K videos with DRM than it is when serving tiny 4MB audio files. The bandwidth requirements were different. Buffering and playback logic were completely different. I will not go in depth on the technical aspects here as it is not my area of expertise, but I will give a huge shoutout to the engineers who did solve those big challenges. The approach was pragmatic: reuse what makes sense, rebuild what doesn’t. Don’t force shared code where the domains are fundamentally different. But don’t rebuild from scratch when you have working solutions.

Chapter 7: The March Decision

Two Tracks, Finite Resources

By March, we’d been running on two parallel tracks for months. Track one: build the new OSN+ backend from scratch. Track two: reverse engineer the legacy system and keep old apps running on new infrastructure. Both were progressing. The new backend was taking shape: core features implemented, APIs defined, modules built. The reverse engineering effort had also made real progress. We had the transparent proxy running, capturing traffic, collecting data. We’d implemented compatibility layers for major legacy endpoints. Some legacy apps were already running against our backend in testing.

But we were a small team. Every engineer working on legacy compatibility was an engineer not working on the new platform. Every hour spent debugging why a legacy Android app expected a slightly different JSON structure was an hour not spent building the future. The original plan was to keep both tracks running until launch. Ship the new apps on the new backend, but also support legacy apps seamlessly. Give users time to update gradually. No forced upgrades, no breaking changes.

That was the plan. But plans meet reality, and reality doesn’t care about your plans.

The Reality Check

Towards the end of February, we did an honest assessment. We looked at what we’d built, what remained, and how much time we had. April 1st was only a few weeks away. Here’s what the picture looked like:

New backend: Core functionality mostly done. But missing critical features. 4K streaming support: not implemented. Advanced DRM features: partially done. Some telco integrations: still in progress. Testing and hardening: barely started. We could ship something by April 1st, but it would be rough. Features would be missing. Bugs would be lurking.

Legacy compatibility: We’d proven the concept. Legacy apps could run on our backend for basic flows. But “basic flows” wasn’t enough. Edge cases were everywhere. Different app versions with different assumptions. Platform-specific quirks. Error handling that didn’t match legacy behavior. To truly support legacy apps in production, reliably, at scale, with all the edge cases handled, we needed weeks more work.

We couldn’t do both well. We could half-ass both and ship a shaky platform that sort-of supported new apps and sort-of supported legacy apps. Or we could focus. There was another factor: the legacy OSN+ vendor contracts. The ones that were expiring at the end of April.

The Risk Assessment

Those vendor contracts were for critical services:

Video streaming infrastructure (CDN, encoding)
Search provider
Recommendation engine

When those contracts expired, legacy OSN+ apps would break. Not degrade, break. Users would open the app and nothing would work. No video playback. No search. No recommendations. The app would be a shell. Our options were limited.

Option A (support legacy): Risk of shipping a buggy new platform. Risk of building a compatibility layer we’d maintain for months or years. Risk of spreading the team too thin and doing nothing well.

Option B (focus on new): Risk of breaking legacy apps. Risk of bad user experience for people who don’t update. Risk of negative reviews and support tickets. Potential revenue loss during the transition.

We gathered the team. Backend, frontend, product, leadership. Laid out the situation. Debated the options. That’s when I started becoming even more bald than I was at the time.

Making the Call

The decision was made, and I was the unfortunate messenger: stop working on legacy compatibility. Everyone focuses on the new platform. It wasn’t a comfortable decision. We’d spent months on the reverse engineering effort. Engineers had put real work into the compatibility layer. And now we were abandoning it. Not because it didn’t work, but because we couldn’t afford to finish it properly.

The reasoning was this: our job was to build the best platform we could, not to avoid short-term pain. Yes, some users would be forced to update. Yes, that would cause friction. But the alternative was shipping a half-finished platform that both new and legacy users would suffer with for years. We could build a great new platform and manage a bumpy transition, or we could build a mediocre platform trying to serve two masters. The choice was uncomfortable but clear.

We communicated the decision clearly: after April 30th, when legacy vendor contracts expire, only updated apps will work. We’d push hard on getting users to update. In-app notifications, emails, support communications would hopefully do the trick. We’d monitor update rates closely. But we wouldn’t compromise the new platform to save legacy apps.

The team shifted focus immediately. Engineers moved off legacy work and onto new platform features. The transparent proxy stayed running, we still needed the data collection after all, but we stopped building new compatibility endpoints. All hands on deck for the new backend.

It was a bet. We were betting that the new platform would be good enough to justify the transition pain. That users would update. That the improved experience would win them back even if the forced update annoyed them initially.

Looking back, it was the right call. But in March, when we made it, we didn’t know that. We knew the deadline was immovable, the resources were finite, and we had to choose. Sometimes leadership is about making the least-bad choice under constraints you can’t change.

Not every decision in software is technical. Sometimes it’s about understanding what you’re optimizing for, what you’re willing to risk, and what you’re not willing to compromise on. We optimized for platform quality over transition smoothness. We risked short-term user friction for long-term product health. And we refused to compromise on shipping something we’d be proud of.

The contracts expired. Legacy apps broke. Users updated. The platform held. It worked out. But that’s the story we tell now. In March, when we made the call, it just felt like the least-bad option we had.

Chapter 8: Performance Through Architecture

Performance: Speed Through Architecture

One of the most visible improvements after the rebuild was performance. API response times dropped significantly. The platform felt faster, more responsive. Users noticed.

This wasn’t about rewriting slow code or finding performance bugs in the legacy system. Our advantage was starting fresh with a clear architecture and modern patterns.

Multi-tiered caching was our biggest win. We implemented a layered approach:

In-memory caching for hot data like frequently accessed content metadata. Microsecond access times.
Redis for shared state across servers, like user preferences, and computed recommendations for example. Single-digit millisecond latency.
Edge CDN for static content and media. Served from locations closest to users.

The key was being smart about what to cache and for how long. Content catalogs change infrequently: cache aggressively. User watch positions update constantly: cache briefly. We tuned TTLs based on actual usage patterns.

Database access patterns were also optimized from the start. The modular monolith helped here. Each module owned its data and we could optimize queries for specific access patterns. Proper indexing, query optimization, connection pooling. The basics, done right.

API design favored fewer, richer endpoints over chatty back-and-forth. Instead of multiple round trips to build a screen, one request returned everything needed. This reduced network overhead and simplified client logic. Although we are questioning this decision as of late, and are aiming for a healthier compromise. Maybe we will discuss this in a future article.

The result: homepage loads that used to take seconds now took hundreds of milliseconds. Search results appeared instantly. Video playback started faster. The whole experience felt snappier.

We’re currently experimenting with edge compute, running logic closer to users rather than centrally. Early results are promising for certain use cases, though we’re still figuring out where it makes sense versus adding complexity.

The performance gains weren’t magic. They came from thoughtful architecture decisions, aggressive but smart caching, and the luxury of building with performance in mind from day one rather than retrofitting it later.

Chapter 9: Launch Week

Launch Week: Hell, But Prepared Hell

April 1st. Launch day. After four months of nonstop work, we were flipping the switch.

The new apps started rolling out across platforms. iOS first, our biggest user base. Then Apple TV. Android followed. Web updates deployed. The new backend went live, serving production traffic. We’d done everything we could to prepare. Load testing. Stress testing. Dry runs of the deployment. Monitoring dashboards set up. Runbooks written. On-call rotations scheduled. Everyone knew their role.

But there’s a difference between testing and production. Testing is controlled. Production is chaos. Real users doing unexpected things. Edge cases you never thought of. Load patterns that don’t match your predictions. Dependencies that fail in ways they never failed before.

We barely slept that week. Not because things were falling apart, they weren’t. But because we needed to watch everything. Every metric. Every error. Every anomaly. When you launch something this big, you don’t trust that it’s working. You verify, constantly.

The User Matching Chaos

The biggest issue we hit wasn’t technical infrastructure, it was user identity. We were running a unified backend for both Anghami and OSN+. Same user accounts, same authentication system. In theory, beautiful. In practice, messy.

OSN+ users from the legacy system needed to be matched to accounts in our own system. Email addresses and phone numbers were the primary key, but not everyone had the same email. Some users had multiple accounts. Some had typos in their emails. Some legacy data was incomplete or corrupted.

When users opened the new OSN+ app and tried to log in, we needed to:

Authenticate them
Find their legacy OSN+ account
Match it to their new account
Migrate their watch history, preferences, subscriptions
Do all of this seamlessly, in milliseconds, without the user noticing

It worked most of the time. But “most of the time” isn’t good enough when you have hundreds of thousands of users. The edge cases piled up:

Users who couldn’t log in because their accounts didn’t match correctly
Subscriptions that didn’t transfer properly
Watch history that went missing
Users who thought they had an account but the legacy data said otherwise

Support tickets flooded in. Our customer service team was overwhelmed. We were debugging individual user issues while trying to identify systemic patterns. Was this a one-off problem or a category of failures we needed to fix?

The backend held up. API response times were fine, servers weren’t crashing, databases weren’t melting. But the user experience was rough for the segment that hit edge cases. And when users are paying for a service and can’t access it, “we’re working on it” doesn’t feel like enough. We worked around the clock fixing issues. Writing scripts to correct mismatched accounts. Updating matching logic to handle new edge cases we discovered. Manually intervening for high-value users who couldn’t wait for automated fixes.

We Didn’t Go Down

The platform held. That’s the headline buried in all the chaos.

Despite the user matching issues, despite the edge cases, despite the support tickets: the backend never went down. API response times stayed fast. Video playback worked. Authentication worked (even when matching was messy, users could authenticate). The core infrastructure was solid.

This was the payoff of the architectural decisions we’d made:

The modular monolith meant no distributed system failures
The caching layers kept load manageable
The contract-based design meant frontend and backend stayed in sync
The focus on the new platform (abandoning legacy compatibility) meant we’d built something solid rather than something fragile

We had issues. Absolutely. Some users were frustrated. Support was slammed. We were exhausted. But the platform didn’t break. It bent under pressure but it held.

By the end of the first week, the major issues were resolved. Account matching was working reliably. Edge cases were handled. Support tickets were trending down. Users who’d had problems were mostly fixed. New users coming in had smooth experiences.

The team was exhausted but proud. We’d launched something big, under pressure, on a tight deadline. It wasn’t perfect. But it worked. And that’s what mattered.

Chapter 10: After the Storm

What Came After

Launch week was hell, but we made it through. The weeks that followed showed us whether the foundation we’d built was solid.

The platform held up. As traffic grew and more users updated to the new apps, performance stayed consistent. The caching strategies worked. The modular architecture made it easy to identify and fix issues. The monitoring gave us confidence that problems would surface quickly, not silently degrade.

Subscriptions grew. Once the initial matching issues were resolved, the numbers looked good. The 41% growth in video subscribers between April and October wasn’t just about content, it was about a platform that worked reliably. Users could trust that when they opened the app, it would work.

Development velocity increased. This was the real win. With the new architecture in place, we could build features fast. Chromecast support shipped within a week of launch. 4K streaming with Dolby Vision and Dolby Atmos went live within a month. These weren’t small features, they were significant additions that would have taken much longer with the legacy system.

The team could move quickly because the foundations were solid. Clear contracts meant no guessing about API behavior. The modular monolith meant no coordination overhead between services. Good caching meant performance stayed fast even as we added features.

Team chemistry strengthened. Going through hell together does something to a team. We’d seen each other at 3 AM debugging production issues. We’d made hard decisions together. We’d bet on each other’s judgment under pressure. That builds trust you can’t get any other way.

The burnout was real. We’d pushed hard for four months straight and then immediately into a chaotic launch. But there was also pride. We’d built something significant. We’d made tough calls and they’d paid off. We’d shipped.

The unified backend vision materialized. With both Anghami and OSN+ running on the same infrastructure, we started seeing the benefits. Improvements to one platform automatically benefited the other. Engineers could work across both products without context switching between different codebases. The operational simplicity of one deployment, and one monitoring system, made the team more efficient.

We’re not done. There’s still technical debt to pay down. The grpc-gateway layer is still there (though we’re slowly migrating endpoints to use sebuf). There are features we cut for launch that need to be built. There are optimizations we want to make. But the foundation is solid. The platform works. The team is stronger. And we’re building on top of something we’re proud of.

Four months to rebuild a streaming platform from scratch. It shouldn’t have been possible. But with the right decisions: ironclad contracts, modular architecture, and focus on the new platform over legacy compatibility, we made it work.

Sometimes the boring choices are the right choices. Sometimes saying no is more important than saying yes. Sometimes building less, but building it well, beats building everything halfway.

Chapter 11: Lessons Learned

Looking back at those four months, certain lessons stand out:

Contracts enable speed, not slow it down. The upfront investment in defining Protobuf contracts felt like overhead at first. But it was the foundation that let everyone move in parallel without blocking each other. When you’re under time pressure, clear contracts aren’t bureaucracy, they’re infrastructure.

Boring technology wins under constraints. We chose Go because we knew it. We chose a monolith because it was simple. We resisted the pull of microservices and shiny new tools. Those boring choices let us focus on building features instead of fighting infrastructure. Sometimes the best technical decision is the one that doesn’t require learning something new.

Know when to cut losses. The March decision to abandon legacy compatibility was painful. We’d invested months of work. But finishing it would have compromised the new platform. Recognizing when to stop, even when it hurts, is as important as knowing when to push forward.

Small teams can move fast if you remove coordination overhead. We weren’t a large team, but we shipped something significant. The modular monolith, shared contracts, and unified codebase meant we didn’t waste time coordinating between services or teams. Every hour spent building was an hour spent building, not an hour spent in meetings figuring out who owns what.

Architecture decisions compound. Every choice we made, modular monolith, Protobuf contracts, and focused caching strategy, paid dividends later. Good architecture doesn’t just make the current work easier; it makes future work possible. Bad architecture (looking at you, gRPC – just kidding, it was just the wrong tool for the job) creates drag that slows everything down.

Not every mistake is fatal. We made wrong calls. gRPC was a mistake. Abandoning legacy compatibility was risky. But we adapted, found workarounds, and kept moving. Perfect decisions are a luxury. Good-enough decisions made quickly beat perfect decisions made too late.

Teams get stronger through adversity. The four-month sprint and the chaotic launch week forged something real. We trusted each other because we’d been through hell together. That trust became the foundation for everything we built afterward.

Closing Thoughts

When we started in January 2024, it felt impossible. Too much to build, not enough time, too many unknowns.

We made it work through focus. Focus on the new platform over legacy. Focus on contracts over ad-hoc coordination. Focus on simplicity over cleverness. Focus on shipping something we could be proud of rather than shipping everything half-finished. The result wasn’t perfect. We cut corners where we had to. We made compromises. We left technical debt that we’re still paying down. But we shipped a platform that worked, that users liked, that let the business grow.

My colleague Elie covered the frontend journey. This was the backend story/ Different perspectives on the same wild ride.

If you’re facing a similar challenge, impossible timeline, small team, ambitious goals: here’s what I’d tell you:

Start with clear contracts. They’re the foundation everything else builds on.

Choose boring technology. Save your innovation budget for the problems that matter.

Build modular systems. Even if you’re shipping a monolith, structure it like it could be split apart someday.

Know what you’re optimizing for. We optimized for platform quality over transition smoothness. That shaped every decision.

Be willing to make hard calls. Some decisions won’t feel good. Make them anyway.

Trust your team. When everyone’s running at full speed, trust is what keeps you aligned.

We rebuilt OSN+ in four months. It was hell. It worked. And I’d do it again… maybe!

Thanks for reading. Now go build something. I’m going to bed.

…oh and if you liked what you read please click here to join the team!

Rebuilding OSN+: A Technical Post-Mortem

Sebastien Melki

Related Posts

+OSN تتعاون مع شركة castLabs لتعزيز حماية المحتوى على منصتها الرقمية

OSN+ Partners with castLabs to Enhance Content Protection with Cutting-edge Multi-DRM Technology, DRMtoday

Anghami Selects Bitmovin’s VOD Encoder to Power New Multimedia Streaming Platform

أنغامي تتعاون مع بيتموفين لتعزيز منصة بث الوسائط المتعددة الجديدة

The Orchestra of Entertainment: One Stage, Endless Possibilities

The Orchestra of Entertainment: One Stage, Endless Possibilities

Rebuilding OSN+: A Technical Post-Mortem

Dalia Mubarak Takes Over Boulevard Square Unveiling Her Album 11:11 at Anghami Lab in Collaboration with Warner Music MENA

إعادة إطلاق منصة OSN+: قصة نجاح تقنية من الشرق الأوسط

Introducing The Hub: A New Premium Branded Destination on Anghami