• Engineering
  • Product
  • For Brands
  • What’s New
  • Music
  • Life at Anghami
No Result
View All Result
  • Engineering
  • Product
  • For Brands
  • What’s New
  • Music
  • Life at Anghami
No Result
View All Result
Anghami
No Result
View All Result

Detecting Music Label Fraud at Scale: A Graph-Based Approach

Elias El Khoury by Elias El Khoury
January 8, 2026
in Engineering
Share on FacebookShare on Twitter

The Problem

Every day, thousands of new songs appear on streaming platforms-not all of them created by real artists. Behind the scenes, networks of fake labels repackage the same tracks under different names, quietly siphoning revenue and polluting recommendation systems. At small scale, it looks like sloppy cataloging. At 200 million songs, it’s industrialized fraud.

We started noticing patterns in our catalog analytics at scale-labels uploading hundreds of albums with suspiciously similar song titles. Same words, different arrangements. “Gentle Rain” in one album, “Rain, Gentle” in another. Different song IDs, different albums, but manual spot-checks suggested the audio was identical or near-identical, just repackaged.

Running a music streaming platform means dealing with a catalog that grows faster than any human team can review, and this systematic fraud compounds into real economic damage across the entire infrastructure.

The Infrastructure Tax

Every uploaded song triggers a pipeline: transcode to 6-8 bitrate variants for different network conditions and devices, store across S3 hot/warm/cold tiers, index for search, generate recommendation features, update database tables. When 10% of your catalog is fraudulent duplication, that’s 20M tracks consuming infrastructure for zero user value. Conservative estimates put the direct costs at $100K+/year (S3 storage across hot/warm/cold tiers, CDN egress when fraudulent files are served, transcoding compute for 6-8 bitrate variants per duplicate, backup storage, database operations for catalog tables, search index updates, recommendation feature generation).

Beyond direct costs, metadata overhead scales linearly with catalog size-search indices, recommendation feature stores, database operations all carry the weight of fraudulent entries. Maintaining acceptable query latency with bloated catalogs requires 15-20% infrastructure overprovisioning, pushing the cumulative waste to ~$1M over three years.

Corrupting the User Experience

But the measurable infrastructure costs are just the beginning. The user experience damage is harder to quantify but more insidious. When search returns 40+ variants of a certain song with algorithmically reordered words, users spend 30% more time filtering noise instead of discovering music, and internal metrics show 40% lower engagement when fraud pollutes their experience.

Worse, recommendation systems learn from bot-inflated play counts, systematically corrupting collaborative filtering signals and pushing genuine discoveries further down ranked lists. New users encounter fraudulent content during their critical first sessions, poisoning the sparse preference data that powers personalization.

The compounding failure mode: users never find music they’d love because fraud occupied those discovery surfaces, translating to churn and lost lifetime value that dwarfs the direct infrastructure costs.

The detection problem is non-trivial at scale. Fraudsters exploit the fact that metadata is cheap to fake-reordering words in titles costs nothing but creates “new” song IDs that bypass naive duplicate detection. Meanwhile, legitimate compilations, remixes, and classical music recordings also show title repetition patterns, making false positives a relationship-destroying risk with real labels. Manual review doesn’t scale to 200M songs and 75K+ labels. Exact duplicate detection misses the word-reordering variants. Pairwise similarity comparisons explode to billions of operations. We needed automated detection that maintains precision while catching both individual bad actors and coordinated fraud networks operating across multiple fake label identities.

Example of what we were seeing:

Album 1: "Relaxing Rain Sounds"
  - "Gentle Rain"
  - "Heavy Rain"
  - "Rain on Window"

Album 2: "Sleep Music Collection"
  - "Rain, Gentle"
  - "Rain Heavy"
  - "Window Rain"

Same content, different packaging. And it was everywhere.

Detection Pipeline: Two-Layer Approach

Detection Pipeline Overview

Phase 1: Fingerprinting Titles

The approach: normalize titles to catch variations. A fingerprint is a canonical representation of a title-a normalized, sorted token list of the title’s words that strips away punctuation, capitalization, and word order.

"Happy Music" -> "music, happy!" -> "MUSIC Happy" -> all become: "happy music"

Process: lowercase → remove punctuation → split words → sort alphabetically → rejoin.

This catches reordered titles like “Calm Relaxing Music” and “Relaxing Calm Music”-both become “calm music relaxing”. Simple, but effective.

def createFingerprint(text: Column) = {
  array_join(
    array_sort(
      split(normalizeText(text), " ")
    ),
    " "
  )
}

Used Unicode regex (\p{L} for letters, \p{N} for numbers) to handle Arabic, Chinese, Korean, etc.

Detection Thresholds

A label gets flagged when:

  • Same fingerprint appears in 2+ albums (cross-album reuse)
  • Label has 10+ reused fingerprints (systematic pattern)
  • 30%+ of label’s songs are reused (high ratio)
  • Only multi-word titles count (skip generic single words)

These thresholds took some tuning. Too low and you catch legitimate labels, too high and fraud slips through.

We excluded major distributors, whitelisted labels, and podcasts.

First Results

Label A: 1,400+ albums, 114K+ songs
Reused: 6,000+ fingerprints (100% reuse ratio)

Top titles:
  - "cityscape serenity" (54 albums)
  - "luminous metropolis" (53 albums)
  - "breeze in the whispers" (52 albums)

100% reuse ratio means every fingerprint in this label’s catalog appeared across multiple albums-systematic cross-album duplication with zero unique titles, far outside the pattern of legitimate content distribution.

Phase 2: Finding Networks

Something strange in the results:

Label A: 1,423 albums, 6,089 fingerprints, 100% reuse
Label B: 1,315 albums, 4,425 fingerprints, 100% reuse
Label C: 812 albums,   4,425 fingerprints, 100% reuse
Label D: 6,845 albums, 2,855 fingerprints, 100% reuse
Label E: 2,726 albums, 2,717 fingerprints, 100% reuse
Label F: 2,120 albums, 4,425 fingerprints, 100% reuse
...

Multiple labels with exactly 4,425 fingerprints-coordination, not coincidence.

Hypothesis: One master library of 4,425 song titles, distributed across multiple fake label identities.

To catch this, we needed to:

  1. Measure content overlap between label pairs (not just exact counts)
  2. Find transitive connections (A shares with B, B shares with C)
  3. Scale to 75K+ labels without exploding into O(n²) comparisons

Enter graph-based detection using Jaccard similarity:

Jaccard = |A ∩ B| / |A ∪ B|

Two labels with 4,425 and 4,430 fingerprints sharing 4,420:
Jaccard = 4,420 / (4,425 + 4,430 - 4,420) = 99.7%

Labels with 80%+ similarity get connected in the graph.

Phase 3: From Similarity to Networks

Exact matching missed variations (4,424 vs 4,425) and partial overlaps (95% shared content with different totals). We needed something smarter.

Network Detection Process

Reframed as a graph problem:

  • Nodes: Labels
  • Edges: Labels with 80%+ fingerprint overlap
  • Goal: Find connected components (fraud networks)
Label A --[97%]-- Label B --[88%]-- Label C
                    |
                  [94%]
                    |
                 Label D --[93%]-- Label E

-> All 5 labels grouped into Network 1
   (even though A and E don't directly share titles)

Used Apache Spark GraphX for distributed processing at 75K+ labels.

How It Works

  • Build bipartite graph: Create edges between labels and fingerprints
  • Find label pairs: Self-join to discover which labels share titles
  • Calculate Jaccard: Measure overlap (shared / total unique)
  • Run connected components: Group transitively connected labels
// Self-join to find shared fingerprints
val labelPairs = labelFingerprintEdges.alias("l1")
  .join(labelFingerprintEdges.alias("l2"),
    col("l1.fingerprint") === col("l2.fingerprint") &&
    col("l1.labelid") < col("l2.labelid"), "inner")

// Calculate Jaccard and filter
val labelSimilarities = sharedCounts
  .withColumn("jaccard_similarity",
    col("shared_count") / (col("total1") + col("total2") - col("shared_count")))
  .filter(col("jaccard_similarity") >= 0.8)

// Build graph and find networks
val graph = Graph(vertexRDD, edgeRDD)
val components = graph.connectedComponents()

Why graphs beat alternatives:

  • Exact matching: Fast but misses 4,424 vs 4,425
  • Pairwise: Catches all pairs but O(n²) = 2.9 billion comparisons at 75K labels
  • Graph-based: Catches transitive connections, scales to 75K+ labels

Results from 200M+ Songs

Layer 1: Title Reuse Detection

75K+ labels flagged for content review (30%+ reuse, 10+ fingerprints, 2+ albums).

Key finding: Multiple labels with exactly 4,425 fingerprints-a shared master library.

Layer 2: Network Detection

Graph analysis found 24 coordinated networks (287 labels total):

Network 1 (Sleep/Meditation): 57 labels, 87.3% avg similarity, 1,006 connections
Network 2 (Wellness/Zen): 32 labels, 88.6% similarity
Network 3 (Arabic): 28 labels, 82.2% similarity (international ops)
Network 4 (Nature): 24 labels, 93.7% similarity (tightest coordination)

The 93.7% network? That’s likely one operator running 24 fake label fronts.

Implementation Notes

Architecture

  • Spark DataFrames for fingerprinting → GraphX for network detection

Optimizations

  • Early filtering: exclude major distributors, podcasts, single-word titles
  • Strategic caching: only aggregated results, not raw 200M rows
  • Broadcast joins for small dimension tables
  • Partition by labelid for efficient grouping

Thresholds (tuned to minimize false positives)

  • MIN_ALBUMS_PER_FINGERPRINT = 2 – Cross-album reuse
  • MIN_REUSED_FINGERPRINTS = 10 – Systematic pattern
  • MIN_REUSE_RATIO = 0.3 – 30%+ of catalog
  • SIMILARITY_THRESHOLD = 0.8 – 80% Jaccard for networks

What We Found

Statistics:

  • 200M+ songs processed
  • 75K+ labels flagged (individual cleanup)
  • 24 networks detected (287 labels showing coordination)
  • Largest network: 57 labels (sleep/meditation genre)
  • Tightest network: 24 labels at 93.7% similarity

Output: Actionable reports with label lists, similarity scores, most reused titles, and recommended actions.

False positives: Validated through sampling. Edge cases include sound effect libraries, classical labels (same compositions), and karaoke. Mitigated with genre-specific thresholds and manual whitelists.

Fraud Patterns

1. Algorithmic Title Generation: Systematic word reordering to fake uniqueness

"Tempestuous, Wet Evening" (336 albums)
"Wet, Tempestuous Evening"
→ Same fingerprint: "evening tempestuous wet"

2. Genre-Specific Farms: Networks targeting sleep/meditation/ASMR with keyword-optimized label names

3. Near-Perfect Duplication: 99%+ overlap across multiple labels = likely single operator with fake fronts

4. Multi-Language Ops: Arabic, Chinese, Korean networks showing international coordination

Next Steps

Audio fingerprinting: Catch same audio with different titles (Chromaprint/AudioHash integration)

Temporal analysis: Detect synchronized upload bursts across network members (bot-like behavior)

Behavioral graphs: Multi-dimensional edges combining content similarity, upload timing, play patterns, and geographic origin

Real-time detection: Streaming pipeline to flag suspicious labels within hours instead of daily batches

Lessons Learned

Technical:

  • Start simple, iterate (fingerprinting → graphs)
  • Distribute early (200M songs = Spark from day one)
  • Many fraud problems are actually network problems
  • Threshold tuning matters (80% vs 90% changes everything)
  • Validate with sampling (prevent catastrophic false positives)

Fraud detection:

  • If it’s too systematic to be human, it probably isn’t
  • Networks hide in plain sight-patterns emerge in aggregate
  • Exact matching misses real-world messiness
  • Graph structure reveals coordination
  • Combine signals: content + behavior + timing

Conclusion

The core insight: Fraud operates from playbooks. Word reordering, library size clustering, genre targeting-these patterns repeat because automation scales better than creativity. Once you recognize the playbook, detection scales too. Fingerprinting exposes individual actors. Graphs reveal coordination. Together, they transform fraud from invisible noise into measurable, actionable intelligence.

Results:

  • 75K+ flagged labels for individual cleanup
  • 24 coordinated networks (287 labels) that would’ve stayed invisible

What’s next:

  • Audio fingerprinting – Chromaprint integration to catch identical audio with different metadata, closing the repackaging loophole entirely
  • Behavioral graphs – Multi-dimensional edges combining upload timing, play patterns, and geographic signals to detect bot-coordinated operations
  • Real-time detection – Streaming pipelines that flag suspicious labels within hours of upload, not days after batch analysis

The fraudsters will adapt. We’ll evolve faster. Round one complete.

Tags: Apache SparkBig DataContent ModerationData EngineeringFraud DetectionGraph AnalyticsMachine LearningMusic StreamingPlatform IntegrityScala
Elias El Khoury

Elias El Khoury

VP Information & Content Systems @ Anghami & OSN+, joined Anghami in 2016

Related Posts

House of Code: rebuilding OSN+ in 4 months
Engineering

Rebuilding OSN+: A Technical Post-Mortem

I have wanted to write this post for a while now. But honestly, after the marathon of delivering this...

by Sebastien Melki
October 3, 2025
+OSN تتعاون مع شركة castLabs لتعزيز حماية المحتوى على منصتها الرقمية
Engineering

+OSN تتعاون مع شركة castLabs لتعزيز حماية المحتوى على منصتها الرقمية

أعلنت castLabs، الشركة الرائدة في تكنولوجيا الفيديو الرقمي، عن تعاونها مع +OSN لتقديم تقنية "دي آر إم توداي" لحماية...

by Nour Sawli
September 11, 2024
OSN+ Partners with castLabs to Enhance Content Protection with Cutting-edge Multi-DRM Technology, DRMtoday
Engineering

OSN+ Partners with castLabs to Enhance Content Protection with Cutting-edge Multi-DRM Technology, DRMtoday

OSN+ has partnered with castLabs to implement DRMtoday, a cloud-based digital rights management (DRM) solution aiming to safeguard it's...

by Nour Sawli
September 11, 2024
Anghami Selects Bitmovin’s VOD Encoder to Power New Multimedia Streaming Platform
Engineering

Anghami Selects Bitmovin’s VOD Encoder to Power New Multimedia Streaming Platform

Following its merger with OSN+, Anghami has chosen Bitmovin’s VOD Encoding to encode over 40,000 video files, bringing the...

by Nour Sawli
July 16, 2024
  • Anghami Reports H1 2025 Financial Results; Marked by Topline Growth & Transformative Deal with Warner Bros. Discovery

    Anghami Reports H1 2025 Financial Results; Marked by Topline Growth & Transformative Deal with Warner Bros. Discovery

    0 shares
    Share 0 Tweet 0
  • Anghami partners with Athar 2025 to power culture and innovation

    0 shares
    Share 0 Tweet 0
  • أنغامي تتعاون مع مهرجان ’أثر 2025‘ لدعم الثقافة والابتكار

    0 shares
    Share 0 Tweet 0
  • Anghami and Huawei celebrate five years of collaboration shaping a connected entertainment ecosystem across MENA

    0 shares
    Share 0 Tweet 0
  • Ramadan 2022 with Anghami Live Radio

    0 shares
    Share 0 Tweet 0

About Anghami . Join Our Team . Go To app

© 2021 Anghami

No Result
View All Result
  • Homepage
  • Engineering
  • Product
  • What’s New
  • For Brands
  • Music
  • Life at Anghami

© 2020 Anghami blog