The Problem
Every day, thousands of new songs appear on streaming platforms-not all of them created by real artists. Behind the scenes, networks of fake labels repackage the same tracks under different names, quietly siphoning revenue and polluting recommendation systems. At small scale, it looks like sloppy cataloging. At 200 million songs, it’s industrialized fraud.
We started noticing patterns in our catalog analytics at scale-labels uploading hundreds of albums with suspiciously similar song titles. Same words, different arrangements. “Gentle Rain” in one album, “Rain, Gentle” in another. Different song IDs, different albums, but manual spot-checks suggested the audio was identical or near-identical, just repackaged.
Running a music streaming platform means dealing with a catalog that grows faster than any human team can review, and this systematic fraud compounds into real economic damage across the entire infrastructure.
The Infrastructure Tax
Every uploaded song triggers a pipeline: transcode to 6-8 bitrate variants for different network conditions and devices, store across S3 hot/warm/cold tiers, index for search, generate recommendation features, update database tables. When 10% of your catalog is fraudulent duplication, that’s 20M tracks consuming infrastructure for zero user value. Conservative estimates put the direct costs at $100K+/year (S3 storage across hot/warm/cold tiers, CDN egress when fraudulent files are served, transcoding compute for 6-8 bitrate variants per duplicate, backup storage, database operations for catalog tables, search index updates, recommendation feature generation).
Beyond direct costs, metadata overhead scales linearly with catalog size-search indices, recommendation feature stores, database operations all carry the weight of fraudulent entries. Maintaining acceptable query latency with bloated catalogs requires 15-20% infrastructure overprovisioning, pushing the cumulative waste to ~$1M over three years.
Corrupting the User Experience
But the measurable infrastructure costs are just the beginning. The user experience damage is harder to quantify but more insidious. When search returns 40+ variants of a certain song with algorithmically reordered words, users spend 30% more time filtering noise instead of discovering music, and internal metrics show 40% lower engagement when fraud pollutes their experience.
Worse, recommendation systems learn from bot-inflated play counts, systematically corrupting collaborative filtering signals and pushing genuine discoveries further down ranked lists. New users encounter fraudulent content during their critical first sessions, poisoning the sparse preference data that powers personalization.
The compounding failure mode: users never find music they’d love because fraud occupied those discovery surfaces, translating to churn and lost lifetime value that dwarfs the direct infrastructure costs.
The detection problem is non-trivial at scale. Fraudsters exploit the fact that metadata is cheap to fake-reordering words in titles costs nothing but creates “new” song IDs that bypass naive duplicate detection. Meanwhile, legitimate compilations, remixes, and classical music recordings also show title repetition patterns, making false positives a relationship-destroying risk with real labels. Manual review doesn’t scale to 200M songs and 75K+ labels. Exact duplicate detection misses the word-reordering variants. Pairwise similarity comparisons explode to billions of operations. We needed automated detection that maintains precision while catching both individual bad actors and coordinated fraud networks operating across multiple fake label identities.
Example of what we were seeing:
Album 1: "Relaxing Rain Sounds" - "Gentle Rain" - "Heavy Rain" - "Rain on Window" Album 2: "Sleep Music Collection" - "Rain, Gentle" - "Rain Heavy" - "Window Rain"
Same content, different packaging. And it was everywhere.
Detection Pipeline: Two-Layer Approach



Phase 1: Fingerprinting Titles
The approach: normalize titles to catch variations. A fingerprint is a canonical representation of a title-a normalized, sorted token list of the title’s words that strips away punctuation, capitalization, and word order.
"Happy Music" -> "music, happy!" -> "MUSIC Happy" -> all become: "happy music"
Process: lowercase → remove punctuation → split words → sort alphabetically → rejoin.
This catches reordered titles like “Calm Relaxing Music” and “Relaxing Calm Music”-both become “calm music relaxing”. Simple, but effective.
def createFingerprint(text: Column) = {
array_join(
array_sort(
split(normalizeText(text), " ")
),
" "
)
}
Used Unicode regex (\p{L} for letters, \p{N} for numbers) to handle Arabic, Chinese, Korean, etc.
Detection Thresholds
A label gets flagged when:
- Same fingerprint appears in 2+ albums (cross-album reuse)
- Label has 10+ reused fingerprints (systematic pattern)
- 30%+ of label’s songs are reused (high ratio)
- Only multi-word titles count (skip generic single words)
These thresholds took some tuning. Too low and you catch legitimate labels, too high and fraud slips through.
We excluded major distributors, whitelisted labels, and podcasts.
First Results
Label A: 1,400+ albums, 114K+ songs Reused: 6,000+ fingerprints (100% reuse ratio) Top titles: - "cityscape serenity" (54 albums) - "luminous metropolis" (53 albums) - "breeze in the whispers" (52 albums)
100% reuse ratio means every fingerprint in this label’s catalog appeared across multiple albums-systematic cross-album duplication with zero unique titles, far outside the pattern of legitimate content distribution.
Phase 2: Finding Networks
Something strange in the results:
Label A: 1,423 albums, 6,089 fingerprints, 100% reuse Label B: 1,315 albums, 4,425 fingerprints, 100% reuse Label C: 812 albums, 4,425 fingerprints, 100% reuse Label D: 6,845 albums, 2,855 fingerprints, 100% reuse Label E: 2,726 albums, 2,717 fingerprints, 100% reuse Label F: 2,120 albums, 4,425 fingerprints, 100% reuse ...
Multiple labels with exactly 4,425 fingerprints-coordination, not coincidence.
Hypothesis: One master library of 4,425 song titles, distributed across multiple fake label identities.
To catch this, we needed to:
- Measure content overlap between label pairs (not just exact counts)
- Find transitive connections (A shares with B, B shares with C)
- Scale to 75K+ labels without exploding into O(n²) comparisons
Enter graph-based detection using Jaccard similarity:
Jaccard = |A ∩ B| / |A ∪ B| Two labels with 4,425 and 4,430 fingerprints sharing 4,420: Jaccard = 4,420 / (4,425 + 4,430 - 4,420) = 99.7%
Labels with 80%+ similarity get connected in the graph.
Phase 3: From Similarity to Networks
Exact matching missed variations (4,424 vs 4,425) and partial overlaps (95% shared content with different totals). We needed something smarter.



Reframed as a graph problem:
- Nodes: Labels
- Edges: Labels with 80%+ fingerprint overlap
- Goal: Find connected components (fraud networks)
Label A --[97%]-- Label B --[88%]-- Label C
|
[94%]
|
Label D --[93%]-- Label E
-> All 5 labels grouped into Network 1
(even though A and E don't directly share titles)
Used Apache Spark GraphX for distributed processing at 75K+ labels.
How It Works
- Build bipartite graph: Create edges between labels and fingerprints
- Find label pairs: Self-join to discover which labels share titles
- Calculate Jaccard: Measure overlap (shared / total unique)
- Run connected components: Group transitively connected labels
// Self-join to find shared fingerprints
val labelPairs = labelFingerprintEdges.alias("l1")
.join(labelFingerprintEdges.alias("l2"),
col("l1.fingerprint") === col("l2.fingerprint") &&
col("l1.labelid") < col("l2.labelid"), "inner")
// Calculate Jaccard and filter
val labelSimilarities = sharedCounts
.withColumn("jaccard_similarity",
col("shared_count") / (col("total1") + col("total2") - col("shared_count")))
.filter(col("jaccard_similarity") >= 0.8)
// Build graph and find networks
val graph = Graph(vertexRDD, edgeRDD)
val components = graph.connectedComponents()
Why graphs beat alternatives:
- Exact matching: Fast but misses 4,424 vs 4,425
- Pairwise: Catches all pairs but O(n²) = 2.9 billion comparisons at 75K labels
- Graph-based: Catches transitive connections, scales to 75K+ labels
Results from 200M+ Songs
Layer 1: Title Reuse Detection
75K+ labels flagged for content review (30%+ reuse, 10+ fingerprints, 2+ albums).
Key finding: Multiple labels with exactly 4,425 fingerprints-a shared master library.
Layer 2: Network Detection
Graph analysis found 24 coordinated networks (287 labels total):
Network 1 (Sleep/Meditation): 57 labels, 87.3% avg similarity, 1,006 connections Network 2 (Wellness/Zen): 32 labels, 88.6% similarity Network 3 (Arabic): 28 labels, 82.2% similarity (international ops) Network 4 (Nature): 24 labels, 93.7% similarity (tightest coordination)
The 93.7% network? That’s likely one operator running 24 fake label fronts.
Implementation Notes
Architecture
- Spark DataFrames for fingerprinting → GraphX for network detection
Optimizations
- Early filtering: exclude major distributors, podcasts, single-word titles
- Strategic caching: only aggregated results, not raw 200M rows
- Broadcast joins for small dimension tables
- Partition by
labelidfor efficient grouping
Thresholds (tuned to minimize false positives)
MIN_ALBUMS_PER_FINGERPRINT = 2– Cross-album reuseMIN_REUSED_FINGERPRINTS = 10– Systematic patternMIN_REUSE_RATIO = 0.3– 30%+ of catalogSIMILARITY_THRESHOLD = 0.8– 80% Jaccard for networks
What We Found
Statistics:
- 200M+ songs processed
- 75K+ labels flagged (individual cleanup)
- 24 networks detected (287 labels showing coordination)
- Largest network: 57 labels (sleep/meditation genre)
- Tightest network: 24 labels at 93.7% similarity
Output: Actionable reports with label lists, similarity scores, most reused titles, and recommended actions.
False positives: Validated through sampling. Edge cases include sound effect libraries, classical labels (same compositions), and karaoke. Mitigated with genre-specific thresholds and manual whitelists.
Fraud Patterns
1. Algorithmic Title Generation: Systematic word reordering to fake uniqueness
"Tempestuous, Wet Evening" (336 albums) "Wet, Tempestuous Evening" → Same fingerprint: "evening tempestuous wet"
2. Genre-Specific Farms: Networks targeting sleep/meditation/ASMR with keyword-optimized label names
3. Near-Perfect Duplication: 99%+ overlap across multiple labels = likely single operator with fake fronts
4. Multi-Language Ops: Arabic, Chinese, Korean networks showing international coordination
Next Steps
Audio fingerprinting: Catch same audio with different titles (Chromaprint/AudioHash integration)
Temporal analysis: Detect synchronized upload bursts across network members (bot-like behavior)
Behavioral graphs: Multi-dimensional edges combining content similarity, upload timing, play patterns, and geographic origin
Real-time detection: Streaming pipeline to flag suspicious labels within hours instead of daily batches
Lessons Learned
Technical:
- Start simple, iterate (fingerprinting → graphs)
- Distribute early (200M songs = Spark from day one)
- Many fraud problems are actually network problems
- Threshold tuning matters (80% vs 90% changes everything)
- Validate with sampling (prevent catastrophic false positives)
Fraud detection:
- If it’s too systematic to be human, it probably isn’t
- Networks hide in plain sight-patterns emerge in aggregate
- Exact matching misses real-world messiness
- Graph structure reveals coordination
- Combine signals: content + behavior + timing
Conclusion
The core insight: Fraud operates from playbooks. Word reordering, library size clustering, genre targeting-these patterns repeat because automation scales better than creativity. Once you recognize the playbook, detection scales too. Fingerprinting exposes individual actors. Graphs reveal coordination. Together, they transform fraud from invisible noise into measurable, actionable intelligence.
Results:
- 75K+ flagged labels for individual cleanup
- 24 coordinated networks (287 labels) that would’ve stayed invisible
What’s next:
- Audio fingerprinting – Chromaprint integration to catch identical audio with different metadata, closing the repackaging loophole entirely
- Behavioral graphs – Multi-dimensional edges combining upload timing, play patterns, and geographic signals to detect bot-coordinated operations
- Real-time detection – Streaming pipelines that flag suspicious labels within hours of upload, not days after batch analysis
The fraudsters will adapt. We’ll evolve faster. Round one complete.





