How CrazyGames Monitors 50M Players Without Extra Headcount

CrazyGames uses ThriveAI to monitor product health across their gaming platform of 50 million monthly players. ThriveAI connects to their analytics and support tools, runs automated weekly health reports, detects anomalies using statistical significance testing, and correlates support ticket patterns with product metrics. The result: 100+ validated detections per month, a false-positive rate under 2%, and issues fixed within hours of detection rather than days. Their VP of Product, Jonas Boonen, describes the impact: "We don't chase problems anymore; we get to them first."

The scale problem nobody talks about

When your product has 50 million monthly players, the monitoring problem isn't that nothing's happening. It's that everything's happening, all at once.

CrazyGames hosts thousands of browser games. Players arrive from every corner of the internet, on every browser, every screen size, every connection speed. At any given moment, something is probably slightly broken for someone. The question isn't "is anything wrong?" It's "what's wrong enough to matter?"

Most product teams at this scale solve the problem by hiring. Add a data analyst. Add a PM focused on platform health. Add a QA lead who spends mornings scanning dashboards. Each person covers one slice, but nobody sees the full picture.

CrazyGames took a different approach.

What "monitoring" actually looked like before

Before ThriveAI, monitoring at CrazyGames looked like it does at most product teams:

The manual sweep. Someone opens the analytics tool, pulls last week's numbers, compares to the week before. Did engagement drop? Did any games break? Is the load time creeping up?

The support scan. Someone opens the support tool, reads through recent tickets, tries to spot patterns. Are players reporting the same bug? Is there a new category of complaint?

The gap. The analytics person and the support person are looking at different slices of the same reality. A 4% drop in session duration might be connected to a spike in "game won't load" tickets - but nobody connects those dots unless they happen to be in the same room at the same time.

At 50 million monthly players, the gap isn't just inconvenient. It's where real problems hide. A regression that affects 2% of players is still one million people.

For more on why this gap matters and how to think about closing it, see our guide on connecting support tickets to product metrics.

What CrazyGames set up

CrazyGames connected ThriveAI to two things: their analytics platform and their support tool. Setup took under 5 minutes.

Once connected, ThriveAI started doing three things automatically, with no manual triggers needed:

1. Weekly product health reports

Every Monday, a structured report lands in the team's Slack channel. It covers:

Player engagement metrics (weekly active users, session depth, stickiness) compared to the previous week and a 4-week baseline
Support ticket themes grouped into near-term, mid-term, and long-term windows, with percentage-point changes showing what's growing and what's shrinking
Statistical significance testing on every metric, so the team knows whether a change is real or noise

The report doesn't just list numbers. It connects them. If session duration drops and "loading" tickets spike the same week, both appear in the same report with the correlation flagged.

For a walkthrough of what these reports look like and how to set one up, see how to automate your weekly product health sweep.

2. Anomaly detection

ThriveAI runs continuous anomaly detection using two independent baselines: one based on what's normal for this day-and-hour (pooled over 2 weeks), and one based on the last 7 days. Both must agree that something changed, and both must pass statistical significance testing, before an alert fires.

This is what keeps the false-positive rate under 2%. A single baseline would flag every weekend traffic dip as a regression. Every holiday pattern. Every viral spike. Requiring two independent baselines to agree filters out almost all false alarms, so when an alert does fire, the team trusts it.

When a real anomaly is detected, it's classified by severity and impact:

Critical - significant regression, high impact (e.g., load time doubled, affecting 30%+ of sessions)
Warning - significant regression, moderate impact
Notable - significant but lower impact
Positive - improvement detected (also tracked, so the team knows what's working)

Each detection includes the specific metric, the magnitude of change, the affected segment (browser, device, region), and the statistical confidence level.

For a deeper look at how anomaly detection works without SQL, see the PM's guide to anomaly detection.

3. Ad-hoc investigation

Beyond automated monitoring, the CrazyGames team can tag ThriveAI directly in Slack to ask questions:

"What's causing the drop in session depth this week?"
"Are players on Safari having a different experience than Chrome?"
"Show me all support tickets related to payment issues in the last 30 days, grouped by theme"

These aren't canned reports. ThriveAI pulls from both analytics and support data to answer the specific question, cross-referencing where relevant.

What 100+ detections per month looks like

The number itself, 100+ validated detections per month, might sound abstract. Here's what it means in practice.

At CrazyGames' scale, with thousands of games and 50 million monthly players, a typical month includes:

Deploy regressions. A code deploy introduces a bug that increases error rates on certain pages. ThriveAI catches it within the detection window (typically the next anomaly scan after the regression appears in data), before the support queue fills up. The team rolls back or patches.

Across ThriveAI clients, we've seen this pattern - a dashboard exception rate climbing from 6% to 17.1% after a deploy, caught within hours by comparing the new build's metrics against the prior build's baseline. The fix ships before most users notice.

Browser-specific issues. A browser update breaks something in the game loading flow for a specific browser-OS combination. Because ThriveAI segments detections by platform, the alert pinpoints the exact combination, not just "load times are up."

Slow-burn trends. A support theme that was 3% of tickets last month is now 8%. It hasn't triggered any single-day anomaly, but the biweekly support pulse catches the trend shift. The team investigates before it becomes 15%.

False positives filtered out. For every real issue caught, the dual-baseline system quietly passes on dozens of fluctuations that would have triggered alerts in a simpler system. A weekend traffic dip that looks like a regression against a 7-day average? Filtered out because the weekday-hour baseline shows it's normal. A one-time spike from a viral game? Filtered out because the trailing average normalizes it within hours.

The under-2% false-positive rate means the team trusts the alerts. When ThriveAI says something's wrong, it almost always is. That trust is what makes the system work at scale. If the team started ignoring alerts because half were false, the whole system collapses.

The result: getting to problems first

The shift Jonas Boonen describes - "We don't chase problems anymore; we get to them first" - is what happens when monitoring moves from reactive to proactive.

Before: The team finds out about issues when players complain, when someone happens to check the right dashboard at the right time, or during the weekly sync when someone mentions something "looked off."

After: Issues surface automatically. The team's first interaction with a problem is reading a Slack message that says what happened, how severe it is, and what segment is affected. The conversation starts at "what do we do about this?" instead of "is this even a real problem?"

At CrazyGames' scale, the math is straightforward:

Without monitoring: A 2% regression affects 1 million players. Average time to detection: days (or weeks if it's gradual). Players churn, NPS drops, support queue grows.
With automated monitoring: The same regression gets flagged within hours. Fix deployed before most players even notice. Support queue stays clean.

The "without extra headcount" part matters because the alternative is real. At 50 million monthly players, dedicated monitoring staff would cost $150K-250K per year (analyst + PM time). ThriveAI runs at $10/hour, billing only for active analysis time.

For a detailed comparison of these economics, see ThriveAI vs. hiring a PM: $10/hr vs. $200K/yr.

What makes this work (and what doesn't)

Not every product team would get the same results CrazyGames does. Here's what made ThriveAI effective for their situation:

What works well:

High traffic volumes. At 50M monthly players, there's enough statistical power to detect even small regressions with confidence. If your product has 500 WAU, anomaly detection won't have enough data to be useful.
Existing analytics + support tooling. CrazyGames already had analytics and a support tool in place. ThriveAI connects to what's there. It doesn't replace your analytics platform.
Slack-native team. The team already lived in Slack. Reports and alerts arrive where the conversation happens, not in a separate dashboard nobody checks.

What it won't do:

Fix issues for you. ThriveAI detects and diagnoses. The engineering team still needs to deploy the fix.
Work without data. If your analytics instrumentation has gaps (missing events, broken properties), ThriveAI will tell you what's missing, but it can't analyze data that doesn't exist.
Replace product judgment. The detection tells you what changed. The PM still decides whether it matters and what to do about it.

Getting started

If you're running a product at scale and spending too much time on manual monitoring, here's how to start:

Connect your analytics. PostHog, Mixpanel, or Amplitude, whichever you use. Takes a couple of minutes.
Connect your support tool. Zendesk or Intercom.
Tell ThriveAI what to watch. Which products, which metrics, which Slack channel. A few minutes of conversation.
Wait for Monday. Your first automated report arrives the following week.

Your first 2 weeks are free. At $10/hour after that, billing only for active analysis time, you'll know quickly whether automated monitoring is worth it for your product.

For a broader overview of what product health monitoring is and how to think about it, start with our definitive guide: What is product health monitoring?

Stop chasing problems. Get to them first.

CrazyGames serves 50 million monthly players and catches issues before players notice. Your first 2 weeks with ThriveAI are free. Connect your analytics and support tools, and see your first automated report next Monday.

Start free - connect your tools

FAQ

How long does it take to set up ThriveAI?
Under 5 minutes. Connect your analytics platform (PostHog, Mixpanel, or Amplitude), your support tool (Zendesk or Intercom), and configure what to monitor. Your first automated report arrives the following Monday.

What does ThriveAI cost?
$10/hour, with the first 2 weeks free. ThriveAI only bills for active analysis time. If it spends 5 minutes generating your weekly report, you're billed for 5 minutes. At CrazyGames' scale, this replaces $150K-250K/year in analyst and PM time.

What's the false-positive rate?
Under 2% at CrazyGames. ThriveAI uses dual baselines (weekday-hour pooled + 7-day trailing) with statistical significance testing. Both baselines must agree before flagging an issue, which filters out most false alarms.

Does ThriveAI work for smaller products?
ThriveAI works for any product with analytics and support data. However, anomaly detection is most effective with higher traffic volumes. With 500 weekly active users, there may not be enough data for statistical significance on smaller metrics. Weekly health reports and support analysis work at any scale.

What analytics platforms does ThriveAI support?
PostHog, Mixpanel, and Amplitude. For support data, ThriveAI connects to Zendesk and Intercom.

How CrazyGames Monitors 50 Million Monthly Players Without Extra Headcount

The scale problem nobody talks about

What "monitoring" actually looked like before

What CrazyGames set up

What 100+ detections per month looks like

The result: getting to problems first

What makes this work (and what doesn't)

Getting started

Stop chasing problems. Get to them first.

FAQ

Reply

Keep Reading

Don't miss the next issue

Thrive, Your AI PM

Newsletter

Company