Safety

How AI bullying detection actually works (and where it doesn't)

Under the hood of KidsHalo's safety engine: what we flag, how we avoid false alarms, and the cases where humans still win.

March 30, 2026 · 8 min read

What we're actually looking for

We score every message against a small set of risk axes: targeted insults, repeated negative attention from one sender, exclusion patterns, threats, and grooming signals. None of these are perfect alone. Together they're surprisingly accurate.

Why false positives hurt more than misses

A safety system that cries wolf gets disabled. We tune our thresholds toward fewer alerts that matter, not more alerts that don't. You'll get a few notifications a week, not a few per day.

Where humans still beat the model

Sarcasm between close friends. Inside jokes that look hostile out of context. Coded slang that flips meaning every six months. The model flags these — we surface them as 'context check' alerts, not crisis alerts, so you can decide.

FAQ

Does KidsHalo read every message?+

Our agent classifies messages on-device when possible. Only the risk score and a redacted snippet leave the device when a flag triggers. We never store full message history server-side.

Try KidsHalo free

AI safety alerts, screen time, content filtering, and live location in one calm dashboard. Free Forever plan, no credit card.

How AI bullying detection actually works (and where it doesn't)

What we're actually looking for

Why false positives hurt more than misses

Where humans still beat the model

FAQ

Try KidsHalo free

More on safety