Safety
How AI bullying detection actually works (and where it doesn't)
Under the hood of KidsHalo's safety engine: what we flag, how we avoid false alarms, and the cases where humans still win.
March 30, 2026 · 8 min read
What we're actually looking for
We score every message against a small set of risk axes: targeted insults, repeated negative attention from one sender, exclusion patterns, threats, and grooming signals. None of these are perfect alone. Together they're surprisingly accurate.
Why false positives hurt more than misses
A safety system that cries wolf gets disabled. We tune our thresholds toward fewer alerts that matter, not more alerts that don't. You'll get a few notifications a week, not a few per day.
Where humans still beat the model
Sarcasm between close friends. Inside jokes that look hostile out of context. Coded slang that flips meaning every six months. The model flags these — we surface them as 'context check' alerts, not crisis alerts, so you can decide.
FAQ
Does KidsHalo read every message?+
Our agent classifies messages on-device when possible. Only the risk score and a redacted snippet leave the device when a flag triggers. We never store full message history server-side.
Try KidsHalo free
AI safety alerts, screen time, content filtering, and live location in one calm dashboard. Free Forever plan, no credit card.
