Sleep2 logo Sleep2 logo
Case Study

sleep² in the Great Sleep Tracker Comparison – Accuracy on Par with Polysomnography


Category

Insurance

Scope

8 sleep trackers, 5 nights

Participants

18 people (home PSG)

How accurate are sleep trackers really?

Sleep trackers and wearables are now indispensable in everyday life and research. But how reliable are the sleep data from Oura, Apple Watch, Fitbit, and others actually? A recent study (Topalidis et al., 2025) systematically investigated this question and tested eight widespread consumer sleep trackers (CST) under real conditions against polysomnography (PSG) – the EEG-based gold standard procedure in sleep medicine. For sleep², the result is clear: No other tested method measures the sleep stages as accurately.

The study design: five nights, home PSG, targeted stress tests

Eighteen participants completed five consecutive nights (Monday to Friday) with ambulatory home polysomnography while simultaneously wearing two identical devices of each tracker. To test the algorithms under challenging conditions, the protocol included targeted sleep manipulations, such as shortened and extended sleep. The evaluation followed a standardized framework of epoch-by-epoch analysis (30-second segments) and a discrepancy analysis of key sleep parameters.

Tested were: sleep² (with Polar Verity Sense and Polar H10), Oura Ring 3, Apple Watch Series 9, Fitbit Charge 6, Garmin Vivoactive 6 and Venu 3, WHOOP 4, and Circul+.

The results at a glance

Measured was the agreement with the PSG across all sleep stages, expressed as accuracy and via Cohen's κ, which accounts for random agreement.

 

DeviceAccuracyCohen's κ
sleep² (Polar H10)84.0 %0.76
sleep² (Polar Verity Sense)83.7 %0.76
Oura Ring 372.5 %0.59
Apple Watch Series 972.3 %0.56
Fitbit Charge 666.2 %0.47
WHOOP 465.2 %0.48
Garmin Vivoactive 6 and Venu 363.4 %0.41
Circul+55.6 %0.33

 

Epoch-by-epoch accuracy and Cohen's κ compared to polysomnography. Note: The maximum achievable accuracy is around 88% (interrater reliability with PSG). Source: Topalidis et al. (2025).

What the numbers mean

With a Cohen's κ of 0.76, sleep² achieves substantial agreement with the PSG and is the only system in the test field to reach this level. In contrast, most wrist-worn trackers overestimated total sleep time and massively underestimated wake phases after sleep onset (WASO). This effect was particularly evident on atypical nights with fragmented, shortened, or extended sleep – precisely when accurate data is most important.

The cardiac-based sleep² measurements using arm and chest bands showed only minor deviations from the PSG and remained stable even on challenging nights. The Oura devices and Apple Watch (Series 9) achieved moderately good accuracy but showed considerable variations between nights.

 

Why sleep² measures so accurately

The difference lies in the sensors and the deep-learning AI method. While many wearables derive their sleep stages from motion data, sleep² uses measurements near the heart and measures the heartbeat with millisecond precision via inter-beat intervals (IBI). This signal precisely depicts the nocturnal regulation of the autonomic nervous system and makes the measurement robust – even when sleep is restless or the bed partner moves next to you.

Recommendation

  • To reliably capture sleep, one should rely on validated methods tested against PSG. Accuracy relative to the gold standard is the critical benchmark.
  • Interpret single-night values from wrist trackers with caution, especially during restless or unusually short or long sleep.
  • For research and care, a standardized, IBI-based method with near-heart sensors like sleep² is recommended.

 

Sources:

Topalidis, P., Kogler, L., Mitterer, C., Hinterberger, A., Baron, S., Schabus, M., & ter Horst, R. (2025). Beyond the Hype? A Standardised Real-World Evaluation of Consumer Sleep Trackers (CST) in Extracting Sleep. PsyArXiv. https://doi.org/10.31234/osf.io/27wun_v1

More Case Studies