Multimodal Egocentric Analysis of Focused Interactions

Sophia Bano, Tamas Suveges, Jianguo Zhang, Stephen McKenna (Lead / Corresponding author)

Research output: Contribution to journalArticle

1 Citation (Scopus)
76 Downloads (Pure)

Abstract

Continuous detection of social interactions from wearable sensor data streams has a range of potential applications in domains including health and social care, security, and assistive technology. We contribute an annotated, multimodal dataset capturing such interactions using video, audio, GPS and inertial sensing. We present methods for automatic detection and temporal segmentation of focused interactions using support vector machines and recurrent neural networks with features extracted from both audio and video streams. Focused interaction occurs when co-present individuals, having mutual focus of attention, interact by first establishing face-to-face engagement and direct conversation. We describe an evaluation protocol including framewise, extended framewise and event-based measures and provide empirical evidence that fusion of visual face track scores with audio voice activity scores provides an effective combination. The methods, contributed dataset and protocol together provide a benchmark for future research on this problem. The dataset is available at https://doi.org/10.15132/10000134
Original languageEnglish
Pages (from-to)37493-37505
Number of pages13
JournalIEEE Access
Volume6
Early online date25 Jun 2018
DOIs
Publication statusPublished - 25 Jun 2018

Fingerprint

Recurrent neural networks
Support vector machines
Global positioning system
Fusion reactions
Health
Wearable sensors

Keywords

  • Social interaction
  • egocentric sensing
  • multimodal analysis
  • temporal segmentation

Cite this

@article{effb1146a135471aafaebf17a88ac9bd,
title = "Multimodal Egocentric Analysis of Focused Interactions",
abstract = "Continuous detection of social interactions from wearable sensor data streams has a range of potential applications in domains including health and social care, security, and assistive technology. We contribute an annotated, multimodal dataset capturing such interactions using video, audio, GPS and inertial sensing. We present methods for automatic detection and temporal segmentation of focused interactions using support vector machines and recurrent neural networks with features extracted from both audio and video streams. Focused interaction occurs when co-present individuals, having mutual focus of attention, interact by first establishing face-to-face engagement and direct conversation. We describe an evaluation protocol including framewise, extended framewise and event-based measures and provide empirical evidence that fusion of visual face track scores with audio voice activity scores provides an effective combination. The methods, contributed dataset and protocol together provide a benchmark for future research on this problem. The dataset is available at https://doi.org/10.15132/10000134",
keywords = "Social interaction, egocentric sensing, multimodal analysis, temporal segmentation",
author = "Sophia Bano and Tamas Suveges and Jianguo Zhang and Stephen McKenna",
note = "This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) under grant EP/N014278/1: ACE-LP: Augmenting Communication using Environmental Data to drive Language Prediction.",
year = "2018",
month = "6",
day = "25",
doi = "10.1109/ACCESS.2018.2850284",
language = "English",
volume = "6",
pages = "37493--37505",
journal = "IEEE Access",
issn = "2169-3536",
publisher = "Institute of Electrical and Electronics Engineers",

}

Multimodal Egocentric Analysis of Focused Interactions. / Bano, Sophia; Suveges, Tamas; Zhang, Jianguo; McKenna, Stephen (Lead / Corresponding author).

In: IEEE Access, Vol. 6, 25.06.2018, p. 37493-37505.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Multimodal Egocentric Analysis of Focused Interactions

AU - Bano, Sophia

AU - Suveges, Tamas

AU - Zhang, Jianguo

AU - McKenna, Stephen

N1 - This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) under grant EP/N014278/1: ACE-LP: Augmenting Communication using Environmental Data to drive Language Prediction.

PY - 2018/6/25

Y1 - 2018/6/25

N2 - Continuous detection of social interactions from wearable sensor data streams has a range of potential applications in domains including health and social care, security, and assistive technology. We contribute an annotated, multimodal dataset capturing such interactions using video, audio, GPS and inertial sensing. We present methods for automatic detection and temporal segmentation of focused interactions using support vector machines and recurrent neural networks with features extracted from both audio and video streams. Focused interaction occurs when co-present individuals, having mutual focus of attention, interact by first establishing face-to-face engagement and direct conversation. We describe an evaluation protocol including framewise, extended framewise and event-based measures and provide empirical evidence that fusion of visual face track scores with audio voice activity scores provides an effective combination. The methods, contributed dataset and protocol together provide a benchmark for future research on this problem. The dataset is available at https://doi.org/10.15132/10000134

AB - Continuous detection of social interactions from wearable sensor data streams has a range of potential applications in domains including health and social care, security, and assistive technology. We contribute an annotated, multimodal dataset capturing such interactions using video, audio, GPS and inertial sensing. We present methods for automatic detection and temporal segmentation of focused interactions using support vector machines and recurrent neural networks with features extracted from both audio and video streams. Focused interaction occurs when co-present individuals, having mutual focus of attention, interact by first establishing face-to-face engagement and direct conversation. We describe an evaluation protocol including framewise, extended framewise and event-based measures and provide empirical evidence that fusion of visual face track scores with audio voice activity scores provides an effective combination. The methods, contributed dataset and protocol together provide a benchmark for future research on this problem. The dataset is available at https://doi.org/10.15132/10000134

KW - Social interaction

KW - egocentric sensing

KW - multimodal analysis

KW - temporal segmentation

UR - http://www.scopus.com/inward/record.url?scp=85049120021&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2018.2850284

DO - 10.1109/ACCESS.2018.2850284

M3 - Article

VL - 6

SP - 37493

EP - 37505

JO - IEEE Access

JF - IEEE Access

SN - 2169-3536

ER -