AbstractIn recent years, the surge in wearable devices amplified the need for advanced computer vision techniques to analyse video gathered by cameras efficiently. This thesis investigates machine learning methods to extract information from video data acquired by wearable cameras. More specifically, we leverage contextual information from egocentric video data that can be used to support language prediction. Our motivating application is a voice output communication aid that provides speech predictions for users with communication disabilities when supplied with relevant contextual information.
Thesis considers two sources of contextual information: (1) face recognition for identification that can provide information on conversational partners, and (2) personal map building that supports localisation. We define a deterministic modification on the softmax calculation, cam-softmax, that alters the target neuron's activation when the model is trained for classification. We confirm the motivation for the change in the calculation by carefully analysing the problem and highlighting a potential flaw in the current training pipeline's component for face recognition. As a second source of contextual information, we consider map building and localisation. While a similar problem is well researched in robotics, person-specific semantic localisation and mapping have not been considered as an online unsupervised clustering problem in the computer vision community.
Unlike traditional classification challenges, face recognition anticipates unseen subjects during test time in accordance with the open-set protocol, yet this assumption is not upheld during training time. We present a geometric interpretation of cam-softmax and show that it incorporates the open set assumption during training leading to face representations better suited for the task. The solution we present can be viewed as part of a family of methods between which we make the relationship clear by defining a generic framework. We show that our method's performance is comparable to state-of-the-art solutions on pair matching benchmarks and report performance gain when tested on an open-set setting. Furthermore, we discuss the method's ability to counter the presence of label noise in the training set and test this capability through a toy example. We conclude that our solution works well in the presence of high labelling noise compared to existing modifications of the softmax's activation.
We consider unsupervised learning of semantic, user-specific maps from first-person video. We address the task as a semantic, non-geometric form of simultaneous localisation and mapping, differing in significant ways from formulations typical in robotics. We create a generic and abstract framework, egomap, in which locations are termed places. Our maps are modelled as a hierarchy of probabilistic place graphs and view graphs. View graphs capture an aspect of user behaviour within places. We present a practical instantiation of the framework that supports our motivating application. We divide the notion of place into stations and routes. Stations typically correspond to rooms or areas where a user spends time, places to which they might refer in spoken conversation, while routes refer to physical links between stations. Visits are temporally segmented based on qualitative visual motion and used to update the map, either updating an existing map station or adding a new map station. We contribute a labelled dataset suitable for the evaluation of this novel simultaneous localization and mapping (SLAM) task. Quantitative experiments compare mapping performance with and without view graphs and demonstrate better online mapping than when using offline baseline clustering. Besides, qualitative results help the reader to arrive at a visual conclusion by intuitively overlaying the learned map and its components on a manually created metric map.
|Date of Award||2021|
|Sponsors||Engineering and Physical Sciences Research Council|
|Supervisor||Stephen McKenna (Supervisor) & Annalu Waller (Supervisor)|