Multi-Modal Recognition of Manipulation Activities through Visual Accelerometer Tracking, Relational Histograms, and User-Adaptation

  • Sebastian Stein

    Student thesis: Doctoral ThesisDoctor of Philosophy


    Activity recognition research in computer vision and pervasive computing has made a remarkable trajectory from distinguishing full-body motion patterns to recognizing complex activities. Manipulation activities as occurring in food preparation are particularly challenging to recognize, as they involve many different objects, non-unique task orders and are subject to personal idiosyncrasies. Video data and data from embedded accelerometers provide complementary information, which motivates an investigation of effective methods for fusing these sensor modalities.

    This thesis proposes a method for multi-modal recognition of manipulation activities that combines accelerometer data and video at multiple stages of the recognition pipeline. A method for accelerometer tracking is introduced that provides for each accelerometer-equipped object a location estimate in the camera view by identifying a point trajectory that matches well the accelerometer data. It is argued that associating accelerometer data with locations in the video provides a key link for modelling interactions between accelerometer-equipped objects and other visual entities in the scene. Estimates of accelerometer locations and their visual displacements are used to extract two new types of features: (i) Reference Tracklet Statistics characterizes statistical properties of an accelerometer's visual trajectory, and (ii) RETLETS, a feature representation that encodes relative motion, uses an accelerometer's visual trajectory as a reference frame for dense tracklets. In comparison to a traditional sensor fusion approach where features are extracted from each sensor-type independently and concatenated for classification, it is shown that combining RETLETS and Reference Tracklet Statistics with those sensor-specific features performs considerably better. Specifically addressing scenarios in which a recognition system would be primarily used by a single person (e.g., cognitive situational support), this thesis investigates three methods for adapting activity models to a target user based on user-specific training data. Via randomized control trials it is shown that these methods indeed learn user idiosyncrasies.

    All proposed methods are evaluated on two new challenging datasets of food preparation activities that have been made publicly available. Both datasets feature a novel combination of video and accelerometers attached to objects. The Accelerometer Localization dataset is the first publicly available dataset that enables quantitative evaluation of accelerometer tracking algorithms. The 50 Salads dataset contains 50 sequences of people preparing mixed salads with detailed activity annotations.
    Date of Award2014
    Original languageEnglish
    SupervisorStephen McKenna (Supervisor)


    • activity recognition
    • sensor fusion
    • computer vision
    • accelerometers
    • machine learning
    • action recognition
    • food preparation
    • situational support
    • user adaptation
    • accelerometer localization
    • tracking

    Cite this