PERCEPTER
Perceiving & Understanding Humans in Video

A CeMNet SmartSpace Research Initiative

Centre for Multimedia & Network Technology
School of Computer Engineering
Nanyang Technological University
Singapore

Principal Investigator: A/Prof Cham Tat Jen

Contents

Overview

Ongoing Research
Face Detectors on Speed: Increasing detection speed through online machine learning
Unsupervised Human Action Classification with Ambiguous Correspondences

Demonstrations and Applications
Automatic Control of Pan-Tilt-Zoom Cameras for Tracking Presenters

Earlier Research
CORGA: Facial Gaze Correction in Monocular Videoconferencing
3D Recovery of Human Motion from Monocular Video
SpliceWorld: Tracking-Assisted Video Editing
Dynamic Ordering for Fast Sequential Registration
Dynamic Bayesian Network for Figure Tracking
Multiple Hypothesis Articulated Figure Tracking

Other Links

Overview

Ongoing Research

Face Detectors on Speed: Increasing detection speed through online machine learning

(Joint work with Minh-Tri Pham)

This work investigates how the speed of an object detector can be rapidly increased through a caching framework when input sequences are quasi-repetitive. In the proposed framework, observed output states are discretized into a large number of classes. Each class induces its own discriminant subspace in the feature space, and is associated with a feature exemplar and a local metric. The feature exemplars and the local metrics are learned online using a novel piecewise linear discriminant analysis. The execution of the original object detector is skipped when the current image feature vector is similar to previously observed feature exemplars, and previously detected object states may simply be recalled. The exemplar recognition is carried via a 1-pass approximate nearest neighbor search in an index tree based on k-means clustering. Preliminary results show up to a 5-fold improvement when applied to the Viola & Jones [23] face detector, improving speeds from 10fps to 50fps. Experiments were done on the standard MPEG-4 testing sequences and real data.

 

Unsupervised Human Action Classification with Ambiguous Correspondences

(Joint work with Zhou Feng)

We have created a combined tracking-classification framework for the unsupervised classification of human action. While most existing approaches assume that feature-wise correspondences on people are either available or not at all, this method explicitly formalizes how the probability of correspondences can be used in computation when the correspondences are ambiguous. It is also able to exploit in a probabilistic manner any foreground-background preprocessed segmentation, even if the segmentation is of low confidence. A principled analysis of the problem leads to a novel probabilistic action representation called the correspondence-ambiguous feature histogram array (CAFHA) that is robust to variations across similar actions. Our results show that the new framework outperforms the recent Zelnik-Manor & Irani method for unsupervised event classification. Additionally, the framework is extended to quasi real-time action inference, achieving good recognition accuracy despite changes in person identity and variations in the actions.

 

Demonstrations and Applications

Please see a 3 minute video (20Mb, WMV format) showcasing some of our demos.

Automatic Control of Pan-Tilt-Zoom Cameras for Tracking Presenters

(Based on FYP projects by Kelvin Koh and Chia Lee Wong)

Earlier Research

CORGA: Facial Gaze Correction in Monocular Videoconferencing

(Joint work with Shyam Krishnamoorthy and Michael Jones)

One of the key problems faced in establishing video-conferencing as an ubiquitous mode of communication is that it is currently impossible to establish eye-contact over a video link. This results in poor natural body-language communication, thereby causing interaction to be awkward and unsatisfying. The reason for this is that the camera and screen cannot be physically aligned. This project aims to overcome the problem by synthesizing virtual views such that when the user is looking at the display window, the transmitted image to corresponding party is that of the user looking at the camera. The goal is to allow eye-contact between video-conferencing participants. Furthermore in multi-party conferencing involving multiple display windows, this technology will allow speakers to naturally indicate the person they are speaking to by simply looking at the person's display window.

Approach The approach is broadly illustrated in figure 3. It involves 3 stages:

  1. Face Registration. A 2D face model is used to register to a person's face appearing in the incoming video frames. The model computes pixel-wise displacements and intensity variations of the input face image from a reference face image and outputs a low-dimensional state vector representing the face parameters. This is discussed in more detail in [1].

  2. Parameter Mapping. The face parameters represent the facial image in the original view. In order to transform the original face image to one which is gaze-corrected, the face parameters are mapped through a function. The function is computed via locally-weighted regression [2] from exemplars in a training set. The training set comprises of pairs of face parameters obtained by registering vertical-baseline stereo images of different persons. The top image of the stereo pair is the original view, while the bottom image of the stereo pair is the desired, gaze-corrected view. These were collected in a separate training phase. The main justification is that the mapping between face parameters in the original view to face parameters in the gaze-corrected view is expected to be relatively invariant across population.

  3. Face Synthesis. Given the gaze-corrected face parameters, a synthetic image can be generated and superimposed on the original image frame.

 

Here are some preliminary results:

 

 

3D Recovery of Human Motion from Monocular Video

(Joint work with David DiFranco and Jim Rehg)

Figure tracking can be used. for the purpose of video-based motion capture, i.e. recovering 3D motion. In our case, figure tracking is carried out with a a 2D kinematic model (in fact a scaled-prismatic model -- see figure tracking web page) to be robust to singularities as well as deferring 3D estimation to a later stage. This provides 2D correspondences, from which 3D pose must be estimated. Because of the ill-posed nature of the estimation, we need to use prior knowledge and constraints, in particular:

See paper for details.

 

SpliceWorld: Tracking-Assisted Video Editing

(Joint work with Jim Rehg and Sing Bing Kang)

One of the benefits of figure tracking is that it can be used to aid segmentation of the figure. It provides a prior for what the foreground figure regions are, which can then be further improved on. We used a coupled figure tracking and background-model estimation framework, since the two processes are mutually dependent. Background modeling involves estimating the background pixels (which may be occluded by the figure in some frames) over time, through image registration. Subtracting this background model from the current image frame will identify a more accurate foreground region, which can be used to improve tracking. Conversely, tracking helps identify foreground pixels to ignore when registering image frames in background modeling, thereby improving the accuracy of the registration. This is illustrated in the diagram below. Figure segmentation is useful for video editing applications, especially when foreground objects and background scenes from old footage are treated as separate layers to be combined into new footage. We demonstrate such a process in the paper listed below.

 

Dynamic Ordering for Fast Sequential Registration

(Joint work with Jim Rehg)

We have developed a method for efficiently registering a scaled-prismatic model to an image. Existing methods which pre-determine the order to sequentially register segments of a kinematic chain (typically a serial order along the chain) are sub-optimal in speed. Instead, our method is based on selecting features which have the smallest matching ambiguity, a measure of registration cost based on successively updated prior estimates.

The left image shows the initial state of the figure with a weak prior (large variances), while the right image shows the efficient ordering of sequentially registered kinematic links.       

 

The left image again shows the initial state of the figure, except that the position of the right shin is known apriori. The right image shows the efficient ordering of sequentially registered kinematic links, which is in an order different from the above.       

Details of this work can be found in

 

Dynamic Bayesian Network for Figure Tracking

(Joint work with Vladimir Pavlovic, Jim Rehg and Kevin Murphy)

 

Multiple Hypothesis Articulated Figure Tracking

(Joint work with Jim Rehg)

Figure tracking in a general environment faces the problem of data association ambiguities. Background clutter may generate distractors which are similar in appearance to the limbs of the figure, or the limbs may appear similar to one another. It may be difficult in some instances to select the correct tracker state based on measurements from the current image frame. A tracker which is forced to select only a single target measurement from each time frame may fail to make the correct choice. Even if the correct target measurement can be easily identified in subsequent time frames, it is possible that the tracker will have been irrecoverably trapped in an erroneous state.

A multiple-hypothesis (MH) tracker instead defers the decision of selecting the right measurement by maintaining a set of multiple hypotheses for the correct state of the tracker. A postprocessing stage may then be used to decide on the optimal track after measurements have been made for all image frames. In the past, MH methods have been developed for radar systems by associating hypotheses with distinct output features (`blips'). These methods unfortunately can not be extended to articulated figure tracking because the data association between a `figure feature' and image pixels is non-trivial and requires the invention of a `figure detector'. In our recent proposed Bayesian framework [3,4], hypotheses are instead generated from random seeds in state-space which are subsequently locally optimized. This avoids the need for distinct features. Furthermore, this mechanism is efficient even in the high dimensional state-space of a figure model, unlike traditional Markov Chain Monte Carlo (MCMC) methods which are based on propagating sets of random samples. A result obtained from our algorithm is shown below:

A 2D Scaled Prismatic Model (SPM) for figure tracking is shown in different states.
 
diff-2c.jpg (14346 bytes) fred-template.gif (3302 bytes) fred-template2.gif (3407 bytes)

Other Links

CeMNet SmartSpace - where imagination evolves into reality