
Ongoing Research
Face Detectors on Speed: Increasing detection speed
through online machine learning
Unsupervised Human Action Classification with Ambiguous
Correspondences
Demonstrations and Applications
Automatic Control of Pan-Tilt-Zoom Cameras for Tracking
Presenters
Earlier Research
CORGA: Facial Gaze Correction in Monocular Videoconferencing
3D Recovery of Human Motion from Monocular Video
SpliceWorld: Tracking-Assisted Video Editing
Dynamic Ordering for Fast Sequential Registration
Dynamic Bayesian Network for Figure Tracking
Multiple Hypothesis Articulated Figure Tracking


(Joint work with Minh-Tri Pham)
This
work investigates how the speed of an object detector can be rapidly increased
through a caching framework when input sequences are quasi-repetitive. In the
proposed framework, observed output states are discretized into a large number
of classes. Each class induces its own discriminant subspace in the feature
space, and is associated with a feature exemplar and a local metric. The feature
exemplars and the local metrics are learned online using a novel piecewise
linear discriminant analysis. The execution of the original object detector is
skipped when the current image feature vector is similar to previously observed
feature exemplars, and previously detected object states may simply be recalled.
The exemplar recognition is carried via a 1-pass approximate nearest neighbor
search in an index tree based on k-means clustering. Preliminary results show up
to a 5-fold improvement when applied to the Viola & Jones [23] face detector,
improving speeds from 10fps to 50fps. Experiments were done on the standard
MPEG-4 testing sequences and real data.


Minh-Tri Pham and Tat-Jen Cham. Detection caching for faster object detection. In Proc. IEEE International Workshop on Modeling People and Human Interaction (PHI), Beijing, China, 2005. Podium presentation. [PDF]
(Joint work with Zhou Feng)
We
have created a combined tracking-classification framework for the unsupervised
classification of human action. While most existing approaches assume that
feature-wise correspondences on people are either available or not at all, this
method explicitly formalizes how the probability of correspondences can be used
in computation when the correspondences are ambiguous. It is also able to
exploit in a probabilistic manner any foreground-background preprocessed
segmentation, even if the segmentation is of low confidence. A principled
analysis of the problem leads to a novel probabilistic action representation
called the correspondence-ambiguous feature histogram array (CAFHA) that is
robust to variations across similar actions. Our results show that the new
framework outperforms the recent Zelnik-Manor & Irani method for unsupervised
event classification. Additionally, the framework is extended to quasi real-time
action inference, achieving good recognition accuracy despite changes in person
identity and variations in the actions.


Please see a 3 minute video (20Mb, WMV format) showcasing some of our demos.
(Based on FYP projects by Kelvin Koh and Chia Lee Wong)

(Joint work with Shyam Krishnamoorthy and Michael Jones)
One
of the key problems faced in establishing video-conferencing as an ubiquitous
mode of communication is that it is currently impossible to establish
eye-contact over a video link. This results in poor natural body-language
communication, thereby causing interaction to be awkward and unsatisfying. The
reason for this is that the camera and screen cannot be physically aligned. This
project aims to overcome the problem by synthesizing virtual views such that
when the user is looking at the display window, the transmitted image to
corresponding party is that of the user looking at the camera. The goal is to
allow eye-contact between video-conferencing participants. Furthermore in
multi-party conferencing involving multiple display windows, this technology
will allow speakers to naturally indicate the person they are speaking to by
simply looking at the person's display window.
Approach The approach is broadly illustrated in figure 3. It involves 3 stages:
Face Registration. A 2D face model is used to register to a person's face appearing in the incoming video frames. The model computes pixel-wise displacements and intensity variations of the input face image from a reference face image and outputs a low-dimensional state vector representing the face parameters. This is discussed in more detail in [1].
Parameter Mapping. The face parameters represent the facial image in the original view. In order to transform the original face image to one which is gaze-corrected, the face parameters are mapped through a function. The function is computed via locally-weighted regression [2] from exemplars in a training set. The training set comprises of pairs of face parameters obtained by registering vertical-baseline stereo images of different persons. The top image of the stereo pair is the original view, while the bottom image of the stereo pair is the desired, gaze-corrected view. These were collected in a separate training phase. The main justification is that the mapping between face parameters in the original view to face parameters in the gaze-corrected view is expected to be relatively invariant across population.
Face Synthesis. Given the gaze-corrected face parameters, a synthetic image can be generated and superimposed on the original image frame.

Here are some preliminary results:

(Joint work with David DiFranco and Jim Rehg)
Figure tracking can be used. for the purpose of video-based motion capture, i.e. recovering 3D motion. In our case, figure tracking is carried out with a a 2D kinematic model (in fact a scaled-prismatic model -- see figure tracking web page) to be robust to singularities as well as deferring 3D estimation to a later stage. This provides 2D correspondences, from which 3D pose must be estimated. Because of the ill-posed nature of the estimation, we need to use prior knowledge and constraints, in particular:

See paper for details.
D.E. DiFranco, T.J. Cham and J.M. Rehg. Recovery of 3D articulated models from 2D correspondences. In Proc. IEEE Conference on Computer Vision & Pattern Recognition (CVPR), Kauai, Hawaii, December 2001. [PDF]
(Joint work with Jim Rehg and Sing Bing Kang)
One of the benefits of figure tracking is that it can be used
to aid segmentation of the figure. It provides a prior for what the foreground
figure regions are, which can then be further improved on. We used a coupled
figure tracking and background-model estimation framework, since the two
processes are mutually dependent. Background modeling involves estimating the
background pixels (which may be occluded by the figure in some frames) over
time, through image registration. Subtracting this background model from the
current image frame will identify a more accurate foreground region, which can
be used to improve tracking. Conversely, tracking helps identify foreground
pixels to ignore when registering image frames in background modeling, thereby
improving the accuracy of the registration. This is illustrated in the diagram
below. Figure segmentation is useful for video editing applications,
especially when foreground objects and background scenes from old footage are
treated as separate layers to be combined into new footage. We demonstrate such
a process in the paper listed below.
J.M. Rehg, S.B. Kang, and T.J. Cham. Video-editing using figure tracking and image-based rendering. In Proc. International Conference on Image Processing (ICIP), Vancouver, BC, Canada, 2000. [PDF]
(Joint work with Jim Rehg)
We have developed a method for efficiently registering a scaled-prismatic model to an image. Existing methods which pre-determine the order to sequentially register segments of a kinematic chain (typically a serial order along the chain) are sub-optimal in speed. Instead, our method is based on selecting features which have the smallest matching ambiguity, a measure of registration cost based on successively updated prior estimates.
| The left image shows the initial state of the figure with a weak prior (large variances), while the right image shows the efficient ordering of sequentially registered kinematic links. |
![]() |
| The left image again shows the initial state of the figure, except that the position of the right shin is known apriori. The right image shows the efficient ordering of sequentially registered kinematic links, which is in an order different from the above. |
![]() |
Details of this work can be found in
T.J. Cham and J.M. Rehg. Dynamic feature ordering for efficient registration. In Proc. International Conference on Computer Vision (ICCV), vol 2, pages 1084-1091, Kerkyra (Corfu), Greece, 1999. [PDF]
(Joint work with Vladimir Pavlovic, Jim Rehg and Kevin Murphy)
(Joint work with Jim Rehg)
Figure tracking in a general environment faces the problem of data association ambiguities. Background clutter may generate distractors which are similar in appearance to the limbs of the figure, or the limbs may appear similar to one another. It may be difficult in some instances to select the correct tracker state based on measurements from the current image frame. A tracker which is forced to select only a single target measurement from each time frame may fail to make the correct choice. Even if the correct target measurement can be easily identified in subsequent time frames, it is possible that the tracker will have been irrecoverably trapped in an erroneous state.
A multiple-hypothesis (MH) tracker instead defers the decision of selecting the right measurement by maintaining a set of multiple hypotheses for the correct state of the tracker. A postprocessing stage may then be used to decide on the optimal track after measurements have been made for all image frames. In the past, MH methods have been developed for radar systems by associating hypotheses with distinct output features (`blips'). These methods unfortunately can not be extended to articulated figure tracking because the data association between a `figure feature' and image pixels is non-trivial and requires the invention of a `figure detector'. In our recent proposed Bayesian framework [3,4], hypotheses are instead generated from random seeds in state-space which are subsequently locally optimized. This avoids the need for distinct features. Furthermore, this mechanism is efficient even in the high dimensional state-space of a figure model, unlike traditional Markov Chain Monte Carlo (MCMC) methods which are based on propagating sets of random samples. A result obtained from our algorithm is shown below:
![]() |
![]() |
![]() |


CeMNet SmartSpace - where imagination evolves into reality
