Research Topics

Visual trackingStereo and Sparse IBR,   Facial Modeling and AnalysisImage and Video Processing


Visual Tracking

Object Tracking with Dynamic Feature Graphs
A good object representation is a description of object/objects by high level features from all perspectives, both spatial and temporal. Extensive representations have been proposed to model objects. They can roughly be categorized as global representation and local representation. Examples of global representations include: color appearance model, subspace methods (like PCA), etc. Local representation describes the object using a set of local features, which are usually selected as the local characterization of object parts. This kind of representation usually incorporates relations between local features to capture the object structure. Typical such representation is region adjacent graph. However, relative less work has been done on modeling the dynamic changes of the object model. Dynamic feature graph is designed as a representation that models both spatial and temporal characteristics of an object. Spatially, the object is represented as an attributed relational graph, with features as nodes and their relations as the edges. Temporally, the graph can adaptively update itself to keep the good features and eliminate unstable features. Learn More

Trajectory Based Multiple Object Tracking
Most tracking algorithms are based on the maximum a posteriori (MAP) solution of a probabilistic framework called Hidden Markov Model, where the distribution of the object state at current time instance is estimated based on current and previous observations. However, this approach is prone to errors caused by temporal distractions such as occlusion, background clutter and multi-object confusion. In this paper we propose a multiple object tracking algorithm that seeks the optimal state sequence which maximizes the joint state-observation probability. We name this algorithm trajectory tracking since it estimates the state sequence or “trajectory” instead of the current state. The algorithm is capable of tracking multiple objects whose number is unknown and varies during tracking. We introduce an observation model which is composed of the original image, the foreground mask given by background subtraction and the object detection map generated by an object detector. The image provides the object appearance information. The foreground mask enables the likelihood computation to consider the multi-object configuration in its entirety. The detection map consists of pixelwise object detection scores, which drives the tracking algorithm to perform joint inference on both the number of objects and their configurations efficiently. Learn More

Background layer model for tracking through occlusion
In this work, we extend previous research on layer-based tracker by introducing the concept of background occluding layers and explicitly inferring depth ordering of foreground layers. . Experimental results show that under various conditions with occlusion, including situations with moving objects undergoing complex motions or having complex interactions, our tracking algorithm is able to handle many difficult tracking tasks reliably.

Dynamic layer representation and its applications to tracking
A dynamic layer representation is proposed for tracking moving objects. Previous work on layered representations has largely concentrated on two-/multi-frame batch formulations, and tracking research has not addressed the issue of joint estimation of object motion, ownership and appearance. This research work extends the estimation of layers in a dynamic scene to incremental estimation formulation and demonstrates how this naturally solves the tracking problem. Learn More

Sampling methods for tracking and detecting multiple objects
The CONDENSATION algorithm and its variants enable the estimation of arbitrary multi-modal posterior distributions that potentially represent multiple tracked objects. However, the specific state representation adopted in the earlier work does not explicitly supports counting, addition, deletion and occlusion of objects. Furthermore, the representation may increasingly bias the posterior density estimates towards objects with dominant likelihood as the estimation progresses over many frames. Learn More


Stereo Computation and Image-Based New View Rendering

Learning Based Stereo
This paper describes a novel learning-based approach for improving the performance of stereo computation. The behavior of a given window-based matching method is characterized by whether the matching scores lead to the true depth, the nearby foreground depth, or random depth values. The probabilities that the matching result belonging to these three categories are determined by the original stereo images, the underlying scene structure and the size of matching window. This conditional probability is learned from training data and is integrated into a depth estimation algorithm using the MAP-MRF framework. Preliminary experimental results show that the learning process captures common errors in SSD matching including the fattening effect, the aperture effect, and mismatches in occluded or low texture regions. It is also demonstrated that the proposed approach significantly improves the accuracy of the depth computation. Learn More 

Direct range space rendering
We propose an algorithm that addresses the sparse image-based rendering (IBR) problem. Unlike the traditional stereo or sparse IBR approach, our method does not explicitly recover the scene geometry or the pixel-wise correspondences between the two images. Instead, we solve this problem by using a range space rendering algorithm, in which the depth information is computed only implicitly in each new view. We show that the rendering result is good even though the local depth maps are not correctly recovered. Learn More 

Depth recovery  from unsynchronized cameras
An algorithm is proposed for estimating dense depth information of dynamic scenes using multiple video streams captured from unsynchronized fixed cameras. We solve this problem by first imposing two assumptions about the scene motions and the time difference between cameras. The scene motion is represented using a local constant velocity model and the camera temporal difference is modeled as a constant within a short of period of time. Based on these models, geometric relations between the images of moving scene points, the scene depth, the scene motions, and the camera temporal offset are investigated and an estimation method is developed to compute the camera temporal difference. The algorithm is tested on both synthetic data and real images. Promising quantitative and qualitative experimental results are demonstrated in the paper.

Dynamic depth recovery  from synchronized video streams
This work addresses the problem of extracting depth information of nonrigid dynamic 3D scenes from multiple synchronized video streams. Three main issues are discussed in this context: (i) temporally consistent depth estimation, (ii) sharp depth discontinuity estimation around object boundaries, and (iii) enforcement of the global visibility constraint. We present a framework in which the scene is modeled as a collection of 3D piecewise planar surface patches induced by color based image segmentation. This representation is continuously estimated using an incremental formulation in which the 3D geometric, motion, and global visibility constraints are enforced over space and time. The proposed algorithm optimizes a cost function that incorporates the spatial color consistency constraint and a smooth scene motion model.

Color segmentation based stereo and the global matching criteria
We present a new analysis by synthesis computational framework for stereo vision. It is designed to achieve the following goals: (1) enforcing global visibility constraints, (2) obtaining reliable depth for depth boundaries and thin structures, (3) obtaining correct depth for textureless regions, and  (4) hypothesizing correct depth for unmatched regions. The framework employs depth and visibility based rendering within a global matching criterion to compute depth in contrast with approaches that rely on local matching measures and relaxation. A color segmentation based depth representation guarantees smoothness in textureless regions.  Hypothesizing depth from neighboring segments enables propagation of correct depth and produces reasonable depth values for unmatched region. A practical algorithm that integrates all these aspects is presented in this paper.  Comparative experimental results are shown for real images.  Results on new view rendering based on a single stereo pair are also demonstrated. Learn More  


Image and Video Processing

Image hallucination with primal sketch priors (Collaboration with Microsoft Research Asia)
We propose a Bayesian approach to image hallucination. Given a generic low resolution image, we hallucinate a high resolution image using a set of training images. Our work is inspired by recent progress on natural image statistics that the priors of image primitives can be well represented by examples. Specifically, primal sketch priors (e.g., edges, ridges and corners) are constructed and used to enhance the quality of the hallucinated high resolution image. Moreover, a contour smoothness constraint enforces consistency of primitives in the hallucinated image by a Markov-chain based inference algorithm. A reconstruction constraint is also applied to further improve the quality of the hallucinated image. Experiments demonstrate that our approach can hallucinate high quality super-resolution images. Learn More 


Facial Modeling, Animation, Analysis, and Transmission

The piecewise Bézier volume deformation model (PBVD)
Capturing real facial motions from videos enables automated construction of dynamic models for facial animation. We proposed an explanation-based facial motion tracking algorithm based on a piecewise Bézier volume deformation model (PBVD). The PBVD is a suitable model both for synthesis and analysis of facial images. With this model, basic facial movements, or action units, are first interactively defined. Then, by linearly combining these action units, various facial movements are synthesized. The magnitudes of these action units can be estimated from real videos using a model-based tracking algorithm. The predefined PBVD action units may also be adaptively modified to customize the dynamic model for a particular face. Experimental results on PBVD-based animation, model-based tracking, and explanation-based tracking are demonstrated. 
Learn More 


Sponsors: