Walk, Stefan (2013)
Multi-Cue People Detection from Video.
Technische Universität Darmstadt
Ph.D. Thesis, Primary publication
|
Text
thesis.pdf Copyright Information: CC BY-NC-ND 2.5 Generic - Creative Commons, Attribution, NonCommercial, NoDerivs . Download (40MB) | Preview |
Item Type: | Ph.D. Thesis | ||||
---|---|---|---|---|---|
Type of entry: | Primary publication | ||||
Title: | Multi-Cue People Detection from Video | ||||
Language: | English | ||||
Referees: | Roth, Prof. Ph.D Stefan ; Schiele, Prof. Dr. Bernt ; Schindler, Prof. Dr. Konrad | ||||
Date: | 2 July 2013 | ||||
Place of Publication: | Darmstadt | ||||
Date of oral examination: | 26 September 2012 | ||||
Abstract: | This thesis aims to advance the state of the art in pedestrian detection. Since there are many applications for pedestrian detection, for example automotive safety or aiding robot-human interaction in robotics, there is a strong desire for improvement. In this thesis, the benefits of combining multiple features that gather information from different cues (for example image color, motion and depth) are studied. Training techniques and evaluation procedures are also investigated, improving performance and the reliability of results, especially when different methods are compared. While motion features were previously used, they either were conceptually restricted to a setting with a fixed camera (e.g. surveillance) or were not resulting in an improvement for the full-image detection task. In this thesis, the necessary modifications to the approach of Dalal et al. (which is based on optical flow) to make it work in the full-image detection setting are presented. In addition to this, substantial improvements using motion features are shown even when the camera is moving significantly, which has not been tested before. A variant of the motion feature that performs equally well with a significantly lower feature dimension is also introduced. Another cue that is used in the present work is color information. Usually, when incorporating color information into computer vision algorithms, one has to deal with the color constancy problem. In this thesis, a new feature called color self-similarity (CSS) is introduced. It encodes long-range (between positions within the detector window) similarities of color distributions. By only comparing colors inside the detector window, the color constancy problem can be circumvented - effects of lighting and camera properties are less likely to vary significantly within the detector window than they are over the whole dataset. Additionally, it is shown that even raw color information can be useful if the training set covers enough variability. Depth is also a useful cue. An existing stereo feature - stereo-based HOG by Rohrbach et al. - is adopted and a new feature that exploits a useful relation between stereo disparity and the height of an object in an image is introduced. This feature is computationally cheap and able to encode local scene information, like object height and the presence of a ground plane, in a completely data-driven way (all parameters are learned during training). It helps both by reducing false positives (eliminating those that have the wrong size) and false negatives (those that were missed because the detector estimated the size wrongly). For the classifier part of the pipeline, it is shown that AdaBoost with decision stumps is not able to handle the multi-cue, multi-view detection setting that we are examining well. A recently proposed boosting classifier, MPLBoost, turned out to be superior, resulting in classification performance that is comparable to support vector machines. It is also demonstrated that error rates can be reduced by using support vector machines and boosting classifiers in combination. Another contribution of this thesis is a procedure to combine training datasets with different sets of cues during training, e.g. a monochrome dataset with a colored dataset, or a dataset with no motion information with a dataset from video. This greatly increases the amount of available training data when multiple cues are used. A collection of pitfalls during evaluation is also highlighted. It is demonstrated that the PASCAL overlap criterion encourages overestimating the bounding box size. Care also has to be taken when evaluating on subsets of annotations, e.g. only on occluded pedestrians or pedestrians of certain sizes. When trying to determine the strengths of different approaches, naive approaches can easily lead to wrong conclusions. In this thesis, better methods to compare different approaches are proposed. An application of the detector in a 3D scene reasoning framework is also presented. Multiple detectors trained on partial (e.g. only upper body) views are combined. 3D reasoning is used to infer which parts of the pedestrian should be visible and the framework uses this information to determine the strengths of the contributions of the partial detectors. This allows the detection system to find pedestrians even when they are occluded for extended periods of time. |
||||
Alternative Abstract: |
|
||||
URN: | urn:nbn:de:tuda-tuprints-35002 | ||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science | ||||
Divisions: | 20 Department of Computer Science 20 Department of Computer Science > Interactive Graphics Systems 20 Department of Computer Science > Multimodale Interaktive Systeme |
||||
Date Deposited: | 11 Jul 2013 09:07 | ||||
Last Modified: | 09 Jul 2020 00:29 | ||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/3500 | ||||
PPN: | 386305390 | ||||
Export: |
View Item |