Abstract this paper utilizes the FAST corner

Abstract  Abstract—In this paper, I proposed a real-time object recognition system under
smart phone environments. The proposed object recognition system consists of
two key modules: feature extraction and object recognition. Feature detectors
such as Scale Invariant Feature Transform (SIFT) and Speeded Up Robust Feature
(SURF) are good methods which yield high quality features, however they are too
computationally intensive for use in real-time applications of any complexity.
Compared to PC platforms, smart phone platforms have limited resources, so
computation-intensive SIFT and SURF descriptors are less usable in such
resource-limited environments. In this paper utilizes the FAST corner detector
that provides faster feature computation by extracting only corner information.
The number of corners detected by the FAST corner detector varies so
normalization is applied to adjust the extracted corners (interest points) to
the same number. Based on the normalized corner information, support vector
machine (SVM) and back-propagation neural network (BPNN) training are performed
for the efficient recognition of objects.1 Compared to conventional SIFT and
SURF algorithms, the proposed object recognition system based on the FAST
corner detector yields increased speed and low performance degradation on smart
phones. Keywords— Detection, Tracking, Recognition, Android,
Smartphone, Open CV, FAST, SIFT.  1.       Introduction                            In the past ten
years, mobile phones have increasingly outdated several conventional tools and
devices by performing increasingly varied and complex tasks. Today it is
possible to use a mobile device from finding a route to a destination, reading
a book, browsing the Ib, to organizing one’s day. In recent years vision based
tools such as barcode scanners and landmark recognition systems have made a
positive impact on extending usability. Furthermore, single instance
recognition systems have been used for tasks such as art (painting)
recognition, book cover and CD album cover recognition. These applications
extract simple features on-device and use powerful hashing strategies to
recover results. But such methods cannot be generalized for generic object category
detection.2 The
ability to perform object detection on a mobile platform opens the door to
compelling applications such as visual searching, object specific augmented
reality, cataloguing objects in a scene, object category specific
recommendation systems, etc. Despite the range of opportunities, implementing
an efficient and accurate algorithm for object detection on a mobile phone is
an open challenge. Existing state-of-the art object recognition methods have
been successfully applied to object detection and have been shown to obtain
reasonable recognition rates. However, in practice, implementing and
transferring such methods into a mobile platform is far from being an easy
task. Mobile devices have limited memory and computational power, whereas
methods such as 7 etc. have both significant computational and memory needs.
Adapting these methods for a mobile phone therefore needs an effective split
between the mobile phone (client) and a server. Moreover conventional object
recognition methods are not designed to detect the object from multiple
view-points. This ability is critical when: 1.
The object is three dimensional – unlike a painting, or book cover, 3D objects
vary considerably from different view-points and often cannot even be vied
completely from a single vantage point.  2.
Active Sensing scenarios are considered. In such scenarios the device can
provide feedback to the user to point the camera from the most informative view
point. In our work I present a technique for leveraging the multiple frames
towards improving object detection and further applying it in a mobile platform
assuming multiple frames are extracted from a short video sequence. I define a
short video sequence to consist of physically adjacent frames where the object
remains in the field of view of the camera and smooth changes in the view of
the object. Notice that, while steady motion is preferred, it is not
instrumental for the technique to provide improvements. The multi-frame
detection approach I present is a generalization of the Class Specific Hough
Forests framework for Object Detection proposed by Gall et al. and the
Generalized Hough Transform to allow patches in a single image for possible
locations of the object in the image. I extend this by generating points from
different patches across multiple frames in a single frame through a technique
I introduce as point transfer. In this technique I associate patches across
frames using tracking to link the point’s spaces of the separate frames.3 I
experimentally demonstrate that my multi-frame technique performs notably
better than the single-frame version. I also implemented the multi-frame
detector on a mobile platform to demonstrate my Multi-frame detector in its
natural used case. In this contribution, I present a division of labour that
requires a significant, non-trivial contribution from the client-side and heavy
processing from the server (back-end) for future work. Through a number of
timing analysis, I present the near-practicality of such a system. In section 5
I discuss this division of labour and in section 6 I present the implementation
details. In
summary, my main contributions are: Ø  A novel Hough forest based
multi-frame object detection framework. Ø  Point Transfer – a novel
technique to integrate the detection process across multiple frames of a short
video sequence through tracking. Ø  A client-server framework with a
non-trivial client to perform object detection on the mobile phone which is
significantly more than a simple image/video capture task. Ø 
A comparative time performance study of three versions of the object
detector – on a mobile device, on a desktop machine, the proposed client-server
framework for future work.  Ø 
An analysis of the effects of loIring image resolutions on detection
performance and resulting system speed-up.  I
next present work pertinent to the point transfer technique in section 2. In
section 2 I also discuss prior works implementing object detectors on mobile
devices. Section 3 introduces the single frame detector that my work builds on.
Section 4 begins with the generalization of the single frame approach and
discusses the point transfer technique in detail. Section 5 provides details of
the mobile application organization. Finally i provide experimental analysis of
both the point-transfer technique and the mobile application in section 6.  2.
Related Work                           Extending the Hough
Transform to work with arbitrary shapes was first introduced by D.H. Ballard in
1981. Since then it has been successfully applied to the problem of single
instance object detection as Ill as object category detection. The Implicit
Shape Model (ISM) introduced by Leibe et al. forms the basis for many part
based object detection algorithms. The ISM learns a model of the spatial
distribution of local patches with respect to the object centre. During testing
the learned model is used to cast a probabilistic vote for the location of the
object centre. ISM builds a codebook of patches by clustering patches based on
appearance. Several modifications to the ISM have been proposed over the years.
Maji et al. propose a method that computes Iights for the ISM using a
max-margin technique. To overcome the computational drawbacks of building a
codebook using clustering, Gall et al. propose to learn a discriminative
codebook using a random forest that they call a Class-Specific Hough Forest.
Sun et al. introduce the idea of Depth Encoded Hough pointing that incorporates
depth information in training in order to learn a one-to-one mapping between
scene depth and patch scale. In our work, I do not introduce any modifications
to the training set-up.Thomas et al extend the ISM for multi-view object
recognition, by learning many single view codebooks that they interconnect via
what they call activation links. These activation links are obtained through
the image exploration algorithm proposed by Ferrari et al for recognizing
specific objects and establishing dense multi-view correspondence across
multiple views. In some sense, this is similar to the idea I pursue in our multi-frame
detector. However, while they build associations between code words across
multiple codebooks corresponding to multiple views, I build association between
patches observed across multiple frames using tracking in test-time. Other
works, present techniques that leverage videos, however these techniques are
limited to single instance object recognition. Similar to us, the work also
leverages short video sequences. However, focuses on estimating poses of object
categories. On the mobile front the spread of works dealing with object
detection is limited. Hart et al introduce an on device segmentation and
detection technique for 2D objects such as coins, keys, and screws. Belhumeur
et al. demonstrate an object recognition system for identifying herbaria.4
They implement their system on a head mounted display system. Also, both
specialize in detecting 2D object categories however, our work focuses on
detecting objects whose intrinsic shape is 3D. Object recognition and tracking
present an on-device technique that can recognize multiple objects based on
tracking of objects based in videos. However both perform single instance
object recognition; our work focuses on category level detection.  3.
Single-Frame Detection                                           In
this section I will briefly review the concept of Hough Forests to localize the
object. 4.
Multi-Frame Detection The
main contributing factor for considering the use of multiple frames for object
categorization and detection is the potential presence of extra evidence that
is not present in a single frame. The ready availability of such frames on
mobile platforms makes this extra evidence more compelling to consider.
However, the best approach to use this potentially advantageous evidence is not
immediately apparent. In this section, I define multi-frame object detection
problem and then I explore an approach that leverages this definition. 5.
Mobile Application Blue-print                         In this section I will
describe our design for implementing our object detection algorithm on the
mobile device. In order for the application to run within a practical amount of
time and within the memory limits of the mobile device, I propose splitting the
task of detection between the client and the server. The
mobile device is first used to capture either a single image or a short video
sequence. The image / frames are then processed to extract the features used
for object detection. If a multi-frame detection is performed, then pixel
tracking is also completed on devices. This information is packaged and
transferred to the server-side over an HTTP connection.5 On receipt of the
features and tracking information, the server runs the vote-transfer detector.
The result of this process is sent back to the client and displayed  6.
Experiments                          In this section I
present some qualitative and quantitative results of our multi-frame approach.
I also provide timing analysis on both the mobile device (client) and desktop
(server) to justify our proposed client-server model.  6.1.
Datasets                      I evaluate the performance of our
algorithms to detect objects using two datasets. The first is a new multi-view
dataset that I collected, and the second is the Car Show Dataset introduced by
Ozuysal et al. The new multi-view dataset contains videos of objects such as
car, mouse, bicycle, and keyboard. There are 10 different instances of each
object category, and the video sequences contain about 6000 frames per instance
covering the entire viewing circle (360 degrees of the azimuth pose angle).6
Each video is sampled at approximately one image per 10 frames. Each frame
contains a single instance of the object category. Additionally, the sampled
images are approximately 0:60 (or 30 every 5 sampled frames) apart. The dataset
also contains bounding box information for the sampled frames. For each object
category, I train the Hough Forest using 420 images of positive and negative
instances each including six instances, and use the remaining four instances
for testing. Also note that the objects are observed under a cluttered
background. At no stage are images or object instances from testing used in the
train set or vice-versa. I
also perform tests on the Car Show Dataset introduced by 19. This dataset
contains 3600 images of 20 instances of cars with one image about every 30. The
images are obtained as the cars are rotated on a turntable set-up. This dataset
is particularly relevant due to the real world settings in which the images are
recorded. Also, the larger frame-frame angular distance allows to further test
the capability of our method. I use a 50-50 evaluation approach where I train
the dictionary on ten instances with about 15 images per instance and the
remaining ten instances are used for testing. Finally,
in all tests performed on both datasets, a positive detection is defined as 50%
percent detection overlap with the ground-truth bounding box.  6.2.4
Performance vs. Image Resolution                                                                
Next I determine the optimal image resolution for detection. Since time
and memory requirements for feature extraction increases with increase in the
image resolution this is a key experiment to determine feasibility on a mobile
platform. To do this, I run a performance analysis of the detector in both
single and multi-frame versions on varying image resolutions. This experiment
allows us to infer the optimal image resolution that provides a suitably high
performance while minimizing the processing needs for feature extraction. Our
experiments show that while best performance is achieved at a resolution of
640×480, the performance at 320×240 is comparatively very reasonable, whereas
large deterioration in performance is observed at 160×120.  6.3.
Mobile Platform                                  In this section I will first discuss some
implementation details. This is followed by timing analyses on the device only,
desktop only, and the client-server model.  6.3.1
Implementation                                     I implement
the object detector in three versions – a complete on-device single frame
detector, a client-server single frame application and lastly its multi-frame
counter-part. The complete on-device application waits for the user to capture
an image. Once the image is captured, the image is processed for feature
extraction through a pre-defined set of scales. Detection across multiple image
scales leads to a scale invariant detector capable of identifying objects of
different sizes. After this, the Hough forest is loaded on-to the system memory
and the patches are passed through the forest for label assignment. The patches
then cast their points for the object center across all scales and possible aspect
ratios of the object. This is followed by post-processing involving non-maximum
suppression of the point space. Scales and aspect ratios to be explored are
predefined based on object type. Also, the Hough forests are trained and
preloaded on-to the device. In this on-device implementation, I assume the object
type is known and is capable of detecting cars, mouses, and bicycles. In
the remaining two versions, detection is run parallel for multiple objects. For
both the client-server implementation, feature extraction across various scales
is performed on device. Patch labeling, points and post-processing are
performed on the server-side. Additionally, for the multiframe approach,
tracking through LK tracking is performed on device as Ill. 6.3.2
Client-Server Analysis                      In this section I will
demonstrate the merit of a client-server system through some experiments. The
tests were conducted on a Samsung G series (device specifications are available
in 1) running Android 2.2. On images of size 640×480 for detection. This was
the best mobile configuration at the time of implementation. LK tracking is
performed at the original image scale. Feature extraction recovers a 16-channel
feature matrix and tracking consists of a 2-channel displacement matrix. The
detection process is significantly slower on device (~3 to 4 times slower). The
reason for the slower performance on the mobile platform is twofold: 8a)
Mobile devices have a slower, less powerful processor, and b) mobile devices
lack a floating point core to perform floating point computations required for
the detection step. I argued that this performance deficit justifies the use of
a client-server model. In contrast to the mobile implementation, the
client-server implementation significantly improves the performance from ~6s
(on device) to ~3s for exploring  7.
Discussion                  In this work I demonstrate a
new approach to multiframe object detection using Hough Forests and demonstrate
through experiments the stability of this approach. The significance of this
method is its ability to bring evidence from minor changes in views to improve
object detection. To evaluate our technique, I introduce a new multi-view
dataset that consists of 3600 videos of three object categories. I further
tested on the car-show dataset. In addition to detection analysis, I also
compare two tracking approaches and study accuracy by varying frame distances
and also comment on time taken to compute. Through a study of performance on
different image resolutions I further demonstrate the improved performance the
multi-frame method provides over its single frame counter-part. Finally, I
demonstrate a realistic implementation of this technique on a mobile platform
using a client-server approach. 9There
remains a rich scope for future work to add to our contribution. First, a more
accurate and faster tracking technique will expand the applicability of this
technique to beyond minor view changes. Secondly, this work has a natural
extension leading to include pose estimation; hover to use multiple views for
this requires understanding how view-point changes can foster pose estimation.
Also, compression of the feature and tracking matrices can aid reducing
transmission time for the mobile implementation.  8.

                           I acknowledge the
support of a Google Research Award and the Gig scale Systems Research Centre. 

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now


Go Top

I'm Eleanor!

Would you like to get a custom essay? How about receiving a customized one?

Check it out