A new view-based template approach to the representation of action is presented. The work is motivated by the observation that a human observer can easily and instantly recognize action in extremely low resolution imagery with no strong features or information about the three-dimensional structure of the scene. Our underlying representations for action are view-based template descriptions of the coarse image motion. Using these descriptions, we propose an appearance-based recognition strategy embedded within a hypothesis-and-test paradigm.
A binary motion energy image (MEI) is initially computed to act as an index into the action library. This coarsely describes the spatial distribution of motion energy for a given view of a given action. Any stored MEIs that plausibly match the unknown input MEI are then tested for a coarse, motion history agreement with a known motion model of the action.
A motion history image (MHI) is the basis of that representation. The MHI is a static image template where pixel intensity is a function of the recency of motion in a sequence. Recognition is accomplished in a feature-based statistical framework.
The motion template technology has been used to recognize human movements within interactive environments such as Virtual PAT and The KidsRoom.
![]() 1.3MB Quicktime, 0.1MB MPEG. |
The motivation for the approach presented in this research can be demonstrated in a single video-sequence (See blurred action sequence to the left). The video is a tremendously blurred sequence (in this case an up-sampling from images of resolution 15x20 pixels) of a human performing a simple, yet readily recognizable, activity. When shown this video, the vast majority of a room full of spectators could identify the action in less than one second from the start of the sequence. What should be quite apparent is that most of the individual frames contain no discernible image of a human being. Even if a system knew that the images were that of a person, no particular pose could be reasonably assigned due to the lack of features present in the imagery.
When viewing the motion in a blurred sequence, two distinct patterns are apparent. The first is the spatial region in which the motion is occurring. The pattern is defined by the area of pixels where something is changing largely independent of how it is moving. The second pattern is how the motion itself is behaving within these regions (e.g. an expanding or rotating field in a particular location). We developed our methods to exploit these notions of where and how, believing that these observations capture significant motion properties of actions that can be used for recognition. |
Consider the example of someone sitting, as shown in the figure below. The top row contains key frames from a sitting sequence. The bottom row displays a cumulative binary motion energy image (MEI) sequence corresponding to the frames above. The MEIs highlight regions in the image where any form of motion was present. The summation of the square of consecutive image differences often provides a robust spatial motion-distribution signal. Image differencing also permits real-time acquisition of the MEIs. As expected, the MEI sequence sweeps out a particular (and perhaps distinctive) region of the image. Our claim is that the shape of the region can be used to suggest both the action occurring and the viewing condition (angle).
To represent how motion is moving, we developed a motion history image (MHI). In an MHI, pixel intensity is a function of the motion history at that location, where brighter values correspond to more recent motion. We currently use a simple replacement and linear decay operator using the binary image difference frames. Examples of MHIs for three actions (sit-down, arms-raise, crouch-down) are presented in the figure below right. Notice that the final motion locations appear brighter in the MHIs.
|
|
![]() 14.5MB Quicktime, 7.5MB MPEG. |
Results show reasonable recognition within an MHI verification method which automatically performs temporal segmentation, is invariant to linear changes in speed, and runs in real-time on a standard platform.
Here is a short demo of the current system using two cameras. The top two images show the camera input with motion bounding regions. Bounding boxes are used to account for the possibility of multiple (separate) people/objects. White boxes identify valid motion regions. The middle two images show the corresponding MHI images for the above frames. The "virtual" room at the bottom shows an avatar of me in particular poses when the system identifies any of the recognizable actions (sitting, waving, crouching). |