Articulated body pose estimation
Articulated body pose estimation, in computer vision, is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.[1][2]
Description
Perception of human beings in the neighboring environment is an important capability that robots must possess. If a person uses gestures to point to a particular object, then the interacting machine should be able to understand the situation in real world context. Hence pose estimation is an important and challenging problem in computer vision, and many algorithms have been deployed in solving this problem over the last two decades. Many solutions involve training complex models with large data sets.
Pose estimation is a difficult problem and an active subject of research because the human body has 244 degrees of freedom with 230 joints. Although not all movements between joints are evident, the human body is composed of ten large parts with twenty degrees of freedom. Algorithms must account for large variability introduced by differences in appearance due to clothing, body shape, size, and hairstyles. Additionally, the results may be ambiguous due to partial occlusions from self-articulation, such as a person's hand covering their face, or occlusions from external objects. Finally, most algorithms estimate pose from monocular (two-dimensional) images, taken from a normal camera. Other issues include varying lighting and camera configurations. Finally, the difficulties are compounded if there are performance requirements. These images lack the three-dimensional information of an actual body pose, leading to further ambiguities. There is recent work in this area wherein images from RGBD cameras provide information about color and depth.[3]
There is a need to develop accurate, tether-less, vision-based articulated body pose estimation systems to recover the pose of bodies such as the human body, a hand, or non-human creatures. Such a system have several foreseeable applications, including
- Marker-less motion capture for human-computer interfaces,
- Physiotherapy,
- 3D animation,
- Ergonomics studies,
- Robot control, and
- Visual surveillance.
The typical articulated body pose estimation system involves a model-based approach, in which the pose estimation is achieved by maximizing/minimizing a similarity/dissimilarity between an observation (input) and a template model. Different kinds of sensors have been explored for use in making the observation, including
- Visible wavelength imagery,
- Long-wave thermal infrared imagery,[4]
- Time-of-flight imagery, and
- Laser range scanner imagery.
These sensors produce intermediate representations that is directly used by the model; the representations include
- Image appearance,
- Voxel (volume element) reconstruction,
- 3D point clouds, and sum of Gaussian kernels[5]
- 3D surface meshes.
Part Models
The basic idea of part based model can be attributed to human skeleton. Any object having property of articulation can be disintegrated into smaller parts wherein each part can take different orientation resulting in different articulations of the same object. Hence different scales and orientation of the main object can be articulated to scales and orientation of the corresponding parts. To formulate the model so that it can be represented in mathematical terms, the parts are connected to each other using springs. Hence the model is also known as Spring model. The degree of closeness between each part is accounted by the compression and expansion of the springs. There is geometric constrain on the orientation of springs. For example, limbs of legs cannot move 360 degrees. Hence parts cannot have that extreme orientation. This reduces the possible permutations.[6]
The spring model forms a graph G(V,E) where V (nodes) corresponds to the parts and E (edges) represent springs connecting two neighboring parts. Each location in image can be reached by the and coordinates of the pixel location. Let be point at location. Then the cost associated in joining the spring between and the point can be given by . Hence the total cost associated in placing components at locations is given by
The above equation simply represents the spring model used to describe body pose. To estimate pose from images cost or energy function must be minimized. This energy function consists of two terms, one related to how each component matches image data and second deals with how much do the oriented(deformed) parts match thus accounting for articulation along with object detection.[7]
The part models also known as pictorial structures is one the basic models on which other efficient models are built by slight modification. One such example is flexible mixture model which reduces the database of hundreds or thousands of deformed parts by exploiting the notion of local rigidity.[8]
Articulated Model with Quaternion [9]
The kinematic skeleton is constructed by a tree-structured chain, as illustrated in the Figure. Each rigid body segment has its local coordinate system that can be transformed to the world coordinate system via a 4×4 transformation matrix ,
where denotes the local transformation from body segment to its parent . Each joint in the body has 3 degrees of freedom (DoF) rotation. Given a transformation matrix , the joint position at the T-pose can be transferred to its corresponding position in the world coordination. In many works, the 3D joint rotation is expressed as a normalized quaternion due to its continuity that can facilitate gradient-based optimization in the parameter estimation.
Applications
Assisted living
Personal care robots may be deployed in future assisted living homes. For these robots, high-accuracy human detection and pose estimation is necessary to perform a variety of tasks, such as fall detection. Additionally, this application has a number of performance constraints.
Character animation
Traditionally, character animation has been a manual process; however, poses can be synced directly to a real-life actor through specialized pose estimation systems. Older systems relied on markers or specialized suits; recent advances in pose estimation and motion capture have enabled markerless applications, sometimes in real-time.[10]
Intelligent driver assisting system
Car accidents account for about two percent of death globally each year. As such, an intelligent system tracking drivers pose constantly may be useful for emergency alerts . On the same lines, pedestrian detection algorithms have been used successfully in autonomous cars, enabling the car to make smarter decisions.
Video games
Commercially, pose estimation has been used in the context of video games, popularized with the Microsoft Kinect sensor (a depth camera). These systems track the user to render their avatar in-game, in addition to performing tasks like gesture recognition to enable the user to interact with the game. As such, this application has a strict real-time requirement.[11]
Other Applications
Other applications include physical therapy, study of cognitive brain development of young children, video surveillance, animal tracking and behavior understanding to preserve endangered species, sign language detection, advanced human–computer interaction, and marker less motion capturing.
Related technology
A commercially successful but specialized computer vision-based articulated body pose estimation technique is optical motion capture. This approach involves placing markers on the individual at strategic locations to capture the 6 degrees-of-freedom of each body part.
Active Research Groups
A number of groups are actively researching pose estimation, including groups at Brown University; Carnegie Mellon University; MPI Saarbruecken; Stanford University; the University of California, San Diego; the University of Toronto; the École Centrale Paris; ETH Zurich; National University of Sciences and Technology (NUST)[12] and the University of California, Irvine.
References
- ↑ Survey of Computer Vision-Based Human Motion Capture (2001)
- ↑ Survey of Advances in Computer Vision-based Human Motion Capture (2006)
- ↑ Droeschel, David, and Sven Behnke. "3D body pose estimation using an adaptive person model for articulated ICP." Intelligent Robotics and Applications. Springer Berlin Heidelberg, 2011. 157167.
- ↑ Han, J.; Gaszczak, A.; Maciol, R.; Barnes, S.E.; Breckon, T.P. (September 2013). "Human Pose Classification within the Context of Near-IR Imagery Tracking". Proc. SPIE Optics and Photonics for Counterterrorism, Crime Fighting and Defence (PDF). 8901. SPIE. pp. 1–10. doi:10.1117/12.2028375. Retrieved 5 November 2013.
- ↑ M. Ding and G. Fan, "Generalized Sum of Gaussians for Real-Time Human Pose Tracking from a Single Depth Sensor" 2015 IEEE Winter Conference on Applications of Computer Vision (WACV), Jan 2015
- ↑ Fischler, Martin A., and Robert A. Elschlager. "The representation and matching of pictorial structures." IEEE Transactions on computers 1 (1973): 6792.
- ↑ Felzenszwalb, Pedro F., and Daniel P. Huttenlocher. "Pictorial structures for object recognition." International Journal of Computer Vision 61.1 (2005): 5579.
- ↑ Yang, Yi, and Deva Ramanan. "Articulated pose estimation with flexible mixtures-of-parts." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.
- ↑ M. Ding and G. Fan, "Articulated and Generalized Gaussian Kernel Correlation for Human Pose Estimation" IEEE Transactions on Image Processing, Vol. 25, No. 2, Feb 2016
- ↑ Dent, Steven. "What you need to know about 3D motion capture". Engadget. AOL Inc. Retrieved 31 May 2017.
- ↑ Kohli, Pushmeet; Shotton, Jamie. "Key Developments in Human Pose Estimation for Kinect" (PDF). Microsoft. Retrieved 31 May 2017.
- ↑ http://rise.smme.nust.edu.pk/
External links
- Michael J. Black, Professor at Brown University
- Research Project Page of German Cheung at Carnegie Mellon University
- Homepage of Dr.-Ing at MPI Saarbruecken
- Markerless Motion Capture Project at Stanford
- Computer Vision and Robotics Research Laboratory at the University of California, San Diego
- Research Projects of David J. Fleet at the University of Toronto
- Ronald Poppe at the University of Twente.
- Professor Nikos Paragios at the Ecole Centrale de Paris
- Articulated Pose Estimation with Flexible Mixtures of Parts Project at UC Irvine
- http://screenrant.com/crazy3dtechnologyjamescameronavatarkofi3367/
- 2D articulated human pose estimation software
- Articulated Pose Estimation with Flexible Mixtures of Parts