Maximally stable extremal regions

In computer vision, maximally stable extremal regions (MSER) are used as a method of blob detection in images. This technique was proposed by Matas et al.[1] to find correspondences between image elements from two images with different viewpoints. This method of extracting a comprehensive number of corresponding image elements contributes to the wide-baseline matching, and it has led to better stereo matching and object recognition algorithms.

Terms and Definitions

Image I is a mapping I : D \subset \mathbb{Z}^2 \to S. Extremal regions are well defined on images if:

  1. S is totally ordered (total, antisymmetric and transitive binary relations \le exist).
  2. An adjacency relation  A \subset D \times D is defined.

Region Q is a contiguous subset of D. (For each p,q \in Q there is a sequence p, a_1, a_2, .., a_n, q and pAa_1, a_iAa_{i+1}, a_nAq.)

(Outer) Region Boundary \partial Q = \{ q \in D \setminus Q: \exists p \in Q : qAp \}, which means the boundary \partial Q of  Q is the set of pixels adjacent to at least one pixel of Q but not belonging to Q.

Extremal Region Q \subset D is a region such that either for all  p \in Q, q \in \partial Q : I(p) > I(q) (maximum intensity region) or for all  p \in Q, q \in \partial Q : I(p) < I(q) (minimum intensity region).

Maximally Stable Extremal Region Let Q_1,.., Q_{i-1}, Q_i,... be a sequence of nested extremal regions (Q_i \subset Q_{i+1}). Extremal region Q_{i*} is maximally stable if and only if q(i) = | Q_{i+\Delta} \setminus Q_{i-\Delta} | / |Q_i| has a local minimum at i*. (Here | \cdot | denotes cardinality.)\Delta \in S is a parameter of the method.

The equation checks for regions that remain stable over a certain number of thresholds. If a region Q_{i+\Delta} is not significantly larger than a region Q_{i-\Delta}, region Q_i is taken as a maximally stable region.

The concept more simply can be explained by thresholding. All the pixels below a given threshold are 'black' and all those above or equal are 'white'. Given a source image, if we generate a sequence of thresholded result images I_t where each image t corresponds to a increasing threshold t, we would see first a white image, then 'black' spots corresponding to local intensity minima will appear then grow larger. These 'black' spots will eventually merge, until the whole image is black. The set of all connected components in the sequence is the set of all extremal regions. In that sense, the concept of MSER is linked to the one of component tree of the image.[2] The component tree indeed provide an easy way for implementing MSER.[3]


Extremal regions

Extremal regions in this context have two important properties, that the set is closed under...

  1. continuous transformation of image coordinates. This means it is affine invariant and it doesn't matter if the image is warped or skewed.
  2. monotonic transformation of image intensities. The approach is of course sensitive to natural lighting effects as change of day light or moving shadows.

Advantages of MSER

Because the regions are defined exclusively by the intensity function in the region and the outer border, this leads to many key characteristics of the regions which make them useful. Over a large range of thresholds, the local binarization is stable in certain regions, and have the properties listed below.

Comparison to other region detectors

In Mikolajczyk et al.,[6] six region detectors are studied (Harris-affine, Hessian-affine, MSER, edge-based regions, intensity extrema, and salient regions). A summary of MSER performance in comparison to the other five follows.

MSER consistently resulted in the highest score through many tests, proving it to be a reliable region detector.[6]

Implementation

The original algorithm of Matas et al.[1] is O(n\,\log(\log(n))) in the number n\, of pixels. It proceeds by first sorting the pixels by intensity. This would take O(n)\, time, using BINSORT. After sorting, pixels are marked in the image, and the list of growing and merging connected components and their areas is maintained using the union-find algorithm. This would take O(n\,\log(\log(n))) time. In practice these steps are very fast. During this process, the area of each connected component as a function of intensity is stored producing a data structure. A merge of two components is viewed as termination of existence of the smaller component and an insertion of all pixels of the smaller component into the larger one. In the extremal regions, the 'maximally stable' ones are those corresponding to thresholds where the relative area change as a function of relative change of threshold is at a local minimum, i.e. the MSER are the parts of the image where local binarization is stable over a large range of thresholds.[1][6]

The component tree is the set of all connected components of the thresholds of the image, ordered by inclusion. Efficient (quasi-linear whatever the range of the weights) algorithms for computing it do exist.[2] Thus this structure offers an easy way for implementing MSER.[3]

More recently, Nister and Stewenius have proposed a truly (if the weight are small integers) worst-case O(n)\, method in,[5] which is also much faster in practice. This algorithm is similar to the one of Ph. Salembier et al.[7]

Robust wide-baseline algorithm

The purpose of this algorithm is to match MSERs to establish correspondence points between images. First MSER regions are computed on the intensity image (MSER+) and on the inverted image (MSER-). Measurement regions are selected at multiple scales: the size of the actual region, 1.5x, 2x, and 3x scaled convex hull of the region. Matching is accomplished in a robust manner, so it is better to increase the distinctiveness of large regions without being severely affected by clutter or non-planarity of the region's pre-image. A measurement taken from an almost planar patch of the scene with stable invariant description are called a 'good measurement'. Unstable ones or those on non-planar surfaces or discontinuities are called 'corrupted measurements'. The robust similarity is computed: For each M_A^i on region A, k regions B_1,\dots, B_k from the other image with the corresponding i-th measurement M_{B_1}^i ,\dots, M_{B_k}^i nearest to  M_A^i are found and a vote is cast suggesting correspondence of A and each of B_1, \dots , B_k. Votes are summed over all measurements, and using probability analysis, we pick out the 'good measurements' as the 'corrupt measurements' will likely spread their votes randomly. By applying RANSAC to the centers of gravity of the regions, we can compute a rough epipolar geometry. An affine transformation between pairs of potentially corresponding regions is computed, and correspondences define it up to a rotation, which is then determined by epipolar lines. The regions are then filtered, and the ones with correlation of their transformed images above a threshold are chosen. RANSAC is applied again with a more narrow threshold, and the final eipolar geometry is estimated by the eight-point algorithm.

This algorithm can be tested here (Epipolar or homography geometry constrained matches): WBS Image Matcher

Use in Text Detection

The MSER algorithm has been used in text detection by Chen by combining MSER with Canny edges. Canny edges are used to help cope with the weakness of MSER to blur. MSER is first applied to the image in question to determine the character regions. To enhance the MSER regions any pixels outside the boundaries formed by Canny edges are removed. The separation of the letter provided by the edges greatly increase the usability of MSER in the extraction of blurred text.[8] An alternative use of MSER in text detection is the work by Shi using a graph model. This method again applies MSER to the image to generate preliminary regions. These are then used to construct a graph model based on the position distance and color distance between each MSER, which is treated as a node. Next the nodes are separated into foreground and background using cost functions. One cost function is to relate the distance from the node to the foreground and background. The other penalizes nodes for being significantly different from its neighbor. When these are minimized the graph is then cut to separate the text nodes from the non-text nodes.[9]To enable text detection in a general scene, Neumann uses the MSER algorithm in a variety of projections. In addition to the greyscale intensity projection, he uses the red, blue, and green color channels to detect text regions that are color distinct but not necessarily distinct in greyscale intensity. This method allows for detection of more text than solely using the MSER+ and MSER- functions discussed above.[10]

Extensions and Adaptations

Other Applications

See also

External links

References

  1. 1.0 1.1 1.2 J. Matas, O. Chum, M. Urban, and T. Pajdla. "Robust wide baseline stereo from maximally stable extremal regions." Proc. of British Machine Vision Conference, pages 384-396, 2002.
  2. 2.0 2.1 L. Najman and M. Couprie: "Building the component tree in quasi-linear time"; IEEE Transaction on Image Processing, Volume 15, Numbers 11 , 2006, pp 3531-3539
  3. 3.0 3.1 3.2 Donoser, M. and Bischof, H. Efficient Maximally Stable Extremal Region (MSER) Tracking CVPR, 2006.
  4. 4.0 4.1 4.2 Forssen, P-E. and Lowe, D.G. "Shape Descriptors for Maximally Stable Extremal Regions" ICCV, 2007.
  5. 5.0 5.1 Nister, D. and Stewenius, H., "Linear Time Maximally Stable Extremal Regions", ECCV, 2008.
  6. 6.0 6.1 6.2 K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, T. Kadir and L. Van Gool: "A Comparison of Affine Region Detectors"; International Journal of Computer Vision, Volume 65, Numbers 1-2 / November, 2005, pp 43-72
  7. Salembier, Philippe; A. Oliveras and L. Garrido (1998). "Anti-extensive Connected Operators for Image and Sequence Processing". IEEE Transactions on Image Processing 7 (4): 555–570. doi:10.1109/83.663500.
  8. Chen, Huizhong; Tsai, Sam; Schroth, Georg; Chen, David; Grzeszczuk, Radek; Girod, Bernd. "Robust Text Detection in Natural Images with Edge-enhanced Maximally Stable Extremal Regions". Proc. IEEE International Conference on Image Processing 2011.
  9. Shi, Cunzhao; Wang, Chunheng; Xiao, Baihua; Gao, Song (15 January 2013). "Scene Text Detection Using Graph Model Built Upon Maximally Stable Extremal Regions". Pattern Recognition Letters 34 (2): 107–116.
  10. Neumann, Lukas; Matas, Jiri (2011). "A Method for Text Localization and Recognition in Real-World Images". ACCV 2010: 770–783.
  11. Forssen, P-E. Maximally Stable Colour Regions for Recognition and Matching, CVPR, 2007.
  12. Chavez, Aaron; Gustafson, David (2011). "Color-Based Extensions to MSERs". ISVC 2011: 358–366.