EE5731 - Panoramic Image Stitching

EE5731 - Panoramic Image Stitching

SIFT is an algorithm from the paper Distinctive Image Features from Scale-Invariant Keypoints, published by David G. Lowe in 2004. The solution for one of its application, image stitching, is proposed in Automatic Panoramic Image Stitching using Invariant Features, by Matthew Brown and David G. Lowe in 2017. The continuous assessment 1 of Visual Computing is based on these two papers.

3. Scale Invariant Feature Transform (SIFT) Algorithm

It’s the third feature detection algorithm introduced in the module EE5731. Here is an overview and the contents of the module: An Overview of EE5731 Visual Computing.

The paper Distinctive Image Features from Scale-Invariant Keypoints was first published in 1999 and attained perfection in 2004, by David Lowe from UBC. Although there is Scale-Invariant in its title, the features in SIFT are way more powerful than this. They are invariant to image rotation, illumination change, and partially invariant to 3D camera viewpoint.

The paper is well written, starting from theories and ending with a complete solution for object recognition. So instead of going through the whole paper, this post will only focus on introducing the framework, accompanying with useful notes when implementing this algorithm during the CA.

  1. Scale Space Extrema: Difference of Gaussian (DoG)
  2. Keypoint Localization: Taylor Expansion
    1. Contrast Threshold
    2. Edge Threshold
  3. Orientation Assignment
  4. Keypoint Descriptor
  5. Homography
  6. RANSAC

1 ~ 4 are the main steps covered in the paper. 5 and 6 are implemented along with SIFT in the assignment, which are covered in another paper, Automatic Panoramic Image Stitching using Invariant Features, from the same team.

3.1 Scale Space Extrema: Difference of Gaussian (DoG)

Input: an image I(x,y)I(x,y)
Output: the DoG D(x,y,σ)D(x,y,\sigma) and the locations of the scale space extrema in each DOG

When we think of selecting anchor points from an image, we prefer those stable ones, usually the extreme points or corners. It’s easy to find the extreme points that have larger pixel values than neighbouring pixels. But this will also include noise pixels and pixels on the edges, making the features unstable. Stable means the features points can be repeatably assigned under different views of the same object. Also, we hope the same object can give close feature descriptors to simplify the matching. The design of descriptors will be introduced later.

Scale space extrema are those extrema points coming from the difference-of-Gaussian (DoG) pyramids. Each pyramid is called an octave, which is formed by ss filtered images using ss Gaussian kernels. An octave of ss Gaussian filtered images can create s1s-1 difference-of-Gaussian images. Then we rescale the image, down-sample it by a factor of 2, and repeat the process.

For an image I(x,y)I(x,y) at a particular scale, L(x,y,σ)L(x,y,\sigma) is the convolution of a variable-scale Gaussian, G(x,y,σ)G(x,y,\sigma): L(x,y,σ)=G(x,y,σ)I(x,y)L(x,y,\sigma) = G(x,y,\sigma) * I(x,y)

The Gaussian blur in two dimensions is the product of two Gaussian functions: G(x,y,σ)=12πσ2ex2+y22σ2G(x,y,\sigma) = \frac{1}{2\pi \sigma^2}e^{-\frac{x^2+y^2}{2\sigma^2}}

Note that the formula of a Gaussian function in one dimension is: G(x,σ)=12πσ2ex22σ2G(x, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{x^2}{2\sigma^2}}

The difference-of-Gaussian D(x,y,σ)D(x,y,\sigma) is: D(x,y,σ)=L(x,y,kσ)L(x,y,σ)D(x,y,\sigma) = L(x,y,k\sigma) - L(x,y,\sigma)

The DoG function is a close approximation to the scale-normalized Laplacian of Gaussian (LoG) function. It’s proved that the extrema of LoG produces the most stable image features compared to a range of other possible image functions, such as the gradient, Hessian, or Harris corner function.

The maxima and minima of the difference-of-Gaussian images are then detected by comparing a pixel to its 26 neighbors at the current and adjacent scales (8 neighbors in the current image and 9 neighbors in the scale above and below).

3.2 Keypoint Localization: Taylor Expansion

Input: the locations of the scale space extrema from a DoG
Output: refined interpolated locations of the scale space extrema

Simply using the locations and the pixel values of the keypoints we get from 3.1 will not make the algorithm become invalid. However, the noise pixels will also give high response and be detected as keypoints in DoG.

The Taylor expansion with quadratic terms of D(x,y,σ)D(x, y, \sigma) is used to find out the location of the real extrema, (x^,y^)(\hat{x}, \hat{y}) (or (x+x^,y+y^)(x+\hat{x}, y+\hat{y}), (x^,y^)(\hat{x}, \hat{y}) is the offset).

Let x=(x,y,σ)T\bm{x} = (x, y, \sigma)^T be the location of the keypoint (x,y)(x, y) in the DoG with variance σ\sigma, we have: D(x,y,σ)=D(x)D+(Dx)Tx+12xT2Dx2xD(x, y, \sigma) = D(\bm{x}) \approx D + (\frac{\partial D}{\partial \bm{x}})^T \bm{x} + \frac{1}{2} \bm{x}^T \frac{\partial^2 D}{\partial \bm{x}^2} \bm{x}

See Wikipedia for more help on vector multiplication. By setting D(x)=0D(\bm{x}) = 0, we have the extremum: x^=(2Dx2)1Dx\hat{\bm{x}} = -(\frac{\partial^2 D}{\partial \bm{x}^2})^{-1} \frac{\partial D}{\partial \bm{x}}

Since the computation only involves a 3x3 block around the keypoint candidate x\bm{x}, we can copy the block and then set the center as original, (0,0)(0, 0). Then x^\hat{\bm{x}} becomes the offset.

In case x^\hat{\bm{x}} is larger than 0.5 on any dimension, it implies that the actual extremum is another pixel rather than x\bm{x}. If so, we can set x^\hat{\bm{x}} as the new center and fetch a new 3x3 block around it, repeat the calculation above till all the dimensions of x^\hat{\bm{x}} are no larger than 0.5. With the final extremum, we can update the keypoints with the refined locations.

3.2.1 Contrast Threshold

Input: the refined location of a keypoint x^\hat{\bm{x}}
Output: the contrast of the keypoint and the decision on whether it’s a noise pixel or not

The extremum location x^\hat{\bm{x}} has another use in noise rejection. Most of the additional noise is not that strong. If the interpolated amplitude of the keypoint on DoG is less than 0.3, then it’s dropped out as a noise pixel. D(x^)=D+12(Dx)Tx^\vert D(\hat{\bm{x}}) \vert = \vert D + \frac{1}{2} (\frac{\partial D}{\partial \bm{x}})^T \hat{\bm{x}} \vert

Note that the image is normalized to [0,1][0, 1] from [0,255][0, 255].

3.2.2 Edge Threshold

Input: the refined location of a keypoint x^\hat{\bm{x}}
Output: whether it’s on a line or a vertex

Using the 3x3 block around the keypoint we can also compute a 2x2 matrix called Hessian matrix. The trace and determinant can be represented as the sum and the product of the 2 eigenvalues, α\alpha and β\beta (say, α>β\alpha > \beta). They are also called principle curvatures. α\alpha is the maximum curvature of the point and β\beta is the minimum curvature. H=[DxxDxyDyxDyy]\bm{H} = \begin{bmatrix} D_{xx} & D_{xy} \\ D_{yx} & D_{yy} \end{bmatrix} Tr(H)=Dxx+Dyy=α+βTr(\bm{H}) = D_{xx} + D_{yy} = \alpha + \beta Det(H)=DxxDyyDxyDyx=DxxDyyDxy2=αβDet(\bm{H}) = D_{xx} D_{yy} - D_{xy} D_{yx} = D_{xx} D_{yy} - D_{xy}^2 = \alpha \beta

Let r=α/βr = \alpha / \beta, then: Tr(H)2Det(H)=(α+β)2αβ=(r+1)2r\frac{Tr(\bm{H})^2}{Det(\bm{H})} = \frac{(\alpha + \beta)^2}{\alpha \beta} = \frac{(r+1)^2}{r}

The empirical threshold is r=10r=10. If Tr(H)2Det(H)>(r+1)2r,r=10\frac{Tr(\bm{H})^2}{Det(\bm{H})} > \frac{(r+1)^2}{r}, r=10

, then is more likely that the keypoint lies on a line.

Consider the image as a 3D surface. The height zz of a point on it is the pixel value of (x,y)(x, y). If the point lies on a line then it’s on a ridge or valley, making αβ\vert \alpha \vert \gg \beta.

Full-width image

Figure 1

3.3 Orientation Assignment

Input: the image location, scale, and orientation of a keypoint
Output: an orientation histogram

3.4 Keypoint Descriptor

Input: the refined location of a keypoint x^\hat{\bm{x}}
Output: a 4x4x8 = 128 element feature vector

All the steps above are from the paper Distinctive Image Features from Scale-Invariant Keypoints. The library VLFeat used in the CA is

3.5 Homography

Input: nn pairs of anchors in both images (n4,usually6)(n \geq 4, usually 6)
Output: the homography matrix

Under construction

3.6 RANSAC

Input: all keypoints in both images
Output: the best nn pairs of keypoints to be the anchors (n4)(n \geq 4)

Under construction

3.7 Implementing Automatic Image Stitching

3.9 Reference

  1. Distinctive Image Features from Scale-Invariant Keypoints
  2. Automatic Panoramic Image Stitching using Invariant Features