image, many features from the background will not have any correct match in the database giving rise to many false matches in addition to the correct ones. The correct matches can be filtered from the full set of matches by identifying subsets of keypoints that agree on the object and its location, scale, and orientation in the new image. The probability that several features will agree on these parameters by chance is much lower than the probability that any individual feature match will be in error. The determination of these consistent clusters can be performed rapidly by using an efficient hash table implementation of the generalized Hough transform Each cluster of 3 or more features that agree on an object and its pose is then subject to further detailed verification. First, a least-squared estimate is made for an affine approxi- mation to the object pose. Any other image features consistent with this pose are identified, and outliers are discarded. Finally, a detailed computation is made of the probability that a particular set of features indicates the presence of an object, given the accuracy of fit and number of probable false matches. Object matches that pass all these tests can be identified as correct with high confidence 2 Related research The development of image matching by using a set of local interest points can be traced back to the work of Moravec (1981) on stereo matching using a corner detector. The moravec detector was improved by Harris and Stephens(1988)to make it more repeatable under small image variations and near edges. Harris also showed its value for efficient motion tracking and 3D structure from motion recovery(Harris, 1992), and the Harris corner detector has since been widely used for many other image matching tasks. While these feature detectors are usually called corner detectors, they are not selecting just corners, but rather any image location that has large gradients in all directions at a predetermined scale The initial applications were to stereo and short-range motion tracking but the approach was later extended to more difficult problems. Zhang et al. (1995)showed that it was possi- ble to match Harris corners over a large image range by using a correlation window around each cormer to select likely matches. Outliers were then removed by solving for a funda- mental matrix describing the geometric constraints between the two views of rigid scene and removing matches that did not agree with the majority solution. At the same time, a similar approach was developed by Torr(1995) for long-range motion matching, in which geometric constraints were used to remove outliers for rigid objects moving within an image The ground-breaking work of Schmid and Mohr (1997)showed that invariant local fed- ture matching could be extended to general image recognition problems in which a feature was matched against a large database of images. They also used Harris corners to select interest points, but rather than matching with a correlation window they used a rotationally invariant descriptor of the local image region. This allowed features to be matched under arbitrary orientation change between the two images. Furthermore, they demonstrated that multiple feature matches could accomplish general recognition under occlusion and clutter by identifying consistent clusters of matched features. The Harris corner detector is very sensitive to changes in image scale, so it does not ovide a good basis for matching images of different sizes. Earlier work by the author owe,1999)extended the local feature approach to achieve scale invariance. This work also described a new local descriptor that provided more distinctive features while being less sensitive to local image distortions such as 3D viewpoint change. This current paper provides a more in-depth development and analysis of this earlier work, while also presenting a number of improvements in stability and feature invariance There is a considerable body of previous research on identifying representations that are stable under scale change Some of the first work in this area was by Crowley and Parker (1984), who developed a representation that identified peaks and ridges in scale space and linked these into a tree structure. The tree structure could then be matched between images with arbitrary scale change. More recent work on graph-based matching by Shokoufandeh Marsic and Dickinson(1999) provides more distinctive feature descriptors using wavelet co efficients. The problem of identifying an appropriate and consistent scale for feature detection has been studied in depth by Lindeberg(1993, 1994). He describes this as a problem of scale selection and we make use of his results below Recently, there has been an impressive body of work on extending local features to be invariant to full affine transformations(Baumberg, 2000; Tuytelaars and Van Gool, 2000 Mikolajczyk and Schmid, 2002, Schaffalitzky and Zisserman, 2002; Brown and Lowe, 2002) This allows for invariant matching to features on a planar surface under changes in ortho- graphic 3D projection, in most cases by resampling the image in a local affine frame. How- ever, none of these approaches are yet fully affine invariant, as they start with initial feature scales and locations selected in a non-affine- invariant manner due to the prohibitive cost of exploring the full affine space. The affine frames are are also more sensitive to noise than those of the scale-invariant features. so in practice the affine features have lower repeatability than the scale-invariant features unless the affine distortion is greater than about a 40 degree tilt of a planar surface(Mikolajczyk, 2002). Wider affine invariance may not be important for many applications, as training views are best taken at least every 30 degrees rotation in view point(meaning that recognition is within 15 degrees of the closest training view) in order to capture non-planar changes and occlusion effects for 3d objects While the method to be presented in this paper is not fully affine invariant, a different approach is used in which the local descriptor allows relative feature positions to shift signif- icantly with only small changes in the descriptor. This approach not only allows the descrip tors to be reliably matched across a considerable range of affine distortion, but it also makes the features more robust against changes in 3D viewpoint for non-planar surfaces. Other advantages include much more efficient feature extraction and the ability to identify larger numbers of features. On the other hand affine invariance is a valuable property for matching planar surfaces under very large view changes, and further research should be performed on the best ways to combine this with non-planar 3D viewpoint invariance in an efficient and stable manner. Many other feature types have been proposed for use in recognition, some of which could be used in addition to the features described in this paper to provide further matches under differing circumstances. One class of features are those that make use of image contours or region boundaries, which should make them less likely to be disrupted by cluttered back grounds near object boundaries. Matas et al,(2002) have shown that their maximally-stable extremal regions can produce large numbers of matching features with good stability. miko lajczyk et al.,(2003) have developed a new descriptor that uses local edges while ignoring unrelated nearby edges, providing the ability to find stable features even near the boundaries of narrow shapes superimposed on background clutter. Nelson and Selinger(1998) have shown good results with local features based on groupings of image contours. Similarly, Pope and Lowe(2000 )used features based on the hierarchical grouping of image contours, which are particularly useful for objects lacking detailed texture The history of research on visual recognition contains work on a diverse set of other image properties that can be used as feature measurements. Carneiro and Jepson(2002) describe phase-based local features that represent the phase rather than the magnitude of local spatial frequencies, which is likely to provide improved invariance to illumination. Schiele and Crowley(2000) have proposed the use of multidimensional histograms summarizing the distribution of measurements within image regions. This type of feature may be particularly useful for recognition of textured objects with deformable shapes. Basri and Jacobs(1997) have demonstrated the value of extracting local region boundaries for recognition. Other useful properties to incorporate include color, motion, figure-ground discrimination, region shape descriptors, and stereo depth cues. The local feature approach can easily incorporate novel feature types because extra features contribute to robustness when they provide correct matches, but otherwise do little harm other than their cost of computation. Therefore future systems are likely to combine many feature ty 3 Detection of scale-space extrema As described in the introduction, we will detect keypoints using a cascade filtering approach that uses efficient algorithms to identify candidate locations that are then examined in further detail. The first stage of keypoint detection is to identify locations and scales that can be repeatably assigned under differing views of the same object. Detecting locations that are invariant to scale change of the image can be accomplished by searching for stable features across all possible scales, using a continuous function of scale known as scale space(Witkin, 1983) It has been shown by Koenderink(1984) and Lindeberg(1994) that under a variety of reasonable assumptions the only possible scale-space kernel is the Gaussian function. There- fore, the scale space of an image is defined as a function, L(, y, o), that is produced the convolution of a variable-Scale Gaussian, G(a, y, o), with an input image, I(a, y) L(, 3, 0)=G(a, 3, o)kI(c, y) where is the convolution operation in c and y, and G(, g, a)=age 70 To efficiently detect stable keypoint locations in scale space, we have proposed(Lowe, 1999) using scale-space extrema in the difference-of-Gaussian function convolved with the imag D(a, 9, o), which can be computed from the difference of two nearby scales separated by a constant multiplicative factor k D(, 9, 0)=(G(, g, ho)-G(a,y, o))*I(a, y) (a, y, ko)-L(a, 1, o There are a number of reasons for choosing this function. First, it is a particularly efficient function to compute, as the smoothed images, L, need to be computed in any case for scale space feature description, and D can therefore be computed by simple image subtraction 5 Scale octave Scale (first Difference of Gaussian Gaussian (DOG) Figure 1: For each octave of scale space the initial image is repeatedly convolved with gaussians to produce the set of scale space images shown on the left. Adjacent Gaussian images are subtracted to produce the difference-of-Gaussian images on the right. After each octave, the Gaussian image is down-sampled by a factor of 2, and the process repeated In addition, the difference-of-Gaussian function provides a close approximation to the scale-normalized Laplacian of Gaussian, oV-G, as studied by Lindeberg(1994 ). Lindeberg showed that the normalization of the laplacian with the factor o is required for true scale invariance In detailed experimental comparisons, Mikolajczyk(2002) found that the maxima and minima of -v-G produce the most stable image features compared to a range of other possible image functions, such as the gradient, Hessian, or Harris corner function The relationship between D and o-V-G can be understood from the heat diffusion equa tion(parameterized in terms of o rather than the more usual t=o) G aVG From this, we see that V2G can be computed from the finite difference approximation to aG/ao, using the difference of nearby scales at ho and o ov2G aG G(, y, ko)-G(C, y, a) ka and therefore G(, y, ko)-G(, 3, o)x(k-1)02V2G This shows that when the difference-of-Gaussian function has scales differing by a con stant factor it already incorporates the o scale normalization required for the scale-invariant Scale Figure 2: Maxima and minima of the difference-of-Gaussian images are detected by comparing a pixel(marked with X) to its 26 neighbors in 3x3 regions at the current and adjacent scales(marked with circles) Laplacian. The factor(k-1) in the equation is a constant over all scales and therefore does not influence extrema location. The approximation error will go to zero as h goes to l, but in practice we have found that the approximation has almost no impact on the stability of extrema detection or localization for even significant differences in scale, such as k=v2 An efficient approach to construction of D(a, y, o)is shown in Figure 1. The initial image is incrementally convolved with gaussians to produce images separated by a constant factor h in scale space, shown stacked in the left column. We choose to divide each octave of scale space (i. e, doubling of o)into an integer number, s, of intervals, So k=21/s We must produce s+ 3 images in the stack of blurred images for each octave, so that final extrema detection covers a complete octave. Adjacent image scales are subtracted to produce the difference-of-Gaussian images shown on the right. Once a complete octave has been processed, we resample the Gaussian image that has twice the initial value of o(it will be 2 images from the top of the stack) by taking every second pixel in each row and column. The accuracy of sampling relative to o is no different than for the start of the previous octave, while computation is greatly reduced 3.1 Local extrema detection In order to detect the local maxima and minima of D(, 1, a), each sample point is compared to its eight neighbors in the current image and nine neighbors in the scale above and below (see Figure 2). It is selected only if it is larger than all of these neighbors or smaller than all of them. The cost of this check is reasonably low due to the fact that most sample points will be eliminated following the first few checks An important issue is to determine the frequency of sampling in the image and scale do- mains that is needed to reliably detect the extrema. Unfortunately it turns out that there is no minimum spacing of samples that will detect all extrema, as the extrema can be arbitrar ily close together:. This can be seen by considering a white circle on a black background which will have a single scale space maximum where the circular positive central region of the difference-of-Gaussian function matches the size and location of the circle. for a very elongated ellipse, there will be two maxima near each end of the ellipse. as the locations of maxima are a continuous function of the image, for some ellipse with intermediate elongation there will be a transition from a single maximum to two, with the maxima arbitrarily close to c 1500 itching location and scale -+ Nearest descriptor in database ---- 1000 Tolal number of keypoints Nearest descriptor in databa 0 500 6 8 Number of scales sampled per Number of scales sampled per octave Figure 3 The top line of the first graph shows the percent of keypoints that are repeatably detected at the same location and scale in a transformed image as a function of the number of scales sampled per octave. The lower line shows the percent of keypoints that have their descriptors correctly matched to a large database. The second graph shows the total number of keypoints detected in a typical image as a function of the number of scale samples each other near the transition Therefore, we must settle for a solution that trades off efficiency with completeness In fact, as might be expected and is confirmed by our experiments, extrema that are close together are quite unstable to small perturbations of the image. We can determine the best choices experimentally by studying a range of sampling frequencies and using those that provide the most reliable results under a realistic simulation of the matching task 3.2 Frequency of sampling in scale The experimental determination of sampling frequency that maximizes extrema stability is shown in Figures 3 and 4. These figures(and most other simulations in this paper) are based on a matching task using a collection of 32 real images drawn from a diverse range, including outdoor scenes, human faces, aerial photographs, and industrial images(the image domain was found to have almost no influence on any of the results ) each image was then subject to a range of transformations, including rotation, scaling, affine stretch, change in brightness and contrast, and addition of image noise. Because the changes were synthetic, it was possible to precisely predict where each feature in an original image should appear in the transformed image, allowing for measurement of correct repeatability and positional accuracy for each feature Figure 3 shows these simulation results used to examine the effect of varying the number of scales per octave at which the image function is sampled prior to extrema detection. In this case, each image was resampled following rotation by a random angle and scaling by a random amount between 0.2 of 0.9 times the original size. Keypoints from the reduced resolution image were matched against those from the original image so that the scales for all keypoints would bebe present in the matched image. In addition, 1% image noise was added, meaning that each pixel had a random number added from the uniform interval [-0.01, 0.01 where pixel values are in the range [0, 1](equivalent to providing slightly less than 6 bits of accuracy for image pixels) 100 40 Matching location and scale+ Nearest descriptor in database 20 1.2 1.4 1.6 1 Prior smoothing for each octave(sigma) Figure 4: The top line in the graph shows the percent of keypoint locations that are repeatably detected in a transformed image as a function of the prior image smoothing for the first level of each octave The lower line shows the percent of descriptors correctly matched against a large database The top line in the first graph of Figure 3 shows the percent of keypoints that are detected at a matching location and scale in the transformed image. For all examples in this paper, we define a matching scale as being within a factor of v 2 of the correct scale, and a matching location as being within o pixels, where o is the scale of the keypoint( defined from equation (1)as the standard deviation of the smallest Gaussian used in the difference-of-Gaussian function). The lower line on this graph shows the number of keypoints that are correctly matched to a database of 40,000 keypoints using the nearest-neighbor matching procedure to be described in Section 6(this shows that once the keypoint is repeatably located, it is likely to be useful for recognition and matching tasks). As this graph shows, the highest repeatability is obtained when sampling 3 scales per octave, and this is the number of scale samples used for all other experiments throughout this paper It might seem surprising that the repeatability does not continue to improve as more scales are sampled. The reason is that this results in many more local extrema being detected but these extrema are on average less stable and therefore are less likely to be detected in the transformed image. This is shown by the second graph in Figure 3, which shows the average number of keypoints detected and correctly matched in each image. The number of keypoints rises with increased sampling of scales and the total number of correct matches also rises. Since the success of object recognition often depends more on the quantity of correctly matched keypoints, as opposed to their percentage correct matching, for many applications it will be optimal to use a larger number of scale samples. However, the cost of computation also rises with this number, so for the experiments in this paper we have chosen to use just 3 scale samples per octave To summarize, these experiments show that the scale-space difference-of-Gaussian func tion has a large number of extrema and that it would be very expensive to detect them all Fortunately, we can detect the most stable and useful subset even with a coarse sampling of scales 3.3 Frequency of sampling in the spatial domain Just as we determined the frequency of sampling per octave of scale space, so we must de- termine the frequency of sampling in the image domain relative to the scale of smoothing Given that extrema can be arbitrarily close together, there will be a similar trade-off between sampling frequency and rate of detection. Figure 4 shows an experimental determination of the amount of prior smoothing, 0, that is applied to each image level before building the scale space representation for an octave. Again, the top line is the repeatability of keypoint detection, and the results show that the repeatability continues to increase with o. However there is a cost to using a large o in terms of efficiency, so we have chosen to use o = 1.6 which provides close to optimal repeatability. This value is used throughout this paper and was used for the results in figure 3 Of course, if we pre-smooth the image before extrema detection, we are effectively dis- carding the highest spatial frequencies. Therefore, to make full use of the input, the image can be expanded to create more sample points than were present in the original. We dou ble the size of the input image using linear interpolation prior to building the first level of the pyramid. While the equivalent operation could effectively have been performed by us- ing sets of subpixel-offset filters on the original image, the image doubling leads to a more efficient implementation. We assume that the original image has a blur of at least g=0.5 (the minimum needed to prevent significant aliasing), and that therefore the doubled image has o=1.0 relative to its new pixel spacing. This means that little additional smoothing is needed prior to creation of the first octave of scale space. The image doubling increases the number of stable keypoints by almost a factor of 4, but no significant further improvements were found with a larger expansion factor 4 Accurate keypoint localization Once a keypoint candidate has been found by comparing a pixel to its neighbors, the next step is to perform a detailed fit to the nearby data for location, scale, and ratio of principal curvatures. This information allows points to be rejected that have low contrast(and are therefore sensitive to noise)or are poorly localized along an edge The initial implementation of this approach Lowe, 1999)simply located keypoints at the location and scale of the central sample point. However, recently Brown has developed a method ( Brown and Lowe, 2002) for fitting a 3D quadratic function to the local sampl points to determine the interpolated location of the maximum, and his experiments showed that this provides a substantial improvement to matching and stability. His approach uses the Taylor expansion(up to the quadratic terms)of the scale-Space function, D(r, y, o), shifted so that the origin is at the sample point aD 192D D(x)=D+ax+oX X (2) where D and its derivatives are evaluated at the sample point and x=(, 3, o ) is the offset from this point. The location of the extremum, x, is determined by taking the derivative of this function with respect to x and setting it to zero giving odoD ax2 ax