
High-Precision Localization Using Ground Texture
で、なんでこういうシステムを考えたのかは、ここでは引用されていませんが、セクションII. RELATED WORKに書かれています。
具体的には?は、セクションIII. SYSTEMのB以降。
どの程度のもんなの?は、セクションIV. EVALUATIONを参照。
でもなんか未だ足りてないよね…はセクションVI. DISCUSSIONを読んでみてください。
The Global Positioning System (GPS) receiver has become an essential component of both hand-held mobile devices and vehicles of all types.
Applications of GPS, however, are constrained by a number of known limitations.
A GPS receiver must have access to unobstructed lines of sight to a minimum of four satellites, and obscured satellites significantly jeopardize localization quality.
Indoors, a GPS receiver either is slow to obtain a fix, or more likely does not work at all.
Even outdoors, under optimal circumstances, accuracy is limited to a few meters (or perhaps a meter with modern SBAS systems).
These limitations make GPS insufficient for fine-positioning applications such as guiding a car to a precise location in a parking lot, or guiding a robot within an indoor room or warehouse.
To overcome the robustness and accuracy limitations of GPS, alternative localization technologies have been proposed, which are either less accurate than GPS (e.g., triangulation of cellphone towers and WiFi hotspots), or expensive and cumbersome to deploy (e.g., RFID localization or special-purpose sensors embedded in the environment).
Inertial navigation and odometry, which are often used in robotics for fine-positioning tasks, require a known initial position, drift over time, and lose track (requiring manual re-initialization) when the device is powered off.
This paper proposes a system that provides millimeterscale localization, both indoors and outside on land.
The key observation behind our approach is that seemingly-random ground textures exhibit distinctive features that, in combination, provide a means for unique identification.
Even apparently homogeneous surfaces contain small imperfectionscracks,scratches, or even a particular arrangement of carpet fibers – that are persistently identifiable as local features.
While a single feature is not likely to be unique over a large area, the spatial relationship among a group of such features in a small region is likely to be distinctive, at least up to the uncertainty achievable with coarse localization methods such as GPS or WiFi triangulation.
Inspired by this observation,we construct a system called Micro-GPS that includes a downward-facing camera to capture fine-scale ground textures, and an image processing unit capable of locating that texture patch in a pre-constructed compact database within a few hundred milliseconds.
The use of image features for precise localization has a rich history, including works such as Photo Tourism [1] and Computational Re-Photography [2].
画像の特徴を利用して正確な位置推定する方法には、Photo Tourism [1]やComputational Re-Photography [2]などの歴史があります。
Thus, a main contribution of our work is determining how some of the algorithms used for feature detection and matching in “natural” images,as used by previous work, can be adapted for “texture-like”images of the ground.
In searching for a robust combination of such methods, we exploit two key advantages of groundtexture images.
First, the ground can be photographed from much closer range than typical features in the environment,leading to an order-of-magnitude improvement in precision.
Second, the statistics of texture-like images lead to a greater density of features, leading to greater robustness over time.
私たちのテストロボットは、Point Grey社製単眼カメラを制御するNVIDIA JetsonTX1開発ボードを備えています。
Our system consists of two phases: an offline database construction phase, and an online localization phase (Figure 1).
We begin by collecting ground texture images and aligning them using global pose optimization.
We extract local features (keypoints) and store them in a database, which is subsequently compressed to a manageable size.
For localization, we find keypoints in a query image and search the database for candidate matches using approximate nearest neighbor matching.
Because it is common for more than 90% of the matches to be spurious, we use voting to reject outliers, based on the observation that inlier matches will vote for a consistent location whereas outliers distribute their votes randomly.
Finally, we use the remaining inlier matches to precisely calculate the location of the query image.
The major contributions of this paper are:
• Describing a low-cost global localization system based on ground textures and making relevant code and in structions available for reproduction.
• Capturing and making available datasets of seven indoor and outdoor ground textures.
• Investigating the design decisions necessary for practical matching in texture-like images, as opposed to natural images.
This includes the choice of descriptor,strategies for reducing storage costs, and a robust voting procedure that can find inliers with high reliability.
• Demonstrating a real-world application of precise localization: a robot that uses Micro-GPS to record a path and then follow it with sub-centimeter accuracy.
The ability to localize a vehicle or robot precisely has the potential for far-reaching applications.
A car could accurately park (or guide the driver to do so) in any location it recognizes from before, avoiding obstacles mere centimeters away.
A continuously-updated map of potholes could be used to guide drivers to turn slightly to avoid them.
The technology applies equally well to vehicles smaller than cars,such as Segways, electric wheelchairs, and mobility scooters for the elderly or disabled, any of which could be guided to precise locations or around hard-to-see obstacles.
Indoor applications include guidance of warehouse robots and precise control over assistive robotics in the home.
A. Mapping
Hardware Setup and Data Collection: Our imaging system consists of a Point Grey CM3 grayscale camera pointed downwards at the ground (Figure 1, left).
ハードウェアのセットアップとデータ収集:撮像システムは、Point Grey(現 FLIR)社製のグレースケールカメラCM3を地面の方に向けて設置しています(図1左)。
A shield blocks ambient light around the portion of the ground imaged by the camera, and a set of LED lights arranged symmetrically around the lens provides rotation-invariant illumination.
The distance from the camera to the ground is set to 260 mm for most types of textures we have experimented with.
Our system is insensitive to this distance, as long as a sufficient number of features can be detected.
The camera output is processed by an NVIDIA Jetson TX1 development board.
カメラの出力は、NVIDIA Jetson TX1開発ボードで処理されます。
Our prototype has the camera and development board mounted on a mobile cart, which may be moved manually or can be driven with a pair of computer-controlled motorized wheels.
The latter capability is used for the “automatic path following” demonstration described in Section V.
For initial data capture, however, we manually move the cart in a zig-zag path to ensure that an area can be fully covered.
This process, while is easily mastered by non-experts, could be automated by putting more engineering effort or even through crowd-sourcing when there are more users.
Image Stitching: To construct a global map, we assume the that surface is locally planar, which is true even for most outdoor surfaces.
Our image stitching pipeline consists of frame-to-frame matching followed by global optimization,leveraging extensive loop closures provided by the zig-zag path.
Since the computation becomes significantly more expensive as the area grows, we split a large area into several regions (which we reconstruct separately) and link the already-reconstructed regions.
This allows us to quickly map larger areas with decent quality.
Figure 1, right shows the“asphalt” dataset, which covers 19.76m2 in high detail.
Datasets: We have experimented with a variety of both indoor and outdoor datasets, covering ground types ranging from ordered (carpet) to highly stochastic (granite), and including both the presence (concrete) and absence (asphalt) of visible large-scale variation.
We have also captured test images for the datasets on a different day (to allow perturbations to the ground surfaces) to evaluate our system.
Figure 2 shows example patches from our dataset.
We will make these datasets, together with databases of SIFT features and testimage sequences, available to the research community.
Database Construction: The final stage in building a map is extracting a set of features from the images and constructing a data structure for efficiently locating them.
This step involves some key decisions, which we evaluate in Section IV.
Here we only describe our actual implementation.
We use the SIFT scale-space DoG detector and gradient orientation histogram descriptor [30], since we have found it to have high robustness and (with its GPU implementation [31]) reasonable computational time.
SIFT scale-space DoG 検出器と勾配方向ヒストグラム(HOG)記述子[30]を使用します。これは、SIFTが高いロバスト性を持ち、(GPU実装[31]を使用して)妥当な計算時間を持っていることがわかっているためです。
For each image in the map, we typically find 1000 to 2000 SIFT keypoints, and randomly select 50 of them to be stored in the database.
This limits the size of the database itself, as well as the data structures used for accelerating nearest-neighbor queries.
We choose random selection after observing that features with higher DoG response are not necessarily highly repeatable features:they are just as likely to be due to noise, dust, etc.
To further speed up computation and reduce the size of the database,we apply PCA [32] to the set of SIFT descriptors and project each descriptor onto the top k principal components.
計算をさらに高速化し、データベースのサイズを縮小するために、PCA [32]をSIFT記述子のセットに適用し、各記述子を上位k個の主成分に投影します。
As described in Section IV, for good accuracy we typically use k = 8 or k = 16 in our implementation, and there is minimal cost to using a “universal” PCA basis constructed from a variety of textures, rather than a per-texture basis.
セクションIVで説明されたように、精度を高めるために、我々の実装では通常、k = 8またはk = 16を使用しています。また、テクスチャごとの基底ではなく、さまざまなテクスチャから構築された「ユニバーサル」なPCA基底を使用することには、最小限のコストしかかかりません。
One of the major advantages of our system is that the height of the camera is fixed, so that the scale of a particular feature is also fixed.
This means that when searching for a feature with scale s in the database, we only need to check features with scale s as well.
In practice, to allow some inconsistency, we quantize scale into 10 buckets and divide the database into 10 groups based on scale.
Then we build a search index for each group using FLANN [33].
次に、FLANN [33]を使用して各グループの検索インデックスを作成します。
During testing, given a feature with scale s, we only need to search for the nearest neighbor in the group to which s belongs.
Our system provides a simple, inexpensive solution to achieve fine absolute positioning, and mobile robots having such a requirement represent an ideal application.
To demonstrate the practicality of this approach, we build a robot that is able to follow a designed path exactly without any initialization of the position.
Our robot (shown in Figure 8a) has a differential drive composed of two 24V DC geared motors with encoders for closed-loop control of the motors.
Using the encoder readings, we implemented dead-reckoning odometry on board to track the position of the robot at reasonable accuracy within a short distance.
The drift in odometry is corrected using Micro-GPS running on the onboard NVIDIA Jetson TX1 computer at ∼4fps.
オドメトリのドリフトは、NVIDIA Jetson TX1コンピュータ上で動作するMicro-GPSを用いて約4fpsで補正されます。
To test the repeatability of navigation using this strategy,we first manually drive the robot along a particular path (Figure 8b), and mark its final location on the ground using a piece of tape.
The robot then goes back to its starting position and re-plays the same path, fully automatically.
The sequences of the manual driving and automatic re-play are shown in the accompanying video; screen-shots from the video are compared in Figure 8c.
手動運転と自動再生のシーケンスは、添付のビデオに示されています; ビデオのスクリーンショットを図8cで比較します。
As shown in Figure 8d,the robot ends up in almost exactly the same position after automatic path following as it did after the manual driving.
[1] N. Snavely, S. M. Seitz, and R. Szeliski, “Photo Tourism: Exploring
photo collections in 3D,” ACM Transactions on Graphics, vol. 25,
no. 3, pp. 835–846, Jul. 2006.
[2] S. Bae, A. Agarwala, and F. Durand, “Computational re-photography,”
ACM Transactions on Graphics, vol. 29, no. 3, pp. 24:1–24:15, Jul.
[3] K. Dana, B. van Ginneken, S. Nayar, and J. Koenderink, “Reflectance
and texture of real-world surfaces,” ACM Transactions on Graphics,
vol. 18, no. 1, pp. 1–34, 1999.
[4] T. Leung and J. Malik, “Representing and recognizing the visual appearance of materials using three-dimensional textons,” International
Journal of Computer Vision (IJCV), vol. 43, no. 1, pp. 29–44, Jun.
[5] D. Heeger and J. Bergen, “Pyramid-based texture analysis/synthesis,”
in Proc. ACM SIGGRAPH, 1995, pp. 229–238.
[6] A. Efros and T. Leung, “Texture synthesis by non-parametric sampling,” in IEEE International Conference on Computer Vision (ICCV),
1999, pp. 1033–1038.
[7] A. Kelly, B. Nagy, D. Stager, and R. Unnikrishnan, “An infrastructurefree automated guided vehicle based on computer vision,” IEEE
Robotics & Automation Magazine, vol. 14, no. 3, pp. 24–34, 2007.
[8] H. Fang, M. Yang, R. Yang, and C. Wang, “Ground-texture-based
localization for intelligent vehicles,” IEEE Transactions on Intelligent
Transportation Systems, vol. 10, no. 3, pp. 463–468, Sep. 2009.
[9] K. Kozak and M. Alban, “Ranger: A ground-facing camera-based
localization system for ground vehicles,” in IEEE/ION Position, Location and Navigation Symposium (PLANS), 2016, pp. 170–178.
[10] W. Clarkson, T. Weyrich, A. Finkelstein, N. Heninger, J. A. Halderman, and E. W. Felten, “Fingerprinting blank paper using commodity
scanners,” in IEEE Symposium on Security and Privacy, 2009, pp.
[11] T. Sattler, B. Leibe, and L. Kobbelt, “Fast image-based localization
using direct 2D-to-3D matching,” in IEEE International Conference
on Computer Vision (ICCV), 2011, pp. 667–674.
[12] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua, “Worldwide pose estimation using 3D point clouds,” in European Conference on Computer
Vision (ECCV), 2012, pp. 15–29.
[13] R. Mur-Artal, J. Montiel, and J. D. Tardos, “ORB-SLAM: A versatile ´
and accurate monocular slam system,” IEEE Transactions on Robotics,
vol. 31, no. 5, pp. 1147–1163, Oct. 2015.
[14] A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A convolutional
network for real-time 6-DOF camera relocalization,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2938–2946.
[15] S. Ramalingam, S. Bouaziz, P. Sturm, and M. Brand, “Geolocalization
using skylines from omni-images,” in IEEE International Conference
on Computer Vision (ICCV) Workshops, 2009, pp. 23–30.
[16] Y. Li, N. Snavely, and D. Huttenlocher, “Location recognition using
prioritized feature matching,” in European Conference on Computer
Vision (ECCV), 2010, pp. 791–804.
[17] S. Cao and N. Snavely, “Minimal scene descriptions from structure
from motion models,” in Computer Vision and Pattern Recognition
(CVPR), 2014, pp. 461–468.
[18] B. Zeisl, T. Sattler, and M. Pollefeys, “Camera pose voting for largescale image-based localization,” in IEEE International Conference on
Computer Vision (ICCV), 2015, pp. 2704–2712.
[19] Y. Avrithis and G. Tolias, “Hough pyramid matching: Speeded-up geometry re-ranking for large scale image retrieval,” International Journal of Computer Vision (IJCV), vol. 107, no. 1, pp. 1–19, Mar. 2014.
[20] X. Wu and K. Kashino, “Adaptive dither voting for robust spatial
verification,” in IEEE International Conference on Computer Vision
(ICCV), 2015, pp. 1877–1885.
[21] J. L. Schonberger, T. Price, T. Sattler, J.-M. Frahm, and M. Pollefeys, ¨
“A vote-and-verify strategy for fast spatial verification in image retrieval,” in Asian Conference on Computer Vision (ACCV), 2016, pp.
[22] L. Svarm, O. Enqvist, F. Kahl, and M. Oskarsson, “City-scale localization for cameras with known vertical direction,” IEEE Transactions
on Pattern Analysis and Machine Intelligence (PAMI), pp. 1455–1461,
[23] G. Baatz, K. Koser, D. Chen, R. Grzeszczuk, and M. Pollefeys, “Han- ¨
dling urban location recognition as a 2D homothetic problem,” in European Conference on Computer Vision (ECCV), 2010, pp. 266–279.
[24] H. Lim, S. N. Sinha, M. F. Cohen, M. Uyttendaele, and H. J. Kim,
“Real-time monocular image-based 6-DoF localization,” International
Journal of Robotics Research, vol. 34, no. 4-5, pp. 476–492, Apr.
[25] S. Middelberg, T. Sattler, O. Untzelmann, and L. Kobbelt, “Scalable
6-DoF localization on mobile devices,” in European Conference on
Computer Vision (ECCV), 2014, pp. 268–283.
[26] A. Irschara, C. Zach, J.-M. Frahm, and H. Bischof, “From structurefrom-motion point clouds to fast location recognition,” in Computer
Vision and Pattern Recognition (CVPR), 2009, pp. 2599–2606.
[27] A. Wendel, A. Irschara, and H. Bischof, “Natural landmark-based
monocular localization for MAVs,” in IEEE International Conference
on Robotics and Automation (ICRA), 2011, pp. 5792–5799.
[28] M. Schonbein and A. Geiger, “Omnidirectional 3D reconstruction in ¨
augmented Manhattan worlds,” in IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), 2014, pp. 716–723.
[29] C. Arth, M. Klopschitz, G. Reitmayr, and D. Schmalstieg, “Real-time
self-localization from panoramic images on mobile devices,” in IEEE
International Symposium on Mixed and Augmented Reality (ISMAR),
2011, pp. 37–46.
[30] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision (IJCV), vol. 60,
no. 2, pp. 91–110, Nov. 2004.
[31] C. Wu, “SiftGPU: A GPU implementation of scale invariant feature
transform (SIFT),” http://www.cs.unc.edu/∼ccwu/siftgpu/, 2007.
[32] K. Pearson, “LIII. On lines and planes of closest fit to systems of
points in space,” The London, Edinburgh, and Dublin Philosophical
Magazine and Journal of Science, vol. 2, no. 11, pp. 559–572, 1901.
[33] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with
automatic algorithm configuration,” in International Conference on
Computer Vision Theory and Applications (VISAPP), 2009, pp. 331–
[34] L. Zhang and S. Rusinkiewicz, “Learning to detect features in texture
images,” in Computer Vision and Pattern Recognition (CVPR), 2018,
pp. 6325–6333.
[35] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust
features,” in European Conference on Computer Vision (ECCV), 2006,
pp. 404–417.
[36] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in IEEE International Conference
on Computer Vision (ICCV), 2011, pp. 2564–2571.
[37] Y. Tian, B. Fan, F. Wu et al., “L2-Net: Deep learning of discriminative
patch descriptor in Euclidean space,” in Computer Vision and Pattern
Recognition (CVPR), 2017.
[38] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas, “Working hard
to know your neighbor’s margins: Local descriptor learning loss,” in
Advances in Neural Information Processing Systems, 2017, pp. 4826–
[39] K. He, Y. Lu, and S. Sclaroff, “Local descriptors optimized for average
precision,” in Computer Vision and Pattern Recognition (CVPR), 2018,
pp. 596–605.
[40] X. Zhang, X. Y. Felix, S. Kumar, and S.-F. Chang, “Learning spreadout local feature descriptors,” in IEEE International Conference on
Computer Vision (ICCV), 2017, pp. 4605–4613.
[41] Z. Luo, T. Shen, L. Zhou, S. Zhu, R. Zhang, Y. Yao, T. Fang, and
L. Quan, “GeoDesc: Learning local descriptors by integrating geometry constraints,” European Conference on Computer Vision (ECCV),
[42] M. Keller, Z. Chen, F. Maffra, P. Schmuck, and M. Chli, “Learning
deep descriptors with scale-aware triplet networks,” in Computer Vision and Pattern Recognition (CVPR), 2018.
[43] S. A. Winder and M. Brown, “Learning local image descriptors,” in
Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8.
[44] C. Wu, “VisualSFM: A visual structure from motion system,” http:
//ccwu.me/vsfm/, 2011.
[45] C. Wu, S. Agarwal, B. Curless, and S. M. Seitz, “Multicore bundle
adjustment,” in Computer Vision and Pattern Recognition (CVPR),
2011, pp. 3057–3064.
[46] M. Cornick, J. Koechling, B. Stanley, and B. Zhang, “Localizing
ground penetrating radar: a step toward robust autonomous ground
vehicle localization,” Journal of Field Robotics, vol. 33, no. 1, pp.
82–102, 2016
Leave a Reply