The Art of Computing

Thursday, June 7, 2012

Nearest Neighbor Search Using Data Projections and Hashing

1. Introduction

Random data projection with hash-based index structures has come to be the state-of-the-art method for both an efficient near neighbor search and a scalable cluster analysis for very high-dimensional data. Data of very high dimensions pose serious computational challenge because of curse-of-dimensionality [Slide]. Random projections are also commonly used for dimension reduction. An instance of random projection, Locality sensitive hashing, has widely attracted the attention of research community and is an active area of further research. This article gives resources for detailed reading on projection method.

2. Near Neighbor Search

a) Types of queries

1. Nearest neighbor queries

2. Top-k nearest neighbors queries

3. Range queries

4. r-Near neighbor queries or ball-queries

5. Approximate nearest neighbor queries (ANN )

6. All nearest-neighbor queries or Group queries or multi-object queries
7. Reverse nearest neighbor queries
8. List of other queries

b) Hashing functions for different distance metrics

1. Stable distributions for L1 Norm and L2 Norm (Euclidean distance)

2. Bit sampling for Hamming distance

3. Min-Hash for Jaccard Similarity Coefficient

4. Random Hyperplane for Cosine Similarity

5. Earth Mover's Distance

6.

c) References

2012

1. SIMP: Accurate and Efficient Near Neighbor Search in High Dimensional Spaces ( Note: Method shown to yield very efficient search result on more than 100 million 128-dimensional vectors. This is one of the largest dataset used for measuring the performance of a high-dimensional search algorithm).

2011
2010
1. Kernelized Locality-Sensitive Hashing for Scalable Image Search
2. An Algorithm and Hardware Design for Very Fast Similarity Search in High Dimensional Space

2009

1. LSB Tree: Efficient and Accurate Nearest Neighbor and Closest Pair Search in High Dimensional Space

2008

1. A Posteriori Multi-Probe Locality Sensitive Hashing

2. Modeling LSH for Performance Tuning
3. Lecture Notes

2007

1. Multi-probe LSH: efficient indexing for high-dimensional similarity search ( Caveat : )

2006

1. Lower bounds on Locality Sensitive Hashing

2. Entropy based Nearest Neighbor Search in High Dimensions

3. Near-Optimal Hashing Algorithms for Near Neighbor Problem in High Dimensions

4. Nearest Neighbor Methods in Learning and Vision: Theory and Practice

2005

1. LSH Forest: Self-Tuning Indexes for Similarity Search

2004

1. Locality-sensitive hashing scheme based on p-stable distributions

2003
2002
1. Similarity Estimation Techniques from Rounding Algorithms

2001
1. Random projection in dimensionality reduction: Applications to image and text data
2. Database-friendly Random Projections
3. Database-friendly random projections: Johnson-Lindenstrauss with binary coins

2000
1. Experiments with Random Projections

1999
1. Similarity Search in High Dimensions via Hashing
2. A Small Approximately Min-Wise Independent Family of Hash Functions

1998
1. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality
2. Min-Wise Independent Permutations

1997
1. Two algorithms for nearest-neighbor search in high dimensions

1984
1. W.B. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert space

d) Implementations

1. Implementation of p-Stable LSH
2. LSHKit
3. SIMP
4. LSB

3. Image Search using Random Projections and LSH

1. Random projection in dimensionality reduction: Applications to image and text data
2. Efficient Near-duplicate detection and sub-image retrieval
3. Near Duplicate Image Detection: min-Hash and tf-idf Weighting
4. Geometric min-Hashing: Finding a (Thick) Needle in a Haystack
5. Caltech Work

4. Scalable Cluster Analysis using Min-Hash

1. Scalable technique for clustering the web
2. Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach
3. Dimensionality Reduction by Random Mapping: Fast Similarity Computation for Clustering

Posted by Vish at 4:44 PM

Email This BlogThis!Share to X Share to Facebook Share to Pinterest

Labels: Algorithms, Data Structures, Indexing, Infomation Retrieval, IT, Programming, Technical Interview Questions

No comments:

Post a Comment

Newer Post Older Post Home

Subscribe to: Post Comments (Atom)

About Me

Vish

View my complete profile

Blog Archive

► 2013 (1)
- ► December (1)

▼ 2012 (37)
- ► November (1)
- ► August (7)
- ► July (15)
- ▼ June (3)
- ► May (3)
- ► April (4)
- ► March (1)
- ► February (3)

Labels

Algorithms (28)
Android (1)
Data Structures (28)
Indexing (1)
Infomation Retrieval (3)
IT (33)
Linked List (1)
Machine Learning (1)
Mobile App (1)
MongoDB (1)
NoSQL (1)
Oracle (1)
Predictive Analytics (1)
Programming (31)
Programming Language (2)
Statistics (1)
String (2)
System Installation (2)
Tablet App (1)
Technical Interview Questions (30)

Followers

Total Pageviews

Watermark theme. Powered by Blogger.