Home

E2LSH 0.1 User Manual

1. thus is the probability that a near neighbor p is not reported E LSH can be also used to solve the nearest neighbor problem where given the query q the data structure is required the report the point in P that is closest to g This can be done by creating several R near neighbor data structures for R Ri R Ri where R should be greater than the maximum distance from any query point to its nearest neighbor The nearest neighbor can be then recovered by querying the data structures in the increasing order of the radiae stopping whenever the first point is found E LSH is based on locality sensitive hashing LSH scheme as described in 2 The original locality sensitive hashing scheme solves the approximate version of the R near neighbor problem called a R c near neighbor problem In that formulation it is sufficient to report any point within the distance of at most cR from the query q if there is a point in P at distance at most R from q with a constant probability For the approximate formulation the LSH scheme achieves a time of O n where p lt 1 c To solve the R I formulation E LSH uses the basic LSH scheme to get all near neighbors in cluding the approximate ones and then drops the approximate near neighbors Thus the running time of E LSH depends on the data set P In particular E7LSH is slower for bad data sets e g when for a query q there are many points from P clustered right outsi
2. towards removing the curse of dimensionality Proceedings of the Symposium on Theory of Computing 1998 5 V M Zolotarev One Dimensional Stable Distributions Vol 65 of Translations of Mathematical Mono graphs American Mathematical Society 1986 22
3. R w is a R near neighbor We will describe later how we choose the parameters k and L and what time memory bounds they give Next we present our choice for the LSH family H 3 3 LSH scheme for l norm In this section we will present the LSH family H that we use in our implementation This LSH family is based on p stable distributions that works for all p 0 2 We use exactly the same LSH family as suggested by 2 It should be noted that the implementation as described in Chapter 2 works only for the l2 Euclidean norm Since we consider points in 1 without loss of generality we can assume that R 1 since otherwise we can scale down all the points by a factor of R 3 3 1 p stable distributions Stable distributions 5 are defined as limits of normalized sums of independent identically distributed vari ables an alternate definition follows The most well known example of a stable distribution is Gaussian or normal distribution However the class is much wider for example it includes heavy tailed distributions Stable Distribution A distribution D over R is called p stable if there exists p gt 0 such that for any n real numbers v1 Un and iid variables X1 Xn with distribution D the random variable gt v X has the same distribution as the variable 57 v X where X is a random variable with distribution D It is known 5 that stable distributions exist for any p 0 2 In particular e
4. U is called locality sensitive if for any q the function p t Pry h q h v lla v t is strictly decreasing in t That is the probability of collision of points q and v is decreasing with the distance between them Thus if we consider any points q v u with v B q R and u B q R then we have that p q v gt p llg ul Intuitively we could hash the points from P into some domain U and then at the query time compute the hash of q and consider only the points with which q collides However to achieve the desired running time we need to amplify the gap between the collision probabil ities for the range 0 R where the R near neighbors lie and the range R oo For this purpose we concate nate several functions h H In particular for k specified later define a function family G g S UF such that g v hi v hg v where h H For an integer L the algorithm chooses L functions 91 gL from G independently and uniformly at random During preprocessing the algorithm stores each v P input point set in buckets g v for all j 1 L Since the total number of buckets may be large the algorithm retains only the non empty buckets by resorting to hashing explained later To process a query q the algorithm searches all buckets g q 9z q For each point v found in a bucket the algorithm computes the distance from q to v and reports the point v iff q v lt
5. choosing the optimal parameters E LSH takes into consideration the upper bound on memory it can use Note that if E LSH starts to swap the performance decreases by a few orders of magnitude The user thus can specify the maximal amount of memory that E LSH can use which should be at most the amount of physical memory available on the system before executing E LSH This upper bound is specified in the file bin mem in bytes If this file does not exist the main scripts will create one with an estimation of the available physical memory 2 5 Additional utilities bin exact is an utility that computes the exact R near neighbors using the simple linear scan algorithm Its usage is the same as that of bin 1sh bin exact R data_set_file query_set_file bin compareOutputs is an utility for checking the correctness of the output generated by the E LSH package by bin 1sh or bin 1sh_fromParams The usage is the following bin compareOutputs correct_output LSH_output correct_out put is the output from bin exact and LSH_out put is the output from bin lsh or bin lsh_fromParams for the same R data_set_file and query set file For each query point from query set file bin compareOutputs outputs whether E2LSH s output is a subset of the output of bin exact in this case OK 1 if E LSH outputs a point that is not a R near neighbor or outputs some points more that once then OK 0 bin compareOutputs also outputs for each query point the fraction of
6. derive a different expression for the probability that the algorithm reports a point that is within the distance R from the query point a R near neighbor With 1 new functions g this probability is greater than or equal to 1 1 p a m p 2 1 p y To require a success probability of at least 1 6 we restrict m to be such that 1 E p aye m p Pu at 1 p su lt 6 This inequation yields a slightly higher value for L m m 1 2 than in the case log 1 6 pE when functions g are independent but L is still O 1 The time for computing the g functions for a query point q is reduced to T O dkm O dky L since we need only to compute m fuctions u The expression for the time T the time to compute the distance to points in buckets 9 q g9L q remains unchanged due to the linearity of expectation 3 5 Implementation details 3 5 1 R NN data structure construction For constructing the R NN data structure the algorithm first computes the parameters k m L for the data structure The parameters k m L are computed as a function of the data set P the radius R and the probability 1 6 as outlined in the section 3 4 and 3 4 1 For a value of k the parameter m is chosen 1 to be the smallest natural number satisfying 1 ay m pi 1 ne lt 0 Lis set to m m 1 2 Thus in what follows we consider m and L as functions of k and the question remains only of how
7. nPoints PPointT dataSet 18 algParameters specify the parameters with which the R NN data structure will be constructed The function returns the constructed data structure The function is declared in LocalitySensitiveHashing h 19 Chapter 5 Frequent Anticipated Questions In this section we give answers to some questions that the reader might ask when compiling and using the package 1 Q How to compile this thing A Since you are reading this manual you must have already gunzipped and untarred the original file Now it suffices to type make in the main directory to compile the code 2 Q OK it compiles Now what A You can run the program on a provided data set mnist1k dts and query set mnist1k q They reside in the main directory Simply type bin lsh 0 6 mnistlk dts mnistlk q gt o This will create an output file o containing the results of the search To see if it worked run the exact algorithm bin exact 0 6 mnistlk dts mnistlk q gt 0 e and compare the outputs by running bin compareOutputs o e O You should receive an answer that looks like Overall OK 1 NN_LSH NN_Correct 5 5 1 000 This means that the run was OK and that the randomized LSH algorithm found 5 out of 5 i e all near neighbors within distance 0 6 from the query points Note that since the algorithm is randomized your results could be different e g 4 out of 5 near neighbors However if you are getting 0 out of 5 the
8. q Note that there might be different optimal k s for different query points therefore the goal would be optimize the mean query time for all query points We discuss more on the optimization procedure in the section 3 5 1 log 6 amp 2 the running time is increasing with ed 8 8 11 3 4 1 Faster computation of hash functions In this section we describe a slight modification to the LSH scheme that enables a considerable reduction of the time 7 the time necessary for computing the functions g In the original LSH scheme we choose L functions g RO Mes Ab where each function o is chosen uniformly at random from the LSH family H For a given point q we need O d time to compute a i j To reduce the time for computing functions g for the query q we reuse some of the functions hy in this case g are not totally independent Specifically in addition to functions g define functions function h q and O dkL time to compute all functions g1 q gr a u in the following manner Suppose k is even and m is a fixed constant Then for 1 m let Ui h nies he where each n is drawn uniformly at random from the family H Thus u are vectors each of k 2 functions drawn uniformly at random from the LSH family H Now define functions g as gi ua up Where 1 lt a lt b lt m Note that we obtain L m m 1 2 functions gi Since the functions g are interdependent we need to
9. to choose k For choosing the value k the algorithm experimentally estimates the times T and T as a function of k Remember that the time T is dependent on the query point q and therefore for estimating Te we need to use a set S of sample query points the estimation of Te is then the mean of the times T for points from 5 The sample set S is chosen to be a set of several points chosen at random from the query set The package also provides the option of choosing S to be a subset of the data set P Note that to estimate T and Te precisely we need to know the constants hidded by the O notation in the expressions for T and Te To compute these constants the implementation constructs a sample data structure and runs several queries on that sample data structure measuring the actual times 7 and To 12 Concluding k is chosen such that T T y 18 minimal while the data structure space requirement is within the memory bounds where T is the mean of the times T for all points in the sample query set S p ges Tela O Once the parameters k m L are computed the algorithm constructs the R NN data structure containing the points from P 3 5 2 Bucket hashing Recall that the domain of each function g is too large to store all possible buckets explicitly and only non empty buckets are stored To this end for each point v the buckets gi v gz v are hashed using the universal hash functions For each function g
10. used for storing the buckets containing data set points Currently values of 0 and 3 are supported but we suggest to use the value 3 Refering to the hash table types described in the section 3 6 the value 0 corresponds to the linked list version of the hash tables and the value 3 to the hash tables with hybrid storage array Y Chapter 3 Algorithm description In this chapter we describe first the general locality sensitive hashing algorithm as in 2 but with slight modifications Next we gradually add more details of the algorithm as well as the optimizations in our implementation 3 1 Notations We use a to denote the d dimensional real space R under the lp norm For any point v Rd the notation v p represents the lp norm of the vector v that is d lloll 00 i 1 In particular v v 2 is the Euclidean norm Let the data set P be a finite subset of R and let n P A point q will usually stand for the query point the query point is any point from R Points v u will usually stand for some points in the data set P The ball of radius r centered at v is denoted by B v r For a query point q we call v an R near neighbor or simply a near neighbor if v B q R 3 2 Generic locality sensitive hashing scheme To solve the R NN problem we use the technique of Locality Sensitive Hashing or LSH 4 3 For a domain S of points the LSH family is defined as Definition 1 A family H h S
11. 1 L there is a hash table H containing the buckets gi v v P For this purpose there are 2 associated hash functions hj Z 0 tableSize 1 and ha Z 0 C each g maps to Z The function h has the role of the usual hash function in an universal hashing scheme The second hash function identifies the buckets in chains The collisions within each bucket are resolved by chaining When storing a bucket g v 1 k in its chain instead of storing the entire vector x1 xj for bucket identification we store only ha 11 k Thus a bucket g v x1 2 has only the following associated information stored in its chain the identifier h2 x1 and the points in the bucket which are g Ua A AP The reasons for using the second hash function hz instead of storing the value g v 11 Tx are twofold Firstly by using a fingerprint ho x 2 we decrease the amount of memory for bucket identification from O k to O 1 Secondly with the fingerprint it is faster to look up a bucket in the hash table The domain of the function ha is chosen big enough to ensure with a high probability that any two different buckets in the same chain have different h2 values All L hash tables use the same primary hash function hj used to dermine the index in the hash table and the same secondary hash function h2 These two hash functions have the form k h1 a1 a2 ax gt He mod prime mod
12. 9 Dimension 784 RED 0 280899972 Use lt u gt functions 1 k 20 m independent tuples of LSH functions 35 L 595 W 4 000000000 E 9991 typeHT 3 All lines except the first one define the parameters in the following way the first line is reserved for fu ture use Each odd line defines the value of a parameter the preceding even line simply describes the name of the corresponding parameter The parameters R Success Probability Dimension k m L Ware the parameters that appear in the algorithm description note that Success Probability k m Lare interrelated values as described in 3 5 1 The parameter R 2 is equal to R The parameter T is reserved for specifying how many points to look through before the query algorithm stops but this parameter is not implemented yet and therefore is set to n 2 6 4 The remainder of the parameter file Note understanding of the description below requires familarity with the algorithm of Chapter 3 The parameter Use lt u gt functions signals whether to use the original g functions each of the L functions g is a k tuple of LSH functions all kL LSH functions are independent or whether to use g s that are not totally independent as described in the section 3 4 1 If the value of the parameter is 0 original g s are used and L m if the value is 1 the modified g s are used and L m m 1 2 The parameter t ypeHT defines the type of the hash table
13. E LSH 0 1 User Manual Alexandr Andoni Piotr Indyk June 21 2005 Contents 1 What is E LSH 2 2 E LSH Usage 4 2 1 Gompilation i sak LENS SAN Es ARE SO ee ENESTE ee Se 4 2 2 Man Usage divin ke Ta dun via Kea id Meant da SE AS PS 4 2 3 Manual setting of the parameters of the R NN data structure 4 D4 MEMORY uns a ss A eee Me ur ee aw haw SES Ae aaa 6 2 5 Additional utilities 224254 ses dat Sea Ae Ge dee ete pe eed oe ened FME SEG 6 2 6 File formats pqg sr Ak ku Se a ale ee dob enti Gar del Se Bok Mes 6 2 6 1 Dataset file and query set file 6 2 62 Output fil formatis apr lal nee Klug Rue aaa BAL fed gle eee 7 2 6 3 File with the parameters for the R NN data structure 7 2 6 4 The remainder of the parameter file 8 3 Algorithm description 9 Spl Notations i wid kv ka Grine ate Syn Ge big eddik Se Gre o a he Ae 9 3 2 Generic locality sensitive hashing scheme 9 32 JESH scheme fords mm satsar os ee ee ee eee gaat ae SK ty eae 10 3 3 1 p stable distributions 10 3 32 Hash family ss ves ae eau SR REN UN de ve ge ne eS 10 3 4 Parameters for the LSH scheme 11 3 4 1 Faster computation of hash functions 12 3 5 Implementation details 12 3 5 1 R NN data s
14. NN data structure besides the files with the data set points and the query points The script constructs the data structure given these parameters and runs queries on the constructed data structure The usage of bin lsh_fromParams is the following bin lsh_fromParams data_set_file query_set_file data_set_params_file The file dat a_set_params_file must be of the same format as the output of bin lsh_computeParams Note that one does not need to specify the success probability and R since these values are embedded in the file data_set_params_file Thus running the following two lines bin lsh_computeParams R data set file query set file gt data set parameters file bin lsh_fromParams data set file query set file data set params file is equivalent to running bin lsh R data set file query set file To modify manually the parameters for the R NN data structure one should modify the file data set params file before running the script bin lsh_fromParams For ease of use the script bin 1sh also outputs the parameters it used for the constructed R near neigh bor data structure These parameters are written to the file data_set_file params where data set file is the name of the supplied data set file 2 4 Memory E LSH uses a considerable amount of memory for bigger data sets the optimal parameters for the R NN data structure might require an amount of memory which is greater than the available physical memory Therefore when
15. a Cauchy distribution Dc defined by the density function c x 4 is 1 stable e a Gaussian normal distribution Dg defined by the density function g x eel 2 is 2 stable From a practical point of view note that despite the lack of closed form density and distribution func tions it is known 1 that one can generate p stable random variables essentially from two independent variables distributed uniformly over 0 1 3 3 2 Hash family The LSH scheme proposed in 2 uses p stable distributions as follows compute the dot products a v to assign a hash value to each vector v Formally each hash function ha v R Z maps a d dimensional vector v onto the set of integers Each hash function in the family is indexed by a choice of random a and b where a is a d dimensional vector with entries chosen independently from a p stable distribution and bis a real number chosen uniformly from the range 0 w For a fixed a b the hash function ha is given by hab v TF 10 The intuition behind the hash functions is as follows The dot product a v projects each vector to the real line It follows from p stability that for two vectors v1 v2 the distance between their projections a v1 a v2 is distributed as vi v2 p X where X is a p stable distribution If one chops the real line into equi width segments of appropriate size w and assign hash values to vectors based on which segment they project onto then it is intu
16. ail to collide for all L functions g with probability at most 1 p Requiring that the point q collides with v on some function g is equivalent to inequation 1 1 pk gt 1 6 which implies that L gt Er Since there are no more conditions on k and L 1 other than minimizing the running time we choose L L The value k is chosen as a function of the data set to minimize the running time of a query Note that this is different from the LSH scheme in 2 where k is chosen as a function of the approximation factor For a fixed value of k and L L k we can decompose the query time into two terms The first term is Ty O dkL for computing the L functions g for the query point q as well as retrieving the buckets gi q from hash tables The second term is Te O d collisions for computing the distance to all points encountered in the retrieved buckets collisions is the number of points encountered in the buckets g1 q gr q the expected value of collisions is E Hcollisions L Vyep p a vl Intuitively T increases as a function of k while Te decreases as a function of k The latter is due to the fact that higher values of k magnify the gap between the collision probabilities of close and far points which for proper values of L decreases the probability of collision of far points Thus typically there exists an optimal value of k that minimizes the sum T Te for a given query point
17. at to optimize the estimated query time However since these parameters are only estimates there are no guarantees that these parameters are optimal for particular query points Therefore manual setting of these parameters may occasionally provide better query times There are two additional scripts that give the possibility of manual setting of the parameters bin lsh_computeParamsand bin lsh_fromParams The first script bin 1sh_computeParams computes the optimal parameters for the R NN data structure from the given data set points and outputs the parameters to the standard output The usage of bin 1sh_computeParams is as follows bin lsh_computeParams R data set file query set file successProbability The script outputs an estimation of the optimal parameters of the R NN data structure for the data set points in data_set_file If one specifies the query set file as the third parameter then we use several of the points from the query set for optimizing data structure parameters if a dot is specified then we use instead several points from the data set for the same purpose The output is written to standard output and may be redirected to a file for a later use as follows bin lsh_computeParams R data set file query set file gt data set parameters file See section 2 6 for description of the format of the parameter file The second script bin 1sh_fromParams takes as an input a file containing the parameters for the R
18. de the ball of radius R centered at q i e when there are many approximate near neighbors E LSH is also different from the original LSH scheme in that E2LSH empirically estimates the optimal parameters for the data structure as opposed to using theoretical formulas This is because theoretical formulas are geared towards the worst case point sets and therefore they are less adequate for the real data sets E LSH computes the parameters as a function of the data set P and optimizes them to minimize the actual running time of query on the host system The outline of the remaining part of the manual is as follows Chapter 2 describes the package and how to use it to solve the near neighbor problem In Chapter 3 we describe the LSH algorithm used to solve the R 1 6 problem formulation as well as optimizations for decreasing running time and memory usage Chapter 4 discusses the structure of the code of the E LSH the main data types and modules as well as the main functions for constructing parametrizing and querying the data structure Finally Chapter 5 contains FAQ Chapter 2 E LSH Usage In this chapter we describe how to use our E LSH package First we show how to compile and use the main script of the package and then we describe two additional scripts to use when one wants to modify or set manually the parameters of the R NN data structure Next we elaborate on memory usage of E7LSH and how to control it Finally we present
19. et e Generalization of functions g ua uy In section 3 4 1 we presented a new approach to choosing functions gi Specifically we choose functions gi Ua up 1 lt a lt b lt m where each Uj j 1 m is a k 2 tuple of random independent hash functions from the LSH family H In this way we decreased the time to compute functions g q for a query q from O dkL to O dkVL This approach could be generalized to functions g that are t tuples of functions u where u are drawn indepently from HEH reducing in this way the asymptotic time for computing the functions g We did not pursue this generalization since even if L is still O the constant hidden by O 1 notation for t gt 2 would probably nullify the theoretical gain for the typical data sets 15 Chapter 4 The E LSH Code 4 1 Code overview The core of the E LSH code is divided into three main components e LocalitySensitiveHashing cpp contains the implementation of the main LSH based R near neighbor data strucure except the hashing of the buckets The main functionality is for con structing the data structure given the parameters such as k m L and for answering a query e BucketHashing cpp contains the implementation of the hash tables for the universal hashing of the buckets The main functionality is for constructing hash tables adding new buckets points to it and looking up a bucket e SelfTuning cpp contains functions for computing the o
20. eter such as k m L from the construction of the R NN data structure itself 1 To construct the R NN data structure given as input 1 6 R d and the data set P one can use the function PRNearNeighborstructT initSelfTunedRNearNeighborWithDataSet RealT thresholdR RealT successProbability Int32T nPoints IntT dimension PPointT dataSet IntT nSampleQueries PPointT sampleQueries The function will estimate optimal parameters k m L and will construct a R NN data structure from this parameters The parameters of the function are the input data of the algorithm R 1 n d and respectively P The parameter sampleQueries represents a set of sample query points the function optimizes the parameters of the constructed data structure for the points specified in the set sampleQueries 17 sampleQueries could be a sample of points from the actual query set or from the data set P nSampleQueries specifies the number of points in the set sampleQueries The return value of the function is the R NN data structrure that is constructed This function is declared in NearNeighbors h For a query operation one can use the function Int32T getRNearNeighbors PRNearNeighborStructT nnStruct PPointT queryPoint PPointT amp result IntT amp resultSize The parameters of the function have the following meaning e nnStruct the R NN data structure on which to perform the query e queryPoint the que
21. hat says whether that bucket is the last one in the chain or not this bit is one of the unused bits of the 4 byte integer storing the index of the first point in the corresponding bucket i e if the h2 value of the bucket is stored at position e in Y then we use a high order bit of the integer at position e 1 in Y For storing the length of the bucket we use the remaining unused bits of the first point in the bucket When the remaining bits are not enough there are more than 232 2021 1 2 1 points in the bucket we store a special value for the length 0 which means that there are more than 2 1 points in the bucket and there are some additional points that do not fit in the 2 1 integers alloted in Y after the h2 value of the bucket These additional points are also stored in Y but at a different position their start index and number are stored in the unused bits of the remaining 2 2 points that follow the h2 value of the bucket and the first point of the bucket i e unused bits of the integers at positions e 2 e 211 1 3 7 Future possible optimizations e Parameter w of the LSH scheme In addition to optimizing k the algorithm could optimize the parameter w to achieve the best query time The function p depends on the parameter w and thus both times T and T depend on w Currently we use a fixed value of 4 but the optimal value of w is a function of the data set P and a sample query s
22. i q and ho g q Precomputing hj and ha in the straight forward way would take O Lk time However since g are of the form ua uz we can reduce this time in the following way Note that each function hj 7 1 2 is of the form h x1 xx SH r x mod A mod Bj Therefore h w1 xj may be computed as DN rix SE jag r mod A mod Bj If r then we have that h gi q rhe we denote Tie ri DS Tha and Diop hop des ug v rl ight UW V mod A mod B Thus it suffices to precompute only ride Ua v where j 1 2 side left right a 1 m which takes O km time e Skipping repeated points In the basic query algorithm a point v P might be encountered more than once Specifically a point v P might appear in more than one of the buckets gi q gr Q Since there is no need to compute the distance to any point v P more than once we keep track of the points for which we already computed the distance g v and not compute the distance a second time For this purpose for each particular query we keep a vector e such that e 1 if we already encountered the point v P in an earlier bucket and computed the distance q v and e 0 otherwise Thus the first time we encounter a point v in the buckets we compute the distance to it which takes O d time for all subsequent times we encounter v we spend only O 1 time for checking the vector e No
23. itively clear that this hash function will be locality preserving in the sense described above One can compute the probability that two vectors vi va collide under a hash function drawn uniformly at random from this family Let f t denote the probability density function of the absolute value of the p stable distribution We will drop the subscript p whenever it is clear from the context For the two vectors vi va let c vi val p For a random vector a whose entries are drawn from a p stable distribution a v a v is distributed as cX where X is a random variable drawn from a p stable distribution Since bis drawn uniformly from 0 w it is easy to see that p c Prasltastor haslvd EROA d For a fixed parameter w the probability of collision p c decreases monotonically with c v1 v2 satisfying the definition 1 The optimal value for w depends on the data set and the query point but it was suggested in 2 that w 4 provides good results and therefore we currently use the value w 4 in our implementation 3 4 Parameters for the LSH scheme To use LSH we need to specify the parameters k and L From the problem formulation specifically from the requirement that a near neighbor is reported with a probability at least 1 6 we can derive a necessary condition on k and L Consider a query point q and a near neighbor v B q R Let p p 1 p R Then Pryeglg q g v gt p Thus q and v f
24. le used for hashing the buckets Collisions are resolved using chaining as explained in sections 3 5 2 and 3 6 There are 2 main types of hash tables HT_LINKED_LIST and HT_HYBRID_CHAINS the field typeHT contains the type of the hash table The type HT_LINKED_LIST corresponds to the linked list version of the hash table and HT_HYBRID_CHAINS to the one with hybrid storage array Y see section 3 6 for more details Each hash table also contains pointers to the descriptions of the functions h1 and h2 used for the universal hashing The structure is defined in BucketHashing h e RNNParametersT a struct containing the parameters necessary for constructing the RNearNeighborStructureT data structure Itis defined in LocalitySensitiveHashing h e PPoint a struct for storing a point from P or a query point This structure contains the coordinates of the point the square of the norm of the point and an index which can be used by the callee outside the E LSH code such as LSHMain cpp for identifying the point for example it might by the index of the point in the set P this index is not used within the core of E7LSH 4 2 E LSH Interface The following are the functions at the interface of E LSH core code The first two functions are suffi cient for constructing the R NN data structure and querying it afterwards they are declared in the file NearNeighbors h The following two functions provide a separation of the estimation of the param
25. n probably something is wrong 20 3 Q I ran the code but it is so slow A If the code is unusually slow it might mean two things e The code uses so much memory that the system starts swapping This typically degrades the performance by 3 orders of magnitude so should be definitely avoided To fix it specify the amount of available not total memory in the file bin mem e The search radius you specified is so large that many most of the data points are reported as near neighbors If this is what you want then the code will not run much faster In fact you might be better off using linear scan instead since it is simpler and has less overhead Otherwise try to adjust the search radius so that on average there are few near neighbors per query You can use bin exact for experiments it is likely to be faster than LSH if you perform just a few queries 21 Bibliography 1 J M Chambers C L Mallows and B W Stuck A method for simulating stable random variables J Amer Statist Assoc 71 340 344 1976 2 M Datar N Immorlica P Indyk and V Mirrokni Locality sensitive hashing scheme based on p stable distributions DIMACS Workshop on Streaming Data Analysis and Mining 2003 3 A Gionis P Indyk and R Motwani Similarity search in high dimensions via hashing Proceedings of the 25th International Conference on Very Large Data Bases VLDB 1999 4 P Indyk and R Motwani Approximate nearest neighbor
26. ptimal parameters for the R near neigh bor data structure Contains all the functions for estimating the times T T including the functions for estimating collisions Additional code making part of the core is contained in the following files e Geometry h contains the definition for a point PPointT data type e NearNeighbors cpp NearNeighbors h contain the functions at the interface of the E7LSH core see a more detailed description below e Random cpp Random h contain the pseudo random number generator e BasicDefinitions h contains the definitions of general purpose types such as Int T Real T and macros such as the macros for timing operations e Utils cpp Utils h contain some general purpose functions e g copying vectors Another important part of the package is the file LSHMain cpp which is a sample code for using E LSH LSHMain cpp only reads the input files parses the command line parameters and calls the cor responding functions from the package The most important data structures are 16 e RNearNeighborStructureT the fundamental R near neighbor data structure This structure contains the parameters used to construct the structure the description of the functions g the index of the points in the structure as well pointers to the L hash tables for storing the buckets The structure is defined in LocalitySensitiveHashing h e UHashStructureT the structure defining a hash tab
27. ry point e result the array where the near neighbors are to be stored if the size of this array is not sufficient for storing the near neighbors the array is reallocated to fit all near neighhbors e resultSize the size of the result array if the result is resized this value is changed accordingly The function get RNearNeighbors returns the number of near neighbors that were found The function is declared in the file NearNeighbors h For estimating the optimal parameters for a R NN data structure one can use the function RNNParametersT computeOptimalParameters RealT R RealT successProbability IntT nPoints IntT dimension PPointT dataSet IntT nSampleQueries PPointT sampleQueries The parameters of the function are the input data of the algorithm R 1 6 n d and respectively P The parameter sampleQueries represents a set of sample query points the function op timizes the parameters of the data structure for the points specified in the set sampleQueries sampleQueries could be a sample of points from the actual query set or from the data set P nSampleQueries specifies the number of points in the set sampleQueries The return value is the structrure with optimal parameters This function is declared in Self Tuning h For constructing the R NN data structure from the optimal parameters one can use the function PRNearNeighborStructT initLSH_WithDataSet RNNParametersT algParameters Int32T
28. some additional useful utilities as well as the formats of the data files of the package All the scripts and programs should be located in the bin directory relative to the ELSH package root directory 2 1 Compilation To compile the E LSH package it is sufficient to run make from E LSH s root directory It is also possible to compile by running the script bin compi le from E LSH s root directory 2 2 Main Usage The main script of ELSH is bin 1sh It is invoked as follows bin lsh R data set file query set file successProbability The script takes as its parameters the name data set file of the file with the data set points and the file query set file with the query points the format of the files is described in Section 2 6 Given these files E LSH constructs the optimized R NN data structure and then runs the queries on the constructed data structure The values R and successProbabilit y specify the parameters R and 1 of the R 1 5 near neighbor problem that E LSH solves Note that successProbability is an optional parameter if not supplied E7LSH uses a default value of 0 9 90 success probability 2 3 Manual setting of the parameters of the R NN data structure As described in Chapter 3 the LSH data structure needs three parameters denoted by k L and m where m VL However the script bin 1sh computes those parameters automatically in the first stage of data structure construction The parameters are chosen so th
29. tableSize i 1 and k h a1 a2 ak gt re mod prime i 1 where r and r are random integers tableSize is the size of the hash tables and prime is a prime number In the current implementation tableSize P a are represented by 32 bit integers and prime is equal to 23 5 This value of prime allows fast hash function computation without using modulo operations Specifically consider computing h2 a for k 1 We have that ha a1 r a1 mod Es 5 low r a1 5 high r a1 mod 2 5 where low r7a1 are the low order 32 bits of ra a 64 bit number and high r a are the high order 32 bits of r a1 If we choose r from the range 1 22 we will always have that a low r aq 5 high rai lt 2 23 5 This means that h a ifa lt 23 5 RS Ee 232 5 ifa gt 22 5 13 For k gt 1 we compute progressively the sum DE p ai mod prime keeping always the partial sum modulo VEG 5 using the same principle as the one above Note that the range of the function ha thus becomes 1 2 6 3 5 3 Additional optimizations We use the following additional optimizations e Precomputation of 9 q hi 9 q and h2 g q To answer a query we first need to compute gi q As mentioned in the section 3 4 1 since g Ua up we need only to compute k 2 tuples Ua q a 1 m Further for searching for buckets g q in hash tables we need in fact only hi g
30. te that with this optimization the estimation of T is not accurate anymore This is because T is the time for computing the distance to points encountered in the buckets gi q gz q Taking into the consideration only the distance computations i e assuming we spend no time on subsequent copies of a point the new expression for T is EIT d Y 2 1 pila 012 m pla 012 1 ola oi vEP 3 6 Memory The R NN data structure described above requires O nL memory for each function g we store the n points from P Since L increases with k the memory requirement is big for big data set or for moderate data set for which optimal time is achived with higher values of k Therefore an upper limit on memory imposes an upper limit on k Because the memory requirement is big the constant in front of O n L is very important In our current implementation with the best variant of the hash tables this constant is 12 bytes Note that it is the structure and layout of the L hash tables that plays the substantial role in the memory usage 14 Below we show two variants of the layout of the hash tables that we deployed We assume that 1 the number of points is n lt 220 2 each pointer is 4 bytes long 3 tableSize n for each hash table One of the most straightforward layouts of a hash table H is the following For each index I of the hash table we store a pointer to a singly linked list of buckets in the chain For each b
31. the R near neighbors that E7LSH manages to find Finally query_set_file outputs the overall statistics the and of the OKs for all queries as well as the ratio of the number of R near neighbors found by E LSHto their actual number as determined by bin exact 2 6 File formats 2 6 1 Data set file and query set file Both the file for data set and for the query set dat a_set_file and query_set_file are text files with the following format coordinate 1 of point I coordinate 2 of point 1 coordinate D of point I coordinate 1 of point 2 coordinate 2 of point 2 coordinate D of point 2 coordinate 1 of point N coordinate 2 of point N coordinate D of point N Each entry coordinate_j_of_point i is a real number 2 6 2 Output file format The output of E LSH is twofold The main results are directed to standard output cout The output stream has the following format Query point i found x NNs They are Total time for R NN query y Additional information is reported to standard error cerr 2 6 3 File with the parameters for the R NN data structure The file with the parameters for the R NN data structure is the output of the bin 1sh_computeParams and the command line parameter dat a_set_params_file for the script bin lsh_fromParams It specifies the estimation of the optimal parameters for a spefic data set and for a specific machine Below is an example of such a file 1 R 0 53 Success probability 0
32. tructure construction 12 3 3 2 Bucket hashing 4 ston ef ad sr ERT tes ee dr has A a R 13 3 5 3 Additional optimizations 14 3 6 Memory Klaus HATE a a Da brune eh PA bats Hoi be ee 14 3 7 Future possible optimizations 15 4 The E LSH Code 16 41 Code OVervieW vend ack husene UE ME ew ee He er NE a Nat a Gr id 16 WD E ESH Interface II re Gea UN art 17 5 Frequent Anticipated Questions 20 Chapter 1 What is E LSH Short answer E LSH Exact Euclidean LSH is a package that provides a randomized solution for the high dimensional near neighbor problem in the Euclidean space l2 After preprocessing the data set E7LSH answers queries typically in sublinear time with each near neighbor being reported with a certain probability E LSH is based on the Locality Sensitive Hashing LSH scheme described in 2 Long answer The R near neighbor problem is defined as follows Given a set of points P C R and a radius R gt 0 construct a data structure that answers the following queries for a query point q find all points p P such that q pll2 lt R where q p 2 is the Euclidean distance between q and p E LSH solves a randomized version of this problem which we call a R 1 near neighbor problem In this case each point p satisfying q p 2 lt R has to be reported with a probability at least 1
33. ucket we store its value ha and a pointer to a singly linked list of points in the bucket The memory requirement per hash table is 4 tableSize 8 buckets 8 n lt 20n yielding a constant of 20 To reduce this constant to 12 bytes we do the following Firstly we index all points in P such that we can refer to points by index this index is constant across all hash tables Refering to a point thus takes only 20 bits and not 32 as in the case of a pointer Consider now a hash table H For this hash table we deploy a table Y of 32 bit unsigned integers that store all buckets with values h2 and points in the buckets thus Y is a hybrid storage table since it stores both buckets and points description The table has a length of buckets n and is used as follows In the hash table H at index l we store the pointer to some index e of Y e is the start of the description of the chain l A chain is stored as follows h2 value of the first bucket in chain at position e in Y followed by the indeces of the points in this bucket positions e 1 e n1 ha value of the second bucket in the chain position e n 1 followed by the indeces of the points in this second bucket positions e n 2 e n 1 2 and so forth Note that we need also to store the number of buckets in each chain as well as the number of points in each bucket Instead of storing the chain length we store for each bucket a bit t

E2LSH 0.1 User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents