In the era of powerful LLMs boom all topics around embeddings and vector stores became a pretty hot topic. Nowadays, it’s the quick search for embeddings that makes it easy to add long-term memory to a chatbot or recommend similar videos or songs to your users.
However, when you have a large number of embeddings, the search for the nearest neighbors becomes a challenge. Comparing each embedding with all others is not an option. The complexity of such a solution would be O(n^2). Because of that, we need to use some kind of a trick to reduce the complexity to O(n*log(n)) or even O(n). These techniques are called Approximate Nearest Neighbor (ANN) algorithms. They trade off some of the search accuracy the speed.
All data and charts were generated using these notebooks.
There is a lot of ready product and solutions which come with the Vector Search functionality included, such as chroma or pinecone. However, my goal was to extend a column-oriented database with a vector search feature.
This a non-trivial task, because most of ANN algorithms are designed to work with a stateful in-memory data structures called indexes. This works really well in row oriented databases, where the data is stored in a row-oriented manner and there is native support for in-memory indexes where in most column-oriented databases don’t support such indexing, because the data is sorted in column-oriented projections which act as an index.
Therefore I had to find a way to bypass this limitation, and for that I needed ANN techniques which don’t produce very heavy index representation, because I had to store it in the database and load and recreate it on every search query. While doing research I’ve stumbled on a technique called Locality Sensitive Hashing which seemed to be a perfect fit for my use case.
Basic LSH implementation can look like this:
class HyperBin:
def __init__(self, num_hyperplanes: int, num_dims: int) -> None:
self.uuid = str(uuid.uuid4())
self.hyperplanes = np.random.normal(size=(num_hyperplanes, num_dims))
norms = np.linalg.norm(self.hyperplanes, axis=1, keepdims=True)
self.hyperplanes = self.hyperplanes / norms
def get_bin(self, point: np.array) -> int:
dot_products = np.dot(self.hyperplanes, point)
bin_index = np.where(dot_products > 0, 1, 0)
bin_index = int("".join(bin_index.astype(str)), 2)
return bin_index
But before putting anything to the Database I’ve decided first to benchmark LSH against the brute force approach in plain python. So I’ve:
all-MiniLM-L6-v2
modelI’ve done this test on subset of 10k embedding vectors, where I’ve been searching for 30 nearest neighbors of 100 test samples. I’ve included also a simple heuristic which creates multiple LSH indexes and then joins the results from each index. This way I can increase the number of results returned by LSH and increase the probability of finding the nearest neighbors.
The results were not satisfying at all. The search accuracy was reasonable only for 4 and 5 hyperplanes, where this is not a great game changer, because it gives maximally 32 possible hash values, which would give only an order of magnitude speedup.
It seemed kinda obvious that the main drawback here is the naive way of creating hyperplanes. So I’ve decided to try to improve the hyperplane creation process. I’ve tried two approaches:
Then I was curious how expensive calculation of KMeans cluster is.
Wow! sklearn
implementation of KMeans is really fast. Less than 4s to index 100k vectors? I can live with that.
But still, I was not sure if this is a really good result. Therefore I’ve decided to compare it with the performance of two cutting edge techniques
And these were the results:
Damn. It seems that at least on this case KMeans LSH outperforms both HNSW and Annoy. However, it does not mean that it is generally better. I’ve tested that only on this particular dataset with 384 dimensional embeddings.
I was also curious how long does it take to build, load, and save the index for each of these methods. So I’ve measured that too:
Here we can see that Annoy index for 50 trees, which is still not a lot, is already 213MB. But I have to admit that indexing time is really good and also they’ve made a great job with the index load speed by utilization of mmap
.
In comparison, HNSW does not look that great. It takes much slower to both compute and load from disk. But the index size is a bit smaller.
I feel that this can be a valuable insight into ANN algorithms. Honestly, I’ve supposed that because of curse of dimensionality there is no way to make LSH perform similar to HNSW or Annoy. But it seems that it is not true.
If you want to learn more, Pinecone has made excellent tutorials, guiding step by step through different vector search algorithms. You can find it here.
Reverse Engineering Reality
---
PGP Key