Space efficient machine learning feature stores using probabilistic data structures - a benchmark

In this post, we describe a technique for providing machine learning feature stores with sublinear space requirements, and perform a benchmark that uses bloom filter as the backing data store. Such feature stores can be an effective alternative to the commonly used key-value-store-based feature stores in certain situations.

photo of Enno Shioji
Enno Shioji

Head of Engineering

Bloom Filter

The problem

When building Machine Learning (ML) applications - such as recommender systems - there is often a need to provide a "feature store" which can enrich the request to the system with additional ML features.

For example: whether a user had looked at an article before is often very informative about whether the user will click or buy that article this time. So, companies keep a record of what article their users had clicked bought recently, and use this data in their recommender systems. Other commonly used data include: past browsing history, purchase history, user information like demographics, explicit preferences they shared etc.

These data are usually stored in key-value stores like Redis, using the user ID as the key, and the features as value.

When a request is made to the recommender system, a query is made to this key-value store using the user ID, and the retrieved features are fed to the recommendation algorithm together with the data contained in the original request. When there are many users, these feature stores can easily get very large.

This creates significant challenges in terms of the development and operation of ML applications.

  • They add to the processing time: Adding a network call commonly adds 2-10ms to your response time. To make matters worse, it also adds a lot of variance to the response time due to the variation of message sizes across users
  • Additional hosting costs/maintenance cost: Distributed databases with strict performance requirements can be expensive to host
  • Additional operational complexities: Operations like backfill can become very expensive to setup/execute
  • Development complexities: An external database adds a dependency to the application code, which adds some complexity to the development/testing process (like having to pre-populate this DB for tests). Intrusive performance optimizations like size limits, aggregations, prioritization of users are often necessary, which adds development time and increases the coupling between model design and infrastructure
  • Multiple lookups can be prohibitively expensive: For example: imagine you want to rank a thousand products, and want to retrieve features for each product - this would be extremely difficult with an external database under strict latency budget. Another hypothetical example is retrieving features for composite keys (interactions), e.g. "How many times were product X and Y bought together?". If the feature state is small enough to live in the same processes' memory, multiple look-ups are far cheaper and thus feasible.

The solution

What if, instead of having a big, unwieldy database, we could read a much smaller dataset into memory, and query that as a feature store from within the process? This is essentially what we can do with "sketching" data structures, a type of probabilistic data structures.

Sketching data structures can store large amounts of data in a compact (sublinear) space at the expense of accuracy. In other words, they store a "summary" of the original data. They are essentially a lossy compression algorithm for your features. Just like JPEG compression for your images, it can compress input data at varying "compression levels" - low-compression level means better quality but larger sizes, and high-compression level means lower quality but smaller sizes.

This allows us to trade-off accuracy in exchange for space requirements. As we will see below, the trade-off is highly favorable - a very small sacrifice in accuracy can save a lot of space.

In this article we will only describe and benchmark bloom-filter-backed feature stores in detail, but theoretically, other sketching data structures like HyperLogLog, Count-Min Sketch, Quotient Filters etc. could be used, too.

Benchmark of a sketching-data-structure-based feature store backed by a Bloom-Filter

Below is a benchmark based on a real-life click prediction dataset. It shows that prediction models that use a bloom-filter-based feature store can achieve the same level of prediction accuracy & prediction throughput with a vastly smaller feature state that can easily be fit into memory.

Benchmark setting

We used a real-life click prediction dataset which has two types of features:

  • Request features: Features that are immediately available in the request, like country, article id, device type, context URL and so on
  • Historical features: Features that are based on accumulated historical data, like browsing history, purchase history, preferences that were saved in the past etc.

The historical features were aggregated using count, max etc. (e.g. how many times did a user browse an item, what was the last time they looked at it etc.) and were then discretized to yield categorical features. They were then stored into feature stores.

The training data had about 5.7 mil examples. Out of these 5.7 mil examples, 2.8 mil had historical data (the rest had only request features). Combined, the data had 1.762 bil data points after feature extraction.

Finally, a logistic regression classifier was used to predict clicks. Our variants were as follows:

  • No history: A model without a feature store (so that it could only use request features)
  • Uncompressed history: A model that simulated use of a conventional feature store (the features were pre-fetched)
  • Compressed history: A model that used a bloom filter based compressed feature store

Implementation of the bloom-filter-based-compressed-feature-store

Below is a simplified implementation in Python that illustrates how the feature store was implemented. It returns what articles a user had looked at before, given their user_id. This is not the actual implementation that was used in the benchmark. The benchmark used a JVM-based implementation, and was more general in nature (it stored arbitrary categorical features).

from typing import Set

from bloom_filter import BloomFilter


class FeatureStore:
    def __init__(self, store: BloomFilter):
        self.store = store
        self.possible_articles = set()

    def add(self, user_id: int, article_ids: Set[int]) -> None:
        for article_id in article_ids:
            self.possible_articles.add(article_id)
            composite_key = f'{user_id}^{article_id}'
            self.store.add(composite_key)

    def retreive_articles(self, user_id: int) -> Set[int]:
        ret = set()
        for article_id in self.possible_articles:
            composite_key = f'{user_id}^{article_id}'
            if self.store.might_contain(composite_key):
                ret.add(article_id)
        return ret

The most important element to point out is the additional state self.possible_articles. This would hold the set of all possible features (in this case, all article IDs), and the code is brute forcing all of them in order to reconstruct the set of articles viewed by the user. This may appear to be a very expensive thing to do, but in practice it is very cheap in relation to the total processing. In my simple benchmark, the difference was undetectable. It is also worth noting that this process could be optimized, for example through the use of binary search, and/or by only querying for important features.

The compressed history variant had a parameter that determined the level of compression - i.e. higher compression level meant lower quality and size, lower compression level meant higher quality and size. What do we mean by "quality" here? In a nutshell, the bloom filter tells us if a binary categorical feature is present (1) or not (0). When the bloom filter says a feature is NOT present, it is always correct - i.e. there are no false negatives. However, when the bloom filter says a feature is present, it can be an error. In other words, at some probability, we will mistakenly set the feature value to 1, when in fact it should have been 0 (i.e. false positive). This adds noise to our model's input. This probability can be tuned via a parameter, and the higher the false positive rate, the smaller the state size.

For more details on how this compression level parameter works, and generally how bloom filters work and their characteristics, see e.g. here, here and here.

As an evaluation metric, we used click ROC-AUC (Area Under the Curve of the Receiver Operating Characteristic curve), a common metric for recommender systems.

Result

The scatter plot below shows the AUC (y axis) of the classifier at varying compression levels (x axis = size of the feature store in bytes in logarithmic scale). The dotted green line is the AUC with a key-value-store-based feature store equivalent (i.e. Uncompressed). The dotted red line is the AUC without any history features (i.e. No history).

AUC plotted against feature state size

As expected, our bloom-filter-backed feature store achieves performances between the two lines (uncompressed ~= 0.80 and no history ~= 0.70).

The estimated size of the key-value-store-based feature store was about 15GB. Hence, the results show that our compressed feature store achieves the same level of classification performance (AUC~=0.7997) using just 3% of memory (470MB vs 15GB). The state size can be further reduced at the expense of classification performance. For example, 90% of the uplift provided by the feature store can be retained by using merely ca. 40MB of state (AUC~=0.79). This would be just 0.3% of the size of an uncompressed feature store. Note that this "saving" grows as the data volume increases due to the sublinear space complexity.

When it comes to throughput (computational efficiency), all of the variants achieved similar throughput (20-22k predictions per second per core on my 2018 Mac). I.e. the additional overhead was undetectable with my performance tests.

The Limitation

So the benchmark results look very good - why would anyone use a conventional key-value-store-based feature store at all? Alas, the new feature stores come with severe limitations and are thus not a drop-in replacement for conventional feature stores.

You have to know what to ask

As described above, we need to keep the set of possible features in order to get the desired output. In a lot of use cases this is not an issue, but in some situations it may be prohibitively expensive (e.g. imagine reconstructing bag-of-word encoding of past user reviews).

They are difficult to update (and thus keep them "fresh")

The second, and probably by far the more important weakness is the difficulty associated with updating them.

Feature "freshness", as in how quickly recent events can be reflected to the feature store is very important, as recent events tend to have high informational value. Many distributed key-value stores have good write performance, and thus it's very feasible to keep them very "fresh" even when high load is involved. The situation is very different with sketching-data-structure-based feature stores.

First, let's consider the appending of new information to our new feature store.

Most sketching-data-structure (including bloom filters) allow incremental appends (so far, so good). However, since the complete state is loaded onto each node's RAM, every write must be applied on every node - so that each node (process) must be able to handle 100% of event traffic. This is usually impossible - common event streams like views, clicks are usually very high volume, and processing that amount of writes on a single node is not a practical option. One could consider batching, but in many key-value-based feature store, the target update latency is shorter than a few seconds - which makes this option extremely difficult.

Theoretically bloom filters could be distributed so that each node only needs to process a shard of the traffic - but at this point one would have converted one's real-time transaction server into a distributed database.

Second, let's consider deletion (expiry) of information.

The situation is even worse, because due to their nature, sketching-data-structures don't allow deletes of individual records. Thus, to delete a record from our new feature state, one has to completely regenerate it by re-processing the entire source dataset again (sans the information we want to delete). This is extremely expensive and thus can only be done on a low-frequency batch basis. There are some sketching-data-structure variants that allow some degree of expiry (see e.g. Age-partitioned Bloom Filters, but there are no mature implementations available.

They cannot support complex queries and updates

Finally, sketching-data-structure-based feature stores don't support complex queries or updates like "remove all events that happened on day X". With key-value-store-based feature stores, the additional cost of storing some metadata (like event timestamps) is relatively minor. But this can be a major undertaking for sketching-data-structure-based feature store.

Conclusion

Sketching-data-structure-based feature stores can not substitute conventional feature stores in all use cases, but they can be an attractive option when using an external feature store is prohibitively expensive. For example, if:

  • One can't afford the additional network call to an external feature store
  • Many feature lookups need to be performed per one request

We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Machine Learning Engineer!



Related posts