Hashing is a fundamental operation in database management, playing a key role in the implementation of core database data structures and algorithms. Hashing is also essential for database data collisions, which can reduce query performance. There are many well-known schemes to handle collisions, such as chaining, probing, and cuckoo hashing.
Hashing and hashing-based algorithms and data structures have many applications in computer science, such as machine learning, computer graphics, bioinformatics, and compilers.
A hash function generates codes that can be used to substitute data inputs. Because these codes are shorter than the actual data and usually have a predetermined length, it is easier to discover and get the original data.
Due to the random nature of standard hash functions, it is sometimes possible for two pieces of data to have the same hash value. As a result, collisions happen when a user searches for one thing and is directed to multiple data items with the same hash value. Slower searches and worse performance occur as a result of the fact that it takes much longer to find the right one.
Researchers from MIT and elsewhere set out to use machine learning to build better hash functions used in many applications. Perfect hash functions are designed to sort data in a way that prevents collisions. However, they must be specially constructed for each dataset and take more time to compute than traditional hash functions. Using learned models instead of traditional hash functions could result in half as many collisions. Their experiments also showed that learned models were often more computationally efficient than perfect hash functions.
These findings, which will be presented at the International Conference on Very Large Databases, show how a hash function may be designed to significantly speed searches in a large database. This method could speed up computational systems scientists use to store and analyze DNA, amino acid sequences, and other biological data.
Ibrahim Sabek, a postdoc in the MIT Data Systems Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL), said, “What we found in this work is that in some situations, we can come up with a better tradeoff between the computation of the hash function and the collisions we will face. We can increase the computational time for the hash function a bit, but at the same time, we can reduce collisions very significantly in certain situations.”
He explains, “We may have a huge number of data inputs, and each one has a different gap between it and the next one, so learning that is quite difficult.”
He also said, “At a certain threshold of sub-models, you get enough information to build the approximation that you need for the hash function. But after that, it won’t lead to more improvement in collision reduction.”
A traditional hash function generates a random number, or code, corresponding to the slot where the key will be kept. Yet, it is very likely that two keys will end up in the same place, resulting in collisions.
Based on their investigation, the researchers want to use learning models to design hash functions for different data types. They also plan on exploring learned hashing for databases that can be inserted or removed. When data is updated in this way, the model must be updated. However, altering the model while maintaining accuracy is a difficult problem.
Learned models can reduce the ratio of keys in a dataset that collides from 30% to 15%. They also achieve higher throughput while reducing runtime by over 30%. The researchers investigated the usage of learned models for hashing. They discovered that the number of sub-models had the most significant influence on collision reduction. They want to use learned models to create hash functions for different forms of data and investigate learned hashing for databases.
The researcher said, “We want to encourage the community to use machine learning inside more fundamental data structures and operations. Any kind of core data structure presents us with an opportunity to use machine learning to capture data properties and get better performance. There is still a lot we can explore.”
- Ibrahim Sabek, Kapil Vaidya, etal. Can Learned Models Replace Hash Functions? PVLDB. DOI: 10.14778/3570690.3570702