Introduction
Hash, salt, SHA-1, SHA-2, std::hash.. To a non-programming person, that may come up as some kind of recipe that just does not seem to add up. In a sense, this is indeed supposed to be gibberish to any third party and a strong, helpful mechanism for us, programmers.
At the start of writing this article, I had one clear idea to get across: to finally unveil this mystery of hashing in C++ for beginners. I, a beginner myself, also wanted to solidify my knowledge in this area, so let’s get started.
Briefly about hash functions
A hash function is essentially an algorithm that shreds and then mixes input data and outputs a completely different value. You can picture hash functions as a blender: we put in a set of vegetables or fruits and get something that does not look good or is similar to the initial input.
Good hash functions have the following properties:
They are deterministic. Given the same input, you will always get the same hash.
They are not too fast or too slow. If a hash function calculates hash way too quickly, it means that its hashing algorithm is easy to reverse engineer and break. On the other hand, if it is too slow, there is not much point in using it anywhere, since it is usually used in fast operations.
Fixed-length output. Given any amount of information, a hash function will always give an output of a predetermined length (usually 16 bytes).
For instance, here is what C++ STL's std::hash
would output, given some random strings:
"This is my first article, and I am writing this with a goal to teach people about hash functions and their applications in C++. This is not a really complex topic, so everyone can learn it once and for all" -> 8833303357991129216
"O" -> 1547029504053110495
As you can see, despite the size differences between the two strings, the output length is the same.
However, sometimes we may encounter so-called hash collisions. This is a common hash algorithm problem that may cause different problems depending on where a hash algorithm is used. It occurs when two distinct data inputs produce the same hash output.
Suppose that we have a string password123 and it is hashed to 18410906283934990467, then there is a miniscule possibility (1 / 2^256) that there is another string that will produce the same hash value. This happens because, though there are 2^256 (2
represents bits, 1 or 0, 256
is a number of these bits) possible hashes, they are still finite, but the options of data input that we can provide is infinite. Notably, hackers will often try to produce the same hash for a target piece of information (e.g. digital signatures, hashed passwords) to deceive a system into thinking that they are authorized users.
Hash functions in C++
Hash functions are encountered a lot in C++. Sometimes we do not notice where they are. For example, std::unordered_map
and std::unordered_set
utilize this exact algorithm for storing data and quickly accessing it. These containers are called hash tables. However, they have one thing we have to be aware of: they work well with common STL types and not really well with user types. By "work well" I mean that they rarely stumble upon hash collisions, giving them good element access speed, inserting, or erasing data (when a collision occurs, data in hash tables will be stored in a data list for the same hash value). This happens because only a set of common types is properly configured to be hashed, but user-type hashes are left up to the user.
Let's say that we are writing a car number plate detection system for the government and we need to store all number plates and the number of times that we have encountered them.
A good approach is to use some fast and intuitive container for such purposes, like std::unordered_map<NumberPlate, int>
, where NumberPlate
is a plate object—1 digit at the beginning, 3 letters in the middle, and 3 digits at the end— and int is the number of times we have seen a plate. Let's take a look at how we can code this system:
#include <iostream>
#include <unordered_map>
#include <string>
#include <array>
#include <functional>
#include <sstream>
class NumberPlate{
public:
NumberPlate(int series, char letter1, char letter2, char letter3, int number) : letters_{letter1, letter2, letter3}, series_(series), number_(number) {}
bool operator==(const NumberPlate& other_plate) const{
return this->GetTuple() == other_plate.GetTuple();
}
std::tuple<int, std::array<char, 3>, int> GetTuple() const {
return std::tie(series_, letters_, number_);
}
std::string GetString() const{
std::ostringstream str;
str << series_;
str << letters_[0] << letters_[1] << letters_[2];
str << number_;
return str.str();
}
private:
std::array<char, 3> letters_;
int series_;
int number_;
};
std::ostream& operator<<(std::ostream& out, const NumberPlate& plate){
out << plate.GetString();
return out;
}
int main(){
std::unordered_map<NumberPlate, int> plates;
plates[{0, 'D', 'S', 'H', 441}] = 2;
plates[{9, 'A', 'B', 'C', 114}] = 4;
for (const auto& [plate, freq] : plates){
std::cout << plate << " : " << freq << '\n';
}
}
In this code snippet, I have created a class NumberPlate, which represents a car number plate. Currently, we can either get a std::string
or an std::tuple
representation of a number plate. However, when we try to compile it, we get a similar error:
The problem lies in the fact that std::unordered_map does not know how to hash the NumberPlate
object. To fix this error, we need to write a custom hasher. Here we will write two versions of this hasher, an easy one that uses STL's std::hash
and our own hashing algorithm:
Using STL's
std::hash
. First, we create another class calledPlateHasher
and define itsoperator()
method, since inside our container, this class will be called like that:PlateHasher()
. When this method is called, the hasher createsstd::hash<std::string>{}
object and passes the result ofplate.GetString()
method as a parameter and computes the hash value of that plate string.
class PlateHasher{
public:
inline size_t operator()(const NumberPlate& plate) const{
std::string number_plate_str(plate.GetString());
size_t hashed_str = std::hash<std::string>{}(number_plate_str);
std::cout << "Number Plate: " << std::move(number_plate_str) << " | HASH: " << hashed_str << std::endl;
return hashed_str;
}
};
int main(){
std::unordered_map<NumberPlate, int, PlateHasher> plates;
++plates[{0, 'D', 'S', 'H', 441}];
++plates[{9, 'A', 'B', 'C', 114}];
++plates[{0, 'D', 'S', 'H', 441}];
for (const auto& [plate, freq] : plates){
std::cout << plate << " : " << freq << '\n';
}
}
// output:
// Number Plate: 0DSH441 | HASH: 12184712924349732319 <- |
// Number Plate: 9ABC114 | HASH: 9677638779162621388 | Same input = same hash
// Number Plate: 0DSH441 | HASH: 12184712924349732319 <- |
// 9ABC114 : 1
// 0DSH441 : 2
Using our own hashing algorithm. When writing your own hashing algorithm for a hash table, you need to keep in mind the following rules: your algorithm needs to be non-congenital (to avoid collisions as much as possible) and fast. In our case, we will be multiplying each value of our plate by different powers of three:
class PlateHasher{
public:
inline size_t operator()(const NumberPlate& plate) const{
uint64_t hashed = 0;
auto [series, letters, number] = plate.GetTuple();
std::string number_letters(letters.data());
hashed = (series * 9) + (std::hash<std::string>{}(std::move(number_letters))) + (number * 19683);
std::cout << "Number Plate: " << plate.GetString() << " | HASH: " << hashed << std::endl;
return static_cast<size_t>(hashed);
}
};
int main(){
std::unordered_map<NumberPlate, int, PlateHasher> plates;
++plates[{0, 'D', 'S', 'H', 441}];
++plates[{9, 'A', 'B', 'C', 114}];
++plates[{0, 'D', 'S', 'H', 441}];
for (const auto& [plate, freq] : plates){
std::cout << plate << " : " << freq << '\n';
}
}
// output:
// Number Plate: 0DSH441 | HASH: 6407075732577860629 <- |
// Number Plate: 9ABC114 | HASH: 11746482041455558785 | Same input = same hash
// Number Plate: 0DSH441 | HASH: 6407075732577860629 <- |
// 9ABC114 : 1
// 0DSH441 : 2
Sometimes hash values may differ in their length, but this is due to conversion between different digit types, so do not worry about it; the hash value is a fixed length anyway (size_t
, in our case).
Conclusion
In this article, I have addressed a common misunderstanding among people who are new to programming and the tech industry overall: "What are the hash functions? Where and how to use them?".
To sum up everything we have gone over:
A hash function is an algorithm that produces a totally indistinguishable output from an input.
Hash functions are widely used in cryptography and data structures.
Same input = same hash output. Sometimes hash collisions happen.
Hash collisions occur, when two or more diffrent inputs produce the same hash
To write a good hashing algorithm for hash tables, we need to multiply different parts of data by different powers of any number.
Hopefully, my first article was helpful to everyone who read it, and you have gained knew knowledge from it. Thank you for taking your time. See you soon!