Recently, I came across a problem where I need to store values inside a nested Hash. The catch is, the keys within that Hash are unknown, and each key can also have a nested Hash within. In addition to that, the final value of this nested Hash is an Integer that needs to be accumulated.
In Ruby, we need to define the Hash keys first before we can get or modify the value. If you try and access a key that does not exist yet, it will return nil. When you try and access another key within that non-existent key, it raises an exception:
> hash = {}
> hash[:level_1][:level_2][:level_3]
=> NoMethodError (undefined method `[]' for nil:NilClass)
There is a Hash method called dig that can traverse the Hash and get the value. It will not throw an exception even if the keys do not exist.
> hash.dig(:level_1, :level_2, :level_3)
=> nil
However, dig is only used to get the value in the hash, and not to set its value. This will not solve the problem that I had.
Taxonomic Rank
To illustrate the problem further, let’s say that we are instructed to sort a list of animals based on their Taxonomy. As a simple example, here are four animals with their Class, Order, and Species classification.
Kingdom | Class | Order | Species | |
Cat | Animalia | Mammalia | Carnivora | Felis catus |
Tiger | Animalia | Mammalia | Carnivora | Panthera tigris |
Dolphin | Animalia | Mammalia | Cetacea | Tursiops truncatus |
Crocodile | Animalia | Reptilia | Crocodilia | Crocodylus johnsoni |
The desired result would be a Hash that groups each animal accordingly:
{
"Animalia" => {
"Mammalia" => {
"Carnivora" => [
"Felis catus",
"Panthera tigris"
],
"Cetacea" => [
"Tursiops truncatus"
]
},
"Reptilia" => {
"Crocodilia" => [
"Crocodylus johnsoni"
]
}
}
}
To create this Hash, we can check if each nested key exists before we assign the value, like below:
if animals['Animalia']
if animals['Animalia']['Mammalia']
if animals['Animalia']['Mammalia']['Carnivora']
animals['Animalia']['Mammalia']['Carnivora'] << 'Felis catus'
end
end
end
However, this doesn’t look very clean due to the repeated existence checks and key declarations. The code also does not cover yet the case when the key doesn’t exist and needs to be initialized. You can imagine that the code would become convoluted pretty fast. It would be nice if we can create a Hash without caring at all if all the parent keys exist.
Taking a technique from Perl’s notebook, Ruby has a tool of its own, called Hash Autovivification.
Autovivi-what?
This comes from a combination of two words:
auto – self, or by itself
vivify – to enliven or animate
So autovivification means to be able to move or grow by itself.
How do we use this in Ruby? The magic happens when you declare your Hash like this:
hash = Hash.new { |h, k| h[k] = h.dup.clear }
Let’s break it down to its components.
Setting the default value
You can define the default value returned by a Hash using these syntax:
Hash.new(default)
Hash.new { |hash, key| default }
For example, you can set the default value to a string called “notfound” if the key is not found in the Hash:
> hash = Hash.new('notfound')
> hash[:first]
=> "notfound"
The second syntax where you provide a block is a lot more flexible as it gives you access to the Hash itself and also the key that is used:
> hash = Hash.new { |_hash, key| "The key #{key} cannot be found." }
> hash[:first]
=> "The key first cannot be found."
Making the Hash go alive
Using these concepts on how we set the default values of a Hash, we can make the Hash store the values even though we have not yet pre-defined the keys.
h[k] = h.dup.clear
In this part of the code, the default behavior (or if the key is not found) is that the non-existent key is assigned a value of the duplicate of the Hash itself, but without its contents (cleared).
In other words, if the key is not found, that key’s value now becomes an empty Hash that is also autovivified. Our previous example, which previously returns an exception, now becomes:
> hash = Hash.new { |h, k| h[k] = h.dup.clear }
> hash[:level_1][:level_2][:level_3]
=> {}
We can also now set a value to a key that is deeply nested inside:
> hash = Hash.new { |h, k| h[k] = h.dup.clear }
> hash[:level_1][:level_2][:level_3] = 'you found me'
> hash
=> {:level_1=>{:level_2=>{:level_3=>"you found me"}}}
> hash[:level_1][:level_2][:level_3]
=> "you found me"
Saving the Taxonomic ranks
Going back to our example problem where we group a list of animals based on their taxonomic rank, we can use autovivification to simplify the solution:
# Initialize the Hash
> animals = Hash.new { |h, k| h[k] = h.dup.clear }
> animals['Animalia']['Mammalia']['Carnivora'] = []
> animals['Animalia']['Mammalia']['Cetacea'] = []
> animals['Animalia']['Reptilia']['Crocodilia'] = []
# Save the values
> animals['Animalia']['Mammalia']['Carnivora'] << 'Felis catus'
> animals['Animalia']['Mammalia']['Carnivora'] << 'Panthera tigris'
> animals['Animalia']['Mammalia']['Cetacea'] << 'Tursiops truncatus'
> animals['Animalia']['Reptilia']['Crocodilia'] << 'Crocodylus johnsoni'
# Check the Hash
> pp animals
{"Animalia"=>
{"Mammalia"=>
{"Carnivora"=>["Felis catus", "Panthera tigris"],
"Cetacea"=>["Tursiops truncatus"]},
"Reptilia"=>{"Crocodilia"=>["Crocodylus johnsoni"]}}}
=> {"Animalia"=>{"Mammalia"=>{"Carnivora"=>["Felis catus", "Panthera tigris"], "Cetacea"=>["Tursiops truncatus"]}, "Reptilia"=>{"Crocodilia"=>["Crocodylus johnsoni"]}}}
Caveats
Some important things to consider when using autovivification are:
Know how the values are used
When we use autovivification, the default value of a key is also a Hash, so we need to check first if our code needs to use a Hash or some other type, like a String or an Array.
The Taxonomic rank example may not be the best way to highlight the strengths of this method because the desired end value is an Array. If the end value is a Hash, then we don’t need to initialize the Order key to an array at the beginning. We can just immediately assign values to arbitrary keys:
> animals = Hash.new { |h, k| h[k] = h.dup.clear }
> animals['Animalia']['Mammalia']['Carnivora']['Felis catus'] = 1
> animals
=> {"Animalia"=>{"Mammalia"=>{"Carnivora"=>{"Felis catus"=>1}}}}
Ensure that you control the input data
The biggest strength of autovivification is also its biggest weakness: the ability to create objects indefinitely. When you do not fully control the input data, such as when using it directly behind a user-submitted form, this can open your application to attacks. A malicious agent can send data to your application and populate this Hash with large amounts of data, consuming tons of memory and potentially cause a crash.
Keys must be predefined or of a known format
Since an autovifified Hash can be generated with any key that you specify, you need to make sure that you can read the Hash contents when you process it. If you are creating keys that are of different types on the same nested level, such as timestamps, strings, or integers, getting the correct values can be a challenge. It may even leave you with “unused” data in your Hash, which is not optimal and consumes unnecessary memory.
Like all things, using autovivification will not solve all of your problems when using a Hash. There are only specific use cases where this technique will shine, and could even cause bugs for other cases. However, when used properly this can make your code simpler and more concise.
For anyone looking for something more full-featured (e.g. default values, key lists, create on write instead of read, auto array or hash), there’s also my XKeys ruby gem.