An adventure of sorts
I have wanted to utilise Sankey diagrams to analyse data for some time. However, I have always found it daunting to figure out how to set these up. For some reason or another, I managed not to 'find time to dig into this area.
Well, that was till this weekend. What's a guy meant to do when they find themself at a loose end, with no family around, and nothing needs doing? It looks like it's Sankey time!!
What is a Sankey Diagram?
A Sankey diagram can display the flow of information, resources, etc., through a process. The thickness of the lines denotes where the process is most active, and thinner lines reflect the opposite.
Let's start at the beginning.
As I started learning about Sankeys, I realised that I would need to look on Medium, Python in Plain English, or Stackoverflow for learning resources. So I started with an article from Arslan Shahid on Python in Plain English, which was a great start. He provided some handy insights, and the following resonated with me.
I would highly encourage you to try different configurations manually to figure out how the diagram would change. Try adding more nodes and more intricate links etc. — Arslan Shahid, Python in Plain English
And so, having read Arslan Shahid's article, I took his advice and started playing with the data.
A few key terms.
Sankey Diagrams are pretty simple when you get your head around a few key terms. But, unfortunately, these terms come up repeatedly and are fundamental to understanding Sankeys and the flows.
Source — This is the starting node; no, you don't need multiple sources to get depth. More on that later.
Target — This is the node that the source connects to
Value — This is the connection flow volume. It is a number and will denote the thickness of the lines that connect the Sankey diagram.
The only other term that comes up is Label, and you can probably guess what that is for. Yes! It labels your flows.
Well, that's enough preamble. Let's get into it.
Our first Sankey.
As we will be using Python and Plotly to create our Sankeys, we should first import our necessary modules.
import plotly.graph_objects as go
Next, as we grow our Sankeys, you can utilise many colours to denote links. That's why I create a color_link list. This list contains 104 colours, which hasn't let me down yet. Granted the colour selection is not ideal for serious presentation purposes, but it is more than adequate for displaying our current Sankeys. You could spend an age looking for the correct colour choice to illustrate your chart and make it more impactful. But that’s a completly different discipline that I am not going to deal with here.
color_link = ['#000000', '#FFFF00', '#1CE6FF', '#FF34FF', '#FF4A46',
'#008941', '#006FA6', '#A30059','#FFDBE5', '#7A4900',
'#0000A6', '#63FFAC', '#B79762', '#004D43', '#8FB0FF',
'#997D87', '#5A0007', '#809693', '#FEFFE6', '#1B4400',
'#4FC601', '#3B5DFF', '#4A3B53', '#FF2F80', '#61615A',
'#BA0900', '#6B7900', '#00C2A0', '#FFAA92', '#FF90C9',
'#B903AA', '#D16100', '#DDEFFF', '#000035', '#7B4F4B',
'#A1C299', '#300018', '#0AA6D8', '#013349', '#00846F',
'#372101', '#FFB500', '#C2FFED', '#A079BF', '#CC0744',
'#C0B9B2', '#C2FF99', '#001E09', '#00489C', '#6F0062',
'#0CBD66', '#EEC3FF', '#456D75', '#B77B68', '#7A87A1',
'#788D66', '#885578', '#FAD09F', '#FF8A9A', '#D157A0',
'#BEC459', '#456648', '#0086ED', '#886F4C', '#34362D',
'#B4A8BD', '#00A6AA', '#452C2C', '#636375', '#A3C8C9',
'#FF913F', '#938A81', '#575329', '#00FECF', '#B05B6F',
'#8CD0FF', '#3B9700', '#04F757', '#C8A1A1', '#1E6E00',
'#7900D7', '#A77500', '#6367A9', '#A05837', '#6B002C',
'#772600', '#D790FF', '#9B9700', '#549E79', '#FFF69F',
'#201625', '#72418F', '#BC23FF', '#99ADC0', '#3A2465',
'#922329', '#5B4534', '#FDE8DC', '#404E55', '#0089A3',
'#CB7E98', '#A4E804', '#324E72', '#6A3A4C'
]
We should also set up some labels to see how our information flows.
# Our labels. The numbers after the comment denote the 'source node' the label belongs to.
label = ['Zero', # 0
'One', # 1
'Two', # 2
'Zero-Zero', # 3
'Zero-One', # 4
'One-Zero', # 5
'One-One', # 6
'Two-Zero', # 7
'Two-One' # 8
]
Sankey 1
And now to our first Sankey. A very modest one, but it provides the essential building blocks for our work.
IMPORTANT — Sankeys take numeric values, NOT descriptive names.
Here is our initial data
# Data
source = [0, 1, 2] # 3 Source Nodes
target = [1, 2, 3, 4, 5, 6] # 6 Target Nodes
value = [1, 1, 1, 1, 1, 1 ] # 6 Values
As you can see, we have three source nodes — 0, 1, 2, 6 target nodes — 1, 2, 3, 4, 5, 6, and 6 values.
We then put our data in a Python Dictionary, passing it to Plotlys graphic object Sankey.
# Link references our fields above
link = dict(source=source, target=target, value=value, color=color_link)# node handels assigning labels and the housekeeping aound the diagram.
node = dict(label = label, pad=35, thickness=20)# We then package the informtion into our data object
data = go.Sankey(link=link, node=node)
Next, we style our diagram in the standard Plotly way. I like a dark background on my diagrams hence the paper_bgcolor being set.
fig = go.Figure(data)
fig.update_layout(
hovermode='x',
title='Sankey 1 - 3 nodes, 6 targets, 6 values',
font=dict(size=10, color='white'),
# Set diagrams background colour to almost black
paper_bgcolor='#51504f')fig.show()
When you stitch it all together and run the code, you should end up with a diagram similar to the following.
Yes. There you have it, a Sankey diagram. What! It's not what you expected? I didn't expect that either when I first ran the code. Up to that point, I thought I understood what was happening. Luckily I do now.
Let me explain. First, we need to review two pieces of code. Our 'label' and our 'source' and 'target'.
# Our labels. The numbers after the comment denote the 'source node' the label belongs to.
label = ['Zero', # 0
'One', # 1
'Two', # 2
'Zero-Zero', # 3
'Zero-One', # 4
'One-Zero', # 5
'One-One', # 6
'Two-Zero', # 7
'Two-One' # 8
]
And our data
source = [0, 1, 2] # 3 Source Nodes
target = [1, 2, 3, 4, 5, 6] # 6 Target Nodes
value = [1, 1, 1, 1, 1, 1 ] # 6 Values
Looking at the above, you can see that;
source 0 is pointing to target 1
or using labels
Zero is pointing at One
Continuing in the above train of thought but using label names, we see the flow as follows.
source target
Zero One
One Two
Two Zero-Zero
Looking at the diagram, we see that is precisely what we get.
You will note that target nodes 4, 5, and 6 are not displayed. This is because there are no source nodes to map to these remaining target nodes.
Takeaway 1 — Understand the relationship between source and target nodes. A source and target of the same number are in the same locations. You will see an example of this later.
Sankey 2
Let's try and make it a bit more exciting and show individual flows of multiple sources reaching out to 'One' and 'Two' and thence, to Zero-One etc.
I won't explain the code in depth; I will point out the significant differences.
# Data
source2 = [0, 0, 1, 1, 2, 2] # 3 Source Nodes
target2 = [3, 4, 5, 6, 7, 8] # 6 Target Nodes
value2 = [1, 1, 1, 1, 1, 1 ] # 6 Values# Link references our fields above
link2 = dict(source=source2, target=target2, value=value2, color=color_link)# node handels assigning labels and the housekeeping aound the diagramnode = dict(label = label, pad=35, thickness=20)# We then package the informtion into our data object
data2 = go.Sankey(link=link2, node=node2)fig2 = go.Figure(data2)
fig2.update_layout(
hovermode='x',
title='Sankey 2 - 3 individual nodes, 6 individual targets, 6 values',
font=dict(size=10, color='white'),
paper_bgcolor='#51504f')fig2.show()
The source2 and target2 lists in the code above show that we have three source nodes, 0, 1, and 2, pointing to targets 3–8. If you reference the label code above, you will see that that denotes target Zero-Zero to Two-One.
Let's see what we get.
Now that is much better than Sankey 1. That's what I would have expected. If you look at the source and target list, you see that the source is 0–2, and targets start at 3, finishing at 8, precisely what is displayed above.
Takeaway 2 — When you know better, you do better. We saw that we got our expected outcome as long as the source and target list items didn't overlap.
Sankey 3
Now we are starting to get into the building blocks of Sankeys. In this Sankey, we will set up three source nodes, each having a flow to a different target node.
source3 = [0, 0, 1, 1, 2, 2] # 3 Source Nodes
target3 = [1, 2, 3, 4, 5, 6] # 6 Target Nodes
value3 = [1, 1, 1, 1, 1, 1 ] # 6 Valueslink3 = dict(source=source3, target=target3, value=value3, color=color_link)
node3 = dict(label = label, pad=35, thickness=20)
data3 = go.Sankey(link=link3, node=node3)fig3 = go.Figure(data3)
fig3.update_layout(
hovermode='x',
title='Sankey 3 - 3 nodes, 6 targets, 6 values',
font=dict(size=10, color='white'),
paper_bgcolor='#51504f')fig3.show()
If you look at the code, we see three source nodes defined as 0–2 and 6 target nodes defined as 1–6.
This should give us source node 0 pointing to target nodes 1 and 2. then source node 1 pointing to target nodes 3 and 4. Finally, source node 2 points to target nodes 5 and 6
Breaking it down as we did previously using label names
source target
Zero One
Zero Two
One Zero-Zero
One Zero-One
Two One-Zero
Two One-One
And we get the following running the code
This is precisely what we predicted above. As I mentioned at the beginning of this article, when discussing key terms, if we reference a target node from a source and the target is also a source node, we can then define depths to our Sankey diagrams. This becomes very powerful, as we see in the following two examples.
Takeaway 3 — By referencing different nodes between source and targets, we can start to build complicated Sankey diagrams.
Sankey 4
Now that we have the basic building blocks, we can map out more complicated processes. In this example, I will map out an imaginary process.
Firstly we will need some new labels.
label4 = ['Risk', # 0
'Reward', # 1
'Forefeit', # 2
'Prople', # 3
'Proceess', # 4
'Systems', # 5
'External Events', # 6
'Money', # 7
'Food', # 8
'Cash', # 9
'Car', # 10
'Holiday' # 11
]
Then our code.
source4 = [0, 0, 0, 0, 1, 1, 2, 2] # 3 Source Nodes
target4 = [3, 4, 5, 6, 7, 8, 9, 10] # 8 Target Nodes
value4 = [1, 1, 1, 1, 1, 1, 1, 1 ] # 8 Valueslink4 = dict(source=source4, target=target4, value=value4, color=color_link)
node4 = dict(label = label4, pad=35, thickness=20)
data4 = go.Sankey(link=link4, node=node4)fig4 = go.Figure(data4)
fig4.update_layout(
hovermode='x',
title='Testing 4 - 3 nodes, 8 targets, 8 values',
font=dict(size=10, color='white'),
paper_bgcolor='#51504f')fig4.show()
I am running the above code results in Sankey 4.
There you have it; three distinct processes mapped out with their individual targets. This was a simple example, but the next posed a few more interesting issues.
Takeaway 4 — You can have more than one process mapped on a Sankey. The crucial part of the code that keeps everything separate is the 'pad=35' in our node dictionary. Adding or reducing the number has the effect of increasing or decreasing the spacing between nodes.
Sankey 5
Thanks for sticking with me. We are about to graduate from the basics. Here I will draw out a much larger process which brings with it its own issues. More on that as we work through the problem.
First, some new labels. Sorry, there are lots of them
label5 = ['Risk', # 0
'Reward', # 1
'Forefeit', # 2
'Prople', # 3
'Proceess', # 4
'Systems', # 5
'External Events', # 6
'Money', # 7
'Food', # 8
'Outdoors', # 9
'Car', # 10
'Holiday', # 11
'Tom,', # 12
'Dick', # 13
'Harry', # 14
'Fast Forward', # 15
'Reverse', # 16
'Play', # 17
'XBox', # 18
'Playstation', # 19
'Nintendo', # 20
'Earthquake', # 21
'Terrorism', # 22
'Cheque', # 23
'Credit Card', # 24
'Burger', # 25
'Pizza', # 26
'Tent', # 27
'Pack', # 28
'Boots', # 29
'Landrover', # 30
'Mazda', # 31
'Ibiza', # 32
'Italy', # 33
'Spain', # 34
'USA', # 35
'Australia', # 36
'Son', # 37
'Daughter', # 38
'Red', # 39
'Blue', # 40
'Grey', # 41
'Silver' # 42
]
And now our code. It hadn't moved on much from when we started. The nodes, targets and values have just increased a lot.
source5 = [0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 9, 10, 10, 11, 11, 11, 11, 11, 12, 12, 31, 31, 30, 30]
target5 = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42]
value5 = [5, 3, 3, 2, 2, 2, 3, 4, 5, 2 ,1 , 1, 1 , 1 , 1, 1 ,1 ,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1 , 1, 1, 1 , 1, 1 , 1, 1, 1] link5 = dict(source=source5, target=target5, value=value5, color=color_link)
node5 = dict(label = label5, pad=35, thickness=20)
data5 = go.Sankey(link=link5, node=node5)fig5 = go.Figure(data5)
fig5.update_layout(
hovermode='x',
title='Sankey 5 - 40 nodes, 40 targets',
font=dict(size=10, color='white'),
paper_bgcolor='#51504f')fig5.show()
This gives us a beautiful Sankey diagram of a proper flow.
I was thrilled when I got the diagram where I wanted it to be. You can probably see by looking at the source, target and values that there is an excellent opportunity for confusion. As I was knocking these up in Jupyter Notebooks, I stuck in the following to validate my data.
print(f'source5 has {len(source5)} items')
print(f'target5 has {len(target5)} items')
print(f'value5 has {len(value5)} items')
Which output a very helpful confidence check.
But again, that was only one issue. I also had nodes connecting to nodes I didn't expect, and I had to go back to basics with paper and pen to work it out.
Do not get me wrong. I was delighted with the output, which helped me understand how Sankey works. It was just a bit laborious.
Takeaway 5 — When Sankeys start to get large and unwieldy, it's time to get our programming marbles on. But before you can do that, you NEED to understand how a Sankey is put together and how they work.
Reflection on the previous examples.
Well, that was fun! I now think I have a good understanding of how Sankey works and enough knowledge that I have realised to be useful; we will need to automate a lot of this. So off, I went back on my search for more Sankey information and found a great article by Baysan on How to Automatically Generate Data Structure for Sankey Diagrams.
Onwards and upwards
Let's automate this
So in the previous section, we did all this manually. Now finally, we will automate the creation of the Sankey diagrams. I pulled heavily from Baysan's article and modified his function for my needs.
The data I use for this article is partially sourced from publicly available documents from BIS.
You can pull the CSV files I created from my GitHub here.
The Data
You can use either data file in the GitHub repo to run the program. Both files have the same format as below. rick_cat.csv is the larger of the two files. I ran the code in Jupyter Notebooks, but it runs fine in PyCharm or any other IDE with slight modifications.
The Code
After playing around with Baysans’ function and the knowledge I gained from experimenting with Sankey earlier, I pulled the following together. It will look very familiar if you have read the article from the beginning. It is not rocket science, but it does speed up the process of creating complicated Sankey diagrams.
The only new addition is Pandas and webcolors. I assume you are all over pandas already.
To get semi-opaque links, you need to use RGBA colours. webcolors has a function hex_to_rgb that allows you to do this. So we read in our hex colour and changed it to an RGBA format.
rgb_link_color = ['rgba({},{},{}, 0.4)'.format(
hex_to_rgb(x)[0],
hex_to_rgb(x)[1],
hex_to_rgb(x)[2]) for x in color_link]
The complete code is below. Thanks to an article by Mattia Cinelli for helping me with this.
import pandas as pd
import plotly.graph_objects as go
from webcolors import hex_to_rgb# You can use either risk_cat.csv or risk_cat1.csv
data = pd.read_csv(r'risk_cat1.csv')
df = pd.DataFrame(data)# Setup our colours
color_link = ['#000000', '#FFFF00', '#1CE6FF', '#FF34FF', '#FF4A46',
'#008941', '#006FA6', '#A30059','#FFDBE5', '#7A4900',
'#0000A6', '#63FFAC', '#B79762', '#004D43', '#8FB0FF',
'#997D87', '#5A0007', '#809693', '#FEFFE6', '#1B4400',
'#4FC601', '#3B5DFF', '#4A3B53', '#FF2F80', '#61615A',
'#BA0900', '#6B7900', '#00C2A0', '#FFAA92', '#FF90C9',
'#B903AA', '#D16100', '#DDEFFF', '#000035', '#7B4F4B',
'#A1C299', '#300018', '#0AA6D8', '#013349', '#00846F',
'#372101', '#FFB500', '#C2FFED', '#A079BF', '#CC0744',
'#C0B9B2', '#C2FF99', '#001E09', '#00489C', '#6F0062',
'#0CBD66', '#EEC3FF', '#456D75', '#B77B68', '#7A87A1',
'#788D66', '#885578', '#FAD09F', '#FF8A9A', '#D157A0',
'#BEC459', '#456648', '#0086ED', '#886F4C', '#34362D',
'#B4A8BD', '#00A6AA', '#452C2C', '#636375', '#A3C8C9',
'#FF913F', '#938A81', '#575329', '#00FECF', '#B05B6F',
'#8CD0FF', '#3B9700', '#04F757', '#C8A1A1', '#1E6E00',
'#7900D7', '#A77500', '#6367A9', '#A05837', '#6B002C',
'#772600', '#D790FF', '#9B9700', '#549E79', '#FFF69F',
'#201625', '#72418F', '#BC23FF', '#99ADC0', '#3A2465',
'#922329', '#5B4534', '#FDE8DC', '#404E55', '#0089A3',
'#CB7E98', '#A4E804', '#324E72', '#6A3A4C'
]# Collect the data we need from a dataframe to populate our Sankey data - source, target, and value
def get_sankey_data(data,cols,values):
# Empty lists to hold our data
sankey_data = {
'label':[],
'source': [],
'target' : [],
'value' : []
}
# Set our counter to zero
cnt = 0# Start loop to retrieve data from our dataframe
while (cnt < len(cols) - 1):
for parent in data[cols[cnt]].unique():
sankey_data['label'].append(parent)
for sub in data[data[cols[cnt]] == parent][cols[cnt+1]].unique():
sankey_data['source'].append(sankey_data['label'].index(parent))
sankey_data['label'].append(sub)
sankey_data['target'].append(sankey_data['label'].index(sub))
sankey_data['value'].append(data[data[cols[cnt+1]] == sub][values].sum())
cnt +=1
return sankey_data# We use this to create RGBA colours for our links.
# This enables us to have semi opaque links which in turn
# allows us to see flows with out being obscured by solid colours
rgb_link_color = ['rgba({},{},{}, 0.4)'.format(
hex_to_rgb(x)[0],
hex_to_rgb(x)[1],
hex_to_rgb(x)[2]) for x in color_link]
# Call our get_sankey_data function - dataframe, colums, values
sankey_chart = get_sankey_data(df,['l1','l2','l3'],'weight')# Style our initial Sankey chart
data = go.Sankey(
node = dict(
pad = 15,
thickness = 20,
line = dict(color = "black", width = 0.5),
label = sankey_chart['label'],
color = "goldenrod"
),
link = dict(
source = sankey_chart['source'],
target = sankey_chart['target'],
value = sankey_chart['value'],
color=color_link
))# Prepare our chart
fig = go.Figure(data)# Update chart with some customisations
fig.update_layout(
hovermode='x',
title='Sankey - Example of an Operational Risk Taxonomy',
font=dict(size=10, color='white'),
paper_bgcolor='#51504f',
# Height is needed for risk_ct.csv as the diagram is large
height=1500,
margin={'t':50,'b':20}
)# display chart
fig.show()
On running the above with risk_cat1.csv, you get the following diagram created.
If you run the above with risk_cat.csv, you get the much larger Sankey Diagram. Note the semi-opaque flows running through our diagram.
Takeaway 6 — Automating much larger Sankeys is not as complicated as expected. After getting the basics solidified, this was relatively quick.
Takeaway 7 — Remember, to get transparent links, you need to use RGBA colours, not HEX
Using Sankeys to spot incorrect process flow
Now that we have had time to play around with Sankeys let me show you one final thing.
You might wonder why I want to spend all this time getting an understanding of Sankeys. I reason that Sankeys can very quickly show an incorrect process flow; moreover, as it does it visually, it is much easier to spot than in a spreadsheet or data frame.
Consider the following two small dataframes.
Even though the two matrices are mapping the same process, there are slight differences between the DataFrames. Yes, in a small set like this, we can spot the differences reasonably quickly, as there is only a small amount of data.
Now look at the same data plotted on two separate Sankey Diagrams
Good Flow
Bad Flow
You can easily spot where there are issues in the Bad flow diagram. Visually it is straightforward to spot. The semi_opaqueness allows us to trace back easily. Scale this up to hundreds of entries, then try and spot the issue in the DataFrame. Sankeys are suitable for displaying this kind of information.
Takeaway 8— It is always easier to spot an inconsistency visually in a diagram than it is to spot it in a DataFrame or Spreadsheet.
Conclusion
Well, that was a whistlestop tour of Sankey Diagrams that I learned over the weekend. I do hope you found it helpful.
Sankey diagrams are great for showing process flow. As shown above, it is easy to spot a deviation from the process when a link goes from one source to a different source/target than expected.
My next piece of work will be to automate deviation from process detection. Programmatically this should be possible. When I crack it, I will let you know.
Until next time.
See ya!