Understanding Plotly Sankey Diagrams

Tom Welsh
15 min readOct 30, 2022

--

An adventure of sorts

A Plotly Sanket chart showing an example Operational Risk Taxonomy flow
Sankey of a possible Operational Risk Taxonomy Mapping — Image by Author

I have wanted to utilise Sankey diagrams to analyse data for some time. However, I have always found it daunting to figure out how to set these up. For some reason or another, I managed not to 'find time to dig into this area.

Well, that was till this weekend. What's a guy meant to do when they find themself at a loose end, with no family around, and nothing needs doing? It looks like it's Sankey time!!

What is a Sankey Diagram?

A Sankey diagram can display the flow of information, resources, etc., through a process. The thickness of the lines denotes where the process is most active, and thinner lines reflect the opposite.

Let's start at the beginning.

As I started learning about Sankeys, I realised that I would need to look on Medium, Python in Plain English, or Stackoverflow for learning resources. So I started with an article from Arslan Shahid on Python in Plain English, which was a great start. He provided some handy insights, and the following resonated with me.

I would highly encourage you to try different configurations manually to figure out how the diagram would change. Try adding more nodes and more intricate links etc. — Arslan Shahid, Python in Plain English

And so, having read Arslan Shahid's article, I took his advice and started playing with the data.

A few key terms.

Sankey Diagrams are pretty simple when you get your head around a few key terms. But, unfortunately, these terms come up repeatedly and are fundamental to understanding Sankeys and the flows.

Source — This is the starting node; no, you don't need multiple sources to get depth. More on that later.

Target — This is the node that the source connects to

Value — This is the connection flow volume. It is a number and will denote the thickness of the lines that connect the Sankey diagram.

The only other term that comes up is Label, and you can probably guess what that is for. Yes! It labels your flows.

Well, that's enough preamble. Let's get into it.

Our first Sankey.

As we will be using Python and Plotly to create our Sankeys, we should first import our necessary modules.

import plotly.graph_objects as go

Next, as we grow our Sankeys, you can utilise many colours to denote links. That's why I create a color_link list. This list contains 104 colours, which hasn't let me down yet. Granted the colour selection is not ideal for serious presentation purposes, but it is more than adequate for displaying our current Sankeys. You could spend an age looking for the correct colour choice to illustrate your chart and make it more impactful. But that’s a completly different discipline that I am not going to deal with here.

color_link = ['#000000', '#FFFF00', '#1CE6FF', '#FF34FF', '#FF4A46',
'#008941', '#006FA6', '#A30059','#FFDBE5', '#7A4900',
'#0000A6', '#63FFAC', '#B79762', '#004D43', '#8FB0FF',
'#997D87', '#5A0007', '#809693', '#FEFFE6', '#1B4400',
'#4FC601', '#3B5DFF', '#4A3B53', '#FF2F80', '#61615A',
'#BA0900', '#6B7900', '#00C2A0', '#FFAA92', '#FF90C9',
'#B903AA', '#D16100', '#DDEFFF', '#000035', '#7B4F4B',
'#A1C299', '#300018', '#0AA6D8', '#013349', '#00846F',
'#372101', '#FFB500', '#C2FFED', '#A079BF', '#CC0744',
'#C0B9B2', '#C2FF99', '#001E09', '#00489C', '#6F0062',
'#0CBD66', '#EEC3FF', '#456D75', '#B77B68', '#7A87A1',
'#788D66', '#885578', '#FAD09F', '#FF8A9A', '#D157A0',
'#BEC459', '#456648', '#0086ED', '#886F4C', '#34362D',
'#B4A8BD', '#00A6AA', '#452C2C', '#636375', '#A3C8C9',
'#FF913F', '#938A81', '#575329', '#00FECF', '#B05B6F',
'#8CD0FF', '#3B9700', '#04F757', '#C8A1A1', '#1E6E00',
'#7900D7', '#A77500', '#6367A9', '#A05837', '#6B002C',
'#772600', '#D790FF', '#9B9700', '#549E79', '#FFF69F',
'#201625', '#72418F', '#BC23FF', '#99ADC0', '#3A2465',
'#922329', '#5B4534', '#FDE8DC', '#404E55', '#0089A3',
'#CB7E98', '#A4E804', '#324E72', '#6A3A4C'
]

We should also set up some labels to see how our information flows.

# Our labels. The numbers after the comment denote the 'source node' the label belongs to.
label = ['Zero', # 0
'One', # 1
'Two', # 2
'Zero-Zero', # 3
'Zero-One', # 4
'One-Zero', # 5
'One-One', # 6
'Two-Zero', # 7
'Two-One' # 8
]

Sankey 1

And now to our first Sankey. A very modest one, but it provides the essential building blocks for our work.

IMPORTANT — Sankeys take numeric values, NOT descriptive names.

Here is our initial data

# Data
source = [0, 1, 2] # 3 Source Nodes
target = [1, 2, 3, 4, 5, 6] # 6 Target Nodes
value = [1, 1, 1, 1, 1, 1 ] # 6 Values

As you can see, we have three source nodes — 0, 1, 2, 6 target nodes — 1, 2, 3, 4, 5, 6, and 6 values.

We then put our data in a Python Dictionary, passing it to Plotlys graphic object Sankey.

# Link references our fields above
link = dict(source=source, target=target, value=value, color=color_link)
# node handels assigning labels and the housekeeping aound the diagram.
node = dict(label = label, pad=35, thickness=20)
# We then package the informtion into our data object
data = go.Sankey(link=link, node=node)

Next, we style our diagram in the standard Plotly way. I like a dark background on my diagrams hence the paper_bgcolor being set.

fig = go.Figure(data)
fig.update_layout(
hovermode='x',
title='Sankey 1 - 3 nodes, 6 targets, 6 values',
font=dict(size=10, color='white'),

# Set diagrams background colour to almost black
paper_bgcolor='#51504f'
)fig.show()

When you stitch it all together and run the code, you should end up with a diagram similar to the following.

Sankey 1 Showing a a Sankey diagram but not what we expected.
Sankey 1 — Sankey Diagram but not what we were expecting — Image by Author

Yes. There you have it, a Sankey diagram. What! It's not what you expected? I didn't expect that either when I first ran the code. Up to that point, I thought I understood what was happening. Luckily I do now.

Let me explain. First, we need to review two pieces of code. Our 'label' and our 'source' and 'target'.

# Our labels. The numbers after the comment denote the 'source node' the label belongs to.
label = ['Zero', # 0
'One', # 1
'Two', # 2
'Zero-Zero', # 3
'Zero-One', # 4
'One-Zero', # 5
'One-One', # 6
'Two-Zero', # 7
'Two-One' # 8
]

And our data

source = [0, 1, 2]  # 3 Source Nodes
target = [1, 2, 3, 4, 5, 6] # 6 Target Nodes
value = [1, 1, 1, 1, 1, 1 ] # 6 Values

Looking at the above, you can see that;
source 0 is pointing to target 1
or using labels
Zero is pointing at One

Continuing in the above train of thought but using label names, we see the flow as follows.

source      target
Zero One
One Two
Two Zero-Zero

Looking at the diagram, we see that is precisely what we get.

You will note that target nodes 4, 5, and 6 are not displayed. This is because there are no source nodes to map to these remaining target nodes.

Takeaway 1 — Understand the relationship between source and target nodes. A source and target of the same number are in the same locations. You will see an example of this later.

Sankey 2

Let's try and make it a bit more exciting and show individual flows of multiple sources reaching out to 'One' and 'Two' and thence, to Zero-One etc.

I won't explain the code in depth; I will point out the significant differences.

# Data
source2 = [0, 0, 1, 1, 2, 2] # 3 Source Nodes
target2 = [3, 4, 5, 6, 7, 8] # 6 Target Nodes
value2 = [1, 1, 1, 1, 1, 1 ] # 6 Values
# Link references our fields above
link2 = dict(source=source2, target=target2, value=value2, color=color_link)
# node handels assigning labels and the housekeeping aound the diagramnode = dict(label = label, pad=35, thickness=20)# We then package the informtion into our data object
data2 = go.Sankey(link=link2, node=node2)
fig2 = go.Figure(data2)
fig2.update_layout(
hovermode='x',
title='Sankey 2 - 3 individual nodes, 6 individual targets, 6 values',
font=dict(size=10, color='white'),
paper_bgcolor='#51504f'
)fig2.show()

The source2 and target2 lists in the code above show that we have three source nodes, 0, 1, and 2, pointing to targets 3–8. If you reference the label code above, you will see that that denotes target Zero-Zero to Two-One.

Let's see what we get.

Sankey 2–3 sources going to 6 targets — Image by Author

Now that is much better than Sankey 1. That's what I would have expected. If you look at the source and target list, you see that the source is 0–2, and targets start at 3, finishing at 8, precisely what is displayed above.

Takeaway 2 — When you know better, you do better. We saw that we got our expected outcome as long as the source and target list items didn't overlap.

Sankey 3

Now we are starting to get into the building blocks of Sankeys. In this Sankey, we will set up three source nodes, each having a flow to a different target node.

source3 = [0, 0, 1, 1, 2, 2]  # 3 Source Nodes
target3 = [1, 2, 3, 4, 5, 6] # 6 Target Nodes
value3 = [1, 1, 1, 1, 1, 1 ] # 6 Values
link3 = dict(source=source3, target=target3, value=value3, color=color_link)
node3 = dict(label = label, pad=35, thickness=20)
data3 = go.Sankey(link=link3, node=node3)
fig3 = go.Figure(data3)
fig3.update_layout(
hovermode='x',
title='Sankey 3 - 3 nodes, 6 targets, 6 values',
font=dict(size=10, color='white'),
paper_bgcolor='#51504f'
)fig3.show()

If you look at the code, we see three source nodes defined as 0–2 and 6 target nodes defined as 1–6.

This should give us source node 0 pointing to target nodes 1 and 2. then source node 1 pointing to target nodes 3 and 4. Finally, source node 2 points to target nodes 5 and 6

Breaking it down as we did previously using label names

source      target
Zero One
Zero Two
One Zero-Zero
One Zero-One
Two One-Zero
Two One-One

And we get the following running the code

Sankey 3 — one node, to two nodes with 2 further target nodes. — Image by Author
Sankey 3 — one node, to two nodes with two other target nodes. — Image by Author

This is precisely what we predicted above. As I mentioned at the beginning of this article, when discussing key terms, if we reference a target node from a source and the target is also a source node, we can then define depths to our Sankey diagrams. This becomes very powerful, as we see in the following two examples.

Takeaway 3 — By referencing different nodes between source and targets, we can start to build complicated Sankey diagrams.

Sankey 4

Now that we have the basic building blocks, we can map out more complicated processes. In this example, I will map out an imaginary process.

Firstly we will need some new labels.

label4 = ['Risk',              # 0
'Reward', # 1
'Forefeit', # 2
'Prople', # 3
'Proceess', # 4
'Systems', # 5
'External Events', # 6
'Money', # 7
'Food', # 8
'Cash', # 9
'Car', # 10
'Holiday' # 11
]

Then our code.

source4 = [0, 0, 0, 0, 1, 1, 2, 2]  # 3 Source Nodes
target4 = [3, 4, 5, 6, 7, 8, 9, 10] # 8 Target Nodes
value4 = [1, 1, 1, 1, 1, 1, 1, 1 ] # 8 Values
link4 = dict(source=source4, target=target4, value=value4, color=color_link)
node4 = dict(label = label4, pad=35, thickness=20)
data4 = go.Sankey(link=link4, node=node4)
fig4 = go.Figure(data4)
fig4.update_layout(
hovermode='x',
title='Testing 4 - 3 nodes, 8 targets, 8 values',
font=dict(size=10, color='white'),
paper_bgcolor='#51504f'
)fig4.show()

I am running the above code results in Sankey 4.

Sankey 4–3 distinct nodes, all with their own distinct targets. Image by Author
Sankey 4–3 distinct nodes, all with their own distinct targets. Image by Author

There you have it; three distinct processes mapped out with their individual targets. This was a simple example, but the next posed a few more interesting issues.

Takeaway 4 — You can have more than one process mapped on a Sankey. The crucial part of the code that keeps everything separate is the 'pad=35' in our node dictionary. Adding or reducing the number has the effect of increasing or decreasing the spacing between nodes.

Sankey 5

Thanks for sticking with me. We are about to graduate from the basics. Here I will draw out a much larger process which brings with it its own issues. More on that as we work through the problem.

First, some new labels. Sorry, there are lots of them

label5 = ['Risk',              # 0
'Reward', # 1
'Forefeit', # 2
'Prople', # 3
'Proceess', # 4
'Systems', # 5
'External Events', # 6
'Money', # 7
'Food', # 8
'Outdoors', # 9
'Car', # 10
'Holiday', # 11
'Tom,', # 12
'Dick', # 13
'Harry', # 14
'Fast Forward', # 15
'Reverse', # 16
'Play', # 17
'XBox', # 18
'Playstation', # 19
'Nintendo', # 20
'Earthquake', # 21
'Terrorism', # 22
'Cheque', # 23
'Credit Card', # 24
'Burger', # 25
'Pizza', # 26
'Tent', # 27
'Pack', # 28
'Boots', # 29
'Landrover', # 30
'Mazda', # 31
'Ibiza', # 32
'Italy', # 33
'Spain', # 34
'USA', # 35
'Australia', # 36
'Son', # 37
'Daughter', # 38
'Red', # 39
'Blue', # 40
'Grey', # 41
'Silver' # 42
]

And now our code. It hadn't moved on much from when we started. The nodes, targets and values have just increased a lot.

source5 = [0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 9, 10, 10, 11, 11, 11, 11, 11, 12, 12, 31, 31, 30, 30]  
target5 = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42]
value5 = [5, 3, 3, 2, 2, 2, 3, 4, 5, 2 ,1 , 1, 1 , 1 , 1, 1 ,1 ,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1 , 1, 1, 1 , 1, 1 , 1, 1, 1]
link5 = dict(source=source5, target=target5, value=value5, color=color_link)
node5 = dict(label = label5, pad=35, thickness=20)
data5 = go.Sankey(link=link5, node=node5)
fig5 = go.Figure(data5)
fig5.update_layout(
hovermode='x',
title='Sankey 5 - 40 nodes, 40 targets',
font=dict(size=10, color='white'),
paper_bgcolor='#51504f'
)fig5.show()

This gives us a beautiful Sankey diagram of a proper flow.

Sankey 5 — Imaginary real-world process. 40 nodes showing flows. Image by Author
Sankey 5 — Imaginary real-world process. Forty nodes are showing flows. Image by Author

I was thrilled when I got the diagram where I wanted it to be. You can probably see by looking at the source, target and values that there is an excellent opportunity for confusion. As I was knocking these up in Jupyter Notebooks, I stuck in the following to validate my data.

print(f'source5 has {len(source5)} items')
print(f'target5 has {len(target5)} items')
print(f'value5 has {len(value5)} items')

Which output a very helpful confidence check.

Confidence check — Image by Author
Confidence Check — Image by Author

But again, that was only one issue. I also had nodes connecting to nodes I didn't expect, and I had to go back to basics with paper and pen to work it out.

Sankey 5 — Working out why my nodes were not lining up — Image by Author
Sankey 5 — Working out why my nodes were not lining up — Image by Author

Do not get me wrong. I was delighted with the output, which helped me understand how Sankey works. It was just a bit laborious.

Takeaway 5 — When Sankeys start to get large and unwieldy, it's time to get our programming marbles on. But before you can do that, you NEED to understand how a Sankey is put together and how they work.

Reflection on the previous examples.

Well, that was fun! I now think I have a good understanding of how Sankey works and enough knowledge that I have realised to be useful; we will need to automate a lot of this. So off, I went back on my search for more Sankey information and found a great article by Baysan on How to Automatically Generate Data Structure for Sankey Diagrams.

Onwards and upwards

Let's automate this

So in the previous section, we did all this manually. Now finally, we will automate the creation of the Sankey diagrams. I pulled heavily from Baysan's article and modified his function for my needs.

The data I use for this article is partially sourced from publicly available documents from BIS.

You can pull the CSV files I created from my GitHub here.

The Data

You can use either data file in the GitHub repo to run the program. Both files have the same format as below. rick_cat.csv is the larger of the two files. I ran the code in Jupyter Notebooks, but it runs fine in PyCharm or any other IDE with slight modifications.

DataFrame of information used in our code. — Image by Author
DataFrame of information used in our code. — Image by Author

The Code

After playing around with Baysans’ function and the knowledge I gained from experimenting with Sankey earlier, I pulled the following together. It will look very familiar if you have read the article from the beginning. It is not rocket science, but it does speed up the process of creating complicated Sankey diagrams.

The only new addition is Pandas and webcolors. I assume you are all over pandas already.

To get semi-opaque links, you need to use RGBA colours. webcolors has a function hex_to_rgb that allows you to do this. So we read in our hex colour and changed it to an RGBA format.

rgb_link_color = ['rgba({},{},{}, 0.4)'.format(
hex_to_rgb(x)[0],
hex_to_rgb(x)[1],
hex_to_rgb(x)[2]) for x in color_link]

The complete code is below. Thanks to an article by Mattia Cinelli for helping me with this.

import pandas as pd
import plotly.graph_objects as go
from webcolors import hex_to_rgb
# You can use either risk_cat.csv or risk_cat1.csv
data = pd.read_csv(r'risk_cat1.csv')
df = pd.DataFrame(data)
# Setup our colours
color_link = ['#000000', '#FFFF00', '#1CE6FF', '#FF34FF', '#FF4A46',
'#008941', '#006FA6', '#A30059','#FFDBE5', '#7A4900',
'#0000A6', '#63FFAC', '#B79762', '#004D43', '#8FB0FF',
'#997D87', '#5A0007', '#809693', '#FEFFE6', '#1B4400',
'#4FC601', '#3B5DFF', '#4A3B53', '#FF2F80', '#61615A',
'#BA0900', '#6B7900', '#00C2A0', '#FFAA92', '#FF90C9',
'#B903AA', '#D16100', '#DDEFFF', '#000035', '#7B4F4B',
'#A1C299', '#300018', '#0AA6D8', '#013349', '#00846F',
'#372101', '#FFB500', '#C2FFED', '#A079BF', '#CC0744',
'#C0B9B2', '#C2FF99', '#001E09', '#00489C', '#6F0062',
'#0CBD66', '#EEC3FF', '#456D75', '#B77B68', '#7A87A1',
'#788D66', '#885578', '#FAD09F', '#FF8A9A', '#D157A0',
'#BEC459', '#456648', '#0086ED', '#886F4C', '#34362D',
'#B4A8BD', '#00A6AA', '#452C2C', '#636375', '#A3C8C9',
'#FF913F', '#938A81', '#575329', '#00FECF', '#B05B6F',
'#8CD0FF', '#3B9700', '#04F757', '#C8A1A1', '#1E6E00',
'#7900D7', '#A77500', '#6367A9', '#A05837', '#6B002C',
'#772600', '#D790FF', '#9B9700', '#549E79', '#FFF69F',
'#201625', '#72418F', '#BC23FF', '#99ADC0', '#3A2465',
'#922329', '#5B4534', '#FDE8DC', '#404E55', '#0089A3',
'#CB7E98', '#A4E804', '#324E72', '#6A3A4C'
]
# Collect the data we need from a dataframe to populate our Sankey data - source, target, and value
def get_sankey_data(data,cols,values):
# Empty lists to hold our data
sankey_data = {
'label':[],
'source': [],
'target' : [],
'value' : []
}
# Set our counter to zero
cnt = 0
# Start loop to retrieve data from our dataframe
while (cnt < len(cols) - 1):
for parent in data[cols[cnt]].unique():
sankey_data['label'].append(parent)
for sub in data[data[cols[cnt]] == parent][cols[cnt+1]].unique():
sankey_data['source'].append(sankey_data['label'].index(parent))
sankey_data['label'].append(sub)
sankey_data['target'].append(sankey_data['label'].index(sub))
sankey_data['value'].append(data[data[cols[cnt+1]] == sub][values].sum())

cnt +=1
return sankey_data
# We use this to create RGBA colours for our links.
# This enables us to have semi opaque links which in turn
# allows us to see flows with out being obscured by solid colours
rgb_link_color = ['rgba({},{},{}, 0.4)'.format(
hex_to_rgb(x)[0],
hex_to_rgb(x)[1],
hex_to_rgb(x)[2]) for x in color_link]

# Call our get_sankey_data function - dataframe, colums, values
sankey_chart = get_sankey_data(df,['l1','l2','l3'],'weight')
# Style our initial Sankey chart
data = go.Sankey(
node = dict(
pad = 15,
thickness = 20,
line = dict(color = "black", width = 0.5),
label = sankey_chart['label'],
color = "goldenrod"
),
link = dict(
source = sankey_chart['source'],
target = sankey_chart['target'],
value = sankey_chart['value'],
color=color_link
))
# Prepare our chart
fig = go.Figure(data)
# Update chart with some customisations
fig.update_layout(
hovermode='x',
title='Sankey - Example of an Operational Risk Taxonomy',
font=dict(size=10, color='white'),
paper_bgcolor='#51504f',
# Height is needed for risk_ct.csv as the diagram is large
height=1500,
margin={'t':50,'b':20}
)
# display chart
fig.show()

On running the above with risk_cat1.csv, you get the following diagram created.

A Plotly Sanket chart showing an example Operational Risk Taxonomy flow
Sankey — Example Risk Taxonomy. Image by Author

If you run the above with risk_cat.csv, you get the much larger Sankey Diagram. Note the semi-opaque flows running through our diagram.

Takeaway 6 — Automating much larger Sankeys is not as complicated as expected. After getting the basics solidified, this was relatively quick.

Takeaway 7 — Remember, to get transparent links, you need to use RGBA colours, not HEX

Using Sankeys to spot incorrect process flow

Now that we have had time to play around with Sankeys let me show you one final thing.

You might wonder why I want to spend all this time getting an understanding of Sankeys. I reason that Sankeys can very quickly show an incorrect process flow; moreover, as it does it visually, it is much easier to spot than in a spreadsheet or data frame.

Consider the following two small dataframes.

Two very similar DataFrames mapping the same process — Image by Author
Two very similar DataFrames mapping the same process — Image by Author

Even though the two matrices are mapping the same process, there are slight differences between the DataFrames. Yes, in a small set like this, we can spot the differences reasonably quickly, as there is only a small amount of data.

Now look at the same data plotted on two separate Sankey Diagrams

Good Flow

Good Matrix showing correct process flow. — Image by Author
Good Matrix showing correct process flow. — Image by Author

Bad Flow

You can easily spot where there are issues in the Bad flow diagram. Visually it is straightforward to spot. The semi_opaqueness allows us to trace back easily. Scale this up to hundreds of entries, then try and spot the issue in the DataFrame. Sankeys are suitable for displaying this kind of information.

Takeaway 8— It is always easier to spot an inconsistency visually in a diagram than it is to spot it in a DataFrame or Spreadsheet.

Conclusion

Well, that was a whistlestop tour of Sankey Diagrams that I learned over the weekend. I do hope you found it helpful.

Sankey diagrams are great for showing process flow. As shown above, it is easy to spot a deviation from the process when a link goes from one source to a different source/target than expected.

My next piece of work will be to automate deviation from process detection. Programmatically this should be possible. When I crack it, I will let you know.

Until next time.

See ya!

--

--

Tom Welsh
Tom Welsh

Written by Tom Welsh

Python aficionado, passionate about data analysis & visualization. Tech-savvy, Cybersecurity & Risk professional by day, Linux & IT infrastructure by night.

Responses (4)