Published by Princeton University Press. Incomplete draft. This version: 2018-04-25.

Buy from Amazon

Buy from Powell's

Buy from the Publisher

“Finally! A data visualization guide that is simultaneously practical and elegant. Healy combines the beauty and insight of Tufte with the concrete helpfulness of Stack Exchange. Data Visualization is brimming with insights into how quantitative analysts can use visualization as a tool for understanding and communication. A must-read for anyone who works with data.” — Elizabeth Bruch, University of Michigan Cover “Healy’s fun and readable book is unusual in covering the ‘why do’ as well as the ‘how to’ of data visualization, demonstrating how dataviz is a key step in all stages of social science—from theory construction to measurement to modeling and interpretation of analyses—and giving readers the tools to integrate visualization into their own work.” — Andrew Gelman, Columbia University

“Data Visualization is a brilliant book that not only teaches the reader how to visualize data but also carefully considers why data visualization is essential for good social science. The book is broadly relevant, beautifully rendered, and engagingly written. It is easily accessible for students at any level and will be an incredible teaching resource for courses on research methods, statistics, and data visualization. It is packed full of clear-headed and sage insights.” Becky Pettit, University of Texas at Austin

“Healy provides a unique introduction to the process of visualizing quantitative data, offering a remarkably coherent treatment that will appeal to novices and advanced analysts alike. There is no other book quite like this.” — Thomas J. Leeper, London School of Economics

“Kieran Healy has written a wonderful book that fills an important niche in an increasingly crowded landscape of materials about software in R. Data Visualization is clear, beautifully formatted, and full of careful insights.” — Brandon Stewart, Princeton University

“Healy’s prose is clear and direct. I came away from this book with a much better understanding of both visualizations and R.” — Neal Caren, University of North Carolina, Chapel Hill

“Innovative and extraordinarily well-written.” — Jeremy Freese, Stanford University

Data Visualization

A practical introduction

Kieran Healy

Duke University
[email protected]

Preface

You should look at your data. Graphs and charts let you explore and learn about the structure of the information you collect. Good data visualizations also make it easier to communicate your ideas and findings to other people. Beyond that, producing effective plots from your own data is the best way to develop a good eye for reading and understanding graphs—good and bad—made by others, whether presented in research articles, business slide decks, public policy advocacy, or media reports. This book teaches you how to do it.

My main goal is to introduce you to both the ideas and the methods of data visualization in a sensible, comprehensible, reproducible way. Some classic works on visualizing data, such as The Visual Display of Quantitative Information (Tufte, 1983), present numerous examples of good and bad work together with some general taste-based rules of thumb for constructing and assessing graphs. In what has now become a large and thriving field of research, more recent work provides excellent discussions of the cognitive underpinnings of successful and unsuccessful graphics, again providing many compelling and illuminating examples (Ware, 2008). Other books provide good advice about how to graph data under different circumstances (Cairo, 2013; Few, 2009; Munzer, 2014), but choose not to teach the reader about the tools used to produce the graphics they show. This may be because the software used is some (proprietary, costly) point-and-click application that requires a fully visual introduction of its own, such as Tableau, Microsoft Excel, or SPSS. Or perhaps the necessary software is freely available, but showing how to use it is not what the book is about (Cleveland, 1994). Conversely, there are excellent cookbooks that provide code “recipes” for many kinds of plot (Chang, 2013). But for that reason they do not take the time to introduce the beginner to the principles behind the output they produce. Finally, we also have thorough introductions to particular software tools and libraries, including the one we will use in this book (Wickham, 2016). These can sometimes be hard for beginners to digest, as they may presuppose a background that the reader does not have.

Each of the books I have just cited is well worth your time. When teaching people how to make graphics with data, however, I have repeatedly found the need for an introduction that motivates and explains why you are doing something but that does not skip the necessary details of how to produce the images you see on the page. And so this book has two main aims. First, I want you get to the point where you can reproduce almost every figure in the text for yourself. Second, I want you to understand why the code is written the way it is, such that when you look at data of your own you can feel confident about your ability to get from a rough picture in your head to a high-quality graphic on your screen or page.

What you will learn

This book is a hands-on introduction to the principles and practice of looking at and presenting data using R and ggplot. R is a powerful, widely used, and freely available programming language for data analysis. You may be interested in exploring ggplot after having used R before, or be entirely new to both R and ggplot and just want to graph your data. I do not assume you have any prior knowledge of R.

After installing the software we need, we begin with an overview of some basic principles of visualization. We focus not just on the aesthetic aspects of good plots, but on how their effectiveness is rooted in the way we perceive properties like length, absolute and relative size, orientation, shape, and color. We then learn how to produce and refine plots using ggplot2, a powerful, versatile, and widely-used visualization library for R (Wickham, 2016). The ggplot2 library implements a “grammar of graphics” (Wilkinson, 2005). This approach gives us a coherent way to produce visualizations by expressing relationships between the attributes of data and their graphical representation.

Through a series of worked examples, you will learn how to build plots piece by piece, beginning with scatterplots and summaries of single variables, then moving on to more complex graphics. Topics covered include plotting continuous and categorical variables, layering information on graphics; faceting grouped data to produce effective “small multiple” plots; transforming data to easily produce visual summaries on the graph such as trend lines, linear fits, error ranges, and boxplots; creating maps, and also some alternatives to maps worth considering when presenting country- or state-level data. We will also cover cases where we are not working directly with a dataset, but rather with estimates from a statistical model. From there, we will explore the process of refining plots to accomplish common tasks such as highlighting key features of the data, labeling particular items of interest, annotating plots, and changing their overall appearance. Finally we will examine some strategies for presenting graphical results in different formats, and to different sorts of audiences.

If you follow the text and examples in this book, then by the end you will:

Understand the basic principles behind effective data visualization.
Have a practical sense for why some graphs and figures work well, while others may fail to inform or actively mislead.
Know how to create a wide range of plots in R using ggplot2.
Know how to refine plots for effective presentation.

Learning how to visualize data effectively is more than just knowing how to write code that produces figures from data. This book will teach you how to do that. But it will also teach you how to think about the information you want to show, and how to consider the audience you are showing it to—including the most common case, when the audience is yourself.

This book is not a comprehensive guide to R, or even a comprehensive survey of everything ggplot can do. Nor is it a cookbook containing just examples of specific things people commonly want to do with ggplot. (Both these sorts of books already exist: see the references in the Appendix.) Neither is it a rigid set of rules, or a sequence of beautifully finished examples that you can admire but not reproduce. My goal is to get you quickly up and running in R, making plots in a well-informed way, with a solid grasp of the core sequence of steps—taking your data, specifying the relationship between variables and visible elements, and building up images layer by layer—that is at the heart of what ggplot does.

Learning ggplot does mean getting used to how R works, and also understanding how ggplot connects to other tools in the R language. As you work your way through the book, you will gradually learn more about some very useful idioms, functions, and techniques for manipulating data in R. In particular you will learn about some of the tools provided by the tidyverse library that ggplot belongs to. Similarly, although this is not a cookbook, once you get past Chapter 1 you will be able to see and understand the code used to produce almost every figure in the book. In most cases you will also see these figures built up piece by piece, a step at a time. If you use the book as it is designed, by the end you will have the makings of a version of the book itself, containing code you have written out and annotated yourself. And though we do not go into great depth on the topic of rules or principles of visualization, the discussion in Chapter 1 and its application throughout the book gives you more to think about than just a list of graph types. By the end of the book you should be able to look at a figure and be able to see it in terms of ggplot’s grammar, understanding how the various layers, shapes, and data are pieced together to make a finished plot.

The right frame of mind

It can be a little disorienting to learn a programming language like R, mostly because at the beginning there seem to be so many pieces to fit together in order for things to work properly. It can seem like you have to learn everything before you can do anything. The language has some possibly unfamiliar concepts that define how it works, like “object”, or “function”, or “class”. The syntactic rules for writing code are annoyingly picky. Error messages seem obscure; help pages are terse; other people seem to have had not quite the same issue as you. Beyond that, you sense that doing one thing often involves learning a bit about some other part of the language. To make a plot you need a table of data, but maybe you need to filter out some rows, recalculate some columns, or just get the computer to see it is there in the first place. And there is also a wider environment of supporting applications and tools that are good to know about, but involve new concepts of their own—editors that highlight what you write; applications that help you organize your code and its output; ways of writing your code that let you keep track of what you have done. It can all seem a bit confusing.

Don’t panic. You have to start somewhere. Starting with graphics is more rewarding than some of the other places you might begin, because you will be able to see the results of your efforts very quickly. As you build your confidence and ability in this area, you will gradually see the other tools as things that help you sort out some issue, or solve a problem that’s stopping you from making the picture you want. That makes them easier to learn. As you acquire them piecemeal—perhaps initially using them without completely understanding what is happening—you will begin to see how they fit together, and be more confident of your own ability to do what you need to do.

Even better, in the past decade or so the world of data analysis and programming generally has opened up in a way that has made help much easier to come by. Free tools for coding have been around for a long time, but in recent years what you might call the “ecology of assistance” has gotten better. There are more resources available for learning the various pieces, and more of them are oriented to the way writing code actually happens most of the time—which is to say, iteratively, in an error-prone fashion, and taking account of problems other people have run into and solved before.

How to use this book

This book can be used in any one of several ways. At a minimum, you can sit down and read it for a general overview of good practices in data visualization, together with many worked examples of graphics from their beginnings to a properly finished state. Even if you do not sit down and work through the code, you will get a good sense of how to think about visualization and a better understanding of the process through which good graphics are produced.

More usefully, if you set things up as described in Chapter 2, and then work through the examples⊕Or if you bring your own data to explore instead of or alongside the examples, as described in Chapter 2., then you will end up with a data visualization book of your own. If you approach the book this way, then by the end you will be comfortable using ggplot in particular and also be ready to learn more about the R language in general.

This book can also be used to teach with, either as the main focus of a course on data visualization or as a supplement to undergraduate or graduate courses in statistics or data analysis. My aim has been to make the “hidden tasks” of coding and polishing graphs more accessible and explicit. I want to make sure you are not left with the “How to Draw an Owl in Three Steps” problem common to many tutorials. You know the one. The first two steps are shown clearly enough. Sketch a few bird-shaped ovals. Make a line for a branch. But the final step, an owl such as John James Audubon might have drawn, is presented as a simple extension for readers to figure out for themselves.

If you have never used R or ggplot, you should start at the beginning of the book and work your way through to the end. If you know about R already but only want to learn the core of ggplot, then after installing the software descibed below, focus on Chapters 3 through 5. Chapter 6 (on models) necessarily incorporates some material on statistical modeling that the book cannot develop fully. This is not a statistics text. So, for example, I show generally how to fit and work with various kinds of model in Chapter 6, but I do not go through the important details of fitting, selecting, and fully understanding different approaches. I provide references in the text to other books that have this material as their main focus.

Each chapter ends with a section suggesting where to go next (apart from continuing to read the book). Sometimes I suggest other books or websites to explore. I also ask questions or pose some challenges that extend the material covered in the chapter, encouraging you to use the concepts and skills you have learned.

Conventions

In this book, we alternate between regular text (like this), samples of code that you can type and run yourself, and the output of that code. In the main text, references to objects or other things that exist in the R language or in your R project—like tables of data, variables, functions, and so on—will also appear in a monospaced or "typewriter" typeface. Code you can type directly into R at the console will be in gray boxes, and also monospaced. Like this:

my_numbers <- c(1, 1, 4, 1, 1, 4, 1)

If you type that line of code into R’s console it will create a thing called my_numbers.⊕Additional notes and information will sometimes appear in the margin, like this. Doing this doesn’t produce any output, however. When we write code that also produces output at the console, we will first see the code (in a gray box) and then the output in a monospaced font against a white background. Here we add two numbers and see the result:

4 + 1

## [1] 5

Two further notes about how to read this. First, by default in this book, anything that comes back to us at the console as the result of typing a command will be shown prefaced by two hash characters (##) at the beginning of each line of output. This is to help distinguish it from commands we type into the console. You will not see the hash characters at the console when you use R.

Second, both in the book and at the console, if the output of what you did results in a series of elements (like numbers, or observations from a variable, and so on) you will often see output that includes some number in square brackets at the beginning of the line. It looks like this: [1]. This is not part of the output itself, but just a counter or index keeping track of how many items have been printed out so far. In the case of adding 4 + 1 we got just one, or [1], thing back—the number five. If there are more elements returned as the result of some instruction or command, the counter will keep track of that on each line. In this next bit of code we will tell R to show us the lower-case letters of the alphabet:

letters

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

You can see the counter incrementing on each line as it keeps count of how many letters have been printed.

Before you begin

The book is designed for you to follow along in an active way, writing out the examples and experimenting with the code as you go. You will be able to reproduce almost all of the plots in the text. You will need to install some software first. Here is what to do:

Get the most recent version of R. R is free and available for Windows, Mac, and Linux operating systems. Download⊕cloud.r-project.org the version of R compatible with your operating system. If you are running Windows or MacOS, you should choose one of the precompiled binary distributions (i.e., ready-to-run applications) linked at the top of the R Project’s webpage.
Once⊕rstudio.com R is installed, download and install R Studio. R Studio is an “Integrated Development Environment”, or IDE. This means it is a front-end for R that makes it much easier to work with. R Studio is also free, and available for Windows, Mac, and Linux platforms.
Install⊕tidyverse.org the tidyverse library and several other add-on packages for R. These libraries provide useful functionality that we will take advantage of throughout the book. You can learn more about the tidyverse’s family of packages at its website.

To install the tidyverse, make sure you have an Internet connection and then launch R Studio. Type⊕I strongly recommend typing all the code examples right from the beginning, instead of copying and pasting. the following lines of code at R’s command prompt, located in the window named “Console”, and hit return. In the code below, the <- arrow is made up of two keystrokes, first < and then the short dash or minus symbol, -.

my_packages <- c("tidyverse", "broom", "coefplot", "cowplot",
                 "gapminder", "GGally", "ggrepel", "ggridges", "gridExtra",
                 "here", "interplot", "margins", "maps", "mapproj",
                 "mapdata", "MASS", "quantreg", "rlang", "scales",
                 "survey", "srvyr", "viridis", "viridisLite", "devtools")

install.packages(my_packages, repos = "http://cran.rstudio.com")

R Studio should then download and install these packages for you. It may take a little while to download everything.

With these packages available, you can then install one last library of material that’s useful specifically for this book. It⊕github.com is hosted on GitHub,⊕GitHub is a web-based service where users can host, develop, and share code. It uses git, a version control system that allows projects, or repositories, to preserve their history and incorporate changes from contributors in an organized way. rather than R’s central package repository, so we use a different function to fetch it.

devtools::install_github("kjhealy/socviz")

Once you’ve done that, we can get started.

☰ Menu