Welcome to the World Data Visualization Challenge Exploration Notebook! Let’s dig into how I (Jim Vallandingham) started the analysis of this data, found additional related historical data, and tried to find a story in it all.

My hope in releasing this quick analysis is to provide transparency to the final visualization, and to improve my analysis process.

Good Government Data

Let’s start with a look at the provided “what makes good government” dataset, which you can find from the challenge homepage.

My angle when looking at this data is skewed toward the idea that good government should support the well-being of the people it governs. My hypothesis is that more traditional measures of government performance and rank might be connected to this populous well-being. Meaning you don’t have to have one or the other - you can have happy people that are productive people, which in turn helps increase the power of the government.

Loading Data

First, we need to load the data. I downloaded a CSV version of the Google Doc sheet, and cleaned out the information rows. Its in the data directory.

We will use the wonderful tidyverse collection of packages for most of this, so let’s load that now.

── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.3.1
✔ readr   1.3.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Now load the data, indicating the - is used for missing values in the CSV.

filename <- "data/WDVP_good_gov.csv"
gov_data <- read_csv(filename, na = '-')
Missing column names filled in: 'X5' [5], 'X11' [11], 'X23' [23]Parsed with column specification:
  .default = col_double(),
  country = col_character(),
  `ISO Country code` = col_character(),
  population = col_number(),
  `surface area (Km2)` = col_number(),
  X5 = col_logical(),
  X11 = col_logical(),
  `GDP per capita (PPP)` = col_number(),
  `health expenditure per person` = col_number(),
  `education expenditure per person` = col_number(),
  X23 = col_logical()
See spec(...) for full column specifications.

There are some spacer columns in the original Google Sheet - so we get some warnings. This is nothing to worry about though.

Metric Descriptions

Before we dive in, it might be worthwhile to talk a bit about some of the metrics this dataset contains. Some are pretty sophisticated composite values that I wanted to learn a bit more about before I charged ahead with the visualization and analysis.


First the basics: GDP. I know its used everywhere, all the time, but I wanted to get a refresher on the details.

While looking for pros/cons of GDP as a performance measure, I found an article about why the GDP sucks which lists some alternatives. Lucky for me (thanks to the foresight of the Information is Beautiful data collectors), many of these alternatives are in the government dataset we are looking at.

Below are some nice quotes from that article, summarising in plain-english what these metrics are about.

Human Development Index (HDI)

Alongside financial measures, the Human Development Index combines statistics on life expectancy, education, and income per capita to rank countries into tiers of development. Often framed in terms of whether people are able to “be” and “do” desirable things in life, HDI was developed by Pakistani economist Mahbub ul Haq and is backed by the United Nations. The higher the number, the more “developed” the country.

I like this summary of this metric tracking if people can “be” and “do” the things they want.

Happy Planet Index

This uses global data on life expectancy, experienced well-being, and ecological footprint to calculate an index that ranks countries on how many long and happy lives they produce per unit of environmental input. The higher the number, the happier the population.

Gini Coefficient

One of the biggest issues with GDP is that it ignores inequality. The Gini coefficient, invented in 1912 by Italian statistician Corrado Gini, represents the income distribution of a nation’s residents. A low number represents low inequality, while a high one indicates a major gap between the rich and poor.

And from the CIA Factbook:

Gini index measures the degree of inequality in the distribution of family income in a country. The more nearly equal a country’s income distribution, the lower its Gini index, e.g., a Scandinavian country with an index of 25. The more unequal a country’s income distribution, the higher its Gini index, e.g., a Sub-Saharan country with an index of 50. If income were distributed with perfect equality the index would be zero; if income were distributed with perfect inequality, the index would be 100.

Gross National Income (GNI)

This one is not in our dataset, but does sound interesting from its description:

GNI measures the economic output of its citizens, taking into account the globalization of modern business. It’s calculated by adding up all the income of the residents of a country, including that received from abroad. As such, it helps show the economic strength of a country’s population, rather than the country itself.

Economic Freedom

This one is not listed in the GDP alternatives site, but sounds like a very interesting measurement to look more closely at.

From the index’s site:

Economic freedom is the fundamental right of every human to control his or her own labor and property. In an economically free society, individuals are free to work, produce, consume, and invest in any way they please. In economically free societies, governments allow labor, capital, and goods to move freely, and refrain from coercion or constraint of liberty beyond the extent necessary to protect and maintain liberty itself.

Economic freedom brings greater prosperity. The Index of Economic Freedom documents the positive relationship between economic freedom and a variety of positive social and economic goals. The ideals of economic freedom are strongly associated with healthier societies, cleaner environments, greater per capita wealth, human development, democracy, and poverty elimination.

This seems like it falls right in the wheelhouse of what a “good” government can provide!

Sustainable Economic Development Assessment (SEDA)

Also not in the GDP alternatives list. This is a new metric to measure well-being.

SEDA is primarily an objective measure (combining data on outcomes, such as in health and education, with quasi-objective data, such as governance assessments). It is also a relative measure that assesses how a country performs in comparison to either the entire universe of countries or to individual peers or groups.

SEDA uses 10 dimensions, including wealth, quality of environment, and income equality to calculate a country’s score. Sounds like quite a composite number!

Exploring Data

Now we have a bit of backstory on the metrics, let’s dig more into the data itself.

First, I’ll just rename some of the columns for easier access.

gov_data$gdp_per_cap <- gov_data$`GDP per capita (PPP)`
gov_data$hdi <- gov_data$`human development index`
gov_data$seda <- gov_data$`sustainable economic development assessment (SEDA)`
gov_data$gini <- gov_data$`GINI index`
gov_data$efree <- gov_data$`overall economic freedom score` 

We can also calculate a few additional metrics using the data we have, that might be useful for showing patterns.

# population density
gov_data$pop_density <- gov_data$population / gov_data$`surface area (Km2)`
# this is a suggested max population for 'small' countries
small_country_max_pop = 5000000
# we can use this threshold to create a new categorical column
gov_data$country_size <- ifelse(gov_data$population > small_country_max_pop, 'big', 'little')
gov_data$log_gdp <- log(gov_data$gdp_per_cap)


I like to use histograms to get a quick sense of what’s in a dataset, and to uncover any mistakes with importing data into R.

So let’s take a quick look at the Human Development Index (HDI) provided. We will use ggplot to visualize.

gov_data %>%
  ggplot(aes(x = hdi)) +
  geom_histogram(binwidth = 0.05) +
  labs(title = "Distribution of HDI")

A challenge with histograms is picking good bins… I’ve used a binwidth of 0.05 just because I know the HDI should range between 0 and 1, and I wanted to see generally where there are gaps and overlaps.

The warning also indicates that there are 9 countries with no HDI data. Not too many, given the dataset contains 195 countries. Still these might be good to filter out, if we pursue HDI further.


Next, let’s take a quick look at GDP.

gov_data %>%
  ggplot(aes(x = gdp_per_cap)) +
  geom_histogram(bins=30) +
  labs(title = "Distribution of GDP")