Welcome to the World Data Visualization Challenge Exploration Notebook! Let’s dig into how I (Jim Vallandingham) started the analysis of this data, found additional related historical data, and tried to find a story in it all.
My hope in releasing this quick analysis is to provide transparency to the final visualization, and to improve my analysis process.
Good Government Data
Let’s start with a look at the provided “what makes good government” dataset, which you can find from the challenge homepage.
My angle when looking at this data is skewed toward the idea that good government should support the well-being of the people it governs. My hypothesis is that more traditional measures of government performance and rank might be connected to this populous well-being. Meaning you don’t have to have one or the other - you can have happy people that are productive people, which in turn helps increase the power of the government.
Loading Data
First, we need to load the data. I downloaded a CSV version of the Google Doc sheet, and cleaned out the information rows. Its in the data
directory.
We will use the wonderful tidyverse collection of packages for most of this, so let’s load that now.
library(tidyverse)
[30m── [1mAttaching packages[22m ─────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──[39m
[30m[32m✔[30m [34mggplot2[30m 3.1.0 [32m✔[30m [34mpurrr [30m 0.2.5
[32m✔[30m [34mtibble [30m 1.4.2 [32m✔[30m [34mdplyr [30m 0.7.8
[32m✔[30m [34mtidyr [30m 0.8.2 [32m✔[30m [34mstringr[30m 1.3.1
[32m✔[30m [34mreadr [30m 1.3.1 [32m✔[30m [34mforcats[30m 0.3.0[39m
[30m── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31m✖[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
Now load the data, indicating the -
is used for missing values in the CSV.
filename <- "data/WDVP_good_gov.csv"
gov_data <- read_csv(filename, na = '-')
Missing column names filled in: 'X5' [5], 'X11' [11], 'X23' [23]Parsed with column specification:
cols(
.default = col_double(),
country = [31mcol_character()[39m,
`ISO Country code` = [31mcol_character()[39m,
population = [32mcol_number()[39m,
`surface area (Km2)` = [32mcol_number()[39m,
X5 = [33mcol_logical()[39m,
X11 = [33mcol_logical()[39m,
`GDP per capita (PPP)` = [32mcol_number()[39m,
`health expenditure per person` = [32mcol_number()[39m,
`education expenditure per person` = [32mcol_number()[39m,
X23 = [33mcol_logical()[39m
)
See spec(...) for full column specifications.
There are some spacer columns in the original Google Sheet - so we get some warnings. This is nothing to worry about though.
Metric Descriptions
Before we dive in, it might be worthwhile to talk a bit about some of the metrics this dataset contains. Some are pretty sophisticated composite values that I wanted to learn a bit more about before I charged ahead with the visualization and analysis.
GDP
First the basics: GDP. I know its used everywhere, all the time, but I wanted to get a refresher on the details.
While looking for pros/cons of GDP as a performance measure, I found an article about why the GDP sucks which lists some alternatives. Lucky for me (thanks to the foresight of the Information is Beautiful data collectors), many of these alternatives are in the government dataset we are looking at.
Below are some nice quotes from that article, summarising in plain-english what these metrics are about.
Human Development Index (HDI)
Alongside financial measures, the Human Development Index combines statistics on life expectancy, education, and income per capita to rank countries into tiers of development. Often framed in terms of whether people are able to “be” and “do” desirable things in life, HDI was developed by Pakistani economist Mahbub ul Haq and is backed by the United Nations. The higher the number, the more “developed” the country.
I like this summary of this metric tracking if people can “be” and “do” the things they want.
Happy Planet Index
This uses global data on life expectancy, experienced well-being, and ecological footprint to calculate an index that ranks countries on how many long and happy lives they produce per unit of environmental input. The higher the number, the happier the population.
Gini Coefficient
One of the biggest issues with GDP is that it ignores inequality. The Gini coefficient, invented in 1912 by Italian statistician Corrado Gini, represents the income distribution of a nation’s residents. A low number represents low inequality, while a high one indicates a major gap between the rich and poor.
And from the CIA Factbook:
Gini index measures the degree of inequality in the distribution of family income in a country. The more nearly equal a country’s income distribution, the lower its Gini index, e.g., a Scandinavian country with an index of 25. The more unequal a country’s income distribution, the higher its Gini index, e.g., a Sub-Saharan country with an index of 50. If income were distributed with perfect equality the index would be zero; if income were distributed with perfect inequality, the index would be 100.
Gross National Income (GNI)
This one is not in our dataset, but does sound interesting from its description:
GNI measures the economic output of its citizens, taking into account the globalization of modern business. It’s calculated by adding up all the income of the residents of a country, including that received from abroad. As such, it helps show the economic strength of a country’s population, rather than the country itself.
Economic Freedom
This one is not listed in the GDP alternatives site, but sounds like a very interesting measurement to look more closely at.
From the index’s site:
Economic freedom is the fundamental right of every human to control his or her own labor and property. In an economically free society, individuals are free to work, produce, consume, and invest in any way they please. In economically free societies, governments allow labor, capital, and goods to move freely, and refrain from coercion or constraint of liberty beyond the extent necessary to protect and maintain liberty itself.
Economic freedom brings greater prosperity. The Index of Economic Freedom documents the positive relationship between economic freedom and a variety of positive social and economic goals. The ideals of economic freedom are strongly associated with healthier societies, cleaner environments, greater per capita wealth, human development, democracy, and poverty elimination.
This seems like it falls right in the wheelhouse of what a “good” government can provide!
Sustainable Economic Development Assessment (SEDA)
Also not in the GDP alternatives list. This is a new metric to measure well-being.
SEDA is primarily an objective measure (combining data on outcomes, such as in health and education, with quasi-objective data, such as governance assessments). It is also a relative measure that assesses how a country performs in comparison to either the entire universe of countries or to individual peers or groups.
SEDA uses 10 dimensions, including wealth, quality of environment, and income equality to calculate a country’s score. Sounds like quite a composite number!
Exploring Data
Now we have a bit of backstory on the metrics, let’s dig more into the data itself.
First, I’ll just rename some of the columns for easier access.
gov_data$gdp_per_cap <- gov_data$`GDP per capita (PPP)`
gov_data$hdi <- gov_data$`human development index`
gov_data$seda <- gov_data$`sustainable economic development assessment (SEDA)`
gov_data$gini <- gov_data$`GINI index`
gov_data$efree <- gov_data$`overall economic freedom score`
We can also calculate a few additional metrics using the data we have, that might be useful for showing patterns.
# population density
gov_data$pop_density <- gov_data$population / gov_data$`surface area (Km2)`
# this is a suggested max population for 'small' countries
small_country_max_pop = 5000000
# we can use this threshold to create a new categorical column
gov_data$country_size <- ifelse(gov_data$population > small_country_max_pop, 'big', 'little')
gov_data$log_gdp <- log(gov_data$gdp_per_cap)
HDI
I like to use histograms to get a quick sense of what’s in a dataset, and to uncover any mistakes with importing data into R.
So let’s take a quick look at the Human Development Index (HDI) provided. We will use ggplot to visualize.
gov_data %>%
ggplot(aes(x = hdi)) +
geom_histogram(binwidth = 0.05) +
labs(title = "Distribution of HDI")
A challenge with histograms is picking good bins… I’ve used a binwidth of 0.05
just because I know the HDI should range between 0 and 1, and I wanted to see generally where there are gaps and overlaps.
The warning also indicates that there are 9 countries with no HDI data. Not too many, given the dataset contains 195 countries. Still these might be good to filter out, if we pursue HDI further.
GDP
Next, let’s take a quick look at GDP.
gov_data %>%
ggplot(aes(x = gdp_per_cap)) +
geom_histogram(bins=30) +
labs(title = "Distribution of GDP")
Ah, we see a much different distribution here! This looks a bit exponential, which is not too surprising. Let’s take a look at the countries with really big GDP per capita
gov_data %>%
filter(gdp_per_cap > 50000) %>%
arrange(-gdp_per_cap) %>%
select(country, gdp_per_cap, country_size)
We can see that most of these are ‘little’ countries - with smaller populations. As noted, these countries can skew results (and they do - just check the histogram). They are candidates to filter out if we are more concerned with telling a ‘big picture’ story.
HDI vs GDP
A scatterplot can show a relationship between two variables. Here, we can check the relationship between HDI and GDP.
gov_data %>%
ggplot(aes(x = gdp_per_cap, y = hdi, colour = country_size)) +
geom_point(size=2) +
geom_text(check_overlap = TRUE, aes(label = country), colour = '#333333', alpha = 0.7) +
labs(title = "HDI vs GDP")
I’ve colored the points by country size, so we can see the impact of these little countries again. But there is definitely a relationship here, indicating that HDI impacts GDP or vice versa.
Metric Correlations
What other relationships are there in this dataset?
Let’s jam a bunch together into a facetted scatterplot! The built in plot()
package will do this for us nicely. Below, I just filter a small subset of the data so that the relationships are still (somewhat) visable.
gov_data %>%
select(gdp_per_cap, gini, efree, `happy planet index`, hdi, seda, `unemployment (%)`) %>%
plot()
So we can see some correlations visually from these plots, let’s quantify this a bit more by plotting a correlation matrix. Using a nice tutorial as a guide, here I use the cor()
function to calculate the correlation values between these variables, then plot using corrplot
# If you don't have corrplot install with:
# install.packages("corrplot")
library(corrplot)
corrplot 0.84 loaded
gov_data %>%
select(gdp_per_cap, gini, efree, `happy planet index`, hdi, seda, `unemployment (%)`) %>%
cor(use = "complete.obs") %>%
corrplot(type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)
It is not too surprising to see high correlation between SEDA, HDI, economic freedom, and GDP.
I am a bit surprised that the happy planet index is not as highly correlated with these measures as well - but perhaps that is because of the environment focus of this metric. There is not much correlation between Happy Planet and the GDP. This is unfortunate - as we can’t really make an argument for an improved environmental impact making an improved GDP - at least not at the time being.
It is also surprising that the Gini coefficient is not more negatively correlated with some of these other combo-metrics. SEDA takes inequality into account as part of its calculation so that explains why it is the most negatively correlated.
Not shown, but things don’t change too much when ‘little’ countries are removed. Try it out yourself with an additional filter()
in the code above.
While there is some negative correlation between Gini and well-being metrics like SEDA and HDI, its still pretty messy.
Here is a plot showing HDI vs Gini, with sizes of dots indicating the GDP - as that was negatively correlated as well.
gov_data %>% #filter(country_size == 'big') %>%
ggplot(aes(x = gini, y=hdi, color = country_size, size = gdp_per_cap)) +
geom_point() +
geom_text(check_overlap = TRUE, aes(label = country), color = "#333333", size = 3) +
labs(title = "HDI vs GINI")
We can see that roughly an increase in the Gini coefficient (meaning an increase in inequality) results in a decrease in the SEDA metric for well-being. This is true for GDP per capita as well. But the correlation is weak.
As an aside, we can see that we are also missing a lot of Gini scores for countries - given the warning messages.
Could still contribute to an interesting story!
Linear Regression
As a quick look at how these variables relate, let’s build a linear model that tries to predict GDP per capita from a combination of the Human Development Index, the Gini coefficient, and the Economic Freedom score. We will avoid SEDA for now, as it explicitly uses GDP-like inputs as part of its calculation.
lm_out <- lm(formula = log_gdp ~ hdi + efree + gini, data = gov_data)
summary(lm_out)
Call:
lm(formula = log_gdp ~ hdi + efree + gini, data = gov_data)
Residuals:
Min 1Q Median 3Q Max
-2.06776 -0.23198 0.00734 0.22203 3.10917
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.828502 0.400787 9.552 <2e-16 ***
hdi 5.935777 0.382782 15.507 <2e-16 ***
efree 0.018581 0.005700 3.260 0.0014 **
gini 0.001139 0.005427 0.210 0.8340
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5227 on 138 degrees of freedom
(53 observations deleted due to missingness)
Multiple R-squared: 0.8111, Adjusted R-squared: 0.807
F-statistic: 197.6 on 3 and 138 DF, p-value: < 2.2e-16
So, roughly speaking, These inputs capture ~80% of the variability in the log-transformed GDP (we use the log of GDP to make that exponential relationship we saw before linear).
At this point, I started thinking about the progress of an individual country.
The comparisons we have looked at show a snap-shot of how countries are performing now, but what are the global and local trends? Where have these countries been to reach this spot? Are we moving forward with these metrics of human progress, or backwards? Is there a connection between the relative performance of a country over time and its GDP?
I wanted to find the answers to some of these questions. And to do that, I need historical data.
Historic Metric Data
Fortunately, the starting dataset provided references to the sources for all the columns of data.
Because the SEDA metric is such an amalgamation of other metrics, and because it is highly correlated with the other measurements, I chose not to pursue historical data for it.
Instead, I focused on getting historical data for HDI, economic freedom, and Gini, along with the more traditional GDP value.
Economic Freedom
The Heritage Foundation has an option to view all the historic data. Unfortunately, I couldn’t get the download button to download more than just the current year. So instead, I copied the table and pasted it into Numbers (which worked but took forever). Then saved it as a CSV.
filename <- "data/heritage_economic_freedom_score.csv"
ecfree_data <- read_csv(filename, na = "N/A")
Parsed with column specification:
cols(
name = [31mcol_character()[39m,
`index year` = [32mcol_double()[39m,
`overall score` = [32mcol_double()[39m,
`property rights` = [32mcol_double()[39m,
`government integrity` = [32mcol_double()[39m,
`judicial effectiveness` = [32mcol_double()[39m,
`tax burden` = [32mcol_double()[39m,
`government spending` = [32mcol_double()[39m,
`fiscal health` = [32mcol_double()[39m,
`business freedom` = [32mcol_double()[39m,
`labor freedom` = [32mcol_double()[39m,
`monetary freedom` = [32mcol_double()[39m,
`trade freedom` = [32mcol_double()[39m,
`investment freedom` = [32mcol_double()[39m,
`financial freedom` = [32mcol_double()[39m
)
Human Development Index (HDI)
The UN provides HDI as a CSV. I downloaded it, ensured the encoding was in UTF-8, and put it in the data directory. We can load it easily, and quickly remove all the spacer columns.
filename <- "data/Human Development Index.csv"
hdi_data <- read_csv(filename)
Missing column names filled in: 'X4' [4], 'X6' [6], 'X8' [8], 'X10' [10], 'X12' [12], 'X14' [14], 'X16' [16], 'X18' [18], 'X20' [20], 'X22' [22], 'X24' [24], 'X26' [26], 'X28' [28], 'X30' [30], 'X32' [32], 'X34' [34], 'X36' [36], 'X38' [38], 'X40' [40], 'X42' [42], 'X44' [44], 'X46' [46], 'X48' [48], 'X50' [50], 'X52' [52], 'X54' [54], 'X56' [56], 'X58' [58]Parsed with column specification:
cols(
.default = col_double(),
Country = [31mcol_character()[39m,
X4 = [33mcol_logical()[39m,
X6 = [33mcol_logical()[39m,
X8 = [33mcol_logical()[39m,
X10 = [33mcol_logical()[39m,
X12 = [33mcol_logical()[39m,
X14 = [33mcol_logical()[39m,
X16 = [33mcol_logical()[39m,
X18 = [33mcol_logical()[39m,
X20 = [33mcol_logical()[39m,
X22 = [33mcol_logical()[39m,
X24 = [33mcol_logical()[39m,
X26 = [33mcol_logical()[39m,
X28 = [33mcol_logical()[39m,
X30 = [33mcol_logical()[39m,
X32 = [33mcol_logical()[39m,
X34 = [33mcol_logical()[39m,
X36 = [33mcol_logical()[39m,
X38 = [33mcol_logical()[39m,
X40 = [33mcol_logical()[39m
# ... with 9 more columns
)
See spec(...) for full column specifications.
1 parsing failure.
row col expected actual file
190 -- 58 columns 1 columns 'data/Human Development Index.csv'
# drop X# columns. dangerous, but easy.
hdi_data <- hdi_data %>% select(-contains('X'))
The original version is in a “wide” format, we want it “long” - so we can easily join by country and year with other datasets. gather() does this nicely for us. Note the cool use of a negative one_of()
to exclude a specified set of columns, and thus include all the others.
hdi_data_long <- hdi_data %>% gather(key = "year", value = "hdi_value", -one_of("Country", "HDI Rank (2017)"))
Then, make some quick derived columns for joining.
hdi_data_long$country <- hdi_data_long$Country
hdi_data_long$year_n <- as.numeric(hdi_data_long$year)
World Bank Data
The rest of the metrics I found I could get from the World Bank.
Lucky for me, with a bit of Googling, I found t sweet sweet wbstats R package which allows you to pull down multiple WorldBank metrics via R! Super cool, and super useful. This will be a package to check out more for future projects as well.
# install with:
# install.packages('wbstats')
library(wbstats)
Using the search feature of wbstats
and the online World Bank indicators page, I found the ID’s of the metrics I wanted. Here’s how to pull down the data, with some notes:
# Metric IDs
# NY.GDP.PCAP.PP.CD = GDP per capita, PPP (current international $)
# NY.GDP.MKTP.KD.ZG = GDP growth (annual %)
# NY.GDP.PCAP.KD.ZG = GDP per capita growth (annual %)
# NY.GNP.PCAP.PP.CD = GNI per capita, PPP (current international $)
# NY.GNP.PCAP.KD.ZG = GNI per capita growth (annual %)
# SI.POV.GINI = GINI index (World Bank estimate)
# SP.POP.TOTL = population total
indicators <- c('NY.GDP.PCAP.PP.CD', 'NY.GNP.PCAP.PP.CD', 'NY.GNP.PCAP.KD.ZG', 'SI.POV.GINI', 'SP.POP.TOTL')
wb_data <- wb(country= "all", indicator = indicators, startdate = 2000, enddate = 2018, return_wide = TRUE)
You’ll note that i limited the year range to be between 2000 and 2018. This seemed like a decent enough year span to see some changes, but recent enough to be meaningful. Perhaps I should have investigated good year ranges more.
# need a numeric year for joining.
wb_data$year = as.numeric(wb_data$date)
Unfortunately, the hdi and economic freedom historical data only have country names - and not ISO codes. A huge short-sightedness if you ask me. So we have to do some tweaking to get matches across these three datasets. I’ve tweaked the names in the staring CSV files for some countries, and below we augment the world data for some others.
wb_data$country <- recode(wb_data$country,
"Yemen, Rep." = "Yemen",
"Lao PDR" = "Laos",
"Korea, Dem. People’s Rep." = "North Korea",
"Korea, Rep." = "South Korea",
"Egypt, Arab Rep." = "Egypt",
"Iran, Islamic Rep." = "Iran",
"Slovak Republic" = "Slovakia",
"Venezuela, RB" = "Venezuela",
"Hong Kong SAR, China" = "Hong Kong",
"Congo, Dem. Rep." = "Democratic Republic of Congo",
"Congo, Rep." = "Republic of Congo"
)
Region Data
Finally, I wanted a way to group these countries into higher themes. I happened upon this Github repo which includes a mapping from iso code to region name. Seemed like a good addition to me!
filename <- "data/country_regions.csv"
regions_data <- read_csv(filename)
Parsed with column specification:
cols(
name = [31mcol_character()[39m,
`alpha-2` = [31mcol_character()[39m,
`alpha-3` = [31mcol_character()[39m,
`country-code` = [31mcol_character()[39m,
`iso_3166-2` = [31mcol_character()[39m,
region = [31mcol_character()[39m,
`sub-region` = [31mcol_character()[39m,
`intermediate-region` = [31mcol_character()[39m,
`region-code` = [31mcol_character()[39m,
`sub-region-code` = [31mcol_character()[39m,
`intermediate-region-code` = [31mcol_character()[39m
)
Joining Data
Now that we have all the ingredients, let’s join them all together in a super dataframe!
There are probably better ways to do this in R, but I went with piecewise joining with left_join()
.
ecfree_hdi_data <- left_join(ecfree_data, hdi_data_long, by = c("name" = "country", "index year" = "year_n"))
ecfree_hdi_wb_data <- left_join(ecfree_hdi_data, wb_data, by = c("name" = "country", "index year" = "year"))
ecfree_hdi_wb_data <- left_join(ecfree_hdi_wb_data, regions_data, by = c("iso3c" = "alpha-3"))
So now this oddly named dataframe ecfree_hdi_wb_data
includes columns from the above sources.
Normalization
Hastily, I added even more columns to the dataset for easier access.
# renames to make accessing columns easier.
# should probably rename column directly, instead of just tacking on more columns
ecfree_hdi_wb_data$country <- ecfree_hdi_wb_data$name.x
ecfree_hdi_wb_data$year <- ecfree_hdi_wb_data$`index year`
ecfree_hdi_wb_data$hdi <- ecfree_hdi_wb_data$hdi_value
ecfree_hdi_wb_data$efree <- ecfree_hdi_wb_data$`overall score`
ecfree_hdi_wb_data$gini <- ecfree_hdi_wb_data$SI.POV.GINI
ecfree_hdi_wb_data$population <- ecfree_hdi_wb_data$SP.POP.TOTL
ecfree_hdi_wb_data$gni_per_cap <- ecfree_hdi_wb_data$NY.GNP.PCAP.PP.CD
ecfree_hdi_wb_data$gni_per_cap_percent <- ecfree_hdi_wb_data$NY.GNP.PCAP.KD.ZG
ecfree_hdi_wb_data$gdp_per_cap <- ecfree_hdi_wb_data$NY.GDP.PCAP.PP.CD
ecfree_hdi_wb_data$country_size <- ifelse(ecfree_hdi_wb_data$population > small_country_max_pop, 'big', 'little')
Data Exploration
Now we can do some exploring!
First, let’s look at just a single year, and compare HDI to GDP like we did above - to confirm the historical data isn’t radically different.
ecfree_hdi_wb_data %>% filter(year == 2017) %>%
ggplot(aes(x = gdp_per_cap, y = hdi)) +
labs(title = "GDP vs HDI") +
geom_text(check_overlap = TRUE, aes(label = country)) +
geom_point()
We can use facet_wrap()
to show the same data for a subset of the years.
ecfree_hdi_wb_data %>% filter(year > 2006) %>%
ggplot(aes(x = gdp_per_cap, y = hdi)) +
labs(title = "GDP vs HDI - 2007 - 2017") +
geom_point(alpha=0.2, size=1) +
geom_text(check_overlap = TRUE, aes(label = country), size = 2) +
facet_wrap(~ year)
Things are looking pretty good, though we note we are missing 2018 data. This is because World Bank doesn’t have this data available yet.
Let’s look at the correlations for the different metrics like we did before, but now with all these additonal time points, to ensure the connections hold.
ecfree_hdi_wb_data %>%
select(gdp_per_cap, gini, efree, hdi) %>%
cor(use = "complete.obs") %>%
corrplot(type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)
Appears to be pretty similar!
Grouping by Country
I wanted to see trends in these metrics over time by country.
Here is a grouping process that uses group_by()
to summarise data by metric. There is a bit of work to do around finding the first and last values, as not all years have data for all countries.
GDP
gdp_by_country <- ecfree_hdi_wb_data %>% group_by(country) %>% summarise(
first_gdp_year = year[max(which(!is.na(gdp_per_cap)))],
last_gdp_year = year[min(which(!is.na(gdp_per_cap)))],
years_of_gdp = last_gdp_year - first_gdp_year,
first_gdp_value = gdp_per_cap[max(which(!is.na(gdp_per_cap)))],
last_gdp_value = gdp_per_cap[min(which(!is.na(gdp_per_cap)))],
gdp_change = last_gdp_value - first_gdp_value,
gdp_change_by_year = gdp_change / years_of_gdp,
avg_gdp = mean(gdp_per_cap, na.rm = TRUE)
)
gdp_by_country %>%
ggplot(aes(x = avg_gdp)) + geom_histogram(bins=30)
So almost all countries increased in GDP, which makes sense.
Gini
gini_by_country <- ecfree_hdi_wb_data %>% group_by(country) %>% summarise(
first_gini_year = year[max(which(!is.na(gini)))],
last_gini_year = year[min(which(!is.na(gini)))],
years_of_gini = last_gini_year - first_gini_year,
first_gini_value = gini[max(which(!is.na(gini)))],
last_gini_value = gini[min(which(!is.na(gini)))],
gini_change = last_gini_value - first_gini_value,
gini_change_by_year = gini_change / years_of_gini,
avg_gini = mean(gini, na.rm = TRUE)
)
gini_by_country <- gini_by_country %>% filter(years_of_gini > 1)
gini_by_country$gini_category = ifelse(gini_by_country$gini_change_by_year < 0, "more equal", "less equal")
A little more effort gets our first and last values in long format:
gini_start_end <- gini_by_country %>%
select(c(country, gini_category, first_gini_value, last_gini_value)) %>%
gather(key = "metric", value = "value", first_gini_value, last_gini_value)
Which we can visualize:
gini_start_end %>%
ggplot(aes(x = metric, y = value, group = country, colour = gini_category)) +
geom_line() +
geom_text(check_overlap = TRUE, aes(label = country), size = 4, colour = 'black') +
facet_wrap(~ gini_category)
Here I wanted to see if there were trends in gini movement, so I made a quick slopechart between the country’s first and last Gini index value. Changes don’t seem to follow any immediate pattern.
Economic Freedom
efree_by_country <- ecfree_hdi_wb_data %>% group_by(country) %>% summarise(
first_efree_year = year[max(which(!is.na(efree)))],
last_efree_year = year[min(which(!is.na(efree)))],
years_of_efree = last_efree_year - first_efree_year,
first_efree_value = efree[max(which(!is.na(efree)))],
last_efree_value = efree[min(which(!is.na(efree)))],
efree_change = last_efree_value - first_efree_value,
efree_change_by_year = efree_change / years_of_efree,
avg_efree = mean(efree, na.rm = TRUE)
)
efree_by_country <- efree_by_country %>% filter(years_of_efree > 1)
efree_by_country %>%
ggplot(aes(x = last_efree_value - first_efree_value)) + geom_histogram()
Most countries remained relatively stable, with some big ups and downs.
Here are the countries that changed the most:
efree_by_country %>%
filter(abs(efree_change) > 15) %>%
arrange(-1 * efree_change) %>%
select(c(country, first_efree_value, last_efree_value, efree_change))
Here’s just another way to plot the same data:
efree_by_country %>% #filter(abs(last_efree_value - first_efree_value) > 10) %>%
ggplot(aes(x = first_efree_value, y = last_efree_value)) +
geom_point() +
geom_text(check_overlap = TRUE, aes(label = country), size = 4) +
labs(title = "Last Ec Freedom Value vs Fist Ec Freedom Value by Country")
Connected Scatterplots
At this point, I knew I wanted to show the connect between these different metrics over time. And I knew I wanted to try something a bit novel in terms of visualizing multiple metrcis together. So I decided to check out what this data looked like using connected scatter plots!
ecfree_hdi_wb_data %>% filter(country %in% c('Brazil', 'China', 'United Kingdom', 'United States')) %>% arrange(country, year) %>%
ggplot(aes( y = hdi, x = efree, colour = country)) +
geom_text(check_overlap = TRUE, aes(label = year)) +
geom_point() +
geom_path() +
labs(title="Connected Scatterplot: HDI vs Economic Freedom")
Pretty nice!
Here is all of the countries together:
ecfree_hdi_wb_data %>% arrange(country, year) %>%
ggplot(aes( y = hdi, x = gdp_per_cap, group = country)) +
geom_text(check_overlap = TRUE, aes(label = year), alpha = 0.5) +
geom_point(alpha = 0.1) +
geom_path(alpha = 0.1) +
labs(title="Connected Scatterplot: All HDI vs GDP")
One other thing I wanted to see was how a country moved through this metric space relative to itself. How is a country performing compared to other years?
That is where normalization / scaling comes in handy!
range01 <- function(x){(x-min(x, na.rm = TRUE))/(max(x, na.rm = TRUE)-min(x, na.rm = TRUE))}
ecfree_hdi_wb_data <- ecfree_hdi_wb_data %>%
group_by(country) %>%
mutate(hdi_01 = range01(hdi), efree_01 = range01(efree), gdp_01 = range01(gdp_per_cap)) %>%
ungroup()
ecfree_hdi_wb_data %>% filter(country %in% c('Brazil', 'China', 'United Kingdom', 'United States')) %>% arrange(country, year) %>%
ggplot(aes( y = hdi_01, x = gdp_01, colour = country)) +
geom_text(check_overlap = TRUE, aes(label = year)) +
geom_point() +
geom_path() +
labs(title="Connected Scatterplot: Relative HDI vs GDP")
Pretty fun to see an individuals arc, as well as a bit of comparison between
Saving to CSV
While this represents only a subset of the data exploration I did to get to the final form, I think it gives a good taste for the basic munging and searching required.
To end, I saved a subset of our combined data to CSV to start digging in more with D3 in the main visualization piece.
gov_data_save <- ecfree_hdi_wb_data %>%
select(country, iso3c, region, `sub-region`, year, population, gni_per_cap, gni_per_cap_percent, gdp_per_cap, efree, hdi, gini)
write_csv(gov_data_save, "output/gov_data_year.csv")
FIN
