We will make a graph to map the different colonial histories of countries in South-East Asia!
Click here to add circular flags.
Packages we will need:
library(ggimage)library(rnaturalearth)library(countrycode)library(ggthemes)library(reshape2)
I use the COLDAT Colonial Dates Dataset by Bastien Becker (2020). We will only need the first nine columns in the dataset:
col_df <- data.frame(col_df[1:9])
Next we will need to turn the dataset from wide to long with the reshape2
package:
long_col <- melt(col_df, id.vars=c("country"), measure.vars = c("col.belgium","col.britain", "col.france", "col.germany", "col.italy", "col.netherlands", "col.portugal", "col.spain"), variable.name = "colony", value.name = "value")
We drop all the 0 values from the dataset:
long_col <- long_col[which(long_col$value == 1),]
Next we use ne_countries()
function from the rnaturalearth
package to create the map!
map <- ne_countries(scale = "medium", returnclass = "sf")
Click here to read more about the rnaturalearth
package.
Next we merge the two datasets together:
col_map <- merge(map, long_col, by.x = "iso_a3", by.y = "iso3", all.x = TRUE)
We can change the class and factors of the colony variable:
library(plyr)col_map$colony_factor <- as.factor(col_map$colony)col_map$colony_factor <- revalue(col_map$colony_factor, c("col.belgium"="Belgium", "col.britain" = "Britain", "col.france" = "France","col.germany" = "Germany", "col.italy" = "Italy", "col.netherlands" = "Netherlands", "col.portugal" = "Portugal", "col.spain" = "Spain", "No colony" = "No colony"))
Nearly there.
Next we will need to add the longitude and latitude of the countries. The data comes from the web and I can scrape the table with the rvest
package
library(rvest)coord <- read_html("https://developers.google.com/public-data/docs/canonical/countries_csv")coord_tables <- coord %>% html_table(header = TRUE, fill = TRUE)coord <- coord_tables[[1]]col_map <- merge(col_map, coord, by.x= "iso_a2", by.y = "country", all.y = TRUE)
Click here to read more about the rvest
package.
And we can make a vector with some hex colors for each of the European colonial countries.
my_palette <- c("#0d3b66","#e75a7c","#f4d35e","#ee964b","#f95738","#1b998b","#5d22aa","#85f5ff", "#19381F")
Next, to graph a map to look at colonialism in Asia, we can extract countries according to the subregion variable from the rnaturalearth
package and graph.
asia_map <- col_map[which(col_map$subregion == "South-Eastern Asia" | col_map$subregion == "Southern Asia"),]
Click here to read more about the geom_flag
function.
colony_asia_graph <- asia_map %>% ggplot() + geom_sf(aes(fill = colony_factor), position = "identity") + ggimage::geom_flag(aes(longitude-2, latitude-1, image = col_iso), size = 0.04) + geom_label(aes(longitude+1, latitude+1, label = factor(sovereignt))) + scale_fill_manual(values = my_palette)
And finally call the graph with the theme_map()
from ggthemes
package
colony_asia_graph + theme_map()
References
Becker, B. (2020). Introducing COLDAT: The Colonial Dates Dataset.
In this post, we can compare countries on the left – right political spectrum and graph the trends.
In the European Social Survey, they ask respondents to indicate where they place themselves on the political spectrum with this question: “In politics people sometimes talk of ‘left’ and ‘right’. Where would you place yourself on this scale, where 0 means the left and 10 means the right?”
Click here to read how to download data from the European Social survey.
round <- import_all_rounds()
Extract all the lists. I just want three of the variables for my graph.
r1 <- round[[1]]r1 <- data.frame(country = r1$cntry, round= r1$essround, lrscale = r1$lrscale)
Do this for all the data.frames
and rbind()
them all together.
round_df <- rbind(r1, r2, r3, r4, r5, r6, r7, r8, r9)
Convert all the variables to suitable types:
round_df$country <- as.factor(round_df$country)round_df$round <- as.numeric(round_df$round)round_df$lrscale <- as.numeric(round_df$lrscale)
Next we find the mean score for all respondents in each of the countries for each year.
round_df %>% dplyr::filter(!is.na(lrscale)) %>% dplyr::group_by(country, round) %>% dplyr::mutate(mean_lr = mean(lrscale)) -> round_df
We keep only one of the values for each country at each survey year.
round_df <- round_df[!duplicated(round_df$mean_lr),]
Create a vector of hex colors that correspond to the countries I want to look at: Ireland, France, the UK and Germany.
my_palette <- c( "DE" = "#FFCE00", "FR" = "#001489", "GB" = "#CF142B", "IE" = "#169B62")
And graph the plot:
library(ggthemes, ggimage)lrscale_graph <- round_df %>% dplyr::filter(country == "IE" | country == "GB" | country == "FR" | country == "DE") %>% ggplot(aes(x= round, y = mean_lr, group = country)) + geom_line(aes(color = factor(country)), size = 1.5, alpha = 0.5) + ggimage::geom_flag(aes(image = country), size = 0.04) + scale_color_manual(values = my_palette) + scale_x_discrete(name = "Year", limits=c("2002","2004","2006","2008","2010","2012","2014","2016","2018")) + labs(title = "Where would you place yourself on this scale,\n where 0 means the left and 10 means the right?", subtitle = "Source: European Social Survey, 2002 - 2018", fill="Country", x = "Year", y = "Left - Right Spectrum")lrscale_graph + guides(color=guide_legend(title="Country")) + theme_economist()
The European Social Survey (ESS) measure attitudes in thirty-ish countries (depending on the year) across the European continent. It has been conducted every two years since 2001.
The survey consists of acore module and two or more ‘rotating’ modules, on social and public trust; political interest and participation; socio-political orientations; media use; moral, political and social values; social exclusion, national, ethnic and religious allegiances; well-being, health and security; demographics and socio-economics.
So lots of fun data for political scientists to look at.
install.packages("essurvey")library(essurvey)
The very first thing you need to do before you can download any of the data is set your email address.
set_email("rforpoliticalscience@gmail.com")
Don’t forget the email address goes in as a string in “quotations marks”.
Show what countries are in the survey with the show_countries()
function.
show_countries()
[1] "Albania" "Austria" "Belgium" [4] "Bulgaria" "Croatia" "Cyprus" [7] "Czechia" "Denmark" "Estonia" [10] "Finland" "France" "Germany" [13] "Greece" "Hungary" "Iceland" [16] "Ireland" "Israel" "Italy" [19] "Kosovo" "Latvia" "Lithuania" [22] "Luxembourg" "Montenegro" "Netherlands" [25] "Norway" "Poland" "Portugal" [28] "Romania" "Russian Federation" "Serbia" [31] "Slovakia" "Slovenia" "Spain" [34] "Sweden" "Switzerland" "Turkey" [37] "Ukraine" "United Kingdom"
It’s important to know that country names are case sensitive and you can only use the name printed out byshow_countries()
. For example, you need to write “Russian Federation” to access Russian survey data; if you write “Russia”…
Using these country names, we can download specific rounds or waves (i.e survey years) withimport_country
. We have the option to choose the two most recent rounds, 8th (from 2016) and 9th round (from 2018).
ire_data <- import_all_cntrounds("Ireland")
The resulting data comes in the form of nine lists, one for each round
These rounds correspond to the following years:
- ESS Round 9 – 2018
- ESS Round 8 – 2016
- ESS Round 7 – 2014
- ESS Round 6 –2012
- ESS Round5 – 2010
- ESS Round4 – 2008
- ESS Round3 –2006
- ESS Round2 – 2004
- ESS Round 1 – 2002
I want to compare the first round and most recent round to see if Irish people’s views have changed since 2002. In 2002, Ireland was in the middle of an economic boom that we called the “Celtic Tiger”. People did mad things like buy panini presses and second house in Bulgaria to resell. Then the 2008 financial crash hit the country very hard.
Irish people during the Celtic Tiger:
Irish people after the Celtic Tiger crash:
Ireland in 2018 was a very different place. So it will be interesting to see if these social changes translated into attitude changes.
First, we use the import_country()
function to download data from ESS. Specify the country and rounds you want to download.
ire <-import_country(country = "Ireland", rounds = c(1, 9))
The resulting ire
object is a list, so we’ll need to extract the two data.frames
from the list:
ire_1 <- ire[[1]]ire_9 <- ire[[2]]
The exact same questions are not asked every year in ESS; there are rotating modules, sometimes questions are added or dropped. So to merge round 1 and round 9, first we find the common columns with the intersect()
function.
common_cols <- intersect(colnames(ire_1), colnames(ire_9))
And then bind subsets of the two data.frames
together that have the same columns with rbind()
function.
ire_df <- rbind(subset(ire_1, select = common_cols), subset(ire_9, select = common_cols))
Now with my merged data.frame
, I only want to look at a few of the variables and clean up the dataset for the analysis.
Click here to look at all the variables in the different rounds of the survey.
att9 <- data.frame(country = data9$cntry, round = data9$essround, imm_same_eth = data9$imsmetn, imm_diff_eth = data9$imdfetn, imm_poor = data9$impcntr, imm_econ = data9$imbgeco, imm_culture = data9$imueclt, imm_qual_life = data9$imwbcnt, left_right = data9$lrscale)class(att9$imm_same_eth)
All the variables in the dataset are a special class called “haven_labelled
“. So we must convert them to numeric variables with a quick function. We exclude the first variable because we want to keep country name as a string character variable.
att_df[2:15] <- lapply(att_df[2:15], function(x) as.numeric(as.character(x)))
We can look at the distribution of our variables and count how many missing values there are with the skim()
function from the skimr
package
library(skimr)skim(att_df)
We can run a quick t-test to compare the mean attitudes to immigrants on the statement: “Immigrants make country worse or better place to live” across the two survey rounds.
Lower scores indicate an attitude that immigrants undermine Ireland’ quality of life and higher scores indicate agreement that they enrich it!
t.test(att_df$imm_qual_life ~ att_df$round)
In future blog, I will look at converting the raw output of R into publishable tables.
The results of the independent-sample t-test show that if we compare Ireland in 2002 and Ireland in 2018, there has been a statistically significant increase in positive attitudes towards immigrants and belief that Ireland’s quality of life is more enriched by their presence in the country.
As I am currently an immigrant in a foreign country myself, I am glad to come from a country that sees the benefits of immigrants!
If we load the ggpubr
package, we can graphically look at the difference in mean attitude scores.
library(ggpubr)box1 <- ggpubr::ggboxplot(att_df, x = "round", y = "imm_qual_life", color = "round", palette = c("#d11141", "#00aedb"), ylab = "Attitude", xlab = "Round")box1 + stat_compare_means(method = "t.test")
It’s not the most glamorous graph but it conveys the shift in Ireland to more positive attitudes to immigration!
I suspect that a country’s economic growth correlates with attitudes to immigration.
So let’s take the mean annual score values
ire_agg <- ireland[!duplicated(ireland$mean_imm_qual_life),]ire_agg <- ire_agg %>% select(year, everything())
Next we can take data from Quandl website on annual Irish GDP growth (click here to learn how to access economic data via a Quandl API on R.)
gdp <- Quandl('ODA/IRL_LE', start_date='2002-01-01', end_date='2020-01-01',type="raw")
Create a year variable from the date variable
gdp$year <- substr(gdp$Date, start = 1, stop = 4)
Add year variable to the ire_agg
data.frame that correspond to the ESS survey rounds.
year =c("2002","2004","2006","2008","2010","2012","2014","2016","2018")year <- data.frame(year)ire_agg <- cbind(ire_agg, year)
Merge the GDP and ESS datasets
ire_agg <- merge(ire_agg, gdp, by.x = "year", by.y = "year", all.x = TRUE)
Scale the GDP and immigrant attitudes variables so we can put them on the same plot.
ire_agg$scaled_gdp <- scale(ire_agg$Value)ire_agg$scaled_imm_attitude <- scale(ire_agg$mean_imm_qual_life)
In order to graph both variables on the same graph, we turn the two scaled variables into two factors of a single variable.
ire_agg <- ire_agg %>% select(year, scaled_imm_attitude, scaled_gdp) %>% gather(key = "variable", value = "value", -year)
Next, we can change the names of the factors
ire_agg$variable <- revalue(ire_agg$variable, c("scaled_gdp"="GDP (scaled)", "scaled_imm_attitude" = "Attitudes (scaled)"))
And finally, we can graph the plot.
The geom_rect()
function graphs the coloured rectangles on the plot. I take colours from this color-hex website; the green rectangle for times of economic growth and red for times of recession. Makes sure the geom-rect()
comes before the geom_line().
library(ggpthemes)ggplot(ire_agg, aes(x = year, y = value, group = variable)) + geom_rect(aes(xmin= "2008",xmax= "2012",ymin=-Inf, ymax=Inf),fill="#d11141",colour=NA, alpha=0.01) + geom_rect(aes(xmin= "2002" ,xmax= "2008",ymin=-Inf, ymax=Inf),fill="#00b159",colour=NA, alpha=0.01) + geom_rect(aes(xmin= "2012" ,xmax= "2020",ymin=-Inf, ymax=Inf),fill="#00b159",colour=NA, alpha=0.01) + geom_line(aes(color = as.factor(variable), linetype = as.factor(variable)), size = 1.3) + scale_color_manual(values = c("#00aedb", "#f37735")) + geom_point() + geom_text(data=. %>% arrange(desc(year)) %>% group_by(variable) %>% slice(1), aes(label=variable), position= position_jitter(height = 0.3), vjust =0.3, hjust = 0.1, size = 4, angle= 0) + ggtitle("Relationship between Immigration Attitudes and GDP Growth") + labs(value = " ") + xlab("Year") + ylab("scaled") + theme_hc()
And we can see that there is a relationship between attitudes to immigrants in Ireland and Irish GDP growth. When GDP is growing, Irish people see that immigrants improve quality of life in Ireland and vice versa. The red section of the graph corresponds to the financial crisis.
This quick function can add rectangular flags to graphs.
Click here to add circular flags with the ggflags
package.
The data comes from a Wikipedia table on a recent report by OECD’s Overseas Development Aid (ODA) from donor countries in 2019.
Click here to read about scraping tables from Wikipedia with the rvest
package in R.
library(countrycode)library(ggimage)
In order to use the geom_flag()
function, we need a country’s two-digit ISO code (For example, Ireland is IE!)
To add the ISO code, we can use the countrycode()
function. Click here to read about a quick blog about the countrycode()
function.
In one function we can quickly add a new variable that converts the country name in our dataset into to ISO codes.
oda$iso2 <- countrycode(oda$donor, "country.name", "iso2c")
Also we can use the countrycode()
function to add a continent variable. We will use that to fill the colors of our bars in the graph.
oda$continent <- countrycode(oda$iso2, "iso2c", "continent")
We can now add the the geom_flag()
function to the graph. The y = -50
prevents the flags overlapping with the bars and places them beside their name label. The image
argument takes the iso2
variable.
Quick tip: with the reorder argument, if we wanted descending order (rather than ascending order of ODA amounts, we would put a minus sign in front of the oda_per_capita
in the reorder
() function for the x axis value.
oda_bar <- oda %>% ggplot(aes(x = reorder(donor, oda_per_capita), y = oda_per_capita, fill = continent)) + geom_flag(y = -50, aes(image = iso2)) + geom_bar(stat = "identity") + labs(title = "ODA donor spending ", subtitle = "Source: OECD's Development Assistance Committee, 2019 ", x = "Donor Country", y = "ODA per capita")
The fill
argument categorises the continents of the ODA donors. Sometimes I take my hex colors from https://www.color-hex.com/ website.
my_palette <- c("Americas" = "#0084ff", "Asia" = "#44bec7", "Europe" = "#ffc300", "Oceania" = "#fa3c4c")
Last we print out the bar graph. The expand_limits()
function moves the graph to fit the flags to the left of the y-axis.
oda_bar + coord_flip() + expand_limits(y = -50) + scale_fill_manual(values = my_palette)
We can all agree that Wikipedia is often our go-to site when we want to get information quick. When we’re doing IR or Poli Sci reesarch, Wikipedia will most likely have the most up-to-date data compared to other databases on the web that can quickly become out of date.
So in R, we can scrape a table from Wikipedia and turn into a database with the rvest
package .
First, we copy and paste the Wikipedia page we want to scrape into the read_html(
) function as a string:
nato_members <- read_html("https://en.wikipedia.org/wiki/Member_states_of_NATO")
Next we save all the tables on the Wikipedia page as a list. Turn the header = TRUE
.
nato_tables <- nato_members %>% html_table(header = TRUE, fill = TRUE)
The table that I want is the third table on the page, so use [[two brackets]] to access the third list.
nato_exp <- nato_tables[[3]]
The dataset is not perfect, but it is handy to have access to data this up-to-date. It comes from the most recent NATO report, published in 2019.
Some problems we will have to fix.
- The first row is a messy replication of the header / more information across two cells in Wikipedia.
- The headers are long and convoluted.
- There are a few values in as N/A in the dataset, which R thinks is a string.
- All the numbers have commas, so R thinks all the numeric values are all strings.
There are a few NA values that I would not want to impute because they are probably zero. Iceland has no armed forces and manages only a small coast guard. North Macedonia joined NATO in March 2020, so it doesn’t have all the data completely.
So first, let’s do some quick data cleaning:
Clean the variable names to remove symbols and adds underscores with a function from the janitor
package
library(janitor)nato_exp <- nato_exp %>% clean_names()
Delete the first row. which contains some extra header text:
nato_exp <- nato_exp[-c(1),]
Rename the headers to better reflect the original Wikipedia table headings In this rename()
function,
- the first string in the variable name we want and
- the second string is the original heading as it was cleaned from the above
clean_names()
function:
nato_exp <- nato_exp %>% rename("def_exp_millions" = "defence_expenditure_us_f", "def_exp_gdp" = "defence_expenditure_us_f_2", "def_exp_per_capita" = "defence_expenditure_us_f_3", "population" = "population_a", "gdp" = "gdp_nominal_e", "personnel" = "personnel_f")
Next turn all the N/A
value strings to NULL
. The na_strings object we create can be used with other instances of pesky missing data varieties, other than just N/A string.
na_strings <- c("N A", "N / A", "N/A", "N/ A", "Not Available", "Not available")nato_exp <- nato_exp %>% replace_with_na_all(condition = ~.x %in% na_strings)
Remove all the commas from the number columns and convert the character strings to numeric values with a quick function we apply to all numeric columns in the data.frame.
remove_comma <- function(x) {as.numeric(gsub(",", "", x, fixed = TRUE))}nato_exp[2:7] <- sapply(nato_exp[2:7], remove_comma)
Next, we can calculate the average NATO score of all the countries (excluding the member_state
variable, which is a character string).
We’ll exclude the NATO total column (as it is not a member_state
but an aggregate of them all) and the data about Iceland and North Macedonia, which have missing values.
nato_average <- nato_exp %>%filter(member_state != 'NATO' & member_state != 'Iceland' & member_state != 'North Macedonia') %>%summarise_if(is.numeric, mean, na.rm = TRUE)
Re-arrange the columns so the two data.frames
match:
nato_average$member_state = "NATO average"nato_average <- nato_average %>% select(member_state, everything())
Bind the two data.frames
together
nato_exp <- rbind(nato_exp, nato_average)
Create a new factor variable that categorises countries into either above or below the NATO average defense spending.
Also we can specify a category to distinguish those countries that have reached the NATO target of their defense spending equal to 2% of their GDP.
nato_exp <- nato_exp %>% filter(member_state != 'NATO' & member_state!= "North Macedonia" & member_state!= "Iceland") %>% dplyr::mutate(difference = case_when(def_exp_gdp >= 2 ~ "Above NATO 2% GDP quota", between(def_exp_gdp, 1.6143, 2) ~ "Above NATO average", between(def_exp_gdp, 1.61427, 1.61429) ~ "NATO average", def_exp_gdp <= 1.613 ~ "Below NATO average"))
Create a vector of hex colours to correspond to the different categories. I choose traffic light colors to indicate the
- green countries (those who have reached the NATO 2% quota),
- orange countries (above the NATO average but below the spending target) and
- red countries (below the NATO spending average).
The blue colour is for the NATO average bar,
my_palette <- c( "Below NATO average" = "#E60000", "NATO average" = "#012169", "Above NATO average" = "#FF7800", "Above NATO 2% GDP quota" = "#4CBB17")
Finally, we create a graph with ggplot
, and use the reorder()
function to arrange the bars in ascending order.
NATO allies are encouraged to hit the target of 2% of gross domestic product. So, we add a geom_vline()
to demarcate the NATO 2% quota.
nato_bar <- nato_exp %>% filter(member_state != 'NATO' & member_state!= "North Macedonia" & member_state!= "Iceland") %>% ggplot(aes(x= reorder(member_state, def_exp_gdp), y = def_exp_gdp, fill=factor(difference))) + geom_bar(stat = "identity") + geom_vline(xintercept = 22.55, colour="firebrick", linetype = "longdash", size = 1) + geom_text(aes(x=22, label="NATO 2% quota", y=3), colour="firebrick", text=element_text(size=20)) + labs(title = "NATO members Defense Expenditure as a percentage GDP ", subtitle = "Source: NATO, 2019", x = "NATO Member States", y = "Defense Expenditure (as % GDP) ")
Click here to read about adding flags to graphs with the ggimage
package.
library(countrycode)library(ggimage)nato_exp$iso2 <- countrycode(nato_exp$member_state, "country.name", "iso2c")
Finally, we can print out the nato_bar
graph!
nato_bar + geom_flag(y = -0.2, aes(image = nato_exp$iso2)) +coord_flip() +expand_limits(y = -0.2) +theme(legend.title = element_blank(), axis.text.x=element_text(angle=45, hjust=1)) + scale_fill_manual(values = my_palette)
Packages we will need:
library(WDI)library(tidyverse)library(magrittr) # for pipeslibrary(ggthemes)library(rnaturalearth) # to create mapslibrary(viridis) # for pretty colors
We will use this package to really quickly access all the indicators from the World Bank website.
Below is a screenshot of the World Bank’s data page where you can search through all the data with nice maps and information about their sources, their available years and the unit of measurement et cetera.
You can look at the World Bank website and browse all the indicators available.
In R when we download the WDI package, we can download the datasets directly into our environment.
With the WDIsearch
() function we can look for the World Bank indicator.
For this blog, we want to map out how dependent countries are on oil. We will download the dataset that measures oil rents as a percentage of a country’s GDP.
WDIsearch('oil rent')
The output is:
indicator name "NY.GDP.PETR.RT.ZS" "Oil rents (% of GDP)"
Copy the indicator string and paste it into the WDI()
function. The country codes are the iso2 codes, which you can input as many as you want in the c()
.
If you want all countries that the World Bank has, do not add country
argument.
We can compare Iran and Saudi Arabian oil rents from 1970 until the most recent value.
data = WDI(indicator='NY.GDP.PETR.RT.ZS', country=c('IR', 'SA'), start=1970, end=2019)
And graph out the output. All only takes a few steps.
my_palette = c("#DA0000", "#239f40") #both the hex colors are from the maps of the countriesoil_graph <- ggplot(oil_data, aes(year, NY.GDP.PETR.RT.ZS, color = country)) + geom_line(size = 1.4) + labs(title = "Oil rents as a percentage of GDP", subtitle = "In Iran and Saudi Arabia from 1970 to 2019", x = "Year", y = "Average oil rent as percentage of GDP", color = " ") + scale_color_manual(values = my_palette)oil_graph + ggthemes::theme_fivethirtyeight() + theme(plot.title = element_text(size = 30), axis.title.y = element_text(size = 20), axis.title.x = element_text(size = 20))
For some reason the World Bank does not have data for Iran for most of the early 1990s. But I would imagine that they broadly follow the trends in Saudi Arabia.
I added the flags myself manually after I got frustrated with geom_flag()
. It is something I will need to figure out for a future blog post!
It is really something that in the late 1970s, oil accounted for over 80% of all Saudi Arabia’s Gross Domestic Product.
Now we see both countries rely on a far smaller percentage. Due both to the fact that oil prices are volatile, climate change is a new constant threat and resource exhaustion is on the horizon, both countries have adjusted policies in attempts to diversify their sources of income.
Next we can use the World Bank data to create maps and compare regions on any World Bank scores.
We will compare all Asian and Middle Eastern countries with regard to all natural rents (not just oil) as a percentage of their GDP.
So, first we create a map with the rnaturalearth
package. Click here to read a previous tutorial about all the features of this package.
I will choose only the geographical continent of Asia, which covers the majority of Middle East also.
asia_map <- ne_countries(scale = "medium", continent = 'Asia', returnclass = "sf")
Then, once again we use the WDI()
function to download our World Bank data.
nat_rents = WDI(indicator='NY.GDP.TOTL.RT.ZS', start=2016, end=2018)
Next I’ll merge the with the asia_map
object I created.
asia_rents <- merge(asia_map, nat_rents, by.x = "iso_a2", by.y = "iso2c", all = TRUE)
We only want the value from one year, so we can subset the dataset
map_2017 <- asia_rents [which(asia_rents$year == 2017),]
And finally, graph out the data:
nat_rent_graph <- ggplot(data = map_2017) + geom_sf(aes(fill = NY.GDP.TOTL.RT.ZS), position = "identity") + labs(fill ='Natural Resource Rents as % GDP') + scale_fill_viridis_c(option = "viridis")nat_rent_graph + theme_map()
Packages we will need :
library(mctest)
The mctest
package’s functions have many multicollinearity diagnostic tests for overall and individual multicollinearity. Additionally, the package can show which regressors may be the reason of for the collinearity problem in your model.
Click here to read the CRAN PDF for all the function arguments available.
So – as always – we first fit a model.
Given the amount of news we have had about elections in the news recently, let’s look at variables that capture different aspects of elections and see how they relate to scores of democracy. These different election components will probably overlap.
In fact, I suspect multicollinearity will be problematic with the variables I am looking at.
Click here for a previous blog post on Variance Inflation Factor (VIF) score, the easiest and fastest way to test for multicollinearity in R.
The variables in my model are:
emb_autonomy
– the extent to which the election management body of the country has autonomy from the government to apply election laws and administrative rules impartially in national elections.election_multiparty
– the extent to which the elections involved real multiparty competition.election_votebuy
– the extent to which there was evidence of vote and/or turnout buying.election_intimidate
– the extent to which opposition candidates/parties/campaign workers subjected to repression, intimidation, violence, or harassment by the government, the ruling party, or their agents.election_free
– the extent to which the election was judged free and fair.
In this model the dependent variable is democracy
score for each of the 178 countries in this dataset. The score measures the extent to which a country ensures responsiveness and accountability between leaders and citizens. This is when suffrage is extensive; political and civil society organizations can operate freely; governmental positions are clean and not marred by fraud, corruption or irregularities; and the chief executive of a country is selected directly or indirectly through elections.
election_model <- lm(democracy ~ ., data = election_df)stargazer(election_model, type = "text")
However, I suspect these variables suffer from high multicollinearity. Usually your knowledge of the variables – and how they were operationalised – will give you a hunch. But it is good practice to check everytime, regardless.
The eigprop()
function can be used to detect the existence of multicollinearity among regressors. The function computes eigenvalues, condition indices and variance decomposition proportions for each of the regression coefficients in my election model.
To check the linear dependencies associated with the corresponding eigenvalue, the eigprop
compares variance proportion with threshold value (default is 0.5) and displays the proportions greater than given threshold from each row and column, if any.
So first, let’s run the overall multicollinearity test with the eigprop()
function :
mctest::eigprop(election_model)
If many of the Eigenvalues are near to 0, this indicates that there is multicollinearity.
Unfortunately, the phrase “near to” is not a clear numerical threshold. So we can look next door to the Condition Index score in the next column.
This takes the Eigenvalue index and takes a square root of the ratio of the largest eigenvalue (dimension 1) over the eigenvalue of the dimension.
Condition Index values over 10 risk multicollinearity problems.
In our model, we see the last variable – the extent to which an election is free and fair – suffers from high multicollinearity with other regressors in the model. The Eigenvalue is close to zero and the Condition Index (CI) is near 10. Maybe we can consider dropping this variable, if our research theory allows its.
Another battery of tests that the mctest package offers is the imcdiag
( ) function. This looks at individual multicollinearity. That is, when we add or subtract individual variables from the model.
mctest::imcdiag(election_model)
A value of 1 means that the predictor is not correlated with other variables. As in a previous blog post on Variance Inflation Factor (VIF) score, we want low scores. Scores over 5 are moderately multicollinear. Scores over 10 are very problematic.
And, once again, we see the last variable is HIGHLY problematic, with a score of 14.7. However, all of the VIF scores are not very good.
The Tolerance (TOL) score is related to the VIF score; it is the reciprocal of VIF.
The Wi score is calculated by the Farrar Wi, which an F-test for locating the regressors which are collinear with others and it makes use of multiple correlation coefficients among regressors. Higher scores indicate more problematic multicollinearity.
The Leamer score is measured by Leamer’s Method : calculating the square root of the ratio of variances of estimated coefficients when estimated without and with the other regressors. Lower scores indicate more problematic multicollinearity.
The CVIF score is calculated by evaluating the impact of the correlation among regressors in the variance of the OLSEs. Higher scores indicate more problematic multicollinearity.
The Klein score is calculated by Klein’s Rule, which argues that if Rj from any one of the models minus one regressor is greater than the overall R2 (obtained from the regression of y on all the regressors) then multicollinearity may be troublesome. All scores are 0, which means that the R2 score of any model minus one regression is not greater than the R2 with full model.
Click here to read the mctest paper by its authors – Imdadullah et al. (2016) – that discusses all of the mathematics behind all of the tests in the package.
In conclusion, my model suffers from multicollinearity so I will need to drop some variables or rethink what I am trying to measure.
Click here to run Stepwise regression analysis and see which variables we can drop and come up with a more parsimonious model (the first suspect I would drop would be the free and fair elections variable)
Perhaps, I am capturing the same concept in many variables. Therefore I can run Principal Component Analysis (PCA) and create a new index that covers all of these electoral features.
Next blog will look at running PCA in R and examining the components we can extract.
References
Imdadullah, M., Aslam, M., & Altaf, S. (2016). mctest: An R Package for Detection of Collinearity among Regressors.R J.,8(2), 495.
Packages we will need:
library(gvlma)
gvlma
stands for Global Validation of Linear Models Assumptions. See Peña and Slate’s (2006) paper on the package if you want to check out the math!
Linear regression analysis rests on many MANY assumptions. If we ignore them, and these assumptions are not met, we will not be able to trust that the regression results are true.
Luckily, R has many packages that can do a lot of the heavy lifting for us. We can check assumptions of our linear regression with a simple function.
So first, fit a simple regression model:
data(mtcars) summary(car_model <- lm(mpg ~ wt, data = mtcars))
We then feed our car_model
into the gvlma()
function:
gvlma_object <- gvlma(car_model)
- Global Stat checks whether the relationship between the dependent and independent relationship roughly linear. We can see that the assumption is met.
- Skewness and kurtosis assumptions show that the distribution of the residuals are normal.
- Link function checks to see if the dependent variable is continuous or categorical. Our variable is continuous.
- Heteroskedasticity assumption means the error variance is equally random and we have hom*oskedasticity!
Often the best way to check these assumptions is to plot them out and look at them in graph form.
Next we can plot out the model assumptions:
plot.gvlma(glvma_object)
The relationship is a negative linear relationship between the two variables.
This scatterplot of residuals on theyaxis and fitted values (estimated responses) on thexaxis. The plot is used to detect non-linearity, unequal error variances, and outliers.
As explained in this Penn State webpage on interpreting residuals versus fitted plots:
- The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
- The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.
- No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers.
In this histograpm of standardised residuals, we see they are relatively normal-ish (not too skewed, and there is a single peak).
Next, the normal probability standardized residuals plot, Q-Q plot of sample (y axis) versus theoretical quantiles (x axis). The points do not deviate too far from the line, and so we can visually see how the residuals are normally distributed.
Click here to check out the CRAN pdf for the gvlma package.
References
Peña, E. A., & Slate, E. H. (2006). Global validation of linear model assumptions.Journal of the American Statistical Association,101(473), 341-354.
Packages we will need:
library(Quandl)library(forecast) #for time series analysis and graphing
The website Quandl.com is a great resource I came across a while ago, where you can download heaps of datasets for variables such as energy prices, stock prices, World Bank indicators, OECD data other random data.
In order to download the data from the site, you need to first set up an account on the website, and indicate your intended use for the data.
Then you can access your API key, when you go to your “Account Setting” page.
Back on R, you call the API key with Quandl.api_key(
) function and now you can directly download data!
Quandl.api_key("StRiNgOfNuMbErSaNdLeTtErs")
Now, I click to search only through the free datasets. I have no idea how much a subscription costs but I imagine it is not cheap.
You can browse through the database and when you find the dataset you want, you copy and paste the string code into Quandl()
function.
We can choose the class of the time series object we will download with the type
= argument.
We also toggle the start_date
and end_data
of the time series.
So I will download employment data for Ireland from 1980 until 2019 as a zoo
object. We can check the Quandl page for the Irish employment data to learn about the data source and the unit of measurement
emp <- Quandl('ODA/IRL_LE', start_date='1980-01-01', end_date='2020-01-01',type="zoo")
Click here to check out the Quandl CRAN pdf documentation and learn more about the differen arguments you can use with this function. Here is the generic arguments you can play with when downloading your dataset:
Quandl(code, type = c("raw", "ts", "zoo", "xts", "timeSeries"), transform = c("", "diff", "rdiff", "normalize", "cumul", "rdiff_from"), collapse = c("", "daily", "weekly", "monthly", "quarterly", "annual")
Now we can graph the emp
data:
autoplot(emp[,"V1"]) + ggtitle("Employment level in Ireland") + labs("Source: International Monetary Fund data") + xlab("Year") + ylab("Employed people (in millions)")
The 1980s were a rough time for unemployment in Ireland. Also the 2008 recession had a sizeable impact on unemployment too. I am afraid how this graph will look with 2020 data.
Next, we can visually examine the autocorrelation plot.
With time series data, it is natural to assume that the value at the current time period is highly related (i.e. serially correlated) to its value in the previous period. This is autocorrelation, and it can be problematic when we want to forecast or run statistics. This is because it violates the assumption that values are independent of each other.
emp_ts <- ts(emp)forecast::ggAcf(emp_ts)
There is very high autocorrelation in employment level in Ireland over the years.
In next blog, we will look at how to correct for autocorrelation in time series data.
The ExPand package is an example of a shiny app.
What is a shiny app, you ask? Click to look at a quick Youtube explainer. It’s basically a handy GUI for R.
When we feed a panel data.frame into the ExPanD()
function, a new screen pops up from R IDE (in my case, RStudio) and we can interactively toggle with various options and settings to run a bunch of statistical and visualisation analyses.
Click here to see how to convert your data.frame to pdata.frame object with the plm package.
Be careful your pdata.frame is not too large with too many variables in the mix. This will make ExPanD upset enough to crash. Which, of course, I learned the hard way.
Also I don’t know why there are random capitalizations in the PaCkaGe name. Whenever I read it, I think of that Sponge Bob meme.
If anyone knows why they capitalised the package this way. please let me know!
So to open up the new window, we just need to feed the pdata.frame into the function:
ExPanD(mil_pdf)
For my computer, I got error messages for the graphing sections, because I had an old version of Cairo
package. So to rectify this, I had to first install a source version of Cairo and restart my R session. Then, the error message gods were placated and they went away.
install.packages("Cairo", type="source")
Then press command + shift + F10 to restart R session
library(Cairo)
You may not have this problem, so just ignore if you have an up-to-date version of the necessary packages.
When the new window opens up, the first section allows you to filter subsections of the panel data.frame. Similar to the filter() argument in the dplyr package.
For example, I can look at just the year 1989:
But let’s look at the full sample
We can toggle with variables to look at mean scores for certain variables across different groups. For example, I look at physical integrity scores across regime types.
- Purple plot: closed autocracy
- Turquoise plot: electoral autocracy
- Khaki plot: electoral democracy:
- Peach plot: liberal democracy
The plots show that there is a high mean score for physical integrity scores for liberal democracies and less variance. However with the closed and electoral autocracies, the variance is greater.
We can look at a visualisation of the correlation matrix between the variables in the dataset.
Next we can look at a scatter plot, with option for loess smoother line, to graph the relationship between democracy score and physical integrity scores. Bigger dots indicate larger GDP level.
Last we can run regression analysis, and add different independent variables to the model.
We can add fixed effects.
And we can subset the model by groups.
The first column, the full sample is for all regions in the dataset.
The second column, column 1 is
Column 2 Post Soviet countries
Column 3: Latin America
Column 4: AFRICA
Column 5: Europe, North America
Column 6: Asia
Running a regression model with too many variables – especially irrelevant ones – will lead to a needlessly complex model. Stepwise can help to choose the best variables to add.
Packages you need:
library(olsrr)library(MASS)library(stargazer)
First, choose a model and throw every variable you think has an impact on your dependent variable!
I hear the voice of my undergrad professor in my ear: ” DO NOT go for the “throw spaghetti at the wall and just see what STICKS” approach. A cardinal sin.
We must choose variables because we have some theoretical rationale for any potential relationship. Or else we could end up stumbling on spurious relationships.
Like the one between Nick Cage movies and incidence of pool drowning.
However …
… beyond just using our sound theoretical understanding of the complex phenomena we study in order to choose our model variables …
… one additional way to supplement and gauge which variables add to – or more importantly omit from – the model is to choose the one with the smallest amount of error.
We can operationalise this as the model with the lowest Akaike information criterion(AIC).
AIC is anestimatorof in-sample prediction error and is similar to the adjustedR-squared measures we see in our regression output summaries.
It effectively penalises us for adding more variables to the model.
Lower scores can indicate a more parsimonious model, relative to a model fit with a higher AIC. It can therefore give an indication of the relative quality ofstatistical modelsfor a given set of data.
As a caveat, we can only compare AIC scores with models that are fit to explain variance of the same dependent / response variable.
data(mtcars)car_model <- lm(mpg ~., data = mtcars) %>% stargazer
Very many variables and not many stars.
With our model, we can now feed it into the stepwise function. Hopefully, we can remove variables that are not contributing to the model!
For the direction
argument, you can choose between backward and forward stepwise selection,
- Forward steps: start the model with no predictors, just one intercept and search through all the single-variable models, adding variables, until we find the the best one (the one that results in the lowestresidual sum of squares)
- Backward steps: we start stepwisewith all the predictors and removes variable with the least statistically significant (the largest p-value) one by one until we find the lost AIC.
Backward stepwise is generally better because starting with the full model has the advantage of considering the effects of all variables simultaneously.
Unlike backward elimination, forward stepwise selection is more suitable in settingswhere the number of variables is bigger than the sample size.
So tldr: unless the number of candidate variables is greater than the sample size (such as dealing with genes), using a backward stepwise approach is default choice.
You can also choose direction = "both"
:
step_car <- stepAIC(car_model, trace = TRUE, direction= "both")
If you add the trace = TRUE
, R prints out all the steps.
I’ll show the last step to show you the output.
The goal is to have the combination of variables that has the lowest AIC or lowestresidual sum of squares (RSS).
BThe best model, based on the lowest AIC, includes the predictors wt (weight), qsec (quarter-mile time), and am (automatic/manual transmission). The formula is mpg ~ wt + qsec + am.
If we dropped the wt variable, the AIC would shoot up 82.79, so we won’t do that.
Model Comparison:
The initial model (null model) has an AIC of 61.31.
Each row in the table corresponds to a different model with additional predictors.
For example, adding the predictor hp reduces the AIC to 61.515, and adding carb reduces it further to 61.751.
The coefficients for the “best” model are given under “Call.” For the formula mpg ~ wt + qsec + am, the intercept is 9.618, and the coefficients for wt, qsec, and am are -3.917, 1.226, and 2.936, respectively.
stargazer(car_model, step_car, type = "text")
We can see that the stepwise model has only three variables compared to the ten variables in my original model.
And even with far fewer variables, the R2 has decreased by an insignificant amount. In fact the Adjusted R2 increased because we are not being penalised for throwing so many unnecessary variables.
So we can quickly find a model that loses no explanatory power by is far more parsimonious.
Plus in the original model, only one variable is significant but in the stepwise variable all three of the variables are significant.
We can plot out the stepwise regression with the oslrr package
car_model %>% ols_step_backward_p() %>% plot()
car_model %>% ols_step_backward_p(details = TRUE)
With this, we can iterate over the steps. First, we can look at the first step:
The removal of cyl suggests that, after evaluating its p-value, it was found to be not statistically significant in predicting the response variable.
The model is iteratively refining itself by removing variables that do not provide significant information.
In total, it runs five steps and we arrive at the final step and the final model.
Below, we can look at the Final Model Output:
Packages we will need:
library(olsrr)library(countrycode)library(WDI)library(stargazer)library(peacesciencer)library(plm)
One core assumption of linear regression analysis is that the residuals of the regression are normally distributed.
When the normality assumption is violated, interpretation and inferences may not be reliable. In worst case, the interpretations are not at all valid.
So it is important we check this assumption is not violated.
As well residuals being normal distributed, we must also check that the residuals have the same variance (i.e. hom*oskedasticity).
Click here to find out how to check for hom*oskedasticity.
Then, if there is a problem with the variance, click here to find out how to fix heteroskedasticity (which means the residuals have a non-random pattern in their variance) with the sandwich package in R.
There are three ways to check that the error in our linear regression has a normal distribution:
- plots or graphs such histograms, boxplots or Q-Q-plots, to get a visual approximation
- examining skewness and kurtosis indices
- formal normality tests, to check for a p-value
In this blog, we will see what factors affect military spending by a government.
We will take some variables from the World Bank via the WDI
package.
Click here to read more about downloading World Bank Data with the WDI package.
mil_spend_gdp = WDI(indicator = "MS.MIL.XPND.ZS")gdp_percap = WDI(indicator = "NY.GDP.PCAP.KD")mil_spend_gdp %>% inner_join(gdp_percap) %>% select(country, year, mil_spend_gdp = MS.MIL.XPND.ZS, gdp_percap = NY.GDP.PCAP.KD) %>% mutate(cown = countrycode(country, "country.name", "cown")) %>% filter(!is.na(cown)) -> wdi_data
And also download some variables via the peacesciencer
package.
Click here to read more about the peacesciencer
package in the blog posts about building datasets
peacesciencer::create_stateyears(system = "gw") %>% add_ucdp_acd() %>% add_democracy() %>% mutate(cown = countrycode(statename, "country.name", "cown")) -> peace_data
We can code a new binary variable that indicates if there was a UCDP conflict in the previous 10 years or not.
We could imagine a country that experienced war is more likely to keep investing in their military (as a larger percentage of their GDP) than countries that have only experienced relative peace in their recent past.
peace_data %<>% select(statename, cown, year, ucdpongoing, maxintensity, conflict_ids, v2x_polyarchy, polity2) %>% mutate(ucdpongoing_no_na = ifelse(is.na(ucdpongoing), 0, ucdpongoing)) %>% group_by(statename) %>% arrange(year) %>% mutate(war_past_10_years = ifelse(ucdpongoing == 1 & lag(ucdpongoing, order_by = year, default = 0, n = 10) == 1, 1, 0)) %>% mutate(war_past_10_years_no_na = ifelse(ucdpongoing_no_na == 1 & lag(ucdpongoing_no_na, order_by = year, default = 0, n = 10) == 1, 1, 0))
We merge the data by the Correlates of War codes.
Click here to read more about using COW codes with the countrycode package.
wdi_data %>% inner_join(peace_data, by = c("cown", "year")) -> wdi_peace
With these data, we can build our linear regression model.
Our dependent variable is military spending as a percentage of GDP (logged)
Our independent varibles are:
- GDP per capita (logged) from the World Bank
- Demoracy (as measured by the V-DEM polyarchy score)
- Binary variable that is 1 if a country had a UCDP conflict in the previous 10 years and 0 if none.
We will also add an interaction term with the GDP and democracy variable.
Given we have cross-sectional longitudinal data, the best option would be panel data analysis with the plm
package
plm(log(mil_spend_gdp) ~ log(gdp_percap)*v2x_polyarchy + as.factor(war_past_10_years_no_na), data = wdi_peace, index = c("cown", "year"), model = "within") %>% stargazer(., type = "text")
Dependent variable: | |
Military spending (GDP %) (ln) | |
GDP pc (ln) | -0.288^{***} |
(0.029) | |
Democracy | 1.004^{***} |
(0.353) | |
War 10 year dummy | 0.146^{***} |
(0.021) | |
GDP pc (ln) x Democracy | -0.199^{***} |
(0.046) | |
Observations | 3,686 |
R^{2} | 0.135 |
Adjusted R^{2} | 0.097 |
F Statistic | 137.989^{***} (df = 4; 3530) |
Note: | ^{*}p<0.1; ^{**}p<0.05; ^{***}p<0.01 |
However, the olsrr
package cannot handle plm
.
In future blog posts, we will lok more closely at plm() panel regressions and the diagnostic tests we have to run with these types of models.
So we will just look at one year, 2010.
lm(log(mil_spend_gdp) ~ log(gdp_percap)*v2x_polyarchy + as.factor(war_past_10_years_no_na), data = subset(wdi_peace, year == 2010)) -> war_model
Dependent variable: | |
Military spending (GDP %) (ln) | |
GDP pc (ln) | 0.261^{**} |
(0.101) | |
Democracy | 1.450 |
(1.565) | |
War 10 year dummy | 0.592^{***} |
(0.159) | |
GDP pc (ln) x Democracy | -0.343^{**} |
(0.168) | |
Constant | 0.183 |
(0.885) | |
Observations | 136 |
R^{2} | 0.381 |
Adjusted R^{2} | 0.362 |
Residual Std. Error | 0.615 (df = 131) |
F Statistic | 20.137^{***} (df = 4; 131) |
Note: | ^{*}p<0.1; ^{**}p<0.05; ^{***}p<0.01 |
So now we have our OLS model, we can run a heap of linear model diagnostic functions with the olsrr
package.
Built by Aravind Hebbali, the description of the package mentions that olsrr
has tools designed to make it easier for users, particularly beginner/intermediate R users to build ordinary least squares regression models. Thank you Aravind!
It includes regression output, heteroskedasticity tests, collinearity diagnostics, residual diagnostics, measures of influence, model fit assessment and variable selection procedures. Look through the CRAN PDF below or look at rsquaredacademy website to get a comprehensive overview of the package
olsrr: Tools for Building OLS Regression ModelsDownload
We will now check if the residuals in our model (the difference between what our model predicted and what the values actually are) are normally distributed
ols_test_normality(war_model)
Test | Statistic | p-value |
---|---|---|
Shapiro-Wilk | 0.9817 | 0.0653 |
Kolmogorov-Smirnov | 0.0524 | 0.8494 |
Cramer-von Mises | 14.1123 | 0.0000 |
Anderson-Darling | 0.469 | 0.2447 |
Let’s look at each test result in turn
Shapiro-Wilk:
The test statistic is 0.9817, and the p-value is 0.0653.
The null hypothesis is that the residuals are normally distributed.
In this case, the p-value is greater than the predefined significance level (typically 0.05), so you cannot reject the null hypothesis.
This suggests that the residuals may follow a normal distribution.
Woo!
Kolmogorov-Smirnov:
The test statistic is 0.0524, and the p-value is 0.8494.
Similar to the Shapiro-Wilk test, the p-value is greater than 0.05, so you cannot reject the null hypothesis of normality.
This suggests that the residuals may follow a normal distribution.
Yay.
Cramer-von Mises:
This test statistic is 14.1123, and the p-value is 0.0000.
The null hypothesis is that the residuals are normally distributed. The very low p-value indicates that you can reject the null hypothesis, suggesting that the residuals are not from a normal distribution.
Oh no.
Anderson-Darling: This is another test of normality. The test statistic is 0.469, and the p-value is 0.2447. Similar to the Shapiro-Wilk and Kolmogorov-Smirnov tests, the p-value is greater than 0.05, so you cannot reject the null hypothesis of normality.
Phew.
Which of the normality tests is the best?
And what is up with Cramer-von Mises?
A paper by Razali and Wah (2011) tested all these formal normality tests with 10,000 Monte Carlo simulation of sample data generated from alternative distributions that follow symmetric and asymmetric distributions.
Their results showed that the Shapiro-Wilk test is the most powerful normality test, followed by Anderson-Darling test, and Kolmogorov-Smirnov test. Their study did not look at the Cramer-Von Mises test.
The results of Razali and Wah’s study echo the previous findings of Mendes and Pala (2003) and Keskin (2006) in support of Shapiro-Wilk test as the most powerful normality test.
According Ahad and colleagues (2011: 641), they find that
“the performances of the normality tests, namely, the Kolmogorov-Smirnov test, Anderson-Darling test, Cramervon Mises test, and Shapiro-Wilk test, were evaluated under various spectrums of non-normal distributions and different sample sizes. The results showed that the ShapiroWilk test is the most sensitive normality test because this test rejects the null hypothesis of normality at the smallest sample sizes compared to the other tests, at all levels of skewness and kurtosis. Thus, when the four normality tests are available in a statistical package, we would recommend practitioners to use the Shapiro-Wilk normality to test the normality of data”
Ahad et al (2001: 641)
We can plot out the residuals distribution in a histogram with olsrr
olsrr::ols_plot_resid_hist(war_model)
Visually, we can confirm that the residuals have a lovely bell curve and are broadly normally distributed! With a few outliers at -2.
Nest we can check the residuals with a QQ plot.
A QQ plot is used to compare residuals to the normal distribution in linear regression. We can use a normal QQ plot to visually check if our residuals follow a theoretical normal distribution.
In addition to being good at identifying outliers and heavy tails, QQ plots can reveal characteristics such as skewness and bimodality, and can be effective even
for small samples (Marden, 2004).
olsrr::ols_plot_resid_qq(war_model)
Again we can see some outliers at -2
Next we can look at a scatter plot of residuals on the y axis and fitted values on the x axis to detect non-linearity, unequal error variances, and outliers.
Each point in the plot is a residual value (i.e. the difference between what the model predicted and what the value actually
When interpreting this plot, there are a few things we want to look out for.
- The points does not deviate too far from 0. This indicates the variance is hom*ogeneous (i.e. homescediasticity, one of my favourite words)
- The points are random (i.e. show no distinct pattern) around the horizontal red line at 0
ols_plot_resid_fit(war_model)
- There are some residual values at the -2, so they might be outliers. We we look at an outlier diagnostic plot in a bit.
- There are no discernible pattern in the scatterplot, so there does not seem to be any heteroscedasticity in the variance.
We can run an ols_plot_resid_lev()
to graph for detecting outliers and/or observations with high leverage.
ols_plot_resid_lev(war_model)
There are a few outliers with leverage that we need to look more closely and examine how they prove / challenge our given theory / hypotheses.
FINALLLY. for a more complete diagnostics check, we can insert the model into the ols_coll_diag() function to calculate the Variance Inflation Factor (VIF) and Eigenvalues of the variables in the model.
VIF scores highlight if there is multicollinearity between the independent variables. If they are too highly correlated, our model is in trouble.
- If the value of VIF is 1< VIF < 5, it specifies that the variables are moderately correlated to each other.
- The challenging value of VIF is between 5 to 10 as it specifies the highly correlated variables.
- If VIF ≥ 5 to 10, there will be multicollinearity among the predictors in the regression model.
- VIF > 10 indicate the regression coefficients are feebly estimated with the presence of multicollinearity
Read more about the issues with multicollinearity in Shrestha (2020)
ols_coll_diag(war_model)
Variables | Tolerance | VIF |
---|---|---|
GDP per capita (ln) | 0.13 | 7.6 |
Democracy | 0.02 | 56.2 |
War 10 years | 0.91 | 1.1 |
GDP pc (ln) X Democracy | 0.01 | 79.9 |
Next we look at the Eigenvalues
Eigenvalue | Condition Index | intercept | GDP pc (ln) | Democracy | War 10 years | GDP pc (ln) X Democracy |
---|---|---|---|---|---|---|
3.93 | 1.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 |
0.90 | 2.09 | 0.00 | 0.00 | 0.00 | 0.82 | 0.00 |
0.16 | 5.01 | 0.01 | 0.00 | 0.00 | 0.12 | 0.01 |
0.02 | 15.14 | 0.04 | 0.08 | 0.05 | 0.02 | 0.02 |
0.00 | 67.74 | 0.95 | 0.92 | 0.94 | 0.03 | 0.97 |
This is not good.
But it is probably due to the interaction term.
If we run the regression agaon without any interaction term , the VIF scores are all around 1!
lm(log(mil_spend_gdp) ~ log(gdp_percap) + v2x_polyarchy + as.factor(war_past_10_years_no_na), data = subset(wdi_peace, year == 2010)) -> war_model_no_interactionols_coll_diag(war_model_no_interaction)
There are plenty of other helpful functions in the olsrr
package that we can look at with our model.
For example, we can run AIC stepwise regression to see if we need to drop any variables
aic_step <- ols_step_both_p(war_model)
Step | Variable | Added/Removed | R-Square | Adj. R-Square | C(p) | AIC | RMSE |
---|---|---|---|---|---|---|---|
1 | v2x_polyarchy | addition | 0.298 | 0.293 | 16.4970 | 271.6880 | 0.6474 |
2 | as.factor(war_past_10_years_no_na) | addition | 0.347 | 0.338 | 8.0720 | 263.7888 | 0.6266 |
3 | log(gdp_percap) | addition | 0.361 | 0.347 | 7.1600 | 262.8901 | 0.6223 |
4 | log(gdp_percap):v2x_polyarchy | addition | 0.381 | 0.362 | 5.0000 | 260.6381 | 0.6150 |
LowerAICscores are better, and AIC penalizes models that use more parameters.
That means if two models explain the same amount of variation, the model with a smaller number of variable parameters will have a lower AIC score.
Many would argue that this would be the better-fit model.
We can see in step 4, the AIC is the lowest. So that is good news!
Variables doe not live in a vacuum in the model. When we run a model, we want to have a better understanding of the relationship between military spending and the independent variables conditional on the other independent variables
According to the package, the added variable plot provides information about the marginal importance of a GIVEN independent variable, given the other variables already in the model.
It shows the marginal importance of the variable in reducing the residual variability.
olsrr::ols_plot_added_variable(war_model)
The military spending dependent variable is on the y axis and we look at adding the named variable (given that the other variables are already in the model).
The democracy variable GDP variable interaction slope appears to decrease while all other variables increase.
What do theYandXresiduals represent? TheYresiduals represent the part ofYnot explained by all the variables other than X. TheXresiduals represent the part ofXnot explained by other variables. The slope of the line fitted to the points in the added variable plot is equal to the regression coefficient whenYis regressed on all variables includingX.
A strong linear relationship in the added variable plot indicates the increased importance of the contribution ofXto the model already containing the other predictors.
We can see, for example, that (with all other variables held constant) higher GDP per capita correlates with higher proportion of military spending. Richer countries seem to dedicate more of this money to building a military.
Thank you for reading !
References
Ahad, N. A., Yin, T. S., Othman, A. R., & Yaacob, C. R. (2011). Sensitivity of normality tests to non-normal data.Sains Malaysiana,40(6), 637-641.
Marden, J. I. (2004). Positions and QQ plots.Statistical Science, 606-614.
Razali, N. M., & Wah, Y. B. (2011). Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests.Journal of statistical modeling and analytics,2(1), 21-33.
Shrestha, N. (2020). Detecting multicollinearity in regression analysis.American Journal of Applied Mathematics and Statistics,8(2), 39-42.
Packages we will use:
install.packages("igraph")
library(igraph)
First create a dataframe with the two actors in the dyad.
countries_df <- data.frame(stateA = ww1_df$statea, stateB = ww1_df$stateb, stringsAsFactors = TRUE)
Next, convert to matrix so it is suitable for the next function
countries_matrix <- as.matrix(countries_df)
Feed the matrix into the graph.edgelist()
function. We can see that it returns an igraph
object:
countries_ig <- graph.edgelist(countries_matrix , directed=TRUE)
“Nodes” designate the vertices of a network, and “edges” designate its ties. Vertices are accessed using the V() function while edges are accessed with the E(). This igraph
object has 232 edges and 16 vertices over the four years.
Furthermore, the igraph
object has a name attribute as one of its vertices properties. To access, type:
V(countries_ig)$name
Which prints off all the countries in the ww1 dataset; this is all the countries that engaged in militarized interstate disputes between the years 1914 to 1918.
[1] "United Kingdom" "Austria-Hungary" "France" "United States" "Russia" "Romania" "Germany" "Greece" "Yugoslavia"[10] "Italy" "Belgium" "Turkey" "Bulgaria" "Portugal" "Estonia" "Latvia"
Next we can fit an algorithm to modify the graph distances. According to our pal Wikipedia, force-directed graph drawingalgorithms are a class ofalgorithmsfordrawing graphsin an aesthetically-pleasing way. Their purpose is to position the nodes of agraphin two-dimensional or three-dimensional space so that all the edges are of more or less equal length and there are as few crossing edges as possible, by assigning forces among the set of edges and the set of nodes, based on their relative positions!
We can do this in one simple step by feeding the igraph into the algorithm function.
Check out this blog post to see the differences between these distance algorithms.
I will choose the Kamada-Kawai algorithm.
kamada_layout <- layout.kamada.kawai(countries_ig)
Now to plot the WW1 countries and their war dispute networks
plot(countries_ig,
layout = kamada_layout,
vertex.size = 14,
vertex.color = "red",
vertex.frame.color = NA,
vertex.label.cex = 1.2,
edge.curved = .2,
edge.arrow.size = .3,
edge.width = 1)
A nice way to summarise all the variables in a dataset.
install.packages("skimr")
library(skimr)
The data we’ll look at is from the Correlates of War . It provides dyadic records of militarized interstate disputes (MIDs) over the period of 1816-2010.
skim(mid)
n_missing :
tells which variables have missing values
complete_rate :
the percentage of the variables which are missing
Column 4 – 7 gives the mean, standard deviation, min, 25th percentile, median, 75th percentile
and max
values.
The last column is a histogram of each variables, so you can easily scan and see if variables are normally distributed, skewed or binary.
Packages we will need:
library(cluster)
library(factoextra)
I am looking at 127 non-democracies on seeing how the cluster on measures of state capacity (variables that capture ability of the state to control its territory, collect taxes and avoid corruption in the executive).
We want to minimise the total within sums of squares error from the cluster mean when determining the clusters.
First, we need to find the optimal number of clusters. We set the max number of clusters at k = 15.
within_sum_squares <- function(k){kmeans(autocracy_df, k, nstart = 3)$tot.withinss}min_max <- 1:15
within_sum_squares_values <- map(min_max, within_sum_squares)
plot(min_max, within_sum_squares_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters",
ylab="Total within sum of squares")
K-means searches for theminimumsum of squaresassignment, i.e. it minimizes unnormalized variance by assigning points to cluster centers.
k_clusters <- kmeans(autocracy_df[3:5], centers = 6, nstart = 25)
class(k_clusters)
We can now take the k_clusters
object and feed it into the fviz_cluster()
function.
fviz_cluster(k_clusters, data = autocracy_df[3:5], ellipse.type = "convex")
Packages we need
install.packages("dendextend")
library(dendextend)
This blog will create dendogram to examine whether Asian countries cluster together when it comes to extent of judicial compliance. I’m examining Asian countries with populations over 1 million and data comes from the year 2019.
Judicial compliance measure how often a government complies with important decisions by courts with which it disagrees.
Higher scores indicate that the government often or always complies, even when they are unhappy with the decision. Lower scores indicate the government rarely or never complies with decisions that it doesn’t like.
It is important to make sure there are no NA values. So I will impute any missing variables.
Click here to read how to impute missing values in your dataset.
library(mice)
imputed_data <- mice(asia_df, method="cart")
asia_df <- complete(imputed_data)
Next we can scale the dataset. This step is for when you are clustering on more than one variable and the variable units are not necessarily equivalent. The distance value is related to the scale on which the different variables are made.
Therefore, it’s good to scale all to a common unit of analysis before measuring any inter-observation dissimilarities.
asia_scale <- scale(asia_df)
Next we calculate the distance between the countries (i.e. different rows) on the variables of interest and create a dist
object.
There are many different methods you can use to calculate the distances. Click here for a description of the main formulae you can use to calculate distances. In the linked article, they provide a helpful table to summarise all the common methods such as “euclidean
“, “manhattan
” or “canberra
” formulae.
I will go with the “euclidean
” method. but make sure your method suits the data type (binary, continuous, categorical etc.)
asia_judicial_dist <- dist(asia_scale, method = "euclidean")
class(asia_judicial_dist)
We now have a dist
object we can feed into the hclust()
function.
With this function, we will need to make another decision regarding the method we will use.
The possible methods we can use are "ward.D"
,"ward.D2"
,"single"
,"complete"
,"average"
(= UPGMA),"mcquitty"
(= WPGMA),"median"
(= WPGMC) or"centroid"
(= UPGMC).
Click here for a more indepth discussion of the different algorithms that you can use
Again I will choose a common "ward.D2"
method, which chooses the best clusters based on calculating: at each stage, which two clusters merge that provide the smallest increase in the combined error sum of squares.
asia_judicial_hclust <- hclust(asia_judicial_dist, method = "ward.D2")
class(asia_judicial_hclust)
We next convert our hclust
object into a dendrogram
object so we can plot it and visualise the different clusters of judicial compliance.
asia_judicial_dend <- as.dendrogram(asia_judicial_hclust)
class(asia_judicial_dend)
When we plot the different clusters, there are many options to change the color, size and dimensions of the dendrogram. To do this we use the set()
function.
Click here to see a very comprehensive list of all the set()
attributes you can use to modify your dendrogram from the dendextend package.
asia_judicial_dend %>%
set("branches_k_color", k=5) %>% # five clustered groups of different colors
set("branches_lwd", 2) %>% # size of the lines (thick or thin)
set("labels_colors", k=5) %>% # color the country labels, also five groups
plot(horiz = TRUE) # plot the dendrogram horizontally
I choose to divide the countries into five clusters by color:
And if I zoom in on the ends of the branches, we can examine the groups.
The top branches appear to be less democratic countries. We can see that North Korea is its own cluster with no other countries sharing similar judicial compliance scores.
The bottom branches appear to be more democratic with more judicial independence. However, when we have our final dendrogram, it is our job now to research and investigate the characteristics that each countries shares regarding the role of the judiciary and its relationship with executive compliance.
Singapore, even though it is not a democratic country in the way that Japan is, shows a highly similar level of respect by the executive for judicial decisions.
Also South Korean executive compliance with the judiciary appears to be more similar to India and Sri Lanka than it does to Japan and Singapore.
So we can see that dendrograms are helpful for exploratory research and show us a starting place to begin grouping different countries together regarding a concept.
A really quick way to complete all steps in one go, is the following code. However, you must use the default methods for the dist
and hclust
functions. So if you want to fine tune your methods to suit your data, this quicker option may be too brute.
asia_df %>%
scale %>%
dist %>%
hclust %>%
as.dendrogram %>%
set("branches_k_color", k=5) %>%
set("branches_lwd", 2) %>%
set("labels_colors", k=5) %>%
plot(horiz = TRUE)
All the packages I will be using:
library(rnaturalearth)library(countrycode)library(tidyverse)library(ggplot2)library(ggthemes)library(viridis)
First, we access and store a map object from the rnaturalearth package, with all the spatial information in contains. We specify returnclass = "sf"
, which will return a dataframe with simple features information.
Simple featuresorsimple feature accessrefers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on thespatialgeometry of these objects. Our map has these attributes stored in the object.
With the ne_countries()
function, we get the borders of all countries.
map <- ne_countries(scale = "medium", returnclass = "sf")View(map)
This map
object comes with lots of information about 241 countries and territories around the world.
In total, it has 65 columns, mostly with different variants of the names / locations of each country / territory. For example, ISO codes for each country. Further in the dataset, there are a few other variables such as GDP and population estimates for each country. So a handy source of data.
However, I want to use values from a different source; I have a freedom_df
dataframe with a freedom of association variable.
The freedom of association index broadly captures to what extent are parties, including opposition parties, allowed to form and to participate in elections, and to what extent are civil society organizations able to form and to operate freely in each country.
So, we can merge them into one dataset.
Before that, I want to only use the scores from the most recent year to map out. So, take out only those values in the year 2019 (don’t forget the comma sandwiched between the round bracket and the square bracket):
freedom19 <- freedom_df[which(freedom_df$year == 2019),]
My freedom19
dataset uses the Correlates of War codes but no ISO country codes. So let’s add these COW codes to the map
dataframe for ease of merging.
I will convert the ISO codes to COW codes with the countrycodes()
function:
map$COWcode <- countrycode(map$adm0_a3, "iso3c", "cown")
Click here to read more about the countrycode() function in R.
Now, with a universal variable common to both datasets, I can merge the two datasets with the common COW codes:
map19 <- merge(map, freedom19, by.x = "COWcode", by.y = "ccode", all = TRUE)
Click here to read more about the merge()
function in R.
We’re all ready to graph the map. We can add the freedom of association variable into the aes()
argument of the geom_sf()
function. Again, the sf
refers to simple features with geospatial information we will map out.
assoc_graph <- ggplot(data = map19) + geom_sf(aes(fill = freedom_association_index), position = "identity") + labs(fill='Freedom of Association Index') + scale_fill_viridis_c(option = "viridis")
The scale_fill_viridis_c(option = "viridis")
changes the color spectrum of the main variable.
Other options include:
"viridis"
"magma"
"plasma
“
And various others. Click here to learn more about this palette package.
Finally we call the new graph stored in the assoc_graph
object.
I use the theme_map()
function from the ggtheme
package to make the background very clean and to move the legend down to the bottom left of the screen where it takes up the otherwise very empty Pacific ocean / Antarctic expanse.
Click here for more information on the ggtheme package.
assoc_graph + theme_map()
And there we have it, a map of countries showing the Freedom of Association index across countries.
The index broadly captures to what extent are parties, including opposition parties, allowed to form and to participate in elections, and to what extent are civil society organizations able to form and to operate freely.
Yellow colors indicate more freedom, green colors indicate middle scores and blue colors indicate low levels of freedom.
Some of the countries have missing data, such as Germany and Yemen, for various reasons. A true perfectionist would go and find and fill in the data manually.
Coming soon
There is one caveat with this function that we are using from the car
package:
recode
is also in the dplyr
package so R gets confused if you just type in recode
on its own; it doesn’t know which package you’re using.
So, you must write car::recode
(). This placates the R gods and they are clear which package to use.
It is useful for all other times you want to explicitly tell R which package you want it to use to avoid any confusion. Just type the package name followed by two :: colons and a list of all the functions in the package drops down. So really, it can also be useful for exploring new packages you’ve installed and loaded!
install.packages("car")library(car)
First, subset the dataframe, so we are only looking at countries in the year 1990.
data_90 <- data[which(data$year==1990),]
Next look at a frequency of each way that regimes around the world ended.
plyr::count(data_90$regime_end)
To understand these numbers, we look at the codebook.
We want to make a new binary variable to indicate whether a coup occurred in a country in 1990 or not.
To do this we use the car::recode()
function.
First we can make a numeric variable. So in the brackets, we indicate our dataframe at the start.
Next bit is important, we put all the original and new variables in ” ” inverted commas.
Also important that we separate each level of the new variable with a ; semicolon.
The punctuation marks in this function are a bit fussy and difficult but it is important.
data_90$coup_numeric <- car::recode(data_90$regime_end, "0:2 = 1; 3:13=0; NA=0")
Alternatively, we can recode the variable as a string output when we choose to make the new variable values in ‘ apostrophe marks’.
data_90$coup_string <- car::recode(data_90$regime_end, "0:2 = 'coup'; 3:13= 'no coup'; NA='no coup'")
If you want to convert a continuous variable to discrete factors, we can go to our trusty mutate() function in the dplyr
package. And within mutate() we use another function: cut()
So instead of recoding binary variables or factor variables . . . we can turn a numeric variable into a discrete variable with cut()
We specify with the breaks
argument to indicate where we want to divide the variable and then we can label the factors with the labels
argument:
data_90 <- data_90 %>% dplyr::mutate(instability_discrete = cut(instability_continuous, breaks=c(-Inf, 0.3, 0.7, Inf), labels=c("low_instability", "mid_instability", "high_instability")))
A quick hack to create a year variable from a string variable and place it as column number one in your dataframe.
First problem with my initial dataset is that the date is a string of numbers and I want the first four characters in the string.
data$year <- substr(data$date, 0, 4)data$year <- as.numeric(data$year)
Now I want to place it at the beginning to keep things more organised:
data = data %>% select(year, everything())
And we are done!
Much better.
TheManifesto Project maintains a database of political party manifestos for around 50 countries. It covers free and democratic elections since 1945 in various countries.
To access the manifestos, we install and load the manifestoR
package, which provides an API between R and The Manifesto Project site.
On the website, we can navigate the Manifesto Project database and search for any country for a given time period in the Data Dashboard section on the site. For example, I search for Ireland from 2012 until the present day and I can see the most recent manifestos put forward by the parties.
We can see the number code for each party. (We will need to use these when downloading the texts into R via the API).
Click here for the full CRAN PDF which gives more information for the manifestoR package.
So first, install the packages.
install.packages("manifestoR")library(manifestoR)
But we cannot just access the data right away.
In order to download any manifesto text from the database, we need to first set up an account on the website and download an API key. So first step to do is go on the website and sign up.
Then you go to your website profile and click download API Key file (txt).
Then we go back on R and write in:
mp_key <- mp_setapikey(file.choose())
Choose the txt file downloaded from the website … and hopefully you should be all set up to access all the manifesto text data.
Now, we can choose the manifestos we want to download.
Using the mp_corpus()
function, we can choose the country and date that we want and download lists of all the texts.
manifestos_2016 <- mp_corpus(countryname == "Ireland" & edate > as.Date("2016-02-01"))
Note that the date I enter into the mp_corpus()
function corresponds with the date from the Manifesto Project website. If there is a way to look this up directly through R, please let me know!
If we look at the manifestos_2016
object we just downloaded:
View(manifestos_2016)
We see we have ten lists. Say, for example, the party I want to look at is Fine Gael, I need the party ID code assigned by the Manifesto Project.
Similar to how I got the date, I can look up the Data Dashboard to find the party code for Fine Gael. Or I can search for the ID via this site.
It was funny to see that all the names of the Irish parties are in English, which I never hear! Fine Gael is Irish for Tribes of Ireland and I guess Family is another way to translate that.
The ID code for the party is 53520, which is the seventh list. So index this list and create a new tibble structure for the manifesto text.
fg_2016 <- as_tibble(manifestos_2016[7])View(fg_2016)
The cmp_code
refers to the value that coders from the Manifesto Project have assigned to each topic or policy position that the party puts forward in their text.
For example 104 means that the party speaks positively of the military, whereas 105 means that the party speaks of the military in a negative way.
I don’t know what the eu_code
is in reference to, but it is all blank in the 2016 coding…
In another post, I hope to write about text mining and sentiment analysis with manifestos. But I’ll leave that to another day.
An alternative way to download and store the manifestos is to download everything from the database:
all_manifestos <- mp_maindataset()
And I want to subset all Irish parties only:
ireland_manifestos <- all_manifestos[which(all_manifestos$countryname == "Ireland"),]
With these all ready, there are some really interesting functions we can run with the data and the coding of the texts by the Manifesto Project.
For example, the rile() function
. This calculates the Right Left Score.
Essentially, higher RILE scores indicate that the party leans more right on the ideological spectrum, with a maximum score of +100 if the whole manifesto is devoted to ‘right’ categories. Conversely, lower RILE scores indicate that the party leans more left (and a score of -100 would mean the entire manifesto puts forward exclusively ‘left’ categories)
Of course, it is a crude instrument to compress such a variety of social, political and economic positions onto a single dimension. But as long as we keep that caveat in mind, it is a handy shorthanded approach to categorising the different parties.
Additionally, Molder (2016) in his paper, “The validity of the RILE left–right index as a measure of party policy” argues that the index is not very valid. Additional researchers have also found that RILE index inaccurately places political parties in policy space as manifestos are not actual binding policies but rather directional signals and aspriations (see Pelizzo’s (2003) paper, “Party positions or party direction? An analysis of Party Manifesto Data” for more on this)
So take these figures with a grain of salt. But it is interesting to visualise the trends.
I continue subsetting until I have only the largest parties in Ireland and put them into big_parties
object. The graph gets a bit hectic when including all the smaller parties in the country since 1949. Like in most other countries, party politics is rarely simple.
Next I can simply create a new rile_index
variable and graph it across time.
big_parties$rile_index <- rile(big_parties)
The large chuck in the geom_text()
command is to only show the name of the party at the end of the line. Otherwise, the graph is far more busy and far more unreadable.
graph_rile <- big_parties %>%group_by(partyname) %>%ggplot(aes(x= as.Date(edate), y = rile_index, color=partyname)) +geom_point() + geom_line() +geom_text(data=. %>%arrange(desc(edate)) %>%group_by(partyname) %>%slice(1),aes(label=partyname), position=position_jitter(height=2), hjust = 0, size = 5, angle=40) +ggtitle("Relative Left Right Ideological Position of Major Irish Parties 1949 - 2016") +xlab("Year") + ylab("Right Left (RILE) Index")
While the graph is a bit on the small size, what jumps out immediately is that there has been a convergence of the main political parties toward the ideological centre. In fact, they are all nearing left of centre. The most right-wing a party has ever been in Ireland was Fine Gael in the 1950s, with a RILE score nearing 80. Given their history of its predecessor “Blueshirts” group, this checks out.
The Labour Party has consistently been very left wing, with its most left-leaning RILE score of -40 something in the early 1950s and again in early 1980s.
Ireland joined the European Union in 1978, granted free third level education for all its citizens since the 1990s and in genenral, has seen a consistent trend of secularisation in society, these factors all could account for the constricting lines converging in the graph for various socio-economic reasons.
In recent years Ireland has become more socially liberal (as exemplified by legalisation of abortion, legalisation of same sex marriage) so these lines do not surprise. Additionally, we do not have full control over monetary policy since joining the euro, so again, this mitigates the trends of extreme economic positions laid out in manifestos.
References
Mölder, M. (2016). The validity of the RILE left–right index as a measure of party policy.Party Politics,22(1), 37-48.
Pelizzo, R. (2003). Party positions or party direction? An analysis of party manifesto data.West European Politics,26(2), 67-89.
This blog will run through how to make a word cloud with Mill’s “On Liberty”, a treatise which argues that the state should never restrict people’s individual pursuits or choices (unless such choices harm others in society).
First, we install and load the gutenbergr
package to access the catalogue of books from Project Gutenburg . This gutenberg_metadata
function provides access to the website and its collection of around 60,000 digitised books in the public domain, for which their U.S. copyright has expired. This website is an amazing resource in its own right.
install.packages("gutenbergr")library(gutenbergr)
Next we choose a book we want to download. We can search through the Gutenberg Project catalogue (with the help of the dplyr
package). In the filter( )
function, we can search for a book in the library by supplying a string search term in “quotations”. Click here to see the CRAN package PDF. For example, we can look for all the books written by John Stuart Mill (search second name, first name) on the website:
mill_all <- gutenberg_metadata %>% filter(author = "Mill, John Stuart")
Or we can search for the title of the book:
mill_liberty <- gutenberg_metadata %>% filter(title = "On Liberty")
We now have a tibble of all the sentences in the book!
View(mill_liberty)
We see there are two variables in this new datafram and 4,703 string rows.
To extract every word as a unit, we need the unnest_tokens( )
function from the tidytext
package:
install.packages("tidytext")library(tidytext)
We take our mill_liberty object
from above and indicate we want the unit to be words from the text. And we create a new mill_liberty_words
object to hold the book in this format.
mill_liberty_words <- mill_liberty %>% unnest_tokens(word, text) %>% anti_join(stop_words)
We now have a row for each word, totalling to 17,576 words! This excludes words such as “the”, “of”, “to” and all those small sentence builder words.
Now we have every word from “On Liberty”, we can see what words appear most frequently! We can either create a list with the count( ) function:
count_word <- mill_liberty_words %>% count(word, sort = TRUE)
The default for a tibble object is printing off the first ten observations. If we want to see more, we can increase the n in our print argument.
print(liberty_words, n=30)
An alternative to this is making a word cloud to visualise the relative frequencies of these terms in the text.
For this, we need to install the wordcloud
package.
install.packages("wordcloud")library(wordcloud)
To get some nice colour palettes, we can also install the RColorBrewer
package also:
install.packages("RColorBrewer")library(RColorBrewer)
Check out the CRAN PDF on the wordcloud package to tailor your specifications.
For example, the rot.per
argument indicates proportion words we want with 90 degree rotation. In my example, I have 30% of the words being vertical. I reran the code until the main one was horizontal, just so it pops out more.
With the scale
option, we can indicate the range of the size of the words (for example from size 4 to size 0.5) in the example below
We can choose how many words we want to include in the wordcloud with the max.words
argument
color_number <- 20color_palette <- colorRampPalette(brewer.pal(8, "Paired"))(color_number)wordcloud(words = mill_liberty_words$word, min.freq = 2, scale = c(4, 0.5) max.words=200, random.order=FALSE, rot.per=0.3, colors=color_palette)
We can see straightaway the most frequent word in the book is opinion. Given that this book forms one of the most rigorous defenses of the idea of freedom of speech, a free press and therefore against the a priori censorship of dissent in society, these words check out.
If we run the code with random.order=TRUE
option, the cloud would look like this:
And you can play with proportions, colours, sizes and word placement until you find one you like!
This word cloud highlights the most frequently used words in John Stuart Mill’s “Utilitarianism”:
Google Trendsis a search trends feature. It shows how frequently a given search term is entered intoGoogle’ssearch engine, relative to the site’s total search volume over a given period of time.
( So note: because the results are all relative to the other search terms in the time period, the dates you provide to the gtrendsR
function will change the shape of your graph and the relative percentage frequencies on the y axis of your plot).
To scrape data from Google Trends, we use the gtrends()
function from the gtrendsR
package and the get_interest()
function from the trendyy
package (a handy wrapper package for gtrendsR
).
If necessary, also load the tidyverse and ggplot packages.
library(tidyverse)library(gtrendsR)library(trendyy)
To scrape the Google trend data, call the trendy()
function and write in the search terms.
We can look at relative search hits for Yevgeny Prigozhin, leader of Russian mercenary army, Wagner Group.
prig <- trendy("Prigozhin", "2022-08-25", "2023-08-26") %>% get_interest()ggplot(data = prig, aes(x = as.Date(date), y = hits)) + geom_line(colour = "#005f73", size = 6.5, alpha = 0.1) + geom_line(colour = "#005f73", size = 4, alpha = 0.3) + geom_line(colour = "#005f73", size = 3, alpha = 0.5) + geom_line(colour = "#005f73", size = 2) + ylab(label = 'Relative Hits %') + bbplot::bbc_style() + xlab(label = "Search Dates") + ylab(label = 'Relative Hits %') + labs(title = "Relative hits for `Prigozhin` on Google", subtitle = "Data from Google search hits") + geom_curve(aes(x = as.Date("2021-10-01"), y = 45, xend = as.Date("2022-02-01"), yend = 30), size = 2, alpha = 0.8, color = "#001219", arrow = arrow(type = "closed"))
For the next example, here we search for the term “Kamala Harris” during the period from 1st of January 2019 until today.
If you want to check out more specifications, for the package, you can check out the package PDF here. For example, we can change the geographical region (US state or country for example) with the geo
specification.
We can also change the parameters of the time
argument, we can specify the time span of the query with any one of the following strings:
- “now 1-H” (previous hour)
- “now 4-H” (previous four hours)
- “today+5-y” last five years (default)
- “all” (since the beginning of Google Trends (2004))
If don’t supply a string, the default is five year search data.
kamala <- trendy("Kamala Harris", "2019-01-01", "2020-08-13") %>% get_interest()
We call the get_interest()
function to save this data from Google Trends into a data.frame version of the kamala
object. If we didn’t execute this last step, the data would be in a form that we cannot use with ggplot().
View(kamala)
In this data.frame, there is adate
variable for each week and ahits
variable that shows the interest during that week. Remember, this hits
figure shows how frequently a given search term is entered intoGoogle’ssearch engine relative to the site’s total search volume over a given period of time.
We will use these two variables to plot the y and x axis.
To look at the search trends in relation to the events during the Kamala Presidential campaign over 2019, we can add vertical lines along the date axis, with a data.frame, we can call kamala_events.
kamala_events = data.frame(date=as.Date(c("2019-01-21", "2019-06-25", "2019-12-03", "2020-08-12")), event=c("Launch Presidential Campaign", "First Primary Debate", "Drops Out Presidential Race", "Chosen as Biden's VP"))
Note the very specific order the as.Date()
function requires.
Next, we can graph the trends, using the above date and hits variables:
ggplot(kamala, aes(x = as.Date(date), y = hits)) + geom_line(colour = "steelblue", size = 2.5) + geom_vline(data=kamala_events, mapping=aes(xintercept=date), color="red") + geom_text(data=kamala_events, mapping=aes(x=date, y=0, label=event), size=4, angle=40, vjust=-0.5, hjust=0) + xlab(label = "Search Dates") + ylab(label = 'Relative Hits %')
Which produces:
I can update the graph
Cairo::CairoWin()ggplot(data = kamala, aes(x = as.Date(date), y = hits)) + geom_vline(data = kamala_events, mapping = aes(xintercept = date), size = 11, alpha = 0.1, color = "#9b2226") + geom_vline(data = kamala_events, mapping = aes(xintercept = date), size = 8, alpha = 0.2, color = "#9b2226") + geom_vline(data = kamala_events, mapping = aes(xintercept = date), size = 7, alpha = 0.3, color = "#9b2226") + geom_vline(data = kamala_events, mapping = aes(xintercept = date), size = 4, alpha = 0.4, color = "#9b2226") + geom_line(colour = "#005f73", size = 5.5, alpha = 0.4) + geom_line(colour = "#005f73", size = 5, alpha = 0.5) + geom_line(colour = "#005f73", size = 4, alpha = 0.6) + geom_line(colour = "#005f73", size = 3) + geom_text(data = kamala_events, mapping = aes(x = date, y = 0, label = event), size = 9, angle = 10, vjust = -4.6, hjust = 0.7) + bbplot::bbc_style() + xlab(label = "Search Dates") + ylab(label = 'Relative Hits %') + labs(title = "Relative hits for `Kamala Harris` on Google", subtitle = "Data from Google search hits") + theme(plot.title = element_text(size = 40), plot.subtitle = element_text(size = 20)) + geom_curve(aes(x = as.Date("2018-12-01"), y = 88, xend = as.Date("2019-01-15"), yend = 99), size = 2, alpha = 0.8, color = "#001219", curvature = -0.4, angle = 90, arrow = arrow(type = "closed")) + geom_curve(aes(x = as.Date("2019-05-01"), y = 69, xend = as.Date("2019-06-15"), yend = 70), size = 2, alpha = 0.8, color = "#001219", curvature = 0.7, angle = 90, arrow = arrow(type = "closed")) + geom_curve(aes(x = as.Date("2019-10-15"), y = 69, xend = as.Date("2019-11-20"), yend = 46), size = 2, alpha = 0.8, color = "#001219", curvature = 0.3, angle = 90, arrow = arrow(type = "closed"))
Super easy and a quick way to visualise the ups and downs of Kamala Harris’ political career over the past few months, operationalised as the relative frequency with which people Googled her name.
If I had chosen different dates, the relative hits as shown on the y axis would be different! So play around with it and see how the trends change when you increase or decrease the time period.
Without examining interaction effects in your model, sometimes we are incorrect about the real relationship between variables.
This is particularly evident in political science when we consider, for example, the impact of regime type on the relationship between our dependent and independent variables. The nature of the government can really impact our analysis.
For example, I were to look at the relationship between anti-government protests and executive bribery.
I would expect to see that the higher the bribery score in a country’s government, the higher prevalence of people protesting against this corrupt authority. Basically, people are angry when their government is corrupt. And they make sure they make this very clear to them by protesting on the streets.
First, I will describe the variables I use and their data type.
With the dependent variable democracy_protest
being an interval score, based upon the question: In this year, how frequent and large have events of mass mobilization for pro-democratic aims been?
The main independent variable is another interval score on executive_bribery scale and is based upon the question: How clean is the executive (the head of government, and cabinet ministers), and their agents from bribery (granting favors in exchange for bribes, kickbacks, or other material inducements?)
Higher scores indicate cleaner governing executives.
So, let’s run a quick regression to examine this relationship:
summary(protest_model <- lm(democracy_protest ~ executive_bribery, data = data_2010))
Examining the results of the regression model:
We see that there is indeed a negative relationship. The cleaner the government, the less likely people in the country will protest in the year under examination. This confirms our above mentioned hypothesis.
However, examining the R2, we see that less than 1% of the variance in protest prevalence is explained by executive bribery scores.
Not very promising.
Is there an interaction effect with regime type? We can look at a scatterplot and see if the different regime type categories cluster in distinct patterns.
The four regime type categories are
- purple: liberal democracy (such as Sweden or Canada)
- teal: electoral democracy (such as Turkey or Mongolia)
- khaki green: electoral autocracy (such as Georgia or Ethiopia)
- red: closed autocracy (such as Cuba or China)
The color clusters indicate regime type categories do cluster.
- Liberal democracies (purple) cluster at the top left hand corner. Higher scores in clean executive index and lower prevalence in pro-democracy protesting.
- Electoral autocracies (teal) cluster in the middle.
- Electoral democracies (khaki green) cluster at the bottom of the graph.
- The closed autocracy countries (red) seem to have a upward trend, opposite to the overall best fitted line.
So let’s examine the interaction effect between regime types and executive corruption with mass pro-democracy protests.
Plot the model and add the * interaction effect:
summary(protest_model_2 <-lm(democracy_protest ~ executive_bribery*regime_type, data = data_2010))
Adding the regime type variable, the R2 shoots up to 27%.
The interaction effect appears to only be significant between clean executive scores and liberal democracies. The cleaner the country’s executive, the prevalence of mass mobilization and protests decreases by -0.98 and this is a statistically significant relationship.
The initial relationship we saw in the first model, the simple relationship between clean executive scores and protests, has disappeared. There appears to be no relationship between bribery and protests in the semi-autocratic countries; (those countries that are not quite democratic but not quite fully despotic).
Let’s graph out these interactions.
In the plot_model()
function, first type the name of the model we fitted above, protest_model
.
Next, choose the type
. For different type arguments, scroll to the bottom of this blog post. We use the type = "pred"
argument, which plots the marginal effects.
Marginal effects tells us how adependent variable changes when a specificindependent variable changes, if other covariates are held constant.The two terms typed here are the two variables we added to the model with the * interaction term.
install.packages("sjPlot")library(sjPlot)plot_model(protest_model, type = "pred", terms = c("executive_bribery", "regime_type"), title = 'Predicted values of Mass Mobilization Index', legend.title = "Regime type")
Looking at the graph, we can see that the relationship changes across regime type. For liberal democracies (purple), there is a negative relationship. Low scores on the clean executive index are related to high prevalence of protests. So, we could say that when people in democracies see corrupt actions, they are more likely to protest against them.
However with closed autocracies (red) there is the opposite trend. Very corrupt countries in closed autocracies appear to not have high levels of protests.
This would make sense from a theoretical perspective: even if you want to protest in a very corrupt country, the risk to your safety or livelihood is often too high and you don’t bother. Also the media is probably not free so you may not even be aware of the extent of government corruption.
It seems that when there are no democratic features available to the people (free media, freedom of assembly, active civil societies, or strong civil rights protections, freedom of expression et cetera) the barriers to protesting are too high. However, as the corruption index improves and executives are seen as “cleaner”, these democratic features may be more accessible to them.
If we only looked at the relationship between the two variables and ignore this important interaction effects, we would incorrectly say that as
Of course, panel data would be better to help separate any potential causation from the correlations we can see in the above graphs.
The blue line is almost vertical. This matches with the regression model which found the coefficient in electoral autocracy is 0.001. Virtually non-existent.
Different Plot Types
type = "std"
– Plots standardized estimates.
type = "std2"
– Plots standardized estimates, however, standardization follows Gelman’s (2008) suggestion, rescaling the estimates by dividing them by two standard deviations instead of just one. Resulting coefficients are then directly comparable for untransformed binary predictors.
type = "pred"
– Plots estimated marginal means (or marginal effects). Simply wrapsggpredict
.
type = "eff"
– Plots estimated marginal means (or marginal effects). Simply wrapsggeffect
.
type = "slope"
andtype = "resid"
– Simple diagnostic-plots, where a linear model for each single predictor is plotted against the response variable, or the model’s residuals. Additionally, a loess-smoothed line is added to the plot. The main purpose of these plots is to check whether the relationship between outcome (or residuals) and a predictor is roughly linear or not. Since the plots are based on a simple linear regression with only one model predictor at the moment, the slopes (i.e. coefficients) may differ from the coefficients of the complete model.
type = "diag"
– ForStan-models, plots the prior versus posterior samples. Forlinear (mixed) models, plots for multicollinearity-check (Variance Inflation Factors), QQ-plots, checks for normal distribution of residuals and hom*oscedasticity (constant variance of residuals) are shown. Forgeneralized linear mixed models, returns the QQ-plot for random effects.
Packages we will need:
install.packages("car")library(car)
When one independent variable ishighlycorrelated with another independent variable (or with a combination of independent variables), the marginal contribution of that independent variable is influenced by other predictor variables in the model.
And so, as a result:
- Estimates for regression coefficients of the independent variables can be unreliable.
- Tests of significance for regression coefficients can be misleading.
To check for multicollinearity problem in our model, we need the vif()
function from the car
package in R. VIF stands for variance inflation factor. It measures how much the variance of any one of the coefficients is inflated due to multicollinearity in the overall model.
As a rule of thumb, a vif
score over 5 is a problem. A score over 10 should be remedied and you should consider dropping the problematic variable from the regression model or creating an index of all the closely related variables.
This blog post will look only at the VIF score. Click here to look at how to interpret various other multicollinearity tests in the mctest package in addition to the the VIF score.
Back to our model, I want to know whether countries with high levels of clientelism, high levels of vote buying and low democracy scores lead to executive embezzlement?
So I fit a simple linear regression model (and look at the output with the stargazer
package)
summary(embezzlement_model_1 <- lm(executive_embezzlement ~ clientelism_index + vote_buying_score + democracy_score, data = data_2010))stargazer(embezzlement_model_1, type = "text")
I suspect that clientelism and vote buying variables will be highly correlated. So let’s run a test of multicollinearity to see if there is any problems.
car::vif(embezzlement_model_1)
The VIF score for the three independent variables are :
Both clientelism index and vote buying variables are both very high and the best remedy is to remove one of them from the regression. Since vote buying is considered one aspect of clientelist regime so it is probably overlapping with some of the variance in the embezzlement score that the clientelism index is already explaining in the model
So re-run the regression without the vote buying variable.
summary(embezzlement_model_2 <- lm(v2exembez ~ v2xnp_client + v2x_api, data = vdem2010))stargazer(embezzlement_model_2, embezzlement_model_2, type = "text")car::vif(embezzlement_mode2)
Comparing the two regressions:
And running a VIF test on the second model without the vote buying variable:
car::vif(embezzlement_model_2)
These scores are far below 5 so there is no longer any big problem of multicollinearity in the second model.
Click here to quickly add VIF scores to our regression output table in R with jtools
package.
Plus, looking at the adjusted R2, which compares two models, we see that the difference is very small, so we did not lose much predictive power in dropping a variable. Rather we have minimised the issue of highly correlated independent variables and thus an inability to tease out the real relationships with our dependent variable of interest.
tl;dr: As a rule of thumb, a vif
score over 5 is a problem. A score over 10 should be remedied (and you should consider dropping the problematic variable from the regression model or creating an index of all the closely related variables).
Click here to run stepwise regression analysis to help decide which problematic variables we can drop from our model (based on AIC scores)
Packages we will need:
library(sandwich)library(stargazer)library(lmtest)
If our OLS model demonstrates high level of heteroskedasticity (i.e. when the error term of our model is not randomly distributed across observations and there is a discernible pattern in the error variance), we run into problems.
Why? Because this means OLS will use sub-optimalestimators based on incorrect assumptions and the standard errors computed using these flawed least square estimators are more likely to be under-valued.
Since standard errors are necessary to compute our t – statistic and arrive at our p – value, these inaccurate standard errors are a problem.
Click here to check for heteroskedasticity in your model with the lmtest package.
To correct for heteroskedastcity in your model, you need the sandwich package and the lmtest package to employ the vcocHC argument.
First, let’s fit a simple OLS regression.
summary(free_express_model <- lm(freedom_expression ~ free_elections + deliberative_index, data = data_1990))
We can see that there is a small star beside the main dependent variable of interest! Success!
We have significance.
Thus, we could say that the more free and fair the elections a country has, this increases the mean freedom of expression index score for that country.
This ties in with a very minimalist understanding of democracy. If a country has elections and the populace can voice their choice of leadership, this will help set the scene for a more open society.
However, it is naive to look only at the p – value of any given coefficient in a regression output. If we run some diagnostic analyses and look at the relationship graphically, we may need to re-examine this seemingly significant relationship.
Can we trust the 0.087 standard error score that our OLS regression calculated? Is it based on sound assumptions?
First let’s look at the residuals. Can we assume that the variance of error is equal across all observations?
If we examine the residuals (the first graph), we see that there is actually a tapered fan-like pattern in the error variance. As we move across the x axis, the variance along the y axis gets continually smaller and smaller.
The error does not look random.
Let’s run a Breush-Pagan test (from the lmtest package) to check our suspicion of heteroskedasticity.
lmtest::bptest(free_exp_model)
We can reject the null hypothesis that the error variance is hom*oskedastic.
So the model does suffer from heteroskedasticty. We cannot trust those stars in the regression output!
In order to fix this and make our p-values more accuarate, we need to install the sandwich package to feed in the vcovHC
adjustment into the coeftest()
function.
vcovHC
stands for variance covariance Heteroskedasticity Consistent.
With the stargazer package (which prints out all the models in one table), we can compare the free_exp_model
alone with no adjustment, then four different variations of the vcovHC
adjustment using different formulae (as indicated in the type
argument below).
stargazer(free_exp_model, coeftest(free_exp_model, vcovHC(free_exp_model, type = "HC0")), coeftest(free_exp_model, vcovHC(free_exp_model, type = "HC1")), coeftest(free_exp_model, vcovHC(free_exp_model, type = "HC2")), coeftest(free_exp_model, vcovHC(free_exp_model, type = "HC3")), type = "text")
Looking at the standard error in the (brackets) across the OLS and the coeftest models, we can see that the standard error are all almost double the standard error from the original OLS regression.
There is a tiny difference between the different types of Heteroskedastic Consistent (HC) types.
The significant p – value disappears from the free and fair election variable when we correct with the vcovHC
correction.
The actual coefficient stays the same regardless of whether we use no correction or any one of the correction arguments.
Which HC estimator should I use in my vcovHC() function?
The default in the sandwich package is HC3.
STATA users will be familiar with HC1, as it is the default robust standard error correction when you add robust at the end of the regression command.
The difference between them is not very large.
The estimator HC0 was suggested in the econometrics literature by White in 1980 and is justified by asymptotic arguments.
For small sample sizes, the standard errors from HC0 are quite biased, usually downward, and this results in overly liberal inferences in regression models (Bera, Suprayitno & Premaratne, 2002). But with HC0, the bias shrinks as your sample size increases.
The estimator types HC1, HC2 and HC3 were put forward by MacKinnon and White (1985) to improve the performance in small samples.
Long and Ervin (2000) furthermore argue that HC3 provides the best performance in small samples as it gives less weight to influential observations in the model
In our freedom of expression regression, the HC3 estimate was the most conservative with the standard error calculations. however the difference between the approaches did not change the conclusion; ultimately the main independent variable of interest in this analysis – free and fair elections – can explain variance in the dependent variable – freedom of expression – does not find evidence in the model.
Click here to read an article by Hayes and Cai (2007) which discusses the matrix formulae and empirical differences between the different calculation approaches taken by the different types. Unfortunately it is all ancient Greek to me.
References
Bera, A. K., Suprayitno, T., & Premaratne, G. (2002). On some heteroskedasticity-robust estimators of variance–covariance matrix of the least-squares estimators.Journal of Statistical Planning and Inference,108(1-2), 121-136.
Hayes, A. F., & Cai, L. (2007). Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation.Behavior research methods,39(4), 709-722.
Long, J. S., & Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear regression model.The American Statistician,54(3), 217-224.
MacKinnon, J. G., & White, H. (1985). Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties.Journal of econometrics,29(3), 305-325.
One core assumption when calculating ordinary least squares regressions is that all the random variables in the model have equal variance around the best fitting line.
Essentially, when we run an OLS, we expect that the error terms have no fan pattern.
Example of hom*oskedasticiy
So let’s look at an example of this assumption being satisfied. I run a simple regression to see whether there is a relationship between and media censorship and civil society repression in 178 countries in 2010.
ggplot(data_010, aes(media_censorship, civil_society_repression)) + geom_point() + geom_smooth(method = "lm") + geom_text(size = 3, nudge_y = 0.1, aes(label = country))
If we run a simple regression
summary(repression_model <- lm(media_censorship ~ civil_society_repression, data = data_2010))stargazer(repression_model, type = "text")
This is pretty common sense; a country that represses its citizens in one sphere is more likely to repress in other areas. In this case repressing the media correlates with repressing civil society.
We can plot the residuals of the above model with the autoplot() function from the ggfortify package.
library(ggfortify)autoplot(repression_model)
Nothing unusual appears to jump out at us with regard to evidence for heteroskedasticity!
In the first Residuals vs Fitted plot, we can see that blue line does not drastically diverge from the dotted line (which indicates residual value = 0).
The third plot Scale-Location shows again that there is no drastic instances of heteroskedasticity. We want to see the blue line relatively horizontal. There is no clear pattern in the distribution of the residual points.
In the Residual vs. Leverage plot (plot number 4), the high leverage observation 19257 is North Korea! A usual suspect when we examine model outliers.
While it is helpful to visually see the residuals plotted out, a more objective test can help us find whether the model is indeed free from heteroskedasticity problems.
For this we need the Breusch-Pagan test for heteroskedasticity from the lmtest package.
install.packages("lmtest)library(lmtest)bptest(repression_model)
The default in R is the studentized Breusch-Pagan. However if you add the studentize = FALSE argument, you have the non-studentized version
The null hypothesis of the Breusch-Pagan test is that the variance in the model is hom*oskedastic.
With our repression_model, we cannot reject the null, so we can say with confidence that there is no major heteroskedasticity issue in our model.
The non-studentized Breusch-Pagan test makes a very big assumption that the error term is from Gaussian distribution. Since this assumption is usually hard to verify, the default bptest() in R “studentizes” the test statistic and provide asymptotically correct significance levels for distributions forerror.
Why do we care about heteroskedasticity?
If our model demonstrates high level of heteroskedasticity (i.e. the random variables have non-random variation across observations), we run into problems.
Why?
- OLS uses sub-optimalestimators based on incorrect assumptions and
- The standard errors computed using these flawed least square estimators are more likely to be under-valued. Since standard errors are necessary to compute our t – statistics and arrive at our p – value, these inaccurate standard errors are a problem.
Example of heteroskedasticity
Let’s look at an example of this hom*oskedasticity assumption NOT being satisfied.
I run a simple regression to see whether there is a relationship between democracy score and respect for individuals’ private property rights in 178 countries in 2010.
When you are examining democracy as the main dependent variable, heteroskedasticity is a common complaint. This is because all highly democratic countries are all usually quite similar. However, when we look at autocracies, they are all quite different and seem to march to the beat of their own despotic drum. We cannot assume that the random variance across different regime types is equally likely.
First, let’s have a look at the relationship.
prop_graph <- ggplot(vdem2010, aes(v2xcl_prpty, v2x_api)) + geom_point(size = 3, aes(color = factor(regime_type))) + geom_smooth(method = "lm")prop_graph + scale_colour_manual(values = c("#D55E00", "#E69F00", "#009E73", "#56B4E9"))
Next, let’s fit the model to examine the relationship.
summary(property_model <- lm(property_score ~ democracy_score, data = data_2010))stargazer(property_model, type = "text")
To plot the residuals (and other diagnostic graphs) of the model, we can use the autoplot() function to look at the relationship in the model graphically.
autoplot(property_model)
Graph number 1 plots the residuals against the fitted model and we can see that lower values on the x – axis (fitted values) correspond with greater spread on the y – axis. Lower democracy scores relate to greater error on property rights index scores. Plus the blue line does not lie horizontal and near the dotted line. It appears we have non-random error patterns.
Examining the Scale – Location graph (number 3), we can see that the graph is not horizontal.
Again, interpreting the graph can be an imprecise art. So a more objective approach may be to run the bptest().
bptest(property_model)
Since the p – value is far smaller than 0.05, we can reject the null of hom*oskedasticity.
Rather, we have evidence that our model suffers from heteroskedasticity. The standard errors in the regression appear smaller than they actually are in reality. This inflates our t – statistic and we cannot trust our p – value.
In the next blog post, we can look at ways to rectify this violation of hom*oskedasticity and to ensure that our regression output has more accurate standard errors and therefore more accurate p – values.
Click here to use the sandwich package to fix heteroskedasticity in the OLS regression.
Political scientists are beginning to appreciate that multiple imputation represents a better strategy for analysing missing data to the widely used method of listwise deletion.
A very clear demonstration of this was a 2016 article by Ranjit Lall, an political economy professor in LSE. He essentially went back and examined the empirical results of multiple imputation in comparison to the commonplace listwise deletion in political science.
He did this by re-running comparative political economy studies published over a five-year period in International Organization and World Politics.
Shockingly, in almost half of the studies he re-ran, Lall found that most key results “disappeared” (by conventional statistical standards) when reanalyzed with multiple imputations rather than listwise deletion.
This is probably due to the fact that it is erroneous to assume that missing data is random and equally distributed among the overall data.
Listwise deletion involves omitting observations with missing values on any variable. This ultimately produces inefficient inferences as it is difficult to believe the assumption that the pattern of missing data is actually completely random.
This blog post will demonstrate a package for imputing missing data in a few lines of code.
Unlike what I initially thought, the name has nothing to do with the tiny rodent, MICE stands for Multivariate Imputation via Chained Equations.
Rather than abruptly deleting missing values, imputation uses information given from the non-missing predictors to provide an estimate of the missing values.
The mice package imputes in two steps. First, usingmice()
to build the model and subsequently callcomplete()
to generate the final dataset.
Themice()
function produces many complete copies of a dataset, each with different imputations of the missing data. Then the complete()
function returns these data sets, with the default being the first.
So first install and load the package:
install.packages("mice")library(mice)
You can check whether any variables in your potential model have an NAs (i.e. missing values) with anyNA()
function.
anyNA(data$clientelism)
If there are missing values, then you can go on ahead with imputing them. First create a new object to store the multiple imputed versions of your dataset.
This iteration process takes a while, depending on how many variables you have in your data.frame. My data
data.frame had about six variables so this stage took about three or four minutes to complete. I was distracted by Youtube for a bit, so I am not exactly sure. I imagine a very large dataset with hundreds of variables would make my computer freak out.
All the variables with missing values in my data.frame were continuous numerical values. I chose the method = "cart"
, which stands for classification and regression trees which appears quite versatile.
imputed_data <- mice(data, method="cart")
A CART is a predictive algorithm that determines how a given variable’s values can be predicted based on other values.
It is composed of decision trees where each fork is a split in a predictor variable and each node at the end has a prediction for the target variable.
After this iterative process is complete and the command has finished running, we then use the complete() function and assign the resulting data.frame to a new object. I call it full_data
full_data <- complete(imputed_data)
I ran a quick regression to see what effect the new fully imputed data.frame had on the relationship. I could have taken a bit longer and found a result that changed as a result of the data imputation step ( as was shown in the above mentioned Lall (2016) paper) but I decided to just stick with my first shot.
We can see that the model with the imputed values have increased the total number of values by about 3,000 or so.
Given that I already have a very large n size, it is not expected that many of thecoefficients would change drastically by adding a small percentage of imputed values. However, we see that the standard error (yay) and the coefficient value decreased (meh). Additionally the R2 (by a tiny amount) decreased (weh).
I chose the cart method but there are many of method options, depending on the characteristics of the data with missing values.
Built-in univariate imputation methods are:
pmm | any | Predictive mean matching |
midastouch | any | Weighted predictive mean matching |
sample | any | Random sample from observed values |
cart | any | Classification and regression trees |
rf | any | Random forest imputations |
mean | numeric | Unconditional mean imputation |
norm | numeric | Bayesian linear regression |
norm.nob | numeric | Linear regression ignoring model error |
norm.boot | numeric | Linear regression using bootstrap |
norm.predict | numeric | Linear regression, predicted values |
quadratic | numeric | Imputation of quadratic terms |
ri | numeric | Random indicator for nonignorable data |
logreg | binary | Logistic regression |
logreg.boot | binary | Logistic regression with bootstrap |
polr | ordered | Proportional odds model |
polyreg | unordered | Polytomous logistic regression |
lda | unordered | Linear discriminant analysis |
2l.norm | numeric | Level-1 normal heteroscedastic |
2l.lmer | numeric | Level-1 normal hom*oscedastic, lmer |
2l.pan | numeric | Level-1 normal hom*oscedastic, pan |
2l.bin | binary | Level-1 logistic, glmer |
2lonly.mean | numeric | Level-2 class mean |
2lonly.norm | numeric | Level-2 class normal |
Well this is just delightful! This package was created by Karthik Ram.
install.packages("wesanderson")library(wesanderson)library(hrbrthemes) # for plot themeslibrary(gapminder) # for datalibrary(ggbump) # for the bump plot
After you install the wesanderson
package, you can
- create a ggplot2 graph object
- choose the Wes Anderson color scheme you want to use and create a palette object
- add the graph object and and the palette object and behold your beautiful data
wes_palette(name, n, type = c("discrete", "continuous"))
To generate a vector of colors, the wes_palette()
function requires:
name
: Name of desired paletten
: Number of colors desired (i.e. how many categories so n = 4).
The names of all the palettes you can enter into the wes_anderson()
function
Click here to look at more palette and theme packages that are inspired by Studio Ghibli, Dutch Master painters, Google, AirBnB, the Economist and Wall Street Journal aesthetics and appearances.
We can use data from the gapminder
package. We will look at the scatterplot between life expectancy and GDP per capita.
We feed the wes_palette()
function into the scale_color_manual()
with the values = wes_palette()
argument.
We indicate that the colours would be the different geographic regions.
If we indicate fill in the geom_point()
arguments, we would change the last line to scale_fill_manual()
We can log the gapminder variables with the mutate(across(where(is.numeric), log))
. Alternatively, we could scale the axes when we are at the ggplot section of the code with the scale_*_continuous(trans='log10')
gapminder %>% filter(continent != "Oceania") %>% mutate(across(where(is.numeric), log)) %>% ggplot(aes(x = lifeExp, y = gdpPercap)) + geom_point(aes(color = continent), size = 3, alpha = 0.8) + # facet_wrap(~factor(continent)) + hrbrthemes::theme_ft_rc() + ggtitle("GDP per capita and life expectancy") + theme(legend.title = element_blank(), legend.text = element_text(size = 20), plot.title = element_text(size = 30)) + scale_color_manual(values= wes_palette("FantasticFox1", n = 4)) +
Next looking at bump plot of OECD data with the Royal Tenanbaum’s colour palette.
Click here to read more about the OECD dataset.
trust %>% filter(country_name == "Ireland" | country_name == "Sweden" | country_name == "Germany" | country_name == "Spain" | country_name == "Belgium") %>% group_by(year) %>% mutate(rank_budget = rank(-trust, ties.method = "min"), rank_budget = as.factor(rank_budget)) %>% ungroup() %>% ggplot(aes(x = year, y = reorder(rank_budget, desc(rank_budget)), group = country_name, color = country_name, fill = country_name)) + geom_bump(aes(), smooth = 7, size = 5, alpha = 0.9) + geom_point(aes(color = country_name), fill = "white", shape = 21, size = 5, stroke = 5) + labs(title = "Level of trust in government", subtitle = "Data Source: OECD") + theme(panel.border = element_blank(), legend.position = "bottom", plot.title = element_text(size = 30), legend.title = element_blank(), legend.text = element_text(size = 20, color = "white"), axis.text.y = element_text(size = 20), axis.text.x = element_text(size = 20), legend.background = element_rect(fill = "#5e6472"), axis.title = element_blank(), axis.text = element_text(color = "white", size = 10), text= element_text(size = 15, color = "white"), panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank(), panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), legend.key = element_rect(fill = "#5e6472"), plot.background = element_rect(fill = "#5e6472"), panel.background = element_rect(fill = "#5e6472")) + guides(colour = guide_legend(override.aes = list(size=8))) + scale_color_manual(values= wes_palette("Royal2", n = 5)) + scale_x_continuous(n.breaks = 20)
Click here to read more about the bump plot from the ggbump package.
Last, we can look at Darjeeling colour palette.
trust %>% filter(country_name == "Ireland" | country_name == "Germany" | country_name == "Sweden"| country_name == "Spain" | country_name == "Belgium") %>% ggplot(aes(x = year, y = trust, group = country_name)) + geom_line(aes(color = country_name), size = 3) + geom_point(aes(color = country_name), fill = "white", shape = 21, size = 5, stroke = 5) + labs(title = "Level of trust in government", subtitle = "Data Source: OECD") + theme(panel.border = element_blank(), legend.position = "bottom", plot.title = element_text(size = 30), legend.title = element_blank(), legend.text = element_text(size = 20, color = "white"), axis.text.y = element_text(size = 20), axis.text.x = element_text(size = 20), legend.background = element_rect(fill = "#5e6472"), axis.title = element_blank(), axis.text = element_text(color = "white", size = 10), text= element_text(size = 15, color = "white"), panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank(), panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), legend.key = element_rect(fill = "#5e6472"), plot.background = element_rect(fill = "#5e6472"), panel.background = element_rect(fill = "#5e6472")) + guides(colour = guide_legend(override.aes = list(size=8))) + scale_color_manual(values= wes_palette("Darjeeling1", n = 5)) + scale_x_continuous(n.breaks = 20)
Last we can look at a bar chart counting different regime types in the eighteenth century.
eighteenth_century <- data_1880s %>%filter(!is.na(regime)) %>%filter(!is.na(appointment)) %>%ggplot(aes(appointment)) + geom_bar(aes(fill = factor(regime)), position = position_stack(reverse = TRUE)) + theme(legend.position = "top", text = element_text(size=15), axis.text.x = element_text(angle = -30, vjust = 1, hjust = 0))
Both the regime variable and the appointment variable are discrete categories so we can use the geom_bar() function. When adding the palette to the barplot object, we can use the scale_fill_manual()
function.
eighteenth_century + scale_fill_manual(values = wes_palette("Darjeeling1", n = 4)
Now to compare the breakdown with countries in the 21st century (2000 to present)
Sometimes the best way to examine the relationship between our variables of interest is to plot it out and give it a good looking over. For me, it’s most helpful to see where different countries are in relation to each other and to see any interesting outliers.
For this, I can use the geom_text()
function from the ggplot2 package.
I will look at the relationship between economic globalization and social globalization in OECD countries in the year 2000.
TheKOF Globalisation Index, introduced by Dreher (2006) measuresglobalizationalong theeconomic,socialand political dimension for most countries in the world
First, as always, we install and load the necessary package. This time, it is the ggplot2
package
install.packages("ggplot2")library(ggplot2)
Next add the following code:
fin <- ggplot(oecd2000, aes(economic_globalization, social_globalization)) + ggtitle("Relationship between Globalization Index Scores among OECD countries in 2000") + scale_x_continuous("Economic Globalization Index") + scale_y_continuous("Social Globalization Index") + geom_smooth(method = "lm") + geom_point(aes(colour = polity_score), size = 2) + labs(color = "Polity Score") + geom_text(hjust = 0, nudge_x = 0.5, size = 4, aes(label = country)) fin
In the aes()
function, we enter the two variables we want to plot.
Then I use the next three lines to add titles to axes and graph
I use the geom_smooth()
function with the “lm” method to add a best fitting regression line through the points on the plot. Click here to learn more about adding a regression line to a plot.
I add a legend to examine where countries with different democracy scores (taken from the Polity Index) are located on the globalization plane. Click here to learn about adding legends.
The last line is the geom_text()
function that I use to specify that I want to label each observation (i.e. each OECD country) with its name, rather than the default dataset number.
Some geom_text() commands to use:
- nudge_x(or nudge_y) slightly “nudge” the labels from their corresponding points to help minimise messy overlapping.
- hjustandvjustmove the text label “left”, “center”, “right”, “bottom”, “middle” or “top” of the point.
Yes, yes! There is a package that uses the color palettes of Wes Anderson movies to make graphs look just beautiful. Click here to use different Wes Anderson aesthetic themed graphs!
zissou_colors <- wes_palette("Zissou1", 100, type = "continuous")fin + scale_color_gradientn(colours = zissou_colors)
Which outputs:
Interestingly, it seems that at the very bottom left hand corner of the plot (which shows the countries that are both low in economic globalization and low in social globalization), we have two OECD countries that score high on democracy – Japan and South Korea- right next to two countries that score the lowest in the OECD on democracy, Turkey and Mexico.
So it could be interesting to further examine why these countries from opposite ends of the democracy spectrum have similar pattern of low globalization. It puts a spanner in the proverbial works with my working theory that countries higher in democracy are more likely to be more globalized! What is special about these two high democracy countries that gives them such low scores on globalization.