R for Statistical Data Analysis

A Beginner’s Guide for Linguists

Ludovic De Cuypere

R basics

R GUI

R as a calculator

4-1          # addition
[1] 3
4/2          # division
[1] 2
2^2          # two to the power of two
[1] 4
sqrt(4)      # sqrt = square root
[1] 2

R as a calculator

(34-4)+(5+4) # use of brackets  
[1] 39
pi           # 3.1415...
[1] 3.141593
log(x = 25, base = 5)    # log of 25 with base 5
[1] 2
exp(x = 2)   # e^2 = 2.7182^2
[1] 7.389056

Exercise 1.1

  1. Use R to calculate the following:

    1. 7+8
    2. 15-8
    3. the product of 23 and 34
    4. 2 raised to the power of 3
    5. the square root of 25
    6. sine of π/4
    7. the expression \(\frac{(3 + 5) \times (2^3)}{\sqrt{16}}\)

R functions

A function has a name and arguments:

Name(argument, argument, ...)

R functions

  • In very simple terms:
    • a function “does something” with its arguments.
    • there usually is an input argument and some optional arguments.
  • For example, we wish to round the result of the quotient \(25/6\) (= 4.1666667) to two decimal places.
1round(x = 25/6,
2      digits = 2)
1
“x” is the first argument (necessary input)
2
“digits” is the second argument (optional argument)
[1] 4.17

R functions and arguments

Strictly speaking, we don’t need to write the names of the arguments …

round(25/6, 2)
[1] 4.17

Similarly

ceiling(5.67) 
[1] 6
floor(5.67)
[1] 5

Packages

To use those other packages, you must first install them via install.packages() and then open them using library().

install.packages("bayesplot")
library("bayesplot")

(Later we will also use RStudio to install and load a package)

Cite your package(s)!

citation("ggplot2")

To cite ggplot2 in publications, please use

  H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
  Springer-Verlag New York, 2016.

A BibTeX entry for LaTeX users is

  @Book{,
    author = {Hadley Wickham},
    title = {ggplot2: Elegant Graphics for Data Analysis},
    publisher = {Springer-Verlag New York},
    year = {2016},
    isbn = {978-3-319-24277-4},
    url = {https://ggplot2.tidyverse.org},
  }

Cite R

citation()

To cite R in publications use:

  R Core Team (2023). R: A language and environment for statistical
  computing. R Foundation for Statistical Computing, Vienna, Austria.
  URL https://www.R-project.org/.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {R: A Language and Environment for Statistical Computing},
    author = {{R Core Team}},
    organization = {R Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2023},
    url = {https://www.R-project.org/},
  }

We have invested a lot of time and effort in creating R, please cite it
when using it for data analysis. See also 'citation("pkgname")' for
citing R packages.

The assignment operator

1x <- 10
1
read: “x gets value \(10\)

The assignment operator

R now knows that x is the “name” for the value of \(10\). Let’s evaluate x:

x
[1] 10

print()

We used the function print() here, without spelling it out explicitly:

print(x)
[1] 10

assign()

1assign("y", 10.3)
2y
1
assign 10.3 to y
2
print(y)
[1] 10.3

x

R has the object x (and y) in its memory, so we can know use x instead of \(10\).

x+9
[1] 19
y - 200
[1] -189.7

Let’s make a function

Writing your own function can be quite easy:

square <- function(x){x^2}

New function

  • Let’s apply this function to a number:
square(x = 5)
[1] 25
  • Or to a vector of numbers:
square(x = c(1,2,3,4,5))
[1]  1  4  9 16 25

As a beginner, there are more than enough functions in base-R and the existing packages. So in this course we will not create our own functions.

Vector

  • A row of similar data (e.g. all numeric objects, all integers, all characters).
  • Example: a vector of 5 numbers and give it the name mydata:
mydata <- c(4,6,8,2,7)
mydata
[1] 4 6 8 2 7

A function applied to a vector

length(mydata)
[1] 5
  • Most functions in R are “vectorized”, which means that the function operates on all elements of the vector. There’s no need in R to write a loop to act on every element (this is getting a bit technical, but this is a an important difference between R and other programming languages).

Add \(3\) to every element of mydata:

mydata+3
[1]  7  9 11  5 10

Character vector

It’s also possible to create a vector of “words” (or better still of “characters”, sometimes refered to as “strings”)

myfriends <- c("Beatrijs", 
               "Fatemeh", 
               "Leo", 
               "Astrid", 
               "Lize")
str(myfriends)
 chr [1:5] "Beatrijs" "Fatemeh" "Leo" "Astrid" "Lize"

Your turn: How many friends are there?

Show the code
length(myfriends)
[1] 5

Statistical functions

  • First, we create a vector WordsPerSentence.
WordsPerSentence<-c(12,3,34,23,23,34,12,23,12,
                    23,34,23,12,23,34,23,26,27) 

Statistical functions

  • Then we apply some statistical functions (all from base-R):
max(WordsPerSentence)                # maximum
[1] 34
min(WordsPerSentence)                # minimum   
[1] 3
sum(WordsPerSentence)                # sum
[1] 401
mean(WordsPerSentence)               # mean
[1] 22.27778
median(WordsPerSentence)             # median
[1] 23

Statistical functions

range(WordsPerSentence)              # range (max - min)
[1]  3 34
var(WordsPerSentence)                # variance   
[1] 81.38889
sortedorder<-sort(WordsPerSentence)  # sorted vector
order(WordsPerSentence)              # orde  
 [1]  2  1  7  9 13  4  5  8 10 12 14 16 17 18  3  6 11 15
fivenum(WordsPerSentence)            # five number summary
[1]  3 12 23 27 34
summary(WordsPerSentence)            # five number summary
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00   14.75   23.00   22.28   26.75   34.00 

ls() and rm()

  • What objects have I created?
ls()
[1] "mydata"           "myfriends"        "sortedorder"      "square"          
[5] "WordsPerSentence" "x"                "y"               
  • You can remove objects with rm():
rm("y") # to remove y, an object that we created earlier 
  • To remove all objects:
rm(list=ls())

Revision

Make a vocabulary list of your R language:

  • round(x=, digits=): rounds the number x, to a given number of digits.
  • <-: assigns name to an object
  • c() combines elements into a vector
  • mean(): calculates the mean of a numeric variable

Exercise 1.2

  1. We measured the height of basketball players (in cm): 189, 190, 198, 210, 213, 234, 205, 207, 198, 199, 189, 203, 204, 207, 188, 187, 179, 209, 205, 204, 206, 200, 199, 198, 198, 206, 204, 175, 201.

    1. Create a vector in R called “height” and assign the measurements to this object
    2. How many observations are there?
    3. What is the average height of the players?
    4. Use R to calculate:
      • maximum
      • minimum
      • median
      • standard deviation
      • variance

Exercise 1.2

  1. Use a function to find help on the mean() function

  2. R contains several datasets which are loaded automatically.

    1. Open the dataset TootGrowth.
    2. Assign the name tg to the dataset Tootgrowth.
    3. How many variables does tg have? (tip: use str())

Categorical data

In statistics, we distinguish between two kinds of variables:

  • Continuous variable (also called “quantitative”)
  • Categorical variable (also called “qualitative”)

Continuous variables are based on measurements (e.g., length, number of words, height, etc.). Categorical variables have a limited number of categories as their values (e.g., Correctness with “correct” vs. “incorrect”, Animacy with “animate” vs. “inanimate”).

Example: categorical data

We asked 10 participants about their educational level, operationalized as a three level categorical variable:

  • “high”: highschool level
  • “ba”: bachelor level
  • “ma”: Master level

Vector Edulevel

EduLevel <-c("high", 
              "high", 
              "ba", 
              "high", 
              "ba", 
              "ma", 
              "ma", 
              "ma", 
              "high", 
              "ma") 

Your turn

Use a function to find out what kind of variable EduLevel is.

Show the code
str(EduLevel)
 chr [1:10] "high" "high" "ba" "high" "ba" "ma" "ma" "ma" "high" "ma"

From character to factor

We transform the character variable to a factor, which is more convenient (and required) for a statistical analysis. A factor variable includes different levels.

EduLevel <- as.factor(EduLevel)
str(EduLevel)
 Factor w/ 3 levels "ba","high","ma": 2 2 1 2 1 3 3 3 2 3

EduLevel is now a factor variable with three categories or levels. R assigns a number to every category. Levels are assigned alphabetically. This can be overruled and the levels can be binned with the levels() function.

table()

We summarize the factor variables by means of the table() function.

table(EduLevel) 
EduLevel
  ba high   ma 
   2    4    4 

prop.table()

Proportions or fractions are calculated with prop.table().

prop.table(table(EduLevel))
EduLevel
  ba high   ma 
 0.2  0.4  0.4 

Your turn

To avoid a nested function, first create an object based on the table() function and then apply prop.table().

Show the code
mytable <- table(EduLevel) 
prop.table(mytable)
EduLevel
  ba high   ma 
 0.2  0.4  0.4 

Missing values

  • Missing values are indicated by “NA”.
  • Example: RT (Reaction time) is a vector with one missing variable:
RT <- c(345, 367, 440, 438, NA, 500, 270)
is.na(RT)
[1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

is.na() is a logical function. The output of the function is TRUE or FALSE.

which()

We can extract which observation is missing by means of which():

which(is.na(RT))
[1] 5

So the fifth value of the vector RT is a missing value.

Your turn

But how many missing values are there?

Show the code
length(which(is.na(RT)))
[1] 1

Logical operators

  • Is \(5\) larger than \(3\)?
5 > 3
[1] TRUE

Your turn

  • Given that: \(x=5\) and \(y=16\).
  • Question: is \(x\) smaller \(y\)?
Show the code
x <- 5  
y <- 16  
x < y  
[1] TRUE

Other useful logical operators:

x <= 5      
[1] TRUE
y >= 20   
[1] FALSE
y == 16   
[1] TRUE
x != 5    # different from
[1] FALSE

Index function “[…]”

  • Square brackets (“[…]”)are important symbols in R.
  • They allow you to extract elements from an object.
  • For instance, here we extract the third element of the vector score.
score <- c(18, 12, 17)
score[3]
[1] 17

Index function

letters is an R object (characters) consisting of the letters (in lower case) of the alphabet.

letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
alfabet <- letters
alfabet
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"

We use the indexfunction to extract different elements:

alfabet[3:5]                
[1] "c" "d" "e"
alfabet[c(8,5,12,12,15)]     
[1] "h" "e" "l" "l" "o"

Your turn

Extract all the numbers from alphabet, except for the first and third letter.

Show the code
alfabet[-c(1,3)]            
 [1] "b" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
[20] "v" "w" "x" "y" "z"

Exercise 1.3

  1. Create a vector “numbers” with all numbers from 1 to 20
  2. Replace the third element with “NA”
  3. What is the average of the vector numbers?
  4. What is the sum of all numbers larger than \(10\)?
  5. How many numbers are larger than \(15\)?
  6. Extract the fourth element.
  7. A lexical decision task returned the following results: {word, non-word, non-word, non-word, non-word, word, word, word, word}. Use R to calculate the proportion of words.

Sampling

  • One of the major advantages of R is its capability to perform simulations. This allows one to bypass the need for actual experiments and instead conduct virtual experiments.
  • In contemporary empirical research, this approach is crucial as it enables the evaluation of hypotheses and the assessment of their plausibility.
  • A fundamental step in this process is the creation or selection of a sample.
  • In this context, we will generate a dataset consisting of \(N = 10\) elements and subsequently draw a random sample of 5 elements from it.

Sampling

  • Here, we generate a dataset consisting of \(N = 10\) elements and subsequently draw a random sample of 5 elements from it.
score<-c(65,45,78,56,89,         
        34,12,34,33,78)    
sample(x = score,                  
       size = 5,          
       replace=FALSE)    
[1] 56 89 45 78 34

Sampling with replacement

coin<-c("head","tail") 
sample(x = coin, 
       size = 20, 
       replace=TRUE)
 [1] "tail" "tail" "head" "head" "head" "head" "head" "head" "head" "tail"
[11] "tail" "head" "tail" "tail" "head" "tail" "head" "tail" "head" "tail"

Sampling with a given probability

data<-c(0,1) 
sample(x = data, 
       size = 20, 
       prob = c(0.30, 0.70), 
       replace=TRUE)
 [1] 1 1 1 0 0 1 0 0 1 1 1 1 0 1 1 0 1 0 0 0

Sampling from a normal distribution

We draw a random sample of \(N=200\) observations from a normal distribution with mean \(\mu=4.5\) and standard deviation \(\sigma=2\).

rnorm(n = 20,           
      mean = 4.5,      
      sd = 2)           
 [1] 2.978715 5.082136 6.429688 3.570105 7.966115 3.532365 6.285349 5.019492
 [9] 2.127598 4.125893 1.638454 1.073755 7.182058 2.395545 3.406861 4.845484
[17] 7.970139 4.429499 7.808639 3.535287

Histogram

hist(Nile)

Histogram

ReactionTime <- rnorm(n = 1000, mean = 300, sd = 10)  
hist(ReactionTime,                                    
     main = "",                                       
     ylab = "Frequency",                              
     xlab = "Reaction Time (ms)")                     

Density curve

densRT <- density(ReactionTime)          
plot(densRT,                             
     main = "",                          
     ylab = "Density",             
     xlab = "Reaction Time (ms)")                      

Boxplot

boxplot(ReactionTime,                  
        ylab = "Reaction time (ms)")                

Stripchart

sampleRT <- sample(x = ReactionTime, size = 8)    
stripchart(sampleRT,                             
           xlab = "Reaction time (ms)",           
           pch = 1)                               

Barplot

We create a factor vector “Animacy” with two values “animate” vs. “inanimate” with rep().

Animacy <- rep(x = c("animate", "inanimate"), 
               times = c(68, 32))                     

Let’s look at the first 10 observations of the vector:

head(Animacy, n = 10)
 [1] "animate" "animate" "animate" "animate" "animate" "animate" "animate"
 [8] "animate" "animate" "animate"

Always use table() first!

tt <- table(Animacy)                      
barplot(tt,                  
        xlab = "Animacy")    

Scatterplot

  • a mock dataset for ten participants who performed a pre- and posttest.
ID Pretest Posttest
1 13 15
2 8 9
3 11 16
4 12 13
5 16 17
6 15 17
7 11 12
8 6 10
9 10 13
10 14 15

Scatterplot

Pretest <- c(13, 8, 11, 12, 16, 15, 
             11, 6, 10, 14)                             
Posttest <- c(15, 9, 16, 13, 17, 17, 
              12, 10, 13, 15)                          
plot(x = Pretest, y = Posttest,                         
     xlab = "Score Pretest",                            
     ylab = "Score Posttest")                           

Scatterplot with a linear regression line

plot(x = Pretest, y = Posttest,
      xlab = "Score Pretest", 
      ylab = "Score Posttest")
abline(lm(Posttest ~ Pretest)) 

Exercise 1.4

Reconsider the height dataset that we created in Exercise 1.2.

  1. Visualize the data by means of a yellow histogram.
  2. Visualize the data by means of a blue horizontal boxplot and label the horizontal axis.
  3. Take a random sample of 5 players and visualize the sample with a vertical stripchart.

The dataframe

RStudio

RStudio is an integrated development environment (IDE) for R and Python. It includes a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging, and workspace management. RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux).

(RStudio Team, 2023)

A typical dataset

Variable Variable Variable
measurement measurement measurement
measurement measurement measurement
measurement measurement measurement
  • In R this kind of object has its own data structure called a “dataframe”. Basically, you “tell” R that the combination of rows and column is one dataset or dataframe.

How to make a dataframe in R: example 1

  • Let’s recreate the following dataset:
Participant Item Frequency ReactionTime
1 a high 320
2 b low 380
3 c low 400
4 d high 300
5 e high 356
6 f low 319

How to make a dataframe in R: example 1

Participant <- c(1,2,3,4,5,6)                  
Item <- c("a", "b", "c", "d", "e", "f")        
Freq <- as.factor(c("high", "low", "low",    
                    "high", "high", "low"))   
RT <- c(320, 380, 400, 300, 356, 319)          
df <- data.frame(Participant, Item, Freq, RT)  
df                                             
  Participant Item Freq  RT
1           1    a high 320
2           2    b  low 380
3           3    c  low 400
4           4    d high 300
5           5    e high 356
6           6    f  low 319

How to make a dataframe in R: example 2

  • Let’s recreate the following dataset:
ID Group Score
1 Control 14
2 Control 15
59 Treat 12
60 Treat 17

How to make a dataframe in R: example 2

Group <- gl(n = 2,                                          
            k = 30,                                         
            labels = c("control", "treat"))                 
Score <- c(round(rnorm(n = 30, mean = 16, sd = 0.8),0),    
           round(rnorm(n = 30, mean = 12, sd = 0.9),0))    
df <- data.frame(Group, Score)                             
head(df)                                                   
    Group Score
1 control    17
2 control    16
3 control    17
4 control    16
5 control    15
6 control    16

View(), edit()

View(df)
edit(df)

BONUS

You can actually create a dataframe from scratch with:

df_2 <- edit(data.frame())

R datasets

Base-R and other packages have built-in datasets. We can open the sleep dataset simply by typing sleep.

sleep
   extra group ID
1    0.7     1  1
2   -1.6     1  2
3   -0.2     1  3
4   -1.2     1  4
5   -0.1     1  5
6    3.4     1  6
7    3.7     1  7
8    0.8     1  8
9    0.0     1  9
10   2.0     1 10
11   1.9     2  1
12   0.8     2  2
13   1.1     2  3
14   0.1     2  4
15  -0.1     2  5
16   4.4     2  6
17   5.5     2  7
18   1.6     2  8
19   4.6     2  9
20   3.4     2 10

Sleep

The sleep dataset contains three variables:

  • extra: the number of extra hours of sleep after taking a sleep medication,
  • group: two sleep medications were tested by 10 participants,
  • ID: an identification number for each participant (not for each row!).

str()

We can view the structure of a dataframe using str().

str(sleep)
'data.frame':   20 obs. of  3 variables:
 $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
 $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ ID   : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...

extra is a numeric variable (a number with decimal places). group and ID are factors, categorical variables with 2 and 10 possible values, respectively. The output for the factor variables can be interpreted as follows:

$ group: Factor w/ 2 levels "1", "2": 1 1 1 ...

“1” and “2” are the names of the levels, which R converts into numerical values, 1 and 2. For instance, 1 1 1 indicates that the first three values of the variable are “1”. The levels are arranged alphabetically by R, unless specified otherwise.

data()

Want to find out which datasets are included in base-r?

data()

help(“DATA”)

For more information on the dataset sleep use:

help("sleep")

Opening a Text File as a Dataframe

ABSTRACT: This dataset contains one datafile (.csv) used to create the graphs and tables in the paper “(Im)polite uses of vocatives in present-day Madrilenian Spanish”. It includes 534 Spanish vocative tokens, i.e. (pro)nominal terms of direct address (e.g., tío ‘dude’), which were retrieved from CORMA, a conversational corpus of peninsular Spanish compiled between 2016 and 2019. The data is annotated for (i) form, (ii) communication, (iii) semantic category, (iv) speaker’s generation, (v) speaker’s gender, (vi) relationship between speaker and hearer, (vii) socio-pragmatic character of the hosting speech act, (viii) the hearer’s reaction, and (ix) the vocative’s socio-pragmatic effect.

Working directory

  • The current working directory can be found via getwd().
  • You change your working directory with setwd().
getwd()
# "C:/Users/lfdcuype/Documents"
setwd("C:/Users/lfdcuype/MILS_Module_1")

read.csv()

  • When your R-script and data file have been stored in the same folder which has been set as the working directory, you can use the read.csv() function to open the .csv-file, as follows:
vocative <- read.csv("dataset_ImPoliteVocatives_20230530.csv", 
                     sep=";",                                  
                     stringsAsFactors=TRUE)                    

Import Dataset from Environment

My personal favourite after many years of trial and lots of error is to use the Import dataset function in the Environment tab in RStudio:

Exploring a dataframe

head(sleep)
  extra group ID
1   0.7     1  1
2  -1.6     1  2
3  -0.2     1  3
4  -1.2     1  4
5  -0.1     1  5
6   3.4     1  6
str(sleep)
'data.frame':   20 obs. of  3 variables:
 $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
 $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ ID   : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...

Exploring a dataframe

names(sleep)
[1] "extra" "group" "ID"   
summary(sleep)
     extra        group        ID   
 Min.   :-1.600   1:10   1      :2  
 1st Qu.:-0.025   2:10   2      :2  
 Median : 0.950          3      :2  
 Mean   : 1.540          4      :2  
 3rd Qu.: 3.400          5      :2  
 Max.   : 5.500          6      :2  
                         (Other):8  

Hmisc::describe()

Other decsriptive functions have been developed by R-developers. Here’s the describe() from the Hmisc package (Harrell Jr, 2023).

options(width = 60)
library(Hmisc)
describe(sleep)
sleep 

 3  Variables      20  Observations
------------------------------------------------------------
extra 
       n  missing distinct     Info     Mean      Gmd 
      20        0       17    0.998     1.54    2.332 
     .05      .10      .25      .50      .75      .90 
  -1.220   -0.300   -0.025    0.950    3.400    4.420 
     .95 
   4.645 
                                                       
Value      -1.6 -1.2 -0.2 -0.1  0.0  0.1  0.7  0.8  1.1
Frequency     1    1    1    2    1    1    1    2    1
Proportion 0.05 0.05 0.05 0.10 0.05 0.05 0.05 0.10 0.05
                                                  
Value       1.6  1.9  2.0  3.4  3.7  4.4  4.6  5.5
Frequency     1    1    1    2    1    1    1    1
Proportion 0.05 0.05 0.05 0.10 0.05 0.05 0.05 0.05

For the frequency table, variable is rounded to the nearest 0
------------------------------------------------------------
group 
       n  missing distinct 
      20        0        2 
                  
Value        1   2
Frequency   10  10
Proportion 0.5 0.5
------------------------------------------------------------
ID 
       n  missing distinct 
      20        0       10 
                                                  
Value        1   2   3   4   5   6   7   8   9  10
Frequency    2   2   2   2   2   2   2   2   2   2
Proportion 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
------------------------------------------------------------

Exercise 2.1

  1. Create a new folder.
  2. Open a new R-script and set the new folder as your working directory
  3. Create a small mock-dataset in Excel with three variables and five rows, as in Figure 1:

Figure 1

Exercise 2.1

  1. Save the file as a .txt-file (FakeData.txt, cf. Figure 2) in the working directory.

Figure 2

  1. Open the dataset in R with read.delim().
  2. What are the names of the variables?
  3. Give a univariate summary of every variable

Applying functions to a variable form a dataframe

  • All the functions outlined earlier can be applied to variables of a dataframe.
table(sleep$group)

 1  2 
10 10 

Applying functions to a variable form a dataframe

boxplot(sleep$extra)

attach()

To avoid the use of “DATAFRAME$”, you can use the attach() function. And when you’re done you use the detach() function. It simplifies the code, but if you work with multiple dataframes you run the risk of loosing track.

attach(sleep)
boxplot(extra)
detach(sleep)

Creating/adding new variables

  • The index function with the square brackets also works on dataframes.
  • Dataframes have rows and columns and so the index function involves two dimensions.
  • Example: we extract the first four rows from sleep.
sleep[1:4,]
  extra group ID
1   0.7     1  1
2  -1.6     1  2
3  -0.2     1  3
4  -1.2     1  4

Your turn

  • Extract all rows for which a positive extra amount of sleep.
Show the code
sleep[sleep$extra>0,]
   extra group ID
1    0.7     1  1
6    3.4     1  6
7    3.7     1  7
8    0.8     1  8
10   2.0     1 10
11   1.9     2  1
12   0.8     2  2
13   1.1     2  3
14   0.1     2  4
16   4.4     2  6
17   5.5     2  7
18   1.6     2  8
19   4.6     2  9
20   3.4     2 10

Your turn

What is the average amount of extra sleep?

Show the code
mean(sleep$extra, na.rm=TRUE)
[1] 1.54

subset()

sleep_drug1 <- subset(sleep,         
                    group=="1")      
summary(sleep_drug1)                 
     extra        group        ID   
 Min.   :-1.600   1:10   1      :1  
 1st Qu.:-0.175   2: 0   2      :1  
 Median : 0.350          3      :1  
 Mean   : 0.750          4      :1  
 3rd Qu.: 1.700          5      :1  
 Max.   : 3.700          6      :1  
                         (Other):4  

subset()

Here’s a more complicated selection:

sleep_2 <- subset(sleep,                      
                    group=="1" & extra>0)      
sleep_2
   extra group ID
1    0.7     1  1
6    3.4     1  6
7    3.7     1  7
8    0.8     1  8
10   2.0     1 10

droplevels

  • When you take a subset, the original levels remain:
summary(sleep_drug1)
     extra        group        ID   
 Min.   :-1.600   1:10   1      :1  
 1st Qu.:-0.175   2: 0   2      :1  
 Median : 0.350          3      :1  
 Mean   : 0.750          4      :1  
 3rd Qu.: 1.700          5      :1  
 Max.   : 3.700          6      :1  
                         (Other):4  

Use the droplevels() to remove unselected levels.

sleep_drug1 <- subset(sleep,             
                    group=="1")      
sleep_drug1 <- droplevels(sleep_drug1)
summary(sleep_drug1)                     
     extra        group        ID   
 Min.   :-1.600   1:10   1      :1  
 1st Qu.:-0.175          2      :1  
 Median : 0.350          3      :1  
 Mean   : 0.750          4      :1  
 3rd Qu.: 1.700          5      :1  
 Max.   : 3.700          6      :1  
                         (Other):4  

Creating a new variable

We calculate a z-score for extra in the sleep data. The dollarsign (“$”) is used to add the variable to the existing dataframe.

sleep$extra_z <- (sleep$extra-mean(sleep$extra))/sd(sleep$extra)
head(sleep)
  extra group ID    extra_z
1   0.7     1  1 -0.4162703
2  -1.6     1  2 -1.5560579
3  -0.2     1  3 -0.8622741
4  -1.2     1  4 -1.3578340
5  -0.1     1  5 -0.8127182
6   3.4     1  6  0.9217413

Note that if you do not use the dollarsign, you create a new vector without adding it as a variable to the sleep dataset.

tapply()

This is a convenient function to compare different groups in your data. For instance, here we calculate the average extra time of sleep for both groups.

tapply(X = sleep$extra,            
       INDEX = sleep$group,       
       FUN = mean)                
   1    2 
0.75 2.33 

Save a dataframe as a datafile

After you create a dataframe “df”, you can save is as a .csv-file as follows:

write.csv2(df, file = "df.csv")

Exercise 2.2

  1. Open the multitask dataset, which we created in Activity 1, and give the dataframe the name “multi”.
  2. Give a univariate summary of all variables.
  3. Use tapply to calculate the median Time for both Tasks.
  4. Take a subset of the simultaneous Task.
  5. Visualize this subset by means of a histogram.
  6. Create a new variable Time_c by subtracting every observation for the variable Time from its mean. (this is called centering the data, hence the use of “_c”).

Tidyverse

Language family tree

Source: https://www.theguardian.com/education/gallery/2015/jan/23/a-language-family-tree-in-pictures

Programming language family tree

Source: Farooq et al. (2014)

Timeline of R

Source: Giorgi et al. (2022)

Tidyverse

library(tidyverse)

By doing so, we actually open multiple packages. Two packages that we will use in particular are:

  1. dplyr: to explore and manipulate data (Wickham et al., 2023)
  2. tidyr: to transform the data format (long to wide and vice versa)(Wickham et al., 2024).

tibble

library(tibble)                      
classroom <- tribble(                  
  ~name,    ~quiz1, ~quiz2, ~test1,
  "Billy",  NA,     "D",    "C",
  "Suzy",   "F",    NA,     NA,
  "Lionel", "B",    "C",    "B",
  "Jenny",  "A",    "A",    "B"
  )
classroom
# A tibble: 4 × 4
  name   quiz1 quiz2 test1
  <chr>  <chr> <chr> <chr>
1 Billy  <NA>  D     C    
2 Suzy   F     <NA>  <NA> 
3 Lionel B     C     B    
4 Jenny  A     A     B    

Verbeke & Simon (2023)

ABSTRACT: This dataset contains the results from 33 Flemish English as a Foreign Language (EFL) learners, who were exposed to eight native and non-native accents of English. These participants completed (i) a comprehensibility and accentedness rating task, followed by (ii) an orthographic transcription task. In the first task, listeners were asked to rate eight speakers of English on comprehensibility and accentedness on a nine-point scale (1 = easy to understand/no accent; 9 = hard to understand/strong accent). How Accentedness ratings and listeners’ Familiarity with the different accents impacted on their Comprehensibility judgements was measured using a linear mixed-effects model. The orthographic transcription task, then, was used to verify how well listeners actually understood the different accents of English (i.e. intelligibility). To that end, participants’ transcription Accuracy was measured as the number of correctly transcribed words and was estimated using a logistic mixed-effects model. Finally, the relation between listeners’ self-reported ease of understanding the different speakers (comprehensibility) and their actual understanding of the speakers (intelligibility) was assessed using a linear mixed-effects regression. R code for the data analysis is provided.

Listening to Accents: Codebook

  • ID: Unique identifier for each line in the dataset.
  • Participant: Unique identifier for each participant.
  • Accent: native and non-native accents of English (e.g., “GBE” = General British English, “GAE” = General American English, etc.)
  • Comprehensibility: Listeners’ rating of how comprehensible the speaker is on a scale from 1 (= easy to understand) to 9 (= hard to understand)
  • Accentedness: Listeners’ rating of how strong a speaker’s accent is on a scale from 1 (= no accent) to 9 (= strong accent)
  • Familiarity: Listeners’ familiarity with the sampled varieties of English (i.e. never; rarely; sometimes; often; very often)

Data: Open and summarize:

options(width = 70) 
comp <- read.csv("Listening_to_Accents_Comprehensibility_Accentedness.csv", 
                 stringsAsFactors=TRUE)
summary(comp)
       ID          Participant        Accent   Comprehensibility
 Min.   :  1.00   Min.   : 7.00   ChinEng:33   Min.   :1.000    
 1st Qu.: 66.75   1st Qu.:33.00   GAE    :33   1st Qu.:2.000    
 Median :132.50   Median :46.00   GBE    :33   Median :2.000    
 Mean   :132.50   Mean   :45.12   IndEng :33   Mean   :2.742    
 3rd Qu.:198.25   3rd Qu.:62.00   NBE    :33   3rd Qu.:4.000    
 Max.   :264.00   Max.   :75.00   NigEng :33   Max.   :8.000    
                                  (Other):66                    
  Accentedness      Familiarity
 Min.   :1.00   Never     :65  
 1st Qu.:4.00   Often     :32  
 Median :6.00   Rarely    :72  
 Mean   :5.58   Sometimes :56  
 3rd Qu.:7.00   Very Often:39  
 Max.   :9.00                  
                               

Change Familiarity to ordered factor

levels(comp$Familiarity)
[1] "Never"      "Often"      "Rarely"     "Sometimes"  "Very Often"
comp$Familiarity <- factor(comp$Familiarity, 
                           ordered = TRUE, 
                           levels = c("Never", 
                                      "Rarely", 
                                      "Sometimes",
                                      "Often",
                                      "Very Often"))

Check levels and factor ordering

levels(comp$Familiarity)
[1] "Never"      "Rarely"     "Sometimes"  "Often"      "Very Often"
str(comp$Familiarity)
 Ord.factor w/ 5 levels "Never"<"Rarely"<..: 1 5 4 2 2 1 2 1 1 5 ...

Exercise 3.1

  1. Download & Open Listening_to_Accents_Comprehensibility_Accentedness.csv. Give the dataframe the name “comp”.
  2. Visualise Comprehensibility by means of a barplot and a histogram.
  3. How many observations are there for every Accent?
  4. Visualize Accent with a piechart. (check help("pie")).
  5. Create a new Variable in which you bin the Accent levels into two or three groups.

Compare

  • base-r:
mean(comp$Comprehensibility)
[1] 2.742424
  • dplyr:
comp |>                                      
  summarise(MEAN = mean(Comprehensibility))  
      MEAN
1 2.742424

More descriptive statistics

comp |>                                                       
  summarise(mean = mean(Comprehensibility, na.rm = TRUE),     
            median = median(Comprehensibility, na.rm = TRUE), 
            min = min(Comprehensibility, na.rm = TRUE),       
            max = max(Comprehensibility, na.rm = TRUE),       
            SD = sd(Comprehensibility, na.rm = TRUE))         
      mean median min max       SD
1 2.742424      2   1   8 1.567838

Basic dplyr vocabulary

  • select() to select columns/variables
  • filter() to select rows
  • mutate() to create new variables
  • group_by() to group the data according to a categorical variable
  • summarise() to calculate descriptive statistics

group_by()

What is the average Comprehension for the eight different accents?

comp |>                                                      
  group_by(Accent) |>                                        
  summarise(mean = mean(Comprehensibility, na.rm = TRUE))    
# A tibble: 8 × 2
  Accent   mean
  <fct>   <dbl>
1 ChinEng  3.30
2 GAE      1.12
3 GBE      2.03
4 IndEng   3.61
5 NBE      2.24
6 NigEng   3.94
7 SAE      2.39
8 SpanEng  3.30

tapply()

tapply(X = comp$Comprehensibility, INDEX = comp$Accent, FUN = mean)
 ChinEng      GAE      GBE   IndEng      NBE   NigEng      SAE 
3.303030 1.121212 2.030303 3.606061 2.242424 3.939394 2.393939 
 SpanEng 
3.303030 

group_by() + summarise()

comp |>                                                
  group_by(Accent) |>                                  
  summarise(mean = mean(Comprehensibility, na.rm = TRUE),         
            median = median(Comprehensibility, na.rm = TRUE),     
            min = min(Comprehensibility, na.rm = TRUE),          
            max = max(Comprehensibility, na.rm = TRUE),           
            SD = sd(Comprehensibility, na.rm = TRUE))             
# A tibble: 8 × 6
  Accent   mean median   min   max    SD
  <fct>   <dbl>  <int> <int> <int> <dbl>
1 ChinEng  3.30      3     1     7 1.42 
2 GAE      1.12      1     1     2 0.331
3 GBE      2.03      2     1     5 0.883
4 IndEng   3.61      3     2     8 1.64 
5 NBE      2.24      2     1     6 1.28 
6 NigEng   3.94      4     1     7 1.62 
7 SAE      2.39      2     1     8 1.52 
8 SpanEng  3.30      3     1     6 1.24 

group_by() with multiple variables

comp |>                                               
  group_by(Accent, Familiarity) |>                   
  summarise(mean = mean(Comprehensibility, na.rm = TRUE),        
            SD = sd(Comprehensibility, na.rm = TRUE))            
# A tibble: 33 × 4
# Groups:   Accent [8]
   Accent  Familiarity  mean     SD
   <fct>   <ord>       <dbl>  <dbl>
 1 ChinEng Never        3.55  1.43 
 2 ChinEng Rarely       3     1.32 
 3 ChinEng Sometimes    3.33  1.53 
 4 ChinEng Often        1    NA    
 5 GAE     Sometimes    1     0    
 6 GAE     Often        1.43  0.535
 7 GAE     Very Often   1.04  0.204
 8 GBE     Never        2     0    
 9 GBE     Rarely       3    NA    
10 GBE     Sometimes    2.17  0.408
# ℹ 23 more rows

base-r: aggregate()

aggregate(Comprehensibility ~ Accent + Familiarity, 
          data = comp, 
          FUN = function(x) c(MEAN = mean(x), SD = sd(x)))
    Accent Familiarity Comprehensibility.MEAN Comprehensibility.SD
1  ChinEng       Never              3.5500000            1.4317821
2      GBE       Never              2.0000000            0.0000000
3   IndEng       Never              4.3333333            1.9663842
4      NBE       Never              1.0000000                   NA
5   NigEng       Never              3.8571429            1.7113069
6      SAE       Never              3.5000000            0.7071068
7  SpanEng       Never              3.3076923            1.4366985
8  ChinEng      Rarely              3.0000000            1.3228757
9      GBE      Rarely              3.0000000                   NA
10  IndEng      Rarely              3.7500000            1.5852943
11     NBE      Rarely              2.1111111            0.9279607
12  NigEng      Rarely              4.1000000            1.6633300
13     SAE      Rarely              2.9000000            1.2866839
14 SpanEng      Rarely              3.2307692            1.0127394
15 ChinEng   Sometimes              3.3333333            1.5275252
16     GAE   Sometimes              1.0000000            0.0000000
17     GBE   Sometimes              2.1666667            0.4082483
18  IndEng   Sometimes              2.1666667            0.4082483
19     NBE   Sometimes              2.6250000            1.5000000
20  NigEng   Sometimes              4.0000000            0.0000000
21     SAE   Sometimes              2.1250000            1.7841898
22 SpanEng   Sometimes              3.8000000            1.4832397
23 ChinEng       Often              1.0000000                   NA
24     GAE       Often              1.4285714            0.5345225
25     GBE       Often              2.1538462            1.2810252
26  IndEng       Often              5.0000000                   NA
27     NBE       Often              1.0000000            0.0000000
28     SAE       Often              2.0000000            0.8164966
29 SpanEng       Often              2.5000000            0.7071068
30     GAE  Very Often              1.0416667            0.2041241
31     GBE  Very Often              1.7272727            0.4670994
32     NBE  Very Often              2.6666667            0.5773503
33     SAE  Very Often              1.0000000                   NA

filter() and select()

  • We select Participants who are familiar with a Spanish English accent (i.e., Familiarity equal to “Often” or “Very Often”) and check their Comperehensibility scores.
comp |>                    
  filter(Familiarity %in% c("Often", "Very Often")) |>    
  select(Participant, Comprehensibility) |> 
  group_by(Participant) |>
  summarise(mean = mean(Comprehensibility, na.rm = TRUE))
# A tibble: 32 × 2
   Participant  mean
         <int> <dbl>
 1           7  3   
 2          10  2   
 3          13  1.33
 4          15  1   
 5          18  1.75
 6          21  1.25
 7          22  1.67
 8          31  1   
 9          33  2   
10          34  1.5 
# ℹ 22 more rows

mutate()

  • mutate() allows you to create new variables. For instance, here we calculate a z-score for Comprehensibility.
comp |>                                            
  drop_na() |>                                       
  mutate(Comprehensibility_z = (Comprehensibility-mean(Comprehensibility))/sd(Comprehensibility)) |>
  summarise(AVG = mean(Comprehensibility_z))                    
           AVG
1 1.137203e-16

drop_na()

mockdf <- tibble(
  Size = c("a", "a", "a", "b", "b", "b"), 
  Score = c(23,34,23,NA,27,4),
  Group = c("first", "second", "second", "second", "first", NA))
clean_df <- mockdf |> drop_na()
clean_df
# A tibble: 4 × 3
  Size  Score Group 
  <chr> <dbl> <chr> 
1 a        23 first 
2 a        34 second
3 a        23 second
4 b        27 first 

count()

mockdf |>  count(Group)
# A tibble: 3 × 2
  Group      n
  <chr>  <int>
1 first      2
2 second     3
3 <NA>       1

group_by + count()

mockdf |> group_by(Group, Size) |> count()
# A tibble: 5 × 3
# Groups:   Group, Size [5]
  Group  Size      n
  <chr>  <chr> <int>
1 first  a         1
2 first  b         1
3 second a         2
4 second b         1
5 <NA>   b         1

Compare: Base-R

  • table()
table(mockdf$Size, mockdf$Group)
   
    first second
  a     1      2
  b     1      1
  • Or, if you prefer a dataframe format:
as.data.frame(xtabs(~ Group + Size, data = mockdf))
   Group Size Freq
1  first    a    1
2 second    a    2
3  first    b    1
4 second    b    1

Exercise 3.2

Use tidyverse to answer the following questions about the multitask dataset.

  1. What is the average Time by Task?
  2. Wat are the three fastest international participants? Give the names of the Participants in descending order, starting with the fastest one.

Data transformation

The long data format
Student Test Score
1 pretest 8
1 posttest 12
2 pretest 12
2 posttest 14
3 pretest 9
3 posttest 8
The wide data format
Student Pretest Posttest
1 8 12
2 12 14
3 9 8

Your turn

Recreate the following dataset in base-R:

Student Pretest Posttest
1 8 12
2 12 14
3 9 8
Show the code
df <- data.frame(Student = c(1, 2, 3),
                 Pretest = c(8, 12, 9),
                 Posttest = c(12, 14, 8))
df
  Student Pretest Posttest
1       1       8       12
2       2      12       14
3       3       9        8

Your turn

Recreate the following dataset as a tibble:

Student Pretest Posttest
1 8 12
2 12 14
3 9 8
Show the code
df2 <- tribble(                  
  ~Student, ~Pretest, ~Posttest, 
       "1",        8,        12,    
       "2",       12,        14,
       "3",        9,         8,
  )
df2
# A tibble: 3 × 3
  Student Pretest Posttest
  <chr>     <dbl>    <dbl>
1 1             8       12
2 2            12       14
3 3             9        8

glimpse()

glimpse(df2)
Rows: 3
Columns: 3
$ Student  <chr> "1", "2", "3"
$ Pretest  <dbl> 8, 12, 9
$ Posttest <dbl> 12, 14, 8

Change to long format with pivot_longer()

df_long <- df |> 
  pivot_longer(cols = c(Pretest, Posttest),    
               names_to = "Test",              
               values_to = "Score")            
df_long
# A tibble: 6 × 3
  Student Test     Score
    <dbl> <chr>    <dbl>
1       1 Pretest      8
2       1 Posttest    12
3       2 Pretest     12
4       2 Posttest    14
5       3 Pretest      9
6       3 Posttest     8

Change to wide format with pivot_wider()

df_wide <- df_long |> 
  pivot_wider(id_cols = Student,      
              names_from = Test,      
              values_from = Score)    
df_wide
# A tibble: 3 × 3
  Student Pretest Posttest
    <dbl>   <dbl>    <dbl>
1       1       8       12
2       2      12       14
3       3       9        8

Exercise 3.3

  1. Transform the Multitask dataset from long to wide data format.

ggplot2

FalseBeginners Codebook

  • The dataset is based on De Wilde et al. (2019).
ID Variable Description
1 Student A unique identifier for each student
2 School A unique identifier for each school
3 Class A unique identifier for each class
4 PPVT Score for the Peabody Picture Vocabulary Test (PPVT). Maximum score = 120
5 Speaking Score for the speaking test. Maximum score = 20
6 Listening Score for the listening test. Maximum score = 25
7 ReadingWriting Score for the reading and writing tests. Maximum score = 50
8 Attitude Attitude of student towards English: Positive vs. Negative
9 L1 L1: Dutch vs. Multilingual
10 Sex Sex of student: Male vs. Female

Open and summarize

options(width = 70) 
fb <- read.delim("FalseBeginners.txt",                
                 header = TRUE,                       
                 na.strings = NA,                     
                 stringsAsFactors=TRUE)               
fb <- na.omit(fb) 
summary(fb)
    Student          School        Class          PPVT       
 Min.   :  2.0   S12    : 50   12A    : 27   Min.   : 31.00  
 1st Qu.:221.2   S34    : 46   9A     : 26   1st Qu.: 70.00  
 Median :443.5   S38    : 35   11A    : 23   Median : 78.00  
 Mean   :441.5   S51    : 35   12B    : 23   Mean   : 78.53  
 3rd Qu.:661.8   S8     : 34   42A    : 22   3rd Qu.: 88.00  
 Max.   :867.0   S24    : 32   21A    : 20   Max.   :116.00  
                 (Other):514   (Other):605                   
    Speaking        Listening     ReadingWriting      Attitude  
 Min.   : 0.000   Min.   : 0.00   Min.   : 0.00   negative: 26  
 1st Qu.: 2.000   1st Qu.:10.00   1st Qu.:13.00   positive:720  
 Median : 5.000   Median :15.00   Median :18.00                 
 Mean   : 6.782   Mean   :14.94   Mean   :21.13                 
 3rd Qu.:10.000   3rd Qu.:20.00   3rd Qu.:29.00                 
 Max.   :20.000   Max.   :25.00   Max.   :50.00                 
                                                                
     L1          Sex     
 dutch:548   female:365  
 multi:198   male  :381  
                         
                         
                         
                         
                         

Histogram

ggplot(data = fb,                  
       mapping =  aes(x=PPVT)) +  
       geom_histogram(bins = 25)  

Histogram

ggplot(fb,                                  
       aes(x=PPVT)) +                        
  geom_histogram(bins = 25,                 
                 fill = "#F1A42B") +        
  theme_minimal() +                         
  xlab("Score PPVT (120)")                  

Boxplot

ggplot(fb,                                 
       aes(y=PPVT)) +                      
  geom_boxplot() +                         
  scale_x_discrete( ) +                    
  theme_grey()                               

Your turn

  • Create a horizontal boxplot in green, with the label “PPVT” on the horizontal axis, and a minimalistic theme.
  • use labs() to add: a title, substitle, and caption.
Show the code
ggplot(fb,                                 
       aes(x=PPVT)) +                      
  geom_boxplot(fill="green") +                         
  scale_y_discrete( ) +
  theme_minimal() +
  labs(title = "Distribution of PPVT scores",
       subtitle = "N = 746 pupils",
       caption = "Source: De Wilde et al. (2019)")

Density curve

ggplot(fb, aes(x = PPVT)) +  
  geom_density()  +          
  theme_minimal()

Barplot

fb |>                           
  count(Sex)  |>                
ggplot(aes(x = Sex,             
           y = n)) +            
  geom_col()                     

Your turn

Add a nice colour to the bars

Show the code
fb |>                           
  count(Sex)  |>                
ggplot(aes(x = Sex,             
           y = n)) +            
  geom_col(fill = "#1E64C8") +  
  xlab("Sex")  +           
  theme_minimal()                

Exercise 4.1

Use the ggplot2 package to create the following visualisations for the Multitask dataset:

  1. Create a histogram of Time.
  2. Create a boxplot of Time.
  3. Create a barplot of Location.

Overlapping histograms

ggplot(fb,                                    
       aes(x=PPVT,                            
           fill = Sex)) +                     
  geom_histogram(bins = 20,                   
                 position = "identity",       
                 alpha = 0.4) +               
  theme_classic()                             

Overlapping density plots

ggplot(fb,                                    
       aes(x=PPVT,                            
           fill = Sex)) +                     
  geom_density(position = "identity",       
                 alpha = 0.4) +             
  theme_classic()                           

Your turn

Change the colour of the densityplots.

Show the code
ggplot(fb,                                    
       aes(x=PPVT,                            
           fill = Sex)) +                     
  geom_density(position = "identity",       
                 alpha = 0.4) +    
  scale_fill_manual(values=c("#0b5394", "#E69F00")) +
  theme_classic()  

Multiple boxplots

ggplot(fb, aes(x = Sex,           
               y = PPVT,            
               fill = Sex)) +         
  geom_boxplot() +                
  theme_minimal() +                
  theme(legend.position = "none")

Boxplots with a facet

ggplot(fb, aes(x = Sex,           
               y = PPVT,            
               fill = Sex)) +         
  geom_boxplot() +                
  theme_minimal() +   
  facet_wrap(L1 ~ .) +            
  theme_minimal() + 
  theme(legend.position = "none") 

Boxplots with a facet

ggplot(fb, aes(x = Sex,           
               y = PPVT,            
               fill = Sex)) +         
  geom_boxplot() +                
  theme_minimal() +   
  facet_wrap( ~ School) +            
  theme_minimal() + 
  theme(legend.position = "none") 

Boxplots with a facet

Clustered barplot

sa <- fb |> count(Sex, Attitude) 
ggplot(data=sa, aes(x=Sex,             
                    y=n,               
                    fill=Attitude)) +  
  geom_col(position = "dodge") +       
  theme_minimal()

Intermezzo: colors

library(RColorBrewer)
display.brewer.all(colorblindFriendly = TRUE)

Your turn

Change the colours of the bars into a color blind friendly Brewer palette.

Show the code
sa <- fb |> count(Sex, Attitude) 
ggplot(data=sa, aes(x=Sex,             
                    y=n,               
                    fill=Attitude)) +  
  geom_col(position = "dodge") +       
  scale_fill_brewer(palette = "Dark2") +
  theme_minimal()

Clustered barplot

ssa <- fb |> group_by(School, Sex) |> count(Attitude)
ggplot(data=ssa, aes(x=Sex,             
                    y=n,                
                    fill=Attitude)) +   
  geom_col(position = "dodge") +        
  facet_wrap( ~ School) + 
  theme_minimal()

Clustered barplot

Scatterplot

ggplot(fb, aes(x = PPVT,          
               y = Speaking)) +   
  geom_point()   +                
  theme_minimal()

Scatterplot with regression line

ggplot(fb, aes(x = PPVT,          
               y = Speaking)) +   
  geom_point()   +              
  geom_smooth(method = "lm", col="blue")  +
  theme_minimal()

Scatterplot with smoother

ggplot(fb, aes(x = PPVT,          
               y = Speaking)) +   
  geom_point()   +        
  geom_smooth(method = "loess", col="blue")  +
  theme_minimal()

Scatterplot with smoother and facet

ggplot(fb, aes(x = PPVT,          
               y = Speaking)) +   
  geom_point()   +  
  geom_smooth(method = "loess", col="blue")  +
  facet_grid(Sex ~ L1) + 
  theme_minimal()

Exercise 4.2

Use the ggplot2 package to create the following visualisations for the Multitask dataset:

  1. Visualise Time by Task as a clustered boxplot
  2. Visualise Time by Task and University as overlapping density plots
  3. Add a vertical line to visualize the mean Time for both Tasks.
  4. Use ggplot() to visualise the paired differences in the multitask dataset
  5. Create a new variable Difference, based on the difference in Time between the two Tasks.
  6. Visualise the relation between Speaking and Listening.
  7. Visualise the relation between Speaking and Listening for both Sexes and L1s separately.

References

De Latte, F. (2023). Replication data for: (Im)polite uses of vocatives in present-day madrilenian spanish. DataverseNO. https://doi.org/10.18710/FOBMUQ
De Wilde, V., Brysbaert, M., & Eyckmans, J. (2019). Learning English through out-of-school exposure. Which levels of language proficiency are attained and which types of input are important? Bilingualism: Language and Cognition, 23(1), 171–185. https://doi.org/10.1017/s1366728918001062
Farooq, S., Khan, S. afzal, Ahmad, F., Islam, S., & Abid, A. (2014). An evaluation framework and comparative analysis of the widely used first programming languages. PloS One, 9, e88941. https://doi.org/10.1371/journal.pone.0088941
Giorgi, F. M., Ceraolo, C., & Mercatelli, D. (2022). The r language: An engine for bioinformatics and data science. Life, 12(5). https://doi.org/10.3390/life12050648
Harrell Jr, F. E. (2023). Hmisc: Harrell miscellaneous. https://CRAN.R-project.org/package=Hmisc
RStudio Team. (2023). RStudio: Integrated development for r. Posit. https://posit.co/
Verbeke, G., & Simon, E. (2023). Replication Data for: Listening to Accents: Comprehensibility, accentedness and intelligibility of native and non-native English speech (Version V1) [dataset]. DataverseNO. https://doi.org/10.18710/8F0Q0L
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. https://CRAN.R-project.org/package=dplyr
Wickham, H., Vaughan, D., & Girlich, M. (2024). Tidyr: Tidy messy data. https://CRAN.R-project.org/package=tidyr