3  Tidyverse

3.1 Introduction

Programming languages have this in common with natural languages: users continuously develop variants that over time may evolve into new dialects or languages.

In R, there is currently a popular “dialect” called tidyverse (Wickham et al. 2019), which has gained widespread acceptance due to its consistency and (relative) simplicity. One of the core concepts of the tidyverse is that all data structures and outputs should consistently be represented as “tidy” data, where each row corresponds to a single observation.

Tidyverse has its own “dataframe”, called a “tibble”, which is basically a simplified form of a dataframe. We open tidyverse.

library(tidyverse)

By doing so we actually open multiple packages. Two packages that we will use in particular are:

  1. dplyr: to explore and manipulate data (Wickham et al. 2023)
  2. tidyr: to transform the data format (long to wide and vice versa)(Wickham, Vaughan, and Girlich 2024).

To make a tibble we also need the library tibble.

1library(tibble)
2LangTest <- tribble(
  ~ID,    ~Pretest, ~Posttest,
  "S1",  NA,  13,
  "S2",  13,  14,
  "S3",  15,  14,
  "S4",  8,   12
  )
LangTest
1
load the tibble package,
2
tribble means we fill the tibble by rows (note the -r- which is not a typo).
# A tibble: 4 × 3
  ID    Pretest Posttest
  <chr>   <dbl>    <dbl>
1 S1         NA       13
2 S2         13       14
3 S3         15       14
4 S4          8       12

To further illustrate the tidyverse approach we use a dataset by Verbeke and Simon (Verbeke and Simon 2023):

This dataset contains the results from 33 Flemish English as a Foreign Language (EFL) learners, who were exposed to eight native and non-native accents of English. These participants completed (i) a comprehensibility and accentedness rating task, followed by (ii) an orthographic transcription task. In the first task, listeners were asked to rate eight speakers of English on comprehensibility and accentedness on a nine-point scale (1 = easy to understand/no accent; 9 = hard to understand/strong accent). How Accentedness ratings and listeners’ Familiarity with the different accents impacted on their Comprehensibility judgements was measured using a linear mixed-effects model. The orthographic transcription task, then, was used to verify how well listeners actually understood the different accents of English (i.e. intelligibility). To that end, participants’ transcription Accuracy was measured as the number of correctly transcribed words and was estimated using a logistic mixed-effects model. Finally, the relation between listeners’ self-reported ease of understanding the different speakers (comprehensibility) and their actual understanding of the speakers (intelligibility) was assessed using a linear mixed-effects regression. R code for the data analysis is provided.

We will analyse the comprehensibility data based on the first task, using the file: Listening_to_Accents_Comprehensibility_Accentedness.csv. The codebook provides information for every variable:

The codebook for the datafile (taken from the README file):

  • ID: Unique identifier for each line in the dataset.
  • Participant: Unique identifier for each participant.
  • Accent: native and non-native accents of English
    • “GBE” = General British English
    • “GAE” = General American English
    • “NBE” = Northern British English
    • “SAE” = Southern American English
    • “IndEng” = Indian English
    • “NigEng” = Nigerian English
    • “ChinEng” = Chinese English
    • “SpanEng” = Spanish English
  • Comprehensibility: Listeners’ rating of how comprehensible the speaker is on a scale from 1 (= easy to understand) to 9 (= hard to understand)
  • Accentedness: Listeners’ rating of how strong a speaker’s accent is on a scale from 1 (= no accent) to 9 (= strong accent)
  • Familiarity: Listeners’ familiarity with the sampled varieties of English (i.e. never; rarely; sometimes; often; very often)

We open the datafile:

comp <- read.csv("Listening_to_Accents_Comprehensibility_Accentedness.csv", 
                 stringsAsFactors=TRUE)

And summarize it:

summary(comp)
       ID          Participant        Accent   Comprehensibility  Accentedness 
 Min.   :  1.00   Min.   : 7.00   ChinEng:33   Min.   :1.000     Min.   :1.00  
 1st Qu.: 66.75   1st Qu.:33.00   GAE    :33   1st Qu.:2.000     1st Qu.:4.00  
 Median :132.50   Median :46.00   GBE    :33   Median :2.000     Median :6.00  
 Mean   :132.50   Mean   :45.12   IndEng :33   Mean   :2.742     Mean   :5.58  
 3rd Qu.:198.25   3rd Qu.:62.00   NBE    :33   3rd Qu.:4.000     3rd Qu.:7.00  
 Max.   :264.00   Max.   :75.00   NigEng :33   Max.   :8.000     Max.   :9.00  
                                  (Other):66                                   
     Familiarity
 Never     :65  
 Often     :32  
 Rarely    :72  
 Sometimes :56  
 Very Often:39  
                
                

Familiarity is a categorical variable, but its ordering is not correctly interpreted by R:

levels(comp$Familiarity)
[1] "Never"      "Often"      "Rarely"     "Sometimes"  "Very Often"
str(comp$Familiarity)
 Factor w/ 5 levels "Never","Often",..: 1 5 2 3 3 1 3 1 1 5 ...

We need to indicate that Familiarity is an ordered factor:

comp$Familiarity <- factor(comp$Familiarity, 
                           ordered = TRUE, 
                           levels = c("Never", 
                                      "Rarely", 
                                      "Sometimes",
                                      "Often",
                                      "Very Often"))
levels(comp$Familiarity)
[1] "Never"      "Rarely"     "Sometimes"  "Often"      "Very Often"
str(comp$Familiarity)
 Ord.factor w/ 5 levels "Never"<"Rarely"<..: 1 5 4 2 2 1 2 1 1 5 ...

3.2 Data exploration and manipulation with dplyr

One of the key ideas of the tidyverse approach is to avoid nested functions and to create sequential code, where one function is performed after another. Let’s illustrate this idea with a very simple example, calculating an average.

In base-R we would calculate the average with:

mean(comp$Comprehensibility)
[1] 2.742424

In tidyverse we first “activate” the data and in next step we calculate the average:

1comp |>
2  summarise(MEAN = mean(Comprehensibility))
1
Choose the comp-dataframe (this requires that we already have created the comp object),
2
apply the summarise function, more specifically we calculate the mean.
      MEAN
1 2.742424

The result is a new dataframe, with one variable called “MEAN”, which is the name that we created in the code. Here we can see a second key idea of the tidyverse approach. The result is a new dataframe, not a table, not a list, or some other specific data structure. Tidyverse consistently creates new dataframe objects.

We can easily extend the summary dataframe with other descriptive statistics.

1comp |>
2  summarise(mean = mean(Comprehensibility, na.rm = TRUE),
            median = median(Comprehensibility, na.rm = TRUE),
            min = min(Comprehensibility, na.rm = TRUE),
            max = max(Comprehensibility, na.rm = TRUE),
            SD = sd(Comprehensibility, na.rm = TRUE))
1
Choose comp,
2
Using Summarise we calculate multiple summary statistics.
      mean median min max       SD
1 2.742424      2   1   8 1.567838

The result is a dataframe with multiple columns.

Summarize() is one of the basic functions in thedplyr vocabulary. The other key functions are:

  • select() to select columns/variables
  • filter() to select rows
  • mutate() to create new variables
  • group_by() to group the data according to a categorical variable

Let’s look at these functions in turn and how they can be combined to manipulate and explore a dataframe.

3.2.1 group_by()

What is the average Comprehension for the eight different accents?

1comp |>
2  group_by(Accent) |>
3  summarise(mean = mean(Comprehensibility, na.rm = TRUE))
1
Choose comp,
2
Group the data according to Accent,
3
Calculate the average Comprehensibility (for each group). I added na.rm = TRUE just in case there would be NAs.
# A tibble: 8 × 2
  Accent   mean
  <fct>   <dbl>
1 ChinEng  3.30
2 GAE      1.12
3 GBE      2.03
4 IndEng   3.61
5 NBE      2.24
6 NigEng   3.94
7 SAE      2.39
8 SpanEng  3.30

The result is a tibble (a simplified dataframe) with two variables.

Now let’s add some more summary statistics.

comp |>                                                
  group_by(Accent) |>                                  
  summarise(mean = mean(Comprehensibility, na.rm = TRUE),         
            median = median(Comprehensibility, na.rm = TRUE),     
            min = min(Comprehensibility, na.rm = TRUE),          
            max = max(Comprehensibility, na.rm = TRUE),           
            SD = sd(Comprehensibility, na.rm = TRUE))             
# A tibble: 8 × 6
  Accent   mean median   min   max    SD
  <fct>   <dbl>  <int> <int> <int> <dbl>
1 ChinEng  3.30      3     1     7 1.42 
2 GAE      1.12      1     1     2 0.331
3 GBE      2.03      2     1     5 0.883
4 IndEng   3.61      3     2     8 1.64 
5 NBE      2.24      2     1     6 1.28 
6 NigEng   3.94      4     1     7 1.62 
7 SAE      2.39      2     1     8 1.52 
8 SpanEng  3.30      3     1     6 1.24 

We can easily add Familiarity as an extra grouping variable (and dropping some summary statistics for the sake of brevity):

comp |>                                               
  group_by(Accent, Familiarity) |>                   
  summarise(mean = mean(Comprehensibility, na.rm = TRUE),        
            SD = sd(Comprehensibility, na.rm = TRUE))            
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by Accent and Familiarity.
ℹ Output is grouped by Accent.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(Accent, Familiarity))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.
# A tibble: 33 × 4
# Groups:   Accent [8]
   Accent  Familiarity  mean     SD
   <fct>   <ord>       <dbl>  <dbl>
 1 ChinEng Never        3.55  1.43 
 2 ChinEng Rarely       3     1.32 
 3 ChinEng Sometimes    3.33  1.53 
 4 ChinEng Often        1    NA    
 5 GAE     Sometimes    1     0    
 6 GAE     Often        1.43  0.535
 7 GAE     Very Often   1.04  0.204
 8 GBE     Never        2     0    
 9 GBE     Rarely       3    NA    
10 GBE     Sometimes    2.17  0.408
# ℹ 23 more rows

One might wonder whether this could also be done in base-R. Well, try this:

aggregate(Comprehensibility ~ Accent + Familiarity, 
          data = comp, 
          FUN = function(x) c(MEAN = mean(x), SD = sd(x)))
    Accent Familiarity Comprehensibility.MEAN Comprehensibility.SD
1  ChinEng       Never              3.5500000            1.4317821
2      GBE       Never              2.0000000            0.0000000
3   IndEng       Never              4.3333333            1.9663842
4      NBE       Never              1.0000000                   NA
5   NigEng       Never              3.8571429            1.7113069
6      SAE       Never              3.5000000            0.7071068
7  SpanEng       Never              3.3076923            1.4366985
8  ChinEng      Rarely              3.0000000            1.3228757
9      GBE      Rarely              3.0000000                   NA
10  IndEng      Rarely              3.7500000            1.5852943
11     NBE      Rarely              2.1111111            0.9279607
12  NigEng      Rarely              4.1000000            1.6633300
13     SAE      Rarely              2.9000000            1.2866839
14 SpanEng      Rarely              3.2307692            1.0127394
15 ChinEng   Sometimes              3.3333333            1.5275252
16     GAE   Sometimes              1.0000000            0.0000000
17     GBE   Sometimes              2.1666667            0.4082483
18  IndEng   Sometimes              2.1666667            0.4082483
19     NBE   Sometimes              2.6250000            1.5000000
20  NigEng   Sometimes              4.0000000            0.0000000
21     SAE   Sometimes              2.1250000            1.7841898
22 SpanEng   Sometimes              3.8000000            1.4832397
23 ChinEng       Often              1.0000000                   NA
24     GAE       Often              1.4285714            0.5345225
25     GBE       Often              2.1538462            1.2810252
26  IndEng       Often              5.0000000                   NA
27     NBE       Often              1.0000000            0.0000000
28     SAE       Often              2.0000000            0.8164966
29 SpanEng       Often              2.5000000            0.7071068
30     GAE  Very Often              1.0416667            0.2041241
31     GBE  Very Often              1.7272727            0.4670994
32     NBE  Very Often              2.6666667            0.5773503
33     SAE  Very Often              1.0000000                   NA

So yes, it can quite easily be done, but the popularity of the dplyr approach has cast its shadow on the more “traditional” ways of using R. 1

3.2.2 filter() and select()

We select Participants who are familiar with a Spanish English accent (i.e., Familiarity equal to “Often” or “Very Often”) and check their Comperehensibility scores.

comp |>                    
  filter(Familiarity %in% c("Often", "Very Often")) |>    
  select(Participant, Comprehensibility) |> 
  group_by(Participant) |>
  summarise(mean = mean(Comprehensibility, na.rm = TRUE))
# A tibble: 32 × 2
   Participant  mean
         <int> <dbl>
 1           7  3   
 2          10  2   
 3          13  1.33
 4          15  1   
 5          18  1.75
 6          21  1.25
 7          22  1.67
 8          31  1   
 9          33  2   
10          34  1.5 
# ℹ 22 more rows
  1. Choose comp,
  2. Select Participant and Comprehensibility,
  3. group the data by Participant,
  4. calculate the mean Comprehensibility for every Participant.

3.2.3 mutate()

mutate() allows you to create new variables. For instance, here we calculate a z-score for Comprehensibility.

comp |>                                            
  mutate(Comprehensibility_z = (Comprehensibility-mean(Comprehensibility))/sd(Comprehensibility)) 
     ID Participant  Accent Comprehensibility Accentedness Familiarity
1     1           7 ChinEng                 3            7       Never
2     2           7     GAE                 1            1  Very Often
3     3           7     GBE                 5            6       Often
4     4           7  IndEng                 3            8      Rarely
5     5           7     NBE                 4            4      Rarely
6     6           7  NigEng                 2            4       Never
7     7           7     SAE                 2            7      Rarely
8     8           7 SpanEng                 2            6       Never
9     9          10 ChinEng                 5            8       Never
10   10          10     GAE                 1            1  Very Often
11   11          10     GBE                 3            4       Often
12   12          10  IndEng                 7            7      Rarely
13   13          10     NBE                 2            3      Rarely
14   14          10  NigEng                 6            7       Never
15   15          10     SAE                 3            6      Rarely
16   16          10 SpanEng                 4            5      Rarely
17   17          13 ChinEng                 3            5       Never
18   18          13     GAE                 1            1  Very Often
19   19          13     GBE                 1            4       Often
20   20          13  IndEng                 2            6      Rarely
21   21          13     NBE                 1            5   Sometimes
22   22          13  NigEng                 2            7       Never
23   23          13     SAE                 1            6   Sometimes
24   24          13 SpanEng                 2            5       Often
25   25          15 ChinEng                 2            6       Never
26   26          15     GAE                 1            1  Very Often
27   27          15     GBE                 1            3       Often
28   28          15  IndEng                 2            5       Never
29   29          15     NBE                 1            7      Rarely
30   30          15  NigEng                 5            5       Never
31   31          15     SAE                 2            8   Sometimes
32   32          15 SpanEng                 3            5       Never
33   33          18 ChinEng                 3            3       Never
34   34          18     GAE                 1            2  Very Often
35   35          18     GBE                 2            3  Very Often
36   36          18  IndEng                 3            6       Never
37   37          18     NBE                 1            2       Often
38   38          18  NigEng                 4            5       Never
39   39          18     SAE                 3            4       Often
40   40          18 SpanEng                 3            5       Never
41   41          21 ChinEng                 1            3       Often
42   42          21     GAE                 1            1  Very Often
43   43          21     GBE                 2            5  Very Often
44   44          21  IndEng                 2            4   Sometimes
45   45          21     NBE                 1            1       Often
46   46          21  NigEng                 1            3      Rarely
47   47          21     SAE                 1            3   Sometimes
48   48          21 SpanEng                 1            4      Rarely
49   49          22 ChinEng                 5            8       Never
50   50          22     GAE                 1            4  Very Often
51   51          22     GBE                 2            1       Often
52   52          22  IndEng                 3            7      Rarely
53   53          22     NBE                 3            6   Sometimes
54   54          22  NigEng                 4            8      Rarely
55   55          22     SAE                 2            7       Often
56   56          22 SpanEng                 4            8   Sometimes
57   57          31 ChinEng                 4            7       Never
58   58          31     GAE                 1            1  Very Often
59   59          31     GBE                 1            2  Very Often
60   60          31  IndEng                 4            8      Rarely
61   61          31     NBE                 2            4   Sometimes
62   62          31  NigEng                 2            5       Never
63   63          31     SAE                 1            3       Often
64   64          31 SpanEng                 4            6   Sometimes
65   65          33 ChinEng                 7            7       Never
66   66          33     GAE                 2            6       Often
67   67          33     GBE                 3            6   Sometimes
68   68          33  IndEng                 6            7       Never
69   69          33     NBE                 6            7   Sometimes
70   70          33  NigEng                 7            8       Never
71   71          33     SAE                 8            7   Sometimes
72   72          33 SpanEng                 6            8       Never
73   73          34 ChinEng                 3            7       Never
74   74          34     GAE                 1            1  Very Often
75   75          34     GBE                 2            3  Very Often
76   76          34  IndEng                 2            5      Rarely
77   77          34     NBE                 2            8  Very Often
78   78          34  NigEng                 3            8      Rarely
79   79          34     SAE                 1            5  Very Often
80   80          34 SpanEng                 3            6       Never
81   81          35 ChinEng                 5            4       Never
82   82          35     GAE                 1            1  Very Often
83   83          35     GBE                 2            2  Very Often
84   84          35  IndEng                 3            6      Rarely
85   85          35     NBE                 3            3   Sometimes
86   86          35  NigEng                 5            5       Never
87   87          35     SAE                 2            4   Sometimes
88   88          35 SpanEng                 3            5       Never
89   89          36 ChinEng                 2            9      Rarely
90   90          36     GAE                 1            4  Very Often
91   91          36     GBE                 2            8  Very Often
92   92          36  IndEng                 2            8   Sometimes
93   93          36     NBE                 1            8       Never
94   94          36  NigEng                 2            9       Never
95   95          36     SAE                 1            8   Sometimes
96   96          36 SpanEng                 2            9   Sometimes
97   97          38 ChinEng                 3            7       Never
98   98          38     GAE                 1            2  Very Often
99   99          38     GBE                 2            4   Sometimes
100 100          38  IndEng                 5            7      Rarely
101 101          38     NBE                 2            3      Rarely
102 102          38  NigEng                 3            5       Never
103 103          38     SAE                 2            3   Sometimes
104 104          38 SpanEng                 3            5      Rarely
105 105          39 ChinEng                 3            7       Never
106 106          39     GAE                 1            3  Very Often
107 107          39     GBE                 1            2  Very Often
108 108          39  IndEng                 3            9      Rarely
109 109          39     NBE                 1            3   Sometimes
110 110          39  NigEng                 4            8       Never
111 111          39     SAE                 1            9      Rarely
112 112          39 SpanEng                 3            9       Never
113 113          40 ChinEng                 2            5      Rarely
114 114          40     GAE                 1            3       Often
115 115          40     GBE                 1            6       Often
116 116          40  IndEng                 3            6      Rarely
117 117          40     NBE                 2            5   Sometimes
118 118          40  NigEng                 2            7       Never
119 119          40     SAE                 1            8      Rarely
120 120          40 SpanEng                 3            7      Rarely
121 121          44 ChinEng                 3            9      Rarely
122 122          44     GAE                 1            1       Often
123 123          44     GBE                 2            9   Sometimes
124 124          44  IndEng                 2            9      Rarely
125 125          44     NBE                 3            9      Rarely
126 126          44  NigEng                 4            9      Rarely
127 127          44     SAE                 2            9   Sometimes
128 128          44 SpanEng                 3            9      Rarely
129 129          46 ChinEng                 4            7       Never
130 130          46     GAE                 1            2  Very Often
131 131          46     GBE                 3            5       Often
132 132          46  IndEng                 7            8       Never
133 133          46     NBE                 2            4   Sometimes
134 134          46  NigEng                 6            7       Never
135 135          46     SAE                 5            7      Rarely
136 136          46 SpanEng                 6            8       Never
137 137          48 ChinEng                 1            3      Rarely
138 138          48     GAE                 1            1  Very Often
139 139          48     GBE                 1            1       Often
140 140          48  IndEng                 4            6      Rarely
141 141          48     NBE                 1            1      Rarely
142 142          48  NigEng                 2            8       Never
143 143          48     SAE                 1            3   Sometimes
144 144          48 SpanEng                 4            5      Rarely
145 145          50 ChinEng                 1            6       Never
146 146          50     GAE                 1            2  Very Often
147 147          50     GBE                 1            3  Very Often
148 148          50  IndEng                 3            9       Never
149 149          50     NBE                 2            9   Sometimes
150 150          50  NigEng                 2            8       Never
151 151          50     SAE                 1            7   Sometimes
152 152          50 SpanEng                 1            8       Never
153 153          54 ChinEng                 3            5       Never
154 154          54     GAE                 1            2  Very Often
155 155          54     GBE                 2            1  Very Often
156 156          54  IndEng                 4            8      Rarely
157 157          54     NBE                 3            3   Sometimes
158 158          54  NigEng                 7            9      Rarely
159 159          54     SAE                 1            9   Sometimes
160 160          54 SpanEng                 5            9      Rarely
161 161          56 ChinEng                 3            8       Never
162 162          56     GAE                 2            6       Often
163 163          56     GBE                 1            6       Often
164 164          56  IndEng                 4            8      Rarely
165 165          56     NBE                 2            7      Rarely
166 166          56  NigEng                 3            9       Never
167 167          56     SAE                 3            8   Sometimes
168 168          56 SpanEng                 2            8       Never
169 169          58 ChinEng                 4            4       Never
170 170          58     GAE                 2            2       Often
171 171          58     GBE                 2            4       Often
172 172          58  IndEng                 5            7       Never
173 173          58     NBE                 2            3      Rarely
174 174          58  NigEng                 4            6       Never
175 175          58     SAE                 4            5      Rarely
176 176          58 SpanEng                 3            6      Rarely
177 177          59 ChinEng                 1            5       Never
178 178          59     GAE                 1            4  Very Often
179 179          59     GBE                 4            7       Often
180 180          59  IndEng                 2            6      Rarely
181 181          59     NBE                 2            7   Sometimes
182 182          59  NigEng                 3            5       Never
183 183          59     SAE                 2            6   Sometimes
184 184          59 SpanEng                 3            5       Never
185 185          61 ChinEng                 3            7      Rarely
186 186          61     GAE                 1            1  Very Often
187 187          61     GBE                 2            6       Often
188 188          61  IndEng                 2            7   Sometimes
189 189          61     NBE                 2            6   Sometimes
190 190          61  NigEng                 5            8      Rarely
191 191          61     SAE                 2            6   Sometimes
192 192          61 SpanEng                 2            6      Rarely
193 193          62 ChinEng                 3            6      Rarely
194 194          62     GAE                 1            2  Very Often
195 195          62     GBE                 2            2  Very Often
196 196          62  IndEng                 5            8      Rarely
197 197          62     NBE                 2            5   Sometimes
198 198          62  NigEng                 3            5      Rarely
199 199          62     SAE                 3            8      Rarely
200 200          62 SpanEng                 3            6      Rarely
201 201          65 ChinEng                 4            7       Never
202 202          65     GAE                 1            2  Very Often
203 203          65     GBE                 2            4       Often
204 204          65  IndEng                 3            6      Rarely
205 205          65     NBE                 2            4      Rarely
206 206          65  NigEng                 5            7       Never
207 207          65     SAE                 4            7   Sometimes
208 208          65 SpanEng                 3            6      Rarely
209 209          66 ChinEng                 3            8      Rarely
210 210          66     GAE                 1            1  Very Often
211 211          66     GBE                 2            6   Sometimes
212 212          66  IndEng                 4            8      Rarely
213 213          66     NBE                 1            7   Sometimes
214 214          66  NigEng                 6            9      Rarely
215 215          66     SAE                 1            7   Sometimes
216 216          66 SpanEng                 3            6   Sometimes
217 217          67 ChinEng                 3            7   Sometimes
218 218          67     GAE                 1            6  Very Often
219 219          67     GBE                 2            8  Very Often
220 220          67  IndEng                 2            7   Sometimes
221 221          67     NBE                 1            5       Often
222 222          67  NigEng                 4            9   Sometimes
223 223          67     SAE                 2            8       Often
224 224          67 SpanEng                 3            9       Often
225 225          70     GBE                 2            3   Sometimes
226 226          70     GAE                 1            1       Often
227 227          70     NBE                 1            6       Often
228 228          70     SAE                 3            5      Rarely
229 229          70  IndEng                 2            7   Sometimes
230 230          70  NigEng                 4            7      Rarely
231 231          70 ChinEng                 2            6   Sometimes
232 232          70 SpanEng                 4            7      Rarely
233 233          71     GBE                 2            2       Never
234 234          71     GAE                 1            4   Sometimes
235 235          71     NBE                 3            6  Very Often
236 236          71     SAE                 4            7       Never
237 237          71  IndEng                 4            7      Rarely
238 238          71  NigEng                 6            8       Never
239 239          71 ChinEng                 5            8       Never
240 240          71 SpanEng                 4            6       Never
241 241          72     GBE                 2            5   Sometimes
242 242          72     GAE                 1            3       Often
243 243          72     NBE                 5            6   Sometimes
244 244          72     SAE                 4            6      Rarely
245 245          72  IndEng                 3            5   Sometimes
246 246          72  NigEng                 4            6   Sometimes
247 247          72 ChinEng                 5            8      Rarely
248 248          72 SpanEng                 6            8   Sometimes
249 249          73     GBE                 3            4      Rarely
250 250          73     GAE                 2            3  Very Often
251 251          73     NBE                 3            4  Very Often
252 252          73     SAE                 3            5      Rarely
253 253          73  IndEng                 5            6       Often
254 254          73  NigEng                 4            5      Rarely
255 255          73 ChinEng                 5            7   Sometimes
256 256          73 SpanEng                 4            5      Rarely
257 257          75     GBE                 2            2       Never
258 258          75     GAE                 1            1   Sometimes
259 259          75     NBE                 5            6   Sometimes
260 260          75     SAE                 3            7       Never
261 261          75  IndEng                 8            8      Rarely
262 262          75  NigEng                 6            8       Never
263 263          75 ChinEng                 5            8      Rarely
264 264          75 SpanEng                 4            7       Never
    Comprehensibility_z
1             0.1642872
2            -1.1113545
3             1.4399288
4             0.1642872
5             0.8021080
6            -0.4735336
7            -0.4735336
8            -0.4735336
9             1.4399288
10           -1.1113545
11            0.1642872
12            2.7155705
13           -0.4735336
14            2.0777497
15            0.1642872
16            0.8021080
17            0.1642872
18           -1.1113545
19           -1.1113545
20           -0.4735336
21           -1.1113545
22           -0.4735336
23           -1.1113545
24           -0.4735336
25           -0.4735336
26           -1.1113545
27           -1.1113545
28           -0.4735336
29           -1.1113545
30            1.4399288
31           -0.4735336
32            0.1642872
33            0.1642872
34           -1.1113545
35           -0.4735336
36            0.1642872
37           -1.1113545
38            0.8021080
39            0.1642872
40            0.1642872
41           -1.1113545
42           -1.1113545
43           -0.4735336
44           -0.4735336
45           -1.1113545
46           -1.1113545
47           -1.1113545
48           -1.1113545
49            1.4399288
50           -1.1113545
51           -0.4735336
52            0.1642872
53            0.1642872
54            0.8021080
55           -0.4735336
56            0.8021080
57            0.8021080
58           -1.1113545
59           -1.1113545
60            0.8021080
61           -0.4735336
62           -0.4735336
63           -1.1113545
64            0.8021080
65            2.7155705
66           -0.4735336
67            0.1642872
68            2.0777497
69            2.0777497
70            2.7155705
71            3.3533913
72            2.0777497
73            0.1642872
74           -1.1113545
75           -0.4735336
76           -0.4735336
77           -0.4735336
78            0.1642872
79           -1.1113545
80            0.1642872
81            1.4399288
82           -1.1113545
83           -0.4735336
84            0.1642872
85            0.1642872
86            1.4399288
87           -0.4735336
88            0.1642872
89           -0.4735336
90           -1.1113545
91           -0.4735336
92           -0.4735336
93           -1.1113545
94           -0.4735336
95           -1.1113545
96           -0.4735336
97            0.1642872
98           -1.1113545
99           -0.4735336
100           1.4399288
101          -0.4735336
102           0.1642872
103          -0.4735336
104           0.1642872
105           0.1642872
106          -1.1113545
107          -1.1113545
108           0.1642872
109          -1.1113545
110           0.8021080
111          -1.1113545
112           0.1642872
113          -0.4735336
114          -1.1113545
115          -1.1113545
116           0.1642872
117          -0.4735336
118          -0.4735336
119          -1.1113545
120           0.1642872
121           0.1642872
122          -1.1113545
123          -0.4735336
124          -0.4735336
125           0.1642872
126           0.8021080
127          -0.4735336
128           0.1642872
129           0.8021080
130          -1.1113545
131           0.1642872
132           2.7155705
133          -0.4735336
134           2.0777497
135           1.4399288
136           2.0777497
137          -1.1113545
138          -1.1113545
139          -1.1113545
140           0.8021080
141          -1.1113545
142          -0.4735336
143          -1.1113545
144           0.8021080
145          -1.1113545
146          -1.1113545
147          -1.1113545
148           0.1642872
149          -0.4735336
150          -0.4735336
151          -1.1113545
152          -1.1113545
153           0.1642872
154          -1.1113545
155          -0.4735336
156           0.8021080
157           0.1642872
158           2.7155705
159          -1.1113545
160           1.4399288
161           0.1642872
162          -0.4735336
163          -1.1113545
164           0.8021080
165          -0.4735336
166           0.1642872
167           0.1642872
168          -0.4735336
169           0.8021080
170          -0.4735336
171          -0.4735336
172           1.4399288
173          -0.4735336
174           0.8021080
175           0.8021080
176           0.1642872
177          -1.1113545
178          -1.1113545
179           0.8021080
180          -0.4735336
181          -0.4735336
182           0.1642872
183          -0.4735336
184           0.1642872
185           0.1642872
186          -1.1113545
187          -0.4735336
188          -0.4735336
189          -0.4735336
190           1.4399288
191          -0.4735336
192          -0.4735336
193           0.1642872
194          -1.1113545
195          -0.4735336
196           1.4399288
197          -0.4735336
198           0.1642872
199           0.1642872
200           0.1642872
201           0.8021080
202          -1.1113545
203          -0.4735336
204           0.1642872
205          -0.4735336
206           1.4399288
207           0.8021080
208           0.1642872
209           0.1642872
210          -1.1113545
211          -0.4735336
212           0.8021080
213          -1.1113545
214           2.0777497
215          -1.1113545
216           0.1642872
217           0.1642872
218          -1.1113545
219          -0.4735336
220          -0.4735336
221          -1.1113545
222           0.8021080
223          -0.4735336
224           0.1642872
225          -0.4735336
226          -1.1113545
227          -1.1113545
228           0.1642872
229          -0.4735336
230           0.8021080
231          -0.4735336
232           0.8021080
233          -0.4735336
234          -1.1113545
235           0.1642872
236           0.8021080
237           0.8021080
238           2.0777497
239           1.4399288
240           0.8021080
241          -0.4735336
242          -1.1113545
243           1.4399288
244           0.8021080
245           0.1642872
246           0.8021080
247           1.4399288
248           2.0777497
249           0.1642872
250          -0.4735336
251           0.1642872
252           0.1642872
253           1.4399288
254           0.8021080
255           1.4399288
256           0.8021080
257          -0.4735336
258          -1.1113545
259           1.4399288
260           0.1642872
261           3.3533913
262           2.0777497
263           1.4399288
264           0.8021080

Other important functions from the tidyverse vocabulary include:

  • drop_na() to drop missing values
  • count() similar to the table() function

I illustrate each in turn using a very simple dataframe

mockdf <- tibble(
  ID = 1:6,
  Size = c("a", "a", "a", "b", "b", "b"), 
  Score = c(23,34,23,NA,27,4),
  Group = c("first", "second", "second", "second", "first", NA))
mockdf
# A tibble: 6 × 4
     ID Size  Score Group 
  <int> <chr> <dbl> <chr> 
1     1 a        23 first 
2     2 a        34 second
3     3 a        23 second
4     4 b        NA second
5     5 b        27 first 
6     6 b         4 <NA>  

Drop the NAs

clean_df <- mockdf |> drop_na()
clean_df
# A tibble: 4 × 4
     ID Size  Score Group 
  <int> <chr> <dbl> <chr> 
1     1 a        23 first 
2     2 a        34 second
3     3 a        23 second
4     5 b        27 first 

Count observations for Group

mockdf |>  count(Group)
# A tibble: 3 × 2
  Group      n
  <chr>  <int>
1 first      2
2 second     3
3 <NA>       1
mockdf |> group_by(Group, Size) |> count()
# A tibble: 5 × 3
# Groups:   Group, Size [5]
  Group  Size      n
  <chr>  <chr> <int>
1 first  a         1
2 first  b         1
3 second a         2
4 second b         1
5 <NA>   b         1

Which is of course similar to:

table(mockdf$Size, mockdf$Group)
   
    first second
  a     1      2
  b     1      1

Or, if you prefer a dataframe format:

as.data.frame(xtabs(~ Group + Size, data = mockdf))
   Group Size Freq
1  first    a    1
2 second    a    2
3  first    b    1
4 second    b    1

3.3 Data transformation

Compare the two tables in Table 3.1

Table 3.1: The long and wide data format
(a) The long data format
Student Test Score
1 pretest 8
1 posttest 12
2 pretest 12
2 posttest 14
3 pretest 9
3 posttest 8
(b) The wide data format
Student Pretest Posttest
1 8 12
2 12 14
3 9 8

In the long data format, each observation (each score for both moments) is given its own row, resulting in two rows per student. This format includes a binary categorical variable Test and a continuous variable Score. In the wide data format, each row represents data for one student. There are two test moments (Pre- and Posttest) displayed separately. This wide format is common in paired and longitudinal data where multiple measurements are taken for one element.

Using the correct data format is crucial for accurate data analysis and visualization. Therefore, being able to switch between these formats without re-entering all data is essential. Of course, you could copy-paste the columns in a spreadsheet, but using code to “pivot” the data is more efficient, in terms of speed and replicability.

To illustrate the data transformation process, we create a small dataframe in the wide format:

df <- data.frame(Student = c(1, 2, 3),
                 Pretest = c(8, 12, 9),
                 Posttest = c(12, 14, 8))
df
  Student Pretest Posttest
1       1       8       12
2       2      12       14
3       3       9        8

We now transform the data to the long format using pivot_longer():

df_long <- df |> 
1  pivot_longer(cols = c(Pretest, Posttest),
2               names_to = "Test",
3               values_to = "Score")
df_long
1
Creating a new dataset df_long.
2
Selecting columns to transform: The cols parameter specifies the columns to pivot.
3
Naming new variables:
# A tibble: 6 × 3
  Student Test     Score
    <dbl> <chr>    <dbl>
1       1 Pretest      8
2       1 Posttest    12
3       2 Pretest     12
4       2 Posttest    14
5       3 Pretest      9
6       3 Posttest     8

To transform the long dataset back to the wide format use pivot_wider():

df_wide <- df_long |> 
1  pivot_wider(id_cols = Student,
2              names_from = Test,
3              values_from = Score)
df_wide
1
Name for the transformed dataset
2
Clustering variable: The id_cols parameter specifies the variable to cluster values by.
3
Variables for new columns: The names_from parameter specifies the variable whose categories become new columns, and values_from specifies the variable from which the values are taken.
# A tibble: 3 × 3
  Student Pretest Posttest
    <dbl>   <dbl>    <dbl>
1       1       8       12
2       2      12       14
3       3       9        8

3.4 Exercises

Exercise 3.1

Use tidyverse to answer the following questions about the multitask dataset.

  1. What is the average Time by Task?
  2. Wat are the three fastest international participants? Give the names of the Participants in descending order, starting with the fastest one.

Exercise 3.2

  1. Transform the Multitask dataset from long to wide data format.

  1. For an “opinionated” comparison between tidyverse and base-r, see TidyverseSkeptic↩︎