library(tidyverse)3 Tidyverse
3.1 Introduction
Programming languages have this in common with natural languages: users continuously develop variants that over time may evolve into new dialects or languages.
In R, there is currently a popular “dialect” called tidyverse (Wickham et al. 2019), which has gained widespread acceptance due to its consistency and (relative) simplicity. One of the core concepts of the tidyverse is that all data structures and outputs should consistently be represented as “tidy” data, where each row corresponds to a single observation.
Tidyverse has its own “dataframe”, called a “tibble”, which is basically a simplified form of a dataframe. We open tidyverse.
By doing so we actually open multiple packages. Two packages that we will use in particular are:
dplyr: to explore and manipulate data (Wickham et al. 2023)tidyr: to transform the data format (long to wide and vice versa)(Wickham, Vaughan, and Girlich 2024).
To make a tibble we also need the library tibble.
- 1
- load the tibble package,
- 2
- tribble means we fill the tibble by rows (note the -r- which is not a typo).
# A tibble: 4 × 3
ID Pretest Posttest
<chr> <dbl> <dbl>
1 S1 NA 13
2 S2 13 14
3 S3 15 14
4 S4 8 12
To further illustrate the tidyverse approach we use a dataset by Verbeke and Simon (Verbeke and Simon 2023):
This dataset contains the results from 33 Flemish English as a Foreign Language (EFL) learners, who were exposed to eight native and non-native accents of English. These participants completed (i) a comprehensibility and accentedness rating task, followed by (ii) an orthographic transcription task. In the first task, listeners were asked to rate eight speakers of English on comprehensibility and accentedness on a nine-point scale (1 = easy to understand/no accent; 9 = hard to understand/strong accent). How Accentedness ratings and listeners’ Familiarity with the different accents impacted on their Comprehensibility judgements was measured using a linear mixed-effects model. The orthographic transcription task, then, was used to verify how well listeners actually understood the different accents of English (i.e. intelligibility). To that end, participants’ transcription Accuracy was measured as the number of correctly transcribed words and was estimated using a logistic mixed-effects model. Finally, the relation between listeners’ self-reported ease of understanding the different speakers (comprehensibility) and their actual understanding of the speakers (intelligibility) was assessed using a linear mixed-effects regression. R code for the data analysis is provided.
We will analyse the comprehensibility data based on the first task, using the file: Listening_to_Accents_Comprehensibility_Accentedness.csv. The codebook provides information for every variable:
The codebook for the datafile (taken from the README file):
- ID: Unique identifier for each line in the dataset.
- Participant: Unique identifier for each participant.
- Accent: native and non-native accents of English
- “GBE” = General British English
- “GAE” = General American English
- “NBE” = Northern British English
- “SAE” = Southern American English
- “IndEng” = Indian English
- “NigEng” = Nigerian English
- “ChinEng” = Chinese English
- “SpanEng” = Spanish English
- Comprehensibility: Listeners’ rating of how comprehensible the speaker is on a scale from 1 (= easy to understand) to 9 (= hard to understand)
- Accentedness: Listeners’ rating of how strong a speaker’s accent is on a scale from 1 (= no accent) to 9 (= strong accent)
- Familiarity: Listeners’ familiarity with the sampled varieties of English (i.e. never; rarely; sometimes; often; very often)
We open the datafile:
comp <- read.csv("Listening_to_Accents_Comprehensibility_Accentedness.csv",
stringsAsFactors=TRUE)And summarize it:
summary(comp) ID Participant Accent Comprehensibility Accentedness
Min. : 1.00 Min. : 7.00 ChinEng:33 Min. :1.000 Min. :1.00
1st Qu.: 66.75 1st Qu.:33.00 GAE :33 1st Qu.:2.000 1st Qu.:4.00
Median :132.50 Median :46.00 GBE :33 Median :2.000 Median :6.00
Mean :132.50 Mean :45.12 IndEng :33 Mean :2.742 Mean :5.58
3rd Qu.:198.25 3rd Qu.:62.00 NBE :33 3rd Qu.:4.000 3rd Qu.:7.00
Max. :264.00 Max. :75.00 NigEng :33 Max. :8.000 Max. :9.00
(Other):66
Familiarity
Never :65
Often :32
Rarely :72
Sometimes :56
Very Often:39
Familiarity is a categorical variable, but its ordering is not correctly interpreted by R:
levels(comp$Familiarity)[1] "Never" "Often" "Rarely" "Sometimes" "Very Often"
str(comp$Familiarity) Factor w/ 5 levels "Never","Often",..: 1 5 2 3 3 1 3 1 1 5 ...
We need to indicate that Familiarity is an ordered factor:
comp$Familiarity <- factor(comp$Familiarity,
ordered = TRUE,
levels = c("Never",
"Rarely",
"Sometimes",
"Often",
"Very Often"))levels(comp$Familiarity)[1] "Never" "Rarely" "Sometimes" "Often" "Very Often"
str(comp$Familiarity) Ord.factor w/ 5 levels "Never"<"Rarely"<..: 1 5 4 2 2 1 2 1 1 5 ...
3.2 Data exploration and manipulation with dplyr
One of the key ideas of the tidyverse approach is to avoid nested functions and to create sequential code, where one function is performed after another. Let’s illustrate this idea with a very simple example, calculating an average.
In base-R we would calculate the average with:
mean(comp$Comprehensibility)[1] 2.742424
In tidyverse we first “activate” the data and in next step we calculate the average:
- 1
- Choose the comp-dataframe (this requires that we already have created the comp object),
- 2
- apply the summarise function, more specifically we calculate the mean.
MEAN
1 2.742424
The result is a new dataframe, with one variable called “MEAN”, which is the name that we created in the code. Here we can see a second key idea of the tidyverse approach. The result is a new dataframe, not a table, not a list, or some other specific data structure. Tidyverse consistently creates new dataframe objects.
We can easily extend the summary dataframe with other descriptive statistics.
- 1
-
Choose
comp, - 2
- Using Summarise we calculate multiple summary statistics.
mean median min max SD
1 2.742424 2 1 8 1.567838
The result is a dataframe with multiple columns.
Summarize() is one of the basic functions in thedplyr vocabulary. The other key functions are:
select()to select columns/variablesfilter()to select rowsmutate()to create new variablesgroup_by()to group the data according to a categorical variable
Let’s look at these functions in turn and how they can be combined to manipulate and explore a dataframe.
3.2.1 group_by()
What is the average Comprehension for the eight different accents?
- 1
-
Choose
comp, - 2
-
Group the data according to
Accent, - 3
-
Calculate the average Comprehensibility (for each group). I added
na.rm = TRUEjust in case there would be NAs.
# A tibble: 8 × 2
Accent mean
<fct> <dbl>
1 ChinEng 3.30
2 GAE 1.12
3 GBE 2.03
4 IndEng 3.61
5 NBE 2.24
6 NigEng 3.94
7 SAE 2.39
8 SpanEng 3.30
The result is a tibble (a simplified dataframe) with two variables.
Now let’s add some more summary statistics.
comp |>
group_by(Accent) |>
summarise(mean = mean(Comprehensibility, na.rm = TRUE),
median = median(Comprehensibility, na.rm = TRUE),
min = min(Comprehensibility, na.rm = TRUE),
max = max(Comprehensibility, na.rm = TRUE),
SD = sd(Comprehensibility, na.rm = TRUE)) # A tibble: 8 × 6
Accent mean median min max SD
<fct> <dbl> <int> <int> <int> <dbl>
1 ChinEng 3.30 3 1 7 1.42
2 GAE 1.12 1 1 2 0.331
3 GBE 2.03 2 1 5 0.883
4 IndEng 3.61 3 2 8 1.64
5 NBE 2.24 2 1 6 1.28
6 NigEng 3.94 4 1 7 1.62
7 SAE 2.39 2 1 8 1.52
8 SpanEng 3.30 3 1 6 1.24
We can easily add Familiarity as an extra grouping variable (and dropping some summary statistics for the sake of brevity):
comp |>
group_by(Accent, Familiarity) |>
summarise(mean = mean(Comprehensibility, na.rm = TRUE),
SD = sd(Comprehensibility, na.rm = TRUE)) `summarise()` has regrouped the output.
ℹ Summaries were computed grouped by Accent and Familiarity.
ℹ Output is grouped by Accent.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(Accent, Familiarity))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
# A tibble: 33 × 4
# Groups: Accent [8]
Accent Familiarity mean SD
<fct> <ord> <dbl> <dbl>
1 ChinEng Never 3.55 1.43
2 ChinEng Rarely 3 1.32
3 ChinEng Sometimes 3.33 1.53
4 ChinEng Often 1 NA
5 GAE Sometimes 1 0
6 GAE Often 1.43 0.535
7 GAE Very Often 1.04 0.204
8 GBE Never 2 0
9 GBE Rarely 3 NA
10 GBE Sometimes 2.17 0.408
# ℹ 23 more rows
One might wonder whether this could also be done in base-R. Well, try this:
aggregate(Comprehensibility ~ Accent + Familiarity,
data = comp,
FUN = function(x) c(MEAN = mean(x), SD = sd(x))) Accent Familiarity Comprehensibility.MEAN Comprehensibility.SD
1 ChinEng Never 3.5500000 1.4317821
2 GBE Never 2.0000000 0.0000000
3 IndEng Never 4.3333333 1.9663842
4 NBE Never 1.0000000 NA
5 NigEng Never 3.8571429 1.7113069
6 SAE Never 3.5000000 0.7071068
7 SpanEng Never 3.3076923 1.4366985
8 ChinEng Rarely 3.0000000 1.3228757
9 GBE Rarely 3.0000000 NA
10 IndEng Rarely 3.7500000 1.5852943
11 NBE Rarely 2.1111111 0.9279607
12 NigEng Rarely 4.1000000 1.6633300
13 SAE Rarely 2.9000000 1.2866839
14 SpanEng Rarely 3.2307692 1.0127394
15 ChinEng Sometimes 3.3333333 1.5275252
16 GAE Sometimes 1.0000000 0.0000000
17 GBE Sometimes 2.1666667 0.4082483
18 IndEng Sometimes 2.1666667 0.4082483
19 NBE Sometimes 2.6250000 1.5000000
20 NigEng Sometimes 4.0000000 0.0000000
21 SAE Sometimes 2.1250000 1.7841898
22 SpanEng Sometimes 3.8000000 1.4832397
23 ChinEng Often 1.0000000 NA
24 GAE Often 1.4285714 0.5345225
25 GBE Often 2.1538462 1.2810252
26 IndEng Often 5.0000000 NA
27 NBE Often 1.0000000 0.0000000
28 SAE Often 2.0000000 0.8164966
29 SpanEng Often 2.5000000 0.7071068
30 GAE Very Often 1.0416667 0.2041241
31 GBE Very Often 1.7272727 0.4670994
32 NBE Very Often 2.6666667 0.5773503
33 SAE Very Often 1.0000000 NA
So yes, it can quite easily be done, but the popularity of the dplyr approach has cast its shadow on the more “traditional” ways of using R. 1
3.2.2 filter() and select()
We select Participants who are familiar with a Spanish English accent (i.e., Familiarity equal to “Often” or “Very Often”) and check their Comperehensibility scores.
comp |>
filter(Familiarity %in% c("Often", "Very Often")) |>
select(Participant, Comprehensibility) |>
group_by(Participant) |>
summarise(mean = mean(Comprehensibility, na.rm = TRUE))# A tibble: 32 × 2
Participant mean
<int> <dbl>
1 7 3
2 10 2
3 13 1.33
4 15 1
5 18 1.75
6 21 1.25
7 22 1.67
8 31 1
9 33 2
10 34 1.5
# ℹ 22 more rows
- Choose
comp, - Select
ParticipantandComprehensibility, - group the data by
Participant, - calculate the mean
Comprehensibilityfor everyParticipant.
3.2.3 mutate()
mutate() allows you to create new variables. For instance, here we calculate a z-score for Comprehensibility.
comp |>
mutate(Comprehensibility_z = (Comprehensibility-mean(Comprehensibility))/sd(Comprehensibility)) ID Participant Accent Comprehensibility Accentedness Familiarity
1 1 7 ChinEng 3 7 Never
2 2 7 GAE 1 1 Very Often
3 3 7 GBE 5 6 Often
4 4 7 IndEng 3 8 Rarely
5 5 7 NBE 4 4 Rarely
6 6 7 NigEng 2 4 Never
7 7 7 SAE 2 7 Rarely
8 8 7 SpanEng 2 6 Never
9 9 10 ChinEng 5 8 Never
10 10 10 GAE 1 1 Very Often
11 11 10 GBE 3 4 Often
12 12 10 IndEng 7 7 Rarely
13 13 10 NBE 2 3 Rarely
14 14 10 NigEng 6 7 Never
15 15 10 SAE 3 6 Rarely
16 16 10 SpanEng 4 5 Rarely
17 17 13 ChinEng 3 5 Never
18 18 13 GAE 1 1 Very Often
19 19 13 GBE 1 4 Often
20 20 13 IndEng 2 6 Rarely
21 21 13 NBE 1 5 Sometimes
22 22 13 NigEng 2 7 Never
23 23 13 SAE 1 6 Sometimes
24 24 13 SpanEng 2 5 Often
25 25 15 ChinEng 2 6 Never
26 26 15 GAE 1 1 Very Often
27 27 15 GBE 1 3 Often
28 28 15 IndEng 2 5 Never
29 29 15 NBE 1 7 Rarely
30 30 15 NigEng 5 5 Never
31 31 15 SAE 2 8 Sometimes
32 32 15 SpanEng 3 5 Never
33 33 18 ChinEng 3 3 Never
34 34 18 GAE 1 2 Very Often
35 35 18 GBE 2 3 Very Often
36 36 18 IndEng 3 6 Never
37 37 18 NBE 1 2 Often
38 38 18 NigEng 4 5 Never
39 39 18 SAE 3 4 Often
40 40 18 SpanEng 3 5 Never
41 41 21 ChinEng 1 3 Often
42 42 21 GAE 1 1 Very Often
43 43 21 GBE 2 5 Very Often
44 44 21 IndEng 2 4 Sometimes
45 45 21 NBE 1 1 Often
46 46 21 NigEng 1 3 Rarely
47 47 21 SAE 1 3 Sometimes
48 48 21 SpanEng 1 4 Rarely
49 49 22 ChinEng 5 8 Never
50 50 22 GAE 1 4 Very Often
51 51 22 GBE 2 1 Often
52 52 22 IndEng 3 7 Rarely
53 53 22 NBE 3 6 Sometimes
54 54 22 NigEng 4 8 Rarely
55 55 22 SAE 2 7 Often
56 56 22 SpanEng 4 8 Sometimes
57 57 31 ChinEng 4 7 Never
58 58 31 GAE 1 1 Very Often
59 59 31 GBE 1 2 Very Often
60 60 31 IndEng 4 8 Rarely
61 61 31 NBE 2 4 Sometimes
62 62 31 NigEng 2 5 Never
63 63 31 SAE 1 3 Often
64 64 31 SpanEng 4 6 Sometimes
65 65 33 ChinEng 7 7 Never
66 66 33 GAE 2 6 Often
67 67 33 GBE 3 6 Sometimes
68 68 33 IndEng 6 7 Never
69 69 33 NBE 6 7 Sometimes
70 70 33 NigEng 7 8 Never
71 71 33 SAE 8 7 Sometimes
72 72 33 SpanEng 6 8 Never
73 73 34 ChinEng 3 7 Never
74 74 34 GAE 1 1 Very Often
75 75 34 GBE 2 3 Very Often
76 76 34 IndEng 2 5 Rarely
77 77 34 NBE 2 8 Very Often
78 78 34 NigEng 3 8 Rarely
79 79 34 SAE 1 5 Very Often
80 80 34 SpanEng 3 6 Never
81 81 35 ChinEng 5 4 Never
82 82 35 GAE 1 1 Very Often
83 83 35 GBE 2 2 Very Often
84 84 35 IndEng 3 6 Rarely
85 85 35 NBE 3 3 Sometimes
86 86 35 NigEng 5 5 Never
87 87 35 SAE 2 4 Sometimes
88 88 35 SpanEng 3 5 Never
89 89 36 ChinEng 2 9 Rarely
90 90 36 GAE 1 4 Very Often
91 91 36 GBE 2 8 Very Often
92 92 36 IndEng 2 8 Sometimes
93 93 36 NBE 1 8 Never
94 94 36 NigEng 2 9 Never
95 95 36 SAE 1 8 Sometimes
96 96 36 SpanEng 2 9 Sometimes
97 97 38 ChinEng 3 7 Never
98 98 38 GAE 1 2 Very Often
99 99 38 GBE 2 4 Sometimes
100 100 38 IndEng 5 7 Rarely
101 101 38 NBE 2 3 Rarely
102 102 38 NigEng 3 5 Never
103 103 38 SAE 2 3 Sometimes
104 104 38 SpanEng 3 5 Rarely
105 105 39 ChinEng 3 7 Never
106 106 39 GAE 1 3 Very Often
107 107 39 GBE 1 2 Very Often
108 108 39 IndEng 3 9 Rarely
109 109 39 NBE 1 3 Sometimes
110 110 39 NigEng 4 8 Never
111 111 39 SAE 1 9 Rarely
112 112 39 SpanEng 3 9 Never
113 113 40 ChinEng 2 5 Rarely
114 114 40 GAE 1 3 Often
115 115 40 GBE 1 6 Often
116 116 40 IndEng 3 6 Rarely
117 117 40 NBE 2 5 Sometimes
118 118 40 NigEng 2 7 Never
119 119 40 SAE 1 8 Rarely
120 120 40 SpanEng 3 7 Rarely
121 121 44 ChinEng 3 9 Rarely
122 122 44 GAE 1 1 Often
123 123 44 GBE 2 9 Sometimes
124 124 44 IndEng 2 9 Rarely
125 125 44 NBE 3 9 Rarely
126 126 44 NigEng 4 9 Rarely
127 127 44 SAE 2 9 Sometimes
128 128 44 SpanEng 3 9 Rarely
129 129 46 ChinEng 4 7 Never
130 130 46 GAE 1 2 Very Often
131 131 46 GBE 3 5 Often
132 132 46 IndEng 7 8 Never
133 133 46 NBE 2 4 Sometimes
134 134 46 NigEng 6 7 Never
135 135 46 SAE 5 7 Rarely
136 136 46 SpanEng 6 8 Never
137 137 48 ChinEng 1 3 Rarely
138 138 48 GAE 1 1 Very Often
139 139 48 GBE 1 1 Often
140 140 48 IndEng 4 6 Rarely
141 141 48 NBE 1 1 Rarely
142 142 48 NigEng 2 8 Never
143 143 48 SAE 1 3 Sometimes
144 144 48 SpanEng 4 5 Rarely
145 145 50 ChinEng 1 6 Never
146 146 50 GAE 1 2 Very Often
147 147 50 GBE 1 3 Very Often
148 148 50 IndEng 3 9 Never
149 149 50 NBE 2 9 Sometimes
150 150 50 NigEng 2 8 Never
151 151 50 SAE 1 7 Sometimes
152 152 50 SpanEng 1 8 Never
153 153 54 ChinEng 3 5 Never
154 154 54 GAE 1 2 Very Often
155 155 54 GBE 2 1 Very Often
156 156 54 IndEng 4 8 Rarely
157 157 54 NBE 3 3 Sometimes
158 158 54 NigEng 7 9 Rarely
159 159 54 SAE 1 9 Sometimes
160 160 54 SpanEng 5 9 Rarely
161 161 56 ChinEng 3 8 Never
162 162 56 GAE 2 6 Often
163 163 56 GBE 1 6 Often
164 164 56 IndEng 4 8 Rarely
165 165 56 NBE 2 7 Rarely
166 166 56 NigEng 3 9 Never
167 167 56 SAE 3 8 Sometimes
168 168 56 SpanEng 2 8 Never
169 169 58 ChinEng 4 4 Never
170 170 58 GAE 2 2 Often
171 171 58 GBE 2 4 Often
172 172 58 IndEng 5 7 Never
173 173 58 NBE 2 3 Rarely
174 174 58 NigEng 4 6 Never
175 175 58 SAE 4 5 Rarely
176 176 58 SpanEng 3 6 Rarely
177 177 59 ChinEng 1 5 Never
178 178 59 GAE 1 4 Very Often
179 179 59 GBE 4 7 Often
180 180 59 IndEng 2 6 Rarely
181 181 59 NBE 2 7 Sometimes
182 182 59 NigEng 3 5 Never
183 183 59 SAE 2 6 Sometimes
184 184 59 SpanEng 3 5 Never
185 185 61 ChinEng 3 7 Rarely
186 186 61 GAE 1 1 Very Often
187 187 61 GBE 2 6 Often
188 188 61 IndEng 2 7 Sometimes
189 189 61 NBE 2 6 Sometimes
190 190 61 NigEng 5 8 Rarely
191 191 61 SAE 2 6 Sometimes
192 192 61 SpanEng 2 6 Rarely
193 193 62 ChinEng 3 6 Rarely
194 194 62 GAE 1 2 Very Often
195 195 62 GBE 2 2 Very Often
196 196 62 IndEng 5 8 Rarely
197 197 62 NBE 2 5 Sometimes
198 198 62 NigEng 3 5 Rarely
199 199 62 SAE 3 8 Rarely
200 200 62 SpanEng 3 6 Rarely
201 201 65 ChinEng 4 7 Never
202 202 65 GAE 1 2 Very Often
203 203 65 GBE 2 4 Often
204 204 65 IndEng 3 6 Rarely
205 205 65 NBE 2 4 Rarely
206 206 65 NigEng 5 7 Never
207 207 65 SAE 4 7 Sometimes
208 208 65 SpanEng 3 6 Rarely
209 209 66 ChinEng 3 8 Rarely
210 210 66 GAE 1 1 Very Often
211 211 66 GBE 2 6 Sometimes
212 212 66 IndEng 4 8 Rarely
213 213 66 NBE 1 7 Sometimes
214 214 66 NigEng 6 9 Rarely
215 215 66 SAE 1 7 Sometimes
216 216 66 SpanEng 3 6 Sometimes
217 217 67 ChinEng 3 7 Sometimes
218 218 67 GAE 1 6 Very Often
219 219 67 GBE 2 8 Very Often
220 220 67 IndEng 2 7 Sometimes
221 221 67 NBE 1 5 Often
222 222 67 NigEng 4 9 Sometimes
223 223 67 SAE 2 8 Often
224 224 67 SpanEng 3 9 Often
225 225 70 GBE 2 3 Sometimes
226 226 70 GAE 1 1 Often
227 227 70 NBE 1 6 Often
228 228 70 SAE 3 5 Rarely
229 229 70 IndEng 2 7 Sometimes
230 230 70 NigEng 4 7 Rarely
231 231 70 ChinEng 2 6 Sometimes
232 232 70 SpanEng 4 7 Rarely
233 233 71 GBE 2 2 Never
234 234 71 GAE 1 4 Sometimes
235 235 71 NBE 3 6 Very Often
236 236 71 SAE 4 7 Never
237 237 71 IndEng 4 7 Rarely
238 238 71 NigEng 6 8 Never
239 239 71 ChinEng 5 8 Never
240 240 71 SpanEng 4 6 Never
241 241 72 GBE 2 5 Sometimes
242 242 72 GAE 1 3 Often
243 243 72 NBE 5 6 Sometimes
244 244 72 SAE 4 6 Rarely
245 245 72 IndEng 3 5 Sometimes
246 246 72 NigEng 4 6 Sometimes
247 247 72 ChinEng 5 8 Rarely
248 248 72 SpanEng 6 8 Sometimes
249 249 73 GBE 3 4 Rarely
250 250 73 GAE 2 3 Very Often
251 251 73 NBE 3 4 Very Often
252 252 73 SAE 3 5 Rarely
253 253 73 IndEng 5 6 Often
254 254 73 NigEng 4 5 Rarely
255 255 73 ChinEng 5 7 Sometimes
256 256 73 SpanEng 4 5 Rarely
257 257 75 GBE 2 2 Never
258 258 75 GAE 1 1 Sometimes
259 259 75 NBE 5 6 Sometimes
260 260 75 SAE 3 7 Never
261 261 75 IndEng 8 8 Rarely
262 262 75 NigEng 6 8 Never
263 263 75 ChinEng 5 8 Rarely
264 264 75 SpanEng 4 7 Never
Comprehensibility_z
1 0.1642872
2 -1.1113545
3 1.4399288
4 0.1642872
5 0.8021080
6 -0.4735336
7 -0.4735336
8 -0.4735336
9 1.4399288
10 -1.1113545
11 0.1642872
12 2.7155705
13 -0.4735336
14 2.0777497
15 0.1642872
16 0.8021080
17 0.1642872
18 -1.1113545
19 -1.1113545
20 -0.4735336
21 -1.1113545
22 -0.4735336
23 -1.1113545
24 -0.4735336
25 -0.4735336
26 -1.1113545
27 -1.1113545
28 -0.4735336
29 -1.1113545
30 1.4399288
31 -0.4735336
32 0.1642872
33 0.1642872
34 -1.1113545
35 -0.4735336
36 0.1642872
37 -1.1113545
38 0.8021080
39 0.1642872
40 0.1642872
41 -1.1113545
42 -1.1113545
43 -0.4735336
44 -0.4735336
45 -1.1113545
46 -1.1113545
47 -1.1113545
48 -1.1113545
49 1.4399288
50 -1.1113545
51 -0.4735336
52 0.1642872
53 0.1642872
54 0.8021080
55 -0.4735336
56 0.8021080
57 0.8021080
58 -1.1113545
59 -1.1113545
60 0.8021080
61 -0.4735336
62 -0.4735336
63 -1.1113545
64 0.8021080
65 2.7155705
66 -0.4735336
67 0.1642872
68 2.0777497
69 2.0777497
70 2.7155705
71 3.3533913
72 2.0777497
73 0.1642872
74 -1.1113545
75 -0.4735336
76 -0.4735336
77 -0.4735336
78 0.1642872
79 -1.1113545
80 0.1642872
81 1.4399288
82 -1.1113545
83 -0.4735336
84 0.1642872
85 0.1642872
86 1.4399288
87 -0.4735336
88 0.1642872
89 -0.4735336
90 -1.1113545
91 -0.4735336
92 -0.4735336
93 -1.1113545
94 -0.4735336
95 -1.1113545
96 -0.4735336
97 0.1642872
98 -1.1113545
99 -0.4735336
100 1.4399288
101 -0.4735336
102 0.1642872
103 -0.4735336
104 0.1642872
105 0.1642872
106 -1.1113545
107 -1.1113545
108 0.1642872
109 -1.1113545
110 0.8021080
111 -1.1113545
112 0.1642872
113 -0.4735336
114 -1.1113545
115 -1.1113545
116 0.1642872
117 -0.4735336
118 -0.4735336
119 -1.1113545
120 0.1642872
121 0.1642872
122 -1.1113545
123 -0.4735336
124 -0.4735336
125 0.1642872
126 0.8021080
127 -0.4735336
128 0.1642872
129 0.8021080
130 -1.1113545
131 0.1642872
132 2.7155705
133 -0.4735336
134 2.0777497
135 1.4399288
136 2.0777497
137 -1.1113545
138 -1.1113545
139 -1.1113545
140 0.8021080
141 -1.1113545
142 -0.4735336
143 -1.1113545
144 0.8021080
145 -1.1113545
146 -1.1113545
147 -1.1113545
148 0.1642872
149 -0.4735336
150 -0.4735336
151 -1.1113545
152 -1.1113545
153 0.1642872
154 -1.1113545
155 -0.4735336
156 0.8021080
157 0.1642872
158 2.7155705
159 -1.1113545
160 1.4399288
161 0.1642872
162 -0.4735336
163 -1.1113545
164 0.8021080
165 -0.4735336
166 0.1642872
167 0.1642872
168 -0.4735336
169 0.8021080
170 -0.4735336
171 -0.4735336
172 1.4399288
173 -0.4735336
174 0.8021080
175 0.8021080
176 0.1642872
177 -1.1113545
178 -1.1113545
179 0.8021080
180 -0.4735336
181 -0.4735336
182 0.1642872
183 -0.4735336
184 0.1642872
185 0.1642872
186 -1.1113545
187 -0.4735336
188 -0.4735336
189 -0.4735336
190 1.4399288
191 -0.4735336
192 -0.4735336
193 0.1642872
194 -1.1113545
195 -0.4735336
196 1.4399288
197 -0.4735336
198 0.1642872
199 0.1642872
200 0.1642872
201 0.8021080
202 -1.1113545
203 -0.4735336
204 0.1642872
205 -0.4735336
206 1.4399288
207 0.8021080
208 0.1642872
209 0.1642872
210 -1.1113545
211 -0.4735336
212 0.8021080
213 -1.1113545
214 2.0777497
215 -1.1113545
216 0.1642872
217 0.1642872
218 -1.1113545
219 -0.4735336
220 -0.4735336
221 -1.1113545
222 0.8021080
223 -0.4735336
224 0.1642872
225 -0.4735336
226 -1.1113545
227 -1.1113545
228 0.1642872
229 -0.4735336
230 0.8021080
231 -0.4735336
232 0.8021080
233 -0.4735336
234 -1.1113545
235 0.1642872
236 0.8021080
237 0.8021080
238 2.0777497
239 1.4399288
240 0.8021080
241 -0.4735336
242 -1.1113545
243 1.4399288
244 0.8021080
245 0.1642872
246 0.8021080
247 1.4399288
248 2.0777497
249 0.1642872
250 -0.4735336
251 0.1642872
252 0.1642872
253 1.4399288
254 0.8021080
255 1.4399288
256 0.8021080
257 -0.4735336
258 -1.1113545
259 1.4399288
260 0.1642872
261 3.3533913
262 2.0777497
263 1.4399288
264 0.8021080
Other important functions from the tidyverse vocabulary include:
drop_na()to drop missing valuescount()similar to thetable()function
I illustrate each in turn using a very simple dataframe
mockdf <- tibble(
ID = 1:6,
Size = c("a", "a", "a", "b", "b", "b"),
Score = c(23,34,23,NA,27,4),
Group = c("first", "second", "second", "second", "first", NA))
mockdf# A tibble: 6 × 4
ID Size Score Group
<int> <chr> <dbl> <chr>
1 1 a 23 first
2 2 a 34 second
3 3 a 23 second
4 4 b NA second
5 5 b 27 first
6 6 b 4 <NA>
Drop the NAs
clean_df <- mockdf |> drop_na()
clean_df# A tibble: 4 × 4
ID Size Score Group
<int> <chr> <dbl> <chr>
1 1 a 23 first
2 2 a 34 second
3 3 a 23 second
4 5 b 27 first
Count observations for Group
mockdf |> count(Group)# A tibble: 3 × 2
Group n
<chr> <int>
1 first 2
2 second 3
3 <NA> 1
mockdf |> group_by(Group, Size) |> count()# A tibble: 5 × 3
# Groups: Group, Size [5]
Group Size n
<chr> <chr> <int>
1 first a 1
2 first b 1
3 second a 2
4 second b 1
5 <NA> b 1
Which is of course similar to:
table(mockdf$Size, mockdf$Group)
first second
a 1 2
b 1 1
Or, if you prefer a dataframe format:
as.data.frame(xtabs(~ Group + Size, data = mockdf)) Group Size Freq
1 first a 1
2 second a 2
3 first b 1
4 second b 1
3.3 Data transformation
Compare the two tables in Table 3.1
| Student | Test | Score |
|---|---|---|
| 1 | pretest | 8 |
| 1 | posttest | 12 |
| 2 | pretest | 12 |
| 2 | posttest | 14 |
| 3 | pretest | 9 |
| 3 | posttest | 8 |
| … | … | … |
| Student | Pretest | Posttest |
|---|---|---|
| 1 | 8 | 12 |
| 2 | 12 | 14 |
| 3 | 9 | 8 |
| … | … | … |
In the long data format, each observation (each score for both moments) is given its own row, resulting in two rows per student. This format includes a binary categorical variable Test and a continuous variable Score. In the wide data format, each row represents data for one student. There are two test moments (Pre- and Posttest) displayed separately. This wide format is common in paired and longitudinal data where multiple measurements are taken for one element.
Using the correct data format is crucial for accurate data analysis and visualization. Therefore, being able to switch between these formats without re-entering all data is essential. Of course, you could copy-paste the columns in a spreadsheet, but using code to “pivot” the data is more efficient, in terms of speed and replicability.
To illustrate the data transformation process, we create a small dataframe in the wide format:
df <- data.frame(Student = c(1, 2, 3),
Pretest = c(8, 12, 9),
Posttest = c(12, 14, 8))
df Student Pretest Posttest
1 1 8 12
2 2 12 14
3 3 9 8
We now transform the data to the long format using pivot_longer():
- 1
-
Creating a new dataset
df_long. - 2
-
Selecting columns to transform: The
colsparameter specifies the columns to pivot. - 3
- Naming new variables:
# A tibble: 6 × 3
Student Test Score
<dbl> <chr> <dbl>
1 1 Pretest 8
2 1 Posttest 12
3 2 Pretest 12
4 2 Posttest 14
5 3 Pretest 9
6 3 Posttest 8
To transform the long dataset back to the wide format use pivot_wider():
- 1
- Name for the transformed dataset
- 2
-
Clustering variable: The
id_colsparameter specifies the variable to cluster values by. - 3
-
Variables for new columns: The
names_fromparameter specifies the variable whose categories become new columns, andvalues_fromspecifies the variable from which the values are taken.
# A tibble: 3 × 3
Student Pretest Posttest
<dbl> <dbl> <dbl>
1 1 8 12
2 2 12 14
3 3 9 8
3.4 Exercises
Exercise 3.1
Use tidyverse to answer the following questions about the multitask dataset.
- What is the average
TimebyTask? - Wat are the three fastest international participants? Give the names of the Participants in descending order, starting with the fastest one.
Exercise 3.2
- Transform the
Multitaskdataset from long to wide data format.
For an “opinionated” comparison between tidyverse and base-r, see TidyverseSkeptic↩︎