[1] 3
[1] 2
[1] 4
[1] 2
A Beginner’s Guide for Linguists
Use R to calculate the following:
A function has a name and arguments:
Name(argument, argument, ...)
Strictly speaking, we don’t need to write the names of the arguments …
To use those other packages, you must first install them via install.packages() and then open them using library().
(Later we will also use RStudio to install and load a package)
To cite ggplot2 in publications, please use
H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
Springer-Verlag New York, 2016.
A BibTeX entry for LaTeX users is
@Book{,
author = {Hadley Wickham},
title = {ggplot2: Elegant Graphics for Data Analysis},
publisher = {Springer-Verlag New York},
year = {2016},
isbn = {978-3-319-24277-4},
url = {https://ggplot2.tidyverse.org},
}
To cite R in publications use:
R Core Team (2023). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria.
URL https://www.R-project.org/.
A BibTeX entry for LaTeX users is
@Manual{,
title = {R: A Language and Environment for Statistical Computing},
author = {{R Core Team}},
organization = {R Foundation for Statistical Computing},
address = {Vienna, Austria},
year = {2023},
url = {https://www.R-project.org/},
}
We have invested a lot of time and effort in creating R, please cite it
when using it for data analysis. See also 'citation("pkgname")' for
citing R packages.
1x <- 10R now knows that x is the “name” for the value of \(10\). Let’s evaluate x:
We used the function print() here, without spelling it out explicitly:
R has the object x (and y) in its memory, so we can know use x instead of \(10\).
Writing your own function can be quite easy:
As a beginner, there are more than enough functions in base-R and the existing packages. So in this course we will not create our own functions.
mydata:mydata:It’s also possible to create a vector of “words” (or better still of “characters”, sometimes refered to as “strings”)
WordsPerSentence.[1] 3 34
[1] 81.38889
[1] 2 1 7 9 13 4 5 8 10 12 14 16 17 18 3 6 11 15
[1] 3 12 23 27 34
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 14.75 23.00 22.28 26.75 34.00
rm():Make a vocabulary list of your R language:
We measured the height of basketball players (in cm): 189, 190, 198, 210, 213, 234, 205, 207, 198, 199, 189, 203, 204, 207, 188, 187, 179, 209, 205, 204, 206, 200, 199, 198, 198, 206, 204, 175, 201.
Use a function to find help on the mean() function
R contains several datasets which are loaded automatically.
TootGrowth.tg to the dataset Tootgrowth.tg have? (tip: use str())In statistics, we distinguish between two kinds of variables:
Continuous variables are based on measurements (e.g., length, number of words, height, etc.). Categorical variables have a limited number of categories as their values (e.g., Correctness with “correct” vs. “incorrect”, Animacy with “animate” vs. “inanimate”).
We asked 10 participants about their educational level, operationalized as a three level categorical variable:
EdulevelUse a function to find out what kind of variable EduLevel is.
We transform the character variable to a factor, which is more convenient (and required) for a statistical analysis. A factor variable includes different levels.
Factor w/ 3 levels "ba","high","ma": 2 2 1 2 1 3 3 3 2 3
EduLevel is now a factor variable with three categories or levels. R assigns a number to every category. Levels are assigned alphabetically. This can be overruled and the levels can be binned with the levels() function.
We summarize the factor variables by means of the table() function.
Proportions or fractions are calculated with prop.table().
To avoid a nested function, first create an object based on the table() function and then apply prop.table().
RT (Reaction time) is a vector with one missing variable:is.na() is a logical function. The output of the function is TRUE or FALSE.
We can extract which observation is missing by means of which():
So the fifth value of the vector RT is a missing value.
But how many missing values are there?
letters is an R object (characters) consisting of the letters (in lower case) of the alphabet.
We use the indexfunction to extract different elements:
Extract all the numbers from alphabet, except for the first and third letter.
We draw a random sample of \(N=200\) observations from a normal distribution with mean \(\mu=4.5\) and standard deviation \(\sigma=2\).
We create a factor vector “Animacy” with two values “animate” vs. “inanimate” with rep().
Let’s look at the first 10 observations of the vector:
| ID | Pretest | Posttest |
|---|---|---|
| 1 | 13 | 15 |
| 2 | 8 | 9 |
| 3 | 11 | 16 |
| 4 | 12 | 13 |
| 5 | 16 | 17 |
| 6 | 15 | 17 |
| 7 | 11 | 12 |
| 8 | 6 | 10 |
| 9 | 10 | 13 |
| 10 | 14 | 15 |
Reconsider the height dataset that we created in Exercise 1.2.
RStudio is an integrated development environment (IDE) for R and Python. It includes a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging, and workspace management. RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux).
| Variable | Variable | Variable |
|---|---|---|
| measurement | measurement | measurement |
| measurement | measurement | measurement |
| measurement | measurement | measurement |
| Participant | Item | Frequency | ReactionTime |
|---|---|---|---|
| 1 | a | high | 320 |
| 2 | b | low | 380 |
| 3 | c | low | 400 |
| 4 | d | high | 300 |
| 5 | e | high | 356 |
| 6 | f | low | 319 |
Participant <- c(1,2,3,4,5,6)
Item <- c("a", "b", "c", "d", "e", "f")
Freq <- as.factor(c("high", "low", "low",
"high", "high", "low"))
RT <- c(320, 380, 400, 300, 356, 319)
df <- data.frame(Participant, Item, Freq, RT)
df Participant Item Freq RT
1 1 a high 320
2 2 b low 380
3 3 c low 400
4 4 d high 300
5 5 e high 356
6 6 f low 319
| ID | Group | Score |
|---|---|---|
| 1 | Control | 14 |
| 2 | Control | 15 |
| … | … | … |
| 59 | Treat | 12 |
| 60 | Treat | 17 |
You can actually create a dataframe from scratch with:
Base-R and other packages have built-in datasets. We can open the sleep dataset simply by typing sleep.
The sleep dataset contains three variables:
extra: the number of extra hours of sleep after taking a sleep medication,group: two sleep medications were tested by 10 participants,ID: an identification number for each participant (not for each row!).We can view the structure of a dataframe using str().
'data.frame': 20 obs. of 3 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
extra is a numeric variable (a number with decimal places). group and ID are factors, categorical variables with 2 and 10 possible values, respectively. The output for the factor variables can be interpreted as follows:
$ group: Factor w/ 2 levels "1", "2": 1 1 1 ...
“1” and “2” are the names of the levels, which R converts into numerical values, 1 and 2. For instance, 1 1 1 indicates that the first three values of the variable are “1”. The levels are arranged alphabetically by R, unless specified otherwise.
Want to find out which datasets are included in base-r?
For more information on the dataset sleep use:
ABSTRACT: This dataset contains one datafile (.csv) used to create the graphs and tables in the paper “(Im)polite uses of vocatives in present-day Madrilenian Spanish”. It includes 534 Spanish vocative tokens, i.e. (pro)nominal terms of direct address (e.g., tío ‘dude’), which were retrieved from CORMA, a conversational corpus of peninsular Spanish compiled between 2016 and 2019. The data is annotated for (i) form, (ii) communication, (iii) semantic category, (iv) speaker’s generation, (v) speaker’s gender, (vi) relationship between speaker and hearer, (vii) socio-pragmatic character of the hosting speech act, (viii) the hearer’s reaction, and (ix) the vocative’s socio-pragmatic effect.
getwd().setwd().read.csv() function to open the .csv-file, as follows:My personal favourite after many years of trial and lots of error is to use the Import dataset function in the Environment tab in RStudio:
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
'data.frame': 20 obs. of 3 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
Other decsriptive functions have been developed by R-developers. Here’s the describe() from the Hmisc package (Harrell Jr, 2023).
sleep
3 Variables 20 Observations
------------------------------------------------------------
extra
n missing distinct Info Mean Gmd
20 0 17 0.998 1.54 2.332
.05 .10 .25 .50 .75 .90
-1.220 -0.300 -0.025 0.950 3.400 4.420
.95
4.645
Value -1.6 -1.2 -0.2 -0.1 0.0 0.1 0.7 0.8 1.1
Frequency 1 1 1 2 1 1 1 2 1
Proportion 0.05 0.05 0.05 0.10 0.05 0.05 0.05 0.10 0.05
Value 1.6 1.9 2.0 3.4 3.7 4.4 4.6 5.5
Frequency 1 1 1 2 1 1 1 1
Proportion 0.05 0.05 0.05 0.10 0.05 0.05 0.05 0.05
For the frequency table, variable is rounded to the nearest 0
------------------------------------------------------------
group
n missing distinct
20 0 2
Value 1 2
Frequency 10 10
Proportion 0.5 0.5
------------------------------------------------------------
ID
n missing distinct
20 0 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 2 2 2 2 2 2 2 2 2 2
Proportion 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
------------------------------------------------------------
Figure 1
Figure 2
read.delim().To avoid the use of “DATAFRAME$”, you can use the attach() function. And when you’re done you use the detach() function. It simplifies the code, but if you work with multiple dataframes you run the risk of loosing track.
sleep.What is the average amount of extra sleep?
Here’s a more complicated selection:
extra group ID
Min. :-1.600 1:10 1 :1
1st Qu.:-0.175 2: 0 2 :1
Median : 0.350 3 :1
Mean : 0.750 4 :1
3rd Qu.: 1.700 5 :1
Max. : 3.700 6 :1
(Other):4
Use the droplevels() to remove unselected levels.
We calculate a z-score for extra in the sleep data. The dollarsign (“$”) is used to add the variable to the existing dataframe.
extra group ID extra_z
1 0.7 1 1 -0.4162703
2 -1.6 1 2 -1.5560579
3 -0.2 1 3 -0.8622741
4 -1.2 1 4 -1.3578340
5 -0.1 1 5 -0.8127182
6 3.4 1 6 0.9217413
Note that if you do not use the dollarsign, you create a new vector without adding it as a variable to the sleep dataset.
This is a convenient function to compare different groups in your data. For instance, here we calculate the average extra time of sleep for both groups.
After you create a dataframe “df”, you can save is as a .csv-file as follows:
multitask dataset, which we created in Activity 1, and give the dataframe the name “multi”.tapply to calculate the median Time for both Tasks.Task.Time_c by subtracting every observation for the variable Time from its mean. (this is called centering the data, hence the use of “_c”).Source: https://www.theguardian.com/education/gallery/2015/jan/23/a-language-family-tree-in-pictures
Source: Farooq et al. (2014)
Source: Giorgi et al. (2022)
By doing so, we actually open multiple packages. Two packages that we will use in particular are:
dplyr: to explore and manipulate data (Wickham et al., 2023)tidyr: to transform the data format (long to wide and vice versa)(Wickham et al., 2024).library(tibble)
classroom <- tribble(
~name, ~quiz1, ~quiz2, ~test1,
"Billy", NA, "D", "C",
"Suzy", "F", NA, NA,
"Lionel", "B", "C", "B",
"Jenny", "A", "A", "B"
)
classroom# A tibble: 4 × 4
name quiz1 quiz2 test1
<chr> <chr> <chr> <chr>
1 Billy <NA> D C
2 Suzy F <NA> <NA>
3 Lionel B C B
4 Jenny A A B
ABSTRACT: This dataset contains the results from 33 Flemish English as a Foreign Language (EFL) learners, who were exposed to eight native and non-native accents of English. These participants completed (i) a comprehensibility and accentedness rating task, followed by (ii) an orthographic transcription task. In the first task, listeners were asked to rate eight speakers of English on comprehensibility and accentedness on a nine-point scale (1 = easy to understand/no accent; 9 = hard to understand/strong accent). How Accentedness ratings and listeners’ Familiarity with the different accents impacted on their Comprehensibility judgements was measured using a linear mixed-effects model. The orthographic transcription task, then, was used to verify how well listeners actually understood the different accents of English (i.e. intelligibility). To that end, participants’ transcription Accuracy was measured as the number of correctly transcribed words and was estimated using a logistic mixed-effects model. Finally, the relation between listeners’ self-reported ease of understanding the different speakers (comprehensibility) and their actual understanding of the speakers (intelligibility) was assessed using a linear mixed-effects regression. R code for the data analysis is provided.
options(width = 70)
comp <- read.csv("Listening_to_Accents_Comprehensibility_Accentedness.csv",
stringsAsFactors=TRUE)
summary(comp) ID Participant Accent Comprehensibility
Min. : 1.00 Min. : 7.00 ChinEng:33 Min. :1.000
1st Qu.: 66.75 1st Qu.:33.00 GAE :33 1st Qu.:2.000
Median :132.50 Median :46.00 GBE :33 Median :2.000
Mean :132.50 Mean :45.12 IndEng :33 Mean :2.742
3rd Qu.:198.25 3rd Qu.:62.00 NBE :33 3rd Qu.:4.000
Max. :264.00 Max. :75.00 NigEng :33 Max. :8.000
(Other):66
Accentedness Familiarity
Min. :1.00 Never :65
1st Qu.:4.00 Often :32
Median :6.00 Rarely :72
Mean :5.58 Sometimes :56
3rd Qu.:7.00 Very Often:39
Max. :9.00
Familiarity to ordered factorComprehensibility by means of a barplot and a histogram.Accent?Accent with a piechart. (check help("pie")).select() to select columns/variablesfilter() to select rowsmutate() to create new variablesgroup_by() to group the data according to a categorical variablesummarise() to calculate descriptive statisticsWhat is the average Comprehension for the eight different accents?
comp |>
group_by(Accent) |>
summarise(mean = mean(Comprehensibility, na.rm = TRUE),
median = median(Comprehensibility, na.rm = TRUE),
min = min(Comprehensibility, na.rm = TRUE),
max = max(Comprehensibility, na.rm = TRUE),
SD = sd(Comprehensibility, na.rm = TRUE)) # A tibble: 8 × 6
Accent mean median min max SD
<fct> <dbl> <int> <int> <int> <dbl>
1 ChinEng 3.30 3 1 7 1.42
2 GAE 1.12 1 1 2 0.331
3 GBE 2.03 2 1 5 0.883
4 IndEng 3.61 3 2 8 1.64
5 NBE 2.24 2 1 6 1.28
6 NigEng 3.94 4 1 7 1.62
7 SAE 2.39 2 1 8 1.52
8 SpanEng 3.30 3 1 6 1.24
comp |>
group_by(Accent, Familiarity) |>
summarise(mean = mean(Comprehensibility, na.rm = TRUE),
SD = sd(Comprehensibility, na.rm = TRUE)) # A tibble: 33 × 4
# Groups: Accent [8]
Accent Familiarity mean SD
<fct> <ord> <dbl> <dbl>
1 ChinEng Never 3.55 1.43
2 ChinEng Rarely 3 1.32
3 ChinEng Sometimes 3.33 1.53
4 ChinEng Often 1 NA
5 GAE Sometimes 1 0
6 GAE Often 1.43 0.535
7 GAE Very Often 1.04 0.204
8 GBE Never 2 0
9 GBE Rarely 3 NA
10 GBE Sometimes 2.17 0.408
# ℹ 23 more rows
aggregate(Comprehensibility ~ Accent + Familiarity,
data = comp,
FUN = function(x) c(MEAN = mean(x), SD = sd(x))) Accent Familiarity Comprehensibility.MEAN Comprehensibility.SD
1 ChinEng Never 3.5500000 1.4317821
2 GBE Never 2.0000000 0.0000000
3 IndEng Never 4.3333333 1.9663842
4 NBE Never 1.0000000 NA
5 NigEng Never 3.8571429 1.7113069
6 SAE Never 3.5000000 0.7071068
7 SpanEng Never 3.3076923 1.4366985
8 ChinEng Rarely 3.0000000 1.3228757
9 GBE Rarely 3.0000000 NA
10 IndEng Rarely 3.7500000 1.5852943
11 NBE Rarely 2.1111111 0.9279607
12 NigEng Rarely 4.1000000 1.6633300
13 SAE Rarely 2.9000000 1.2866839
14 SpanEng Rarely 3.2307692 1.0127394
15 ChinEng Sometimes 3.3333333 1.5275252
16 GAE Sometimes 1.0000000 0.0000000
17 GBE Sometimes 2.1666667 0.4082483
18 IndEng Sometimes 2.1666667 0.4082483
19 NBE Sometimes 2.6250000 1.5000000
20 NigEng Sometimes 4.0000000 0.0000000
21 SAE Sometimes 2.1250000 1.7841898
22 SpanEng Sometimes 3.8000000 1.4832397
23 ChinEng Often 1.0000000 NA
24 GAE Often 1.4285714 0.5345225
25 GBE Often 2.1538462 1.2810252
26 IndEng Often 5.0000000 NA
27 NBE Often 1.0000000 0.0000000
28 SAE Often 2.0000000 0.8164966
29 SpanEng Often 2.5000000 0.7071068
30 GAE Very Often 1.0416667 0.2041241
31 GBE Very Often 1.7272727 0.4670994
32 NBE Very Often 2.6666667 0.5773503
33 SAE Very Often 1.0000000 NA
Familiarity equal to “Often” or “Very Often”) and check their Comperehensibility scores.comp |>
filter(Familiarity %in% c("Often", "Very Often")) |>
select(Participant, Comprehensibility) |>
group_by(Participant) |>
summarise(mean = mean(Comprehensibility, na.rm = TRUE))# A tibble: 32 × 2
Participant mean
<int> <dbl>
1 7 3
2 10 2
3 13 1.33
4 15 1
5 18 1.75
6 21 1.25
7 22 1.67
8 31 1
9 33 2
10 34 1.5
# ℹ 22 more rows
mutate() allows you to create new variables. For instance, here we calculate a z-score for Comprehensibility.table()Use tidyverse to answer the following questions about the multitask dataset.
Time by Task?| Student | Test | Score |
|---|---|---|
| 1 | pretest | 8 |
| 1 | posttest | 12 |
| 2 | pretest | 12 |
| 2 | posttest | 14 |
| 3 | pretest | 9 |
| 3 | posttest | 8 |
| … | … | … |
| Student | Pretest | Posttest |
|---|---|---|
| 1 | 8 | 12 |
| 2 | 12 | 14 |
| 3 | 9 | 8 |
| … | … | … |
Recreate the following dataset in base-R:
| Student | Pretest | Posttest |
|---|---|---|
| 1 | 8 | 12 |
| 2 | 12 | 14 |
| 3 | 9 | 8 |
Recreate the following dataset as a tibble:
| Student | Pretest | Posttest |
|---|---|---|
| 1 | 8 | 12 |
| 2 | 12 | 14 |
| 3 | 9 | 8 |
glimpse()pivot_longer()pivot_wider()Multitask dataset from long to wide data format.| ID | Variable | Description |
|---|---|---|
| 1 | Student | A unique identifier for each student |
| 2 | School | A unique identifier for each school |
| 3 | Class | A unique identifier for each class |
| 4 | PPVT | Score for the Peabody Picture Vocabulary Test (PPVT). Maximum score = 120 |
| 5 | Speaking | Score for the speaking test. Maximum score = 20 |
| 6 | Listening | Score for the listening test. Maximum score = 25 |
| 7 | ReadingWriting | Score for the reading and writing tests. Maximum score = 50 |
| 8 | Attitude | Attitude of student towards English: Positive vs. Negative |
| 9 | L1 | L1: Dutch vs. Multilingual |
| 10 | Sex | Sex of student: Male vs. Female |
options(width = 70)
fb <- read.delim("FalseBeginners.txt",
header = TRUE,
na.strings = NA,
stringsAsFactors=TRUE)
fb <- na.omit(fb)
summary(fb) Student School Class PPVT
Min. : 2.0 S12 : 50 12A : 27 Min. : 31.00
1st Qu.:221.2 S34 : 46 9A : 26 1st Qu.: 70.00
Median :443.5 S38 : 35 11A : 23 Median : 78.00
Mean :441.5 S51 : 35 12B : 23 Mean : 78.53
3rd Qu.:661.8 S8 : 34 42A : 22 3rd Qu.: 88.00
Max. :867.0 S24 : 32 21A : 20 Max. :116.00
(Other):514 (Other):605
Speaking Listening ReadingWriting Attitude
Min. : 0.000 Min. : 0.00 Min. : 0.00 negative: 26
1st Qu.: 2.000 1st Qu.:10.00 1st Qu.:13.00 positive:720
Median : 5.000 Median :15.00 Median :18.00
Mean : 6.782 Mean :14.94 Mean :21.13
3rd Qu.:10.000 3rd Qu.:20.00 3rd Qu.:29.00
Max. :20.000 Max. :25.00 Max. :50.00
L1 Sex
dutch:548 female:365
multi:198 male :381
Add a nice colour to the bars
Use the ggplot2 package to create the following visualisations for the Multitask dataset:
Time.Time.Location.Change the colour of the densityplots.
Change the colours of the bars into a color blind friendly Brewer palette.
Use the ggplot2 package to create the following visualisations for the Multitask dataset:
Time by Task as a clustered boxplotTime by Task and University as overlapping density plotsTime for both Tasks.ggplot() to visualise the paired differences in the multitask datasetDifference, based on the difference in Time between the two Tasks.Speaking and Listening.Speaking and Listening for both Sexes and L1s separately.