Data

MILS 2026: Intro to R

Author

Affiliation

Ludovic De Cuypere

Vrije Universiteit Brussel & Ghent University

Published

April 15, 2026

Modified

April 16, 2026

In this course, you will explore, summarize, and visualize variables from the following three datasets:

False beginners
Accents
Vocatives

Here is a brief explanation for each dataset, so that you have a basic understanding of what kind of data is provided. For each dataset, the file format is provided, along with a codebook and an overview of the fist six rows of the data.

False beginners

For this part we use the FalseBeginners dataset. The dataset is based on De Wilde et al. (2019), available at https://osf.io/ndr47/. I have selected and renamed the variables for our purposes here. Table 1 provides an overview of the variables.

Table 1: Codebook Falsebeginners

ID	Variable	Description
1	Student	A unique identifier for each student
2	School	A unique identifier for each school
3	Class	A unique identifier for each class
4	PPVT	Score for the Peabody Picture Vocabulary Test (PPVT). Maximum score = 120
5	Speaking	Score for the speaking test. Maximum score = 20
6	Listening	Score for the listening test. Maximum score = 25
7	ReadingWriting	Score for the reading and writing tests. Maximum score = 50
8	Attitude	Attitude of student towards English: Positive vs. Negative
9	L1	L1: Dutch vs. Multilingual
10	Sex	Sex of student: Male vs. Female

The file is a tab delimited .txt-file. We open the file with read.delim(). The dataset is organised in long data format.

the text file to open,
the first row of the file contains the variables names,
missing values are marked as “NA”,
strings are interpreted as factors.

The first six rows of the dataset:

  Student School Class PPVT Speaking Listening ReadingWriting Attitude    L1
1      10     S1    1A   70        4         7             13 positive dutch
2      14     S1    1A   79        7        14             18 positive dutch
3       2     S1    1A   98       19        24             37 positive dutch
4       3     S1    1A   80       11        17             30 positive dutch
5       4     S1    1A   76        7         9             18 positive dutch
6       5     S1    1A   85       16        17             22 positive dutch
     Sex
1 female
2   male
3   male
4   male
5   male
6   male

Accents

This dataset is published in the data repository TROLLing (https://dataverse.no/dataverse/trolling/) as Verbeke and Simon (2023). The dataset abstract provides a summary of the data:

This dataset contains the results from 33 Flemish English as a Foreign Language (EFL) learners, who were exposed to eight native and non-native accents of English. These participants completed (i) a comprehensibility and accentedness rating task, followed by (ii) an orthographic transcription task. In the first task, listeners were asked to rate eight speakers of English on comprehensibility and accentedness on a nine-point scale (1 = easy to understand/no accent; 9 = hard to understand/strong accent). How Accentedness ratings and listeners’ Familiarity with the different accents impacted on their Comprehensibility judgements was measured using a linear mixed-effects model. The orthographic transcription task, then, was used to verify how well listeners actually understood the different accents of English (i.e. intelligibility). To that end, participants’ transcription Accuracy was measured as the number of correctly transcribed words and was estimated using a logistic mixed-effects model. Finally, the relation between listeners’ self-reported ease of understanding the different speakers (comprehensibility) and their actual understanding of the speakers (intelligibility) was assessed using a linear mixed-effects regression. R code for the data analysis is provided.

The codebook for the datafile (taken from the README file):

ID: Unique identifier for each line in the dataset.
Participant: Unique identifier for each participant.
Accent: native and non-native accents of English
- “GBE” = General British English
- “GAE” = General American English
- “NBE” = Northern British English
- “SAE” = Southern American English
- “IndEng” = Indian English
- “NigEng” = Nigerian English
- “ChinEng” = Chinese English
- “SpanEng” = Spanish English
Comprehensibility: Listeners’ rating of how comprehensible the speaker is on a scale from 1 (= easy to understand) to 9 (= hard to understand)
Accentedness: Listeners’ rating of how strong a speaker’s accent is on a scale from 1 (= no accent) to 9 (= strong accent)
Familiarity: Listeners’ familiarity with the sampled varieties of English (i.e. never; rarely; sometimes; often; very often)

We open the datafile using ´read.csv()´ as this is a comma separated text file:

The first six rows:

  ID Participant  Accent Comprehensibility Accentedness Familiarity
1  1           7 ChinEng                 3            7       Never
2  2           7     GAE                 1            1  Very Often
3  3           7     GBE                 5            6       Often
4  4           7  IndEng                 3            8      Rarely
5  5           7     NBE                 4            4      Rarely
6  6           7  NigEng                 2            4       Never

Vocatives

This is the data from De latte (2023). The file is published in the online repository TROLLing (https://dataverse.no/dataverse/trolling/).

ABSTRACT: This dataset contains one datafile (.csv) used to create the graphs and tables in the paper “(Im)polite uses of vocatives in present-day Madrilenian Spanish”. It includes 534 Spanish vocative tokens, i.e. (pro)nominal terms of direct address (e.g., tío ‘dude’), which were retrieved from CORMA, a conversational corpus of peninsular Spanish compiled between 2016 and 2019. The data is annotated for (i) form, (ii) communication, (iii) semantic category, (iv) speaker’s generation, (v) speaker’s gender, (vi) relationship between speaker and hearer, (vii) socio-pragmatic character of the hosting speech act, (viii) the hearer’s reaction, and (ix) the vocative’s socio-pragmatic effect.

We open the data with ´read.csv()´. Note that this is a comma separated text file, but the semicolon “;” is used as a separator, which we specify as by means of the argument ´sep=“;”´:

Refer to the README file of the dataset for the codebook: https://doi.org/10.18710/FOBMUQ.

The first six rows:

     form communication semantic_category speaker_generation speaker_gender
1    coño AM_GEN4_M_01a         nom_taboo               GEN4              M
2    coño AM_GEN4_M_01a         nom_taboo               GEN4              M
3 hermano AM_GEN4_M_01a  nom_familiarizer               GEN4              M
4   PropN AM_GEN4_M_01a             PropN               GEN4              M
5   PropN AM_GEN4_M_01a             PropN               GEN4              M
6   PropN AM_GEN4_M_01a             PropN               GEN4              M
  relationship speech_act reaction effect_voc
1    intimates    neutral  neutral     banter
2    intimates    neutral  neutral     banter
3    intimates    neutral       na    neutral
4    intimates    neutral  neutral    neutral
5    intimates    neutral  neutral     banter
6    intimates    neutral  neutral    neutral

References

De Latte, Fien. 2023. “Replication Data for: (Im)polite Uses of Vocatives in Present-Day Madrilenian Spanish.” DataverseNO. https://doi.org/10.18710/FOBMUQ.

De Wilde, Vanessa, Marc Brysbaert, and June Eyckmans. 2019. “Learning English Through Out-of-School Exposure. Which Levels of Language Proficiency Are Attained and Which Types of Input Are Important?” Bilingualism: Language and Cognition 23 (1): 171–85. https://doi.org/10.1017/s1366728918001062.

Verbeke, Gil, and Ellen Simon. 2023. “Replication Data for: Listening to Accents: Comprehensibility, accentedness and intelligibility of native and non-native English speech.” DataverseNO. https://doi.org/10.18710/8F0Q0L.