Section 5 Parsing ELAN files to R
Let’s assume we have a following kind of ELAN file:
If this would be parsed into R, the expected structure would be something like:
example <- tribble(~token, ~form, ~utterance, ~reference, ~participant, ~time_start, ~time_end, ~filename,
'words', 'Words', 'Words here.', '.001', 'Niko', 0, 1000, 'kpv_izva20171010test.eaf',
'here', 'here', 'Words here.', '.001', 'Niko', 0, 1000, 'kpv_izva20171010test.eaf',
'.', '.', 'Words here.', '.001', 'Niko', 0, 1000, 'kpv_izva20171010test.eaf') %>%
mutate(session_name = str_extract(filename, '.+(?=.eaf)'))
example %>% knitr::kable()
token | form | utterance | reference | participant | time_start | time_end | filename | session_name |
---|---|---|---|---|---|---|---|---|
words | Words | Words here. | .001 | Niko | 0 | 1000 | kpv_izva20171010test.eaf | kpv_izva20171010test |
here | here | Words here. | .001 | Niko | 0 | 1000 | kpv_izva20171010test.eaf | kpv_izva20171010test |
. | . | Words here. | .001 | Niko | 0 | 1000 | kpv_izva20171010test.eaf | kpv_izva20171010test |
As far as I see, this is what we have in the ELAN file. Of course there are other pieces of information such as annotator, language of the tier, last editing time, media files and so on, but I have not needed those very much myself. The convention I have had with the media files is that each file has media files named identically with the ELAN file itself. Thereby their names (and paths) can be extracted from the ELAN file names if and when needed.
Things can be complicated when there is more content below the tiers. For example, there often are glosses, lemmas and pos-tags below the word tokens. Those will be discussed below.
First I want to mention the possibility to combine metadata into the ELAN files. The starting point here is, usually, that we usually have in this point two values we need: participant and session name. We didn’t discuss the session names yet, but in my convention each ELAN file has an unique name, and this also serves as the general session name by which this recording instance can be distinguished from the rest.
If we think metadata more generally, we normally have variables which are connected into one of these items. The session itself has a recording place and time, it has files, it has setting, more and genre, for example. The participants themselves have birthtime, age, place of residence, language skills, roles in the recording and so on.
It is important to note that some of the variables listed above are quite different from others. Especially age and role are something that make sense only in the combination of the participant and the recording: the age is always different, and the roles can also vary. We have numerous recordings where the interviewer gets to be the interviewee and vice versa.
So from this point of view the “result” we actually want to work with often looks more like this:
place_meta <- tibble(session_name = c('kpv_izva20171010test', 'kpv_izva20171015test2'),
rec_place = c('Paris, France', 'Syktyvkar, Komi'),
rec_place_lat = c(48.864716, 0),
rec_place_lon = c(2.349014, 0))
speaker_meta <- tibble(participant = 'Niko',
birthplace = 'Sulkava, Finland',
bplace_lat = 61.786100,
bplace_lon = 28.370586,
birthyear = 1986)
corpus <- left_join(example, place_meta) %>% left_join(speaker_meta)
## Joining, by = "session_name"
## Joining, by = "participant"
corpus
## # A tibble: 3 x 16
## token form utterance reference participant time_start time_end
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 words Words Words here. .001 Niko 0 1000
## 2 here here Words here. .001 Niko 0 1000
## 3 . . Words here. .001 Niko 0 1000
## # ... with 9 more variables: filename <chr>, session_name <chr>,
## # rec_place <chr>, rec_place_lat <dbl>, rec_place_lon <dbl>,
## # birthplace <chr>, bplace_lat <dbl>, bplace_lon <dbl>, birthyear <dbl>
As we are joining the tables by participant and session_name, we are able to handle the fact that some variables may have different values in different sessions. For example, in this case the session metadata table contained two different sessions from different years, but when the joins were done the non-matching ones were discarded. This is one of the properties of left_join
, there are other join types that are also often useful.
These examples reflect the naming convention that has been used in Freiburg based research projects.
{language}_{variety}{YYYYMMDD}-{Nth-recording}
There are few variations of this, but the basic idea is similar. Some aspects of metadata are stored already in the filenames. I have heard regularly arguments that ideally the filenames would contain absolutely no metadata, which probably is good for personal privacy reasons. However, there are also pragmatic reasons to have filenames contain something that makes them easy or possible to navigate by humans. In the same sense it can be useful to have there some mnemonic element, which we have also at times added into the end, but this alos adds some new problems.
However, having some pieces of information in the filenames allows us to get few more metadata columns in this point:
corpus <- dir('corpus/', pattern = 'eaf$', full.names = TRUE) %>% map(read_eaf) %>% bind_rows()
# We assume the session name starts with a three letter ISO code
# If the system is entirely reliable, we can just take three first characters
corpus <- corpus %>% mutate(lang = str_extract(session_name, '.{3}')) %>%
mutate(variety = str_extract(session_name, '(?<=.{3}_)[a-z]+(?=\\d)')) %>%
mutate(rec_year = as.numeric(str_extract(session_name, '\\d{4}(?=\\d{4})'))) %>%
mutate(gender = str_extract(participant, '(?<=-)(F|M)(?=-)'))
corpus %>% select(lang, variety, rec_year, gender, everything())
## # A tibble: 568 x 15
## lang variety rec_year gender token
## <chr> <chr> <dbl> <chr> <chr>
## 1 kpv izva 2014 F ме
## 2 kpv izva 2014 F ,
## 3 kpv izva 2014 F кӧнечнэ
## 4 kpv izva 2014 F же
## 5 kpv izva 2014 F ,
## 6 kpv izva 2014 F кык
## 7 kpv izva 2014 F лун
## 8 kpv izva 2014 F вӧлі
## 9 kpv izva 2014 F в
## 10 kpv izva 2014 F шоке
## # ... with 558 more rows, and 10 more variables: utterance <chr>,
## # reference <chr>, participant <chr>, time_start <dbl>, time_end <dbl>,
## # session_name <chr>, filename <chr>, word <chr>, after <chr>,
## # before <chr>
corpus %>% count(variety)
## # A tibble: 2 x 2
## variety n
## <chr> <int>
## 1 izva 348
## 2 udo 220
corpus %>% count(gender)
## # A tibble: 2 x 2
## gender n
## <chr> <int>
## 1 F 281
## 2 M 287
corpus %>% count(participant)
## # A tibble: 4 x 2
## participant n
## <chr> <int>
## 1 JAI-M-1939 246
## 2 JSS-F-1988 183
## 3 MVF-F-1984 98
## 4 NTP-M-1986 41
This is one place where standardized and cross-comparable metadata would be useful. It seems to me that lots of the metadata conversation has circulated around the archiving and later data foundability needs, but this questions come to us also on very concrete levels. How do we call the columns in an R dataframe? You have to refer to them all the time, so changing it later will break lots of code that worked earlier. I guess having something systematic that works for you is the best solution for now. Probably the naming conventions of different programming languages are also good to be taken into account.
The idea behind the structure in this data frame is that we can consider each token as one observation, and this format with one token per row allows easy plotting and statistical testing without issues. The main issue of this section is that we ultimately have to take quite much time to think from where different pieces of metadata come from, or whether they can or cannot be derived from the data we have at hand.
5.0.1 Questions
- Places, coordinates?
- How exact times we have and need?
- Are there some pieces of metadata we need all the time?
- What kind of data tends to change?
5.0.2 Example
I have written in this blog about more exact conventions in parsing ELAN files, especially within FRelan R package that I have built to work with Freiburg project ELAN files. However, I have not been very succesful in getting some system to work so that it would be easily and intuitively adapted into project with an entirely different tier structure. Probably the easiest way is to modify the code every time there is a new tier structure.
5.1 Customizing to tier pattern
One of the example files has structure:
|-ref (id)
\- orth (kpv)
\- word (tokenized from orth)
\- lemma
\- pos
\- ft-rus (Russian translation)
\- ft-eng (English translation)
\- note (different notes on utterance level)
The way I usually approach this is to read into R individual tiers, and join them together following the logic by which the ELAN tier structure has been set up.
read_custom_eaf <- function(path_to_file){
ref <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "refT") %>%
dplyr::select(content, annot_id, participant, time_slot_1, time_slot_2) %>%
dplyr::rename(ref = content) %>%
dplyr::rename(ref_id = annot_id)
orth <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "orthT") %>%
dplyr::select(content, annot_id, ref_id, participant) %>%
dplyr::rename(orth = content) %>%
dplyr::rename(orth_id = annot_id)
token <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "wordT") %>%
dplyr::select(content, annot_id, ref_id, participant) %>%
dplyr::rename(token = content) %>%
dplyr::rename(token_id = annot_id) %>%
dplyr::rename(orth_id = ref_id)
lemma <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "lemmaT") %>%
dplyr::select(content, annot_id, ref_id, participant) %>%
dplyr::rename(lemma = content) %>%
dplyr::rename(lemma_id = annot_id) %>%
dplyr::rename(token_id = ref_id)
pos <- FRelan::read_tier(eaf_file = path_to_file, linguistic_type = "posT") %>%
dplyr::select(content, ref_id, participant) %>%
dplyr::rename(pos = content) %>%
dplyr::rename(lemma_id = ref_id)
elan <- left_join(ref, orth) %>%
left_join(token) %>%
left_join(lemma) %>%
left_join(pos) %>%
select(token, lemma, pos, time_slot_1, time_slot_2, everything(), -ends_with('_id'))
time_slots <- FRelan::read_timeslots(path_to_file)
elan %>% left_join(time_slots %>% rename(time_slot_1 = time_slot_id)) %>%
rename(time_start = time_value) %>%
left_join(time_slots %>% rename(time_slot_2 = time_slot_id)) %>%
rename(time_end = time_value) %>%
select(token, lemma, pos, participant, time_start, time_end, everything(), -starts_with('time_slot_'))
}
read_custom_eaf('corpus/kpv_udo20120330SazinaJS-encounter.eaf')
## Joining, by = c("ref_id", "participant")
## Joining, by = c("participant", "orth_id")
## Joining, by = c("participant", "token_id")
## Joining, by = c("participant", "lemma_id")
## Joining, by = "time_slot_1"
## Joining, by = "time_slot_2"
## # A tibble: 239 x 8
## token lemma pos participant time_start time_end
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 И и CC NTP-M-1986 170 3730
## 2 эшшӧ эшшӧ _ NTP-M-1986 170 3730
## 3 ӧтик ӧтик Num NTP-M-1986 170 3730
## 4 тор тор N NTP-M-1986 170 3730
## 5 , , CLB NTP-M-1986 170 3730
## 6 мый мый CS NTP-M-1986 170 3730
## 7 тэнад тэ Pron NTP-M-1986 170 3730
## 8 , , CLB NTP-M-1986 170 3730
## 9 тэныд тэ Pron NTP-M-1986 170 3730
## 10 мам мам N NTP-M-1986 170 3730
## # ... with 229 more rows, and 2 more variables: ref <chr>, orth <chr>
read_custom_eaf('corpus/kpv_izva20140330-1-fragment.eaf')
## Joining, by = c("ref_id", "participant")
## Joining, by = c("participant", "orth_id")
## Joining, by = c("participant", "token_id")
## Joining, by = c("participant", "lemma_id")
## Joining, by = "time_slot_1"
## Joining, by = "time_slot_2"
## # A tibble: 98 x 8
## token lemma pos participant time_start time_end
## <chr> <lgl> <lgl> <chr> <dbl> <dbl>
## 1 Ме NA NA MVF-F-1984 0 6086
## 2 , NA NA MVF-F-1984 0 6086
## 3 кӧнечнэ NA NA MVF-F-1984 0 6086
## 4 же NA NA MVF-F-1984 0 6086
## 5 , NA NA MVF-F-1984 0 6086
## 6 кык NA NA MVF-F-1984 0 6086
## 7 лун NA NA MVF-F-1984 0 6086
## 8 вӧлі NA NA MVF-F-1984 0 6086
## 9 в NA NA MVF-F-1984 0 6086
## 10 шоке NA NA MVF-F-1984 0 6086
## # ... with 88 more rows, and 2 more variables: ref <chr>, orth <chr>
In practice the files would be parsed in following manner:
corpus <- dir('corpus', pattern = 'eaf$', full.names = TRUE) %>% map(read_custom_eaf) %>% bind_rows()
## Joining, by = c("ref_id", "participant")
## Joining, by = c("participant", "orth_id")
## Joining, by = c("participant", "token_id")
## Joining, by = c("participant", "lemma_id")
## Joining, by = "time_slot_1"
## Joining, by = "time_slot_2"
## Joining, by = c("ref_id", "participant")
## Joining, by = c("participant", "orth_id")
## Joining, by = c("participant", "token_id")
## Joining, by = c("participant", "lemma_id")
## Joining, by = "time_slot_1"
## Joining, by = "time_slot_2"
## Joining, by = c("ref_id", "participant")
## Joining, by = c("participant", "orth_id")
## Joining, by = c("participant", "token_id")
## Joining, by = c("participant", "lemma_id")
## Joining, by = "time_slot_1"
## Joining, by = "time_slot_2"
5.2 Why to read ELAN files into R?
I have often seen and read the idea that in order to analyze linguistic data in R the first task should be to export data from ELAN into a spreadsheet. The problems with this approachs are manifold:
- The spreadsheet has to be manually updated every time the ELAN file changes
- If the spreadsheet is annotated further, it will not match with the updates done into ELAN files after original export
From this point of view the ideal solution would be to store all information into ELAN file and create an automatic export procedure to analyse the data further, or store it in other format such as spreadsheet in case this is needed for some reason.
Only way to keep this working is through meticulous working practices. There is no way over this. However, these practices can be easened and made more robust by adopting workflows which minimize the possibilities to deviate from the convention. The most simple mechanism for this is to have a system that is as minimal as possible, still allowing the tasks we want to achieve.
Unfortunately the situations and the tools we work with are never perfect, and there may be situations where we settle into a workflow far from ideal, due to different constraints: time, practicality, limitations of software. In these cases it is important to notice where these compromises have been made. I have included into each example presented here also the points which will make this procedure difficult to maintain, if these can be easily imagined.
5.3 Parsing with FRelan package
The function I have in the FRelan package for parsing one tier looks like this:
FRelan::read_tier
## function(eaf_file = "/Volumes/langdoc/langs/kpv/kpv_izva20140404IgusevJA/kpv_izva20140404IgusevJA.eaf", linguistic_type = "wordT", read_file = T, xml_object = F){
##
## `%>%` <- dplyr::`%>%`
##
## if (read_file == F){
##
## file = xml_object
##
## } else {
##
## file <- xml2::read_xml(eaf_file)
##
## }
##
## participants_in_file <- file %>% xml2::xml_find_all(paste0("//TIER[@LINGUISTIC_TYPE_REF='", linguistic_type, "']")) %>%
## xml2::xml_attr("PARTICIPANT")
##
## coerce_data_frame <- function(current_participant){
## dplyr::data_frame(
## content = file %>%
## xml2::xml_find_all(
## paste0("//TIER[@LINGUISTIC_TYPE_REF='", linguistic_type, "' and @PARTICIPANT='", current_participant,"']/ANNOTATION/*/ANNOTATION_VALUE")) %>%
## xml2::xml_text(),
## annot_id = file %>%
## xml2::xml_find_all(
## paste0("//TIER[@LINGUISTIC_TYPE_REF='", linguistic_type, "' and @PARTICIPANT='", current_participant,"']/ANNOTATION/*/ANNOTATION_VALUE/..")) %>%
## xml2::xml_attr("ANNOTATION_ID"),
## ref_id = file %>%
## xml2::xml_find_all(
## paste0("//TIER[@LINGUISTIC_TYPE_REF='", linguistic_type, "' and @PARTICIPANT='", current_participant,"']/ANNOTATION/*/ANNOTATION_VALUE/..")) %>%
## xml2::xml_attr("ANNOTATION_REF"),
## speaker = current_participant,
## tier_id = file %>%
## xml2::xml_find_all(
## paste0("//TIER[@LINGUISTIC_TYPE_REF='", linguistic_type, "' and @PARTICIPANT='", current_participant,"']/ANNOTATION/*/ANNOTATION_VALUE/../../..")) %>%
## xml2::xml_attr("TIER_ID"),
## type = file %>%
## xml2::xml_find_all(
## paste0("//TIER[@LINGUISTIC_TYPE_REF='", linguistic_type, "' and @PARTICIPANT='", current_participant,"']/ANNOTATION/*/ANNOTATION_VALUE/../../..")) %>%
## xml2::xml_attr("LINGUISTIC_TYPE_REF"),
## time_slot_1 = file %>%
## xml2::xml_find_all(
## paste0("//TIER[@LINGUISTIC_TYPE_REF='", linguistic_type, "' and @PARTICIPANT='", current_participant,"']/ANNOTATION/*")) %>%
## xml2::xml_attr("TIME_SLOT_REF1"),
## time_slot_2 = file %>%
## xml2::xml_find_all(
## paste0("//TIER[@LINGUISTIC_TYPE_REF='", linguistic_type, "' and @PARTICIPANT='", current_participant,"']/ANNOTATION/*")) %>%
## xml2::xml_attr("TIME_SLOT_REF2"))
## }
##
## if (length(participants_in_file) != 0){
##
## plyr::ldply(participants_in_file, coerce_data_frame) %>% dplyr::tbl_df() %>% dplyr::rename(participant = speaker)
##
## } else {
##
## all_participants <- file %>% xml2::xml_find_all("//TIER") %>%
## xml2::xml_attr("PARTICIPANT") %>% unique()
## tibble(content = NA,
## annot_id = '',
## ref_id = '',
## participant = all_participants,
## tier_id = '',
## type = linguistic_type,
## time_slot_1 = NA,
## time_slot_2 = NA)
##
## }
##
##
## }
## <bytecode: 0x7fabc6558ad8>
## <environment: namespace:FRelan>
It works fine, but I have not yet updated it to use map()
function from purrr
package. It also has some extra parts in it to parse either an XML file or an XML file already read into memory. We usually want to read several tiers and merge them together, and in this context it is not good to open the XML file again every time we read it, as this would slow the function down very much.