Section 12 Example: Preprocessing workflow
In this example I walk through a preliminary exploration of one morphological variable in Komi-Zyrian dialects. I use general dplyr and purrr functions described in section Tools: R.
In principle there are four patterns that occur regularly: мунісныс, мунісны, муніны, муніныс. Only two first occur in this data, but the regex would in principle capture all of them.
The idea is to define a general function that with the simplest possible example data finds and classifies the wanted examples. This way we can differentiate the query and processing from the data we apply it into. So in principle we could take another Komi dialect sample and run similar analysis into that.
dummy_data <- tibble(token = c('мунісныс', 'мунісны', 'муніны', 'муніныс'))
filter_verbs <- function(data){
data %>% mutate(type = if_else(str_detect(token, 'сн..$') & str_detect(token, 'с$'),
true = 'isnɨs',
false = if_else(str_detect(token, 'сн.') & str_detect(token, 'ы$'),
true = 'isnɨ',
false = if_else(str_detect(token, '([^с]ныс$)'),
true = 'inɨs',
false = 'inɨ')))) %>%
mutate(type_final = as.factor(if_else(str_detect(type, 's$'),
true = 's-final',
false = 'vowel-final'))) %>%
mutate(type_medial = as.factor(if_else(str_detect(type, 'sn'),
true = 's-medial',
false = 'vowel-medial')))
}
dummy_data %>% filter_verbs() %>% knitr::kable()
token | type | type_final | type_medial |
---|---|---|---|
мунісныс | isnɨs | s-final | s-medial |
мунісны | isnɨ | vowel-final | s-medial |
муніны | inɨ | vowel-final | vowel-medial |
муніныс | inɨs | s-final | vowel-medial |
It seems that the result is correct, so we can start to apply it to the real data. However, if we find problems along the way, we can always return to this point and modify the function, as long as we haven’t started to do manual edits in the derived files.
In the next step we load the corpus directly into R from previously saved RDS file. How these can be worked with was described in section Tools: R.
kpv <- read_rds('corpus.rds')
Of course if the corpus is small enough we can also just read it directly from whatever source we have. However, as this process is repeated every time we compile this website, it is easier to read it this way.
verbs <- kpv %>%
filter(! participant %in% c('NTP-M-1986', 'MR-M-1974', 'RB-1974')) %>% # this simply removes the western researchers Niko Partanen, Michael Rießler and Rogier Blokland
filter(str_detect(token, '(и|і)(с)?ны(с)?$')) %>% # This selects the wanted tokens
filter_verbs() %>% # Here we call the function we set up above
select(token, type, everything())
We can easily look into how many types we have and which are the most common tokens. This doesn’t modify the data frame, but gives us information about how sensical the result is. Usually this kind of more finished document doesn’t show the whole workflow how the preprocessing function was edited iteratively while examination of the results shows something is off, but in principle older versions should usually be present in older Git commits. This can be useful when it is realized later that some of the older version was indeed the correct one.
verbs %>% count(type) %>% slice(1:10) %>% knitr::kable()
type | n |
---|---|
inɨ | 300 |
inɨs | 250 |
isnɨ | 1803 |
isnɨs | 1196 |
verbs %>% count(token) %>% arrange(desc(n)) %>% slice(1:10) %>% knitr::kable()
token | n |
---|---|
вӧліны | 176 |
вӧліныс | 113 |
кучисныс | 84 |
воисныс | 67 |
олісны | 67 |
шуисны | 57 |
локтісны | 46 |
ветлісны | 45 |
карисны | 44 |
олісныс | 44 |
verbs %>% filter(type == 'inɨs') %>% slice(1:10) %>% knitr::kable()
token | type | utterance | reference | participant | time_start | time_end | session_name | filename | word | type_final | type_medial |
---|---|---|---|---|---|---|---|---|---|---|---|
муніныс | inɨs | Нужник бӧкас сулалэ и бӧр муніныс. | kpv_izva19300000ArtijevI-135-10 | IXA-M-18XX | 90000 | 100000 | kpv_izva19300000ArtijevI-135 | /Volumes/langdoc/langs/kpv/kpv_izva19300000ArtijevI-135/kpv_izva19300000ArtijevI-135.eaf | муніныс | s-final | vowel-medial |
муніныс | inɨs | Муніныс ныа вӧлӧн. | kpv_izva19570000-290_3bz-11 | XXV-M-19XX | 52900 | 54220 | kpv_izva19570000-290_3bz-Bakur | /Volumes/langdoc/langs/kpv/kpv_izva19570000-290_3bz-Bakur/kpv_izva19570000-290_3bz-Bakur.eaf | Муніныс | s-final | vowel-medial |
вӧліныс | inɨs | Баракъясас вӧліныс гразнӧйӧсь зэй. | kpv_izva19570000-290_3bz-23 | XXV-M-19XX | 111380 | 114420 | kpv_izva19570000-290_3bz-Bakur | /Volumes/langdoc/langs/kpv/kpv_izva19570000-290_3bz-Bakur/kpv_izva19570000-290_3bz-Bakur.eaf | вӧліныс | s-final | vowel-medial |
лэччаніныс | inɨs | Рытнас шонді лэччаніныс ке сӧстэм - мӧдасылас лоэ шондіа лун , | kpv_izva19590000IgusevJA-280 | JAI-M-1939 | 1104086 | 1109573 | kpv_izva19590000IgusevJA | /Volumes/langdoc/langs/kpv/kpv_izva19590000IgusevJA/kpv_izva19590000IgusevJA.eaf | лэччаніныс | s-final | vowel-medial |
лэччаніныс | inɨs | Шонді лэччаніныс ке гӧрд , мӧдасылас лоас тӧла лун . | kpv_izva19590000IgusevJA-288 | JAI-M-1939 | 1165648 | 1169283 | kpv_izva19590000IgusevJA | /Volumes/langdoc/langs/kpv/kpv_izva19590000IgusevJA/kpv_izva19590000IgusevJA.eaf | лэччаніныс | s-final | vowel-medial |
уудиныс | inɨs | уудиныс (водьпомыс) ляпкыд. | kpv_izva19591216-05582_2az.29 | MXV-F-1937 | 147886 | 150661 | kpv_izva19591216-05582_2az | /Volumes/langdoc/langs/kpv/kpv_izva19591216-05582_2az/kpv_izva19591216-05582_2az.eaf | уудиныс | s-final | vowel-medial |
лоиныс | inɨs | а сыри- сыритяяс ӧні нин лоиныс важынкаяс: | kpv_izva19591216-05582_4a.078 | MXV-F-1937 | 519866 | 526876 | kpv_izva19591216-05582_4a | /Volumes/langdoc/langs/kpv/kpv_izva19591216-05582_4a/kpv_izva19591216-05582_4a.eaf | лоиныс | s-final | vowel-medial |
кӧйиныс | inɨs | кӧйиныс сёяс и кулэ. | kpv_izva19591216-05582_4a.101 | MXV-F-1937 | 717843 | 720943 | kpv_izva19591216-05582_4a | /Volumes/langdoc/langs/kpv/kpv_izva19591216-05582_4a/kpv_izva19591216-05582_4a.eaf | кӧйиныс | s-final | vowel-medial |
кӧйиныс | inɨs | кӧйиныс верме кӧрлы джагедны и луна. | kpv_izva19591216-05582_4a.112 | MXV-F-1937 | 774030 | 779556 | kpv_izva19591216-05582_4a | /Volumes/langdoc/langs/kpv/kpv_izva19591216-05582_4a/kpv_izva19591216-05582_4a.eaf | кӧйиныс | s-final | vowel-medial |
кӧйиныс | inɨs | кӧйиныс дебсе росся костэ; | kpv_izva19591216-05582_4a.113 | MXV-F-1937 | 779556 | 783611 | kpv_izva19591216-05582_4a | /Volumes/langdoc/langs/kpv/kpv_izva19591216-05582_4a/kpv_izva19591216-05582_4a.eaf | кӧйиныс | s-final | vowel-medial |
Often the actual benefit of having the data in a programmatic environment instead of just ELAN is that can use in the research variables from metadata that cannot be accessed in ELAN. In our participant naming system we use usually the convention where the gender and birthyear are marked to the name id.
verbs <- verbs %>% mutate(gender = str_extract(participant, '(?<=-)[MF](?=-)')) %>%
select(token, type, gender, participant, everything())
verbs %>%
count(gender)
## # A tibble: 3 x 2
## gender n
## <chr> <int>
## 1 F 2455
## 2 M 1029
## 3 <NA> 65
Here we see that for some speakers we didn’t find the gender specified as is described in the convention. The reason is that there are some old transcribed texts for which we don’t know who is the speaker – very likely the same text has been elicitated from several speakers and is some kind of a synthesis of those. We simply don’t know. This is somewhat typical for fieldwork data from the 19th and early 20th century, although variation in practices is also great. In this case we know (after looking a bit better which files are having this problem) what is the reason for this problem, and we can decide to leave out those cases if necessary. However, it is good to take into account that if there are missing values, it is very important to examine what is going on behind them.
verbs %>%
filter(is.na(gender)) %>%
count(type)
## # A tibble: 4 x 2
## type n
## <chr> <int>
## 1 inɨ 7
## 2 inɨs 3
## 3 isnɨ 52
## 4 isnɨs 3
In this point we can take a note that in the oldest data available there are very few examples of the s-final forms, but it is also a very small subset of the corpus.
Before advancing further, we can add one more variable to the dataframe we are working with.
verbs <- verbs %>%
mutate(year = str_extract(session_name, '\\d{4}(?=\\d{4})')) %>%
select(token, type, gender, year, everything())
Year is of course a bit problematic variable as we aren’t really having that much data for each year. So let’s add a new column for the decade.
verbs <- verbs %>%
mutate(type = as.factor(type)) %>%
mutate(year = as.numeric(year)) %>%
mutate(decade = (year %/% 10) * 10) %>%
select(token, type, gender, year, decade, everything())
After this the plotting will work nicely. We can analyze the distribution of tokens per decade:
ggplot(verbs,
aes(x = decade)) +
geom_bar()
This reflects well the data distribution of IKDP corpus, which makes sense as these verb forms should occur everywhere.
ggplot(verbs) +
geom_bar(mapping = aes(x = decade, fill = type_medial), position = "fill")
ggplot(verbs %>%
filter(year > 1930)) + # This leaves older data out as it is so gappy
geom_bar(mapping = aes(x = decade, fill = type_medial), position = "fill")
ggplot(verbs %>%
filter(year > 1930) %>%
filter(! is.na(gender))) + # If we want to use a variable later, we have to make sure it is available
geom_bar(mapping = aes(x = decade, fill = type_medial), position = "fill") +
facet_grid(. ~ gender)
Next thing I want to try is to associate birth places with areas and plot those. For this we need bit more metadata, which I’m now reading from our Filemaker Pro database, but which should be set up better for the actual teaching.
Once we have set up the processing workflow to the point where we have something useful, and maybe get into phase where we don’t know what we are doing, it can be an useful practice to write the dataframe into a new variable so that it is not necessary to do all previous changes when something goes wrong. In this case I create a new variable called verbs_test
, and when we accidentally do something we didn’t want to it is easy to run the code from this point onward.
source('/Volumes/langdoc/langs/kpv/FM_meta.R')
## Loading required package: DBI
## Loading required package: rJava
## Joining, by = "Actor_ID"
## Joining, by = "Session_ID"
## Joining, by = "RecPlace_OSM_ID"
## Joining, by = "PlaceofRes_OSM_ID"
## Joining, by = "Birthplace_OSM_ID"
## Warning in eval(ei, envir): NAs introduced by coercion
## Warning in eval(ei, envir): NAs introduced by coercion
## Warning in eval(ei, envir): NAs introduced by coercion
## Warning in eval(ei, envir): NAs introduced by coercion
## Warning in eval(ei, envir): NAs introduced by coercion
## Warning in eval(ei, envir): NAs introduced by coercion
## Warning in eval(ei, envir): NAs introduced by coercion
# meta %>% distinct(place_birth, birthplace_osm_id, lat_birth, lon_birth) %>%
# arrange(place_birth) %>%
# group_by(place_birth) %>%
# filter(n()>1)
verbs_test <- left_join(verbs, meta %>% distinct(participant, lat_birth, lon_birth, attr_foreign) %>% filter(participant %in% verbs$participant)) %>%
rename(lat = lat_birth,
lon = lon_birth)
## Joining, by = "participant"
verbs_test <- verbs_test %>%
mutate(variety = str_extract(session_name, '(?<=kpv_)[a-z]+'))
12.1 From points to polygon
#izva <- st_read('https://raw.githubusercontent.com/langdoc/IKDP-2/025e817c25181b683661a21ab36facb63c830604/data/izva_dialects.geojson')
izva <- st_read('/Users/niko/github/IKDP-2/data/izva_dialects-test.geojson')
## Reading layer `OGRGeoJSON' from data source `/Users/niko/github/IKDP-2/data/izva_dialects-test.geojson' using driver `GeoJSON'
## Simple feature collection with 19 features and 4 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 28.96325 ymin: 55.94814 xmax: 74.33177 ymax: 70.39286
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
ggplot(izva) +
geom_sf(aes(fill = variant)) +
geom_point(data = verbs_test %>% filter(! is.na(lon)),
aes(x = lon, y = lat))
verbs_test %>% filter(is.na(lat)) %>% count(participant) %>% arrange(desc(n))
## # A tibble: 31 x 2
## participant n
## <chr> <int>
## 1 IIB-M-1946 58
## 2 TFA-F-1934 46
## 3 APP-F-1957 39
## 4 unknown 25
## 5 AAZ-F-1940 20
## 6 KOM-F-1964 13
## 7 XXC-F-196X 8
## 8 NGK-F-1956 6
## 9 группа 6
## 10 S1 5
## # ... with 21 more rows
## based on this:
## https://gis.stackexchange.com/questions/222978/lon-lat-to-simple-features-sfg-and-sfc-in-r
geo_inside <- function(lon, lat, map, variable) {
variable <- enquo(variable)
pt <-
tibble::data_frame(x = lon,
y = lat) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(map))
pt %>% st_join(map) %>% pull(!!variable)
}
verbs_test <- verbs_test %>%
filter(! is.na(lon) | ! is.na(lat))
verbs_test %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 3281
verbs_test <- verbs_test %>%
mutate(region = geo_inside(lon, lat, izva, variant))
## although coordinates are longitude/latitude, it is assumed that they are planar
verbs_test <- verbs_test %>%
mutate(dialect = geo_inside(lon, lat, izva, dialect))
## although coordinates are longitude/latitude, it is assumed that they are planar
ggplot(data = verbs_test %>%
filter(str_detect(dialect, 'zva')),
aes(x = type)) +
geom_bar() +
facet_wrap(region ~ gender)
verbs_test %>% count(dialect)
## # A tibble: 9 x 2
## dialect n
## <fctr> <int>
## 1 Central Sysola 6
## 2 izva 1311
## 3 Izva 1566
## 4 Lower Vychegda 1
## 5 Luza-Letka 12
## 6 Syktyvdin 7
## 7 Udora 257
## 8 Upper Sysola 30
## 9 Upper Vychegda 91
ggplot(verbs_test %>%
filter(! is.na(region)) %>%
filter(str_detect(dialect, 'zva'))) +
geom_bar(mapping = aes(x = region, fill = type), position = "fill") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(verbs_test %>%
filter(! participant %in% c('MSF-F-1968', 'VPC-M-1993')) %>%
filter(! is.na(region)) %>%
filter(str_detect(dialect, 'zva'))) +
geom_bar(mapping = aes(x = region, fill = type), position = "fill") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
library(geofacet)
mygrid <- data.frame(
code = c("IZKA", "IZKP", "IZBT", "IZCO", "IZSIB", "IZEX", "IZKU", "UDVA", "UDMZ", "VM", "IZUP", "PE", "VYLO", "VYUP", "SK", "SD", "SC", "LL", "SU"),
name = c("Kanin", "Kola Peninsula", "Tundra", "Izhma core", "Siberia", "Izhma extension", "Kolva-Usa", "Vashka", "Mezen", "Vym", "Upper Izhma", "Pechora", "Lower Vychegda", "Upper Vychegda", "Syktyvkar", "Syktyvdin", "Central Sysola", "Luza-Letka", "Upper Sysola"),
row = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 6, 7, 7, 8),
col = c(2, 1, 3, 4, 7, 5, 6, 1, 2, 3, 4, 5, 3, 4, 3, 3, 4, 2, 5),
stringsAsFactors = FALSE
)
geofacet::grid_preview(mygrid)
## You provided a user-specified grid. If this is a generally-useful
## grid, please consider submitting it to become a part of the
## geofacet package. You can do this easily by calling:
## grid_submit(__grid_df_name__)
# verbs_test %>% mutate(name = as.character(region)) %>%
# mutate(name = if_else(name == 'Bolshezyemelskaya Tundra', 'Tundra', name)) %>%
# left_join(mygrid) %>%
# filter(! is.na(region)) %>% filter(variety == 'vym') %>% select(name)
ggplot(verbs_test %>%
mutate(name = as.character(region)) %>%
mutate(name = if_else(name == 'Bolshezyemelskaya Tundra', 'Tundra', name)) %>%
left_join(mygrid) %>%
filter(! is.na(region)) %>%
filter(! participant %in% c('MSF-F-1968', 'VPC-M-1993', 'VVF-F-1957'))) +
geom_bar(mapping = aes(x = factor(""), fill = type), position = "fill") +
facet_geo(~ name, grid = mygrid) +
labs(title = "First preterite plural verb allomorphs in Komi-Zyrian dialects",
subtitle = "Map approximates the location and contact relations of the dialects. For blanks no data available.",
caption = "Work done in LATTICE, Paris\nData Source: IKDP Author: Niko Partanen (2017)",
y = "Percentage of different types",
x = "") +
theme_bw() +
theme(axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
# axis.title.x=element_blank(),
# axis.title.y=element_blank(),
# legend.position="none",
panel.background=element_blank(),
# panel.border=element_blank(),
panel.grid.major=element_blank(),
panel.grid.minor=element_blank()) +
theme(strip.background = element_rect(fill="white", linetype = 'blank'))+
theme(strip.text = element_text(colour = 'black', size = 7))
## Joining, by = "name"
## You provided a user-specified grid. If this is a generally-useful
## grid, please consider submitting it to become a part of the
## geofacet package. You can do this easily by calling:
## grid_submit(__grid_df_name__)
# ggplot(verbs_test %>%
# filter(! participant %in% c('MSF-F-1968', 'VPC-M-1993')) %>%
# filter(! is.na(region)) %>%
# filter(! is.na(gender)) %>%
# filter(str_detect(dialect, 'zva'))) +
# geom_bar(mapping = aes(x = gender, fill = type), position = "fill") +
# theme(axis.text.x = element_text(angle = 90, hjust = 1))
#
ggplot(verbs_test %>%
filter(! is.na(region)) %>%
filter(! str_detect(dialect, 'zva'))) +
geom_bar(mapping = aes(x = region, fill = type), position = "fill") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# verbs_test %>% filter(type == 'inɨ')
verb_stems <- verbs_test %>%
mutate(stem = str_extract(token, '.+(?=(и|і)с?ны?с?)')) %>%
distinct(stem) %>%
arrange(stem)
# verb_stems %>%
# filter(! str_detect(stem, "(j|['])")) %>%
# write_csv('data/izva_verbs.csv')
# verbs_test %>%
# mutate(stem = str_extract(token, '.+(?=(и|і)с?ны?с?)')) %>%
# count(stem) %>%
# arrange(desc(n)) %>%
# slice(10:20)
# verbs_test %>%
# filter(variety == 'izva') %>%
# mutate(stem = str_extract(token, '.+(?=(и|і)с?ны?с?)')) %>%
# filter(! stem == 'в') %>%
# count(stem, type) %>%
# rename(hits = n) %>%
# arrange(stem) %>%
# split(.$stem) %>%
# map(~ mutate(.x, diff_types = n()) %>%
# mutate(type_ratio = hits / sum(hits)) %>%
# mutate(sum_hits = sum(hits))) %>%
# bind_rows %>%
# filter(sum_hits > 10) %>%
# arrange(desc(diff_types)) %>% View
# verbs_test %>%
# filter(variety == 'izva') %>%
# mutate(stem = str_extract(token, '.+(?=(и|і)с?ны?с?)')) %>%
# filter(stem == 'босьт') %>%
# filter(type == 'isnɨs') %>%
# select(utterance) %>%
# View
# verbs_test %>% filter(str_detect(utterance, 'всю')) %>% select(utterance, filename) %>% open_eaf(3)
# kpv %>% filter(str_detect(utterance, 'кеде ')) %>% distinct(utterance, participant, filename)
#
# split(.$type) %>%
# map(~ count(.x, stem) %>%
# arrange(desc(n)) %>%
# slice(10:20))
# verbs_test %>% filter(str_detect(token, '^кор')) %>% select(utterance, participant, variety, year)
# verbs_test %>% filter(str_detect(token, '^торйед')) %>% open_eaf(1)
verbs_test %>%
mutate(stem = str_extract(token, '.+(?=(и|і)с?ны?с?)')) %>%
left_join(read_csv('data/izva_verbs.csv')) %>%
filter(! is.na(category)) %>%
filter(stem != 'вӧл') %>%
# select(token, type, type_medial, gender, year, participant, category, variety, region, dialect) %>%
ggplot(data = .,
aes(x = category)) +
geom_bar() +
facet_grid(. ~ type)
## Parsed with column specification:
## cols(
## stem = col_character(),
## category = col_character(),
## remove = col_character()
## )
## Joining, by = "stem"
verbs_test %>%
mutate(stem = str_extract(token, '.+(?=(и|і)с?ны?с?)')) %>%
left_join(read_csv('data/izva_verbs.csv')) %>%
filter(! is.na(category)) %>%
filter(dialect %in% c('Udora', 'izva', 'Izva')) %>%
filter(stem != 'вӧл') %>%
select(token, type, gender, year, participant, category, variety, region, dialect) %>%
ggplot(data = .,
aes(x = category)) +
geom_bar() +
facet_grid(type ~ variety)
## Parsed with column specification:
## cols(
## stem = col_character(),
## category = col_character(),
## remove = col_character()
## )
## Joining, by = "stem"
int <- verbs_test %>%
mutate(stem = str_extract(token, '.+(?=(и|і)с?ны?с?)')) %>%
left_join(read_csv('data/izva_verbs.csv')) %>%
filter(! is.na(category))
## Parsed with column specification:
## cols(
## stem = col_character(),
## category = col_character(),
## remove = col_character()
## )
## Joining, by = "stem"
# int %>%
# left_join(count(int, stem) %>% rename(token_count = n)) %>%
# arrange(desc(token_count)) %>%
# distinct(stem, token_count) %>%
# ggplot(data = .,
# aes(x = token, y = token_count)) +
# geom_bar()
# блиныс
# verbs %>% arrange(token) %>% distinct(token) %>% write_csv('data/izva_verbs.csv')
# verbs %>% filter(str_detect(token, 'j'))
# verbs %>% mutate(variant = str_extract(session_name, '(?<=kpv_)[a-z]+(?=\\d)')) %>% count(variant)
#verbs %>% left_join()