Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save gederajeg/6d99ca670283e34d338a8edcd3a4eaf7 to your computer and use it in GitHub Desktop.
Save gederajeg/6d99ca670283e34d338a8edcd3a4eaf7 to your computer and use it in GitHub Desktop.
Replace standardised orthography for long vowels in Kähler (1987) Enggano - German Dictionary

Context

In the transcription of Kähler (1987), the transcribers directly changed original orthography into the standardised ones. For instance, the original orthography for nasalised long vowel like ȭ was changed into õõ using the Keyman keyboard.

When õõ is generated using Keyman, it contains four characters under the hood (i.e., two combinations of o + ◌̃).

Aim

In order to search and replace these multibytes characters combining letter and diacritics, we need to:

  • capture these complex characters as two characters with the group regex ((...)) (this is for one character)
  • then use look-behind to capture the same sequence of characters

Example codes

library(stringr)

# create an example word with standardised orthography for long vowels using the IPA Keyman
x <- "amãpõpõõ"

# search the sequence of long vowels with strings also generated using the IPA Keyman
rgx <- "(..)(\\1)"

# test the regex pattern
str_view_all(x, rgx)
## [1] │ amãpõp<õõ>

# replace the sequence of long vowel back into the original orthography (using ◌̄)
str_replace_all(x, rgx, "\\1̄")
## [1] "amãpõpȭ"

Notes

  • If the multibyte characters have been normalised using stringi::stri_trans_nfc(), then the search regex will also need to be wrapped inside stringi::stri_trans_nfc().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment