Context

In the transcription of Kähler (1987), the transcribers directly changed original orthography into the standardised ones. For instance, the original orthography for nasalised long vowel like ȭ was changed into õõ using the Keyman keyboard.

When õõ is generated using Keyman, it contains four characters under the hood (i.e., two combinations of o + ◌̃).

Aim

In order to search and replace these multibytes characters combining letter and diacritics, we need to:

capture these complex characters as two characters with the group regex ((...)) (this is for one character)
then use look-behind to capture the same sequence of characters

Example codes

library(stringr)

# create an example word with standardised orthography for long vowels using the IPA Keyman
x <- "amãpõpõõ"

# search the sequence of long vowels with strings also generated using the IPA Keyman
rgx <- "(..)(\\1)"

# test the regex pattern
str_view_all(x, rgx)
## [1] │ amãpõp<õõ>

# replace the sequence of long vowel back into the original orthography (using ◌̄)
str_replace_all(x, rgx, "\\1̄")
## [1] "amãpõpȭ"

Notes

If the multibyte characters have been normalised using stringi::stri_trans_nfc(), then the search regex will also need to be wrapped inside stringi::stri_trans_nfc().

gederajeg/change_standard_orthography_into_original_form_Kähler1987.md

Context

Aim

Example codes

Notes