In the transcription of Kähler (1987), the transcribers directly changed original orthography into the standardised ones. For instance, the original orthography for nasalised long vowel like ȭ
was changed into õõ
using the Keyman keyboard.
When õõ
is generated using Keyman, it contains four characters under the hood (i.e., two combinations of o + ◌̃).
In order to search and replace these multibytes characters combining letter and diacritics, we need to:
- capture these complex characters as two characters with the group regex (
(...)
) (this is for one character) - then use look-behind to capture the same sequence of characters
library(stringr)
# create an example word with standardised orthography for long vowels using the IPA Keyman
x <- "amãpõpõõ"
# search the sequence of long vowels with strings also generated using the IPA Keyman
rgx <- "(..)(\\1)"
# test the regex pattern
str_view_all(x, rgx)
## [1] │ amãpõp<õõ>
# replace the sequence of long vowel back into the original orthography (using ◌̄)
str_replace_all(x, rgx, "\\1̄")
## [1] "amãpõpȭ"
- If the multibyte characters have been normalised using
stringi::stri_trans_nfc()
, then the search regex will also need to be wrapped insidestringi::stri_trans_nfc()
.