Wiktionary:Project - Text processing information
This page is no longer active. It is being kept for historical interest. | |
No discussion is needed to revive this page; simply remove the {{inactive}} tag and bring it up to date.
|
Text Processing Information
[edit]The idea of this project is for Wiktionary to provide raw, open, information helpful to text processing tasks. When dealing with the tasks below, often times text processing systems use a set of rules that are mostly right instead of trying to compile exhaustive information. Sometimes they also work off of proprietary data. Any project that would like to use more dictionary type information could extract it from Wiktionary. Obvious candidates for using such information would be word processors (MS Word and the like) or type setting systems (such as LaTeX).
Below are some areas where are dictionary approach to deal with text problems is helpful. Feel free to add ideas and please help on existing ones.
Letter Boundaries
[edit]Most of the time, each character in a word corresponds exactly to one "language" letter ("table" is spelled t+a+b+l+e). Some languages however have digraphs in their alphabets (such as the Spanish "ch") or trigraphs (such as the Hungarian "dzs") where one letter is made up of several characters and this is where most of the problems spring from. Languages can have other issues, such as when a Hungarian digraph is repeated in a word sometimes only the first character is repeated (so "ny" + "ny" → "nny" or "nyny" depending). The issue for text processing is to determine what the underlying letters are of a word (is "nny" = "n"+"ny" or "ny"+"ny"). This is very important collating words correctly, cursor movemient through text, etc.
Todo:
- List languages with ambiguously parsable character strings. A cursory table including Latin-based alphabets that contain digraphs was compiled below (with information from w:Alphabets derived from the Latin#Notes). If there any other issues that can confuse letter-boundary algorithms aside from di/trigraph issues please note them (or non-Latin based alphabets with digraphs).
- On words with ambiguously parseable character strings, use
{{letters}}
to define actual lettering.
Langauges | Ambiguity? |
Albanian | ?? dh, gj, ll, nj, rr, sh, th, xh, zh |
Arbëresh | ?? dh, gj, hj, ll, nj, rr, sh, th, xh, zh, hj |
Basque | ?? dd, ll, rr, ts, tt, tx, tz |
Belarusian | ?? ch, dz, dź, dž |
Breton | ?? ch, c'h, zh |
Catalan | ?? dz, gu, (gü), ig, ix, ll, l·l, nc, ny, qu, (qü), rr, ss, tz |
Corsican | ?? chj, ghj |
Croatian | ?? dž, lj, nj, dj |
Czech | ?? ch |
Dutch | ?? ij |
Filipino | ?? ng |
Galician. | ?? gu, qu, ch, ll, nh, rr, ao |
Guaraní | ?? ch, mb, nd, ng, nt, rr |
Hausa | ?? sh, ts |
Hungarian | cs, dz, gy, ly, ny, sz, ty, zs, dzs. Plus composition is ambiguous. |
Irish | ?? bh, ch, dh, fh, gh, mh, ph, sh, th |
Italian | ?? ch, gh, gn, gl, sc |
Latvian | ?? dz, dž |
Maltese | ?? ie, għ |
Māori | ?? ng, Wh |
Piedmontese | ?? n- |
Pinyin | ?? ch, sh, zh |
Polish | ?? ch, cz, dz, dż, dź, sz, rz |
Romani | ?? čh, dž, kh, ph, th |
Slovak | ?? dz, dž, ch |
Spanish | None for collation. ch and ll are treated as digraphs for collation purposes, but as distinct letters otherwise. [1] |
Swedish | ?? ch, dj, lj, rl, rn, rs, sj, sk, si, ti, sch, skj, stj |
Walloon | ?? ae, ch, dj, ea, jh, oe, oen, oi, sch, sh, tch, xh |
Welsh | ch, dd, ff, ng, ll, ph, rh, th |
Xhosa | ?? bh, ch, dl, dy, dz, gc, gq, gr, gx, hh, hl, kh, kr, lh, mb, mf, mh, n', nc, ndl, ndz, ngc, ngh, ngq, ngx, nh, nkc, nkq, nkx, nq, nx, ntl, ny, nyh, ph, qh, rh, sh, th, ths, thsh, thy, ts, tsh, ty, wh, xh, yh, zh |
Hyphenation
[edit]Making sure that words are hyphenated correctly is important so that, for instance, "cardiovascular" is hyphenated as "cardio-vascular" instead of "cardi-ovascular". Most hyphenation systems use either poor performing rules, or some compacted form of a proprietary hyphenation dictionary (see w:Hyphenation algorithm). Use {{hyphenation}}
(eg {{hyphenation|co|incid|ence}}
) to specify possible hyphenation points in a word. There will be sometimes regional differences (for example between UK and USA) but that´s fine.
Todo:
- Pretty straight forward, just make sure entries have well formulated
{{hyphenation}}
usage (while there are guidelines to hyphenation, many authorities disagree on words).
Word Boundaries
[edit]In most languages/scripts words are separated by spaces. This is not true however of Thai where there is no break between words (Thai words are generally small so it´s easy to decipher knowing the language). The Unicode Text Segmentation implementation uses a small list of Thai words for instance to determine where word breaks are. Wiktionary should be able to provide a more exhaustive (and updated) type dictionary for Thai and any other language that runs into these problems.
Todo:
- Identify other languages that have difficult word break patterns (maybe other related languages such as Khmer or Lao?)
- Develope format so that entries in these languages show whether they are just one word or a composite of several (and if so what are the consituent words)
See
[edit]- Unicode Text Segmentation dealing with boundaries between letters, words, and sentences.
- Unicode Line Breaking Algorithm for some thoughts on hyphenation.
- WT:GP#Wiktionary Augmenting Unicode and others