Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
user:zeman:transliteration-of-urdu-to-latin-script [2010/11/10 13:53] zeman Frequent words. |
user:zeman:transliteration-of-urdu-to-latin-script [2010/11/16 08:25] (current) zeman ~ |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Transliteration of Urdu to Latin Script ====== | ====== Transliteration of Urdu to Latin Script ====== | ||
| + | |||
| + | Copyright © 2010 by Dan Zeman < | ||
| + | License: GNU GPL | ||
| + | This research has been supported by the grant of the Czech Ministry of Education no. MSM0021620838. | ||
| ===== Transliteration versus Transcription ===== | ===== Transliteration versus Transcription ===== | ||
| Line 18: | Line 22: | ||
| ===== Consonants ===== | ===== Consonants ===== | ||
| - | Most of the consonants do not pose any serious problem. I decided to represent the retroflex consonants (ٹ, ڈ, ڑ) by a dot below their dental or other counterparts, | + | Most of the consonants do not pose any serious problem. I decided to represent the retroflex consonants (ٹ, ڈ, ڑ) by a dot below their dental or other counterparts, |
| Some other notes: //j// is pronounced as in English, not as in Czech or German. //č// and //š// are used in Baltic and Slavic languages (among others) to represent the sounds that are usually written “ch” or “sh”, respectively, | Some other notes: //j// is pronounced as in English, not as in Czech or German. //č// and //š// are used in Baltic and Slavic languages (among others) to represent the sounds that are usually written “ch” or “sh”, respectively, | ||
| Line 60: | Line 64: | ||
| | 0645 | م | m | m | | | 0645 | م | m | m | | ||
| | 0646 | ن | n | n | | | 0646 | ن | n | n | | ||
| - | | 06BA | ں | n | ñ | | + | | 06BA | ں | ~ | ñ | |
| | 0648 | و | v | w | | | 0648 | و | v | w | | ||
| | 06C1 | ہ | h | h | | | 06C1 | ہ | h | h | | ||
| Line 103: | Line 107: | ||
| | 06CC | ی | j, i:, e: | y, ī, e | | | 06CC | ی | j, i:, e: | y, ī, e | | ||
| | 06D2 | ے | e: | e | | | 06D2 | ے | e: | e | | ||
| - | | 0626 | ئ | - | 0 | | + | | 06D3 | ۓ | e: | e | |
| - | | 0674 | ٔ (high hamza) | - | 0 | | + | | 0624 | ؤ | u:, o: | ū, o | |
| + | | 0626 | ئ | -, i:, e | 0, ī, e | | ||
| + | | 0654 | (hamza above)ٔ | - | 0 | | ||
| + | | 0674 | (high hamza)ٔ | - | 0 | | ||
| The transliteration script should contain a gradually growing vocabulary that would help disambiguate known words. Otherwise there would be a very high number of ambiguous positions in any transliterated string. | The transliteration script should contain a gradually growing vocabulary that would help disambiguate known words. Otherwise there would be a very high number of ambiguous positions in any transliterated string. | ||
| Line 110: | Line 117: | ||
| ===== Short Vowels and Diacritics ===== | ===== Short Vowels and Diacritics ===== | ||
| - | Without diacritics (which is more common), every consonant that is not followed by a long vowel may or may not be followed by a short vowel. I denote this possibility by the character for the neutral | + | Without diacritics (which is more common), every consonant that is not followed by a long vowel may or may not be followed by a short vowel. I denote this possibility by the character for the neutral |
| //Warning! This section is under construction. I am still confused about the exact rules for Urdu vowel representation, | //Warning! This section is under construction. I am still confused about the exact rules for Urdu vowel representation, | ||
| Line 133: | Line 140: | ||
| Some frequent words cannot be disambiguated by character-based rules alone but a vocabulary could identify them as existing unambiguous Urdu words and save much manual work by disambiguating them. Here are some examples: | Some frequent words cannot be disambiguated by character-based rules alone but a vocabulary could identify them as existing unambiguous Urdu words and save much manual work by disambiguating them. Here are some examples: | ||
| - | * ہے => he | + | * ہے => he (“is”) |
| - | * میں => meñ | + | * میں => meñ (“in”) |
| - | * ایک => ek | + | * ایک => ek (“one”) |
| - | * اور => or | + | * اور => or (“and”) |
| Note however that there are inherently ambiguous words that cannot be disambiguated without human intervention (or at least without looking at the neighboring words). Examples: | Note however that there are inherently ambiguous words that cannot be disambiguated without human intervention (or at least without looking at the neighboring words). Examples: | ||
| - | * تو => to | tū | + | * تو => to (“so”) |
| - | * اس => is | us | + | * اس => is (“of this”) |
| - | * ان => in | un | + | * ان => in (“of these”) |
| ===== The Transliteration Script ===== | ===== The Transliteration Script ===== | ||
| Line 150: | Line 157: | ||
| <code bash> | <code bash> | ||
| - | If you happen to sit on the ÚFAL network, you will find the script in '' | + | If you happen to sit on the ÚFAL network, you will find the script in '' |
| + | |||
| + | I am also attaching the current snapshot of the two folders to this wiki {{: | ||
| This is an example of an Urdu sentence and the romanized output by the script: | This is an example of an Urdu sentence and the romanized output by the script: | ||
| Line 156: | Line 165: | ||
| * آپ کو پچھلے 182 دنوں میں اپنی بیماری یا معزوری کے سبب مندرجہ ذیل میں سے کوئی ایک ملتا رہا ہے ؟ | * آپ کو پچھلے 182 دنوں میں اپنی بیماری یا معزوری کے سبب مندرجہ ذیل میں سے کوئی ایک ملتا رہا ہے ؟ | ||
| * āp ko pəčhəle 182 dənoñ meñ əpənī b[yīe]mārī yā məˀəz[wūo]rī ke səbəb mənədərəjəh ż[yīe]l meñ se koī ek mələtā rəhā he ? | * āp ko pəčhəle 182 dənoñ meñ əpənī b[yīe]mārī yā məˀəz[wūo]rī ke səbəb mənədərəjəh ż[yīe]l meñ se koī ek mələtā rəhā he ? | ||
| + | |||
| + | Afterwards, a speaker of Urdu is supposed to edit the transliteration and disambiguate all remaining cases: | ||
| + | |||
| + | * āp ko p**ə**čh**ə**le 182 d**ə**noñ meñ **ə**p**ə**nī b**[yīe]**mārī yā m**ə**ˀ**ə**z**[wūo]**rī ke s**ə**b**ə**b m**ə**n**ə**d**ə**r**ə**j**ə**h ż**[yīe]**l meñ se koī ek m**ə**l**ə**tā r**ə**hā he ? | ||
| + | * Of alternatives in brackets, one has to be selected. Sometimes the brackets do not list all possibilities but they are easy to guess. For instance, [yīe] should in fact be [əyə|ī|e]. | ||
| + | * The schwa //ə// is a shortcut for [aiu] or an empty string (no vowel here). | ||
