Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
user:zeman:transliteration-of-urdu-to-latin-script [2010/11/09 16:14] zeman Hamza. |
user:zeman:transliteration-of-urdu-to-latin-script [2010/11/16 08:25] (current) zeman ~ |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Transliteration of Urdu to Latin Script ====== | ====== Transliteration of Urdu to Latin Script ====== | ||
+ | |||
+ | Copyright © 2010 by Dan Zeman < | ||
+ | License: GNU GPL | ||
+ | This research has been supported by the grant of the Czech Ministry of Education no. MSM0021620838. | ||
===== Transliteration versus Transcription ===== | ===== Transliteration versus Transcription ===== | ||
Line 18: | Line 22: | ||
===== Consonants ===== | ===== Consonants ===== | ||
- | Most of the consonants do not pose any serious problem. I decided to represent the retroflex consonants (ٹ, ڈ, ڑ) by a dot below their dental or other counterparts, | + | Most of the consonants do not pose any serious problem. I decided to represent the retroflex consonants (ٹ, ڈ, ڑ) by a dot below their dental or other counterparts, |
Some other notes: //j// is pronounced as in English, not as in Czech or German. //č// and //š// are used in Baltic and Slavic languages (among others) to represent the sounds that are usually written “ch” or “sh”, respectively, | Some other notes: //j// is pronounced as in English, not as in Czech or German. //č// and //š// are used in Baltic and Slavic languages (among others) to represent the sounds that are usually written “ch” or “sh”, respectively, | ||
Line 60: | Line 64: | ||
| 0645 | م | m | m | | | 0645 | م | m | m | | ||
| 0646 | ن | n | n | | | 0646 | ن | n | n | | ||
- | | 06BA | ں | n | ñ | | + | | 06BA | ں | ~ | ñ | |
| 0648 | و | v | w | | | 0648 | و | v | w | | ||
| 06C1 | ہ | h | h | | | 06C1 | ہ | h | h | | ||
Line 68: | Line 72: | ||
===== Vowels ===== | ===== Vowels ===== | ||
- | The consonant (or semi-vowel) و //(w)// is also ambiguously used to represent the long vowels //ū// (pronounced as //oo// in English //fool//) and //o// (pronounced as //oo// in English //door//). I want to distinguish these three pronunciations. In most cases however, the script can only output //[wūo]// and leave the disambiguation to a human judgment: | + | The consonant (or semi-vowel) و //(w)// is also ambiguously used to represent the long vowels //ū// (pronounced as //oo// in English //fool//) and //o// (pronounced as //oo// in English //door//). I want to distinguish these three pronunciations |
* In word-initial position, I assume that only consonantal pronunciation is possible and always output //w//. | * In word-initial position, I assume that only consonantal pronunciation is possible and always output //w//. | ||
Line 76: | Line 80: | ||
* In all other cases I output //[wūo]//. | * In all other cases I output //[wūo]//. | ||
- | The consonant (or semi-vowel) ی //(y)// is also ambiguously used to represent the long vowels //ī// (pronounced as //ee// in English //feet//) and //e// (pronounced roughly as //ai// in English //fair//). I want to distinguish these three pronunciations. In most cases however, the script can only output //[yīe]// and leave the disambiguation to a human judgment: | + | The consonant (or semi-vowel) ی //(y)// is also ambiguously used to represent the long vowels //ī// (pronounced as //ee// in English //feet//) and //e// (pronounced roughly as //ai// in English //fair//). I want to distinguish these three pronunciations |
* In word-initial position, I assume that only consonantal pronunciation is possible and always output //y//. | * In word-initial position, I assume that only consonantal pronunciation is possible and always output //y//. | ||
Line 103: | Line 107: | ||
| 06CC | ی | j, i:, e: | y, ī, e | | | 06CC | ی | j, i:, e: | y, ī, e | | ||
| 06D2 | ے | e: | e | | | 06D2 | ے | e: | e | | ||
- | | 0626 | ئ | - | 0 | | + | | 06D3 | ۓ | e: | e | |
- | | 0674 | ٔ (high hamza) | - | 0 | | + | | 0624 | ؤ | u:, o: | ū, o | |
+ | | 0626 | ئ | -, i:, e | 0, ī, e | | ||
+ | | 0654 | (hamza above)ٔ | - | 0 | | ||
+ | | 0674 | (high hamza)ٔ | - | 0 | | ||
+ | |||
+ | The transliteration script should contain a gradually growing vocabulary that would help disambiguate known words. Otherwise there would be a very high number of ambiguous positions in any transliterated string. | ||
+ | |||
+ | ===== Short Vowels and Diacritics ===== | ||
+ | |||
+ | Without diacritics (which is more common), every consonant that is not followed by a long vowel may or may not be followed by a short vowel. I denote this possibility by the character for the neutral vowel schwa: //ə//. | ||
+ | |||
+ | //Warning! This section is under construction. I am still confused about the exact rules for Urdu vowel representation, | ||
+ | |||
+ | Although used rarely, Urdu has means to mark the three short vowels as well. This is done using one of the three diacritical marks. Long vowels can be disambiguated as well, e.g. a consonant with the pesh mark followed by a waw without any diacritic means that the waw is a long vowel //[ūo]// but not the consonant //w//. | ||
+ | |||
+ | ^ Unicode ^ Unicode Name ^ Urdu Name ^ With Beh ^ Transliteration ^ | ||
+ | | 064E | ARABIC FATHA | zabar | بَ | ba | | ||
+ | | 064F | ARABIC DAMMA | pesh | بُ | bu | | ||
+ | | 0650 | ARABIC KASRA | zer | بِ | bi | | ||
+ | |||
+ | pesh (ARABIC DAMMA, 064F) ... u ... کُون //kon// “who” | ||
+ | zabar (ARABIC FATHA, 064E) ... a ... کَون //kawn// | ||
+ | zer (ARABIC KASRA, 0650) ... i ... | ||
+ | |||
+ | Possible further reading: | ||
+ | * http:// | ||
+ | * http:// | ||
+ | |||
+ | ===== Vocabulary of Frequent Words ===== | ||
+ | |||
+ | Some frequent words cannot be disambiguated by character-based rules alone but a vocabulary could identify them as existing unambiguous Urdu words and save much manual work by disambiguating them. Here are some examples: | ||
+ | |||
+ | * ہے => he (“is”) | ||
+ | * میں => meñ (“in”) | ||
+ | * ایک => ek (“one”) | ||
+ | * اور => or (“and”) | ||
+ | |||
+ | Note however that there are inherently ambiguous words that cannot be disambiguated without human intervention (or at least without looking at the neighboring words). Examples: | ||
+ | |||
+ | * تو => to (“so”) | tū (“thou”) | ||
+ | * اس => is (“of this”) | us (“of that”) | ||
+ | * ان => in (“of these”) | un (“of those”) | ||
+ | |||
+ | ===== The Transliteration Script ===== | ||
+ | |||
+ | You need two files. All of the transliteration knowledge is encoded in the library '' | ||
+ | |||
+ | <code bash> | ||
+ | |||
+ | If you happen to sit on the ÚFAL network, you will find the script in '' | ||
+ | |||
+ | I am also attaching the current snapshot of the two folders to this wiki {{: | ||
+ | |||
+ | This is an example of an Urdu sentence and the romanized output by the script: | ||
+ | |||
+ | * آپ کو پچھلے 182 دنوں میں اپنی بیماری یا معزوری کے سبب مندرجہ ذیل میں سے کوئی ایک ملتا رہا ہے ؟ | ||
+ | * āp ko pəčhəle 182 dənoñ meñ əpənī b[yīe]mārī yā məˀəz[wūo]rī ke səbəb mənədərəjəh ż[yīe]l meñ se koī ek mələtā rəhā he ? | ||
+ | |||
+ | Afterwards, a speaker of Urdu is supposed to edit the transliteration and disambiguate all remaining cases: | ||
+ | * āp ko p**ə**čh**ə**le 182 d**ə**noñ meñ **ə**p**ə**nī b**[yīe]**mārī yā m**ə**ˀ**ə**z**[wūo]**rī ke s**ə**b**ə**b m**ə**n**ə**d**ə**r**ə**j**ə**h ż**[yīe]**l meñ se koī ek m**ə**l**ə**tā r**ə**hā he ? | ||
+ | * Of alternatives in brackets, one has to be selected. Sometimes the brackets do not list all possibilities but they are easy to guess. For instance, [yīe] should in fact be [əyə|ī|e]. | ||
+ | * The schwa //ə// is a shortcut for [aiu] or an empty string (no vowel here). |