Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:transliteration-of-urdu-to-latin-script [2010/11/10 09:38] zeman S alefem to nějak nebylo vidět, ale možná jsem měl znaky v prohozeném pořadí. |
user:zeman:transliteration-of-urdu-to-latin-script [2010/11/10 14:18] zeman New vowels with hamza. |
||
---|---|---|---|
Line 103: | Line 103: | ||
| 06CC | ی | j, i:, e: | y, ī, e | | | 06CC | ی | j, i:, e: | y, ī, e | | ||
| 06D2 | ے | e: | e | | | 06D2 | ے | e: | e | | ||
- | | 0626 | ئ | - | 0 | | + | | 06D3 | ۓ | e: | e | |
- | | 0674 | ٔ (high hamza) | - | 0 | | + | | 0624 | ؤ | u:, o: | ū, o | |
+ | | 0626 | ئ | -, i:, e | 0, ī, e | | ||
+ | | 0654 | (hamza above)ٔ | - | 0 | | ||
+ | | 0674 | (high hamza)ٔ | - | 0 | | ||
The transliteration script should contain a gradually growing vocabulary that would help disambiguate known words. Otherwise there would be a very high number of ambiguous positions in any transliterated string. | The transliteration script should contain a gradually growing vocabulary that would help disambiguate known words. Otherwise there would be a very high number of ambiguous positions in any transliterated string. | ||
Line 129: | Line 132: | ||
* http:// | * http:// | ||
+ | ===== Vocabulary of Frequent Words ===== | ||
+ | |||
+ | Some frequent words cannot be disambiguated by character-based rules alone but a vocabulary could identify them as existing unambiguous Urdu words and save much manual work by disambiguating them. Here are some examples: | ||
+ | |||
+ | * ہے => he (“is”) | ||
+ | * میں => meñ (“in”) | ||
+ | * ایک => ek (“one”) | ||
+ | * اور => or (“and”) | ||
+ | |||
+ | Note however that there are inherently ambiguous words that cannot be disambiguated without human intervention (or at least without looking at the neighboring words). Examples: | ||
+ | |||
+ | * تو => to (“so”) | tū (“thou”) | ||
+ | * اس => is (“of this”) | us (“of that”) | ||
+ | * ان => in (“of these”) | un (“of those”) | ||
+ | |||
+ | ===== The Transliteration Script ===== | ||
+ | |||
+ | You need two files. All of the transliteration knowledge is encoded in the library '' | ||
+ | |||
+ | <code bash> | ||
+ | |||
+ | If you happen to sit on the ÚFAL network, you will find the script in '' | ||
+ | |||
+ | I am also attaching the current snapshot of the two folders to this wiki {{: | ||
+ | |||
+ | This is an example of an Urdu sentence and the romanized output by the script: | ||
+ | |||
+ | * آپ کو پچھلے 182 دنوں میں اپنی بیماری یا معزوری کے سبب مندرجہ ذیل میں سے کوئی ایک ملتا رہا ہے ؟ | ||
+ | * āp ko pəčhəle 182 dənoñ meñ əpənī b[yīe]mārī yā məˀəz[wūo]rī ke səbəb mənədərəjəh ż[yīe]l meñ se koī ek mələtā rəhā he ? |