Differences

This shows you the differences between two versions of the page.

--- user:zeman:transliteration-of-urdu-to-latin-script [2010/11/10 09:38]
zeman S alefem to nějak nebylo vidět, ale možná jsem měl znaky v prohozeném pořadí.
+++ user:zeman:transliteration-of-urdu-to-latin-script [2010/11/10 14:18]
zeman New vowels with hamza.
@@ Line 103: / Line 103: @@
 | 06CC | ی | j, i:, e: | y, ī, e |
 | 06D2 | ے | e: | e |
-| 0626 | ئ | - | 0 |
+| 06D3 | ۓ | e: | e |
-| 0674 | ٔ (high hamza) | - | 0 |
+| 0624 | ؤ | u:, o: | ū, o |
+| 0626 | ئ | -, i:, e | 0, ī, e |
+| 0654 | (hamza above)ٔ | - | 0 |
+| 0674 | (high hamza)ٔ | - | 0 |
 The transliteration script should contain a gradually growing vocabulary that would help disambiguate known words. Otherwise there would be a very high number of ambiguous positions in any transliterated string.
@@ Line 129: / Line 132: @@
   * http://users.skynet.be/hugocoolens/newurdu/vowels.html
+===== Vocabulary of Frequent Words =====
+Some frequent words cannot be disambiguated by character-based rules alone but a vocabulary could identify them as existing unambiguous Urdu words and save much manual work by disambiguating them. Here are some examples:
+  * ہے => he (“is”)
+  * میں => meñ (“in”)
+  * ایک => ek (“one”)
+  * اور => or (“and”)
+Note however that there are inherently ambiguous words that cannot be disambiguated without human intervention (or at least without looking at the neighboring words). Examples:
+  * تو => to (“so”) | tū (“thou”)
+  * اس => is (“of this”) | us (“of that”)
+  * ان => in (“of these”) | un (“of those”)
+===== The Transliteration Script =====
+You need two files. All of the transliteration knowledge is encoded in the library ''urdund.pm''. The Perl script ''translit_urdund.pl'' merely reads the standard input, passes it through the library and sends the result to the standard output. It is called like this:
+<code bash>perl translit_urdund.pl < urdu.txt > latin.txt</code>
+If you happen to sit on the ÚFAL network, you will find the script in ''~zeman/projekty/transliterace''. It should be able to find the library itself; the library is in ''~zeman/lib/translit'' (you will programs and libraries for other writing systems in these two folders as well).
+I am also attaching the current snapshot of the two folders to this wiki {{:user:zeman:translit.zip|here}}. Note however that it will not be updated regularly.
+This is an example of an Urdu sentence and the romanized output by the script:
+  * آپ کو پچھلے 182 دنوں میں اپنی بیماری یا معزوری کے سبب مندرجہ ذیل میں سے کوئی ایک ملتا رہا ہے ؟
+  * āp ko pəčhəle 182 dənoñ meñ əpənī b[yīe]mārī yā məˀəz[wūo]rī ke səbəb mənədərəjəh ż[yīe]l meñ se koī ek mələtā rəhā he ?

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences