[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:transliteration-of-urdu-to-latin-script [2010/11/10 09:38]
zeman S alefem to nějak nebylo vidět, ale možná jsem měl znaky v prohozeném pořadí.
user:zeman:transliteration-of-urdu-to-latin-script [2010/11/10 13:59]
zeman Attached the software.
Line 129: Line 129:
   * http://users.skynet.be/hugocoolens/newurdu/vowels.html   * http://users.skynet.be/hugocoolens/newurdu/vowels.html
  
 +===== Vocabulary of Frequent Words =====
 +
 +Some frequent words cannot be disambiguated by character-based rules alone but a vocabulary could identify them as existing unambiguous Urdu words and save much manual work by disambiguating them. Here are some examples:
 +
 +  * ہے => he
 +  * میں => meñ
 +  * ایک => ek
 +  * اور => or
 +
 +Note however that there are inherently ambiguous words that cannot be disambiguated without human intervention (or at least without looking at the neighboring words). Examples:
 +
 +  * تو => to | tū
 +  * اس => is | us
 +  * ان => in | un
 +
 +===== The Transliteration Script =====
 +
 +You need two files. All of the transliteration knowledge is encoded in the library ''urdund.pm''. The Perl script ''translit_urdund.pl'' merely reads the standard input, passes it through the library and sends the result to the standard output. It is called like this:
 +
 +<code bash>perl translit_urdund.pl < urdu.txt > latin.txt</code>
 +
 +If you happen to sit on the ÚFAL network, you will find the script in ''~zeman/projekty/transliterace''. It should be able to find the library itself; the library is in ''~zeman/lib/translit'' (you will programs and libraries for other writing systems in these two folders as well).
 +
 +I am also attaching the current snapshot of the two folders to this wiki {{:user:zeman:translit.zip|here}}. Note however that it will not be updated regularly.
 +
 +This is an example of an Urdu sentence and the romanized output by the script:
 +
 +  * آپ کو پچھلے 182 دنوں میں اپنی بیماری یا معزوری کے سبب مندرجہ ذیل میں سے کوئی ایک ملتا رہا ہے ؟
 +  * āp ko pəčhəle 182 dənoñ meñ əpənī b[yīe]mārī yā məˀəz[wūo]rī ke səbəb mənədərəjəh ż[yīe]l meñ se koī ek mələtā rəhā he ?

[ Back to the navigation ] [ Back to the content ]