Differences

This shows you the differences between two versions of the page.

--- user:zeman:transliteration-of-urdu-to-latin-script [2010/11/10 13:53]
zeman Frequent words.
+++ user:zeman:transliteration-of-urdu-to-latin-script [2010/11/16 08:25] (current)
zeman ~
@@ Line 1: / Line 1: @@
 ====== Transliteration of Urdu to Latin Script ======
+Copyright © 2010 by Dan Zeman <zeman@ufal.mff.cuni.cz>
+License: GNU GPL
+This research has been supported by the grant of the Czech Ministry of Education no. MSM0021620838.
 ===== Transliteration versus Transcription =====
@@ Line 18: / Line 22: @@
 ===== Consonants =====
-Most of the consonants do not pose any serious problem. I decided to represent the retroflex consonants (ٹ, ڈ, ڑ) by a dot below their dental or other counterparts, as is usual across the Indo-Aryan languages. A dot above a letter distinguishes a two Arabic letters whose Urdu pronunciation is identical to other letters, from transliteration of those other letters (ث, ذ). Similarly, a cedilla below a letter distinguishes other five letters that occur in words of Arabic descent (ح, ص, ض, ط, ظ).
+Most of the consonants do not pose any serious problem. I decided to represent the retroflex consonants (ٹ, ڈ, ڑ) by a dot below their dental or other counterparts, as is usual across the Indo-Aryan languages. A dot above a letter distinguishes two Arabic letters whose Urdu pronunciation is identical to other letters, from transliteration of those other letters (ث, ذ). Similarly, a cedilla below a letter distinguishes five other letters that occur in words of Arabic descent (ح, ص, ض, ط, ظ).
 Some other notes: //j// is pronounced as in English, not as in Czech or German. //č// and //š// are used in Baltic and Slavic languages (among others) to represent the sounds that are usually written “ch” or “sh”, respectively, in English. Of similar descent is the character //ž//; the corresponding sound is sometimes represented as “zh” in English and corresponds to the French pronunciation of //j//. //x// represents (in accord with phonetic tradition) the same sound as Czech/German/Scottish “ch”. English-oriented transcriptions of Arabic often transcribe this sound as “kh”, a solution that we want to avoid. It would conflict with the aspirated //kh// of Urdu. //ğ// is taken from Turkish and describes the sound that is often transcribed “gh” from Arabic (which we cannot use, again because of the aspirated //gh//).
@@ Line 60: / Line 64: @@
 | 0645 | م | m | m |
 | 0646 | ن | n | n |
-| 06BA | ں | n | ñ |
+| 06BA | ں | ~ | ñ |
 | 0648 | و | v | w |
 | 06C1 | ہ | h | h |
@@ Line 103: / Line 107: @@
 | 06CC | ی | j, i:, e: | y, ī, e |
 | 06D2 | ے | e: | e |
-| 0626 | ئ | - | 0 |
+| 06D3 | ۓ | e: | e |
-| 0674 | ٔ (high hamza) | - | 0 |
+| 0624 | ؤ | u:, o: | ū, o |
+| 0626 | ئ | -, i:, e | 0, ī, e |
+| 0654 | (hamza above)ٔ | - | 0 |
+| 0674 | (high hamza)ٔ | - | 0 |
 The transliteration script should contain a gradually growing vocabulary that would help disambiguate known words. Otherwise there would be a very high number of ambiguous positions in any transliterated string.
@@ Line 110: / Line 117: @@
 ===== Short Vowels and Diacritics =====
-Without diacritics (which is more common), every consonant that is not followed by a long vowel may or may not be followed by a short vowel. I denote this possibility by the character for the neutral character schwa: //ə//.
+Without diacritics (which is more common), every consonant that is not followed by a long vowel may or may not be followed by a short vowel. I denote this possibility by the character for the neutral vowel schwa: //ə//.
 //Warning! This section is under construction. I am still confused about the exact rules for Urdu vowel representation, so I also expect more errors to occur here.//
@@ Line 133: / Line 140: @@
 Some frequent words cannot be disambiguated by character-based rules alone but a vocabulary could identify them as existing unambiguous Urdu words and save much manual work by disambiguating them. Here are some examples:
-  * ہے => he
+  * ہے => he (“is”)
-  * میں => meñ
+  * میں => meñ (“in”)
-  * ایک => ek
+  * ایک => ek (“one”)
-  * اور => or
+  * اور => or (“and”)
 Note however that there are inherently ambiguous words that cannot be disambiguated without human intervention (or at least without looking at the neighboring words). Examples:
-  * تو => to | tū
+  * تو => to (“so”) | tū (“thou”)
-  * اس => is | us
+  * اس => is (“of this”) | us (“of that”)
-  * ان => in | un
+  * ان => in (“of these”) | un (“of those”)
 ===== The Transliteration Script =====
@@ Line 150: / Line 157: @@
 <code bash>perl translit_urdund.pl < urdu.txt > latin.txt</code>
-If you happen to sit on the ÚFAL network, you will find the script in ''~zeman/projekty/transliterace''. It should be able to find the library itself; the library is in ''~zeman/lib/translit'' (you will programs and libraries for other writing systems in these two folders as well).
+If you happen to sit on the ÚFAL network, you will find the script in ''~zeman/projekty/transliterace''. It should be able to find the library itself; the library is in ''~zeman/lib/translit'' (you will find programs and libraries for other writing systems in these two folders as well).
+I am also attaching the current snapshot of the two folders to this wiki {{:user:zeman:translit.zip|here}}. Note however that it will not be updated regularly.
 This is an example of an Urdu sentence and the romanized output by the script:
@@ Line 156: / Line 165: @@
   * آپ کو پچھلے 182 دنوں میں اپنی بیماری یا معزوری کے سبب مندرجہ ذیل میں سے کوئی ایک ملتا رہا ہے ؟
   * āp ko pəčhəle 182 dənoñ meñ əpənī b[yīe]mārī yā məˀəz[wūo]rī ke səbəb mənədərəjəh ż[yīe]l meñ se koī ek mələtā rəhā he ?
+Afterwards, a speaker of Urdu is supposed to edit the transliteration and disambiguate all remaining cases:
+  * āp ko p**ə**čh**ə**le 182 d**ə**noñ meñ **ə**p**ə**nī b**[yīe]**mārī yā m**ə**ˀ**ə**z**[wūo]**rī ke s**ə**b**ə**b m**ə**n**ə**d**ə**r**ə**j**ə**h ż**[yīe]**l meñ se koī ek m**ə**l**ə**tā r**ə**hā he ?
+    * Of alternatives in brackets, one has to be selected. Sometimes the brackets do not list all possibilities but they are easy to guess. For instance, [yīe] should in fact be [əyə|ī|e].
+    * The schwa //ə// is a shortcut for [aiu] or an empty string (no vowel here).

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences