====== Transliteration of Urdu to Latin Script ======

Copyright © 2010 by Dan Zeman <zeman@ufal.mff.cuni.cz>
License: GNU GPL
This research has been supported by the grant of the Czech Ministry of Education no. MSM0021620838. 

===== Transliteration versus Transcription =====

Transcription of texts among scripts can be guided by two main (often contradicting) principles:

  * Preserving orthography
  * Preserving pronunciation

To some extent, transcription is always guided by pronunciation (otherwise it would be //encryption//). However, if preserving pronunciation is the top priority, the mapping may become irreversible: silent letters will be omitted and various strings that are pronounced the same way will be mapped on the same target letter. Also, a pronunciation-oriented mapping depends on the target //language// (in addition to target //script//) because different languages pronounce the same characters of the same script differently.

In contrast, //transliteration// aims at preserving the original orthography in the first place. Ideally, transliteration is a letter-for-letter 1-1 mapping. Diacritical marks are used if the target script does not have enough characters. Occasionally, sequences of target letters (such as 'sh') can be used as one character if one can be reasonably sure that the individual letters cannot occur next to each other.

Transliteration could even disambiguate cases that are ambiguous in the source text (e.g. recover short vowels missing from a text in Arabic script). As far as information is added but not removed, this is OK: the original spelling can still be reconstructed.

In this document I describe my approach to Romanization (i.e. transliteration into a Latin-based alphabet) of Urdu text. I use this transliteration scheme in papers on Urdu. My goal is to reflect the original pronunciation as well as possible, while not violating the requirement that the original spelling be restorable. To help the reader with the pronunciation, I want to insert missing short vowels and disambiguate a few other cases. I provide a Perl script that implements the deterministic part of the transliteration and marks positions where human decision is needed. Urdu uses a few characters that are not used in the original Arabic script. Moreover, some of the original Arabic letters might prefer a different Latin representation if the mapping were motivated by Arabic, instead of Urdu pronunciation. On the target side, no particular language was on my mind when modeling the pronunciation. See below for details.

===== Consonants =====

Most of the consonants do not pose any serious problem. I decided to represent the retroflex consonants (ٹ, ڈ, ڑ) by a dot below their dental or other counterparts, as is usual across the Indo-Aryan languages. A dot above a letter distinguishes two Arabic letters whose Urdu pronunciation is identical to other letters, from transliteration of those other letters (ث, ذ). Similarly, a cedilla below a letter distinguishes five other letters that occur in words of Arabic descent (ح, ص, ض, ط, ظ).

Some other notes: //j// is pronounced as in English, not as in Czech or German. //č// and //š// are used in Baltic and Slavic languages (among others) to represent the sounds that are usually written “ch” or “sh”, respectively, in English. Of similar descent is the character //ž//; the corresponding sound is sometimes represented as “zh” in English and corresponds to the French pronunciation of //j//. //x// represents (in accord with phonetic tradition) the same sound as Czech/German/Scottish “ch”. English-oriented transcriptions of Arabic often transcribe this sound as “kh”, a solution that we want to avoid. It would conflict with the aspirated //kh// of Urdu. //ğ// is taken from Turkish and describes the sound that is often transcribed “gh” from Arabic (which we cannot use, again because of the aspirated //gh//).

I do not attempt to map the special Semitic guttural consonant //ayin// to a Latin letter following pronunciation of a European language, as this sound is very peculiar to most Europeans. In transcription of Arabic, it is sometimes represented by superscript //c//. We use the IPA symbol ˀ (MODIFIER LETTER GLOTTAL STOP).

The letter ں (NOON GHUNNA) occurs only at the end of the word and marks nasalization of the preceding vowel rather than a real consonant.

There are two //h// letters: ہ (HEH GOAL) and ھ (HEH DOACHASHMEE). It is not necessary to distinguish them by diacritics as they occur in different positions. The normal consonant //h// is written using ہ (HEH GOAL), which can also appear at the end of the word to mark an (otherwise invisible) word-final short vowel //a// (transcribed //ah//). In contrast, ھ (HEH DOACHASHMEE) is used exclusively after other consonants (such as //k, g, č, j, t, d, b, p//) to form their aspirated counterparts. Thus, بھ is //bh//, پھ is //ph// etc.

^ Unicode ^ Character ^ Pronunciation ^ Transliteration ^
| 0628 | ب | b | b |
| 067E | پ | p | p |
| 062A | ت | ṱ | t |
| 0679 | ٹ | ʈ | ṭ |
| 062B | ث | s | ṡ |
| 062C | ج | ʤ | j |
| 0686 | چ | ʧ | č |
| 062D | ح | h | ḩ |
| 062E | خ | x | x |
| 062F | د | d | d |
| 0688 | ڈ | ɖ | ḍ |
| 0630 | ذ | z | ż |
| 0631 | ر | r | r |
| 0691 | ڑ | ɽ | ṛ |
| 0632 | ز | z | z |
| 0698 | ژ | ʒ | ž |
| 0633 | س | s | s |
| 0634 | ش | ʃ | š |
| 0635 | ص | s | ş |
| 0636 | ض | z | z̧ |
| 0637 | ط | ṱ | ţ |
| 0638 | ظ | z | ḑ |
| 0639 | ع | Ɂ | ˀ |
| 063A | غ | ɣ | ğ |
| 0641 | ف | f | f |
| 0642 | ق | q | q |
| 06A9 | ک | k | k |
| 06AF | گ | g | g |
| 0644 | ل | l | l |
| 0645 | م | m | m |
| 0646 | ن | n | n |
| 06BA | ں | ~ | ñ |
| 0648 | و | v | w |
| 06C1 | ہ | h | h |
| 06BE | ھ | h | h |
| 06CC | ی | j | y |

===== Vowels =====

The consonant (or semi-vowel) و //(w)// is also ambiguously used to represent the long vowels //ū// (pronounced as //oo// in English //fool//) and //o// (pronounced as //oo// in English //door//). I want to distinguish these three pronunciations (note however that I am not attempting to further distinguish //o// from the slightly different vowel //ao// that is pronounced as //au// in English //automatic//; I am pretending that these two are identical). In most cases however, the script can only output //[wūo]// and leave the disambiguation to a human judgment:

  * In word-initial position, I assume that only consonantal pronunciation is possible and always output //w//.
  * Anywhere immediately before ا (ALEF), I assume that only consonantal pronunciation is possible and always output //w//.
  * In word-final position, I believe that vowel is more likely although I am not sure that the consonant can be completely excluded. Nevertheless, I currently output //[ūo]//.
  * If it appears immediately before word-final ں (NOON GHUNNA), I consider it part of plural oblique case suffix and invariably output //o//.
  * In all other cases I output //[wūo]//.

The consonant (or semi-vowel) ی //(y)// is also ambiguously used to represent the long vowels //ī// (pronounced as //ee// in English //feet//) and //e// (pronounced roughly as //ai// in English //fair//). I want to distinguish these three pronunciations (note however that I am not attempting to further distinguish //e// from the slightly different vowel //ae// that is pronounced more open; I am pretending that these two are identical). In most cases however, the script can only output //[yīe]// and leave the disambiguation to a human judgment:

  * In word-initial position, I assume that only consonantal pronunciation is possible and always output //y//.
  * Anywhere immediately before ا (ALEF), I assume that only consonantal pronunciation is possible and always output //y//.
  * In word-final position, I assume that the only possible reading is //ī//.
  * In all other cases I output //[yīe]//.

The letter ے (YEH BARREE) only appears in word-final position and is transliterated as //e// (which is written in other positions using the ambiguous ی).

The letter ا (ALEF) is ambiguous and can lead to many different readings:

  * In word-initial position, it merely says that the word begins with a vowel. It could be any of the three short vowels //[aiu]//: افریقہ //afrīqah// “Africa”, اسلام //islām// “Islam”, اردو //urdū// “Urdu”.
    * If word-initial ا is followed by و or ی, they together could represent a word-initial long vowel //[ūoīe]//, such as in ایک //ek// “one”. In this case, ا should map to an empty string (because the next character itself will allow for transliteration by the long vowel).
  * In word-internal and word-final positions, ا is transliterated to the long vowel //ā// (pronounced as //a// in English //father//).

The letter آ (ALEF MADDA) only appears in word-initial position and is transliterated as //ā// (which is written in other positions using normal ا).

The YEH with the diacritic HAMZA above separates two consecutive vowels, e.g. جائے گا //jāe gā// “will go” or کوئی //koī// “some”.

Similarly, the diacritic HAMZA above a و separates it from the preceding vowel as in ہاؤسنگ //hāūsing// “housing”. (In this case, the hamza is a separate character that is placed in the logical sequence after the و.)

^ Unicode ^ Character ^ Pronunciation ^ Transliteration ^
| 0627 | ا | -, a: | a, i, u, 0, ā |
| 0622 | آ | a: | ā |
| 0648 | و | v, u:, o: | w, ū, o |
| 06CC | ی | j, i:, e: | y, ī, e |
| 06D2 | ے | e: | e |
| 06D3 | ۓ | e: | e |
| 0624 | ؤ | u:, o: | ū, o |
| 0626 | ئ | -, i:, e | 0, ī, e |
| 0654 | (hamza above)ٔ | - | 0 |
| 0674 | (high hamza)ٔ | - | 0 |

The transliteration script should contain a gradually growing vocabulary that would help disambiguate known words. Otherwise there would be a very high number of ambiguous positions in any transliterated string.

===== Short Vowels and Diacritics =====

Without diacritics (which is more common), every consonant that is not followed by a long vowel may or may not be followed by a short vowel. I denote this possibility by the character for the neutral vowel schwa: //ə//.

//Warning! This section is under construction. I am still confused about the exact rules for Urdu vowel representation, so I also expect more errors to occur here.//

Although used rarely, Urdu has means to mark the three short vowels as well. This is done using one of the three diacritical marks. Long vowels can be disambiguated as well, e.g. a consonant with the pesh mark followed by a waw without any diacritic means that the waw is a long vowel //[ūo]// but not the consonant //w//.

^ Unicode ^ Unicode Name ^ Urdu Name ^ With Beh ^ Transliteration ^
| 064E | ARABIC FATHA | zabar | بَ | ba |
| 064F | ARABIC DAMMA | pesh | بُ | bu |
| 0650 | ARABIC KASRA | zer | بِ | bi |

pesh (ARABIC DAMMA, 064F) ... u ... کُون //kon// “who”
zabar (ARABIC FATHA, 064E) ... a ...  کَون //kawn//
zer (ARABIC KASRA, 0650) ... i ... 

Possible further reading:
  * http://en.wikipedia.org/wiki/Arabic_diacritics
  * http://users.skynet.be/hugocoolens/newurdu/vowels.html

===== Vocabulary of Frequent Words =====

Some frequent words cannot be disambiguated by character-based rules alone but a vocabulary could identify them as existing unambiguous Urdu words and save much manual work by disambiguating them. Here are some examples:

  * ہے => he (“is”)
  * میں => meñ (“in”)
  * ایک => ek (“one”)
  * اور => or (“and”)

Note however that there are inherently ambiguous words that cannot be disambiguated without human intervention (or at least without looking at the neighboring words). Examples:

  * تو => to (“so”) | tū (“thou”)
  * اس => is (“of this”) | us (“of that”)
  * ان => in (“of these”) | un (“of those”)

===== The Transliteration Script =====

You need two files. All of the transliteration knowledge is encoded in the library ''urdund.pm''. The Perl script ''translit_urdund.pl'' merely reads the standard input, passes it through the library and sends the result to the standard output. It is called like this:

<code bash>perl translit_urdund.pl < urdu.txt > latin.txt</code>

If you happen to sit on the ÚFAL network, you will find the script in ''~zeman/projekty/transliterace''. It should be able to find the library itself; the library is in ''~zeman/lib/translit'' (you will find programs and libraries for other writing systems in these two folders as well).

I am also attaching the current snapshot of the two folders to this wiki {{:user:zeman:translit.zip|here}}. Note however that it will not be updated regularly.

This is an example of an Urdu sentence and the romanized output by the script:

  * آپ کو پچھلے 182 دنوں میں اپنی بیماری یا معزوری کے سبب مندرجہ ذیل میں سے کوئی ایک ملتا رہا ہے ؟
  * āp ko pəčhəle 182 dənoñ meñ əpənī b[yīe]mārī yā məˀəz[wūo]rī ke səbəb mənədərəjəh ż[yīe]l meñ se koī ek mələtā rəhā he ?

Afterwards, a speaker of Urdu is supposed to edit the transliteration and disambiguate all remaining cases:

  * āp ko p**ə**čh**ə**le 182 d**ə**noñ meñ **ə**p**ə**nī b**[yīe]**mārī yā m**ə**ˀ**ə**z**[wūo]**rī ke s**ə**b**ə**b m**ə**n**ə**d**ə**r**ə**j**ə**h ż**[yīe]**l meñ se koī ek m**ə**l**ə**tā r**ə**hā he ?
    * Of alternatives in brackets, one has to be selected. Sometimes the brackets do not list all possibilities but they are easy to guess. For instance, [yīe] should in fact be [əyə|ī|e].
    * The schwa //ə// is a shortcut for [aiu] or an empty string (no vowel here).