Transliteration of Urdu to Latin Script

Transliteration versus Transcription

Transcription of texts among scripts can be guided by two main (often contradicting) principles:

Preserving orthography
Preserving pronunciation

To some extent, transcription is always guided by pronunciation (otherwise it would be encryption). However, if preserving pronunciation is the top priority, the mapping may become irreversible: silent letters will be omitted and various strings that are pronounced the same way will be mapped on the same target letter. Also, a pronunciation-oriented mapping depends on the target language (in addition to target script) because different languages pronounce the same characters of the same script differently.

In contrast, transliteration aims at preserving the original orthography in the first place. Ideally, transliteration is a letter-for-letter 1-1 mapping. Diacritical marks are used if the target script does not have enough characters. Occasionally, sequences of target letters (such as 'sh') can be used as one character if one can be reasonably sure that the individual letters cannot occur next to each other.

Transliteration could even disambiguate cases that are ambiguous in the source text (e.g. recover short vowels missing from a text in Arabic script). As far as information is added but not removed, this is OK: the original spelling can still be reconstructed.

In this document I describe my approach to Romanization (i.e. transliteration into a Latin-based alphabet) of Urdu text. I use this transliteration scheme in papers on Urdu. My goal is to reflect the original pronunciation as well as possible, while not violating the requirement that the original spelling be restorable. To help the reader with the pronunciation, I want to insert missing short vowels and disambiguate a few other cases. I provide a Perl script that implements the deterministic part of the transliteration and marks positions where human decision is needed. Urdu uses a few characters that are not used in the original Arabic script. Moreover, some of the original Arabic letters might prefer a different Latin representation if the mapping were motivated by Arabic, instead of Urdu pronunciation. On the target side, no particular language was on my mind when modeling the pronunciation. See below for details.

Consonants

Most of the consonants do not pose any serious problem. I decided to represent the retroflex consonants (ٹ, ڈ, ڑ) by a dot below their dental or other counterparts, as is usual across the Indo-Aryan languages. A dot above a letter distinguishes a two Arabic letters whose Urdu pronunciation is identical to other letters, from transliteration of those other letters (ث, ذ). Similarly, a cedilla below a letter distinguishes other five letters that occur in words of Arabic descent (ح, ص, ض, ط, ظ).

Some other notes: j is pronounced as in English, not as in Czech or German. č and š are used in Baltic and Slavic languages (among others) to represent the sounds that are usually written “ch” or “sh”, respectively, in English. Of similar descent is the character ž; the corresponding sound is sometimes represented as “zh” in English and corresponds to the French pronunciation of j. x represents (in accord with phonetic tradition) the same sound as Czech/German/Scottish “ch”. English-oriented transcriptions of Arabic often transcribe this sound as “kh”, a solution that we want to avoid. It would conflict with the aspirated kh of Urdu. ğ is taken from Turkish and describes the sound that is often transcribed “gh” from Arabic (which we cannot use, again because of the aspirated gh).

I do not attempt to map the special Semitic guttural consonant ayin to a Latin letter following pronunciation of a European language, as this sound is very peculiar to most Europeans. In transcription of Arabic, it is sometimes represented by superscript c. We use the IPA symbol ˀ (MODIFIER LETTER GLOTTAL STOP).

The letter ں (NOON GHUNNA) occurs only at the end of the word and marks nasalization of the preceding vowel rather than a real consonant.

There are two h letters: ہ (HEH GOAL) and ھ (HEH DOACHASHMEE). It is not necessary to distinguish them by diacritics as they occur in different positions. The normal consonant h is written using ہ (HEH GOAL), which can also appear at the end of the word to mark an (otherwise invisible) word-final short vowel a (transcribed ah). In contrast, ھ (HEH DOACHASHMEE) is used exclusively after other consonants (such as k, g, č, j, t, d, b, p) to form their aspirated counterparts. Thus, بھ is bh, پھ is ph etc.

Unicode	Character	Pronunciation	Transliteration
0628	ب	b	b
067E	پ	p	p
062A	ت	ṱ	t
0679	ٹ	ʈ	ṭ
062B	ث	s	ṡ
062C	ج	ʤ	j
0686	چ	ʧ	č
062D	ح	h	ḩ
062E	خ	x	x
062F	د	d	d
0688	ڈ	ɖ	ḍ
0630	ذ	z	ż
0631	ر	r	r
0691	ڑ	ɽ	ṛ
0632	ز	z	z
0698	ژ	ʒ	ž
0633	س	s	s
0634	ش	ʃ	š
0635	ص	s	ş
0636	ض	z	z̧
0637	ط	ṱ	ţ
0638	ظ	z	ḑ
0639	ع	Ɂ	ˀ
063A	غ	ɣ	ğ
0641	ف	f	f
0642	ق	q	q
06A9	ک	k	k
06AF	گ	g	g
0644	ل	l	l
0645	م	m	m
0646	ن	n	n
06BA	ں	n	ñ
0648	و	v	w
06C1	ہ	h	h
06BE	ھ	h	h
06CC	ی	j	y

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Transliteration of Urdu to Latin Script

Transliteration versus Transcription

Consonants