Transliterate Arabic, Greek, and others using UTF-8

Feb 21, '07 07:30:02AM

Contributed by: stsmith

I wrote this perl script which transliterates ASCII into UTF-8 colloquial and classical Arabic, Greek, and (at some point in the future) Cyrillic, Hebrew, and other scripts. Input: ASCII. Output: UTF-8 and octal representation of UTF-8.

I've used this to input foreign language titles into my iTunes world music collection. Once you've generated the octal UTF-8, this can be done by hand:

mp4tags -s "`printf '9rabiyuN 'anaa (33027133026133121633025033122033121233121440330243331216331206331216330247'`" -a "Yuri Mrakady" "'ajmal mnw9aat al-jaaz 01.m4a"

The iTunes song title of the AAC file 'ajmal mnw9aat al-jaaz 01.m4a will appear as 9rabiyuN 'anaa (عرَبِيٌ أَنَا). Or enter these codes into a cdrdao TOC file, and use my cd2codec script with the command cd2codec --utf8 to accomplish this automatically. I've written this script for my own needs, but it's easy to modify to incorporate other formats. To use, save the text to the file transliterate, do chmod a+x transliterate, get/build the required tools, and type transliterate --help for usage instructions.

Arabic transliteration is simply a colloquial Arabic front-end for Otakar Smrz's excellent ArabTeX perl script; you'll need to download and install Encode and Encode::Arabic from CPAN. I've also implemented a simple Greek transliteration engine (no accents or breathings) that runs without any additions. I've left placeholders for Cyrillic and Hebrew extensions, but these are not implemented.

[robg adds: I haven't tested this one.]

Comments (3)


Mac OS X Hints
http://hints.macworld.com/article.php?story=20070214214951585