Submit Hint Search The Forums LinksStatsPollsHeadlinesRSS
14,000 hints and counting!

Unix command files, UTF-8, and the byte order mark UNIX
"A little knowledge is a dangerous thing" as they say. A long story for a problem people may rarely if ever encounter, but here goes:

I love TextWrangler for editing all kinds of text files. I set it to save in UTF-8 (with the initial byte order mark, or BOM) set by default. I discovered that the BOM makes Safari read HTML as Unicode automatically, without the need for a charset declaration, or messy entity codes for special characters. So now I can just type HTML freely in any languages and scripts I want.

Now over to Terminal: On my old Mac, I had a few default aliases set up for tcsh. I learned that now in Leopard the default shell is bash, which I am happy to note supports Unicode in pathnames seamlessly, but which uses a very different structure for keeping default aliases. I found my old ~/Library » init » tcsh » aliases.mine file and did my research: I copied the file, saved it as ~/.bash_alias, and created ~/.bash_profile to source it.

But nothing would work. I got the strangest errors, like -bash: source: command not found. Say what?! The command is right there in /usr/bin/ where it belongs! I dug for answers on the net for hours, and kept trying things. Eventually I noticed that when I executed ~/.bash_alias myself on the command line, all but the first of my aliases loaded. When I changed the file to start with a blank line, all aliases loaded, with one error about an empty command. Ahha! So the problem turned out to be the file format: the BOM made the first word of the first line into nonsense. So I resaved both of my dot-files in "UTF-8, no BOM" mode, and all is well.

Moral of the story: Though we know "There ain't no such thing as plain text," Unix requires command files to be as close to it as possible.

    •    
  • Currently 2.33 / 5
  You rated: 5 / 5 (9 votes cast)
 
[14,989 views]  

Unix command files, UTF-8, and the byte order mark | 17 comments | Create New Account
Click here to return to the 'Unix command files, UTF-8, and the byte order mark' hint
The following comments are owned by whoever posted them. This site is not responsible for what they say.
It applies to many of the UNIX commands.
Authored by: yuji on Apr 06, '09 08:03:31AM

GCC can handle UTF-8 encoded source files, but it doesn't like the BOM either.



[ Reply to This | # ]
It applies to many of the UNIX commands.
Authored by: egreg on Apr 06, '09 08:23:44AM

Aquamacs Emacs doesn't like it either



[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: Nem on Apr 06, '09 09:18:07AM

Uhm... 'source' is a shell builtin and has nothing to do with the bash executable being in '/usr/bin' - just an FYI. ;-)

---
Nem W. Schlecht
http://geekmuse.net/



[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: MJCube on Apr 07, '09 08:10:44AM
A good illustration of my level of knowledge/ignorance (thus the introductory quotation). My point was that the error was so basic as to seem impossible, and that was my clue. "U+FEFFsource" is of course not a command.

[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: Swift on Apr 06, '09 09:34:34AM

Can anybody help me with this problem?

I'm doing Quicktime Text on a Windows machine. Among the subtitles we're doing for this project are Asian languages. To get the code out of our subtitle editor, we check the "Unicode" box. Don't know what flavor of Unicode that is. But then we open up this text file in Word (Windows also.) We run a macro to get the headers in, and then I have to save the doc as text for Quicktime for Windows. I've tried every encoding possible. The one that works? Japanese (Mac), Korean (Mac), and so on. If I open that file, I get the correct font, all the characters, beautifully in Quicktime for Windows.

But it won't work on the Mac! In Quicktime for the Mac, even if I change the font in the header to a Mac Asian font, I get garbage characters.

I've tried different encodings in BBEdit, but all I've succeeded in doing is munging the text irretrievably.

I wish I had a better understanding of all this. It is nearly completely opaque to me. I just try one thing and then another.

---
------------------------
Screenplays for Royalty
since 1749



[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: boxcarl on Apr 06, '09 11:41:12AM
Here's a helpful summary of the issues with character encodings: http://www.joelonsoftware.com/articles/Unicode.html

[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: leamanc on Apr 06, '09 11:44:10AM

I didn't even get but a few sentences into this hint before I realized what the outcome would be. I've found, as others have here, that UTF-8 with BOM is just asking for trouble. I think BareBones should remove that option from BBEdit and TextWrangler! The no-BOM option seems to work great, and is well read across many apps that accept UTF-8.

The only downside from the hinter's perspective is that he has to declare his text encoding in HTML documents, but really you should do that anyway.



[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: Anonymous on Apr 07, '09 04:46:02PM
Err, no. Don't eliminate useful features. Barebones just needs to extend that features, so that if you save an HTML file, it uses the fancy encoding. But if you save a shell script, it saves it in a format appropriate to that filetype.

[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: palahala on Apr 08, '09 12:59:09AM

Or if you open a HTML file with no BOM, it could scan for the <meta> header specifying the encoding. Likewise, if it opens XML, it could interpret the encoding attribute.

In fact, I wonder which is considered to be authoritative: the BOM or the encoding as specified in the file.



[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: palahala on Apr 06, '09 02:02:25PM

Just for whoever finds this hint because of the "bash [..] command not found": the same may happen when using line endings that are not recognized. This happens to me every now and then when using Windows editors to change cygwin files... :-)



[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: Fairly on Apr 06, '09 02:22:53PM

I'm surprised you're surprised.



[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: palahala on Apr 06, '09 03:27:10PM

Well, I guess that's why MJCube is sharing this with us. Once it has happened to you one time, you'll surely remember the next time you somehow messed up. But I recall it took me some time to figger out the line ends problem the first time it happened to me, especially if the error occurs some time after you've made the changes and the "command not found" error is not giving a lot of information.



[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: palahala on Apr 06, '09 03:30:27PM
For troubleshooting: remember the file command to find information on odd line terminators and/or encodings.

[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: Keltia on Apr 06, '09 03:57:28PM

You should note that BOM is not needed for UTF-8 files and it is a byte encoding whereas UTF-16{LE,BE} and UTF-32{LE,BE} are not. It can be as you found out also an obstacle for playing with the file...



[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: palahala on Apr 06, '09 04:56:12PM
BOM is not needed for UTF-8 files

True, if you know it is UTF-8. If you (or the processor of your file) do not know, and have no other means of specifying the encoding, then the BOM might be useful for UTF-8 as well. But I have not seen such cases yet. For example: the structure of both HTML and XML allow for specifying the encoding in the top of the actual content.

In fact, I guess that if Bash would have supported UTF-8 command files, then the BOM would actually have been needed.



[ Reply to This | # ]
Unix command files, UTF-8, and the byte order mark
Authored by: siteisbroken on Apr 06, '09 06:29:56PM

If the poster is writing HTML to post on the web, rather than strictly for his own use, he should note that just because his browser recognizes the BOM as indicating UTF-8, that doesn't mean others will. He should stick to the standards.



[ Reply to This | # ]
HTML for my own use
Authored by: MJCube on Apr 07, '09 08:04:48AM

Yes, of course. I had written a sentence explaining that these web pages are mostly for my own reference, but I omitted it as being too much info for the hint. One of these days I'll look into browser support for UTF-8 without the declaration, but for these purposes I don't care.



[ Reply to This | # ]