Talk:Byte-order mark

From Wikipedia, the free encyclopedia

Detailed discussion of BOM does not add to understanding of endianness, and BOM can be taken as a seperate concept, so i've moved it back to its own article.

It really was messy in the endianness article, especially as BOM has its own category links, external links, and the like.

--Pengo 00:52, 27 Oct 2004 (UTC)

Contents

[edit] edits by Cherlin

some of theese edits seem rather dodgy to me.

used-->misused : you claim that using the BOM to mark text as being in a utf- format is misuse yet http://www.unicode.org/unicode/uni2book/ch13.pdf ("specials" section "Byte Order Mark (BOM)" heading) states that the byte sequence may be used to indicate both byte order and charachtor set.

"contrary to its definition" : you claim that use of the BOM on utf-8 is contary to its definition yet http://www.unicode.org/unicode/uni2book/ch13.pdf ("specials" section "Byte Order Mark (BOM)" heading)

FF FE 00 00-->00 00 FF FE (already reverted) : encoding the code point FEFF in little endian utf-32 would give FF FE 00 00 as was in the original not 00 00 FF FE as your edit states. Furthermore the table that was there before your edit exactly corresponds to the information given in http://www.unicode.org/unicode/uni2book/ch13.pdf ("specials" section "Byte Order Mark (BOM)" heading)

unless i see good justification for theese edits i will be reverting the two that i have not already reverted Plugwash 16:13, 24 Dec 2004 (UTC)

It is now two days since you made the edits and you have not responded furthermore i find you to be a very new contributer who has got into trouble elsewhere and made few other edits im am therefore reverting the rest of the edits you made to this page Plugwash 02:23, 27 Dec 2004 (UTC)

[edit] Byte Order Mark in UTF-8

Does anyone know why Windows software likes to put a BOM at the front of UTF-8 files? Isn't it true that the order is unambiguous, and thus it does nothing for any endianness problems? Is it simply a way of flagging a file as containing UTF-8 instead of ASCII? -R. S. Shaw 23:38, 5 Jun 2005 (UTC)

yeah its simply used to mark the file as being utf-8 rather than the systems legacy encoding. Plugwash 00:25, 6 Jun 2005 (UTC)
Whenever you save a file as UTF-8 in Windows Notepad, the UTF-8 BOM is prepended to it. You can use a different editor (a non–Unicode-aware editor or a hex editor) to remove the BOM. If the file contains one or more legal UTF-8 sequences, and only legal UTF-8 sequences, then removing the BOM will have no effect on the file—it’ll still be UTF-8. If the file contains only ASCII and you remove the BOM, Notepad will flag it as ANSI (8-bit codepage mode). If the file contains a BOM and you insert an illegal sequence into it (like a single FF byte in the middle of the text, or C2 E4, etc), then the file will stay intact, but if it hasn’t got a BOM and you insert such a sequence, it’ll revert to ANSI, and legal UTF-8 sequences too will be viewed in Notepad according to the current Windows ANSI codepage semantics (for example CF 80 as Ï€ instead of π if you’re on a US WinXP). --Shlomital 22:33, 2005 Jun 11 (UTC)
On Czech WinXP it works the same. Notepad marks it with BOM for easier recognition of the encoding, but does not require it. It is an unexpectedly tolerant approach.

[edit] Why is this a problem?

as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script. It may also interfere with source for programming languages

All those tools are free software or have free software equivalents and it must be relatively easy to make them ignore the mark. Shinobu (talk) 10:18, 20 November 2007 (UTC)

True though I could see that doing more harm than good, imagine you wrote your script on your desktop and it ran fine but when you put it on the production server an invisiable character stopped it from running. Plugwash (talk) 10:22, 20 November 2007 (UTC)
That assumes that the "free software" is of varied quality, not following a standard. That may be true. However the context for the quote was biased to support this situation. Tedickey (talk) 11:18, 20 November 2007 (UTC)
"All those tools are free software or have free software equivalents" — no, not proprietary Unixes, and yes they are still around. -- intgr [talk] 11:27, 20 November 2007 (UTC)
The de-facto standard is for tools (including such core OS components as the binary loader) to recognise a script by the first two bytes of a file being "#!". If some versions of some tools start ignoring a preceeding BOM but others don't (free software DOES NOT mean you can force your changes on your distro maker or server host) then IMO there is likely to be far more confusion than if scripts with a BOM universally fail (which afaict is the status quo). Plugwash (talk) 12:57, 20 November 2007 (UTC)
uh - no. No one's presented any evidence of scripts which would be ambiguous if someone provided a loader which handles BOM. Tedickey (talk) 13:10, 20 November 2007 (UTC)

"All those tools are free software or have free software equivalents and it must be relatively easy to make them ignore the mark." – In addition to what User:Plugwash writes above, I do not believe you can convince even a large minority of Unix users that placing a piece of crippled, limited character-encoding metadata into general files is a good idea. Although I only read about it just now, BOM for UTF-8 strikes me as an unusually stupid idea. The section on BOM in RFC 3629 illustrates some reasons why; it is full of heuristics and language that you rarely see in RFCs ("without a good reason", "only when really necessary", "an attempt at diminishing this uncertainty").

Should I interpret the article as if Windows Notepad is the only widely spread software which actually creates UTF-8 BOMs? It would make sense; Microsoft do not care about plain text editing – they are more into "one application, one proprietary file format" – and they have historically not cared about the usefulness of Notepad.

JöG (talk) 09:13, 29 March 2008 (UTC)

OK, now I see the article says "Quite a lot of Windows software (including Windows Notepad)". But it would be interesting to know if popular, serious text editors on Windows (emacs, vim, UltraEdit and popular Windows-specific editors) do this by default. JöG (talk) 09:18, 29 March 2008 (UTC)

You named two ports to Windows and one native. That's a rather small and unrepresentative example. There are many Windows editors. Btw, the comment regarding interprocess communication is unnecessary, since it adds no factual information. Take a look at Windows PowerShell, which has to be doing this transparently. Tedickey (talk) 11:01, 29 March 2008 (UTC)

[edit] Too technical!

OK, I understand everything in the article, since I'm a unicodopath, but the intro should say:

  • Unicode is a computer encoding of all languages characters (in principle),
  • The byte order mark is designed so that a computer who reads it, can guess (with a reasonable probability) that the data text is probably Unicode, and
  • Guess what kind of Unicode encoding, since there are many - the article already says that, I just wanted to stress that it shall.

The intro is a bit too technical for being an intro. The current text qualifies as a technical description intended for me and you, not any outsider. The missing nouns that should be in the intro are: computer, data coding, natural languages. L8R. Said: Rursus 10:15, 25 April 2008 (UTC)