Talk:File format

From Wikipedia, the free encyclopedia

Object Format is a new concept being developed that relates to the concept of storing functions in with the data (what an object is), thus offloading data transformations on those who created the data format. The reduces the complexity needed in a file format and the complexity that isn't needed (effectively open sourcing the implementation of the transformation code with the data). This technology is currently in development in Open Source projects I'm working on, so I felt it was worth mentioning.

The methods in the object can be protected using a public key encryption of the data CRC and any necessary error correction, a process I haven't dealt with but have faith in as a way to protect the objects from malicious adjustments. Beware file formats, your days are numbered. --Rofthorax 08:51, 24 August 2005 (UTC)

Sounds a lot like a Smalltalk image... Wouter Lievens

The Ogg format can potentially store video and/or audio, but actual implementations are currently rare as of December 2001, and only an audio codec called Ogg Vorbis exists for the format, though developers continue to work on video codecs such as the Tarkin video codec and to integrate other formats such as MNG (lossless and motion-JPEG compression), FLAC (lossless audio compression), and XML (text-based data such as captions and subtitles) into the Ogg framework. IFF is a now defunct format which, like AVI, is a shell, but IFF had no limitations, being able to store sound, image, movie, animation, data or archive. If a programmer wanted to store data in the IFF format, he just had to define the subformat following the general rules. For example, WAV files follow a variation of IFF called RIFF.

Removed, at least temporarily. The main point of including this seemed to be to make the point that some file formats allow storage of more than one kind of information. I have left one or two examples in the article, but I think such an in-depth treatment is excessive, at least considering the current length of the rest of the article. -- Ryguasu

This text:

The most useful part of intellectual property law for protecting ownership of a file format seems to be patent law. Although you cannot patent a file format, some file formats require encoding data with patented algorithms.

is presumably talking about the laws of a particular country, the USA maybe? Maybe there are countries where it's possible to patent a file format, but patenting algorithms as such is not permitted in many places.

Yes, I was thinking about US law, which is not an altogether bad starting point, given the importance of US software developers. But a more international perspective is certainly in order. --Ryguasu

1 Data format
2 Incorrect HTML magic number, HTML files should start with <!DOCTYPE
3 "magic number" vs. other ways to specify format
- 3.1 And again
4 Uniform Type Identifiers
5 Odds and ends...
6 File structure addition...

[edit] Data format

This article is about data formats in general, not only file formats: it should be renamed. For instance JPEG is a data format but not a file format (JFIF is a file format using JPEG, TIFF is another one). A data format is required to exchange data, a file is only an exchange mean, that is not always used. It is quite common for a server to generate data in some format (say MPEG) and send it to a client for immadiate use, with no file storage at any point in the process. Marc Mongenet 18:37, 2004 Aug 30 (UTC)

[edit] Incorrect HTML magic number, HTML files should start with <!DOCTYPE

An HTML file, for instance, should begin with the ASCII characters <html>

This is not correct. HTML and XHTML documents should start with a DOCTYPE declaration.

See http://www.w3.org/TR/html4/struct/global.html#h-7.1

--Skjæve 09:21, 14 Sep 2004 (UTC)

You are of course correct, and I've changed the article to try and reflect this. I've left the mention of <html> in there, though, because if you were testing magic numbers it would be foolish not to look for this: I should think there are an awful lot of HTML files out there that don't include a DOCTYPE, especially hand-crafted ones. - IMSoP 15:46, 15 Sep 2004 (UTC)

And they are perfectly valid HTML documents that have no html tag (and no head and no body). They all are optional elements. HTML is the worst example of magic number I can think about. Marc Mongenet 04:35, 2004 Sep 16 (UTC)

Or perhaps the best example of the imperfection of the approach! It is, after all, an attempt to apply rules that started off for internal use by readers of binary files, to the much more complex problem of identifying the large range of files on a modern computer. Perhaps a note should be added that this isn't always easy. Anyway, I only used HTML as an arbitrary example of "a file type most readers will have seen and understand", and included it in all three sections for consistency and to aid comparisons. (Although, I'm not sure they are strictly "optional" in terms of Standards-based validity, only in terms of renderability by most "tag soup" parsers, but I see your point) - IMSoP 13:07, 16 Sep 2004 (UTC)

Both the start and end tags are optional in HTML 2.0, HTML 3.2 and HTML 4.0(1), strict and loose DTD. Worst possible example... Marc Mongenet 20:02, 2004 Sep 30 (UTC)

Fair enough, I didn't know that. Still, I think you miss my point about what it's a good example of: it's a good example of magic numbers not working. It illustrates their achilles heel, so to speak. I've edited the article now, to explicitly make this point. [The downside being my nice parallel examples are now in different orders in different paragraphs. :( Maybe I'll fix that later.] - IMSoP 22:36, 30 Sep 2004 (UTC)

[edit] "magic number" vs. other ways to specify format

The "magic number" ... approach ... is only useful, however, if the interface used to access the files allows the user to easily manipulate any file in a variety of ways — as opposed to double clicking automatically doing the "right" thing...

I don't understand. When I double click on a file, what difference does it make if the OS looks at the "magic number" at the beginning of the file or the "extension" of the file name ? It seems to me that it doesn't make any difference, so it is just as "useful" either way. --DavidCary 20:26, 5 Jan 2005 (UTC)

Hmm... I wonder if the author of that statement (possibly me) was referring to the "can often determine more precise information" part of the previous sentence (this information being irrelevant in a big-icon just-double-click type environment - although I grant that sorting and searching can be greatly enhanced by it, even on Windows); or perhaps, it was intended to mean the usefulness of arbitrarily changing a file's "type" (e.g. renaming a file under Windows changes the double click behaviour). I certainly agree that the current statement is unclear, but am not 100% sure how to reword it. - IMSoP 23:21, 5 Jan 2005 (UTC)

A further disadvantage [of "magic numbers"] is that it requires scanning of both the file in question and a "magic file" listing known identifiers, making it less efficient, especially for displaying large lists of files.

I don't understand. Less efficient than what ? If we used external metadata or file extensions, we'd still have to (a) read that data or file extension and (b) look it up in a list of known file types.

Yes, this is definitely badly written. A more correct statement is that since reliable 'magic number' tests are often quite complex, and each file must be tested against every test known (the tests are not necessarily mutually exclusive, and there will often be fairly generic tests that match as well as more specific ones, so you can't even stop at the first match), it is less efficient. (Unlike an extension or standardised metadata test, where the data is checked once, and looked up in a potentially very efficient index of one-to-one relationships.) I'll reword that part now. - IMSoP 23:21, 5 Jan 2005 (UTC)

[edit] And again

This is only useful, however, if the interface used to access the files allows the user to easily manipulate any file in a variety of ways—as opposed to double clicking automatically doing the "right" thing; it is therefore more often associated with command line interfaces than graphical ones.

Still doesn't make any sense. Removed.

each file must be tested against every possibility in the "magic file"

Not strictly speaking true. Proper algorithms and data structures for the magic database (trie-like) can make magic detection remarkably efficient.

And, as with the example of HTML, some filetypes just don't lend themselves to recognition in this way.

HTML is recognisable. fdo uses this magic:

   <magic priority="50">
     <match value="<head" type="string" offset="0:64"/>
     <match value="<TITLE" type="string" offset="0:64"/>
     <match value="<title" type="string" offset="0:64"/>
     <match value="<html" type="string" offset="0:64"/>
     <match value="<HTML" type="string" offset="0:64"/>
     <match value="<BODY" type="string" offset="0"/>
     <match value="<body" type="string" offset="0"/>
     <match value="<TITLE" type="string" offset="0"/>
     <match value="<title" type="string" offset="0"/>
     <match value="<!--" type="string" offset="0"/>
     <match value="<h1" type="string" offset="0"/>
     <match value="<H1" type="string" offset="0"/>
     <match value="<!doctype HTML" type="string" offset="0"/>
     <match value="<!DOCTYPE html" type="string" offset="0"/>
   </magic>

That's enough to match any real-world html document. EdC 17:28, 3 August 2006 (UTC)

[edit] Uniform Type Identifiers

The page should be updated with information about the Uniform Type Identifiers (UTI), from Apple Computers.

Section on UTIs added. --Malpertuis 22:42, 4 August 2006 (UTC)

[edit] Odds and ends...

--Ccodere 13:46, 17 August 2006 (UTC)I will add some small clarifications on the MIME types, as a lot of people have added their own MIME types without actuall registering them, which makes the MIME standard very awkward indeed. Furthermore, even not widely in use, i have created my own File format identification scheme... I will explain it, please tell me if it is not appropriate here.

[edit] File structure addition...

I will be adding a new section giving information on the different possible file structures for file formats. If you have think i have missed something, please feel free to discuss it with me. Hopefully this will help clarifiy the different file formats...

Furthermore, i explicitly the copyright on my magicdb.org glossary terms page so it can be included in wikipedia.

Ccodere 04:34, 11 January 2007 (UTC)Carl

Retrieved from "http://en.wikipedia.org../../../f/i/l/Talk%7EFile_format_9e92.html"