Is it necessary?
It is important to note that there are already tens of thousands of file formats that have already been created. Try to see if you cannot use a file format that has already been created; the best choice is to use one of the standard file formats, for more information on standard file formats, check this page, as well as the registered MIME types.
If those formats do not fit, try using one of the defunct (now longer maintained) file formats by reviving it. If you still wish to create a new file format, see below for further tips.
Choose a file extension
Choose a unique file extension for that specific file format. Try avoiding confusing file extensions such as .IMG
Determine of the encoding of the file format
Determine the type of encoding the file will have: ASCII (ISO 646 IRV), text or binary. ASCII encoding is portable across systems that have different endian architectures. On the other hand, parsing a text or ASCII encoded file is usually slower than reading directly from a binary file. If the encoding will be binary, determine the endian of the encoded data. Most personal computers use little-endian encoding, most servers as well as network byte order is in big-endian encoding. It is also possible to design a file format that can encode both types of data.
Chunk based or direct stream format
Determine if the file needs to be extensible, in that case consider creating a chunk based file format (such as XML, PNG, JPEG, IFF and RIFF file formats). Chunk based file formats contain data chunks of data preceded by a header identifying the following data and an optional footer. This permits older software applications to skip over data that it does not recognize.
Create magic value to identify the file
Signatures in files are used to easily identify file formats by automatic identifier tools, as well as by general software. In the 1980's file signatures would consists of only 2 bytes (or 2 characters), but with the amount of file formats in existence today, this creates several duplicate identification. It is strongly suggested that your signature consist of at least 8 bytes or 8 characters.
ASCII encoded file: In the case of a text encoded file, it is suggested to create a character signature that is placed at position 0 of the file. Position 0 is selected because automatic identifier tools are not able to parse complex text.
Binary encoded file: In the case of a binary encoded file, it is suggested to create a signature that is placed at or near position 0 of the file. This is to avoid getting false results with files which would have the same data within the file (most files have a header, so the probability of having data equal to the signature of your file is quite reduced if the signature is put near the start of the file). For added security, it could be possible to add an extra signature at the end of the file.
Decide if metadata is required in the file format
Usually software as well operating systems read inside files to determine characteristics of the file. These characteristics are usually valid for every file. For example, the following information is useful in most file formats:
Title: The title of the file, for example the name of the picture, or the name of the document.
Creator: The author of the file, for example the name of the composer of the music file.
Source: The origin of the file, such as an URI where the file was downloaded, or the contact address of the creator of the file.
Rights: License information or copyright information for the file.
Identifier: Unique identifier for the resource or file, this can be a FID (as assigned from this site), an ISBN number, an ISSN number, an ISAN numer, or a DOI number.
Creation date: The date the original file was created.
A suggestion is to embed an Adobe XMP block in the file. For more information on the XMP Standard, take a look at the Adobe web site.
Document the format
If the format is to be widely used, it is very important to have a clear and concise document describing the entire file format (document it in a plain ASCII file, or a PDF or HTML document).
Create samples of the format
For programmers to really understand the internals of the file format, it is important that complex sample files of the file format be available.
Create a simple API to access the files
For uniformity, it is also possible, to get wider acceptance, to create a simple and freeware file access library, with optional source code. The access library, for greater portability across different languages should be coded either in C, Java or .NET.
Register the file format
Get a MIME type for your file format. RFC 2048 describes how to register a new MIME type.
If you decide to create a new IFF or RIFF type file, do not forget to register this new format to the registrar maintainers (the maintainers are not the original creators of the format!) .
Submit the file format
Submit the file format to magicdb.org, as well as wotsit.org and filext.com it will be registered, and added to the file formats database, and the tools will also be able to identify your file format.
Possible encoding types for files are described below:
Binary coded: Values are 8-bits each, and each of these values represent a numeric value, or a character, depending on the file format. This is the fastest format to process, because the software requires to verify the data in the file in a minimal way.
ASCII encoded (ISO 646 IRV): Values are 7-bits each, and can be visualized in a standard text editor. Usually ASCII encoded files are more permissive in their format, so usually this requires a parser to verify the validity of the file.
Text encoded: This indicates text, that can or cannot be directly displayed by the operating system, ASCII is a a type of text encoding. The most common text encodings are as follows:
Fixed width text encoding
ISO-8859: Each character is encoded on a byte, and represents a specific character for a specific character set. This permits representing most roman based character sets.
UCS-2 (ISO 10646): Each character is encoded on a 16-bit value, even though the endian is not specified, it is usually in network byte order (big-endian). This permits encoding all characters of the Basic Multilingual Plane (BMP 0) of unicode, which consists of most current languages of the world.
UTF-32/UCS-4 (UNICODE/ISO 10646): Each character is encoded on a 32-bit value. This permits representing all the characters from the Unicode standard (all planes).
Variable width text encoding
UTF-8 (UNICODE): Each character is represented as a variable length binary value, where ASCII characters are represented as usual as a single 8-bit value. All other characters have different lengths. This permits representing all the characters from the Unicode standard (all planes).
UTF-16 (UNICODE): Each character is encoded either on one or two 16-bit values. Like UTF-8, this is also a variable length encoding of characters, and like UTF-8 it permits representing all the characters from the Unicode standard (all planes).
These are suggestion for different encodings of the data:
Floating point values: Use the standard, single or double floating point format standardized by norm IEEE-754.
Date and time values: Use the preferred ISO 8601 format, with the following formats preferred: YYYY-MM-DD or YYYY-MM-DDThh:mm:ssTZD. For more information on these formats, refer here
Character encoding: ASCII encoding or UTF-8 encoding are encouraged.
Last modification $Date: 2004/09/06 23:17:33 $
Copyright © 2004 Optima SC Inc.