Programs and file formats change over time such that old files may become difficult to read. This complicates using digital information over long term.
No requirements apply to the publication of research data and supplementary materials in the repositories of the ETH Zurich. However, please be aware that the future use of some formats may become very difficult and, if possible, use file formats in the left and middle column of table 1.
File collections containing a large number of files or subfolders should be published as uncompressed *.zip files on Windows computers and as *.tar files on Mac computers. Since uncompressed *.zip and *.tar files are well standardised formats, they can be unpacked in the long term. However, for the long-term use of your file collection, the file formats within these container files must also be usable in the long term. We can offer only limited services to validate and curate the contents of *.zip and *.tar files.
Table 1: Our assessment of future readability of some common file formats. (For more detailed information we refer to the recommendations of the Bundesarchiv (German), the KOST (German or French), the Memoriav, the Forschungsdatenzentrums Archäologie & Altertumswissenschaften IANUS (Germany), the Library of Congress and the Harvard Library.) File type Recommended Suitable to only a limited extent Not suitable for archiving PDF (*.pdf) with embedded fonts Plain text (*.txt, *.asc, *.c, *.h, *.cpp, *.m, *.py, *.r etc.) (ISO 8859-1 coded) Rich Text Format (*.rtf) HTML and XML (The ASCII text is readable over long term; try to avoid external links.) Not accepted for publication, OK for supplementary materials: Word *.docx PowerPoint *.pptx LaTeX, TeX (The ASCII text is readable over long term; open source software required for formatting and the resulting PDF should be included.) OpenDocument formats (*.odm, *.odt, *.odg, *.odc, *.odf) Spreadsheet or table JPEG2000 (*.jp2, lossless compression) Graphics InDesign (*.indd), Illustrator (*.ait) Encapsulated Postscript (*.eps) Video 2 1 PDF/A-3 allows a wide variety of file formats to be attached, even if these are not suitable for archiving. We therefore rate PDF/A-3 as "suitable to only a limited extent". The ETH Data Archive will neither check nor curate attached files. 2 In addition to the file format (or container format), also the codec and the compression method are important. See Ianus, Memoriav and KOST for further information. The Motion JPEG 2000 file format (*.mj2, *.mjp2) was removed from the list on 24 October 2024 as it is no longer in use. 3 In the Version of Nov 21, 2018 of the current document, the format QuickTime Movie was downgraded from „Recommended“ to „Suitable to only a limited extent“. Apple discontinued the support of Windows QuickTime Player in the year 2016. Windows Media Player thus only supports file format versions 2.0, or earlier, of QuickTime Movie files. If you plan using your data for up to ten years we recommend the formats in the middle and the left column of Table 1. Even less known formats that are common in your area of expertise for this type of data are usually suitable. You should also consider the following points: For storage over more than ten years, we recommend file formats in the left column of Table 1, such as PDF/A, ASCII text, and TIFF. Also PNG, SVG and JPEG2000 may be appropriate. Bear in mind that the future readability of a file will also strongly depend on the used file features: Reading fancy features of a format, such as video data within a PDF file, will be less reliable than reading basic features. To use files for more than ten years, the file formats should be very common and, if possible, follow standards that are open and not proprietary. Nevertheless, it cannot be guaranteed that your data will remain readable over the long term, as this depends on future software developments. The ETH Library reviews the identified archived file formats annually in a format monitoring report, and will convert outdated formats if an applicable current target format with a better perspective for preservation is available. The original file will always be kept. We recommend the conversion methods shown in Table 2. Useful conversions also depend on the type of information that is stored in the files. You may store your Excel spread sheets in *.csv files, but if the Excel file contains also macros, equations or embedded objects, this information will be lost. You should check the quality of your converted files. The original and the converted files should be archived. Some more recent file types (*.docx, *.xlsx, *.pptx) are so-called container files. By attaching the file extension “.zip” to the file name you can check the single file components. You may also save such simpler files separately. Table 2: Recommended file conversions For large data collections you can get an overview of your file formats using the free JAVA application DROID. Furthermore, this tool detects unknown file formats as well as inconsistencies between file extensions and file contents (figure 1). With the exception of text files, files usually contain a special string of characters to indicate the file format. This character string is also referred to as signature or as magic numbers. If DROID finds a known signature within the file, this is used to determine the file type. In this case "Signature" or "Container" is indicated in the column "Method" (see figure 1). If the signature within the file is not consistent with the file extension, DROID shows a warning sign (yellow triangle with exclamation mark). Pure text files (*.txt) or tables in text format (*.csv files) do not contain any signatures. DROID classifies such files by using the file extension. If there is no signature and the file extension does not indicate a text file, the file is not classified at all (both files at the bottom of figure 1). The software tool docuteam packer is recommended and set up for some customers by the ETH Library. This tool detects files with unclear or unknown formats and produces a list comparable to that of DROID. Figure 1: Screenshot showing DROID verification for some test files. Files with unclear or unknown file types can be easily detected.Assessment of various file formats
Text Raw data and workspace Raster image (bitmap) Vector graphics CAD Audio Footnotes
Suitable to only a limited extent
Recommended file formats
Recommended conversion methods
File type Recommended conversions Text Tables Workspace Dump in Matlab, R or S-Plus Graphics File format verification with DROID