Programs and file formats change over time, which means that old files may become difficult to read. This complicates the long-term use of digital information.

On this page we explain the minimum requirements for the acceptance of publications into the repositories of the ETH Library (Research Collection and ETH Data Archive). Furthermore, we evaluate file formats regarding their suitability for archiving. We also explain how to convert your files to suitable formats, and how to use the software DROID to identify unsuitable files, even within large data collections.

Research data and supplementary materials

The use of formats suitable for long-term archiving is not a requirement for the publication of data in the ETH Zurich repositories. However, please be aware that problematic formats can significantly impede future use. Therefore, if possible, use the file formats listed in the left and centre columns of Table 1.

File collections containing a large number of files or subfolders should be published as uncompressed *.zip files on Windows computers and as *.tar files on MacIntosh computers (see Preparing your files). Since these are well standardised formats, uncompressed *.zip and *.tar files can be unpacked in the long term. However, for long-term use of your file collection, the file formats within these container files must also be usable in the long term. Please note that we only offer limited services to validate and curate the contents of *.zip and *.tar files.

Assessment of various file formats

Table 1: Our assessment of future readability of some common file formats. (For more detailed information we refer to the recommendations of the Swiss Federal Archives (German)KOST (German or French)Memoriav, the Forschungsdatenzentrum Archäologie & Altertumswissenschaften IANUS (Germany), the Library of Congress and the Harvard Library.)

File type

Recommended

Suitable to only a limited extent

Not suitable for archiving

Formatted Text
  • PDF/A (*.pdf, recommended subtypes 1b, 2b and 2u)
  • XML (including XSD/XSL/XHTML etc.; the included or accessible schema and character encoding must be explicitly specified)
  • PDF (*.pdf) with embedded fonts

  • PDF/A-3 (*.pdf) 1
  • Rich Text Format (*.rtf)

  • HTML and XML (ASCII text is readable over the long term; avoid external links if possible)

Not accepted for publication, OK for supplementary materials:

  • Word *.docx

  • PowerPoint *.pptx

  • LaTeX, TeX (ASCII text is readable over the long term; any open source software used for formatting and the resulting PDF should be included)

  • OpenDocument formats (*.odm, *.odt, *.odg, *.odc, *.odf)

  • Markdown (*.md)
  • Word *.doc
  • PowerPoint *.ppt
Plain Text
  • Plain Text (*.txt, *.asc, *.c, *.h, *.cpp, *.m, *.py, *.r etc.)4 
  • Markdown (*.md)5



Spreadsheets or tables

  • Comma- or tab-delimited text files (*.csv)
  • Excel *.xlsx (container format)
  • OpenDocument spreadsheets (*.ods)
  • Excel *.xls, *.xlsb (binary formats)
Raw data and workspace
  • ASCII Text is suitable for long-term use, but subsequent machine readability may be time-consuming.
  • S-Plus files (*.sdd) may be saved as text files.
  • Matlab *.mat files may be saved in HDF Format. Avoid nontrivial ASCII Matlab *.mat files, as they cannot be read with the Matlab load command (see Table 2).
  • Network Common Data Format or NetCDF (*.nc, *.cdf)
  • Hierarchical Data Format (HDF5) (*.h5, *.hdf5, *.he5)
  • Binary files such as outdated Matlab files *.mat (binary) or R files *.RData (see Table 2)
Raster image (bitmap)
  • TIFF (*.tif) (uncompressed, preferably TIFF 6.0, Part 1: Baseline TIFF). TIFF is preferred over PNG or JPEG2000.
  • Portable Network Graphics (*.png, uncompressed)
  • JPEG2000 (*.jp2, lossless compression)

  • Digital Negative (*.dng) if you want to keep raw data of digital photos in addition to TIFF files
  • TIFF (*.tif) (compressed)
  • GIF (*.gif)
  • BMP (*.bmp)
  • JPEG/JFIF (*.jpg)
  • JPEG2000 (lossy compression) (*.jp2)
  • Photoshop (*.psd)
Vector graphics
  • SVG without JavaScript binding (*.svg)

  • Graphics InDesign (*.indd), Illustrator (*.ait)

  • Encapsulated Postscript (*.eps)

CAD
  • AutoCAD Drawing (*.dwg)
  • Drawing Interchange Format, AutoCAD (*.dxf)
  • Extensible 3D, X3D (*.x3d, *.x3dv, *.x3db)


Audio
  • WAV (*.wav) (uncompressed, pulse-code modulated)
  • Advanced Audio Coding (*.mp4)
  • MP3 (*.mp3)

Video 2

  • FFV1 codec (version 3 or later) in Matroska container (*.mkv)
  • MPEG-2 (*.mpg,*.mpeg)
  • MP4, also called MPEG-4 Part 14 (*.mp4)
  • QuickTime Movie (*.mov) 3
  • Audio Video Interleave (*.avi)
  • Windows Media Video (*.wmv)

Footnotes

1 PDF/A-3 allows a wide variety of file formats to be attached, even if these are not suitable for archiving. We therefore rate PDF/A-3 as "suitable to only a limited extent". The ETH Data Archive will neither check nor curate attached files. 

2 In addition to the file format (or container format), the codec and compression method are also important. See Ianus, Memoriav and KOST for further information. The Motion JPEG 2000 file format (*.mj2, *.mjp2) was removed from the list on 24 October 2024, as it is no longer widely used.

3 The QuickTime Movie format was downgraded from „Recommended“ to „Suitable to only a limited extent“ in the 21 November 2018 version of this document,. Apple discontinued support for the Windows version of QuickTime Player in 2016. Consequently, Windows Media Player only supports QuickTime Movie file versions 2.0 or earlier.

4 Text must be encoded as ASCII, UTF-8 or UTF-16 (the latter with BOM). Text according to ISO 8859-1 is not suitable.

5 Markdown according to CommonMark specification (https://commonmark.org).

Suitable to only a limited extent

If you are planning to use your data for up to ten years, we recommend the formats in the left and the centre columns of Table 1. Formats that are less well-known but commonly used for this type of data in your field are also usually suitable.

The following points should also be noted:

  • Files in rare formats should be converted into common formats whenever possible. You should archive both the original file and the converted file.
  • The files should not be dependent on references to data, fonts, templates or programs stored elsewhere. Instead, these objects should also be archived. If this is not possible, the existing dependencies on other files or programs should be described in a plain text file ("readme"). This file should then be archived together with the data.
  • Files should not be password-protected, encrypted or compressed. However, if you absolutely have to encrypt data, take the necessary precautions to ensure that an authorised person can open it even after you have left.
  • Use only letters, numbers, underscores (_) and hyphens (-) when naming folders and files. Avoid using spaces, slashes, umlauts and other special characters. For more information, see this guideline.
  • The file extension should be consistent with the actual file format.

Recommended file formats

For storage of more than ten years, we can only recommend the file formats listed in the left-hand column of Table 1, in particular PDF/A, ASCII text and TIFF. PNG, SVG and JPEG2000 may also be suitable. Note that the future readability of a file also depends heavily on the file features used: Reading advanced features, such as video data within a PDF file, is less reliable than reading basic features.

To ensure that files can be used for more than ten years, the file formats should be very widespread and, if possible, follow standards that are open and not proprietary. However, it cannot be guaranteed that your data will remain readable in the long term, as this depends on future software developments.

The ETH Library reviews the archived file formats annually as part of a Format Monitoring Report. If possible, outdated formats will be converted into more common formats that offer a better perspective for preservation. The original file is always kept.

Recommended conversion methods

We recommend using the conversion methods shown in Table 2. Useful conversions also depend on the type of information that is stored in the files. For example, you could convert your Excel spreadsheets to *.csv files. However, if the Excel file contains also macros, formulas or embedded objects, this information will be lost in the conversion.

You should visually check the quality of your converted files. Both the original and the converted files should be archived.

Some more recent file types (*.docx, *.xlsx, *.pptx) are so-called container files. By adding the file extension “.zip” to the file name, you can view the individual components and also save suitable simpler files separately.

Table 2: Recommended file conversions

File typeRecommended conversions
Text
  • Word and PowerPoint files should be converted to PDF/A-2b format (or PDF/A-2u). See also our instructions on creating PDF/A files.
  • LaTeX (or TeX files) should be converted to PDF/A format and both versions should be submitted.
  • You should carefully check the quality of your converted files. Verify formulas, special characters, umlauts, special fonts, spelling errors, text selection and searching, tables, colours, transparent objects, comments, vector graphics and layered graphics. 
Tables
  • Convert Excel *.xls files to *.xlsx files
  • You may save a copy of important embedded objects (such as figures) as a separate file.
  • Tables can be converted to ASCII text *.csv files as follows: In Excel, save individual sheets as *.csv files; in R, use “write.csv” to save tables; in S-Plus, use „write.table“ to save tables as *.sdd files.
Workspace Dump in Matlab, R or S-Plus
  • Matlab *.mat files should be saved as v7.3 files (using save -v7.3 x.mat), as the resulting *.mat file adheres to the HDF5 standard. (HDF5 is an open standard for tables, media data and complex data structures.)
  • The R workspace should be saved in HDF5 format using the R package rhadf5. The S-Plus function data.dump produces a file that can be read using the R function data.restore.
  • For complex data, it is usually not useful to save the workspace using ASCII, as the resulting files are difficult to read. Such an ASCII workspace dump can be saved in R with the command save(…, ascii = TRUE), in Matlab with file.txt –ascii, or in S-Plus with dump().
  • Important tables in the workspace should also be saved as a separate CSV file.
Graphics
  • Vector graphics files will be more difficult to access in the long term than raster graphics (bitmaps). Embedding vector graphics in PDF files is also prone to errors. Files in special vector graphic formats, such as InDesign (*.indd) or Illustrator (*.ait), should be saved in a more suitable format if possible (see the left-hand column in Table 1). You should carefully check the quality of the converted files with regard to contrast, resolution, colours, transparent objects, and text.


File format verification with DROID

For large data collections, the free JAVA application DROID provides an overview of your file formats. This tool can also detect unknown file formats as well as inconsistencies between file extensions and file contents (see Figure 1).

With the exception of text files, most files contain a special string of characters that indicates the file format. This string of characters is also referred to as 'signature' or 'magic number'. If DROID finds a known signature within the file, it is used to determine the file type. In this case, "Signature" or "Container" is displayed in the "Method" column (see Figure 1). If the file's signature is not consistent with its extension, DROID displays a warning (yellow triangle with exclamation mark).

Pure text files (*.txt) or tables in text format (*.csv files) do not contain any signatures. DROID classifies such files based on their file extension. If there is no signature and the file extension does not indicate that the file is text-based, the file is not classified at all (see the bottom two files in Figure 1).

The ETH Library recommends and configures the software tool docuteam packer for certain customers. This tool can also detect files with unclear or unknown formats and generate a list similar to that created by DROID.


DROID Screenshot

Figure 1: Screenshot showing DROID verification for some test files. Files with unclear or unknown formats can be easily detected.


General Remark: Due to software updates, the design of the user interface can change and the screenshots shown above may not correspond to the actual appearance. But the process and the functionalities remain the same.

  • No labels