Archival and publication of research data
Excerpt from "Guidelines on Digital Research Data at TU Darmstadt"
"Research data must be archived to the extent and for the periods specified in the Guidelines for Safeguarding Good Scientific Practice (usually at least ten years). As a rule, this archiving should take place in a recognized discipline-specific research data repository or in the institutional repository of TUDa (TUdatalib). If there are reasons why archiving at the above-mentioned locations is not possible, an archiving solution should be chosen that ensures a similar level of long-term integrity preservation, accessibility and findability of the data. If the data volume of the research data to be stored is too high for complete archiving, a selection will be made according to the aspects of traceability, reproducibility and reusability." (Guideline 3)
"In order to ensure the traceability and reproducibility of research results, research data should be published at an appropriate time and to an appropriate extent, unless other regulations (e.g. contracts in commissioned research or data protection law) or a planned commercial exploitation prevent this. In order to ensure their reusability and in accordance with the Open Access Policy of the TU Darmstadt, published research data should be assigned as open a license as possible." (Guideline 4)
Appraisal
Your decision what research data to preserve depends on factors like storage costs, but also on the goal you want to pursue: Shall the data archival only ensure the integrity of your research, or shall the data also be available for future reuse? In general, we advise that you retain data that belongs to one of the following categories:
- Raw data: You should always try to preserve this data. If personal data is part of your research make sure it is properly anonymized beforehand.
- Accompanying data: to recreate and verify your research process. If your original data can thus be easily and reliably restored, you should think about keeping only the input data, especially in the case of large, storage-intensive results such as numerical simulations.
- Settings (e.g. calibrations, instrument settings, etc.)
- Input data
- Templates of questionnaires, interview topic guides
- Software and code
- Final outputs: You should retain the data on which your research results are based, and the data that could be used as the basis for future research. Retention of the final outputs is also important for research integrity and validation of findings.
- Third-party data: If you used third-party data in your research, you should store them with your own results. If legal reasons, e.g. licenses or contracts, prevent this, their use during your research should be documented in detail to be able to reproduce your findings.
- Documentation: You should provide a comprehensive description of your data, including comments on used software (e.g. versions) and conversion between file formats.
The Data Curation Centre provides a comprehensive guide to help you decide what data you want to keep.
Decision to publish
The following flow chart helps you to decide when a publication is advisable and what should be considered beforehand:
![]() |
---|
Decision tree for publication of research data, German version |
Some obligations like the patenting process result only in a temporary embargo period after which a publication of these data is possible.
File formats
Not every file format can ensure that its content is readable in the future: Newer versions of software are not backward-compatible or support is otherwise not provided long-term. This problem is especially worrisome if you use proprietary formats or file formats that are not widely accepted. You should aim to use open, standardized file formats which increases the chance that research data will be accessible for the foreseeable future.
The following list of appropriate file formats is based on recommendations from ETH Zürich, IANUS - Forschungsdatenzentrum Archäologie & Altertumswissenschaften, forschungsdaten.info as well as the Hessian Research Data Infrastructures HeFDI:
Text
Format | Recommended | Limited | Not recommended |
---|---|---|---|
PDF/A (*.pdf, preferably subtypes -2b und -2u) | |||
Unformatted text (*.txt, source code,...) (ASCII, UTF-8, or UTF-16 with Byte Order Mark (BOM) encoding) | |||
PDF (*.pdf) with embedded fonts | |||
Unformatted text (*.txt, source code,...) (ISO 8859-1 encoding) | |||
Rich Text Format (*.rtf) | |||
HTML und XML (no external contents) | |||
Word (*.docx) | |||
PowerPoint (*.pptx) | |||
LaTeX und TeX (including license-free packages and resulting PDF) | |||
Word (*.doc) | |||
PowerPoint (*.ppt) |
Current versions of Microsoft Word (since 2016) export to PDF/A-3 which is not recommended for long-term storage. Instead, you can use PDFCreator and print the document as a PDF/A-2 file. Unformatted text files are always preferrable to PDF files which should only be used as format if the layout of document is important. Conversion to PDF should be inspected visually for errors.
Spreadsheet
Format | Recommended | Limited | Not recommended |
---|---|---|---|
Comma or tab-separated text files (*.csv) | |||
Excel (*.xlsx) | |||
OpenDocument Formate (*.odm, *.odt, *.odg, *.odc, *.odf) | |||
Excel (*.xls), (*.xlsb) | convert to *.xlsx |
Raster image
Format | Recommended | Limited | Not recommended |
---|---|---|---|
TIFF (*.tif, uncompressed, TIFF 6.0+) | |||
Portable Network Graphics (*.png, uncompressed) | |||
JPEG2000 (*.jp2, lossless compression) | |||
Digital-Negative-Format (*.dng) | |||
TIFF (*.tif, compressed) | |||
GIF (*.gif) | |||
BMP (*.bmp) | |||
JPEG/JFIF (*.jpg) | |||
JPEG2000 (*.jp2, compressed) |
Vector image
Format | Recommended | Limited | Not recommended |
---|---|---|---|
SVG without JavaScript binding (*.svg) | |||
Grafik InDesign (*.indd), Illustrator (*.ait) | |||
Encapsulated Postscript (*.eps) | |||
Photoshop (*.psd) |
Audio
Format | Recommended | Limited | Not recommended |
---|---|---|---|
WAV (*.wav) (uncompressed, pulse-code modulated) | |||
Advanced Audio Coding (*.mp4) | |||
MP3 (*.mp3) |
Video
Format | Recommended | Limited | Not recommended |
---|---|---|---|
FFV1 Codec (since ver. 3) in Matroska Container (*.mkv) | |||
MPEG-2 (*.mpg, *.mpeg) | |||
MPEG-4 Part 14 (*.mp4) | |||
Audio Video Interleave (*.avi) | |||
Motion JPEG 2000 (*.mj2, *.mjp2) | |||
Windows Media Video (*.wmv) | |||
QuickTime Movie (*.mov) |
Besides the format, the used codec and compression determine the long-term usability
3D
Format | Recommended | Limited | Not recommended |
---|---|---|---|
AutoCAD Drawing (*.dwg) | |||
Drawing Interchange Format, AutoCAD (*.dxf) | |||
Extensible 3D, X3D (*.x3d, *.x3dv, *.x3db) |
Persistent Identifier
A persistent identifier (PID) refers unambiguously and permanently to an object. Even if the location of the object itself changes, the identifier remains the same. In addition, persistent identifiers can also store metadata about the referenced object.
For data, the Digital Object Identifier (DOI) is widely used. With a DOI, you can easily link the referenced data and a publication. A DOI also ensures that your research data is permanently findable, retrievable and citable. You can register DOIs in TUdatalib after you have published your data. Likewise, many other research data repositories (see below) will allow you to register DOIs or another PIDs.
Similar to DOIs for digital objects such as data, ORCID (Open Researcher Contributor Identification) was developed to uniquely identify individuals. Here, a person is assigned an identifier so that they can be identified even if the name or affiliation changes. You can get an ORCID via self-registration and enter the information in your ORCID profile yourself.
Repositories
Repositories are used to store and publish digital research objects. They help to make your data more findable and accessible which leads to potential reuse and increased citations. In most repositories, you can enrich your data with additional metadata to describe it and to make it, in turn, easy to search and find. Thus, publishing well-documented data objects on accepted and well-connected repositories is a good step towards making your own data FAIR.
Many archives and repositories implement the reference model Open Archival Information System (OAIS). Repositories with a high level of quality and integrity are often certified by one of these initiatives CoreTrustSeal, nestor Seal/DIN 31644, or ISO 16363.
Discipline-specific
Whenever possible, use discipline-specific repositories to preserve your research data as they are catered to the specific scientific needs and are more visible in your research community. Some publishers like Nature and PLOS curate a list of recommended repositories. You can also use re3data which is a global registry of data repositories to search for discipline-specific repositories. Similarly, FAIRsharing.org provides a searchable database of discipline-specific repositories and also of data availabilities policies of various organisations such as journal publishers. If you look for a suitable repository you can use the following criteria that a repository should fulfill 1
- Persistent identifier (PID) for data records
- PID for authors
- Metadata
- Download and export options
- Description or documentation
- Access options
- Licences
- Overview/preview of the data record
- Versioning
- Registration and processing
- Discovery by search engines
General-purpose
If a discipline-specific repository is not available, you can deposit your research data at the institutional repository TUdatalib of TU Darmstadt. It is open to all researchers of TU Darmstadt and up to 2 TB total volume of new data per year and research group are free of charge. TUdatalib provides the aforementioned features.
Data journals
Data or software journals are specialized journals to describe research data/software. They are peer-reviewed and provide a high level of quality. The data itself, however, is still published in a separate repository and linked to the articles. The re3data COREF project compiled a list of data journals.
-
Gerlach, R., Rex, J., Lang, K., Neute, N., Schwartze, V.: Fact Sheet: Research Data Repositories (2020) ↩