Skip to content

Organization and documentation

Excerpt from "Guidelines on Digital Research Data at TU Darmstadt"

"[...] Associated with them are also the metadata, documentation, and software necessary to understand them. [...]" (Definition 1)

"Research data management is understood to refer to the entire handling of digital data in research, from the planning of its generation, through its organization, use and processing in research projects, to its selection and permanent archiving or even deletion, with the aim of achieving the aforementioned goals. This includes, in particular, the discipline-specific documentation of its creation in digital form, the secure storage, the appropriate processing and, if applicable, the publication in a suitable form." (Definition 2)

"[...] To implement the FAIR principles, the metadata describing the research data must be published to an appropriate extent." (Guideline 4)

Documentation

Documenting your research data means that you provide additional data, so-called metadata, including information necessary for other scientists to understand the context of your research data, its information content and its limits. Of course, a good documentation serves not only for others to understand the data, but also for yourself when you have to work again with the data in the future. In general, a good data documentation is an essential step for your data to confirm to the FAIR principles of making data Findable, Accessible, Interoperable, and Reusable.

There are a lot of things you might want to document:

  • what the data represents (object of investigation and recorded parameters),
  • how it was obtained (methods and tools including relevant parameters),
  • the context of data creation, your reasons for data generation,
  • your thoughts on the data, just to mention a few.

Think about what you would expect others to provide and have a look at well-documented datasets in your field.

When choosing a format for documenting your research data, try to create a documentation that is as structured, consistent and machine actionable as possible. This is not only beneficial for potential re-users of your data, but also for yourself, as it facilitates using the metadata for things like finding your data or automatization of tasks like analysis or visualization. Find out if there are standards or best practices suggested for your area of research, e.g. by getting in contact with suitable NFDI consortia or by searching databases such as the Fairsharing.org standards database, and try to create metadata that fulfills these.

One of the highest levels of machine-actionability and interoperability of metadata (and data) is achieved by expressing it within the resource description framework (RDF) that represents information as triples consisting of subject, predicate and object, each of which are selected from unambiguous controlled identifiers (IRIs) provided by carefully curated terminologies.

Automation is a great way to both ensure data documentation quality and keeping the effort to create the documentation at a minimum. In many cases, a significant part of the information necessary for the documentation will already exist as it was generated automatically, for example by a measuring device, or entered by the user in some other system. Configuring once to assemble all the information into the documentation will mean considerably less documentation effort afterwards.

File-system-based approaches

If you cannot use a specialized infrastructure like an ELN for organizing your data, you can still achieve a minimal structured way of organizing and documenting data by using a file-system-based approach that relies on a common, pre-defined folder and file structure with a standardized naming scheme. Once such a system has been implemented, efforts have to be taken to ensure its application long-term. In particular, new researchers have to be trained to navigate it and file their own data according to the rules. We present below an example based on an approach that is used within the collaborative research centre TRR 150 at TU Darmstadt for research in engineering. Other systems have been developed for various scientific areas with examples given in the following list:

More advanced approaches towards data organization and documentation rely on specialized software like electronic laboratory notebooks and file versioning tools.

Method-focused data organization example

This system is based on research projects that are subdivided in parts we call investigations. An investigation is a research endeavour with the aim to create and analyse a specific set of data using a specific research setup or method. Typically, a research publication will be assembled from the information obtained from several such investigations.

Another aspect of the method-focused data organization is the allocation of unique identifiers (IDs) for each data file within an investigation. Note that each measurement, data manipulation or analysis leads to a new dataset with a new ID which is part of the file name. A central part of this approach is the ID-table (or separate ID tables for different ID systems) that lists all identifiers within the investigation and contains a description of the way the respective dataset was obtained, linking to a previous identifier in case of data manipulation. This way, the provenance of every dataset and visualization can be tracked to the original raw data. The following table is an example for how such an ID table might be organized:

ID Date Creator Origin Tool Parameters Description
00001 2023-01-12 J. Doe Measurement Expensive Instrument; Software version 1.3 Runtime: 300 s; Gain: 450 V Additional information
00002 2023-01-12 J. Doe Processed: ID 00001 R Script in SourceCode/analysis.R, git commit zy123abc Clusters: 12 Additional information
00003 2023-01-13 A. Smith Processed: ID 00001 R Script in SourceCode/analysis.R, git commit 12gea24b Clusters: 7 Additional information

Within the investigation folder, there will be further subfolders containing different kinds of data according a common structural scheme. Typically, necessary folders differ by research domain but might include a subset of the following:

  • software source code
  • raw data
  • processed or derived data
  • instrument configuration files
  • documentation

Within the investigation folder itself, there should be the ID table and a readme file describing the investigation.

The structure of such a file system with two investigations will look something like this:

Example folder structure for the method-focus data organization approach

There are two common ways to create unique identifiers. Please note that raw and processed data in the same investigation may be labelled with different styles.

  • Based on the date the respective dataset was created
    • YYYYMMDD_[key]_number
    • The key might, for example, refer to a process-ID, an instrument-ID, or an acquisition method
    • This format is particularly suitable for raw data as it is often referred to by acquisition date
  • Consecutively numbering the datasets
    • [5-digit-number]_[key]_[optional_tag]
    • This format "is more suitable for processed data often obtained in an iterative process which can take longer than one day, week or even month"

Structured data formats

As an alternative or in addition to relying on a convention for folder structure and file naming, it is also possible to use structured file formats for organizing your data. These file formats offer an internal organization similar to a folder structure and a possibility to incorporate data and documenting metadata within the same file, resulting in a data object that is inseparable from its documentation. Examples are the hierarchical data format (HDF5) or research object crate (RO-crate). A workshop for use of HDF research within engineering created by the working group for fluid systems at TU Darmstadt can be found on GitLab. Approaches relying on structured data formats become especially powerful when combined with semantic, RDF-based metadata, resulting in self-documenting digital objects.

Data organization software

Electronic laboratory notebooks (ELNs)

For research that relies heavily on experiments, we recommend to use electronic laboratory notebooks (ELNs) to document past experiments and plan future ones.

ELNs are specialized tools that help researchers keep track of their experiments and the data generated during the course of the research. In brief, they are a digital version of the classic, handwritten lab journal but offer a lot more features that will facilitate your day-to-day lab routines.

Those advanced features might include:

  • structured metadata creation
  • links between related experiments, datasets, equipment, or resources
  • linking to or uploading research data
  • file history to track data versions
  • direct upload of metadata and data from instruments
  • field-specific and full-text search functions
  • templates for ensuring creation of consistent data and metadata
  • collaborative documentation and data sharing for multi-researcher projects
  • backup of lab notebooks

Implementing an ELN on the other hand can present several challenges:

  • Selecting an ELN that suits your group's needs while avoiding vendor lock-in
  • Training the team requires dedicated time and effort
  • clarifying access and rights management
  • Potential additional costs for software licenses and hardware (e.g. tablets)

Selecting the right ELN for your team

There is a multitude of ELNs available both as free software as well as commercial tools. The exact functionality as well as the user interfaces will differ between the various solutions. This is especially true as there are both highly generic tools that try to cover the essentials of many disciplines as well as those that focus on specific domains to offer specialized features. To identify software solutions that might be suitable for your research group, we recommend the ELN finder service that the University and State Library Darmstadt offers in collaboration with ZB MED.

The TUdata team will assist with getting into contact with active users of ELN solutions. Please send us an email if you are interested to learn more about a specific ELN. Additionally, if you are interested in ELNs, please feel free join the TU Darmstadt ELN mailing list.

The following table lists projects at TU Darmstadt that use ELN. If your project/research group also uses an ELN, please get in touch to be listed as well.

Project / Research group ELN software
SFB TRR 270 HoMMage eLabFTW
Department of Physical Metallurgy eLabFTW
Ecological networks eLabFTW
Macromolecular and Paper Chemistry eLabFTW
Organic Structure Analysis OpenEnventory, LOGS-ELN
Institute for Condensed Matter Physics eLabFTW

Info

eLabFTW is currently being tested as a campus-wide offer by the university computing centre. If you would like to give it a try, send an email to .

Useful ELN resources

eLabFTW

File versioning software

Manual tracking of changes to files has significant challenges especially when it comes to scalability with a high frequency of changes, a high number of data files, or multiple scientists working on the same files in a collaborative setting. Several modern software solutions exist for file versioning with the software git having become a de-facto standard in many settings. As a service for git called GitLab (see below) is offered by the TU Darmstadt, we will focus on this tool here.

git

Git allows for tracking, merging, and reverting changes with high efficiency for text files such as source code but also data that is stored in textual format (such as csv or xml files). Self-study courses exist to familiarize yourself with the options offered by git, including

GitLab

Use of central repositories with advanced features makes git even more powerful than only using it on a local computer. One such software is GitLab that provides, among others, additional tools for collaborative work and features for automatically executed workflows. In collaboration with RWTH Aachen, TU Darmstadt offers a GitLab service to its scientists free of charge. Storage is limited to 2 GB per project. Please see the TU-GitLab Website for more information. Here, we list several resources that can help you get familiar with the basic features offered by GitLab

The University and State Library also makes available textbooks on git and GitLab that can be found in TUfind. One example eBook is Mastering Git. A Beginner's Guide that covers both git and GitLab.

Quality control

In general, researchers are experts on the methods they work with and what quality criteria their data has to pass. However, there are certain procedures and tools that might help you with organizing quality control.

When creating a data management plan, you typically envisage the types of measurements you will execute and thus the data structure that you will obtain. At this stage, also collect criteria that call for, at least, a closer examination if not even a repeat of the measurement. One such criterion might be instrument ranges.

If possible, put those criteria in a machine-readable form to be able to exploit automatic tools for data plausibility checks. For example, electronic lab notebooks may come with features that enable automatic data validation against pre-defined templates.

Automation, for example by automatic transfer of information from a measurement device to a data organization system, by itself may help ensure data quality. The less you have to do manually, the easier it is to avoid mistakes like forgetting to copy part of the data or introducing mistakes such as typos.

When it comes to manually validating research data, it is best to establish clear processes and workflows early-on. For example, should data be checked and validated only by the researcher that gathers the data or by a second pair of eyes? Again, there are tools that might help with such a review process, including workflows that can be defined in electronic lab notebooks.

Not only the data itself but also the describing metadata must be of sufficient quality. Thus, also think about what information about the data and the context of its generation you will need at later stages for it to still be useful. Create the metadata in a fashion that is as structured and standardized as possible. One way to do this is to extract metadata terms from existing metadata vocabularies and to define mandatory and optional terms. This basically means that you define a metadata application profile that defines how your data should be described.

Material

The SIG “Data organization and documentation” has created a checklist to help you answer typical questions about data organization and documentation and implement relevant points.

You can download the word document and adapt it to your specific needs: checklist.docx (last updated: 10-09-2024)