Data Management

General Introduction

Data management can seem relatively trivial for beginners, but starting with best practices will make your future significantly more accessible and help ensure your data is well-organized and secure.

Where are we now?

Before we start a lesson it is usually helpfulto reflect on what we already know about a certain topic or e.g. what a lesson may possibly try to teach us.

So please take a few minutes to reflect on the concept of digital literacy with the following questions.

1.1 What is your understanding of data managment?

1.2 What could you possibly be learn?

1.3 How do you usually store/manage data?

Note

Feel free to do this in your head or in a separate document. Remember, to interactively engage with the material either open it in MyBinder (the small rocket button at the top of the website) or [download the course material](the small download button on top), go through the setup process and open this file (i.e digital_literacy.ipynb in the introduction folder) using Jupyter Notebooks or VScode.

Roadmap

  • Goals

  • Data management

    1. Data management plan

    2. Setup local folder structure (BIDS)

  • Data storage

    1. Open Brain Consent Form/GDPR & you

    2. Open data

Goals

Specific goals of this first session will be to, e.g.

  • general understanding

  • get familiar with the process

  • provide a checklist to follow

  • understanding why project design is an essential step

Data Management


1. Data Management Plan

An initial step when starting any research project should be to set up a data management plan (DMP). This helps you to flesh out, describe, and document what data exactly you want to collect, what you’ll be doing with it, and where and how it’s stored and eventually shared.

A DMP helps you stay organized and reduces the potential for surprises in the future (e.g., due to too limited data storage capacities or unexpected costs). It is at times also required by, e.g., your University or agencies funding research endeavors.

Motivation?

For the public good

  • if widely adopted, makes it inherently easier to reproduce code and analysis pipelines built by others, therefore lowering scientific waste and improving efficiency

For yourself

  • You are likely the future user of the data and data analysis pipelines you’ve developed, so keeping your file structure standardized removes the need to remember where you’ve stored specific pieces of data, etc.

  • Enables and simplifies collaboration; - allows readers/collaborators to gain a quick understanding of what data you’ll be collecting, where this data can be found, and what exactly you’re planning on doing with it

  • Reviewers and funding agencies like to see clear, reproducible results

  • Open-science-based funding opportunities and awards are available (for instance, the OHBM Replication Award, Mozilla Open Science Fellowship, Google Summer of Code, and so on).


FAIR principles

FAIR principles: The FAIR principles stand for Findable, Accessible, Interoperable, and Reusable. These principles aim to make research data more accessible and reusable by promoting the use of standardized metadata, persistent identifiers, and open data formats. This allows research not only to be shared but also to be actually found.


What to consider in your data management plan

Most universities provide templates, tools, or guidance on how to create a DMP, so it is a good idea to check your university’s online presence or get in contact with your local library.

For the Goethe University Frankfurt, researchers can use the following tool: (German) Datenmanagementpläne mit dem Goethe-RDMO

There are also public tools to collect and share DMPs, such as DMPonline for the UK.

Here, you also find publicly published plans that you can use to check what your DMP could/should contain.

The Turing Way Project lists the following considerations when creating a DMP. Many of the specific points of this checklist have already been discussed in the previous steps.

Turing way DMP checklist

1. Roles and Responsibilities of project team members

- discuss who is responsible for different tasks related to project/data management
- e.g., who is responsible for maintaining the dataset, how takes care of the research ethics review

2. Type and size of data collected and documentation/metadata generated

- i.e., raw, preprocessed, or finalised data (lead to different considerations, as e.g., raw data can generally not be openly shared)
- the expected size of the dataset
- how well is the dataset described in additional (metadata) files, 
    - what abbreviations are used, how are, e.g. experimental conditions coded
    - where, when, and how was data collected
    - description of the sample population

3. Type of data storage used and backup procedures that are in place

- where is data stored
- data protection procedures
- how are backups handled, i.e. location and frequency of backups
- will a version control system be used?
- directory structure, file naming conventions

4. Preservation of the research outputs after the project.

- public repositories or local storage
- e.g. OSF

5. Reuse of your research outputs by others

Is- the code and coding environment shared? (e.g. GitHub)
- conditions for reuse of collected dataset (licensing etc.)

6. Costs

- potential costs of equipment and personnel for data collection
- costs for data storage

To create your DMP, you can either use the discussed tools or create a first draft by noting your thoughts/expectations regarding the above checklist in a document.

2. Setup local folder structure

It is recommended to adopt a standardized approach to structuring your data, as this not only helps you stay consistent but also allows you and possible collaborators to easily identify where specific data is located.

General File Naming Conventions

To make sure that it is easily understood what a file contains and to make files easier for computers to process, you should follow certain naming conventions:

- be consistent
- use the date in the format YYYYMMDD
- use underscores `(_)` instead of spaces or
- use camelCase (capitalized first letter of each word in a phrase) instead of spaces
- avoid spaces, special characters `(+-"'|?!~@*%{[<>)`,  punctuation `(.,;:)`, slashes and backslashes `(/\)`
- avoid "version" names, e.g., v1, vers1, final, final_really, etc. (instead, use a version control system like GitHub)

Establish a folder hierarchy

Before you begin working on your project, you should start setting up the local folder structure on your system. This helps you keep organized and saves you a lot of work in the long run.

Your folder hierarchy, of course, depends on your project’s specific need (e.g., folders for data, documents, images, etc.) and should be as clear and consistent as possible. The easiest way to achieve this is to copy and adapt an already existing folder hierarchy template for research projects.

One example (including a template) is the Transparent project management template for the OSF platform by C.H.J. Hartgerink

The contained folder structure would then look like this:

project_name/
    └── archive
    │   └── 
        
    └── analyses
    │   └── 
    │   
    └── bibliography
    │   └── 
    │   
    └── data
    │   └── 
    │   
    └── figure
    │   └── 
    │   
    └── functions
    │   └── 
    │   
    └── materials
    │   └── 
    │   
    └── preregister
    │   └── 
    │
    └── submission
    │   └── 
    │   
    └── supplement
        └── 

Another example would be the “research project structure” by Nikola Vukovic

Where the folder hierarchy would look like this:


project_name/
    └── projectManagment/
    │   ├── proposals/
    │   │        └── 
    │   ├── finance/
    │   │       └── 
    │   └── reports/
    │           └── 
    │   
    └── EthicsGovernance
    │   ├── ethicsApproval/
    │   │       └── 
    │   └── consentForms/
    │           └── 
    │   
    └── ExperimentOne/
    │   ├── inputs/
    │   │       └── 
    │   ├── data/
    │   │       └── 
    │   ├── analysis/
    │   │       └── 
    │   └── outputs/
    │           └── 
    │   
    └── Dissemination/
        ├── presentations/
        │       └── 
        ├── publications/
        │       └── 
        └── publicity/
                └── 



Incorporating experimental data/BIDS standard

Now, both of these examples provide an “experiment folder” but tend to utilize/establish their own standards.

However, we aim to make our folder structure easily understandable, interoperable (e.g., between systems and programs), and reproducible. Therefore, it is best to adapt our “experiment folder” to industry standards.

For most experimental data, the most promising approach to this is BIDS (Brain Imaging Data Structure). Originally conceptualized as a standardized format for the organization and description of fMRI data, the format can be extended to encompass other kinds of neuroimaging and behavioral data. Using the BIDS standard makes the integration of your data into most neuroscience analysis pipelines.


Working with neuroimaging data makes the setup of your system a little more complicated.
The most promising/popular approach to structuring your data is the BIDS (Brain Imaging Data Structure) standard.

The Bids (Brain Imaging Data Structure) standard is a community-driven specification that aims to facilitate the organization and sharing of neuroimaging data. The Bids standard specifies a common format for storing and organizing neuroimaging data, including MRI, EEG, MEG, and iEEG. The standard can, of course, additionally be used to store behavioral data.

The Bids standard defines a specific folder hierarchy for organizing neuroimaging data. This hierarchy is organized into several separate folders, each with a specific purpose. As Bids is mostly concerned with our data, it provides a standardized way to organize the data folder in the diagram above. The data folder would then be structured in the following way.

data/
    ├── derivatives/
    └── subject/
        └── session/
            └── datatype/

/derivatives: contains processed data, such as the results of statistical analyses.

/sub- folder: contains data from one subject. Each subject is identified by a unique code that starts with “sub-“. This folder contains subfolders for each imaging session, which contains separate folders for each imaging file (datatype in the diagram above) recorded for this specific subject.


Neuroimaging datasets mostly contain data from more than one subject; the data folder will, therefore, necessarily contain multiple subject folders, named sub-01, sub-02 ... sub-0n. This could look something like this:

project_data
    ├── dataset_description.json
    ├── participants.tsv
    ├── derivatives
    ├── sub-01
    │   ├── anat
    │   │   ├── sub-01_inplaneT2.nii.gz
    │   │   └── sub-01_T1w.nii.gz
    │   └── func
    │       ├── sub-01_task-X_run-01_bold.nii.gz
    │       ├── sub-01_task-X_run-01_events.tsv
    │       ├── sub-01_task-X_run-02_bold.nii.gz
    │       ├── sub-01_task-X_run-02_events.tsv
    │       ├── sub-01_task-X_run-03_bold.nii.gz
    │       └── sub-01_task-X_run-03_events.tsv
    ├── sub-02
    │   ├── anat
    │   │   ├── sub-02_inplaneT2.nii.gz
    │   │   └── sub-02_T1w.nii.gz
    │   └── func
    │       ├── sub-02_task-X_run-01_bold.nii.gz
    │       ├── sub-02_task-X_run-01_events.tsv
    │       ├── sub-02_task-Xk_run-02_bold.nii.gz
    │       └── sub-02_task-X-02_events.tsv

    ...
    ...

We’ll not go into detail about the different neuroimaging files here (the .nii.gz files), but there is another thing we can learn from this standard: The inclusion of metadata.


In the above diagram, you find two metadata files:

The participants.csv file contains information about the participants in the study, such as demographics, behavioral data, and other relevant information. Specifically, it typically contains a row for each participant and columns that describe various aspects of each participant, such as their age, sex, handedness, cognitive scores, and any clinical diagnoses.

The dataset_description.json contains important metadata about the entire dataset, such as:

Name: A brief and informative name for the dataset;

License: The license under which the data are released (e.g., CC-BY-SA).

Authors: A list of individuals who contributed to the dataset.

Funding: Information about the funding sources that supported the creation of the dataset.

Description: A detailed description of the dataset, including information about the data collection methods, study participants, and any relevant processing or analysis steps that were taken. Subjects: A list of the subjects (i.e., study participants) included in the dataset, including information about their demographics and any relevant clinical information.

Sessions: A list of the scanning sessions or experimental sessions that were conducted for each subject, including information about the acquisition parameters and any relevant task or stimulus information.

Task: Information about the task or stimulus used in the experiment, if applicable.

Modality: Information about the imaging modality used to acquire the data (e.g., fMRI, MRI, MEG, etc.). AnatomicalLandmarkCo


BIDS File naming conventions

The BIDS format also specifies how we should name our files. To make sure that, e.g., others understand what content to expect in a specific file and to make it easier to use automated tools that expect certain file names for, e.g., data analyses.


The general file naming convention looks like this:

 key1 - value1 _ key2 - value2 _ suffix .extension 

Where key-value pairs are separated by underscores (e.g., Sub-01-_task-01), followed by an underscore and a suffix describing the datatype (e.g., _events), which is followed by the file extension (e.g., .tsv). Resulting in:

Sub-01-_task-01_events.tsv

It’s recommended that you adopt this file naming system and apply it to all of your files, e.g., your project report could be called:

first name-lastname_project-report.txt

You may also want to add a date to non-data files (ideally in the year-month-day format (YYYYMMDD)), e.g.

YYYYMMDD_firstname-lastname_project-report.txt

Avoid adding descriptions such as version_01 or final_version etc. Instead, you should rely on digital tools with version history functionality, such as Google Docs. In the next section, we’ll further introduce the concept of a version control system to avoid this issue altogether.

To learn more

Checkout the BIDS starter-kit


Data storage:

Now, while having a standardized data format on our system, let’s turn toward the topic of data storage. While your local file system may seem initially sufficient for small projects or purely behavioral research, for larger datasets and file sizes, such as in neuroimaging, you’ll quickly run out of storage. The same applies when you collect a larger number of smaller projects throughout your career, of course. Further, do we really want to store data simply on some laptop until it is forgotten/deleted and, in general, inaccessible to others who might make use of it?

(Un/)Fortunately, sharing data falls under the jurisdiction of local laws, e.g., the General Data Protection Regulation (GDPR) in Germany. It is, therefore, essential to make sure that where and how you store or share the data you will be collecting or have already collected complies with the law.



2. Open data

Nowadays, there are a lot of cloud storage options that allow researchers to share their data, given certain rules and restrictions. The following is an incomplete but somewhat exhaustive list of the possible storage options you face.


Open Science Repositories: Repositories such as OSF, OpenNeuro, the OpenfMRI database, and the IEEE DataPort provide open access to MRI and EEG datasets, as well as other neuroimaging data.

U.S. National Institutes of Health: The National Institutes of Health (NIH) of the U.S. provides open access to many MRI and EEG datasets through the National Library of Medicine’s National Institutes of Health Data Sharing Repository.

Research Data Repositories: Zenodo, Figshare, and other research data repositories allow scientists to store, share, and publish their data in an open and transparent manner. These repositories are often committed to open access principles and provide a centralized location for data and metadata, as well as version control and preservation features.

Research Collaborations: Collaborative projects, such as the Human Connectome Project, the International Neuroimaging Data-Sharing Initiative (INDI), and the ABIDE (Autism Brain Imaging Data Exchange) project, provide open access to large datasets collected from multiple sites.

Domain-Specific Repositories: There are also domain-specific repositories for various scientific fields, such as the NCBI Sequence Read Archive for genomics data, the European Geosciences Union Data Centre for Earth and Environmental Science data, and the International Astronomical Union’s SIMBAD database for astronomical data. These repositories often have specific requirements for data deposition and sharing but provide a dedicated space for researchers to share their data in an open and transparent manner.


Acknowledgments:

The go-to resource for creating and maintaining scientific projects was created by the Turing Way Project. We’ve adapted some of their material for the Data Management Plan section of the lesson above.

The Turing Way Community. (2022). The Turing Way: A handbook for reproducible, ethical and collaborative research (1.0.2). Zenodo. https://doi.org/10.5281/zenodo.7625728

Transparent project management template for the OSF plattform by WEC.H.J. Hartgerink

We’ve adapted some of the directory structure diagrams from the BIDS Standard Guide.

Where are we now?

Please take a moment to reflect on if and how your understanding of the concept of digital literacy has possibly changed following this lecture.

2.1 How well informed we’re you with regard to the current topic? (i.e., did you know what was expecting you?)

2.2 What was the most surprising part of the training/lecture for you? What information in particular stood out to you?