LibGuides: Research Data Management: Organizing Data

File Naming

At the start of a research project, it is easy to believe that you'll remember what name you gave to a file and where you put it. However, once your research gets underway, there may be multiple files in various formats, multiple versions, websites, citations, blogs, articles, methodologies, notes, spreadsheets, etc., all relating to your research. Trying to find a data file that you need that has been stored or named incorrectly or inaccurately can be both frustrating and a waste of valuable time.

Good file management practices are therefore essential to enable you to identify, locate and use your research data files efficiently and effectively. Additionally, good file management practices such as group file naming protocols are also required should you wish to share your files with others in a shared file space.

A key aspect of successfully managing your data is naming and organizing your files and associated folders effectively. A file name is a chief identifier for a research data file.

A good file and folder naming strategy will help you quickly find the files you need, easily understand what a particular data file is and what it contains, and to differentiate between different files and different versions of the same file.

Adopting a sensible file naming strategy and applying it consistently will provide an audit trail for the development of your data. This will help prevent confusion when working on files, particularly when working collaboratively with others, and ensure data files are not accidentally overwritten or deleted.

There are three main criteria to consider regarding the naming and labeling of research data files, namely:

Organization - important for future access and retrieval and needs to take into account the file naming constraints of the system where the file is located.

Context - this could include content-specific or descriptive information, independent of where the data are stored.

Consistency - choose a naming convention and ensure that the rules are followed systematically by always including the same information in the same order.

The following list includes a number of common elements that should be considered when developing a file naming strategy:

Description of the content
Project number
Name of creator
Name of research team/department associated with the data
Date of creation; Publication date
Version number

Decide what labels are appropriate for your own data files and be consistent in applying them. Labelling your files effectively can provide an audit trail for the development of your data.

1. There are a number of easy-to-follow rules for naming files that will help to improve the use and re-use of your data, showing:
Keep file names short and relevant

File names should be short and relevant - generally about 25 characters is a sufficient length to capture enough descriptive and contextual information for a data file.

2. Do not use special characters

Do not use special characters in a filename, such as: £"$%!”¬&*^()+=[]{}~@:;#,.<> as these are often used for specific tasks in different operating systems.

3. Do not use full-stops or spaces

Like special characters, full-stops and spaces are parsed differently on different systems - consider using underscores or hyphens instead.

4. Date formatting

If including dates, format them consistently following the format Year-Month-Day:
YYYY-MM-DD
YYYY-MM
YYYY-YYYY

Formatting dates in this way maintains the chronological order and simplifies the process of sorting and browsing your data files.
5. Case dependency

Do not assume that the software application or instrument uses case dependency when naming or renaming files within the aforementioned utility - assume that TANGO, Tango and tango are the same, even though some file systems may consider them as different

Source: MANTRA

Versioning Files

It is important to identify and distinguish versions of research data files consistently. Versioning your files ensures that a clear audit trail exists for tracking the development of a data file and identifying earlier versions when needed.

When versioning files it is common practice to use consecutive numbering for major version changes, with decimals used for minor changes (v1; v1.1; v2.1; v2.2).

Record every change irrespective of how minor that change may be. However, keep in mind that too many similar or related files may be confusing, both to yourself and to anyone else wanting to access or use your data. You may think that you know which data file is which but that may not always be the case as time passes and the number of different file versions grows.

It is easier to maintain a manageable number of versions with a clear naming structure. As long as the original 'raw' or definitive copy is retained and processing is well documented, intermediate working files can be discarded.

There are a number of dedicated and cloud-based tools which provide tools for versioning files.

Dedicated tools, such as Subversion and TortoiseSVN, provide systems for versioning files.

When working on files collaboratively, cloud-based services such as Wikis and Google Docs provide automatic version tracking, allowing you to 'roll back' to older versions if necessary.

Microsoft SharePoint and OneDrive are cloud-based file storage and file sharing services which have in-built backup and file versioning tools. These can be particularly useful when working on files collaboratively.

The Open Science Framework (OSF) provide a free and open source project management tool. The OSF tool facilitates collaboration with external partners and includes functionality to organise, version, document and share research files and outputs.

Source: MANTRA

Organizing Codes

Data is seldom analysed in its raw form, and the term 'research software' refers to the code or processing scripts used to clean, transform or analyse research data.

Your code may be written in scripting languages such as UNIX shell or Python, in traditional programming languages such as C, C++, FORTRAN or Java, or data analysis packages such as R or MATLAB.

For a research project this might include:

a few lines of a shell script to clean or filter your data
a set of R commands to generate graphs
Python code for text mining
SPSS syntax file for your statistical analysis
10,000 lines of Java for medical image analysis

Regardless of the scale, the syntax and code you generate is an important research output required to validate your work. As with the data itself, you need to implement a strategy for naming, versioning and archiving your research software.

There are a number of useful tools available designed to help you track and version your code. The most widely used are based on the Git versioning system, and these include:

GitHub: https://github.com
GitLab: https://about.gitlab.com
Bitbucket: https://bitbucket.org
Gitea: https://gitea.io/en-us

These tools are designed specifically to track the development and allow you to version your code. They support and promote collaborative working, and allow you to make your code publicly available when appropriate.These services commonly provide some level of free access.

If your code is extensive, it is highly recommended to use a Git repository. There are extensive training videos on using Git repositories available online which will help you to use of your repository of choice correctly and make the most of its functionality.

If your code base is small and comprises only one or two files you may prefer to use a manual versioning system instead.

Source: MANTRA

Computational Notebooks

Computational notebooks provide a workspace for writing and developing code, facilitating the sharing and publishing of script workflows.

Perhaps the most commonly known computational notebook, Jupyter Notebook provides an open-source web-based application for creating and sharing documents that contain live code, equations, visualisations and narrative text.

In addition to tracking the development of code, Jupyter Notebook can be used directly for cleaning and transforming data, numerical simulation, statistical modeling and more.

Source: MANTRA

Electronic Lab Notebooks

Other tools, such as electronic lab notebooks (ELNs) can support the organisation of research data.

ELNs perform the same function as paper lab notebooks, and also enable better sharing and searching for data.

Some electronic lab notebooks also support integration with other tools used in the lab, and with research infrastructure.

Source: MANTRA