LibGuides: Research Data Management: Metadata

Metadata Introduction

What is metadata?

Metadata is the data that describes the data and provides the necessary context needed for future use. Documentation and metadata are key for understanding a dataset and promoting reproducibility and replicability.

Types of Metadata

Descriptive: describes the data
Administrative: intellectual property rights and preservations strategy of the data
Structural: how does one piece of the data relate to another piece of data within a project
Markup Languages: This provides annotation to documents. Some annotations can be how the document is to be structured or displayed. There are tags or elements like <p> in HTML code. Other examples of markup languages are XML and Markdown.

Reference: Qin, J., & Zeng, M. L. (2020). Metadata. ALA Neal-Schuman.

Things to Consider:

Bad Example:

Here is a made up example of some data. It seems like its a study about treatment and smoking, but it is unclear what the treatments are. Also, there is no explanation on what Smoking_Cat means. Therefore, there is quite a number of questions that need to be answer in order for the data to be useful or interpretable.

Good Example:

Here is the same data but with its corresponding metadata. Notice that the columns have better descriptors with units now displayed. This gives the additional information that these were baseline measurements taken instead of mid or post-treatment. Acronyms for the treatments are explained on at the bottom and also provide dosing. Smokig_Cat is defined as Smoking Categories and the amount of smoking a person does placed into numeric bins.

The example is not perfect though as the type of smoking (most likely cigarettes in this case as supposed to vaping) is not described. The protocol is also missing. For example, how were subjects chosen for this particular study? Was there any exclusion criterion such as age or presence of disease?

Bad Example

Good Example

Metadata for a research project

Metadata for your Research Project:

ReadMe File: These are files that can be very helpful for people trying to understand your data or you code. ReadMe files are commonly used with code development to describe how to install and/or use software. However, they can also be used with any sort of files to describe a dataset contained within a folder. Consider creating a ReadMe file for you data. It can describe when, how, where, and under what conditions data was generated. The file can also explain any naming convention that you happen to use (see File Naming). The ReadMe file can be a simple text document so that anyone can read and have access to it.

Data Dictionary: A data dictionary describes any variables or acronyms that are used within a dataset. The "Good Example" above shows an example of what a data dictionary looks like. It can also be a spreadsheet that describes different samples/patients and the protocol and/or treatment that they went through. These data dictionaries are to help facilitate understanding of the data for the next person who wants to reproduce or reuse your work. Creating metadata helps save time and furthers scientific investigation.

Codebooks: Annotating your code is essential for others to understand what you did. Use comments liberally and consider making a codebook. A codebook is a document that describes the variables, parameters, and the lines of code within your script. This will help others to reuse your code for their own projects. Last, consider using Markdown. This can be found in R and Jupyter Notebooks. Its a version of commenting that makes reading annotations much easier.

Resources

Harvard University Longwood Research Data Management: Documentation and Metadata

Cornell University: How to Write a README File