Skip to Main Content

Research Data Management at CSHL: Data Organization

Data Organization

Organizing your data, documents, and file system is critical for research data management. Fortunately, there are a number of resources describing standards and best practices that can help. The resources below provide information on how to organize your dataorganize your files, preparing to future-proof and share your data through data documentation and metadata, and how to choose electronic lab notebooks for data organization.  

Please contact us for assistance and consultation on any of these resources:

Phone: +1 516.367.6872
Librarian Email: libraryhelp@cshl.edu

Organize your data

As your data are collected but before it is analyzed, it needs to be organized so that it can be easily analyzed and examined. Common ways to organize your data include using spreadsheet best practices, and creating data dictionaries that are also used as a type of metadata (description of the data). 

1. Spreadsheet Best Practices

  • Use variable naming conventions. Variable names in a dataset should be intuitive and meaningful, e.g., "study_id". 
  • Entries for a given variable should be based on a consistent definition, e.g., consistent data type, units, and formatting

2. Data Dictionaries

Data dictionaries are files that accompany and describe data files, particularly spreadsheet data, by defining each variable included in the dataset. Data dictionaries should be created for all datasets, so that you or others can understand the data now or in the future. As such, they are a common type of metadata, or data description that provide important context for the data collected. 

Variable definitions can include the following information: 

  • The variable name and a description of what the variable means
  • Data type (e.g., text, date, continuous values)
  • Possible values / entries (e.g., "0", "1")
  • How the data is coded (for example, if data entries for a variable are coded as "0" or "1", what do these entries mean?)
  • Units 
  • Calculations or other information for derived variables
  • How missing data is recorded
  • How the data were measured
  • Any other information that is relevant to understand the data and avoid ambiguity

Additional Data Organization Resources

Preparing Tabular Data for Description and Archiving, Cornell University Library

 

Organize your files

Have you ever had trouble locating raw data, or any other file associated with a research project or publication? File organization is important to establish and maintain throughout a research project, and to aid reuse and reproducibility in the short and long term. The basic components of file organization include file naming, versioning, and file structure.  

1. File Naming

A File Naming Convention (FNC) is a framework for naming your files in a way that describes what they contain and their relationship to the project and other files. When establishing a FNC, there are 3 criteria: Organization, Context, Consistency. A well designed FNC will provide a preview of the content in each file, be organized logically (based on time of production), and identify the creator.

Aim for filenames no more than 25 characters in length.

  • Aim for  YYYYMMDD,  YYMMDD, or DDMMYY format to keep chronological order. JUST BE CONSISTENT!!!
  • Do not use spaces or special characters such as  ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' ", etc.
  • Instead of spaces, use dashes(-), underscores(_), or CamelCase (FileName) to connect words
  • If using sequential numbering system, using leading zeros to sequentially sort files. For example, use "001, 002, 010, 011, 100, 101" instead of "1, 2, 10, 11, 100, 101"

Here's an example of file names created without and using an FNC:

File names with no FNC File names with an FNC
Labwork_2017 Labwork_Matt_03072017
Images_test Images_Leicaconfocal_testsamples1-7_07092016
Sequence125 Sequence_mouse_sample125_06092015
Video_387 Video_behaviour_mouse387_05032016

Always remember a file naming convention breaks down if not followed consistently. When developing one be sure to include all the relevant information and feedback from everyone who needs to use the FNC (e.g., fellow lab members) and make sure that everyone is aware of it and knows how to apply it.

File Naming Resources:

File Naming Best Practices Handout from MIT Libraries
Hints and tips for developing your FNC

2. Versioning 

Versioning allows you to maintain different versions, or iterations of a file or set of files, and keep track of changes made over time. For example, in a collaborative project, you may want to know who made what changes, and why. You can do this by using version numbers within file names to delineate between updated versions (e.g., v1.1, v2.4) where a change in the first digit represents a major revision change, and a change in the second number represents a relatively minor revision change.

Example: FileName_1.0 (original file); FileName_1.1 (original file with minor changes); FileName_2.0 (original file with major revisions)

You can also create a log to substantively describe changes among versions. Such versioning logs can be created manually (e.g., in a text file) or automatically (e.g., using Google Drive). See the Versioning Resources below for more details. 

Versioning Resources:

Version Control Tools and Techniques handout from MIT Libraries

Using Git for version control from NYU Data Services 

3. File Structure

Developing a hierarchical filing/folder system can seem daunting, but simple, best practices can make it easier to develop a system that helps you find files quickly in the short and long term. Once you develop a file structure system that works for you, follow it consistently. In developing a file structure, consider the following:

  1. Determine the context you or others may use to find a file. Examples include project names, dates, or types, or stages of research. 
  2. Develop a folder hierarchy that aligns with the project. Example: [Project] / [Experiment] / [Instrument / File Type]
  3. Include a readme file for folders to list, link to, and describe the files contained therein.
  4. Folders should not get too large (i.e., too many individual files)
  5. Folder hierarchy should not get too deep (i.e., too many levels of subfolders).
  6. Avoid tangled folder nests. 

File Structure Resources: 

Naming and Organizing Files and Folders from MIT Libraries

File organization strategies from NYU Libraries

 

Data Documentation and Metadata

Describing your data through documentation and metadata ("data about the data") provides necessary context for the future use or reuse of your data by yourself and others. Such descriptions are important to include with any stored or shared data files. There are multiple ways to document and describe data. Your choices should consider current and anticipated data uses. 

Common Types of Data Documentation:

  • Correspondence (electronic and paper)
  • Social media communications (blogs, wikis, tweets etc.)
  • Signed consent forms
  • Methodologies and workflows
  • Standard operating procedures and protocols
  • Questionnaires, transcripts, codebooks

Metadata is structured information describing the characteristics (content, context, structure, other details) of a data product. Creating metadata is important because it supports responsible data discoverability, re-use and preservation.

Common Types of Metadata:

  • Readme files
  • Abstracts or other summaries
  • Data dictionaries
  • Auto-generated, computer readable metadata (sometimes created when submitting data to shared, external repositories )

Readme Files

Readme files are fantastic organizational tools that you can use to document and describe anything from your own filing system, to a set of data and project-related documents that you share with others. 

Readme best practices:

  • Create a Readme file for each data file/dataset
  • Name each Readme file to be clearly associated with the data file(s) it describes
  • Save Readme documents as plain text files
  • Consistently format and name your readme files 
  • Follow any disciplinary conventions 
  • For folders, readme files should list, link to, and describe all the files in a particular folder.
  • For datasets, readme files should include any information needed to use the data, such as: where to find it, how to access it, possible uses, known issues or limitations, collection methods, other details such as units or variable names, ethical/privacy restrictions, licensing, who to cite 

Documentation and Metadata Resources

Metadata Naming Authorities and Taxonomies (NYU Libraries)

Metadata Authoring Software (NYU Libraries)

Metadata standards/schema: DublinCoreMODS (Metadata Object Description Schema)DarwinCore