Creating a research data management plan - York University Libraries

Data Management Plans (DMP)

A DMP is a formal document that details the strategies and tools you will implement to effectively manage your data during the active phase of your research, and the mechanisms you will use for preserving and appropriately sharing your data at the end of the project. A DMP is a “living” document that can be modified throughout your project to reflect any changes that have occurred.

The DMP Assistant

The DMP Assistant is a national, online, bilingual data management planning tool developed by the Digital Research Alliance of Canada (the Alliance) in collaboration with host institution University of Alberta to assist researchers in preparing data management plans (DMPs). This tool is freely available to all researchers and develops a DMP through a series of key data management questions, supported by best-practice guidance and examples.

DMPs are one of the foundations of good research data management (RDM), an international best practice, and increasingly required by institutions and funders, including the Canadian Tri-Agencies as outlined in their Research Data Management Policy.

Questions from the DMP Assistant

The section below reproduces a copy of the questions posted by the DMP assistant for quick reference.

This section addresses data collection issues such as data types, file formats, naming conventions, and data organization – factors that will improve the usability of your data and contribute to the success of your project.

What types of data will you collect, create, link to, acquire and/or record?

Examples: numeric, images, audio, video, text, tabular data, modeling data, spatial data, instrumentation data.

What file formats will your data be collected in? Will these formats allow for data re-use, sharing and long-term access to the data?

Proprietary file formats requiring specialized software or hardware to use are not recommended, but may be necessary for certain data collection or analysis methods. Using open file formats or industry-standard formats (e.g. those widely used by a given community) is preferred whenever possible. Learn more by reviewing this table of recommended file formats from the UK Data Service.

What conventions and procedures will you use to structure, name and version-control your files to help you and others better understand how your data are organized?

It is important to keep track of different copies or versions of files, files held in different formats or locations, and information cross-referenced between files. This process is called 'version control'.
Logical file structures, informative naming conventions, and clear indications of file versions, all contribute to better use of your data during and after your research project. These practices will help ensure that you and your research team are using the appropriate version of your data, and minimize confusion regarding copies on different computers and/or on different media.
Read more on this topic at UK Data Archive: file organizing and version control and Consortium of European Social Science Data Archives: designing a data file structure, organisation of variables.

Because data are rarely self-explanatory, all research data should be accompanied by metadata (information that describes the data according to community best practices). Metadata standards vary across disciplines, but generally state who created the data and when, how the data were created, their quality, accuracy, and precision, as well as other features necessary to facilitate data discovery, understanding and reuse.

Any restrictions on use of the data must be explained in the metadata, along with information on how to obtain approved access to the data, where possible.

What documentation will be needed for the data to be read and interpreted correctly in the future?

Typically, good documentation includes information about the study, data-level descriptions, and any other contextual information required to make the data usable by other researchers. Other elements you should document, as applicable, include: research methodology used, variable definitions, vocabularies, classification systems, units of measurement, assumptions made, format and file type of the data, a description of the data capture and collection methods, explanation of data coding and analysis performed (including syntax files), and details of who has worked on the project and performed each task, etc.

How will you make sure that documentation is created or captured consistently throughout your project?

Consider how you will capture this information and where it will be recorded, ideally in advance of data collection and analysis, to ensure accuracy, consistency, and completeness of the documentation. Often, resources you've already created can contribute to this (e.g. publications, websites, progress reports, etc.). It is useful to consult regularly with members of the research team to capture potential changes in data collection/processing that need to be reflected in the documentation. Individual roles and workflows should include gathering data documentation as a key element.

If you are using a metadata standard and/or tools to document and describe your data, please list here.

There are many general and domain-specific metadata standards. Dataset documentation should be provided in one of these standard, machine readable, openly-accessible formats to enable the effective exchange of information between users and systems. These standards are often based on language-independent data formats such as XML, RDF, and JSON. There are many metadata standards based on these formats, including discipline-specific standards.

Planning how research data will be stored and backed up throughout and beyond a research project is critical in ensuring data security and integrity. Appropriate storage and backup not only helps protect research data from catastrophic losses (due to hardware and software failures, viruses, hackers, natural disasters, human error, etc.), but also facilitates appropriate access by current and future researchers.

What are the anticipated storage requirements for your project, in terms of storage space (in megabytes, gigabytes, terabytes, etc.) and the length of time you will be storing it?

Storage-space estimates should take into account requirements for file versioning, backups, and growth over time.
If you are collecting data over a long period (e.g. several months or years), your data storage and backup strategy should accommodate data growth. Similarly, a long-term storage plan is necessary if you intend to retain your data after the research project.

How and where will your data be stored and backed up during your research project?

The risk of losing data due to human error, natural disasters, or other mishaps can be mitigated by following the 3-2-1 backup rule:

Have at least three copies of your data.
Store the copies on two different media.
Keep one backup copy offsite

Data may be stored using optical or magnetic media, which can be removable (e.g. DVD and USB drives), fixed (e.g. desktop or laptop hard drives), or networked (e.g. networked drives or cloud-based servers). Each storage method has benefits and drawbacks that should be considered when determining the most appropriate solution.
Further information on storage and backup practices is available from the University of Sheffield Library and the UK Data Archive.

How will the research team and other collaborators access, modify, and contribute data throughout the project?

An ideal solution is one that facilitates co-operation and ensures data security, yet is able to be adopted by users with minimal training. Transmitting data between locations or within research teams can be challenging for data management infrastructure. Relying on email for data transfer is not a robust or secure solution. Third-party commercial file sharing services (such as Google Drive and Dropbox) facilitate file exchange, but they are not necessarily permanent or secure, and are often located outside Canada. Please contact your Faculty IT department or University Information Technology (UIT) to develop the best solution for your research project.

Data preservation will depend on potential reuse value, whether there are obligations to either retain or destroy data, and the resources required to properly curate the data and ensure that it remains usable in the future. In some circumstances, it may be desirable to preserve all versions of the data (e.g. raw, processed, analyzed, final), but in others, it may be preferable to keep only selected or final data (e.g. transcripts instead of audio interviews).

Where will you deposit your data for long-term preservation and access at the end of your research project?

The issue of data retention should be considered early in the research lifecycle. Data-retention decisions can be driven by external policies (e.g. funding agencies, journal publishers), or by an understanding of the enduring value of a given set of data. The need to preserve data in the short-term (i.e. for peer-verification purposes) or long-term (for data of lasting value), will influence the choice of data repository or archive. A helpful analogy is to think of creating a 'living will' for the data, that is, a plan describing how future researchers will have continued access to the data.
If you need assistance locating a suitable data repository or archive, please contact the Library at yul_rdm@yorku.ca.
re3data.org is a directory of potential open data repositories. Verify whether or not the data repository will provide a statement agreeing to the terms of deposit outlined in your Data Management Plan.

Indicate how you will ensure your data is preservation ready. Consider preservation-friendly file formats, ensuring file integrity, anonymization and de-identification, inclusion of supporting documentation.

Some data formats are optimal for long-term preservation of data. For example, non-proprietary file formats, such as text ('.txt') and comma-separated ('.csv'), are considered preservation-friendly. The UK Data Archive provides a useful table of file formats for various types of data. Keep in mind that preservation-friendly files converted from one format to another may lose information (e.g. converting from an uncompressed TIFF file to a compressed JPG file), so changes to file formats should be documented.
Identify steps required following project completion in order to ensure the data you are choosing to preserve or share is anonymous, error-free, and converted to recommended formats with a minimal risk of data loss.

Most Canadian research funding agencies now have policies requiring research data to be shared upon publication of the research results or within a reasonable period of time. While data sharing contributes to the visibility and impact of research, it has to be balanced with the legitimate desire of researchers to maximise their research outputs before releasing their data. Equally important is the need to protect the privacy of respondents and to properly handle sensitive data.

What data will you be sharing and in what form? (e.g. raw, processed, analyzed, final).

Raw data are the data directly obtained from the instrument, simulation or survey.
Processed data result from some manipulation of the raw data in order to eliminate errors or outliers, to prepare the data for analysis, to derive new variables, or to de-identify the human participants.
Analyzed data are the the results of qualitative, statistical, or mathematical analysis of the processed data. They can be presented as graphs, charts or statistical tables.
Final data are processed data that have, if needed, been converted into a preservation-friendly format.
Consider which data may need to be shared in order to meet institutional or funding requirements, and which data may be restricted because of confidentiality/privacy/intellectual property considerations.

Have you considered what type of end-user license to include with your data?

Licenses determine what uses can be made of your data. Funding agencies and/or data repositories may have end-user license requirements in place; if not, they may still be able to guide you in the development of a license. Once created, please consider including a copy of your end-user license with your Data Management Plan. Note that only the intellectual property rights holder(s) can issue a license, so it is crucial to clarify who owns those rights.
There are several types of standard licenses available to researchers, such as the Creative Commons licenses and the Open Data Commons licenses. In fact, for most datasets it is easier to use a standard license rather than to devise a custom-made one. Note that even if you choose to make your data part of the public domain, it is preferable to make this explicit by using a license such as Creative Commons' CC0.

What steps will be taken to help the research community know that your data exists?

Possibilities include: data registries, repositories, indexes, word-of-mouth, publications.
How will the data be accessed (Web service, ftp, etc.)? If possible, choose a repository that will assign a persistent identifier (such as a DOI) to your dataset. This will ensure a stable access to the dataset and make it retrievable by various discovery tools.
One of the best ways to refer other researchers to your deposited datasets is to cite them the same way you cite other types of publications (articles, books, proceedings). Note that some data repositories also create links from datasets to their associated papers, thus increasing the visibility of the publications.
Contact the Library at yul_rdm@yorku.ca for assistance with making your dataset visible and easily accessible.

Data management focuses on the 'what' and 'how' of operationally supporting data across the research lifecycle. Data stewardship focuses on 'who' is responsible for ensuring that data management happens. A large project, for example, will involve multiple data stewards. The Principal Investigator should identify at the beginning of a project all of the people who will have responsibilities for data management tasks during and after the project.

Identify who will be responsible for managing this project's data during and after the project and the major data management tasks for which they will be responsible.

Your data management plan has identified important data activities in your project. Identify who will be responsible -- individuals or organizations -- for carrying out these parts of your data management plan. This could also include the timeframe associated with these staff responsibilities and any training needed to prepare staff for these duties.

How will responsibilities for managing data activities be handled if substantive changes happen in the personnel overseeing the project's data, including a change of Principal Investigator?

Indicate a succession strategy for these data in the event that one or more people responsible for the data leaves (e.g. a graduate student leaving after graduation). Describe the process to be followed in the event that the Principal Investigator leaves the project. In some instances, a co-investigator or the department or division overseeing this research will assume responsibility.

What resources will you require to implement your data management plan? What do you estimate the overall cost for data management to be?

This estimate should incorporate data management costs incurred during the project as well as those required for the longer-term support for the data after the project is finished. Items to consider in the latter category of expenses include the costs of curating and providing long-term access to the data. Some funding agencies state explicitly the support that they will provide to meet the cost of preparing data for deposit. This might include technical aspects of data management, training requirements, file storage & backup, and contributions of non-project staff.

Researchers and their teams need to be aware of the policies and processes, both ethical and legal, to which their research data management must comply. Protection of respondent privacy is of paramount importance and informs many data management practices. In their data management plan, researchers must state how they will prepare, store, share, and archive the data in a way that ensures participant information is protected, throughout the research lifecycle, from disclosure, harmful use, or inappropriate linkages with other personal data.

It's recognized that there may be cases where certain data and metadata cannot be made public for various policy or legal reasons, however, the default position should be that all research data and metadata are public.

If your research project includes sensitive data, how will you ensure that it is securely managed and accessible only to approved members of the project?

Consider where, how, and to whom sensitive data with acknowledged long-term value should be made available, and how long it should be archived. These decisions should align with Research Ethics Board requirements. For more information, consult the Data Retention and Deposit Guidelines for Research Involving Human Participants.
Restrictions can be imposed by limiting physical access to storage devices, by placing data on computers that do not have external network access (i.e. access to the Internet), through password protection, and by encrypting files. Sensitive data should never be shared via email or cloud storage services such as Dropbox.

If applicable, what strategies will you undertake to address secondary uses of sensitive data?

Obtaining the appropriate consent from research participants is an important step in assuring Research Ethics Boards that the data may be shared with researchers outside your project. The consent statement may identify certain conditions clarifying the uses of the data by other researchers. For example, it may stipulate that the data will only be shared for non-profit research purposes or that the data will not be linked with personally identified data from other sources.

How will you manage legal, ethical, and intellectual property issues?

Compliance with privacy legislation and laws that may impose content restrictions in the data should be discussed with your institution's privacy officer or research services office. Research Ethics Boards are central to the research process.
Include here a description concerning ownership, licensing, and intellectual property rights of the data. Terms of reuse must be clearly stated, in line with the relevant legal and ethical requirements where applicable (e.g., subject consent, permissions, restrictions, etc.).

Note: Much of the text on this page can be attributed to the DMP Assistant, licensed under a CC0 license.