
Researchers at the Portal Project have been gathering data on the interactions between rodents, ants, and plants in Arizona since 1977. Initially, this data was shared informally, but the approach evolved over the years to incorporate systematic updates. In the 2000s, the team began publishing data papers that combined new and existing information, maintaining relevance and accuracy. According to Ethan White, an environmental data scientist at the University of Florida, “Data collection is not a one-time effort.” To enhance their data management, White and his colleagues developed a modern data workflow in 2019, utilizing platforms like GitHub, Zenodo, and Travis CI. This system allows for consistent updates while preserving earlier versions, with the Zenodo repository currently holding approximately 620 versions of their data.
The need for effective data management extends beyond long-term ecological research. Many researchers frequently update their data sets throughout their careers, leading to challenges in maintaining accuracy and accessibility. Crystal Lewis, a data-management consultant based in St. Louis, Missouri, notes that “there are no standards for repositories; the journals are not telling you how to correct a data set or how to cite new data.”
Five Essential Strategies for Data Management
Good data-science practices can provide clarity and improve the organization of ongoing projects. Here are five strategies to effectively manage and cite data sets.
**1. Choose an Appropriate Repository**
Utilizing a data repository offers researchers a reliable method to store and share multiple versions of their data. Kristin Briney, a librarian at the California Institute of Technology, emphasizes that repositories help prevent data from being lost or overlooked. Effective January 2024, US federal funding agencies will mandate that researchers deposit data in a repository, with some agencies, including the National Institutes of Health, already implementing this policy. Journals like PLoS ONE also encourage authors to use established data repositories, such as the Dryad Digital Repository and the Open Science Framework.
Repositories provide long-term storage with multiple backups. For instance, Zenodo guarantees that data will be maintained as long as CERN, which operates the site, remains functional. They also ensure that archived data remains unaltered and assign persistent identifiers, making it easier for others to locate the data.
**2. Create Multiple Versions**
Establishing new versions of data sets is vital for transparency and accessibility. Overwriting old data can hinder the ability to replicate previous analyses or track changes over time. Lewis points out that researchers are often their own most significant collaborators, stating, “Three months from now, you will forget what you did.”
Many data repositories automatically create new versions as data is added. For example, Zenodo generates a unique digital object identifier (DOI) for each version, allowing researchers to cite specific data accurately. If researchers opt to manage their data outside of a repository, they can use GitHub to create releases whenever they update their data.
**3. Define File Names and Terminology**
Establishing a consistent file naming convention is essential for organizing data effectively. Briney advises researchers to include dates in their file names, ideally in the format YYYYMMDD or YYYY-MM-DD. This simple practice helps keep related files distinct, making it easier to navigate complex data sets.
Documenting metadata is equally important. This includes explanations of various variables and the organization of files and folders. Clear documentation not only assists researchers but also enhances the data-sharing experience, allowing others to understand the contents of a spreadsheet.
**4. Explicitly Document Queries and Terminology**
Researchers should also keep records of the terminology and queries used to generate and analyze their data. Sabina Leonelli, who specializes in big-data methods at the Technical University of Munich, illustrates this point with an example regarding biomedical databases. As definitions evolve, capturing the specific terminology used at the time of data generation is crucial for maintaining the integrity of analyses.
**5. Maintain a Change Log**
Keeping a change log is an effective way to track modifications to data sets over time. This log should document what changes were made, when they occurred, and the rationale behind each adjustment. Such records can be invaluable for both the original researchers and future collaborators, providing clarity and context for the data.
These strategies not only ensure the integrity of data but also facilitate collaboration and reproducibility in research. As the landscape of data management continues to evolve, adopting these practices can help researchers navigate the complexities of maintaining and updating their data effectively.