Resources
Resources for bringing software engineering best practices into your research program:
Do you process or analyze data? Do you make computational models of your data? Do you make plots for your paper figures? If you answered “yes” to any of these questions, then you write software! Software has become a crucial part of doing science and engineering – across every academic field. The scientific methods used in your field have been refined for years to make sure the research you do is reliable and accessible to others in the scientific community. Similarly, the software you incorporate into your work should follow the same rigor.
Science and engineering are inherently collaborative processes, which means software development is as well. Developing software that is easy to understand, run, and modify when shared between researchers can drastically speed up your research, give you more confidence in the validity and accuracy of your research findings, and make it easier to share your findings with your research community.
Science and engineering also rest on an expectation of methodological rigor and accuracy. Just as you would test an experimental setup in the lab before collecting real data, you need to have certainty that your software does what you expect before running real analyses! Adopting practices for good code quality and software testing can ensure reliable results and help avoid paper retractions down the road.
While the software industry has spent years refining best practices for software development, these practices are not always well-suited to academic projects and can therefore be a challenge to adopt. Here, we have tried to take these valuable lessons learned from the software industry and to distill a set of practices that are tailored to writing code in academia.
Version control for code and data
How do you keep your code and changes to it organized among your team? If you work with other people on the same piece of code, how do you make sure changes get shared consistently and everyone has access to the most recent version? Do you know how your code and data have changed so that you can reproduce results? When you need to look at an old version of a file, how easy is it for you to do so?
To keep track of how projects develop, a version control system is used. A version control system keeps track of changes to your project and makes it easy to share those changes with your team and collaborators.
A commonly used version control system is git. Git sets up a repository (“repo”), which is a collection of all of the files in a project and their histories. After working on a change to the project, editedfiles are checked into the repository. If you ever need to revert back to a prior version, git lets you travel back to anywhere in the history, so you can confidently edit the project without worrying about overwriting past work. When multiple people update separate areas of the repository, git will merge the changes so everyone can safely work in parallel.
Sharing code with collaborators:
A version control tool like git manages changing code in a project, but does not share the project with others. In order to collaborate, the code can be put in an online remote” repository, which can be public or have restricted access. GitHub is the most widely used remote repository site. Each collaborator can create a local copy of the remote work. Edits are then made to the local version before being checked in and pushed to the remote repository so other collaborators stay up to date.
Version Control for Data:
Basic git doesn’t work well for storing large data sets. Instead, to keep track of different versions of your data, use Data Version Control (DVC).
DVC is an extension for git that stores the large data files outside of the repository and tracks them with lightweight files within the repo. The lightweight files point to a particular version of the large data files, so you can still version all of your data using git’s tools. With DVC tracking how your data changes, you can safely update the data while keeping track of past results.
Resources:
- git software carpentry
- “Version control is the lab notebook of the digital world: it’s what professionals use to keep track of what they’ve done and to collaborate with other people”
- lessons 1-6 set up git
- lesson 7-9 cover collaborating with GitHub
- lessons 10-14 cover research concerns
- using Data Version Control (DVC)
- Introduction to git
Code Quality and Longevity
How do you ensure your code remains useful and easy to update? Can you read and understand your code without difficulty? Can you modify parts of your code without breaking other parts? These questions highlight the value of code quality. Well-structured code is easier to read, understand, maintain, and extend, leading to less overall development effort.
Resources:
Internal Project Documentation
How do you remember and communicate the important decisions about how your software is constructed? Why did you choose a specific programming language or library? What algorithms are used, and why? How do new collaborators set up your software? What environment and dependencies do they need?
When new members join your team, they have a lot to learn about why your software was built the way it was. Internal project documentation makes it easier for everyone to understand your software’s internals.
Resources:
- Intro to Markdown
- Github Documentation (including wiki on repo)
- Thoughts and advice on documentation best practices
How do you know that your software is working the way you intended? When you make changes to a part of your software, are you sure that the rest of it still works as expected? What if the external code it depends on changes? Are you able to easily diagnose issues when they arise? How can you check whether others will be able to use your code even if their computer systems are set up differently from yours? How can you ensure that results returned from your software don’t change unexpectedly?
Software testing goes beyond running your code to make sure it works after you write it. Writing tests, and running them regularly, are key aspects of software best practices. Proper testing will not only help you detect bugs in new code, but give you a) the confidence to know that existing code still works given new changes,b) help you track down the source of errors more efficiently, and c) make for a faster development process in the long run.
Resources:
- Lesson/Tutorial from CodeRefinery (includes resources and tools)
- Reading from MIT OpenCourseWare 6.005: overview of software testing
- The Practical Test Pyramid” from Martin Fowler
- Software testing fundamentals: a site dedicated to basic knowledge about testing
- Types of Automation Testing: A Guide for Beginners
- Intro to Test Driven Design
- Visual Studio’s tools, walkthroughs and tutorials for testing
Python
Matlab
Frontend testing tools
Code Coverage tools
What are others allowed to do with the software you write? Do you want it to remain freely available and open-source? Do you mind if someone changes it and releases it as a commercial product? Do you want to charge licensing fees to fund further development?
These questions fall into the realm of software licensing. A software license dictates what uses are allowed for your software and protects you from being sued. It is crucial to choose a license before you make any software publically available. Once people begin using your software, it can be difficult to change its terms.
(This article is valuable because even though it only introduces a few possible license options, it clearly explains some of the high-level issues in choosing between a permissive license versus a “copyleft” license that requires that all further development be shared freely.)