Data Science Core Skills: Filling the Gaps with Community-Developed Workshops

Basic data science skills are necessary for all types of scientific research. Although many excellent resources are available, putting together a professional training program tailored to your research institution is a challenge. Interdisciplinary research programs attract students and staff with a wide range of basic knowledge and skills. Graduate students are funded through a hodgepodge of programs with different training and support requirements. The training of post-docs and early career researchers can be neglected, and many struggle to acquire the skills they need to advance their research and careers. Students and staff can start at any time of the year, although there is often a cohort of new students each fall.

We run a UKRI funded program called Ed-DaSH, developing new training workshop materials available to the entire research community. We work with Joineryan inclusive community teaching coding and data skills, including pedagogical model hands-on collaborative learning that we have adopted in our workshops. Workshop topics include statistics, Fair principles (findability, accessibility, interoperability and reuse) data management and workflow management systems. Starting in the fall of 2021, our institutes began using these new materials in basic data science training programs, initially focused on the admission of doctoral students but available to all staff and students.

Identify unmet training needs

What does your audience need to learn to fulfill their potential as a researcher? Surveys are a good start, especially if they are short and easy to complete. For example, a survey of the Institute of Genetics and Cancer of the Human Genetics Unit of the CRM regarding statistical training needs and support revealed a strong demand for one-on-one training and workshops.

However, the survey can only capture what you ask for and what people know they need right now. Future needs should also be taken into account, especially for early career researchers. Expert observations must be taken into account: what is the avant-garde doing? We were able to observe a radical change towards workflow management systems in health and bioscience research and a lack of training to integrate them into daily use.

Collect local training

What does your institute offer? And how does this relate to the training needs you have identified? Your survey can tell you what training has helped in the past, and it’s also helpful to gather existing post-training surveys. If the training is offered through your institute’s research program, is it open only to students of that program, or could it be made more widely available? We had a number of locally developed workshops, such as an introduction to our university computing cluster and an introduction to genome browsers, which were well received and in high demand.

Find community-developed training

What training and resources have other people developed that you can use? Don’t waste time reinventing the wheel. Training materials developed by other members of the research community are often freely available, adaptable, and of high quality. Even better, academic research communities will generally welcome contributions and comments.

We believe that to foster a living curriculum, it is worth letting go of some control over its content. Data Science is fortunate to have access to open source woodworking shop hardware. We use the lessons of Software carpentry and Data carpentry suites, covering Unix, Python, and R shell basics.

Don’t be afraid to adapt: ​​we used to offer an in-house developed genomics workshop, but we’ve replaced it with a more up-to-date carpentry lesson. Lessons developed with input from the wider research community are tested and updated by hundreds of instructors worldwide, making it easy to share across institutes. Our premises Edinburgh Joineries the community facilitates collaboration.

Address training gaps

Looking back at your training needs, what is missing? In our case, it was statistics, data management and workflow management systems that we felt needed the most new hardware. If you have the ability to get started right away and want to make your new hardware a community effort, talk to the woodworkers about their incubator program and take a look at their Curriculum Development Manual. For funding, along with UKRI, the Software Sustainability Institute generally and Elixir specifically for biosciences may have relevant programs.

Timing and audience

When, where and how do you want to deliver your new training program?

  1. Public: are you mainly targeting a single group of doctoral students or a broader group of researchers?
  2. Intensity: a week or two of full-day sessions can be effective to plan, but our feedback has been that it can overwhelm learners. A combination of half-day and full-day sessions spread over a longer period can improve retention and engagement.
  3. Hourly: researchers are more engaged in training when they know they need it; for example, when they have data to analyze. Advanced topics may be more useful later in a program than in the first few weeks of new doctoral students.
  4. Sequencing: what order to teach? Some workshops will have obvious prerequisites. For example, an introduction to the R programming language should take place before a workshop that teaches statistics through R programming.
  5. External Restrictions: what other requirements do your learners have? Consult the student manual to circumvent everything that is already provided. Avoid deadlines for first-year reports and major field conferences.
  6. Resource: Who is available to lead the workshop? How long do they have? How is this time funded? Joineries’ community approach helps by bringing back former learners as assistants, who then build their own data skills by training others.

Long term training

Data science is here for the long haul, and your program will need to evolve with changing needs. Collecting feedback and, more importantly, acting on it will keep your program relevant and effective. Community-developed hardware helps spread the burden of updating your lessons, and you can pay for it by making fixes and updates based on your experiences.

Alison Meynert is Principal Investigator and Senior Manager of Bioinformatics Analysis in the MRC’s Human Genetics Unit, and Edward Wallace is Sir Henry Dale Fellow in the School of Biological Sciences, both at Edinburgh University.

Sam D. Gomez