Homework

Bioinformatic Summer School - Comparative Genomics and Metataxonomic Analysis

The overall teaching aims of this module is to create competence in the students in terms of independently perform bioinformatic analysis on the command line. This includes comparative analysis on bacterial genomes and metataxonomic analysis.

Specifically, we will:

For comparative genomics:

Download all genomes of a bacterial species.
Annotate the genetic content of these genomes
Profile the prophage content of these genomes
Profile the secondary metabolism of these genomes
Build a whole-genome phylogeny of the genomes
Infer the linkage between data-driven analysis and fundamental biology

For metataxonomy:

Download 16S data
Demultiplex the data
Clean, filter and denoise the data
Taxonomically classify the data
Statistically analyze the data

The successful student will

Have a working knowledge of the Linux command line
Be able to use suggested command line tools
Investigate the bioinformatic literature to find further tools relevant for the biological questions
Infer the biological relevance of the genetic content in bacteria
Infer the composition of microbiomes
Discuss the biological implications of phylogeny, genetic content, microbial composition and the relatedness of all the above

Homework

1.Linux install and BASH

Installation of a Linux environment. Most bioinformatics is done with the Linux operating system, rather than for example windows. Ubuntu is a modern version of Linux which is free and widely used. If you run windows, you can install an Ubuntu terminal directly with the Ubuntu subsystem – just follow these instructions: https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview

When you have successfully installed ubuntu, make sure you update it with these two commands:

sudo apt update

sudo apt upgrade

**IMPORTANT**

THE COMMANDS OF THE COURSE ARE SPECIFICALLY MADE FOR A WINDOWS SYSTEM RUNNING THE WINDOWS SUBSYSTEM FOR LINUX (WSL) AS INSTALLED ABOVE – I STRONGLY SUGGEST THAT YOU ONLY USE THAT

IF YOU HAVE AN OLD VERSION OF UBUNTU YOUR WINDOWS, GET RID OF IT AND INSTALL AS ABOVE

IF YOU HAVE UBUNTU THROUGH VIRTUAL BOX, GET RID OF IT AND INSTALL AS ABOVE

IF YOU HAVE ANYTHING OTHER THAN AS DESCRIBED ABOVE, GET RID OF IT AND REINSTALL AS ABOVE

WE ALWAYS SPEND WAY TOO MUCH TIME TRYING TO MAKE ALTERNATIVES WORK, SO PLEASE JUST DO AS INSTRUCTED

**EVEN MORE IMPORTANT**

PLEASE JUST DO THE ABOVE, IT WILL SAVE YOU SO MUCH TIME

For MAC OS X users, you are already running Linux underneath the OS X interface. You have access to the command line with a standard program called ‘Terminal’. The work we will be doing should all work here (but with some minor details).

Regardless of your operating system, make sure you have a working terminal and then do a Linux tutorial here:

https://app.datacamp.com/learn/courses/introduction-to-bash-scripting.

The basic programming language of the Linux command line is called bash, and we will be using this extensively. If you show up with no knowledge of this, you will probably not learn a whole lot, so do the tutorial and play around as much as you can.

2.Conda installation

One of the most tricky parts of bioinformatics is the installation of packages. Package A needs version X of package B, but package C need version Y of Package B, which might not be possible. Luckily, the conda package manager takes care of this for us by working out the details of these dependencies and allows us to make individual ‘environments’ for each set of packages for each analysis.

Follow the instructions here (for the miniconda installation)

https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

For MAC users, follow this:

https://docs.conda.io/projects/conda/en/latest/user-guide/install/macos.html

As the first exercise to check your conda installation, run this code (from your ‘base’ environment):

conda install mamba -n base -c conda-forge

This will install mamba, a faster implementation of conda. Conda is pretty good, but it has not scaled very well with time, and mamba solves that.

3.Installation of R and Rstudio

We need the programming language R for the metataxonomy. R is a statistically minded programming language, which is great for statistics and for plotting – all my plots are made in R.

We will use a great integrated development environment (IDE) for R called Rstudio. Rstudio makes writing and running code exceptionally easy. Rstudio runs directly in windows (or macOS) so don’t mix it up with Linux!

First you install R:

https://www.r-project.org/

Then you install Rstudio

https://posit.co/download/rstudio-desktop/

Homework

Bioinformatic Summer School - Comparative Genomics and Metataxonomic Analysis

Homework

​

1.Linux install and BASH

2.Conda installation

3.Installation of R and Rstudio