Genome Annotation#

Lecture#


After you have de novo assembled your genome sequencing reads into contigs, it is useful to know what genomic features are on those contigs. The process of identifying and labelling those features is called genome annotation.

Prokka is a "wrapper"; it collects together several pieces of software (from various authors), and so avoids "re-inventing the wheel".

Prokka finds and annotates features (both protein coding regions and RNA genes, i.e. tRNA, rRNA) present on on a sequence. Prokka uses a two-step process for the annotation of protein coding regions: first, protein coding regions on the genome are identified using Prodigal; second, the function of the encoded protein is predicted by similarity to proteins in one of many protein or protein domain databases. Prokka is a software tool that can be used to annotate bacterial, archaeal and viral genomes quickly, generating standard output files in GenBank, EMBL and gff formats. More information about Prokka can be found here.

Prepare our computing environment#

We will first run the appropriate srun command to book the computing cores (cpus) on the cluster.

Tip

You need to ask the teacher which partition to use !

srun -p SELECTED_PARTITION --cpus-per-task 2 --pty bash -i

You are now on a computing node, with computing 2 cpus reserved for you. That way, you can run commands interactively.

If you want to exit the srun interactive mode, press CTRL+D or type exit

Input data#

Prokka requires assembled contigs. You can prepare you working directory for this annotation tutorial.

cd results
ls -l
mkdir annotation
cd annotation

You will link the hybrid assemblies of the 5 Klebsiella strains in this annotation directory:

ln -s ../unicycler_assemblies_5_Kp/K*_unicycler_scaffolds.fasta

You will also need a proteins set specific of Klebsiella pneumoniae for the annotation. You can go to the Uniprot/Swiss-Prot database website and search for all the proteins sequences for the organism Klebsiella pneumoniae, select only the reviewed entries, and download the fasta file of those.

Question

How many reviewed protein entries are available for Klebsiella pneumoniae in Swiss-Prot ?

For your convenience, we have made the resulting file available. You can copy it by typing the following command:

cp /scratch/genesys_training/files/annotation/swissprot_kp_221107.fasta .

Running prokka#

module load bioinfo/prokka/1.14.6

prokka --force --genus Klebsiella --species pneumoniae \
  --kingdom Bacteria --usegenus --proteins swissprot_kp_221107.fasta \
  --notrna --prefix K2 --outdir K2_prokka K2_unicycler_scaffolds.fasta

Once Prokka has finished, examine each of its output files.

Preparing Prokka script for loop#

We now want to run Prokka on all our strains, so we can compare the annotations later.

We will prepare a prokka_kp.sh script for doing so, in the scripts directory. Here is an example:

#!/bin/bash
## SLURM CONFIG ##

#SBATCH --job-name=prokka
#SBATCH --output=%x.%j.out
#SBATCH --cpus-per-task 2
#SBATCH --time=24:00:00
#SBATCH -p SELECTED_PARTITION
#SBATCH --mail-type=FAIL,END
#SBATCH --mem-per-cpu=4G

#variables
suffix="_scaffolds.fasta"
prefix="${1%$suffix}"

module load bioinfo/prokka/1.14.6

prokka --force --genus Klebsiella --species pneumoniae \
          --kingdom Bacteria --usegenus --proteins swissprot_kp_221107.fasta \
          --notrna --prefix ${prefix} --outdir ${prefix}_prokka ${1}

Once your are ready with your script, you can run the loop for sbatch it:

for i in K*scaffolds.fasta; do sbatch ../scripts/prokka_kp.sh $i; done

Visualising the annotation#

Ugene is a graphical program to perform some bioinformatics analyses. Notably, it allows to browse annotated genomes, and to curate annotations if needed. Download Ugene here and install it on your local computer.

Copy the .gff file produced by prokka on your computer, and open it with Ugene.