Hybrid Genome Assembly using long and short reads data#

In this practical we will perform the assembly of Klebsiella pneumoniae, using the short and the long reads that we have trimmed in previous tutorials.

Getting the data#

We have trimmed the raw short-reads and the long reads in the previous steps of the training, so we will now assemble them into longer sequences called contigs, using a method of hybrid assembly combining both long and short reads, with the Unicycler pipeline.

Find your 3 fastq(.gz) files containing the trimmed reads for strain K2.

cd results
ls -l

Unicycler hybrid assembly#

As you might have noticed, assemblies can take some time. For that reason, we will prepare a unicycler_assembly_K2.sh script that we will run on the Slurm job scheduler, using the sbatch command.

Tip

You need to ask the teacher which partition to use !

#!/bin/bash
## SLURM CONFIG ##

#SBATCH --job-name=unicycler
#SBATCH --output=%x.%j.out
#SBATCH --cpus-per-task 4
#SBATCH --time=24:00:00
#SBATCH -p SELECTED_PARTITION
#SBATCH --mail-type=FAIL,END
#SBATCH --mem-per-cpu=4G

##
module load bioinfo/racon/1.4.3
module load bioinfo/unicycler/0.4.4

unicycler -1 K2_Illu_R1_trimmed.fastq -2 K2_Illu_R2_trimmed.fastq --long K2_MinION.fastq.gz \
  -o K2_unicycler_assembly -t 4

When you think that your script is ready, you can run the job with SLURM using this command (exit the srun if it is still active):

sbatch unicycler_assembly_K2.sh

Then check that your script is "Running" by typing:

squeue

## to see only your jobs, select the user
squeue -u your_login

The result of the assembly is in the directory K2_unicycler_assembly under the name assembly.fasta First, have a look of the Unicycler output directory.

Question

what are the different files there?

Check the assembly graph (gfa file) with Bandage => you will need to use the scp command from your computer.

Let's now make a link of the file containing the assembled scaffolds, to simplify the run of Quast and BUSCO

ln -s K2_unicycler_assembly/assembly.fasta K2_unicycler_scaffolds.fasta

and look at it

head K2_unicycler_scaffolds.fasta

Quality of the Assembly#

QUAST is a software evaluating the quality of genome assemblies by computing various metrics, including

First we might need to type the srun command to book the resources (the previous step was with sbatch):

srun -p SELECTED_PARTITION --cpus-per-task 2 --pty bash -i

Run Quast on your assembly (type the srun command first if this is not done already)

module load bioinfo/quast/5.0.2
module load bioinfo/bedtools/2.30.0
module load bioinfo/minimap2/2.24

quast.py -o K2_unicycler_quast -t 2 --conserved-genes-finding --gene-finding \
  --pe1 K2_Illu_trimmed_R1.fastq.gz --pe2 K2_Illu_trimmed_R2.fastq.gz K2_unicycler_scaffolds.fasta

and take a look at the text report

cat K2_unicycler_quast/report.txt

Question

How well does the assembly total size and coverage correspond to your earlier estimation?

Question

How many contigs in total did the assembly produce?

Question

Has the assembly improved compared with short-reads only assembly? in term of N50 and L50 for example.

Let's now check the completeness in term of essential genes expected with BUSCO

Assembly Completeness#

You can find more info regarding BUSCO in the short-reads assembly tutorial. Let's run it !

module load bioinfo/BUSCO/5.2.2

busco -i K2_unicycler_scaffolds.fasta -o K2_unicycler_busco --mode genome --lineage_dataset enterobacterales_odb10

Question

How many marker genes has busco found? Was this number improved compared to previous assemblies of short-reads only and long-reads only ?