Hybrid Genome Assembly using long and short reads data#
In this practical we will perform the assembly of Klebsiella pneumoniae, using the short and the long reads that we have trimmed in previous tutorials.
Getting the data#
We have trimmed the raw short-reads and the long reads in the previous steps of the training, so we will now assemble them into longer sequences called contigs, using a method of hybrid assembly combining both long and short reads, with the Unicycler pipeline.
Find your 3 fastq(.gz) files containing the trimmed reads for strain K2.
cd results
ls -l
Unicycler hybrid assembly#
As you might have noticed, assemblies can take some time. For that reason, we will prepare a unicycler_assembly_K2.sh
script that we will run on the Slurm job scheduler, using the sbatch
command.
Tip
You need to ask the teacher which partition to use !
#!/bin/bash
## SLURM CONFIG ##
#SBATCH --job-name=unicycler
#SBATCH --output=%x.%j.out
#SBATCH --cpus-per-task 4
#SBATCH --time=24:00:00
#SBATCH -p SELECTED_PARTITION
#SBATCH --mail-type=FAIL,END
#SBATCH --mem-per-cpu=4G
##
module load bioinfo/racon/1.4.3
module load bioinfo/unicycler/0.4.4
unicycler -1 K2_Illu_R1_trimmed.fastq -2 K2_Illu_R2_trimmed.fastq --long K2_MinION.fastq.gz \
-o K2_unicycler_assembly -t 4
When you think that your script is ready, you can run the job with SLURM using this command (exit the srun
if it is still active):
sbatch unicycler_assembly_K2.sh
Then check that your script is "Running" by typing:
squeue
## to see only your jobs, select the user
squeue -u your_login
The result of the assembly is in the directory K2_unicycler_assembly
under the name assembly.fasta
First, have a look of the Unicycler output directory.
Question
what are the different files there?
Check the assembly graph (gfa file) with Bandage => you will need to use the scp
command from your computer.
Let's now make a link of the file containing the assembled scaffolds, to simplify the run of Quast and BUSCO
ln -s K2_unicycler_assembly/assembly.fasta K2_unicycler_scaffolds.fasta
and look at it
head K2_unicycler_scaffolds.fasta
Quality of the Assembly#
QUAST is a software evaluating the quality of genome assemblies by computing various metrics, including
First we might need to type the srun
command to book the resources (the previous step was with sbatch
):
srun -p SELECTED_PARTITION --cpus-per-task 2 --pty bash -i
Run Quast on your assembly (type the srun command first if this is not done already)
module load bioinfo/quast/5.0.2
module load bioinfo/bedtools/2.30.0
module load bioinfo/minimap2/2.24
quast.py -o K2_unicycler_quast -t 2 --conserved-genes-finding --gene-finding \
--pe1 K2_Illu_trimmed_R1.fastq.gz --pe2 K2_Illu_trimmed_R2.fastq.gz K2_unicycler_scaffolds.fasta
and take a look at the text report
cat K2_unicycler_quast/report.txt
Question
How well does the assembly total size and coverage correspond to your earlier estimation?
Question
How many contigs in total did the assembly produce?
Question
Has the assembly improved compared with short-reads only assembly? in term of N50 and L50 for example.
Let's now check the completeness in term of essential genes expected with BUSCO
Assembly Completeness#
You can find more info regarding BUSCO in the short-reads assembly tutorial. Let's run it !
module load bioinfo/BUSCO/5.2.2
busco -i K2_unicycler_scaffolds.fasta -o K2_unicycler_busco --mode genome --lineage_dataset enterobacterales_odb10
Question
How many marker genes has busco
found?
Was this number improved compared to previous assemblies of short-reads only and long-reads only ?