Context

With the rise of long-reads sequencing technologies, it is now possible to get tens of Kb fragments reads. Recently, the advent of high fidelity Pacbio long-reads drastically improve the accuracy of SMRT Sequencing platforms. In this post, I will test the genome assembly of rice genome using Hifiasm assembler.

Installation

For the installation, I proceed like recommended by the developer.


$ sudo git clone https://github.com/chhylp123/hifiasm

$ cd hifiasm && make

Note: Make sure you have installed g++ and zlib dependencies.

I am using a Centos 7 distribution. So I installed zlib and g++ following the code:


$ sudo yum install gcc-c++


$ sudo yum install zlib-devel

Data retrieving

For this assembly execution, I will use the rice O. sativa MH63 Hifi data available here. I grabbed the data set from European Nucleotide Archive (ENA) using the following accession number SRR9969481



# Download the data


$ wgetftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR996/001/SRR9969481/SRR9969481_subreads.fastq.gz


# Rename the data

$ mv SRR9969481_subreads.fastq.gz Rice.fastq.gz


# Create a link in my working directory


$ ln -fs ~/datafiles/002_o.sat.MH63_PacBio.HiFi.14kb/Rice.fastq.gz .

Running

Using vim text editor I wrote a small bash script named run_hifiasm.sh:



#!/bin/bash

set -e


~/program/hifiasm/hifiasm -o Rice.asm -t 32 Rice.fastq.gz # assemble the genome



awk '/^S/{print ">"$2;print $3}' Rice.asm.p_ctg.gfa > Rice.asm.p_ctg.fa # get primary contigs in FASTA

Then I ran the bash using the following code:


$ /usr/bin/time -o out.ram.time.txt -v bash run_hifiasm.sh 2> log.hifiasm &

Then I type exit to put in background the running process.

Results

Let’s check the assembly statistics


$ assembly-stats Rice.asm.p_ctg.fa

I have got the following assembly statistics:



sum = 421362451, n = 1225, ave = 343969.35, largest = 37513714
N50 = 13841853, n = 12
N60 = 13500458, n = 15
N70 = 12402620, n = 18
N80 = 10695965, n = 22
N90 = 5876907, n = 27
N100 = 8552, n = 1225
N_count = 0
Gaps = 0

Here are some characteristics related to the running process:



Command being timed: "bash run_hifiasm.sh"
        User time (seconds): 108396.17
        System time (seconds): 1079.09
        Percent of CPU this job got: 2615%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:09:46
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 21640460
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 7
        Minor (reclaiming a frame) page faults: 269598163
        Voluntary context switches: 11845273
        Involuntary context switches: 1155532
        Swaps: 0
        File system inputs: 20530400
        File system outputs: 13984408
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Comments

Hifiasm (Cheng et al., 2021) is incredibly fast. A total of ~18 mn (system time) to assemble a ~430 Mb (Yu et al., 2002) genome with 32 cores.

Table 1: Basics assembly statistics of R498 and MH63 assemblies

	MH63	R498	R498
Data type	PacBio hifi	PacBio CLR	PacBio CLR
Assembler	Hifiasm	CANU	HERA
Assembly level	Contigs	Chromosome	Chromosome
Assembly size (bp)	421,362,451	390,983,850	391,626,037
Contigs number	1,225	14	12
Largest contig (bp)	37,513,714	44,361,539	45,881,347
N50 (bp)	13,841,853	31,778,392	31,347,481
L50	12	6	6

The peak memory was also low 21.64 MB, showing that Hifiasm is less memory consuming. The Hifiasm assembly size was 421 Mbp, representing 97.90% of the estimated size. Besides, the contiguity is quite high (N50 = 13.84 Mbp) compared to the recent Pacbio CLR, linkage mapping and fosmid-based assembly released by Du et al. (2017). A basic comparison with the chromosomal-scale assemblies from CANU (Koren et al., 2017) and HERA (Du and Liang, 2019) assemblers with PacBio CLR data (Table 1) showed that the Hifiasm assembly is quite good for Hi-C scaffolding.