Context
With the rise of long-reads sequencing technologies, it is now possible to get tens of Kb fragments reads. Recently, the advent of high fidelity Pacbio long-reads drastically improve the accuracy of SMRT Sequencing platforms. In this post, I will test the genome assembly of rice genome using Hifiasm assembler.
Installation
For the installation, I proceed like recommended by the developer.
$ sudo git clone https://github.com/chhylp123/hifiasm
$ cd hifiasm && make
Note: Make sure you have installed g++ and zlib dependencies.
I am using a Centos 7 distribution. So I installed zlib and g++ following the code:
$ sudo yum install gcc-c++
$ sudo yum install zlib-devel
Data retrieving
For this assembly execution, I will use the rice O. sativa MH63 Hifi data available here. I grabbed the data set from European Nucleotide Archive (ENA) using the following accession number SRR9969481
# Download the data
$ wgetftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR996/001/SRR9969481/SRR9969481_subreads.fastq.gz
# Rename the data
$ mv SRR9969481_subreads.fastq.gz Rice.fastq.gz
# Create a link in my working directory
$ ln -fs ~/datafiles/002_o.sat.MH63_PacBio.HiFi.14kb/Rice.fastq.gz .
Running
Using vim text editor I wrote a small bash script named run_hifiasm.sh:
#!/bin/bash
set -e
~/program/hifiasm/hifiasm -o Rice.asm -t 32 Rice.fastq.gz # assemble the genome
awk '/^S/{print ">"$2;print $3}' Rice.asm.p_ctg.gfa > Rice.asm.p_ctg.fa # get primary contigs in FASTA
Then I ran the bash using the following code:
$ /usr/bin/time -o out.ram.time.txt -v bash run_hifiasm.sh 2> log.hifiasm &
Then I type exit to put in background the running process.
Results
Let’s check the assembly statistics
$ assembly-stats Rice.asm.p_ctg.fa
I have got the following assembly statistics:
sum = 421362451, n = 1225, ave = 343969.35, largest = 37513714
N50 = 13841853, n = 12
N60 = 13500458, n = 15
N70 = 12402620, n = 18
N80 = 10695965, n = 22
N90 = 5876907, n = 27
N100 = 8552, n = 1225
N_count = 0
Gaps = 0
Here are some characteristics related to the running process:
Command being timed: "bash run_hifiasm.sh"
User time (seconds): 108396.17
System time (seconds): 1079.09
Percent of CPU this job got: 2615%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:09:46
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 21640460
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 7
Minor (reclaiming a frame) page faults: 269598163
Voluntary context switches: 11845273
Involuntary context switches: 1155532
Swaps: 0
File system inputs: 20530400
File system outputs: 13984408
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Comments
Hifiasm (Cheng et al., 2021) is incredibly fast. A total of ~18 mn (system time) to assemble a ~430 Mb (Yu et al., 2002) genome with 32 cores.
Table 1: Basics assembly statistics of R498 and MH63 assemblies
The peak memory was also low 21.64 MB, showing that Hifiasm is less memory consuming. The Hifiasm assembly size was 421 Mbp, representing 97.90% of the estimated size. Besides, the contiguity is quite high (N50 = 13.84 Mbp) compared to the recent Pacbio CLR, linkage mapping and fosmid-based assembly released by Du et al. (2017). A basic comparison with the chromosomal-scale assemblies from CANU (Koren et al., 2017) and HERA (Du and Liang, 2019) assemblers with PacBio CLR data (Table 1) showed that the Hifiasm assembly is quite good for Hi-C scaffolding.