Skip to main content
SearchLoginLogin or Signup

Review 1: "SARS-CoV-2 RECoVERY: a multi-platform open-source bioinformatic pipeline for the automatic construction and analysis of SARS-CoV-2 genomes from NGS sequencing data"

Reviewer: Martin Höelzer (Robert Koch Institute) 📒📒📒 ◻️◻️

Published onApr 14, 2022
Review 1: "SARS-CoV-2 RECoVERY: a multi-platform open-source bioinformatic pipeline for the automatic construction and analysis of SARS-CoV-2 genomes from NGS sequencing data"
1 of 2
key-enterThis Pub is a Review of
SARS-CoV-2 RECoVERY: a multi-platform open-source bioinformatic pipeline for the automatic construction and analysis of SARS-CoV-2 genomes from NGS sequencing data
Description

AbstractBackgroundSince its first appearance in December 2019, the novel Severe Acute Respiratory Syndrome Coronavirus type 2 (SARS-CoV-2), spread worldwide causing an increasing number of cases and deaths (35,537,491 and 1,042,798, respectively at the time of writing, https://covid19.who.int). Similarly, the number of complete viral genome sequences produced by Next Generation Sequencing (NGS), increased exponentially. NGS enables a rapid accumulation of a large number of sequences. However, bioinformatics analyses are critical and require combined approaches for data analysis, which can be challenging for non-bioinformaticians.ResultsA user-friendly and sequencing platform-independent bioinformatics pipeline, named SARS-CoV-2 RECoVERY (REconstruction of CoronaVirus gEnomes & Rapid analYsis) has been developed to build SARS-CoV-2 complete genomes from raw sequencing reads and to investigate variants. The genomes built by SARS-CoV-2 RECoVERY were compared with those obtained using other software available and revealed comparable or better performances of SARS–CoV2 RECoVERY. Depending on the number of reads, the complete genome reconstruction and variants analysis can be achieved in less than one hour. The pipeline was implemented in the multi-usage open-source Galaxy platform allowing an easy access to the software and providing computational and storage resources to the community.ConclusionsSARS-CoV-2 RECoVERY is a piece of software destined to the scientific community working on SARS-CoV-2 phylogeny and molecular characterisation, providing a performant tool for the complete reconstruction and variants’ analysis of the viral genome. Additionally, the simple software interface and the ability to use it through a Galaxy instance without the need to implement computing and storage infrastructures, make SARS-CoV-2 RECoVERY a resource also for virologists with little or no bioinformatics skills.Availability and implementationThe pipeline SARS-CoV-2 RECoVERY (REconstruction of COronaVirus gEnomes & Rapid analYsis) is implemented in the Galaxy instance ARIES (https://aries.iss.it).

RR:C19 Evidence Scale rating by reviewer:

  • Potentially informative. The main claims made are not strongly justified by the methods and data, but may yield some insight. The results and conclusions of the study may resemble those from the hypothetical ideal study, but there is substantial room for doubt. Decision-makers should consider this evidence only with a thorough understanding of its weaknesses, alongside other evidence and theory. Decision-makers should not consider this actionable, unless the weaknesses are clearly understood and there is other theory and evidence to further support it.

***************************************

Review:


Here, the authors aim to provide an all-in-one pipeline for the reconstruction of SARS-CoV-2 genomes from different sequencing technologies. While such easy-to-use pipelines are needed by the worldwide community to rapidly reconstruct genomes for molecular surveillance and the detection of emerging variants, they also need to be accurate to support decision-making even based on single nucleotide changes. Unfortunately, I think that the pipeline in its current state does not produce high-quality genome sequences. While the used tools seem to some extent reasonable for short-read data, they will fail for the reconstruction of accurate genomes from Nanopore data.

Thus, I highly recommend either focusing the pipeline on short reads only or including proper analysis steps and tools that also support Nanopore data. In the current state, I would not recommend the pipeline for Nanopore data at all.

Major

[1] Read quality analysis and trimming

As described, the authors use Trimmomatic for a basic read qc. However, the removal of the remaining 5’ and/ or 3’ adapter sequences or primer sequences in particular for Illumina protocols is a crucial step that can also impact mapping/ variant calling if not properly done. I recommend adding additional functionalities for adapter trimming and primer clipping. For example, adapter trimming can be performed via fastp (that could be also a general replacement of Trimmomatic in terms of speed) while providing the adapter sequences in FASTA format. For primer clipping (e.g. derived from Illuminas CleanPlex protocol,) I can recommend bamclipper. This might complicate the workflow but is crucial for specific sequencing protocols like involving amplicons. Regarding Nanopore reads: do the authors also trim them with Trimmomatic? If so, there is no need and normally Nanopore data is only filtered by length. E.g. many labs use the well-established ARTIC amplicon protocol and, for example, select only reads between 400-700 nt (V3 protocol) for further processing.

[2] Subtraction of human sequences

I recommend not only mapping against the human reference genome but rather generating an index out of human+SARS-CoV-2. Otherwise, it could happen that (short) reads are sub-optimally mapping against the human genome that including a not inconsiderable amount of endogenous viral elements. Do the authors map Nanopore reads with Bowtie2 as well? Or Minimap2?

[3] Contig assembly

I am unsure if the authors are also using SPAdes for Nanopore data. I would recommend specialized long-read assembly tools such as flye. Besides, it is questionable if such a de novo step is needed at all if the authors construct the consensus reference-based.

[4] Genome reconstruction

The authors use mpileup and bcftools from SAMtools for the variant calling and consensus reconstruction. While these are basic tools for such tasks, there are also more sophisticated tools for variant calling already used by the SARS-CoV-2 community such as LoFreq, Freebayes, or GATK. Also, parameter settings such as allele frequency cutoffs, … are important to consider. For Nanopore data, the used procedure will result in many false variant calls (see below).

[5] Nanopore

The pipeline is lacking important steps needed for the proper analysis of Nanopore data. After mapping reads with e.g. minimap2 polishing steps (e.g. racon, Medaka) are needed to reduce errors in Nanopore data. Also, the variant calling should be not performed with default tools such as samtools but rather using the machine learning models e.g. implemented in Medaka for variant calling. If I understand Tab. 1 correctly, the results clearly show that the pipeline is not working for Nanopore data: 96 % of consensus sequences with different nucleotide calls. Sure, Genome Detective is not much better (90 %) but it might be that the tool is also not suitable for analyzing Nanopore data?

[6] Benchmark

First of all, it is unclear which genomes the authors used from their pipeline: the reference-based or de novo reconstructed ones? The authors report that the genomes produced are generally longer than the ones produced by CLC or Genome Detective. I wonder if these tools perform de novo assemblies of the reads or also use a reference-guided consensus strategy? Most pipelines currently available (such as https://github.com/connor-lab/ncov2019-artic-nf, https://github.com/replikation/poreCov, https://gitlab.com/RKIBioinformaticsPipelines/ncov_minipipe, …) perform reference-based reconstructions and thus rely on the length of the (Wuhan) reference genome. Thus, the length of the consensus genome is not necessarily a meaningful quality metric. I also wonder why the pipeline in the mean produced longer genomes than the GISAID references that might be also assembled reference-based. Are the reconstructions extended at the 5’ and 3’ end of the genome?

Besides, CLC seems to perform much better than the pipeline regarding a very important metric: % of consensus sequences with different nucleotide calls.

Comments
0
comment

No comments here

Why not start the discussion?