Next Generation Sequencing (NGS)/De novo assembly

De novo assembly
The generation of short reads by next generation sequencers has lead to an increased need to be able to assemble the vast amount of short reads that are generated. This is no trivial problem, as the sheer number of reads makes it near impossible to use, for example, the overlap layout consensus (OLC) approach that had been used with longer reads. Therefore, most of the available assemblers that can cope with typical data generated by Illumina use a de Bruijn graph based k-mer based approach.

A clear distinction has to be made by the size of the genome to be assembled.
 * small (e.g. bacterial genomes: few Megabases)
 * medium (e.g. lower plant genomes: several hundred Megabases)
 * large (e.g. mammalian and plant genomes: Gigabases)

All de-novo assemblers will be able to cope with small genomes, and given decent sequencing libraries will produce relatively good results. Even for medium sized genomes, most de-novo assemblers mentioned here and many others will likely fare well and produce a decent assembly. That said, OLC based assemblers might take weeks to assemble a typical genome. Large genomes are still difficult to assemble when having only short reads (such as those provided by Illumina reads). Assembling such a genome with Illumina reads will probably will require using a machine that has about 256 GB and potentially even 512GB RAM, unless one is willing to use a small cluster, or invest into commercial software.

Typical workflow
A genome assembly project, whatever its size, can generally be divided into stages:
 * 1) Experiment design
 * 2) Sample collection
 * 3) Sample preparation
 * 4) Sequencing
 * 5) Pre-processing
 * 6) Assembly
 * 7) Post-assembly analysis

Experiment design
Like any project, a good de novo assembly starts with proper experimental design. Biological, experimental, technical and computational issues have to be considered:


 * Biological issues: What is known about the genome?
 * How big is it? Obviously, bigger genomes will require more material.
 * How frequent, how long and how conserved are repeat copies? More repetitive genomes will possibly require longer reads or long distance mate-pairs to resolve structure.
 * How AT rich/poor is it? Genomes which have a strong AT/GC imbalance (either way) are said to have low information content. In other words, spurious sequence similarities will be more frequent.
 * Is is haploid, diploid, or polyploid? Currently genome assemblers deal best with haploid samples, and some provide a haploid assembly with annotated heterozygous sites. Polyploid genomes (e.g. plants) are still largely problematic.


 * Experimental issues: What sample material is available?
 * Is it possible to extract a lot of DNA? If you have only little material, you might have to amplify the sample (e.g. using MDA), thus introducing biases.
 * Does that DNA come from a single cell, a clonal population, or a heterogeneous collection of cells? Diversity in the sample can create more or less noise, which different assemblers handle differently.


 * Technical issues: What sequencing technologies to use?
 * How much does each cost?
 * What is the sequence quality? The greater the noise, the more coverage depth you will need to correct for errors.
 * How long are the reads? The longer the reads, the more useful they will be to disambiguate repetitive sequence.
 * Can paired reads be produced cost-effectively and reliably? If so, what is the fragment length? As with long reads, reliable long distance paired can help disambiguate repeats and scaffold the assembly.
 * Can you use a hybrid approach? E.g. short and cheap reads mixed with long expensive ones.


 * Computational issues: What software to run?
 * How much memory do they require? This criteria can be final, because if a computer does not have enough memory, it will either crash, or slow down tremendously as it swaps data on and off the hard drive.
 * How fast are they? This criteria is generally less stringent, since the assembly time is generally minor within a complete genome assembly and annotation project. However, some scale better than other.
 * Do they require specific hardware? (e.g. large memory machine, or cluster of machines)
 * How robust are they? Are they prone to crash? Are they well supported?
 * How easy are they to install and run?
 * Do they require a special protocol? Can they handle the chosen sequencing technology?

Some steps which are likely common to most assemblies:


 * 1) If it is within reason and would not tamper with the biology: Try to get DNA from haploid or at least mostly homozygous individuals.
 * 2) Make sure that all libraries are really ok quality-wise and that there is no major concern (e.g. use FastQC)
 * 3) For paired end data you might also want to estimate the insert size based on draft assemblies or assemblies which you have made already.
 * 4) Before submitting data to a de-novo assembler it might often be a good idea to clean the data, e.g. to trim away bad bases towards the end and/or to drop reads altogether. As low quality bases are more likely to contain errors, these might complicate the assembly process and might lead to a higher memory consumption. (More is not always better) That said, several general purpose short read assemblers such as SOAP de-novo and ALLPATHS-LG can perform read correction prior to assembly.
 * 5) Before running any large assembly, double and triple check the parameters you feed the assembler.
 * 6) Post assembly it is often advisable to check how well your read data really agrees with the assembly and if there are any problematic regions
 * 7) If you run de Bruijn graph based assemblies you will want to try different k-mer sizes. Whilst there is no rule of thumb for any individual assembly, smaller k-mers would lead to a more tangled graph if the reads were error free. Larger k-mer sizes would yield a less tangled graph, given error free reads. However, a lower k-mer size would likely be more resistant to sequencing errors. And a too large k might not yield enough edges in the graph and would therefore result in small contigs.

Data pre-processing
For a more detailed discussion, see the chapter dedicated to pre-processing.

Data pre-processing consists in filtering the data to remove errors, thus facilitating the work of the assembler. Although most assemblers have integrated error correction routines, filtering the reads will generally greatly reduce the time and memory overhead required for assembly, and probably improve results too.

Genome assembly
Genome assembly consists in taking a collection of sequencing reads, which are much shorter than the actual genome, and creating a genome sequence which is a likely source of all these fragments. What defines a likely genome depends generally on heuristics and the data available. Firstly, by parsimony, the genome must be as short as possible. One could take all the reads and simply produce the concatenation of all their sequences, but this wold not be parsimonious. Secondly, the genome must include as much of the input data as possible. Finally, the genome must satisfy as many of the experimental data as possibly. Typically, paired-end reads are expected to map onto the genome with a given respective orientation and a given distance from each other.

The output of an assembler is generally decomposed into contigs, or contiguous regions of the genome which are nearly completely resolved, and scaffolds, or sets of contigs which are approximately placed and oriented with respect to each other.

There are many assemblers available (See the Wikipedia page on sequence assembly for more details). Tutorials on how to use some of them are below.

Techniques for comparing assemblies
Once several genome assemblies are generated, they need to be evaluated. Current methods include:


 * N50 (length of contigs or scaffolds)
 * mapping of reads that were used to produce the assembly
 * identification and counting of highly conserved genes expected to be present based on evolution
 * mapping of transcripts to genome assemblies

Post-assembly analysis
Once a genome has been obtained, a number of analyses are possible, if not necessary:
 * Quality control
 * Comparison to other assemblies
 * Variant detection
 * Annotation

ABySS
is a de-novo assembler which can run on multiple nodes where it uses the message parsing interface (MPI) interface for communication. As ABySS distributes tasks, the amount of RAM needed per machine is smaller and thus Abyss is able to cope with large genomes. See here for a tutorial.


 * Pros
 * distributed interface a cluster can be used
 * a large genome can be assembled with relatively little RAM per compute node. A human genome was assembled on 21 nodes having 16GB RAM each


 * Cons
 * relatively slow

Allpaths-LG
Allpath-LG is a novel assembler requiring specialized libraries. The authors of the software benchmarked ALLPATH-LG against SOAP-denovo and ALLPATH-LG reported superior performance. However it must be noted that they might not have used the SOAP-denovo gap filling module for one of the data set due to time constraints. This would probably have improved the SOAP assembly contiguous sequence length. In our own hand (usadellab) we have seen similar good N50 results and also reported good N50 values for ALLPATHS-LG Arabidopsis assemblies. Similarly ALLPATHS-LG was named as well performing in the Assemblathon.


 * Pros
 * relatively fast runtime (slower than SOAP)
 * good scaffold length (likely better than SOAP)
 * can use long reads (e.g. PAC Bio) but only for small genomes


 * Cons
 * specially tailored libraries are necessary
 * large genomes (mammalian size) need a lot of RAM. The publications estimates about 512GB would be sufficient though
 * slower than SOAP

Euler SR USR
is an assembler that includes an error correction module.
 * Pros
 * Has an error correction module
 * Cons

MIRA
is a general purpose assembler that can integrate various platform data and perform true hybrid assemblies.


 * Pros
 * very well documented and many switches
 * can combine different sequencing technologies
 * likely relatively good quality data


 * Cons
 * Only partly multithreaded thus and due to the technology slow
 * Probably not recommended to assemble larger genomes

Ray
is a distributed scalable assembler tailored for bacterial genomes, metagenomes and virus genomes.

Tutorial available here


 * Pros
 * scalability (uses MPI)
 * correctness
 * usability
 * well documented
 * responsive mailing list
 * can combine different sequencing technologies
 * de Bruijn-based


 * Cons

SOAP de novo
is an all purpose genome assembler. It was used to assemble the giant panda genome. See here for a tutorial.


 * Pros
 * SOAP de novo uses a medium amount of RAM
 * SOAP de novo is relatively fast (probably the fastest free assembler)
 * SOAP de novo contains a scaffolder and a read-corrector
 * SOAP de novo is relatively modular (read-corrector, assembly, scaffold, gap-filler)
 * SOAP de novo works well with very short reads


 * Cons
 * potentially somewhat confusing way in which contigs are built.
 * Relatively large amount of RAM needed, BGI states ca. 150GB (less than ALLPATHS though)

SPAdes
is a single-cell genome assembler.


 * Pros
 * SPAdes works good with highly non-uniform coverage (e.g. after using Multiple Displacement Amplification)
 * SPAdes uses medium ammount of RAM
 * SPAdes is relatively fast
 * SPAdes includes error correction software BayesHammer
 * SPAdes have scaffolder (version 2.3+)


 * Cons
 * SPAdes is well tested only on bacterial genomes
 * SPAdes works with Illumina reads only

Velvet
See here for a tutorial on creating an assembly with Velvet.


 * Pros
 * Easy to install, stable
 * Easy to run
 * Fast (multithreading)
 * Can take in long and short reads, works with SOLiD colorspace reads
 * Can use a reference genome to anchor reads which normally map to repetitive regions (Columbus module)
 * Cons
 * Velvet might need large amounts of RAM for large genomes, potentially  > 512 GB for a human genome based if at all possible. This is based on an approximation formula derived by Simon Gladman for smaller genomes -109635 + 18977*ReadSize + 86326*GenomeSize in MB + 233353*NumReads in million - 51092*Kmersize

Minia
is a de Bruijn graph assembler optimized for very low memory usage.


 * Pros
 * Assembles very large genomes quickly on modest resources
 * Easy to install, run


 * Cons
 * Illumina data only
 * Does not perform any scaffolding
 * Some steps are I/O-intensive, i.e. a local hard disk should be used rather than a network drive

CLC cell
The CLC assembly cell is a commercial assembler released by CLC. It is based on a de Bruijn graph approach.


 * Pros
 * CLC uses very little RAM
 * CLC is very fast
 * CLC contains a scaffolder (version 4.0+)
 * CLC can assemble data from most common sequencing platforms.
 * Works on Linux, Mac and Windows.


 * Cons
 * CLC is not free
 * CLC might be a bit more liberal in folding repeats based on our own plant data.

Newbler
is an assembler released by the Roche company.


 * Pros
 * Newbler has been used in many assembly projects
 * Newbler seems to be able to produce good N50 values
 * Newbler is often relatively precise
 * Newbler can usually be obtained free of charge


 * Cons
 * Newbler is tailored to (mostly) 454 data. Since Ion Torrent PGM data has a similar error profile (predominance of miscalled homopolymer repeats), it may be a good choice there also. Whilst it can accommodate some limited amount of Illumina data as has been described by bioinformatician Lex Nederbragt, this is not possible for larger data sets. The fire ant genome added ~40x Illumina data to ~15x 454 coverage in the form of "fake" 454 reads: first assembling the Illumina data using SOAPdenovo and then chopping the obtained contigs into overlapping 300bp reads, and finally inputting these fake 454 reads to Newbler alongside real 454 data.
 * As Newbler at least partly uses the OLC approach large assemblies can take time

Decision Helper
This is based both on personal experience as well as on published studies. Please note however that genomes are different and software packages are constantly evolving.

An Assemblathon challenge which uses a synthetic diploid genome assembly was reported on by Nature to call SOAP de novo, Abyss and ALLPATHS-LG the winners.

However a talk on the Assemblethon website names SOAP de novo, sanger-sga and ALLPATHS-LG to be consistently amongst the best performers for this synthetic genome.

I want to assemble: (For large genomes this is based on the fact that not many assemblers can deal with large genomes, and based on the assemblathon outcome. For 454 data this is based on Newbler's good general performance, and MIRA's different outputs, its versatility and the theoretical consideration that de Bruijn based approaches might fare worse)
 * Mostly 454 or Ion Torrent data
 * small Genome =>MIRA, Newbler
 * all others use Newbler
 * Mixed data (454 and Illumina)
 * small genome => MIRA, but try other ones as well
 * medium genome => no clear recommendation
 * large genome, assemble Illumina data with ALLPATHS-LG and SOAP, add in other reads or use them for scaffolding
 * Mostly Illumina (or Colorspace)
 * small genome => MIRA, velvet
 * medium genome => no clear recommendation
 * large genome, assemble Illumina data with ALLPATHS-LG and SOAP, add in other reads or use them for scaffolding

Post assembly you might want to try the SEQuel software to improve the assembly quality.

I want to start a large genome project for the least cost (This recommendation is based on the Assemblathon outcome, the original ALLPATHS publication as well as a publication that used ALLPATHS for the assembly of Arabidopsis genomes.
 * Use Illumina reads with ALLPATHS-LG specification (i.e. overlapping), the reads will work in e.g. SOAP de novo as well

Each software has its particular strength, if you have specific requirement, the result from Assemblathon will guide you. Another comparison site GAGE has also released its comparison. Also there exists QUAST tool for assessing genome assembly quality.

Further Reading Material

 * Background
 * Genome Sequence Assembly Primer
 * Paszkiewicz and Studholme, 2010 Some general background
 * Nagarajan and Pop, 2010
 * Imeldfort and Edwards, 2009 Sequencing of plant genomes
 * Pop 2009


 * Original publications
 * Simpson et al., 2009 ABySS
 * Zerbino and Birney, 2008 Velvet
 * Gnerre et al., 2011 ALLPATHS-LG
 * Li et al., 2010 SOAP denovo
 * Chevreaux et al., 2004 MIRAest
 * Chaisson et al., 2009 EULER-USR
 * The CLC Assembly Cell Whitepaper includes a comparison with ABySS for a human genome and with velvet for a bacterial genome


 * Comparisons
 * Ye et al., 2011 Comparison of Sanger/PCAP; 454/Roche and Illumina/SOAP assemblies. Illumina/SOAP had lower substitution, deletion and insertion rates but lower contig and scaffold N50 sizes than 454/Newbler.
 * Paszkiewicz et al., 2010 General review about short read assemblers
 * Zhang et al., 2011 In depth comparison of different genome assemblers on simulated Illumina read dat. Unfortunately only up to medium genomes were tested. For eukaryotic genomes and short reads Soap denovo is suggested for longer reads ALLPATHS-LG.
 * Chapman JA et al. 2011 introduce the new assembler gathered literature data on the assembly of E. coli K12 MG1655 for Allpaths 2, Soapdenovo, Velvet, Euler-SR, Euler, Edena, AbySS and SSAKE. Allpaths2 had by far the largest Contig and Scaffold N50 and was apart from Meraculous the only misassembly free. Meraculous was shown to even contain no errors.
 * Liu et al., 2011 benchmark their new assembler PASHA against SOAP de novo (v 1.04), velvet (1.0.17) and ABySS (1.2.1) using three bacterial data sets. Whilst PASHA usually the largest NG50 and NG80 (N50 and N80 calculated with the true genome sizes) SOAP de novo produced the highest number of contigs and soemtimes worse NG50 and NG80. However for one dataset SOAP denovo showed the best genome coverage.
 * The Assemblathon comparing de novo genome assemblies of many different teams based on a synthetic genome. The Assemblathon 1 competition is now published in Genome Research by Earl et al.

ENA
See here for more information.

The European Nucleotide Archive (ENA), has a three-tiered data architecture.It consolidates information from:
 * EMBL-Bank.
 * the European Trace Archive:containing raw data from electrophoresis-based sequencing machines.
 * the Sequence Read Archive: containing raw data from next-generation sequencing platforms.

SRA
See SRA for more information.

The Sequence Read Archive (SRA) is:
 * the Primary archival repository for next generation sequencing reads and alignments (BAM)
 * Expanding to manage other high-throughput data including sequence variations (VCF)
 * Will shorty also accept capillary sequencing reads
 * Globally comprehensive through INSDC data exchange with NCBI and DDBJ
 * Part of European Nucleotide Archive (ENA)
 * Data owned by submitter and complement to publication
 * Data expected to be made public and freely available; no access/use restrictions permitted
 * Pre-publication confidentiality supported
 * Controlled access data submitted to EGA
 * Active in the development of sequence data storage and compression algorithms/technologies

SRA Metadata Model

 * Study: sequencing study description
 * Sample: sequenced sample description
 * Experiment/Run: primary read and alignment data
 * Analysis: secondary alignment and variation data
 * Project: groups studies together
 * EGA DAC: Data Access Committee
 * EGA Policy: Data Access Policy
 * EGA Dataset: Dataset controlled by Policy and DAC

IGV
IGV is the Integrative Genomics Viewer developed by NCBI, the National Center for Biotechnology Information. IGV allows for easy navigation of large-scale genomic datasets, and supports the integration of genomic data types such as aligned sequence reads, mutations, copy number, interfering RNA screens, gene expression, methylation, and genomic annotations. Users can amplify specific areas down to individual base-pairs, and more generally scroll through an entire genome. It can be used to visualize and share whole genomes/reference genomes, alignments, variants, and regions of interest as well as filter, sort, and group genomic data.