Next Generation Sequencing (NGS)/Alignment

Introduction
Alignment, also called mapping, of reads is an essential step in re-sequencing. Having sequenced an organism of a species before, and having constructed a reference sequence, re-sequencing more organisms of the same species allows us to see the genetic differences to the reference sequence, and, by extension, to each other. Alignments of data from these re-sequenced organisms is a relatively simple method of detecting variation in samples. There are certain instances (such as new genes in the sequenced sample that are not found in the existing reference sequence) that can not be detected by alignment alone; however, while other approaches, such as de novo assembly, are potentially more powerful, they are also much harder or, for some organisms, impossible to achieve with current sequencing methods.

Next-generation sequencing generally produces short reads or short read pairs, meaning short sequences of <~200 bases (as compared to long reads by Sanger sequencing, which cover ~1000 bases). To compare the DNA of the sequenced sample to its reference sequence, we need to find the corresponding part of that sequence for each read in our sequencing data. This is called aligning or mapping the reads against the reference sequence. Once this is done, we can look for variation (e.g. SNPs) within the sample. This poses a number of problems:
 * The short reads do not come with position information, that is, we do not know what part of the genome they came from; we need to use the sequence of the read itself to find the corresponding region in the reference sequence.
 * The reference sequence can be quite long (~3 billion bases for human), making it a daunting task to find a matching region.
 * Since our reads are short, there may be several, equally likely places in the reference sequence from which they could have been read. This is especially true for repetitive regions.
 * If we were only looking for perfect matches to the reference, we would never see any variation. Therefore, we need to allow some mismatches and small structural variation (InDels) in our reads.
 * Any sequencing technology produces errors. Similar to the "real" variation, we need to tolerate a low level of sequencing errors in our reads, and separate them from the "real" variation later.
 * We need to do that for each of the millions of reads in our sequencing data.

Short reads
Raw short reads often come in (or can be converted into) a file format called FASTQ. It is a plain text format, containing the sequence and quality scores for every read, where each single read normally occupies four consecutive lines: @Read_id_1 CTGATGTGCCGCCTCACTTCGGTGGT + @@@DDDDDH8( @Read_id_2 TGATGTGCCGCCTCACTACGGTGGTG + FHHHHHJIJIJIJIIIJJIIJGIGII @Read_id_3 ... The four lines are:
 * 1) The name/ID of the read, preceded by a "@". For read pairs, there will be two entries with that name, either in the same or a second FASTQ file.
 * 2) The sequence of the read.
 * 3) A "+" sign. In very old FASTQ files, this is followed by the read name from the first line. Today, this line is present for historical reasons backwards compatibility only.
 * 4) The quality scores of the bases from line 2. The scores are generated by the sequencing machine, and encoded as ASCII (33+score) characters. The line should have the same length as line 2, as there is one quality score per base.

Alignment
For each of the short reads in the FASTQ file, a corresponding location in the reference sequence (or that no such region exists) needs to be determined. This is achieved by comparing the sequence of the read to that of the reference sequence. A mapping algorithm will try to locate a (hopefully unique) location in the reference sequence that matches the read, while tolerating a certain amount of mismatch to allow subsequence variation detection. Reads aligned (mapped) to a reference sequence will look like this:

GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG Reference sequence CTGATGTGCCGCCTCACTTCGGTGGT       Short read 1 TGATGTGCCGCCTCACT A CGGTGGTG      Short read 2 GATGTGCCGCCTCACTTCGGTGGTGA     Short read 3 GCTGATGTGCCGCCTCACT A CGGTG         Short read 4 GCTGATGTGCCGCCTCACT A CGGTG         Short read 5

You can see the reference sequence on the top row, and five short reads stacked below; this is called a pileup. While two of the reads are a perfect match to the reference, the three other reads show a mismatch each, highlighted in red ("A" in the read, instead of "T" in the reference). Since there are multiple reads showing the mismatch, at the same position, with the same difference, one could conclude that it is an actual genetic difference (point mutation or SNP), rather than a sequencing error or mismapping.

Mapping algorithms
There are several alignment algorithms in existence; you can find an (incomplete) list further down in software packages. Some notes on mapping algorithms:
 * The reference sequence, the short reads, or both, are often pre-processed into an indexed form for rapid searching. (See BWT)

Sources of errors
There are several potential sources for errors in an alignment, including (but not limited to):
 * Polymerase Chain Reaction artifacts (PCR artifacts). Many NGS methods involve one or multiple PCR steps. PCR errors will show as mismatches in the alignment, and especially errors in early PCR rounds will show up in multiple reads, falsely suggesting genetic variation in the sample. A related error would be PCR duplicates, where the same read pair occurs multiple times, skewing coverage calculations in the alignment.
 * Sequencing errors. The sequencing machine can make an erroneous call, either for physical reasons (e.g. oil on an Illumina slide), or due to properties of the sequenced DNA (e.g., homopolymers). As sequencing errors are often random, they can be filtered out as singleton reads during variant calling.
 * Mapping errors. The mapping algorithm can map a read to the wrong location in the reference. This often happens around repeats or other low-complexity regions.

Alignment types
Alignments can be used for different purposes:
 * Whole-genome sequencing. This would be the "default" use; sequence all DNA from an organism and map it to the appropriate reference sequence, to find genetic variation.
 * Exome sequencing. For large genomes (e.g., human), capture just the exomic DNA before sequencing. This will return sequencing data for most of the genes, at a fraction of the cost.
 * Transcriptome sequencing (RNA-Seq). Sequencing of the transcriptome, that is, of the RNA present in the sample. This can show which genes are transcribed in the sample, and help fine-tune gene annotation (exon boundaries etc.). Mapping can be done either to the full reference sequence, or to a special "transcriptome reference".
 * ChIP-Seq (Protein-DNA interaction).

The SAM/BAM format
The SAM/BAM format has emerged as the de facto standard format for short read alignments. SAM is the plain-text version of the binary, compressed BAM format. They can be converted into one another by the name-giving samtools command-line tool. BAM (without alignment position data) is increasingly used as a space-saving alternative to FASTQ files for containing the short raw read data, and all current alignment software can generate SAM/BAM as an output format. Once in BAM format, the file can be indexed, giving quick access to any region of the reference sequence. Subsequently, using samtools or other software, BAM files can be analysed (e.g. for quality control), modified (removal of PCR duplicates, local realignment, base quality recomputation), or used to call variation, either small (SNPs, short InDels) or large (inversions, tandem duplications, deletions, translocations). BAM files can be visualised using tools like Artemis, ACT, or LookSeq. Last but not least, alignments in BAM format can be used to "morph" the reference sequence to correspond to the short read data with ICORN ; this can be useful to get an actual DNA sequence for a sample, or to construct a new reference sequence based on a closely related species.

Other useful SAM/BAM-related software includes:
 * Picard
 * Bio-SamTools

SAM/BAM tools examples

 * Convert SAM to BAM format:
 * samtools view -bS aln.sam > aln.bam
 * Sort BAM file according to position on reference sequence
 * samtools sort aln.bam aln_sorted.bam
 * Index BAM file (needed for visualising the alignment)
 * samtools index aln_sorted.bam aln_sorted.bai
 * Extract the header information from a BAM file
 * samtools view –h aln_sorted.bam > aln.sam
 * Generate a FASTQ file from a BAM file
 * bam2fastq -o aligned.fastq --no-unaligned aln.bam

Software packages
An exhaustive list of NGS aligner is maintained at HTS mapper

Based on personal experience and prevalence and based on literature data on the performance , a quick primer on when to use which software:
 * If only speed matters use bowtie.
 * BWA is a bit slower but more sensitive.
 * If sensitivity and specificity is needed, try NextGenMap, Stampy, Novoalign, or SHRiMP 2.

Furthermore a framework (Teaser) was proposed that automatically benchmarks several mappers within minutes.