SAMtools

The Sequence Alignment/Map format and SAMtools.
Original author(s) Heng Li
Developer(s) John Marshall and Petr Danecek et al [1]
Initial release 2009
Stable release 1.2 / 2015-02-02
Development status Active
Operating system UNIX-based
Available in C
Type Bioinformatics
License BSD, MIT
Website http://www.htslib.org

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion.[2] SAM files can be very large (10s of Gigabytes is common), so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.

Usage and commands

Like many Unix commands, SAMtool commands follow a stream model, where data runs through each command as if carried on a conveyor belt. This allows combining multiple commands into a data processing pipeline. Although the final output can be very complex, only a limited number of simple commands are needed to produce it. If not specified, the standard streams (stdin, stdout, and stderr) are assumed. Data sent to stdout are printed to the screen by default but are easily redirected to another file using the normal Unix redirectors (> and >>), or to another command via a pipe (|).

SAMtools commands

SAMtools provides the following commands, each invoked as "samtools some_command".

view 
The view command filters SAM or BAM formatted data. Using options and arguments it understands what data to select (possibly all of it) and passes only that data through. Input is usually a sam or bam file specified as an argument, but could be sam or bam data piped from any other command. Possible uses include extracting a subset of data into a new file, converting between BAM and SAM formats, and just looking at the raw file contents. The order of extracted reads is preserved.
sort 
The sort command sorts a BAM file based on its position in the reference, as determined by its alignment. The element + coordinate in the reference that the first matched base in the read aligns to is used as the key to order it by. [TODO: verify]. The sorted output is dumped to a new file by default, although it can be directed to stdout (using the -o option). As sorting is memory intensive and BAM files can be large, this command supports a sectioning mode (with the -m options) to use at most a given amount of memory and generate multiple output file. These files can then be merged to produce a complete sorted BAM file [TODO - investigate the details of this more carefully].
index  
The index command creates a new index file that allows fast look-up of data in a (sorted) SAM or BAM. Like an index on a database, the generated *.sam.sai or *.bam.bai file allows programs that can read it to more efficiently work with the data in the associated files.
tview  
The tview command starts an interactive ascii-based viewer that can be used to visualize how reads are aligned to specified small regions of the reference genome. Compared to a graphics based viewer like IGV,[3] it has few features. Within the view, it is possible to jumping to different positions along reference elements (using 'g') and display help information ('?').
mpileup  
The mpileup command produces a pileup format (or BCF) file giving, for each genomic coordinate, the overlapping read bases and indels at that position in the input BAM files(s). This can be used for SNP calling for example.
flagstat  

Examples

view

Convert a bam file into a sam file.

samtools view sample.bam > sample.sam

Convert a sam file into a bam file. The -b option compresses or leaves compressed input data.

samtools view -bS sample.sam > sample.bam

Extract all the reads aligned to the range specified, which are those that are aligned to the reference element named chr1 and cover its 10th, 11th, 12th or 13th base. The results is saved to a BAM file including the header. An index of the input file is required for extracting reads according to their mapping position in the reference genome, as created by samtools index.

samtools view sample_sorted.bam "chr1:10-13"

Extract the same reads as above, but instead of displaying them, writes them to a new bam file, tiny.bam. The -b option makes the output compressed and the -h option causes the SAM headers to be output also. These headers include a description of the reference that the reads in sample.bam were aligned to and will be needed if the tiny.bam file is to be used with some of the more advanced SAMtools commands. The order of extracted reads is preserved.

samtools view -h -b sample_sorted.bam "chr1:10-13" > tiny_sorted.bam
tview
samtools tview sample_sorted.bam

Start an interactive viewer to visualize a small region of the reference, the reads aligned, and mismatches. Within the view, can jump to a new location by typing g: and a location, like g:chr1:10,000,000. If the reference element name and following colon is replaced with {{{1}}}, the current reference element is used, i.e. if {{{1}}} is typed after the previous "goto" command, the viewer jumps to the region 200 base pairs down on chr1. Typing ? brings up help information.

sort
samtools sort unsorted_in.bam sorted_out

Read the specified unsorted_in.bam as input, sort it by aligned read position, and write it out to sorted_out.bam, the bam file whose name (without extension) was specified.

samtools sort -m 5000000 unsorted_in.bam sorted_out

Read the specified unsorted_in.bam as input, sort it in blocks up to 5 million k (5 Gb) [TODO: verify units here, this could be wrong] and write output to a series of bam files named sorted_out.0000.bam, sorted_out.0001.bam, etc., where all bam 0 reads come before any bam 1 read, etc. [TODO: verify this is correct].

index
samtools index sorted.bam

Creates an index file, sorted.bam.bai for the sorted.bam file.

See also

References

  1. http://sourceforge.net/mailarchive/forum.php?thread_name=2F0E69A8-A2DD-4D6E-9EDE-2A9C0506DA0F%40sanger.ac.uk&forum_name=samtools-devel
  2. Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R.; 1000 Genome Project Data Processing Subgroup (2009). "The Sequence Alignment/Map format and SAMtools". Bioinformatics 25 (16): 2078–2079. doi:10.1093/bioinformatics/btp352. PMC 2723002. PMID 19505943.
  3. IGV

External links

Wikibooks has more on the topic of: SAMtools