Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format

Table of contents

Introduction

"Bambino" is a graphical viewer and variant detector for next-generation sequencing files in SAM/BAM format. Its primary features: The program is written in Java (version 6 or later required) using the Picard library for SAM/BAM parsing, MySQL Connector/J to communicate with the UCSC annotation databases, and Swing for the GUI.

An index of next-generation datasets is available, and the viewer is integrated into certain projects in the Cancer Genome Workbench.

Launching bambino

Click one of the links below to run Bambino using Java Web Start (to download and run locally, see utilities documentation). Allowing the program to use 1 gb of RAM is recommended, but it may work with less, depending on how much data you want to load into the viewer.
Maximum memory allocationLink
1 gb (recommended) launch program
512 mb launch program
768 mb launch program
1.5 gb launch program
If you see a message like "could not create the Java virtual machine", try running a lower-memory version of the program. If while running the program you receive an "out of memory" error message, try running the program with additional memory, or viewing narrower regions of the genome.

Note that at the moment the program uses a self-signed certificate, so you will have to click through a security warning to run it. This is necessary to allow the program to open .bam files on your computer.

Below is a screenshot of the launch screen:

The launcher contains three tabs:

Data files

This tab lets you specify the .bam files you want to view and how to handle the reference sequence.

Annotation database

This tab lets you configure the UCSC genome annotation database to use, if annotations are desired. This is configured to use the public UCSC annotation server for human build hg18 by default, but may be configured to use a different database or locally-installed mirror.

Options

Specifies miscellaneous options to start the program:

Example data / walkthrough

UCSC has an example .bam file you can use with Bambino:
  1. Download the following 2 files and save in the same directory on your computer: bamExample.bam (data file) and bamExample.bam.bai (index file).
  2. Launch bambino.
  3. Set the ".bam file(s)" field to point to your downloaded copy of bamExample.bam.
  4. If you have downloaded a copy of the hg18 reference sequence, specify its location on your computer, otherwise uncheck the "use reference sequence" checkbox.
  5. click "Start viewer"
  6. Once the viewer starts, enter "chr21:33035000-33040000" in the "Jump:" field and hit enter.
  7. The assembly will refresh showing the reads aligned in this region. Enough reads will be present to demonstrate some of the viewer's features: you can see aligned dbSNP entries and reference sequences, and the SNP detector can find a few SNPs, one novel C->T variant at base 33035351 and a dbSNP A->T entry at base 33038426 (rs2186271).

Viewer display overview

Below is a screenshot of the viewer:
screen shot of the viewer

Navigation

There are several ways to move around the assembly:

Context menu

Right-clicking within the assembly will bring up a context-sensitive popup menu with several options which use the sample ID and/or location currently under the mouse cursor:

Variant detection

Click the "SNPs" button in the toolbar to bring up the variant detection controls:
variant detection controls
The following settings control the variant detection process:
Label command-line switch Description Default value
Minimum nucleotide quality -min-quality Minimum quality score for including a particular nucleotide in any SNP calculation. Applies to both reference and alternative alleles. 10
Minimum mapping quality -min-mapq Minimum read mapping quality required to use a read in variant detection (see MAPQ field in SAM specification, section 2.2.1). 1
Minimum coverage -min-coverage The minimum number of reads passing all quality filters required at a variant site. Only reads associated with the variant are counted -- i.e. for a putative G/T SNP, reads showing other calls are not included. 4
Minimum frequency of alternative allele -min-minor-frequency Minimum frequency (non-reference read count divided by reference read count) to consider any putative variant. A value of 0 disables this check. Note that this is in the pooled set of data; take for example a run on 2 BAM files, one normal and one tumor, containing perfectly-distributed genomic data (unlikely!). A variant which is homozygous for the reference allele in normal and heterozygous for the alternative allele in tumor would have a minor allele frequency of 0.25 (25%). 5%
Minimum observations of alternative allele -min-alt-allele-count Required minimum number of reads supporting non-reference allele 3
Minimum unique read names supporting alternative allele -min-unique-alt-reads Required minimum number of unique read names supporting the variant allele. Set this to a number greater than one to ensure the variant is observed in more than a single mate pair. This can be a concern in low-coverage regions where both reads in a mate pair amplify the same region. 1
Minimum unique read mapping start positions supporting alternative allele -min-unique-alt-read-start Requires the set of reads supporting the variant allele to show a minimum number of unique read mapping alignment start positions. Setting this value higher helps avoid monoclonal effects by increasing supporting read mapping diversity, but requires higher read coverage. 2
Minimum observations of alternative allele to enable uniqueness filters -unique-filter-coverage Sets a minimum number of observations of alternative allele before enforcing two filters: (1) minimum number of reads with flanking sequence and (2) minimum unique start positions for alternative allele. Prevents these filters from discarding variants in low-coverage areas. 6
Minimum alternative reads with flanking sequence -min-alt-flanking-reads [count]
-min-alt-flanking-reads-window [window_size]
Requires a minimum number of reads showing the non-reference allele to have flanking sequence of a particular size. at least 1 sequence with 10 or more nt of flanking sequence
Minimum quality of flanking sequence -min-flanking-quality [quality]
-min-flanking-quality-window [window_size]
Require all reads to have flanking sequence of a certain minimum quality. Not enforced if the site is near the end of a read. If you are trying to detect low-frequency variants, or variants in regions with low coverage, a lower quality value may be appropriate. quality 15+ for 5 flanking nt
Maximum allowable high-quality mismatches to reference sequence -mmf-max-hq-mismatches [count]
-mmf-min-hq-quality [quality]
Sequence will be entirely disqualified for SNP-calling purposes if it has more than the specified number of mismatches of specified or better sequence quality. Note that mismatches corresponding to known dbSNP SNPs are not included in this count. 3 mismatches of quality 15+
Maximum allowable low-quality mismatches to reference sequence -mmf-max-lq-mismatches [count]
-mmf-min-lq-quality [quality]
Sequence will be entirely disqualified for SNP-calling purposes if it has more than the specified number of low-quality mismatches. Note that mismatches corresponding to known dbSNP SNPs are not included in this count. 6 mismatches of quality 3+
Mismap filter: max ratio of suspicious mismatches to usable ones -mismap-frequency [value] When the mismatch filter rejects a read for having too many disagreements from the consensus, the positions of high-quality mismatches are recorded. When a putative SNP is evaluated, a ratio is generated of the count of these suspicious base calls to the count of usable reads showing the SNP allele. If the ratio is above the specified level, the SNP is rejected as likely being in a mismapped region. For example take a candidate SNP where the reference is a G and the alternative allele is a T. Evaluation of the reads found 6 cases where a T at this position was found in a read with an unacceptably high level of mismatches. 10 other reads were found in acceptable reads (which possibly didn't fully overlap the problematic region). The ratio would be 6/10 or 0.6; under the default settings the SNP would be rejected as a possible artifact of mismapped reads. 0.5
Ignore reads with non-primary alignments -skip-non-primary
or
-allow-non-primary
Whether to skip reads with having "alignment is not primary" flag set (see SAM specification, section 2.2.2, flag 0x0100). don't use non-primary alignments
read-end mismatch filter TBD Ignores high-quality mismatches clustered near the starts or ends of reads, which tend to cause false positive SNP calls. By default these ignore 2 or more mismatches within the first/last 6 nucleotides, and 3 or more mismatches within the first/last 10 nucleotides. enabled
mismapped deletion filter TBD When a read contains a deletion near the start or end of the sequence, alignment software can struggle to map it properly because there aren't enough bases available on the other side of the deletion. This leads to short stretches of disagreement with the reference sequence which can lead to false positive SNP calls. This filter examines putative SNPs falling within called deletions, discarding reads disagreeing with the reference sequence where the disagreement occurs within 10 nt of the end of a read. enabled
mate pair disagreement filter TBD If multiple reads for a mate pair are present at a SNP site, checks whether all their allele calls agree. If they don't, the reads are excluded from consideration in the SNP call calculation. enabled
Clicking the "Find SNPs/indels" button then proceeds with the analysis. The results may then be browsed using the spinner control directly to the right of the "SNPs" button in the toolbar. Iterating through the selections centers the display on each selected variant, highlighting it with a blue vertical line in the display. Other SNP sites also visible onscreen will be highlighted with a green line.

Note that the interactive variant detector will only work within the region loaded in the viewer. To detect variants in an entire dataset (or target region), invoke the variant detector from the command line.

Command-line usage / utilities

Bambino contains various utility routines (including the variant detector) which may be run offline. These are described in the BAM utilities documentation.

Citation

An article about Bambino was published in Bioinformatics in January 2011 (abstract / PDF / PubMed).

License

Bambino is distributed under the following license terms (text, .doc).

Change log

2013-02-01 2012-03-09 2012-01-11 2011-09-26 2011-05-12 2011-02-15
Questions, comments, and suggestions to: Michael Edmonson (
Michael.Edmonson@stjude.org)
Questions on accessing Bambino at NIH : Richard Finney (finneyr@mail.nih.gov)