Wednesday, March 26, 2014

Predicting Full-Length Ribosomal Gene Sequences

Introduction

The 16S ribosomal gene has been used extensively in biology for distinguishing relatedness between species.  This gene has regions of DNA that are highly conserved among almost all living organisms and other regions that have high DNA sequence variability.  The conserved regions are ideal for building PCR primers that can amplify DNA from many different organism.  The variable regions that are amplified using these conserved primers can be used to determine the relatedness between two or more organisms.  Closely related species typically have much more similar DNA sequences than distantly related species.

Typically PCR is used to amplify a portion of the 16S ribosomal gene for sequencing.  However, whole genome sequences or whole metagenome sequences also contain short DNA reads originating from the 16S gene.  These reads can be separated from the pool of other genomic reads and assembled into the entire 16S gene.  EMIRGE (Miller, et al. 2011) is an algorithm for reconstructing full-length ribosomal genes from short read DNA sequences.


EMIRGE

EMIRGE reconstructs full-length ribosomal genes from short read DNA sequences.  It first maps reads to a database of known 16S genes such as the SILVA or greengenes database.  After the initial mapping, EMIRGE estimates the probability that a given read was generated from the reference to which it mapped.  Based on these probability estimates, reference sequences are changed to reflect the 16S sequences that are likely to be represented by the set of reads.  Reads are then remapped to the adjusted 16S sequence database and the processes is repeated until an equilibrium is achieved.  The resulting database of 16S sequences reflect the likely 16S genes represented by the input set of short reads.

This software was primarily built to infer the set of 16S genes from whole metagenome reads.  However, it can also be used to infer the single 16S gene from genomic sequences from a single isolate.  Full-length 16S genes are difficult to assemble even when only reads from a single genome are considered.

In the Dangl lab, we use EMIRGE to predict full-length 16S genes from reads generated from a single genome of bacteria.  An example of the EMIRGE command we use is:

emirge.py my_output_dir -1 fwd_reads.fastq -2 rev_reads.fastq -b SSURef_NR99_115_tax_silva_formated -f SSURef_NR99_115_tax_silva_formated.fasta -i 600 -s 1000 -l 250

The descriptions of each parameter are below:

my_output_dir: the output and working directory for EMIRGE.

-1: the forward or single-end genomic sequencing reads

-2: the reverse end genomic sequencing reads

-b: the bowtie index of the 16S sequence database

-f: the fasta file of the 16S sequence database

-i: insert size of paired-end reads

-s: standard deviation of insert size for paired-end reads

-l: max length of reads


Other Details

EMIRGE uses bowtie to map reads to the reference database.  To build the bowtie index of the reference database the following command was used:

bowtie-build SSURef_NR99_115_tax_silva_formated.fasta SSURef_NR99_115_tax_silva_formated

Also, the database downloaded from SILVA had to be reformatted using this Perl script.  This script requires BioUtils.

No comments:

Post a Comment