tag:blogger.com,1999:blog-58695575217568837402024-03-05T00:51:02.481-08:00The in silico lensAnonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.comBlogger20125tag:blogger.com,1999:blog-5869557521756883740.post-52128241206810160982015-05-13T14:53:00.000-07:002015-05-13T15:00:36.509-07:00Categorizing musicians by genre using artificial neural networks<div dir="ltr" style="text-align: left;" trbidi="on">
<b>Background</b><br />
Genres are an important classification scheme for organizing music and other forms of art. However, genre classifications are often ambiguous. For example, undergraduate students asked to classify a given set of songs into 10 genres had only a 72% concordance with recording companies genre classifications (<a href="http://faculty-web.at.northwestern.edu/music/gjerdingen/Papers/PubPapers/Scanning.pdf">D. Perrot and R. Gjerdigen, 1999</a>). Some of the most sophisticated computer algorithms can classify songs into genres with about 70% - 80% accuracy using song features such as tempo, time, instruments, etc. The accuracy of genre classification is also dependent on the number of genres an algorithm attempts to delineate. From the MIREX 2005 audio genre classification contest, the highest accuracy for 6 genre classifications was 87% but dropped to 75% when classifying 10 genres (<a href="http://www.music-ir.org/evaluation/mirex-results/">http://www.music-ir.org/evaluation/mirex-results/</a>).<br />
<br />
An alternative to classifying songs into genres is classify musicians into genres. Categorizing musicians into genres is also a difficult problem because individual musicians can be members of several different, but related, genres. Musicians can be described using a set of terms (i.e. genres and sub-genres) representing their genre membership. For example, <a href="http://en.wikipedia.org/wiki/Florida_Georgia_Line">Florida Georgia Line</a>, can be classified using the terms: bro-country, country rock, country pop, and country rap. This multi-class membership makes it difficult to make general comparisons between genres. For example, it would be interesting to test if the word "truck" is more frequently mentioned in country songs (i.e. by country musicians) than in rock songs (i.e by rock musicians).<br />
<br />
<i>The first objective in this series of posts is to categorize musicians into a set of 20 pre-defined genres based on a list of terms associated with each musician using an artificial neural network model</i>. <br />
<br />
<b>Artificial Neural Network Background</b><br />
<a href="http://en.wikipedia.org/wiki/Artificial_neural_network">Artificial neural networks</a> (ANN) are a class of machine learning models frequently used to solve classification problems. ANNs contain a set of input, hidden, and output nodes connected by weighted edges. Input nodes correspond to features in the training dataset. Output nodes correspond to classification types. Input nodes are connected by weighted edges to hidden nodes, and hidden nodes are connected by weighted edges to output nodes. Hidden nodes and weighted edges are the medium by which the input features are transformed into a classification prediction. For example, if a given musician, M<span style="font-size: xx-small;">1</span>, is defined by features f<span style="font-size: xx-small;">1</span>, f<span style="font-size: xx-small;">2</span>, and f<span style="font-size: xx-small;">3</span>, by "activating" input nodes f<span style="font-size: xx-small;">1</span>, f<span style="font-size: xx-small;">2</span>, and f<span style="font-size: xx-small;">3</span> the network redirects the flow of weight towards the output nodes. After all the weight has been redirected, the output node with the highest weight has the highest probability of being the correct classification for musician M<span style="font-size: xx-small;">1</span>. <br />
<br />
<b>The Training Dataset</b><br />
The task of building or training an ANN is equivalent to assigning weights to each edge such that the classification accuracy of the model is maximized. The overly simplistic intuition behind this operation is as follows. First, edge weights are randomly initialized. Then features of a single training entry are allowed to flow through the network and the final output for that entry is noted. Using the final output and known output, an error value can be calculated which measures the level of correctness of the final output. The network is then <a href="http://en.wikipedia.org/wiki/Backpropagation">backpropagated</a> and weights are redistributed to minimize the error for the that feature set. This process is repeated for each training entry.<br />
<br />
In this project, an ANN will be used to classify musicians based on a set of terms describing each musician. Training data was manually gathered and curated. First, for each genre, <i>G</i>, google searches for "popular musicians of genre <i>G</i>" generated a list of popular musicians and their matching genre. Terms describing each musician, <i>M</i>, were gathered from the <a href="http://the.echonest.com/">Echo Nest database</a>. The "popular musician" lists were supplemented with more musicians obtained from Wikipedia pages for each genre. The complete training dataset consisted of 20 genres where each genre contained 80 musicians and each musician is represented by a list of terms. This training dataset, like nearly all training datasets for song classification, is likely to have incorrect annotations. Making more accurate training sets is a difficult task but substantially improves classification accuracy (<a href="http://users.cis.fiu.edu/~lli003/Music/cla/21.pdf">Cory McKay and Ichiro Fujinaga, 2006</a>). <br />
<br />
<b>Feature Selection</b><br />
Estimating model parameters (i.e. weights) can be computationally expensive. Training sets having a large number of features take much longer to estimate model parameters than those with fewer features. Here the initial training dataset had over 3,000 features (i.e. terms). Removing zero and near-zero variance features can reduce the required computational resources without having a substantial impact on model accuracy. Additionally, removing correlated features will also reduce runtime and will improve model accuracy. For these data, terms associated with fewer than 1% of musicians were deemed near-zero variance and removed. Also, terms having a <a href="http://en.wikipedia.org/wiki/Phi_coefficient">Phi coefficient</a> (a measure of similarity) higher than .9 were removed. These filtering measures reduced the number of features (i.e. terms) to 412. The final training dataset contained 1,544 musicians linked to 20 genres and described by 412 terms. Note that 56 musicians are linked to more than one genre. <br />
<br />
<b>Building the Model</b><br />
The ANN model was built using the <a href="http://topepo.github.io/caret/index.html"><i>caret</i> package</a> and principles outlined in <a href="http://appliedpredictivemodeling.com/">Applied Predictive Modeling</a> by Max Kuhn and Kjell Johnson. ANN models have two main parameters: decay and size. After each backpropagation, weights are multiplied by the decay parameter, a number less than one, to prevent weights from growing too large. Excessively large weights cause a model to be overly specific to the given training dataset (i.e. not generalizable to other input data). The second parameter, size, is the number of hidden nodes in the model. Arbitrarily choosing these parameters may lead to suboptimal models. A method called model tuning builds several models based on a range of parameter values and selects the most accurate as the final model. Accuracy for each tuning model built in the project is shown in Figure 1. The final model used parameters decay=0.5 and size=25. Further adjustments to these parameters are not likely to significantly improve model accuracy for the given training data.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_WsrZ-wuNoJj4ir67gR9w1Kn5WOKTeQ3fjs8_Vj_LuX9I5AjUxBMFuwmXXp_ME3QtsZF8SrLtWWuPKHPU-spDCWcP84h98bIK1yp_vRFk_VkncHtDOZZBkvV35COYOtIthChRDW9Tdm1t/s1600/model_accuracy.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="360" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_WsrZ-wuNoJj4ir67gR9w1Kn5WOKTeQ3fjs8_Vj_LuX9I5AjUxBMFuwmXXp_ME3QtsZF8SrLtWWuPKHPU-spDCWcP84h98bIK1yp_vRFk_VkncHtDOZZBkvV35COYOtIthChRDW9Tdm1t/s1600/model_accuracy.png" width="640" /></a></div>
<span style="font-size: x-small;"> Figure 1: Model accuracy for a range of decay and size parameter values. The most accurate model was built using decay=0.5 and size=25. Bars indicate standard deviation. </span><br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<br />
Accuracy of statistical models can be measure using a technique called cross validation. In cross validation the training data are split into two groups: a training group and a test group. The model is built using the training group and subsequently evaluated using the test group. The process is repeated multiple times using different subsets of training and test groups to ensure that a chance good (or bad) split between the training and test groups does not incorrectly estimate model accuracy. Here the final model correctly classified about 60% the test cases (Figure 1). <br />
<br />
Model accuracy can also be evaluated using a <a href="http://en.wikipedia.org/wiki/Confusion_matrix">confusion matrix</a>. A confusion matrix shows the number of observed against the expected classifications for each category. Figure 2 shows the percentages of data observed in each category from the cross validation test cases. The vast majority of observed musician's genres matched the expected genre. <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEz9SFG0dIbU3a-qu9Uks3SdzxWOE9EQZ9b6iX5H854C8-NO8oCqhuJT0XAyLGEBIyZZZc72ghPgyMl3XSPqokrG3qnrjfAyaTfTgQFA3V2XXO-SjimLiLkWGyd44wZKNf6hpMFPwa4Quy/s1600/confusion_matrix.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="392" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEz9SFG0dIbU3a-qu9Uks3SdzxWOE9EQZ9b6iX5H854C8-NO8oCqhuJT0XAyLGEBIyZZZc72ghPgyMl3XSPqokrG3qnrjfAyaTfTgQFA3V2XXO-SjimLiLkWGyd44wZKNf6hpMFPwa4Quy/s640/confusion_matrix.png" width="640" /></a></div>
<br />
<span style="font-size: x-small;">Figure 2: Confusion matrix built using all cross validation test cases. Values of each cell indicate the percentage of total musicians observed in each category for a given expected genre. If the classification model were perfect there would be a blue diagonal and all other boxes would be red.</span><br />
<br />
<br />
Genres having musicians that are frequently classified incorrectly into a single alternative genre are likely similar to that alternative genre. A <a href="http://en.wikipedia.org/wiki/Hierarchical_clustering">hierarchically clustering</a> of the confusion matrix percentages (Figure 3) shows that some genres are similar to other genres (i.e. classic_rock and soft rock; disco and funk; etc.). These similarities match logical expectations. These similarities correspond to how accurately the model can distinguish between genres. For example, the model is more likely to incorrectly classify a "classic rock" musician as "soft rock" than "reggae." <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9NJEd4yAsu-ISf4RzI-E7NMaqPwYSyZ0-lvAu0hT2NOR6m7YMFxG-Dewzwm0huPBH4goUSUWcfrGN2MbrJVbJJb0eb94Kz7Y6psRWpyJveVMMAaj9AjuDXqD_KcPObwbosafCCrH9PDTs/s1600/genre_denro.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="392" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9NJEd4yAsu-ISf4RzI-E7NMaqPwYSyZ0-lvAu0hT2NOR6m7YMFxG-Dewzwm0huPBH4goUSUWcfrGN2MbrJVbJJb0eb94Kz7Y6psRWpyJveVMMAaj9AjuDXqD_KcPObwbosafCCrH9PDTs/s640/genre_denro.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-size: x-small;">Figure 3: Dendrogram clustering of confusion matrix values. This shows the similarity between genres based on the number of misclassification instances shared between genres. As expected, genres know to be similar cluster together (i.e. classic_rock and soft_rock). </span><br />
<br />
<br />
<b>Conclusion</b><br />
Artificial neural networks are an appropriate model for classification tasks. Here an ANN was built to classify musicians based on a list of terms associated with each musician. Considering accuracy of other models used in classifying songs, this model is sufficiently accurate for classifying musicians into genres. Future analyses will use classifications from this model to make general comparisons between genres. <br />
<br />
<br />
<br />
<b>Additional Notes</b><br />
<ul style="text-align: left;">
<li>The code (which needs to be cleaned and organized) can be found <a href="https://gist.github.com/anonymous/62c9295b2a07fb28dafe">here</a>.</li>
<li><b>Disclaimer:</b> I am not a statistician. This is a learning exercise for me. I'm sure there are plenty mistakes and things I overlooked. If anyone has suggestions on how to improve the correctness of this analysis I am interested in hearing them. I want to learn.</li>
<li>For a fantastic primer on using <a href="http://en.wikipedia.org/wiki/Artificial_neural_network">artificial neural networks</a> for regression and classification see the following posts by Brian Dolhansky: </li>
<ul>
<li><a href="http://briandolhansky.com/blog/artificial-neural-networks-linear-regression-part-1">Part 1 </a></li>
<li><a href="http://briandolhansky.com/blog/2013/7/11/artificial-neural-networks-linear-classification-part-2">Part 2 </a></li>
<li><a href="http://briandolhansky.com/blog/2013/9/23/artificial-neural-nets-linear-multiclass-part-3">Part 3</a></li>
</ul>
<li>These chapters by Michael Nielsen describing ANNs are also a great resource.</li>
<ul>
<li><a href="http://neuralnetworksanddeeplearning.com/chap1.html">Chapter 1</a></li>
<li><a href="http://neuralnetworksanddeeplearning.com/chap2.html">Chapter 2</a></li>
</ul>
<li><a href="http://www.ee.columbia.edu/~dliang/files/FINAL.pdf">Liang <i>et al.</i></a> use music features and lyrics to classify genres. However, many popular genres are missing (e.g. country) and there may be mistakes.</li>
<li>Other interesting references:</li>
<ul>
<li><a href="http://art.uniroma2.it/research/musicIR/BasSeraStel_ISMIR04.pdf">Basili, et al, 2004</a></li>
<li><a href="http://www.academia.edu/7705835/Statistical_Method_for_Music_Genre_Classification">Matos</a></li>
</ul>
</ul>
<ul style="text-align: left;">
<ul>
</ul>
</ul>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com1tag:blogger.com,1999:blog-5869557521756883740.post-90871889510140834962015-04-29T14:02:00.000-07:002015-04-29T14:02:20.555-07:002015 JGI Users Meeting Notes<div dir="ltr" style="text-align: left;" trbidi="on">
<b>Interested in watching any of these talks? See <a href="https://www.youtube.com/playlist?list=PLkxZMDuKlaKtVOKisVBz0R7faYvWJrjuv">this webpage</a>!</b><br />
<b><br /></b>
<b><br /></b>
<b>Jack Gilbert </b>-- Genome-Enabled Flux Balance Metabolic Networks Form Periodically Flooded Soils<br />
<ul style="text-align: left;">
<li>Presented an array of studies across many different microbial communities</li>
<li>environmental microbes have roughly a 35 year lab to major climate changes</li>
<li>Cyanobacteria can metabolize nitrogen into amonium which has been show to be useful in sphagnum moss</li>
<li>bacteria can protect against allergies. Adding clostridium into mouse will alleviate allergenic symptoms</li>
<li>added clostridium to a young man who had really bad allergies. Comparisons to other family members without allergies is in progress</li>
<li>also doing this experiment on dolphins because it is easier to control their environment</li>
<li>microbes can contribute to you being fat or skinny</li>
<li>circadian rhythms in the gut genes and microbes looks to be very important</li>
<li>microbes in roots also have circadian cycles </li>
<li>your house takes on your microbiome</li>
<li>dogs increase the similarity of couple's microbiomes</li>
</ul>
<br />
<b>Francis Martin </b>-- Harnessing Genomics for Understanding Tree-Microbe Interactions in Forest Ecosystems<br />
<ul style="text-align: left;">
<li>we know very little about fungi affect the carbon cycle</li>
<li>4 major groups of fungi in most forest ecosystems: white rotters, brown rotters, litter soul decayers, and ectomycorrizal</li>
<li>he studies mycorrhizal fungi and their symbiotic toolkit</li>
<li>mycorrhizal symbiosis has evolved independently several time!</li>
<li>ectomycorrizal fungi (EMF) have reduced complement of plant cell wall degrading genes compared to ancestors</li>
<li>small proteins from EMF are secreted and land on plant cells</li>
<li>Look at Platt paper for JAZ interaction (PNAS). Missp7 binds to JAZ and prevents the immune response in the plant</li>
<li>each symbiosis event has developed a unique set of effectors</li>
</ul>
<br />
<b>Joan Bennett</b> -- Do Fungi have a 'Volatome'?<br />
<br />
<ul style="text-align: left;">
<li><a href="http://en.wikipedia.org/wiki/Aflatoxin">aflotoxins</a> can cause cancer in small doses</li>
<li><a href="http://en.wikipedia.org/wiki/Mycotoxin">mycotoxins</a> may cause <a href="http://en.wikipedia.org/wiki/Sick_building_syndrome">sick building syndrome</a></li>
<li>volatile organic compounds (VOCs) are things that we can smell and are frequently made by fungi</li>
<li>using arabidopsis and flies as models for testing the effects of these compounds</li>
<li>flies exposed to c-8 VOCs acted similarly to model flies for Parkinson's disease</li>
<li>VOCs in the fungal microbiome of humans are likely responsible for attracting mosquitoes</li>
</ul>
<br />
<br />
<b>Susanna Theroux</b> -- Marsh Madness: Microbial Communities Driving Greenhouse Gas Cycling in Coastal Wetlands<br />
<br />
<ul style="text-align: left;">
<li>wants to link wetland microbes with carbon emissions</li>
<li>methanogens break down carbon into methane</li>
<li>wetlands produce a lot of methane</li>
<li>microbes might be useful in minimizing methane production compared to carbon storage</li>
<li>more methanogens yield more methane</li>
</ul>
<br />
<br />
<b>Antonis Rokas</b> -- Evolution of Fungal Chemodiversity<br />
<br />
<ul style="text-align: left;">
<li>fungal metabolism</li>
<li>asks how did chemo diversity originate and why is chemo diversity clustered</li>
<li>toxicity clusters are likely driven by genetic linkage (ie butterflies)</li>
<li>mined metabolic databases for genes that are clustered and genes that are not and measured toxicity. Clustered genes are more toxic</li>
<li>tissue specific expression in humans and mammals is the equivalent of clustering in fungi</li>
<li>this implies that the position of two genes in the fugal genome give information about how they interact in humans (ie those two genes are likely to be expressed in the same tissue)</li>
</ul>
<br />
<br />
<b>Rotem Sorek</b> -- The Immune System of Bacteria<br />
<br />
<ul style="text-align: left;">
<li>crisper is the immune response in bacteria for phages</li>
<li>the cas proteins find phage DNA in the cell and insert it into the next spacer</li>
<li>crisper is in only ~40% of bacteria. How do other bacteria fight phage? The only other known mechanism is restriction enzymes. Can we find new immune systems in bacteria?</li>
<li>immune system signatures: rapidly evolving, high horizontal gene transfer</li>
<li>found the BREZ system in B. cereus! Found in 10% of bacteria. We don't understand the mechanisms yet</li>
<li>phages have anti-defense systems. Ergo it would be best for bacteria to have multiple defense systems</li>
</ul>
<br />
<br />
<b>Phil Hugenholtz</b> -- Back from the Dead: The Curious Tale of the Predatory Cyanobacterium Vampirovibrio chlorellavorus<br />
<br />
<ul style="text-align: left;">
<li>these bacteria suck host cells dry!</li>
<li>falls in Cyanobacteria clade</li>
<li>contains a type-IV secretion system partially found on plasmids</li>
<li>non photosynthetic (unlike most Cyanobacteria)</li>
</ul>
<br />
<br />
<b>Stephen Wright</b> -- Comparative and Population Genomics in the Brassicaceae: Understanding Genome-Wide Natural Selection<br />
<br />
<ul style="text-align: left;">
<li>evolution can go backwards. For example, Y chromosome degeneration in humans.</li>
<li>many plants that undergo whole genome duplication revert to diploid</li>
<li>there is evidence in Arabidopsis that a long time ago it had a whole genome duplication event</li>
<li>possible explanations: passive constituency of redundancy, inefficient selection, differential adaptation</li>
<li>there is limited evidence contrary to popular theory that selfing or limited recombination leads to a high density of deleterious mutations and evolutionary dead-ends</li>
<li>capsella rubells is a model for this phenomenon</li>
<li>ploidy effects the efficiency of natural selection</li>
<li>higher ploidy can weaken efficacy of natural selection</li>
<li>ploidy combined with transition to selfing increases rate of deleterious mutation accumulation</li>
<li>plant sex chromosomes are younger than mammalian sex chromosomes</li>
</ul>
<br />
<br />
<b>Steve Briggs</b> -- Protein Regulatory Networks<br />
<br />
<ul style="text-align: left;">
<li>see Walley et al. PNAS 2013!</li>
</ul>
<br />
<br />
<b>Susan Lynch</b> -- The Microbiome--A New Frontier in Human Health<br />
<br />
<ul style="text-align: left;">
<li>increase in asthma cases in children in US and Australia</li>
<li>asthma is mostly an environmental disease (ie not so much genetic)</li>
<li>now people have far less environmental exposure</li>
<li>Americans spend about 90% of their time indoors</li>
<li>vaginal born children have different microbiome than c-section born children. And they are less likely to have asthma.</li>
<li>lactobacillus keeps airways of mice challenged with dust open like normal airways</li>
<li>see Fuijmura et al PNAS January 2014</li>
<li><a href="http://www.henryford.com/body.cfm?id=58545">WHEALS</a> study</li>
<li>allergenic children often develop asthma</li>
</ul>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0tag:blogger.com,1999:blog-5869557521756883740.post-14563959485636616612015-01-26T13:03:00.001-08:002015-01-26T13:03:20.234-08:00The iChip: A tool for high-throughput microbial culturing<div dir="ltr" style="text-align: left;" trbidi="on">
It is commonly accepted that only ~1% of naturally occurring microbes are culturable using standard culturing techniques. Until recently, culturing microbes has been the only way to investigate their presence and effects on various environments (i.e. human gut, soil, plant roots, oceans, etc). Metagenomics, sequencing DNA from a pool of microbes, has broadened our understanding of microbial communities by circumventing the need for culturing. However, culturing remains an important aspect of investigating microbial communities particularly for environments where the microbial diversity is so great that generating complete genome sequences via metagenomics is impractical (i.e. soil). <br />
<br />
In 2010, a study from the Epstein and Lewis groups (<a href="http://aem.asm.org/content/76/8/2445.full">Nichols et al., 2010</a>) describes a new technology which can increase culturing success from ~1% to nearly 50%! This technology, the isolation chip or iChip, contains 384 chambers. A single microbial cell is deposited in each chamber. Then a semi-permeable membrane is used to cover all the chambers and the iChip is placed into the environment from which the microbes originate. The membrane allows cells access to nutrients and growth factors found in its natural environment. <br />
<br />
The iChip has been successfully used to study novel secondary metabolites (<a href="http://www.nature.com/ja/journal/v63/n8/full/ja201087a.html">Lewis et al., 2010</a>) and discover novel antibiotics (<a href="http://www.nature.com/nature/journal/v517/n7535/full/nature14098.html">Ling et al., 2015</a>; <a href="http://www.nature.com/nrd/journal/v12/n5/full/nrd3975.html">Lewis, 2013</a>). However, a possible reason why high-throughput culturing techniques have not gained much traction is the emphasis on metagenomic applications and studies which do not suffer from culturing bias. High-throughput culturing and metagenomic sequencing each have a unique set of strengths and weaknesses. Perhaps more thought should be put into designing methods using a combination of high-throughput culturing and metagenomic sequencing to leverage both sets of strengths and mitigate their weaknesses. </div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0tag:blogger.com,1999:blog-5869557521756883740.post-48780055062654173772014-11-21T06:17:00.000-08:002014-11-21T06:17:07.630-08:00Literature Review: Selection on soil microbiomes reveals reproducible impacts on plant function<div dir="ltr" style="text-align: left;" trbidi="on">
I recently found an intriguing microbiome paper, and I would like to post my thoughts about it.<br />
<br />
<b>Citation</b><br />
Panke-Buisse et al., <a href="http://www.nature.com/ismej/journal/vaop/ncurrent/full/ismej2014196a.html">Selection on soil microbiomes reveals reproducible impacts on plant function</a>, ISME Journal 2014, doi: 10.1038/ismej.2014.196.<br />
<br />
<b>Review</b><br />
The main hypothesis in this paper tests if phenotypes from a soil microbiome connected with a single plant genotype can be recapitulated in other plant genotypes. First, early- and late-flowering time associated soil microbiome were generated. Ten generations of <i>Arabidopsis thaliana</i> Col-0 were grown and at each generation soils from the four earliest and latests flowering pots were collected as the inoculum for the subsequent generation. Using soils from generation 10, final early- and late-flowering time associated inoculums were derived. These inoculums were then administered to four <i>Arabidopsis thaliana </i>ecotypes (Rld, Ler, Be, and Col-0) and a close relative, <i>Brasica rapa</i>. Flowering time, plant biomass, and nitrogen mineralization enzyme activity for each of the genotypes were statistically different between early- and late-flowering associated inoculum treatments except Ler (Fig 4). Microbes specifically associated with early- and late- flowering time treatments were found in low abundance suggesting that low-abundance microbes can contribute significantly to interesting phenotypes.<br />
<b><br /></b>
<b><br />Positives</b>
In general, I found this paper to be very insightful for the following reasons:<br />
<ul style="text-align: left;">
<li>Soil Microbial phenotypes can persist in novel hosts. This is a really cool result and could have direct implications for building field-useful microbiome treatments. However, flowering time is a well conserved phenotype so this result may not apply to other specific phenotypes of interest such as crop yield, disease resistance, or with greater host genetic divergence (i.e. corn flowering time vs arabidopsis).</li>
<li>Low abundance members of a community can drive an important phenotype. A common assumption in metagenomics is that the most abundant members of a community are the most important. Perhaps more focus should be geared toward linking microbes to phenotypes regardless of abundance.</li>
<li>This artificial selection design doesn't require detailed knowledge of genetic mechanisms in order to produce a desired phenotype. It is similar to the old dogma of plant breading where two good plants were crossed to produce an even better plant with no underlying knowledge of the genetic mechanisms. While understanding the mechanisms is important for maximizing the phenotypic benefits, it is not required to generate a positive effect.</li>
</ul>
<br />
<b>Critiques and Questions</b><br />
<ul style="text-align: left;">
<li>The methods section seems incomplete.</li>
<ul>
<li>Why are there so few OTUs in the heatmap and ternary plot? Were OTUs filtered out? I would expect to see hundreds or even thousands of OTUs from soil-derived microbiota.</li>
<li>Were the wild soils combined at equal ratios?</li>
<li>What was their measure of flowering time? I think there are well defined methods in the literature for observing flowering time.</li>
</ul>
<li>Why are the EF associated samples in the 10 generation experiment actually flowering at a later time in later generations (supp fig 1)?</li>
<li>Why are the control samples not included in supp fig 1? </li>
<li>Even though control samples were included as a covariate to generate figure 4, I would have liked to have seen them graphed there as well.</li>
<li>Perhaps a better way of generating the control samples would have been to randomly pick pots to generate a control inoculum. This was what they did in <a href="http://www.pnas.org/content/97/16/9110.full">Swenson et al., 2000</a></li>
</ul>
<b><br /></b>
<b>Future Work</b><br />
This experimental design could be used to address the following intriguing hypotheses:<br />
<ul style="text-align: left;">
<li>Selection on soil microbiomes changes the root endophyte microbial community.</li>
<ul>
<li>Redo the experiment with the same design but also 16S profile endophyte microbes. </li>
<li>It would also be interesting to see if the microbes specific to each treatment are also found inside the root. This would suggest a direct microbial effect on the plant phenotype.</li>
</ul>
<li>Selection for soil microbiome associated phenotypes is driven by multiple mechanisms.</li>
<ul>
<li>Redo the experiment but don't combine replicates such that the end inocula include four independent lines of selection for both early- and late-flowering time soil microbiomes. Compare microbes among treatment groups looking for different microbial profiles that express similar phenotypes. </li>
</ul>
</ul>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0tag:blogger.com,1999:blog-5869557521756883740.post-44434774184638643262014-10-08T16:54:00.001-07:002014-10-08T16:54:54.609-07:00MiSeq Flow Cell Edges Correlate with Low Quality Scores<div dir="ltr" style="text-align: left;" trbidi="on">
Upon receiving a FASTQ file fresh off the MiSeq, the first question I ask myself is: "Did the sequencing work?" On several occasions I open these files to discover the first few pages of sequence reads are littered with N's and have low quality scores. However, when I run the full set of reads through a QC pipeline (e.g. FastQC) an overwhelming majority are high quality. <br />
<br />
<i>So why is it that the reads at the beginning of the FASTQ file have such poor quality?</i><br />
<i><br /></i>
Reads generated using Illumina technology are ordered by their cluster's y-coordinate on the flow cell. This led me to hypothesize that clusters near the edge of the flow cell are more likely to have low quality scores. To test this hypothesis I took a subset of reads (932,709) from a MiSeq run, calculated their average quality score, and graphed that against the distance from the closest edge. <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXGYdTpLISXBZz_ak74OS2IlJq9rgPSNUWFIM9yqyuxp4ulhm2DGH4mQhndWmrZEjADb0yW6vGg6Fz3feNgXgsOGXdiiua3tvppTkmEtsTCO6-Z25Eih04g5Ox50x2G9EiZK4lWu-261Pn/s1600/flow_cell_error_cor.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXGYdTpLISXBZz_ak74OS2IlJq9rgPSNUWFIM9yqyuxp4ulhm2DGH4mQhndWmrZEjADb0yW6vGg6Fz3feNgXgsOGXdiiua3tvppTkmEtsTCO6-Z25Eih04g5Ox50x2G9EiZK4lWu-261Pn/s1600/flow_cell_error_cor.tiff" height="340" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The splines function in R was used to model the relationship between average quality score and distance from the closest flow cell edge (red line). Clusters closer to the edge of the flow cell have significantly lower quality scores. However, the good news is that the vast majority of clusters do not fall close to the edge (light blue heat). </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
In practical terms this issue is insignificant because so few reads are affected, but it does explain the high concentration of low quality reads at the beginning (and end) of Illumina generated FASTQ files.</div>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0tag:blogger.com,1999:blog-5869557521756883740.post-65034513917414905132014-09-30T11:21:00.000-07:002014-09-30T11:34:30.525-07:00When is my genome finished? <div dir="ltr" style="text-align: left;" trbidi="on">
This is a common question for anyone doing genome sequencing and assembly. <br />
<br />
<b>The short answer:</b> never.<br />
<br />
<b>The slightly longer answer: </b> it depends.<br />
<br />
<b>The long answer: </b>Finishing a genome means the order of all nucleotide bases have been correctly resolved. Even for simple genomes this is <i>extremely</i> difficult. <a href="http://www.genome.gov/11006943">Billions</a> of dollars have been spent to sequence the human genome, and it is currently estimated that 92% of the human genome has greater than 99.99% accuracy (<a href="http://www.nature.com/nature/journal/v429/n6990/abs/nature02390.html">Schmutz et al., 2004</a>). In total, 99% of the genome has been assembled, and the remaining 1% are likely to be highly repetitive regions with little or no gene content (remember that 1% of 3 billion total bases means there are about 30 million unresolved bases). Using the current technology, it is impossible to reach 100% accuracy across 100% of the human genome. For simpler genomes such as some viral and bacterial genomes it is possible to completely resolve the entire sequence. However, because these genomes are much more dynamic (i.e. change at a faster rate), generating a completely finished genome may not be worth the cost. <br />
<br />
Finishing a genome requires a substantial amount of time, work, and money. However, getting a genome to draft status (i.e. incomplete but usable) can be done with minimal costs and resources. So perhaps the most important question is: <b><i>when is my genome usable</i>? </b>The answer to this question depends on the research questions being considered. For example, when studying the evolutionary history of genomes with a high propensity for genomic rearrangements, generating long contigs/scaffolds is important for determining the orientation and lineage of genomic rearrangements. Alternatively, some research questions are more concerned about the completeness of the gene content for making functional comparisons between genomes. For such a question, generating long, continuous contig/scaffolds is less important.<br />
<br />
Here are some possible ways to estimate completeness (with what I think are the best methods near the top):<br />
<ul style="text-align: left;">
<li>Compare the gene content of highly conserved genes</li>
<ul>
<li>Eukaryote: <a href="http://korflab.ucdavis.edu/Datasets/genome_completeness/">CEGMA</a></li>
<li>Bacterial: <a href="https://github.com/Ecogenomics/CheckM/wiki">CheckM</a></li>
<li>Archaeal: A <a href="file:///Users/Scott/Downloads/journal.pone.0061692.s006.pdf">table</a> of 53 conserved COGs</li>
</ul>
<li>Check for possible errors using <a href="http://genomebiology.com/2013/14/5/R47/">REAPR</a></li>
<li>Calculate assembly metrics like contig number, N50, coverage, and assembly size. </li>
<li>Compare the size of your assembled genome to a related (or set of related) genome(s). </li>
</ul>
<div>
See these papers for interesting discussion on evaluating assemblies: </div>
<div>
<ul style="text-align: left;">
<li><a href="http://genome.cshlp.org/content/early/2012/01/12/gr.131383.111.full.pdf+html">Salzberg et al., 2011</a>.</li>
<li><a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0031002">Vezzi et al., 2012</a>.</li>
<li><a href="http://bioinformatics.oxfordjournals.org/content/early/2013/03/09/bioinformatics.btt086.long">Gurevich et al., 2013</a>.</li>
<li><a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0069890">Shangguan et al., 2013</a>.</li>
<li><a href="http://www.ncbi.nlm.nih.gov/pubmed/24330235">Pan et al., 2014</a>.</li>
</ul>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com2tag:blogger.com,1999:blog-5869557521756883740.post-49635352893902382302014-08-22T14:14:00.000-07:002014-08-22T14:14:02.029-07:00How to Order Contigs Using a Reference Genome<div dir="ltr" style="text-align: left;" trbidi="on">
<b><span style="font-size: large;">Background</span></b><br />
In genomics complete (i.e. finished) genomes provide an excellent resource for future sequencing projects. However, generating finished genomes is an expensive and laborious endeavor. In some cases, the returns of finishing a genome are not worth the cost particularly when draft genomes can provide enough information for robust hypotheses testing. Draft genomes contain a large number of unordered sequences of various lengths called contigs. Contigs can be joined into more contiguous sequences called scaffolds using paired-end reads. While a set of contigs/scaffolds can be useful for a wide variety of projects (gene annotation, gene expression profiling, etc.), contig order provides vital information in comparative genomics and analyses of specific genomic regions. Typically, a rough ordering of contigs can be accomplished by mapping contigs to a closely related reference genome.<br />
<br />
<b><span style="font-size: large;">ABACAS</span></b><br />
Recently, I discovered a nice software package called ABACAS which not only orders contigs but creates a contiguous sequence representing the connected contigs. Gaps between contigs are represented by N's and overlapping regions are resolved (although I could not find information on exactly how). Because ABACAS requires <a href="http://mummer.sourceforge.net/">MUMer</a>, its outputs integrate will with other <a href="http://mummer.sourceforge.net/">MUMer</a> scripts such as those for visualizing alignments (Figure 1, i.e. <a href="http://mummer.sourceforge.net/manual/#mummerplot">mummerplot</a>). ABACAS is hosted by <a href="https://sourceforge.net/">SourceForge</a> and is well <a href="http://abacas.sourceforge.net/Manual.html">documented</a>. It was published in <i><a href="http://bioinformatics.oxfordjournals.org/content/25/15/1968.long">Bioinformatics</a> </i>in 2009.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxEaIqrZzIs-v2rjozXl0PkMKgmeNxH0Wl8XnQI1APmLCJp55xFIFflO2AbSDD6MVNSgD-RcS51A8F8Wy-UJ8yWT72CyRHipMDt_LMJU8gyL3JE6RTwA_99AO8iKm4PFscjkGSM_P1twBk/s1600/Slide1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxEaIqrZzIs-v2rjozXl0PkMKgmeNxH0Wl8XnQI1APmLCJp55xFIFflO2AbSDD6MVNSgD-RcS51A8F8Wy-UJ8yWT72CyRHipMDt_LMJU8gyL3JE6RTwA_99AO8iKm4PFscjkGSM_P1twBk/s1600/Slide1.jpg" height="287" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
<b><span style="font-size: large;">Notes</span></b><br />
Of course, more closely related draft and reference genomes will generate the most correct ordering. However, in cases where the target genome is VERY closely related to the reference genome it may be better to map raw reads to the reference genome instead of build a <i>de novo</i> assembly. Deviations from the reference genome can be discovered using SNP detection software. Read mapping experiments are generally cheaper because they require fewer reads than <i>de novo</i> assembly but can still detect biologically significant differences when compared to the reference genome.</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0tag:blogger.com,1999:blog-5869557521756883740.post-74310520648737347962014-07-11T12:08:00.001-07:002015-01-12T12:02:34.206-08:00Cool Unix Commands<div dir="ltr" style="text-align: left;" trbidi="on">
I will add to this list as I discover new ones. If you have a favorite or useful command feel free to include it in a comment on this post.<br />
<br />
<br />
<div style="text-align: left;">
<i><b>Convert a FASTQ file to FASTA (originally posted <a href="http://stackoverflow.com/questions/1542306/converting-fastq-to-fasta-with-sed-awk">here</a>):</b></i><br />
<ul style="text-align: left;">
</ul>
<div>
sed -n '1~4s/^@/>/p;2~4p' </div>
<div>
<br /></div>
<div>
NOTE: this assumes that each FASTQ entry spans only four lines as is customary.<br />
<br />
<br />
<br />
<i><b>Convert a SAM file to FASTA</b></i><br />
<i><br /></i>
awk '{OFS=""}{print $1, "\n", $10; }' file.sam > file.fasta<br />
<br />
NOTE: You will loose a lot of information in the sam file. You can save more of that info by adding column variables to the print statement. Also, you may have to change the column variable numbers depending on your sam file format. This is just a general example.<br />
<br />
<br />
<br />
<i><b>Replace spaces in file names with underscore (originally posted <a href="http://superuser.com/questions/295994/how-do-i-rename-files-with-spaces-using-the-linux-shell">here</a>)</b></i><br />
<i><br /></i>
rename ' ' '_' *<br />
<br />
NOTE: do NOT put spaces in file names!! This is so annoying!<br />
<br />
<br />
<br />
<i><b>Get a histogram of sequence lengths from FASTA/Q files (from Surge Biswas)</b></i><br />
<i><br /></i>
FASTQ: cat <fastq file> | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c<br />
FASTA: cat <fasta file> | awk '{if(NR%4==0) print length($1)}' | sort -n | uniq -c<br />
<br />
<br />
<br />
<i><b>Do arithmetic operations on the bash command line</b></i><br />
<i><br /></i>echo $((1 + 1))<br />
echo $((1 - 1))<br />
echo $((1 * 1))<br />
echo $((1 / 1))<br />
echo $(((1+3) / (1+1)))<br />
<br />
For floating point operations you can use the bc tool. For example<br />
<br />
echo "scale=1; 1/2" | bc<br />
<br />
<br />
<br />
<i><b>Add a comment to a bash command on the command line</b></i><br />
<i><b><br /></b></i>
<command>; # this is a comment line<br />
<br />
A practical example: mv file1 old_file1; # there is now a new file1 is a more recent version<br />
<br />
NOTE: Do you ever have a long and complex command for which you would like to save a simple note? You can use this little trick and the note will be saved along side your command in your history. The next time you look through your history to rerun the command you will also see the associated note.<br />
<br />
<br />
<br />
<b><i>Count the number of bases in a FASTA file</i></b><br />
<b><i><br /></i></b>
grep -v ">" file.fasta | wc | awk '{print $3 - $1}'<br />
<br />
(from martinghunt on <a href="http://seqanswers.com/forums/showthread.php?t=36160">SEQanswers</a>)<br />
<i><br /></i>
<i><br /></i></div>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0tag:blogger.com,1999:blog-5869557521756883740.post-5152881669689794382014-06-26T12:14:00.003-07:002015-04-04T05:57:11.030-07:00How to Learn Bioinformatics<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="font-size: large;"><b>Introduction</b></span><br />
<br />
At least once a month someone asks me for help learning bioinformatics. I love it when this happens because it usually means they want to take control of their own analysis thereby freeing up my time for problems that interest me. This post is a collection of tips and resources for people wanting to learn how to do bioinformatics. <br />
<b><span style="font-size: large;"><br /></span></b>
<b><span style="font-size: large;">Keep These Things in Mind:</span></b><br />
<ul style="text-align: left;">
<li><b>Learning the basics of bioinformatics is easy. </b>The basics as described in this post are often taught in high school. However, don't get frustrated if you don't understand everything all at once. Learning anything new takes time and practice no matter its difficulty. </li>
<li> <b>A little bit goes a long way. </b>I estimate that nearly 90% of my work is occupied by simple routine procedures. Learning how to do these tasks will substantially expand your ability to analyze and interpret your data.</li>
<li><b>Google it. </b>Google is the best resource for learning new techniques and trouble shooting problems. If you have a question type exactly what you would say to a person into the google search bar. When you take questions to your bioinformatics friends it's likely they won't know the answer offhand and will google it anyway.</li>
<li><b>Try it. </b>If you're not sure about something try it and see what happens. Generally, there is very little danger is just trying a command to see if and how it works. That being said it's a good idea to backup important files and data just incase something goes very wrong. Every Unix programmer that I know has deleted a really important file using the <i>rm</i> command (which is one of the few irreversible Unix commands). It's going to happen to you too so make a backup.</li>
</ul>
<span style="font-size: large;"><b><br /></b></span>
<span style="font-size: large;"><b>Learn the Unix Basics</b></span><br />
<ul style="text-align: left;">
<li><b>Get on a Unix machine.</b> Doing is the most important aspect of learning Unix. You will never fully understand the basic concepts if you only read about them. Mac users have it easy because OSX is build on a unix shell. Simply open the terminal application and you are ready to start with an online tutorial. For non-mac users I recommend finding an old computer and installing a Linux/Unix operating system like <a href="http://www.ubuntu.com/">Ubuntu</a>. A slightly more difficult approach would be to partition the drive of an existing computer to <a href="https://help.ubuntu.com/community/WindowsDualBoot">dual boot</a> a Linux/Unix OS along with the existing OS.</li>
<li><b>Complete an online tutorial</b></li>
<ul>
<li><a href="http://www.ee.surrey.ac.uk/Teaching/Unix/">Unix Tutorial for Beginners</a></li>
<li><a href="http://cnx.org/content/m13327/1.1/">Introduction to Unix for Biologists</a></li>
<li><a href="http://korflab.ucdavis.edu/Unix_and_Perl/index.html">Unix and Perl Primer for Biologists</a></li>
</ul>
</ul>
<ul style="text-align: left;">
<li><b>Buy a book</b> if you are a book learner. However, the basic can pretty much all be learned using online materials. My favorite Unix book is O'Reilly's <a href="http://shop.oreilly.com/product/9780596003302.do">Unix Power Tools</a>.</li>
</ul>
<br />
<b><span style="font-size: large;">Learn a Scripting Language</span></b><br />
<ul style="text-align: left;">
<li><b>Pick a scripting language.</b> Scripting languages are computer languages that are not compiled (i.e. they are interpreted by the computer on the fly). The two most popular bioinformatics scripting languages are <a href="http://www.perl.org/">Perl</a> and <a href="https://www.python.org/">Python</a>. Both languages have their strengths and weakness, but I personally prefer Perl.</li>
<li><b>Complete an online tutorial for your language.</b></li>
<ul>
<li><a href="http://learn.perl.org/tutorials/">Perl tutorial</a></li>
<li><a href="http://korflab.ucdavis.edu/Unix_and_Perl/">Perl and Unix for Biologists</a></li>
<li><a href="https://docs.python.org/2/tutorial/">Python tutorial</a></li>
<li><a href="http://pythonforbiologists.com/index.php/introduction-to-python-for-biologists/">Python for Biologists</a></li>
</ul>
<li><b>Buy a book</b><b style="font-style: italic;">. </b>My favorite Perl book is <a href="http://shop.oreilly.com/product/9780596001735.do">Perl Best Practices</a> by Damian Conway. This book is a must have for all Perl programmers! I don't have much experience with Python books, so I would recommend looking at book reviews before making a purchase.</li>
</ul>
<br />
<b><span style="font-size: large;">Learn a Statistical/Graphing Language</span></b><br />
<ul style="text-align: left;">
<li><b>Pick a language for doing statistical operations and building figures.</b> Languages like <a href="http://www.r-project.org/">R</a> and <a href="http://www.mathworks.com/products/matlab/">Matlab</a> are prime choices for both statistics and graphics. Both languages have their strengths and weaknesses, but I personally prefer R. If you choose R I highly recommend using the <a href="http://ggplot2.org/">ggplot2</a> library for building figures. </li>
<li><b>Complete an online tutorial for your language.</b></li>
<ul>
<li><a href="http://www.cyclismo.org/tutorial/R/">R tutorial</a> (<a href="http://rstudio-pubs-static.s3.amazonaws.com/2176_75884214fc524dc0bc2a140573da38bb.html">ggplot tutorial</a>) (<a href="https://www.datacamp.com/courses/free-introduction-to-r">another fantastic R tutorial by DataCamp</a>)</li>
<li><a href="http://www.mathworks.com/academia/student_center/tutorials/launchpad.html">Matlab tutorial</a> </li>
</ul>
<li><b>Give up Excel.</b> Excel is a powerful program but lacks the flexibility of computer languages like R and Matlab. While there is a steeper learning curve for R and Matlab, you will substantially enhance your ability to do statistical analyses and build graphics by getting away from Excel.</li>
</ul>
<br />
<b><span style="font-size: large;">Learn Basic Bioinformatics Procedures and Corresponding Software Tools</span></b><br />
<br />
For example:<br />
<ul style="text-align: left;">
<li>Quality control</li>
<ul>
<li><a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/">fastQC</a></li>
</ul>
<li><a href="http://en.wikipedia.org/wiki/Sequence_assembly">Sequence assembly</a></li>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Velvet_assembler">Velvet</a>,</li>
<li><a href="http://bioinf.spbau.ru/spades">SPAdes</a> (my preference, also see <a href="http://scottmyourstone.blogspot.com/2013/11/new-spades-assembler-used-on-miseq.html">this post</a>)</li>
</ul>
<li><a href="http://en.wikipedia.org/wiki/Sequence_assembly#De-novo_vs._mapping_assembly">Sequence mapping</a></li>
<ul>
<li><a href="http://bio-bwa.sourceforge.net/">BWA</a></li>
<li><a href="http://bowtie-bio.sourceforge.net/index.shtml">Bowtie</a></li>
</ul>
<li>Sequence searching</li>
<ul>
<li><a href="http://en.wikipedia.org/wiki/BLAST">BLAST</a></li>
</ul>
<li><a href="http://genomebiology.com/2013/14/9/R95">Differential expression</a></li>
<ul>
<li><a href="http://www.bioconductor.org/packages/release/bioc/html/edgeR.html">edgeR</a></li>
<li><a href="http://bioconductor.org/packages/release/bioc/html/DESeq.html">DESeq</a></li>
</ul>
<li>Variant calling</li>
<ul>
<li><a href="http://www.broadinstitute.org/gatk/">GATK</a></li>
</ul>
<li>etc</li>
</ul>
This is only a small list of procedures and tools primarily focusing on DNA sequence analysis. For a more comprehensive list see <a href="http://omictools.com/">OMICtools</a>.<br />
<br />
<br />
<b><span style="font-size: large;">Find a problem</span></b><br />
<br />
I strongly encourage new bioinformaticians to find some real data to do meaningful science using the above principles and skills. If you don't personally have data I recommend downloading data from a public repository (i.e. Genbank). A similar alternative would be to choose a paper that uses a procedure you are interested in learning and recapitulate the results. </div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com2tag:blogger.com,1999:blog-5869557521756883740.post-39782901026912977102014-03-26T11:25:00.000-07:002014-03-26T11:25:40.263-07:00Predicting Full-Length Ribosomal Gene Sequences<div dir="ltr" style="text-align: left;" trbidi="on">
<b><span style="font-size: large;">Introduction</span></b><br />
<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
The 16S ribosomal gene has been used extensively in biology for distinguishing relatedness between species. This gene has regions of DNA that are highly conserved among almost all living organisms and other regions that have high DNA sequence variability. The conserved regions are ideal for building <a href="http://en.wikipedia.org/wiki/Polymerase_chain_reaction">PCR</a> primers that can amplify DNA from many different organism. The variable regions that are amplified using these conserved primers can be used to determine the relatedness between two or more organisms. Closely related species typically have much more similar DNA sequences than distantly related species. <br />
<br />
Typically PCR is used to amplify a portion of the <a href="http://en.wikipedia.org/wiki/16S_ribosomal_RNA">16S ribosomal gene</a> for sequencing. However, whole genome sequences or whole metagenome sequences also contain short DNA reads originating from the 16S gene. These reads can be separated from the pool of other genomic reads and assembled into the entire 16S gene. <a href="http://genomebiology.com/2011/12/5/r44">EMIRGE</a> (Miller, et al. 2011) is an algorithm for reconstructing full-length ribosomal genes from short read DNA sequences. <br />
<br />
<br />
<b><span style="font-size: large;">EMIRGE</span></b><br />
<b><br /></b>
EMIRGE reconstructs full-length ribosomal genes from short read DNA sequences. It first maps reads to a database of known 16S genes such as the <a href="http://www.arb-silva.de/">SILVA</a> or g<a href="http://greengenes.lbl.gov/cgi-bin/nph-index.cgi">reengenes</a> database. After the initial mapping, EMIRGE estimates the probability that a given read was generated from the reference to which it mapped. Based on these probability estimates, reference sequences are changed to reflect the 16S sequences that are likely to be represented by the set of reads. Reads are then remapped to the adjusted 16S sequence database and the processes is repeated until an equilibrium is achieved. The resulting database of 16S sequences reflect the likely 16S genes represented by the input set of short reads.<br />
<br />
This software was primarily built to infer the set of 16S genes from whole metagenome reads. However, it can also be used to infer the single 16S gene from genomic sequences from a single isolate. Full-length 16S genes are difficult to assemble even when only reads from a single genome are considered. <br />
<br />
In the Dangl lab, we use EMIRGE to predict full-length 16S genes from reads generated from a single genome of bacteria. An example of the EMIRGE command we use is:<br />
<br />
emirge.py my_output_dir -1 fwd_reads.fastq -2 rev_reads.fastq -b SSURef_NR99_115_tax_silva_formated -f SSURef_NR99_115_tax_silva_formated.fasta -i 600 -s 1000 -l 250<br />
<div class="p1">
<br />
The descriptions of each parameter are below:<br />
<br /></div>
<div class="p1">
my_output_dir: the output and working directory for EMIRGE.</div>
<div class="p1">
<br /></div>
<div class="p1">
-1: the forward or single-end genomic sequencing reads</div>
<div class="p1">
<br /></div>
<div class="p1">
-2: the reverse end genomic sequencing reads</div>
<div class="p1">
<br /></div>
<div class="p1">
-b: the <a href="http://bowtie-bio.sourceforge.net/manual.shtml">bowtie</a> index of the 16S sequence database</div>
<div class="p1">
<br /></div>
<div class="p1">
-f: the fasta file of the 16S sequence database</div>
<div class="p1">
<br /></div>
<div class="p1">
-i: insert size of paired-end reads</div>
<div class="p1">
<br /></div>
<div class="p1">
-s: standard deviation of insert size for paired-end reads</div>
<div class="p1">
<br /></div>
<div class="p1">
-l: max length of reads<br />
<br />
<br />
<b><span style="font-size: large;">Other Details</span></b><br />
<b><br /></b>EMIRGE uses bowtie to map reads to the reference database. To build the bowtie index of the reference database the following command was used:<br />
<br />
<div class="p1">
bowtie-build SSURef_NR99_115_tax_silva_formated.fasta SSURef_NR99_115_tax_silva_formated</div>
<div class="p1">
<br /></div>
<div class="p1">
Also, the database downloaded from SILVA had to be reformatted using <a href="https://gist.github.com/anonymous/9789621">this</a> Perl script. This script requires <a href="https://sourceforge.net/projects/bioutilsperllib/?source=directory">BioUtils</a>.</div>
</div>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0tag:blogger.com,1999:blog-5869557521756883740.post-25224016398587692512014-03-24T13:52:00.000-07:002014-03-24T13:52:31.658-07:002014 JGI Users Meeting Notes<div dir="ltr" style="text-align: left;" trbidi="on">
Here are some notes from a few of the speakers at the JGI Users meeting in California. In general the speakers were fantastic. Some general themes of the conference include: single-cell genomics, synthetic biology, fungal metagenomics, and metabolics. A person take-home message for me was the need for creative biological solutions to common issues that the human race currently faces or will face in the near future. <br />
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;">Mark Ackermann (opening keynote) – A
Single Cell Perspective on Bacterial Interactions<o:p></o:p></b></div>
<div class="MsoNormal">
- Focused on phenotypic heterogeneity, when identical cells
have different functional profiles.<o:p></o:p></div>
<div class="MsoNormal">
- Most genes don’t have clonal variation but in the ones
that do how is that heterogeneity important for the community.<o:p></o:p></div>
<div class="MsoNormal">
- Salmonella is an example of phenotypic heterogeneity.<span style="mso-spacerun: yes;"> </span>One cell type causes inflammation and one
uses the inflammation response to reproduce and cause full infection.<o:p></o:p></div>
<div class="MsoNormal">
- Different cell types survive better in different
environmental conditions.<o:p></o:p></div>
<div class="MsoNormal">
- Another example of phenotypic heterogeneity is in alpine
lakes where there are generally large amounts of ammonium that bacteria can use
as a nitrogen source.<span style="mso-spacerun: yes;"> </span>However, there are
some cells that fix their own nitrogen in the event that ammonium runs out.<o:p></o:p></div>
<div class="MsoNormal">
- preliminary data show that neighboring cells are more
likely to be of the same cell type.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;">Mary Berbee –
Pectinases link Early Fungal Evolution to the Land Plant Lineage</b><o:p></o:p></div>
<div class="MsoNormal">
- Sequenced early divergent fungal groups.<o:p></o:p></div>
<div class="MsoNormal">
- The relationship between the early branching groups is
still poorly resolved.<o:p></o:p></div>
<div class="MsoNormal">
- Showed some cool trees where she had overlaid two trees to
highlight difference between the two.<span style="mso-spacerun: yes;"> </span>I
would like to know what software she used to do this.<o:p></o:p></div>
<div class="MsoNormal">
- Her trees were based on whole genomes but I’m not sure how
she built them.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;">Rytas Vilgalys –
Understanding the Forest Microbiome: <span style="mso-spacerun: yes;"> </span>A
Fungal Perspective</b><o:p></o:p></div>
<div class="MsoNormal">
- Oak and pine share many fungi while populus has more
different fungi.<o:p></o:p></div>
<div class="MsoNormal">
- Soils from the same region are likely to share the same
fungi.<o:p></o:p></div>
<div class="MsoNormal">
- Populus of different genotypes do not assembly different
fungi.<span style="mso-spacerun: yes;"> </span>At least not nearly as different
as fungi from different regions.<o:p></o:p></div>
<div class="MsoNormal">
- They have isolated ~1,800 fungal isolates.<span style="mso-spacerun: yes;"> </span>These isolate represent only ~15% of the
isolates that are likely populus endophytes.<o:p></o:p></div>
<div class="MsoNormal">
- Many fungal isolates stimulate plant growth.<o:p></o:p></div>
<div class="MsoNormal">
- They are re-inoculating these isolates to confirm they are
endophytic.<o:p></o:p></div>
<div class="MsoNormal">
- Mortierella elongata is an isolate that stimulates plant
growth in populus and Arabidopsis thaliana.<o:p></o:p></div>
<div class="MsoNormal">
- M. elongata also harbors bacterial symbionts
(Glomeribacter which are known to affect lipid fermentation and is a sister to
Burkholderia.<span style="mso-spacerun: yes;"> </span>These bacteria cannot be
cultured possibly because they rely so heavily on the host for nutrients).<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
- M. elongata migrate to the roots.<o:p></o:p></div>
<div class="MsoNormal">
- Different genes are expressed in M. elongata grown in
culture than those sampled from the rhizosphere.<o:p></o:p></div>
<div class="MsoNormal">
- Different genes are expressed in M. elongata inoculated on
different hosts.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;">Eddy Rubin</b><o:p></o:p></div>
<div class="MsoNormal">
- Bacterial genes are typically ~900bp.<o:p></o:p></div>
<div class="MsoNormal">
- In a couple of sequenced genomes they saw average
bacterial gene lengths as low as 200bp.<span style="mso-spacerun: yes;"> </span>However,
when they adjust the codon table by replacing one of the stop codons to code
for a glycine predicted genes have an average length of 900bp!<span style="mso-spacerun: yes;"> </span>Some bacteria use different codon
translations!<span style="mso-spacerun: yes;"> </span><span style="background: yellow; mso-highlight: yellow;"><o:p></o:p></span></div>
<div class="MsoNormal">
- Natalia Ivanova is a gene annotation specialist they
consulted for help in this analysis.<o:p></o:p></div>
<div class="MsoNormal">
- They found evidence of recoding in lots of other bacteria
by looking at sequenced isolates.<o:p></o:p></div>
<div class="MsoNormal">
- Didn’t find evidence of recoding in archea.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
- They show that phages which use different codon profiles
can circumvent host cell machinery to match their codon profile!<o:p></o:p></div>
<div class="MsoNormal">
- CRISPR regions in bacterial cells often contain phage
elements that correspond to different codon profiles.<span style="mso-spacerun: yes;"> </span>This is further evidence that phages with
different codon profiles can infect cells with canonical codon profiles.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;">Nicole Dublier
–Metagenomics and Metaproteomic Analyses of Symbioses between Bacteria and
Gutless Marine Worms</b><o:p></o:p></div>
<div class="MsoNormal">
- Bacteria can use hydrogen to produce more energy than
methane.<span style="mso-spacerun: yes;"> </span>Nature 2011<o:p></o:p></div>
<div class="MsoNormal">
- They discovered key genes able to metabolize hydrogen.<o:p></o:p></div>
<div class="MsoNormal">
- The second half of the talk was about gutless worms living
in shallow water.<span style="mso-spacerun: yes;"> </span>They completely
dependent on bacterial symbionts for feeding and waste excretion.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
- There are species specific symbionts.<o:p></o:p></div>
<div class="MsoNormal">
- Her proteomics data yield more obvious features than
comparative genomics.<span style="mso-spacerun: yes;"> </span>As an example she
shows how one isolate contains a protein that does the function of 3 different
proteins in the canonical Calvin Cycle.<span style="mso-spacerun: yes;">
</span>DNA sequencing confirmed this observation but would have been a
“needle-in-a-haystack” for a comparative genomics project.<span style="mso-spacerun: yes;"> </span>This work published in PNAS.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;">Erin Nuccio – Mapping
Soil Carbon from Cradle to Grave:<span style="mso-spacerun: yes;"> </span>Using
Omics and Isotope Analyses to Identify the Microbial Blueprint for
Root-enhanced Decomposition of Organic Matter.</b><o:p></o:p></div>
<div class="MsoNormal">
- The general question is how do microbes transform and
stabilize root carbon in soil.<o:p></o:p></div>
<div class="MsoNormal">
- Carbon can affect nitrogen rates.<o:p></o:p></div>
<div class="MsoNormal">
- Plants fix carbon for microbes in the soil.<o:p></o:p></div>
<div class="MsoNormal">
- Looking at the rhizosphere over time it gradually deviates
from bulk soil in carbon levels at time points of 3, 6, 9, and 12 weeks.<o:p></o:p></div>
<div class="MsoNormal">
- Some preliminary data show that bacteria prefer carbon
excreted by plant over as an energy source over nitrogen liter material (ie
material artificially added to the system).<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;">Michael Fischbach – A
Gene-to-Molecule Approach to the Discovery and Characterization of Natural
Products</b><o:p></o:p></div>
<div class="MsoNormal">
- Discovers natural gene products.<span style="mso-spacerun: yes;"> </span>By gene products I think he means functional
protein units.<o:p></o:p></div>
<div class="MsoNormal">
- Undiscovered gene products are often coded by clusters of
genes.<o:p></o:p></div>
<div class="MsoNormal">
- Has some type of algorithm to computationally discover
these clusters that may produce unknown gene products.<o:p></o:p></div>
<div class="MsoNormal">
- Lots of his most interesting clusters were found on human
associated microbes.<o:p></o:p></div>
<div class="MsoNormal">
- Discovered several oligosaccharide clusters.<span style="mso-spacerun: yes;"> </span>These bacteria were very difficult to work
with but these clusters and the functions they provide to the human host are of
high interest.<o:p></o:p></div>
<div class="MsoNormal">
- The general observation of this study was that microbes in
our gut are making products for which we have no idea what they are or how they
function.<span style="mso-spacerun: yes;"> </span>It’s like taking several
prescription drugs for your entire life!<span style="mso-spacerun: yes;">
</span>We need to figure out what is going on in there.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;">Kelly Matzen –
Genetic Control of Mosquitoes</b><o:p></o:p></div>
<div class="MsoNormal">
- In the 50’s DDT was used to control mosquito populations
and subsequently mosquito born disease such as dengue.<span style="mso-spacerun: yes;"> </span>However, DDT is know to be detrimental to the
environment in several ways and therefore is being used much less.<span style="mso-spacerun: yes;"> </span>We are starting to see diseases like dengue
make a comeback in places like Florida and of course in places like Central and
South America.<o:p></o:p></div>
<div class="MsoNormal">
- Right now the most effective control is pesticides.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
- They are releasing massive numbers of sterile male
mosquitoes to control (ie reduce) mosquito populations.<span style="mso-spacerun: yes;"> </span>This technique has been successfully used
before in the United States to control populations of other insects many years
ago.<o:p></o:p></div>
<div class="MsoNormal">
- This technique seems to be working in the small field
studies they have been conducting.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
- There is some push back from legislators but in general it
seems like good solution.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;">Cameron Coates –
Characterization of Cyanobacterial Hydrocarbon composition and Distribution of
Biosynthetic Pathways</b><o:p></o:p></div>
<div class="MsoNormal">
- Cyanobacteria produce over 30% of the earth’s oxygen.<o:p></o:p></div>
<div class="MsoNormal">
- They are very diverse and live in all sorts of habitats on
earth.<o:p></o:p></div>
<div class="MsoNormal">
- They can produce hydrocarbons where are relevant of use of
biofuels.<span style="mso-spacerun: yes;"> </span>However, they don’t produce
large amounts of hydorcarbons.<o:p></o:p></div>
<div class="MsoNormal">
- They looked at the evolution of cyanobacteria hydrocarbon
pathways.<span style="mso-spacerun: yes;"> </span>There are two main pathways.<span style="mso-spacerun: yes;"> </span>Several clades have both pathways suggesting
a large amount of horizontal gene transfer.<span style="mso-spacerun: yes;">
</span><o:p></o:p></div>
<div class="MsoNormal">
- This work was published in PLOS ONE.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;">June Medford – Making
Better Plants:<span style="mso-spacerun: yes;"> </span>Synthetic Approaches in
Plant Engineering</b><o:p></o:p></div>
<div class="MsoNormal">
- They created a biological input/output system.<span style="mso-spacerun: yes;"> </span>This allows for some external factor to cause
a reaction that can be observed in the plant.<o:p></o:p></div>
<div class="MsoNormal">
- They use a pariplasmic binding protein as the input signal
because it can quickly defuse through the cell wall and are then translocated
to the nucleus to transcriptionally regulate some response.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
- They can theoretically use this system as a flag for
pollutants or other dangers that we currently use very expensive technology to
detect.<o:p></o:p></div>
<div class="MsoNormal">
- They are currently developing a system to detect TNT where
the response signal of the plant is to turn white.<span style="mso-spacerun: yes;"> </span>This system can detect traces of TNT 10x
smaller than a dog!<span style="mso-spacerun: yes;"> </span>There are still some
kinks to work through like response time.<span style="mso-spacerun: yes;">
</span>But looks like a very promising system.<span style="mso-spacerun: yes;">
</span>This idea has countless unexplored applications!<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b style="mso-bidi-font-weight: normal;">Kankshita Swaminathan
– Genome Biology of Miscanthus</b><o:p></o:p></div>
<div class="MsoNormal">
- Miscanthus is in the same clade as sugar cane, corn, and
sorghum.<span style="mso-spacerun: yes;"> </span>These plants have been amenable
to breading.<o:p></o:p></div>
<div class="MsoNormal">
- The genomic sequence of sorghum is very close to
Miscanthus except that Miscanthus has had a whole genome duplication event.<o:p></o:p></div>
<div class="MsoNormal">
- In the winter all the nutrients migrate to the rhizome
leaving only the stalk above ground.<span style="mso-spacerun: yes;"> </span>The
stalk is the most important element for biofuels and can be harvested without
significantly depleting soil nutrients.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b>Annalee Newitz (closing keynote) - How Humans Will Survive a Mass Extinction</b></div>
<div class="MsoNormal">
- Humans have a very good chance of surviving a mass extinction because we are very adaptable. However, our focus should be how we can preserve the diversity of the earth as it is now.</div>
<div class="MsoNormal">
- A mass extinction is when greater than 70% of the earth's species are killed.</div>
<div class="MsoNormal">
- Five mass extinctions have occurred in the history of the earth. Perhaps the largest was caused by cyanobacteria because they released large amounts of oxygen into the atmosphere. Close to 90% of species became extinct as a result.</div>
<div class="MsoNormal">
- Climate change is inevitable regardless of wither or not humans are the cause.</div>
<div class="MsoNormal">
- The questions we should be asking are: how can we respond to these changing climates and what can we do to preserve the world as we know it.</div>
<div class="MsoNormal">
- Space travel seems like an important step in human survival. </div>
<div class="MsoNormal">
<br /></div>
<!--[if gte mso 9]><xml>
<o:DocumentProperties>
<o:Revision>0</o:Revision>
<o:TotalTime>0</o:TotalTime>
<o:Pages>1</o:Pages>
<o:Words>1300</o:Words>
<o:Characters>7410</o:Characters>
<o:Company>University of North Carolina at Chapel Hill</o:Company>
<o:Lines>61</o:Lines>
<o:Paragraphs>17</o:Paragraphs>
<o:CharactersWithSpaces>8693</o:CharactersWithSpaces>
<o:Version>14.0</o:Version>
</o:DocumentProperties>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]-->
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>JA</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
<w:UseFELayout/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="276">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" Priority="39" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" Name="toc 4"/>
<w:LsdException Locked="false" Priority="39" Name="toc 5"/>
<w:LsdException Locked="false" Priority="39" Name="toc 6"/>
<w:LsdException Locked="false" Priority="39" Name="toc 7"/>
<w:LsdException Locked="false" Priority="39" Name="toc 8"/>
<w:LsdException Locked="false" Priority="39" Name="toc 9"/>
<w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
<w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/>
<w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
<w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
<w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
<w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
<w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
<w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
<w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
<w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
<w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
<w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
<w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
<w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
<w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--><br />
<div class="MsoNormal">
<br /></div>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0tag:blogger.com,1999:blog-5869557521756883740.post-88418398450920039862014-02-25T11:55:00.000-08:002014-03-31T11:15:00.620-07:00The Pan/Core/Accessory Genome<div dir="ltr" style="text-align: left;" trbidi="on">
<b>Introduction</b><br />
The term "pan genome" was coined in 2005 by <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1216834/?report=classic">Tettelin</a> in a paper describing the genomes of eight pathogenic Streptococcus strains. The pan genome is the set of all unique genes from a set of genomes (ie gene union). The core genome is the set of genes found in each genome (ie gene intersection). The accessory genome is the genes unique to a particular genome (ie strain specific genes). <br />
<br />
<b>Previous Studies</b><br />
<a href="http://www.nature.com/nature/journal/v499/n7457/full/nature12221.html">Read et al. (2012)</a> discusses the pan and core genome of phytoplancton. They estimate the size of the <i>E. huxleyi</i> pan genome to be large because there are several thousand genes in the reference that are missing from all of the three well-sequenced isolates. This is definitely more of a core genome paper (the analysis of which is easy when you have a reference). Perhaps a better way of showing the diversity of the pan genome is rarefaction curves on the number of homologous genes or perhaps k-mer content. The lead investigator, Igor Grigoriev, is the fungal genomics lead investigator at the JGI.<br />
<br />
<b>Pan and Core Genome Dynamics</b><br />
In general, as more genomes are added to the analysis set, the core-genome shrinks and the pan-genome grows. <a href="http://mbe.oxfordjournals.org/content/29/11/3413.full">Collins</a> (2012) describe these dynamics in their Molecular Biology and Evolution publication by using an infinitely many genes (<a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3342869/">IMG</a>) model. In summary, the Collins IMG model is based on the idea of three types of gene classes: core, shell, and cloud genes. Core genes are those found in all genomes, shell genes are gained and lost from genomes at a relatively slow rate, and cloud genes are rapidly gained and lost from genomes. Empirical data from a set of Bacillaceae genomes support the Collins IMG model. The Collins IMG model can be used to predict the size of core- and pan-genomes. In the Dangl lab this has become a question of substantial interest for determining which relevant clades require more isolate genomes for more robust functional genomics analyses.<br />
<br />
<b>Core vs Accessory</b><br />
Given a set of genes from various genomes what gene set is most interesting? Of course this question will most heavily depend on previous knowledge of the genomes in question. In the Dangl lab we are interested in microbes that inhabit the endosphere (inner plant root). Because the set of microbes living inside these roots exhibit a different profile than surrounding soil, one could hypothesize that a single or small set of genes are responsible for a microbes ability to inhabit the inner root. Under this hypothesis the core-genome would be of particular interest because it should contain these genes. However, in practice the core genomes is primarily composed of common cellular functions ubiquitous to all bacteria. <br />
<br />
Because the core-genome is likely to reveal nothing of specific interest the accessory-genome seems most interesting. The genes in this set alludes to what makes a particular genome functionally distinct and interesting. They get at questions like: "What functions does a particular bacteria provide to the community."<br />
<br />
<b>Software for pan-genome analysis</b><br />
The primary aim in any pan-genome analysis is grouping orthologous genes from different genomes. To do this many pan-genome pipelines utilize algorithms and databases such as <a href="http://www.geneontology.org/">GO</a>, <a href="http://www.ncbi.nlm.nih.gov/books/NBK21090/">COG</a>, <a href="http://www.genome.jp/kegg/kegg1.html">KEGG</a>, <a href="http://eggnog.embl.de/version_4.0.beta/">eggNOG</a>, <a href="http://pfam.sanger.ac.uk/">Pfam</a>, etc. <br />
<br />
Here are some notes on various pipelines developed for pan-genome analyses. <br />
<ul style="text-align: left;">
<li><a href="http://aem.asm.org/content/79/24/7696.long">GET_HOMOLOGUES</a> (my recommendation)</li>
<ul>
<li>This is my program of choice for pan/core/accessory genome analyses. </li>
<li>Fantastic documentation</li>
<li>Options for bidirectional blast hit (BDBH), COGtriangle, and/or orthoMCL algorithms for building clusters of orthologous groups.</li>
<li>Builds clear figures</li>
<li>Several options for powerful downstream analyses. </li>
<li>Parallelization options</li>
</ul>
<li><a href="http://www.biomedcentral.com/1471-2105/11/461">Panseq (Laing, 2010)</a></li>
<ul>
<li>Nice <a href="http://www.webcitation.org/query.php?url=http://76.70.11.198/panseq&refdoi=10.1186/1471-2105-11-461">web-based interface</a>. </li>
<li>Seems to work (as opposed to some of the following programs)</li>
<li>Output formats are not as user friendly or as concise get_homologues</li>
<li>Job completion email never comes in. Be sure to save the link somewhere.</li>
<li>More suited for small, quick analyses</li>
</ul>
<li><a href="http://www.ncbi.nlm.nih.gov/pubmed/21765097">PGAT (Brittnacher, 2011)</a></li>
<ul>
<li>Nice <a href="http://tools.nwrce.org/pgat/index.html">web-based interface and documentation</a></li>
<li>Limited to comparisons between only a small number of genomes already stored in their database. </li>
</ul>
<li><a href="http://bioinformatics.oxfordjournals.org/content/28/3/416.long">PGAP (Zhao, 2011)</a> </li>
<ul>
<li>Did not install because it requires the old version of blast (blastall)</li>
</ul>
<li><a href="https://f1000research.com/articles/2-265/v1">PanFunPro (Lukjancenko, 2013)</a></li>
<ul>
<li>Still in the development stage. I had a quick look at the source code and there were some things didn't make sense. I'll give the developers a little more time to work out the kinks. </li>
<li>The installation can take some time because of some large dependencies (eg. InterProScan). Furthermore, the installation for the PanFunPro Perl scripts could be streamlined using a tool like Module::Build. However, the installation instructions for it's dependencies are well written making installation remarkable easy. </li>
</ul>
<li><a href="http://nar.oxfordjournals.org/content/40/22/e172.long">PanOCT (Fouts, 2012)</a></li>
<ul>
<li>Primarily an algorithm for determining homology between a set of genes from 2 or more eukaryote genomes. </li>
<li>Considers conservation of neighboring genomic regions for determining homology. The basic idea is that two genes are truly homologous (as opposed to paralogous) will be situated in the same genomic location.</li>
</ul>
</ul>
<div>
<br /></div>
<div>
<ul style="text-align: left;">
</ul>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0tag:blogger.com,1999:blog-5869557521756883740.post-531719286631597302014-02-12T19:23:00.000-08:002014-02-13T07:51:42.773-08:00A Brief Introduction to Sequence Assembly<div dir="ltr" style="text-align: left;" trbidi="on">
<b>Assembly Background</b><br />
<div>
Sequence assembly is one of the overarching challenges in bioinformatics. To understand the assembly problem it helps to understand some basics of DNA sequencing. Consider a bacterium having a genome comprised of a single 5 megabase (5 million base pairs) chromosome. Ideally, sequencing machines would start at the beginning of the chromosome and read each of the 5 million base pairs until arriving at the end. Unfortunately, the current technology is limited to reading sequences between 30 and ~10,000+ bases. The assembly problem is to take these short segments of DNA called reads and overlap them in such a way to recreate the original 5Mb chromosome.</div>
<div>
<br /></div>
<div>
To illustrate this consider the set of character strings below that come from a quote by Theodore Roosevelt (spaces have been replaced with "_" for clarity). Can you put the pieces together to find out what it says? <br />
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjs7RZu_GNG3bMZPdNnctKmQwiNZmXDzav56UYFkhCQm8A15E_8xaf8dPxvBkt6LC5iYo54UsZoVMRlDt9mgx9A1g4xdoFJX8cerGtzDLvwn3H53AWEzM7lizzDiTxQ2XJQCR2-p0jQA6gr/s1600/assembly_example_word_bank.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjs7RZu_GNG3bMZPdNnctKmQwiNZmXDzav56UYFkhCQm8A15E_8xaf8dPxvBkt6LC5iYo54UsZoVMRlDt9mgx9A1g4xdoFJX8cerGtzDLvwn3H53AWEzM7lizzDiTxQ2XJQCR2-p0jQA6gr/s1600/assembly_example_word_bank.jpg" /></a></div>
<div>
<br />
You should end up with something that looks like this:<br />
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGKcykZYe-EowFeIer7-ombtZLi5iOTP0NfyMVTFbaz0lj4ESmdf37r3OcP9PdsIzZvZhRxJw1un4LsCX6TJh5vqPxfugUpKQHRmhEp3lW1QNPajPNRr48VUJaVjbX2G1OyMIS4jwnRLJb/s1600/assembly_example_answer.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGKcykZYe-EowFeIer7-ombtZLi5iOTP0NfyMVTFbaz0lj4ESmdf37r3OcP9PdsIzZvZhRxJw1un4LsCX6TJh5vqPxfugUpKQHRmhEp3lW1QNPajPNRr48VUJaVjbX2G1OyMIS4jwnRLJb/s320/assembly_example_answer.jpg" height="169" width="320" /></a></div>
<div>
<br /></div>
<div>
This is more or less what assembly programs attempt to do with DNA. Some things to notice in the above example:</div>
<div>
<ul>
<li>Repeats can be problematic during assembly. Notice that the word "you" is used twice in this sentence. Looking at the two character strings "Believe_yo" and "you're_ha" you may have incorrectly merged them to form the character string "Believe_you're_ha" which is an incorrect assembly. DNA repeats are common in genomes and can fragment assemblies or cause assembly mistakes.</li>
<li>Longer reads can help with the repeat problem. For example, given only two long character strings "Believe_you_can_and_you're" and "can_and_you're_halfway_there" it is much easier to unambiguously assembly the quotation despite the fact that the word "you" is used twice.</li>
<li>Sequencing errors complicate assembly. For example, if the character string "halfway" was sequenced as "calfway" there would be no way to finish the assembly correctly because "you're_ha" does not overlap with "calfway." </li>
<li>Coverage (i.e. the number of times a character is represented in the assembly) helps distinguish sequencing errors. For example, if we sample from the quote (or in the genome in the case of DNA) many times and see the character string "halfway" 100 times and the character string "calfway" only once we can assume that "calfway" was incorrectly sequenced. </li>
</ul>
<div>
<b><br /></b></div>
<div>
<b>De Novo vs Mapping Assembly</b></div>
<div>
De novo is a latin expression meaning "from the beginning" (<a href="http://en.wikipedia.org/wiki/De_novo">Wikipedia</a>). De novo sequence assemblies are build with no external information beyond the raw sequencing reads. First pass de novo assembles are called "draft" assemblies because the genome remains fragmented (i.e. discontinuous) and may contain assembly errors. Extensive resequencing and curation is generally required to "complete" or "finish" a genome assembly. <br />
<br />
Alternatively, mapping assembly uses a reference sequence as an anchor to orient sequenced reads. After reads are ordered based on their location in the reference sequence a consensus sequence is generated from all the mapped reads. The consensus sequence can differ from the reference sequence but differences are generally single base differences scattered throughout the genome. Mapping assembly is most useful when the reference sequence and sequenced organism are closely related.<br />
<br />
In some cases (e.g. metagenomics), a combination of de novo and mapping assemblies may be advantageous. However, hybrid assembly algorithms and protocols are not well explored.</div>
<div>
<br /></div>
<div>
<b>De Novo Assembly Algorithm Classes</b></div>
</div>
<div>
There are two main classes of assembly algorithms used in de novo assembly: overlap-layout-consensus (OLC) and <a href="http://en.wikipedia.org/wiki/De_Bruijn_graph">de bruijn graph</a> (DBG). Similar to the above example, OLC first finds reads with overlapping ends, builds a layout graph based on these overlaps, and lastly generates a consensus sequence as the graph is traversed. OLC was the first assembly method developed and works well with long-read, low-coverage sequencing technologies like <a href="http://en.wikipedia.org/wiki/Sanger_sequencing">Sanger</a> (and possibly <a href="http://www.pacificbiosciences.com/">PacBio</a>).</div>
<div>
<br /></div>
<div>
DBG based assemblers convert the set of reads into a set of k-mers (i.e. short DNA sequences of length k). These k-mers are then used to build a <a href="http://en.wikipedia.org/wiki/De_Bruijn_graph">de bruijn graph</a> from which the genomic sequence is inferred. DBG assemblers work well with high-coverage sequencing methods like Illumina and Ion Torrent. </div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<b>Useful Links</b></div>
<div>
<ul style="text-align: left;">
<li>For more information and comparisons between OLC and DBG see <a href="http://bfg.oxfordjournals.org/content/11/1/25.full?sid=49a81f2a-8fe6-4a0b-bf6a-33a83853dd90">this review</a>.</li>
<li><a href="http://genome.cshlp.org/content/12/5/669.full">What is finished and why do we care?</a> (Genome Research 2002).</li>
<li>A <a href="http://www.sciencedirect.com/science/article/pii/S0888754310000492">review of assembly algorithms</a> including: <span style="background-color: white; color: #2e2e2e; font-family: 'Arial Unicode MS', 'Arial Unicode', Arial, 'URW Gothic L', Helvetica, Tahoma, sans-serif; font-size: 13.600000381469727px; line-height: 20px; text-align: justify; word-spacing: -1.0568554401397705px;">SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo (Genomics 2010).</span></li>
<li style="text-align: justify;"><span style="color: #2e2e2e; font-family: Arial Unicode MS, Arial Unicode, Arial, URW Gothic L, Helvetica, Tahoma, sans-serif;"><span style="background-color: white; font-size: 14px; line-height: 20px; word-spacing: -1.0568554401397705px;"><a href="http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html">How to apply de bruijn graphs to genome assembly</a> (Nature 2011).</span></span></li>
</ul>
</div>
<div>
<ul style="text-align: left;">
</ul>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com1tag:blogger.com,1999:blog-5869557521756883740.post-62747942069644179742013-11-15T06:24:00.000-08:002013-11-15T06:24:16.488-08:00New SPAdes Assembler Used on MiSeq Reads for 6 Burkholderia Genomes<div dir="ltr" style="text-align: left;" trbidi="on">
<b><span style="font-size: large;">Introduction</span></b><br />
<br />
This post introduces a practical assembly strategy using the new SPAdes assembler with Illumina MiSeq reads. Utilizing the sequence analysis tools sickle, FLASH, SPAdes, and in-house Perl scripts we assemble 6 <i>Burkholderia</i> genomes to draft status using both paired-end (PE) and mate-pair (MP) reads. This exercise demonstrates the practicality of using the Illumina MiSeq for small scale assembly projects.<br />
<b><span style="font-size: large;"><br /></span></b>
<b><span style="font-size: large;">The SPAdes Assembler</span></b><br />
<b><br /></b>Many popular <i><a href="http://en.wikipedia.org/wiki/Sequence_assembly#De-novo_vs._mapping_assembly" target="_blank">de novo</a></i> assemblers, including SPAdes, rely on a computational data structure called a <a href="http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html" target="_blank">de Bruijn</a> graph. SPAdes uses a multisized de Bruijn graph to balance trade-offs between small and large k-mer sizes. A smaller k causes more repeat regions to collapse into the same node tangling the graph. However, larger k values can fragment the graph in low coverage regions. Multisized de Bruijn graphs allow the size of k to vary based on regional coverage depth. SPAdes also has a unique method for handling PE reads. Typically PE information is incorporated after the initial assembly by mapping PE reads back to assembled contigs. When corresponding pairs map to the ends of different contigs those contigs can be merged into a single scaffold. SPAdes improves on this by incorporating PE distance information directly into the de Bruijn graph. Because of these features SPAdes can be run with any number of single-end (SE), PE, or MP read files.<br />
<br />
<b><span style="font-size: large;">Assembling <i>Burkhoderia</i> Genomes with SPAdes</span></b><br />
<br />
<div>
<i>Burkholderia</i> is a genus of bacteria with strains having a variety of environmental effects ranging from plant growth promoting activity to human and plant pathogenesis. Because our lab is particularly interested in phenotypes of association between bacteria and plants, we sequenced 6 strains of <i>burkholderia</i> isolated from the endophyte compartment (i.e. inner root) of <i>A. thaliana</i> using two runs on our Illumina MiSeq machine. The first run was done using a standard PE library, and the second was done using a MP library.</div>
<div>
<b><br /></b></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1ycP-g3zYZxP9_KoTyzabWxCmLAYL5cResPrE58uD4H0RQVGR45wNmEHwac3yRW_FhllTYgLv68fSKiMeGoYF1hXLKclZElEJVVeXiID-PN350tzICzXsi_RM-LflX4N8yDWnMTWW3C0T/s1600/Slide1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="230" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1ycP-g3zYZxP9_KoTyzabWxCmLAYL5cResPrE58uD4H0RQVGR45wNmEHwac3yRW_FhllTYgLv68fSKiMeGoYF1hXLKclZElEJVVeXiID-PN350tzICzXsi_RM-LflX4N8yDWnMTWW3C0T/s400/Slide1.jpg" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
</div>
<div>
The pipeline used for assembly is outlined and described below. </div>
<div>
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJpvScaZjy5TchVpPVj_lKJZBU5buRn3PaONrOVlyFflCoKtocaeA4Djb3lgD3lhYTK5TckmGnS05sqgH1vEbT1GqxCzyyi0wNWMfsYKkgCF-TGRU1MdeRLdEqW0ZWAGxpi2HKArWPrQas/s1600/Slide2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJpvScaZjy5TchVpPVj_lKJZBU5buRn3PaONrOVlyFflCoKtocaeA4Djb3lgD3lhYTK5TckmGnS05sqgH1vEbT1GqxCzyyi0wNWMfsYKkgCF-TGRU1MdeRLdEqW0ZWAGxpi2HKArWPrQas/s400/Slide2.jpg" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
</div>
<div>
<b><br /></b></div>
<div>
<b>FLASH</b></div>
<div>
FLASH is a software package the <a href="http://scottmyourstone.blogspot.com/2013/10/merging-overlapping-paired-end-reads.html">merges PE reads</a> into a single read. In PE sequencing there is a distribution of DNA fragment lengths from which ends are sequenced (Supp Fig 1). As reads continue to lengthen more PE reads can potentially overlap. The main advantage to merging the overlapping reads is the error correction step. Furthermore, longer reads tend to yield better assemblies because of their ability to span repetitive regions that fragment assemblies. While longer merged reads may span repetitive regions it is still possible that spanned repeats are incorrectly assembled. In terms of draft assembly this is a minor point.</div>
<div>
<br /></div>
<div>
After experimenting with various combinations of parameters I decided on the following FLASH parameters:</div>
<div>
<br /></div>
<div>
-m 25 -r 250 -f 500 -s 100<br />
<br />
<b>Adapter Trimming</b><br />
A preprocessing step is required to trim MP reads at the Illumina junction adapter. For details on how to do this and why it is necessary see <a href="http://scottmyourstone.blogspot.com/2013/11/difference-between-paired-end-and-mate.html" target="_blank">this post</a>. </div>
<div>
<b><br /></b></div>
<div>
<b>Sickle</b></div>
<div>
Graph assemblers can be sensitive to erroneous k-mers produced from sequencing errors which convolute the graph with false nodes and connections. Rigorous quality filtering prior to assembly reduces the number of false k-mers thereby improving assembly. <a href="https://github.com/najoshi/sickle" target="_blank">Sickle</a> is a software package for quality filtering which relies on a sliding window algorithm over the read quality scores. If the average quality of a window drops below a given minimum the remaining bases are trimmed off. Any trimmed or untrimmed reads shorter than a given length are removed. Sickle also produces a file containing high quality reads where the corresponding pair failed the quality check. These reads can be used in downstream analysis as single-end reads. </div>
<div>
<br /></div>
<div>
The parameter setting that I used when running Sickle are:</div>
<div>
<br /></div>
<div>
-t sanger -l 127 -q 20</div>
<div>
<br /></div>
<div>
The '-t' parameter specifies the quality value type. The '-l' parameter specifies the minimum read length and '-q' is the minimum average window quality score. At this stage it is important to consider the '-l' parameter if running SPAdes is your next step. The SPAdes parameter '-k' takes a list of k-mers from which to build a multi-sized graph. For long Illumina reads the SPAdes manual advises '-k' be set to '21,33,55,77,99,127.' Because the longest k-mer is 127bp reads in you dataset should be at least that long. To allow for shorter reads when running Sickle simply lower the '-l' parameter and ensure the largest '-k' value is shorter than or equal to that value. Because many of the MP reads are already short I ran sickle with the -l parameter set to 75 and the max SPAdes -k parameter set to 75.<br />
<br />
I should note that SPAdes has a built-in correction algorithm--<a href="http://bix.ucsd.edu/projects/hammer/" target="_blank">hammer</a>. Hammer looks at k-mer frequency to identify potential false k-mers. However, I recommend running Sickle first for 2 reasons:<br />
<ol style="text-align: left;">
<li>Sickle is a much faster correction algorithm than hammer. Hence, limiting the number of false k-mers that hammer must identify speeds up SPAdes.</li>
<li>Sickle uses quality values. While quality values are not perfect, they do correlate with the correctness of bases. If reads have any low-quality k-mers it is possible by chance that some of these k-mers will be the same causing hammer to classify them as a true k-mer. However, it is likely that one error inside a high quality k-mer will not be trimmed by Sickle, but will be identified by hammer. This intuition says that a combination of Sickle and hammer is the best regime for quality filtering.</li>
</ol>
</div>
<div>
<b>SPAdes</b></div>
<div>
I ran the SPAdes assembler twice for each genome--first with both PE and MP reads and then with only the PE reads. The second run shows that improvements were made when MP reads are included. The following command was used to run SPAdes with both PE and MP reads:<br />
<br />
<i>spades.py -k 21,33,55,75 -t 5 --careful -s merged_PE_reads.fastq -1 fwd_PE_sickled_reads.fastq -2 rev_PE_sickled_reads.fastq -s PE_sickled_single_reads.fastq --mp1-1 fwd_MP_sickled_reads.fastq --mp1-2 rev_MP_sickled_reads.fastq --mp1-s singles_MP_sickled_reads.fastq --mp1-rf -o spades_out</i><br />
<br />
Using only PE reads (i.e. the merged reads, fwd high quality reads, reverse high quality reads, and single high quality reads) SPAdes was run using the following command:<br />
<br />
<i>spades.py -k 21,33,55,77,99,127 -t 5 --careful -s out.extendedFrags.fastq -1 out.notCombined_1_qc.fastq -2 out.notCombined_2_qc.fastq -s out.notCombined_singles_qc.fastq -o spades_out</i><br />
<b><br /></b></div>
<div>
<b>QUAST</b><br />
When the assembly is complete <a href="http://bioinf.spbau.ru/QUAST" target="_blank">QUAST</a> can be used to compare and view assembly results. For examples see the Results section of this post.<br />
<br />
<br />
<b><span style="font-size: large;">Results</span></b><br />
<br />
<b>Assembly</b><br />
All 6 genomes assembled very well when both PE and MP reads were used.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxAlYHYXC5bXtcOSalkmlopUHnd5d0K4Wibbjg_Srj5HxBFIK9dcUFKVDKKNJIUxxYSfxexUb2zk0H3rhAT5Azk-wfq4pPeqWkJKVNmDp2dIvw8S-ao1jwhPmd6WXl3aEfHAnLXULJZjxh/s1600/MP_PE_assembly.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxAlYHYXC5bXtcOSalkmlopUHnd5d0K4Wibbjg_Srj5HxBFIK9dcUFKVDKKNJIUxxYSfxexUb2zk0H3rhAT5Azk-wfq4pPeqWkJKVNmDp2dIvw8S-ao1jwhPmd6WXl3aEfHAnLXULJZjxh/s1600/MP_PE_assembly.png" /></a></div>
<b><br /></b>
<b>Does merging PE reads help?</b><br />
<div>
To illustrate how merging reads prior to assembly can improve assembly, I assembled one <i>Burkholderia</i> genome using three different input sets:</div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-family: Times; font-size: medium; font-style: normal; font-variant: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<ol style="font-weight: normal; text-align: left;">
<li>Merged and unmerged reads</li>
<li>Raw PE reads with no merging</li>
<li>Only reads that merge</li>
</ol>
<div>
<div style="font-weight: normal; margin: 0px;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLYiRoOkTFt4aLlxA9MNKTj6HNHzS_mg0WWYWPapZvvj-YRsbovoknbYns8QrY-V3f8hIRuBQk94dMQ0qNcv7RAiniHQrxk8nrdlWqmQhbNu2XtyhflR88YAokJJfP4ylBCefLyK1bg07I/s1600/Merge_test.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="206" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLYiRoOkTFt4aLlxA9MNKTj6HNHzS_mg0WWYWPapZvvj-YRsbovoknbYns8QrY-V3f8hIRuBQk94dMQ0qNcv7RAiniHQrxk8nrdlWqmQhbNu2XtyhflR88YAokJJfP4ylBCefLyK1bg07I/s640/Merge_test.png" width="640" /></a></div>
</div>
<div style="font-weight: normal; margin: 0px;">
<br /></div>
<div style="font-weight: normal; margin: 0px;">
Under the assumption that improved assembly statistics (e.g. number of contigs, N50, etc.) indicate a better assembly, a combination of merged and unmerged reads (option 1) yields the best assembly.</div>
<div style="font-weight: normal; margin: 0px;">
<br /></div>
<div style="margin: 0px;">
<b>Does adding MP reads help?</b><br />
When assembling with only PE reads the resulting assemblies are worse than when MP reads are included.</div>
</div>
</div>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieZ06bQY_3JRxsgBPuRV-gG2eptZpDuggPYeCs03qK77IeZN2l2sJ_6HpSm1WZXRrRhjZc4dAvE7rb1_k-s2QLXKZ9nqS_hZKYCohdc4AYeNSegMgJkU5UsAUAjgHcdmLm7CFPlgTXV0Ei/s1600/PE_only_assembly.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieZ06bQY_3JRxsgBPuRV-gG2eptZpDuggPYeCs03qK77IeZN2l2sJ_6HpSm1WZXRrRhjZc4dAvE7rb1_k-s2QLXKZ9nqS_hZKYCohdc4AYeNSegMgJkU5UsAUAjgHcdmLm7CFPlgTXV0Ei/s1600/PE_only_assembly.png" /></a></div>
<br />
To test if the assembly improvements were primarily a factor of increased coverage or information incorporated in MP reads, I reassembled the PE+MP reads for B1/BMP1 treating the MP reads as SE reads. Despite MP reads substantial increase in coverage (52 to 103), when treated as SE reads only minor assembly improvements are observed. This suggests that distance information provided by MP reads can have a greater effect on improving assemblies than additional coverage.<br />
<b><br /></b>
<b><br /></b>
<b><span style="font-size: large;">Conclusions</span></b><br />
<b><br /></b>
Using a combination of FLASH, Sickle, SPAdes, and in-house Perl scripts we assemble 6 <i>Burkholderia</i> genomes to draft status. This exercise demonstrates the practicality of the Illumina MiSeq for small scale sequencing projects consisting of a handful of bacterial genomes. A single PE MiSeq run may be sufficient to assembly a small number of bacterial genomes, however the additional coverage and pairing information of MP reads can improve PE assemblies. Merging overlapping PE reads and quality trimming before assembly can increase N50 and other assembly statistics.<br />
<br />
<br />
<b><span style="font-size: large;">Supplemental Figures</span></b><br />
Figure 1 -- PE insert size distribution for one sample (B1)<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipCgeVVdqyP9mK-z5DXAA4qGEbp9ECO9QbdsM2Dw7rHI7G0usZXhejG6qWa5MpbuceRncPmTZiMc3ANGJiwg0H8mGaWlB0cmJcgZFIpqoJha5QC4kMQb23w98aeaRjjrQVx1-TlmO6a3Cx/s1600/Slide6.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="235" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipCgeVVdqyP9mK-z5DXAA4qGEbp9ECO9QbdsM2Dw7rHI7G0usZXhejG6qWa5MpbuceRncPmTZiMc3ANGJiwg0H8mGaWlB0cmJcgZFIpqoJha5QC4kMQb23w98aeaRjjrQVx1-TlmO6a3Cx/s320/Slide6.jpg" width="320" /></a></div>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com4tag:blogger.com,1999:blog-5869557521756883740.post-4989977424843260172013-11-02T07:24:00.000-07:002015-05-07T07:27:30.819-07:00Difference Between Paired-End and Mate-Pair Reads<div dir="ltr" style="text-align: left;" trbidi="on">
In DNA sequencing lingo the words "paired-end" (PE) and "mate-pair" (MP) are frequently used interchangeably. While the underlying principles between PE and MP reads have strong similarities, there are inherent differences that are crucial to understand. <br />
<br />
The similarities between PE and MP reads include:<br />
<ul style="text-align: left;">
<li>Reads come in pairs</li>
<li>Pairs come from the ends of the same DNA strand</li>
</ul>
<div>
<br /></div>
<div>
The differences between PE and MP reads include:</div>
<div>
<ul style="text-align: left;">
<li>Library preparation protocols -- In short, PE protocols attach an adapter, SP1, to the fwd end and another adapter, SP2, to the reverse end. The first sequencing step is started by targeting SP1 to generate the forward read. The second sequencing step targets SP2 to generate the reverse read. For MP protocols longer DNA sequences are circularized using biotinylated adapters. During the circularization process the DNA strand ends are connected with the biotinylated adapter between them. Circularized DNA are sheared and the biotinylated adapters connecting stand ends are pulled down. These reads can then be sequenced using the same SP1-SP2 adapter protocols used in PE sequencing. </li>
<li>Insert size -- The insert size refers to the distance between the pairs. PE reads generally have a smaller insert size (< 1kp) than MP (2-5 kb). The difference in insert size stems from the difference in protocols. Depending on the length of your reads it is possible for PE reads to have <a href="http://scottmyourstone.blogspot.com/2013/10/merging-overlapping-paired-end-reads.html" target="_blank">overlapping ends</a>.</li>
<li>Read orientation -- PE reads come in forward-reverse (FR) orientation where read 1 is the forward read and read 2 is the reverse read. Because of the circularization step MP reads com in reverse-forward (RF) orientation where read 1 is the reverse read and read 2 is the forward read. These differences are especially important to understand for assembly algorithms and projects.</li>
<li>Read Trimming -- Theoretically, PE reads require no trimming before sequence analysis. However, in practice it is recommended that low quality portions of the read be trimmed using tools like <a href="https://github.com/najoshi/sickle" target="_blank">Sickle</a>. Alternatively, MP reads <b>require</b> trimming because biotinylated adapters are often present in the middle of one or both MP reads. Adapter trimming software generally remove adapters and any sequence beyond the adapter. Software options for adapter trimming include <a href="https://code.google.com/p/cutadapt/" target="_blank">cutadapt</a>, <a href="http://www.usadellab.org/cms/?page=trimmomatic" target="_blank">Trimmomatic</a>, and <a href="https://code.google.com/p/ea-utils/wiki/FastqMcf" target="_blank">FastqMcf</a>. For more reading on adapter trimming see this <a href="http://onetipperday.blogspot.com/2012/08/three-ways-to-trim-adaptorprimer.html" target="_blank">post</a> and this <a href="http://bioscholar.com/genomics/tools-remove-adapter-sequences-next-generation-sequencing-data/" target="_blank">post</a>.</li>
</ul>
</div>
<div>
<br /></div>
<div>
For more details on the sequencing protocols see the Illumina documentation for <a href="http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn" target="_blank">PE</a> and <a href="http://res.illumina.com/documents/products/technotes/technote_nextera_matepair_data_processing.pdf" target="_blank">MP</a> sequencing.</div>
<div>
<br /></div>
<div>
<br /></div>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com10tag:blogger.com,1999:blog-5869557521756883740.post-54191404734659701492013-10-22T09:33:00.000-07:002014-10-08T15:45:50.738-07:00Read simulators review with an emphasis on metagenomics<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: left;">
<h2 style="text-align: left;">
<b>Why Read Simulation?</b></h2>
Simulations are an important aspect of bioinformatics that can be used for testing and benchmarking algorithms, optimizing parameters, and generating optimal study design. The following are examples of how read simulations have been successfully utilized in each of these instances.<br />
<br />
<h4 style="text-align: left;">
<b>Testing Algorithms</b></h4>
After writing a new algorithm how do we ensure that it works? The most effective way is to run the algorithm on a set of data where the answers are known. This is a perfect application for simulated data. <a href="http://www.biomedcentral.com/1471-2164/10/520/" target="_blank">Katharina Hoff</a> use DNA sequence simulations to test the effects of sequencing error from several sequencing technologies on specialized metagenomic gene prediction algorithms. She concludes that gene prediction accuracy is poor in the presence of sequencing errors using these algorithms. Furthermore, gene prediction algorithms not specialized for metagenomic data perform as well or better than their specialize counterparts. This suggests that metagenomic gene predictors could be improved by being more robust to sequence error.<br />
<br />
<h4 style="text-align: left;">
<b>Benchmarking Algorithms</b></h4>
Benchmarking is used to compare accuracy, precision, technical requirements, and other attributes of algorithms. Simulated reads provide a common benchmark for comparing assembly algorithms in the Assemblathon competition [<a href="http://genome.cshlp.org/content/early/2011/09/16/gr.126599.111.abstract" target="_blank">Earl, et al.</a>], and have also been used to benchmark read mapping algorithms such as Bowtie, Soap, and Pass [<a href="http://bib.oxfordjournals.org/content/11/2/181.full" target="_blank">Horner</a>].<br />
<br />
<h4 style="text-align: left;">
<b>Optimizing Parameters</b></h4>
<div style="text-align: left;">
Researchers often ignore the detailed parameters of algorithms and programs. This can frequently cause problems by violating assumptions made in the software development process. Furthermore, the effects of parameters on results are largely unknown yet can significantly effect results and conclusions. To better understand the effects of parameters used by the algorithm FLASH, <a href="http://bioinformatics.oxfordjournals.org/content/early/2011/09/07/bioinformatics.btr507.full.pdf" target="_blank">Magoc et al.</a> use simulated reads to build <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic" target="_blank">ROC</a> curves illustrating the trade-offs between correctly and incorrectly merged paired-end reads using<span style="color: red;"> </span>different values for the mismatch and minimum overlap parameters (Figures 5 and 6). <br />
<br /></div>
<h4 style="text-align: left;">
<b>Optimizing Study Design</b></h4>
Prior to sequencing, it is common to ask questions like, "how much coverage do we need," "what read length should we sequence at," "what sequencing platform is best for our project," "should we use paired-end (PE) or single-end (SE) reads," etc. These and other similar questions can be answered at least in part by sequence simulation. For example, the <a href="http://www.nature.com/nature/journal/v467/n7319/full/nature09534.html" target="_blank">1000 Genomes Project Consortium</a> used <a href="http://bioinformatics.oxfordjournals.org/content/28/4/593.full" target="_blank">ART</a> to test the effects of read length and PE insert size on a reads ability to map to the human genome. They conclude that longer reads substantial increase mappability especially for SE reads. Furthermore, increasing insert size also marginally improves mappability.<br />
<b><br /></b>
Furthermore, <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0031386" target="_blank">Mende et al.</a> use sequence simulations to test several aspects of study design for whole metagenome sequencing. They conclude that quality control measures such as quality filtering and quality trimming have a substantial impact on assembly by improving accuracy and extending contig lengths. They also evaluate assembly quality on Sanger, pyrosequencing, and Illumina platforms with communities of low (10 genomes), medium (100 genomes), and high (400 genomes) complexity. For the low complexity community all platforms were more or less equal in terms of assembly and accurately represented the functional aspects of the community. For the medium complexity community Illumina produced the best assembly and most accurately represented the functional elements. With the high complexity community none of the platforms were particularly good, however, because of the longer reads, Sanger was still able to represent much of the functional composition. <b><br /></b><br />
<br />
<h2 style="text-align: left;">
<b>Things to look for in a simulator?</b></h2>
<ul>
<li>Good/appropriate error model</li>
<li>Models read lengths</li>
<li>Models coverage bias</li>
<li>Includes quality values</li>
<li>Single-end and Paired-end reads</li>
<li>Multiple sequencing platform capabilities (e.g. Illumna, 454, etc)</li>
<li>Easy to install</li>
<li>Easy to use</li>
<li>Good documentation</li>
</ul>
<br />
<ul>
</ul>
<div>
<h2 style="text-align: left;">
<b>What are the simulators our there for...</b></h2>
</div>
<div>
<h4 style="text-align: left;">
<b>Genomic DNA sequences</b></h4>
<b><br /></b>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEig_HmxdTogmdcoiNART4zgUBnNgdNls8XAXTj5bPE7EVyIJicKZtDodH6a_2QoQdZJikP3oaSohcu0RKvQMu4iw5nUa1kegBGdE6St5JBTiGAWDGBiT2cWHSw4YUWFZEfNmbTicKpvLtLd/s1600/Slide1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEig_HmxdTogmdcoiNART4zgUBnNgdNls8XAXTj5bPE7EVyIJicKZtDodH6a_2QoQdZJikP3oaSohcu0RKvQMu4iw5nUa1kegBGdE6St5JBTiGAWDGBiT2cWHSw4YUWFZEfNmbTicKpvLtLd/s400/Slide1.jpg" height="251" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
</div>
<div>
<ul>
<li><b>wgsim</b> relies on a simple uniform substitution error model. The uniform error model does not reflect error patterns as accurately as the models and algorithms described below. Another major weakness of wgsim its inability to generate INDEL errors. Wgsim seems like the most basic read simulator.</li>
<li><b><a href="http://www.ebi.ac.uk/goldman-srv/simNGS/" target="_blank">simNGS</a></b> models the processes in an Illumina sequencing run such as cluster generation and cluster intensity. The model must be input as the "runfile" where each line contains parameters for each cycle. The number of cycles contained in the "runfile" corresponds to the number of bases simNGS can model.</li>
<li><a href="http://maq.sourceforge.net/" style="font-weight: bold;" target="_blank">MAQ</a> is primarily an assembly algorithm (hence the large number of citations). It also includes a uniform reference sequence mutation algorithm. To simulate reads it uses a first-order markov chain to model quality at each cycle. Using that model it generates quality values and then bases based on those quality values. Documentation is lacking, so I'm not entirely sure all the values in the table are correct.</li>
<li><a href="http://www.biomedcentral.com/1471-2164/13/74" style="font-weight: bold;" target="_blank">GemSIM</a> takes as input a SAM file and FASTQ file for empirical error model generation. This is advantageous because it can easily be extended to new sequencing technologies or upgrades as they are released. It can also simulate metagenome reads based on given abundance proportions. </li>
<li><a href="http://bioinformatics.oxfordjournals.org/content/28/4/593.full" style="font-weight: bold;" target="_blank">ART</a> uses quality score data from real Illumina reads to model substitution rates. Quality scores are generally very informative for Illumina reads, but may not be perfect. They map real Illumina reads to a reference to model INDEL rates. I would prefer they do that with substitutions as well. Currently the longest Illumina read length model can generate 75bp reads. This is becoming less useful as read lengths continue to grow. I am less familiar with 454 and Sanger reads so I won't comment on those models. </li>
<li><a href="http://publications.mi.fu-berlin.de/962/2/mason201009.pdf" style="font-weight: bold;" target="_blank">Mason</a> was build using models empirically derived from fly and yeast reads. According to the technical report Mason can model 100bp Illumina reads from the GAII, however it appears that it may be possible to build a more up-to-date model. Mason uses previously published models for both 454 and Sanger reads. Mason was developed in C++ and is easily extendable making it useful for incorporating into your personal code.</li>
<li><b><a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2935434/" target="_blank">FlowSim</a></b> is specifically for 454 Pyrosequences. FlowSim builds an empirical model derived using quality trimmed <i>E. coli</i> K-12 and <i>D. Labrax</i> reads. Because the 454 platform interprets reads first as flow grams, FlowSim models these flow grams, specifically the homopolymer distributions. During the simulation process, flow grams are generated and are subsequently interpreted into base calls and corresponding quality scores. The number of cycles indicates roughly the number of bases in each read.</li>
<li><b><a href="https://github.com/jstjohn/SimSeq" target="_blank">SimSeq</a></b> has limited documentation! One of the useful attributes of SimSeq is its model for mate-pair chimeras.</li>
<li><a href="http://bioinformatics.oxfordjournals.org/content/early/2012/04/12/bioinformatics.bts187.full.pdf" style="font-weight: bold;" target="_blank">pIRS</a> builds an error model based on SOAP2 mapping results or a given SAM file. It uses a combination of two matrices to generate bases and quality values. These matrices are based of empirical training data and a first-order markov chain (similar to MAQ) of quality scores. This seems like a very good model for both bases and quality values. One of the defining features of pIRS is its simulation of coverage bias across a genome based on %GC content. It also includes an option for mutating the given genome to fabricate reference sequence heterogeneity.</li>
</ul>
</div>
<div>
<h4 style="text-align: left;">
<b>Metagenomic Sequences</b></h4>
<b><br /></b>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5Ha8Cv0w6Ga1gjMXIEfkusJ2dycPJgizKIOv5yQc5tJh5p_HsOAsB-j8AaiQBsjeVhd9-gipNbhCbs1eh3Ltx4Zz_xYl_dlTWQyluhWDY8u5LAed5g8YK3h842oVeYzhND6H5j23RdPUo/s1600/Slide2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5Ha8Cv0w6Ga1gjMXIEfkusJ2dycPJgizKIOv5yQc5tJh5p_HsOAsB-j8AaiQBsjeVhd9-gipNbhCbs1eh3Ltx4Zz_xYl_dlTWQyluhWDY8u5LAed5g8YK3h842oVeYzhND6H5j23RdPUo/s400/Slide2.jpg" height="168" width="400" /></a></div>
</div>
<div>
<ul>
<li><b><a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0075448#close" target="_blank">NeSSM</a> </b>can simulate coverage bias but relies on the mapping distributions from real metagenome sequences to the reference database. This is less useful approach because it cannot be applied to model coverage bias of previously unsequenced metagenomes. A very positive feature of NeSSM is its GPU capabilities. GPU processing is much faster which facilitates simulating large scale sequencing runs from platforms like the Illumina HiSeq. The download link in the paper was not working, so I was unable to download and test this software.</li>
<li><b><a href="http://nar.oxfordjournals.org/content/early/2012/03/19/nar.gks251.abstract" target="_blank">Grinder</a></b> -- Can also simulate PCR dependent amplicon sequences for any amplicon based on given primers or a database of reference amplicons suca! For amplicon simulations it additionally produces chimeric reads and gene copy number biases. Grinder allows users to input abundance profiles, model alpha and beta diversity. Quality scores in Grinder are assigned a single "high" quality value for all correct bases and a single "low" quality value for bases designated as errors. Grinder includes a graphical user interface (GUI) and command line interface (CLI). Grinder can also be run on the web through the <a href="https://usegalaxy.org/" target="_blank">Galaxy</a> interface.</li>
<li><b><a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0003373" target="_blank">MetaSim</a></b> -- Stores user defined genomes in a database. MetaSim can simulate reads from any of the genomes in the database using various empirically defined models or a user defined custom model. Abundance measures for each genome allow users to simulate communities of variable abundance. To simulate heterogeneity in the community, MetaSim can mutate the selected reference genomes based on a phylogenetic tree input describing the degree of mutation. MetaSim also includes a GUI, however more advanced options can be accessed via the CLI. One of the short-comings of MetaSim is the lack of simulated quality values. As an increasing number of algorithms rely on these values this highlights an essential feature that is missing.</li>
<li><b><a href="https://www.google.com/search?q=metagenomics+read+simulation&oq=metagenomics+read&aqs=chrome.0.69i59j69i57j0l3.3220j0j7&sourceid=chrome&espv=210&es_sm=119&ie=UTF-8#" target="_blank">Bear</a></b> is still under development. </li>
<li><b>GemSIM</b> see above.</li>
</ul>
<div>
As a disclaimer I should say that I have only tried a few of these simulators. There may be special features that I have missed.<br />
<br /></div>
</div>
<h2 style="text-align: left;">
<b>Discussion</b></h2>
<div>
Simulations are an important and under utilized method for testing algorithms and planning experiments. It can be tempting to haphazardly start sequencing or push your data through a collaborators favorite algorithm using the default parameters. These temptations can lead to failed experiments, unusual results, or a misunderstanding of data. A more systematic approach would be to build informative simulations to anticipate and address problematic algorithm assumptions and experimental design flaws. This approach requires a large up-front cost, however it can save time and money in the long run. </div>
<div>
<br /></div>
<div>
As a cautionary note, simulations are not biology. Sequence simulations are an imperfect image of a very complicated world. Conclusions drawn from simulation experiments must be taken with a grain of salt. In some cases it is vital to design biological experiments to validate such conclusions. The utility of sequence simulations is to perform a cheap, first-pass test of algorithms and experimental designs. </div>
<div>
<br /></div>
<div>
While sequence simulators are not expected to perfectly represent real sequences, the closer they can get to that point the better. Most simulators carefully model quality of bases, but there are only a couple that model biases imposed by sequencing platforms and sample preparation protocols such as GC content bias. This is one area where sequence simulators have room to improve. However, building such models is challenging due to the large number of variables that contribute to such biases. For example, there are hundreds of available sample prep protocols for hundreds of biological techniques. Building a universal model to account for biases from each of these protocols is an enormous task especially in a world where protocols can have a life expectance rate of only a few months. Furthermore, unique biases can be introduced by the set of DNA itself. Sequencing human DNA will produce different biases with different levels of effect than sequencing a single bacterial strain. </div>
<div>
<br /></div>
<div>
In conclusion, sequence simulators can be extremely useful for testing and benchmarking algorithms, optimizing parameters, and designing experiments. There are a number of simulators available each with its own strengths and weaknesses. Carfull consideration should be used when picking a simulator and interpreting simulation experiments. <br />
<br /></div>
<h2 style="text-align: left;">
<b>Other interesting reads</b></h2>
<ul style="text-align: left;">
<li><a href="http://www.springerreference.com/docs/html/chapterdbid/331359.html" target="_blank">Lessons learned from simulated metagenomics datasets</a></li>
<li><a href="http://www.ncbi.nlm.nih.gov/pubmed/20377454" target="_blank">Zagordi</a>, et al. Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction. 2012.</li>
<li><a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0019984" target="_blank">Pignatelli</a>, et al. Evaluating the fidelity of de novo short read metagenomic assembly using simulated data. 2011.</li>
<li><a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0010209" target="_blank">Morgan</a>, et al. Metagenomic Sequencing of an <i>In Vitro</i>-Simulated Microbial Community. 2010.</li>
<li><a href="http://www.nature.com/nmeth/journal/v4/n6/abs/nmeth1043.html" target="_blank">Mavromatis</a>, et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. 2007.</li>
</ul>
<div>
<b><br /></b></div>
<div>
<b><span style="font-size: large;">UPDATE - Oct 8, 2014</span></b></div>
<div>
<ul style="text-align: left;">
<li>The new <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168713/?report=classic">BEAR</a> simulator seems like a good option as well.</li>
</ul>
</div>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com2tag:blogger.com,1999:blog-5869557521756883740.post-78271791322694675532013-10-09T13:02:00.001-07:002013-10-09T19:23:28.945-07:00Merging Overlapping Paired-End Reads<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: left;">
<b><u>Paired-End Sequencing Overview</u></b><br />
<br />
Amplicon sequencing is a technique in which PCR primers are used to amplify a small portion of a genome. For example, if a gene of interest is preceded by a unique DNA sequence, S<span style="font-size: xx-small;">1</span>, and followed by another unique DNA sequence, S<span style="font-size: xx-small;">2</span>, PCR amplification using S<span style="font-size: xx-small;">1</span> and S<span style="font-size: xx-small;">2</span> as primers will exponentially copy the DNA segment flanked by S<span style="font-size: xx-small;">1</span> and S<span style="font-size: xx-small;">2</span> (see this <a href="http://www.dnatube.com/video/7087/Polymerase-chain-reaction-cycle-PCR" target="_blank">video</a>). The resulting DNA can be sequenced using a paired-end (PE) sequencing protocol where DNA molecules in a sample are sequenced from both ends. In some cases amplicons (i.e. the portion of DNA flanked by the two primers) are short enough such that the forward read overlaps the reverse read. It can be advantageous to merge these reads into a single sequence for the following reason:</div>
<div style="text-align: left;">
<br /></div>
<ul style="text-align: left;">
<li>Sequenced reads generally have a higher density of errors near the end of the read. The extra information gained by overlapping the tail end of the forward read with the tail end of the reverse read all for correction of most of these errors.</li>
</ul>
<br />
<ul style="text-align: left;">
</ul>
<div style="text-align: left;">
<b><u>Merging PE Reads with FLASH</u></b></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
While there are several bioinformatics tools and algorithms that would be appropriate for merging overlapping PE reads, I prefer to use the assembly tool <a href="http://bioinformatics.oxfordjournals.org/content/early/2011/09/07/bioinformatics.btr507.full.pdf" target="_blank">FLASH</a>. FLASH is freely available under the <a href="http://gplv3.fsf.org/" target="_blank">GPLv3</a> license and can be downloaded via <a href="https://sourceforge.net/projects/flashpage/?source=navbar" target="_blank">SourceForge</a>. </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
Reasons to use FLASH include:</div>
<div style="text-align: left;">
<ul style="text-align: left;">
</ul>
<div>
<ul style="text-align: left;">
<li>Easy to download and install</li>
<li>Fast!</li>
<li>Uses quality scores to correct possible sequencing errors in the overlapping region. This is by far the most important reason to use FLASH. Correcting potential sequencing errors can lead to much less noise and more definitive results. </li>
</ul>
<div>
Reasons you might want to find another tool:</div>
</div>
</div>
<div style="text-align: left;">
<ul style="text-align: left;">
<li>FLASH is not able to look for gapped alignments between read pairs. Ungapped alignments are appropriate when using Illumina reads, however they can be problematic for other sequencing platforms such as 454.</li>
</ul>
<div>
FLASH is easy to use if you are familiar with the Unix command line interface. First, navigate to where you installed flash and type the command:</div>
</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
./flash -h</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
This will ensure that FLASH has been installed correctly and prints the FLASH help page to your terminal screen. To run FLASH you will need two FASTQ formatted files representing the forward and reverse reads to merge. These files should have the same number of reads and should be ordered such that the first read in the forward file is paired with the first read in the reverse file. <br />
<br />
Similar to other bioinformatics applications, FLASH has several parameters that can be adjusted to maximize the algorithm's effectiveness. The FLASH help page (./flash -h) give a full explanation of each parameter. To run FLASH using all default parameters type:<br />
<br />
./flash <forward reads file path> <reverse reads file path><br />
<br />
The following are a list of useful parameters that I often use:<br />
<br />
-m, --min-overlap=NUM: The minimum required overlap between the fwd and rev read. Beware if this number is too small spurious overlaps will be formed. I recommend at least and overlap of 10bp. See the FLASH paper for more details.<br />
<br />
-M, --max-overlap: The expected overlap length of the fwd and rev reads. This parameter can also be calculated using the -r, -f, and -s parameters.<br />
<br />
-x, --max-mismatch-density=NUM: The ratio of number of mismatched bases and overlap length. This parameter controls the stringency for which pairs are considered to be mergable. A high ratio of number of mismatches to overlap length will merge more pairs, but some of these pairs may be false merges. Alternatively, a low ratio will merge fewer pairs but ensures that false merges are excluded. <br />
<br />
-r, --read-len=LEN: The average length of the raw reads. This number is used in calculations for the -M parameter (--max-overlap).<br />
<br />
-f, --fragment-len=LEN: The expected amplicon length. <br />
<br />
-s, --fragment-len-stddev=LEN: The expected standard deviation of amplicon lengths. <br />
<br />
-d, --output-directory=DIR: The output directory.<br />
<br />
-o, --output-prefix=PREFIX: The output prefix. For example if "-o my_output" is used all output files will have the prefix "my_output." This parameter can be useful for organizing different FLASH runs.<br />
<br />
Refer to the FLASH help page for explanations about other parameters.<br />
<br />
<br />
<b><u>Quality of Merged PE Reads</u></b><br />
<br />
During the merging process of overlapping PE reads, discrepancies between bases in the overlapping region can be corrected by choosing the base with the highest quality score to represent the consensus base. To illustrate quality improvements from merging PE reads I use a subset of recently published data where we cloned a known 16S gene into a plasmid and sequenced it using our typical 16S metagenome profiling protocol for 2x250 bp reads on the Illumina MiSeq. The primed V4 amplicon is roughly 253bp. After removing extraneous bases added by our protocol (e.g. molecule tags, primers), the PE reads are expected to have an overlapping region of ~183bp. Because the overlapping region is large the lower quality tails of both the forward and reverse read overlap in high quality regions in the corresponding read.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEip759QJ15S9z53g2-i_Q3hZmMm-5UpirAVo1-iOd8oAi5GA6q7WDdZqxTKhZnkBP6DHUIt2UpdEDP78bqvN8X9V9aYIN-EhcYccpZjp2AkqPlEXdqDxFGCjLuBuh2vyo_54udRkHrAb4A_/s1600/overlapping_pe_reads.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="135" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEip759QJ15S9z53g2-i_Q3hZmMm-5UpirAVo1-iOd8oAi5GA6q7WDdZqxTKhZnkBP6DHUIt2UpdEDP78bqvN8X9V9aYIN-EhcYccpZjp2AkqPlEXdqDxFGCjLuBuh2vyo_54udRkHrAb4A_/s400/overlapping_pe_reads.jpg" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
The distribution of reads based on the errors per base (EPB) in each read is as follows.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBtQeiteaan7VHuUcCizw1zTKc-MoHvC2bHQASQ3Fg6KDKDagLU2G0oUPyJRL-By6qeJ_mTEIWq6ZeF6OVArEx4XpLhbtwGEP8ZoUOMJCyfkYFlb2Ww9wnoUumY1YpzQ-iLK1n9iM9oAId/s1600/epb_dist.jpg" imageanchor="1"><img alt="" border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBtQeiteaan7VHuUcCizw1zTKc-MoHvC2bHQASQ3Fg6KDKDagLU2G0oUPyJRL-By6qeJ_mTEIWq6ZeF6OVArEx4XpLhbtwGEP8ZoUOMJCyfkYFlb2Ww9wnoUumY1YpzQ-iLK1n9iM9oAId/s400/epb_dist.jpg" title="" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The green line (merged reads) has shifted toward the y-axis indicating that there are more reads with lower EPB compared to forward reads and especially reverse reads. Notice that the y-axis is in a log scale, so while the differences between groups seems small it is actually quite substantial. The mean EPB for all merged reads is 0.001903164 while the forward and reverse reads are 0.003959693 and 0.00590493, respectively. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
One caveat to this investigation is the quality filtering done prior to analysis. Naturally, low quality reads are less likely to merge causing merged reads to have artificially higher quality. To compensate for this effect we rigorously filtered raw reads before calculating EPB. Consequently, the divergence between merged reads and unfiltered fwd/rev raw reads will be more pronounced than is reflected in the figure strengthening the conclusion that merged reads have higher quality than fwd/rev raw reads. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Yet another caveat to consider when designing an overlapping PE sequencing experiment is the expected overlap length. In this study we use a small amplicon which allows the error prone tails to overlap in regions of higher quality in the corresponding read. With some amplicons this will not be possible, however even when low quality regions overlap each other corrections can still be made but at a lower confidence. Also, the V4 amplicon has very little variability in terms of length. Again, this is not the case with all amplicons and can effect merging success. For example, the ITS region is another amplicon used in metagenome profiling, but its length variability is substantially higher than the 16S V4 amplicon. Therefore, reads that fail to overlap in an ITS experiment cannot be assumed to be low quality sequence and should be incorporated in downstream analysis as non-overlapping PE reads.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
In conclusion, overlapping PE reads can easily be merged using tools like FLASH. Furthermore, merging these reads yields higher quality merged sequences because sequencing errors in the overlapped portion can be corrected based on read quality scores. Higher quality sequences reduce noise introduced by sequencing error and consequently lead to more robust downstream analysis and conclusions. </div>
<br /></div>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0tag:blogger.com,1999:blog-5869557521756883740.post-6685688605403476302013-09-20T09:56:00.000-07:002014-09-27T05:20:23.692-07:00Installing Perl Modules<div dir="ltr" style="text-align: left;" trbidi="on">
One of the great advantages of Perl is the substantial amount of code developed by the Perl community. However, installing and using external code can be a minor annoyance for experienced Perl programmers and a major headache for Perl beginners. For personal future reference and to help ease the pain of this learning curve, I am writing this post to outline the main installation strategies for Perl modules and give some tips for getting things working smoothly.<br />
<br />
There are two main approaches to installing Perl modules--<a href="http://www.cpan.org/">CPAN</a> and manual installation via ExtUtils::MakeMaker or Module::Build. CPAN is a fantastic resource. My preference for installing Perl modules is to use the <a href="http://search.cpan.org/~miyagawa/App-cpanminus-1.7001/bin/cpanm">cpanm</a> script (which automates the CPAN download, build, and install process) coupled with local::lib. I also recently learned that cpanm can take a URL argument to a git or SourceForge developers release. It will also automatically install any dependencies declared in the build script. The command would look something like this:<br />
<br />
cpanm http://sourceforge.net/projects/bioutilsperllib/files/latest/download?source=directory<br />
<br />
<a href="http://search.cpan.org/~haarg/local-lib-2.000014/lib/local/lib.pm">local::lib</a> is an important part of the installation process. Many instructional websites illustrate the installation process with sudo commands. If you are a system administrator or you know that several users will need the Perl module you are installing then it is best to use the sudo command to install into the public set of Perl modules. However, for personal use it is best to install in a local directory. The Perl local::lib module can take care of this very nicely for you. Note that there are other ways to trick CPAN into installing a Perl module in a local directory, but these are not nearly as clean as using local::lib. local::lib is surprising easy to set up by simply following the instructions on the CPAN documentation page <a href="http://search.cpan.org/~ether/local-lib-1.008018/lib/local/lib.pm" target="_blank">here</a>. Once local::lib is setup cpanm will recognize these setting and automatically install into your local Perl module set.<br />
<br />
The second option for installing Perl modules is a manual installation via <a href="http://search.cpan.org/~leont/Module-Build-0.4204/lib/Module/Build.pm">Module::Build</a> or <a href="http://search.cpan.org/~bingos/ExtUtils-MakeMaker-6.86/lib/ExtUtils/MakeMaker.pm">ExtUtils::MakeMaker</a>. If you are a serious Perl programmer I would highly recommend becoming very familiar with one of these two installation modules. I personally like Module::Build, but I admit that I have had limited experience with ExtUtils::MakeMaker. To simply use these for installing Perl modules only a basic understanding of a handful of commands is required. <br />
<br />
First, the module package must be downloaded to the machine where you plan to install. I recommend having a directory where you keep all the module you download. The directory I have designated for this is $HOME/build. Once you have the module downloaded to your build directory you will have to unpackage it with the unzip command (or a comparable command). This will create a directory where you unzipped the module. Navigate into the directory. If you see a file called Build.PL you know the module was built using Module::Build. Use the following commands to build, test and install the module:<br />
<br />
perl Build.PL<br />
./Build<br />
./Build test<br />
./Build install<br />
<br />
If the module fails to install because required dependencies are missing you can try the command:<br />
<br />
./Build installdeps<br />
<br />
If you do not see the Build.PL file then you should see a file named Makefile.PL. This indicates that the module was built using ExtUtils::MakeMaker. To install such a module use the following commands:<br />
<br />
perl Makefile.PL<br />
make<br />
make test<br />
make install<br />
<br />
For details on the wide range of capabilities for both these modules carefully read the documentation pages (<a href="http://search.cpan.org/~leont/Module-Build-0.4007/lib/Module/Build.pm" target="_blank">Module::Build</a>, <a href="http://perldoc.perl.org/ExtUtils/MakeMaker.html" target="_blank">ExtUtils::MakeMaker</a>).<br />
<br />
For more information on installing Perl modules see the following:<br />
<br />
<ul style="text-align: left;">
<li><a href="http://www.cpan.org/modules/INSTALL.html" target="_blank">CPAN instructions</a></li>
<li><a href="http://www.thegeekstuff.com/2008/09/how-to-install-perl-modules-manually-and-using-cpan-command/" target="_blank">The Geek Stuff</a></li>
<li>s<a href="http://stackoverflow.com/questions/540640/how-can-i-install-a-cpan-module-into-a-local-directory" target="_blank">tackoverflow</a></li>
<li><a href="http://www.perlmonks.org/?node_id=128077" target="_blank">PerlMonks</a></li>
</ul>
<br />
As a side note the following command is useful for determining if a module is already installed and where it is located:<br />
<div>
<br />
perldoc -l <module name></div>
<br />
<br /></div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0tag:blogger.com,1999:blog-5869557521756883740.post-55992142268884072932013-04-23T18:35:00.002-07:002013-04-23T18:36:12.444-07:00BioUtils vs. BioPerl<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="background-color: white; color: #666154; font-family: Arial, Verdana, sans-serif; font-size: 13px;">Perl is a popular language used in bioinformatics. Its popularity is based primarily on its easiness to learn and plethora of open source libraries. BioPerl is an example of one such library. As a bioinformatician, I appreciate BioPerl because it quickly and easily allows me to perform common, menial tasks such as parsing fasta files and manipulating basic DNA sequence objects. However, BioPerl's speed and memory scale poorly and are becoming limiting factors in large-scale sequencing projects. In order to accommodate many diverse data types, the object hierarchy and object attributes of BioPerl have become somewhat convoluted thereby increasing memory and runtime requirements.</span><br />
<div style="background-color: white; color: #666154; font-family: Arial, Verdana, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #666154; font-family: Arial, Verdana, sans-serif; font-size: 13px;">
To address memory and runtime limitations of BioPerl, I developed BioUtils, a collection of simple Perl modules for FASTA/Q formated DNA sequences. BioUtils can:</div>
<div style="background-color: white; color: #666154; font-family: Arial, Verdana, sans-serif; font-size: 13px;">
<ul>
<li style="list-style-position: outside; list-style-type: circle;">store FASTA/Q sequences (including quality values)</li>
<li style="list-style-position: outside; list-style-type: circle;">Read/write FASTA/Q files</li>
<li style="list-style-position: outside; list-style-type: circle;">Perform simple quality control on FASTA/Q sequences</li>
<li style="list-style-position: outside; list-style-type: circle;">Provide summary info for a set of FASTA/Q sequences</li>
<li style="list-style-position: outside; list-style-type: circle;">Build a consensus sequence from 2 or more FASTQ sequences</li>
</ul>
Because of its simplicity, BioUtils requires much less memory and drastically reduces runtime of the above operations compared to BioPerl. The following figure displays the CPU time for reading and writing FASTQ sequences from files with increasing numbers of sequences.</div>
<div style="background-color: white; color: #666154; font-family: Arial, Verdana, sans-serif; font-size: 13px;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUX335vz9AlSyOif-HNZYX_XI1WOB9qVN6m7zioPgoazhZgcaPfqYRr2ZUC05DucdUQXjlubiAbdNevPjE8wyU6oOWxMVNw_N35xTCgilg4w5iXneSz5yoHx-Ki2qJrQhhs-FydW-1HQca/s1600/time_comparison.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="192" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUX335vz9AlSyOif-HNZYX_XI1WOB9qVN6m7zioPgoazhZgcaPfqYRr2ZUC05DucdUQXjlubiAbdNevPjE8wyU6oOWxMVNw_N35xTCgilg4w5iXneSz5yoHx-Ki2qJrQhhs-FydW-1HQca/s320/time_comparison.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
</div>
<div style="background-color: white; color: #666154; font-family: Arial, Verdana, sans-serif; font-size: 13px;">
Currently, a full Illumina MiSeq run can produce ~6 million paired-end reads (i.e. ~12 million raw reads). Extrapolating from the figure above, the projected runtime of BioPerl for a complete MiSeq dataset would take nearly 2 and a half hours compared to only <span style="color: black;">5 minutes</span> using BioUtils. </div>
<div style="background-color: white; color: #666154; font-family: Arial, Verdana, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #666154; font-family: Arial, Verdana, sans-serif; font-size: 13px;">
Because BioUtils is implemented using inside-out objects described in Perl Best Practices (Damian Conway), it is difficult to measure memory usage. However, a peek at the source code will confirm that BioUtils stores less metadata and thus requires less memory than BioPerl.</div>
<div style="background-color: white; color: #666154; font-family: Arial, Verdana, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #666154; font-family: Arial, Verdana, sans-serif; font-size: 13px;">
More information and a BioUtils download can be found on sourceforge: <a href="https://sourceforge.net/projects/bioutilsperllib/?source=directory" rel="nofollow" style="color: #8da1ad;">https://sourceforge.net/projects/bioutilsperllib/?source=directory</a></div>
<br />
<div style="background-color: white; color: #666154; font-family: Arial, Verdana, sans-serif; font-size: 13px;">
<br /></div>
</div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0tag:blogger.com,1999:blog-5869557521756883740.post-23813479603136345602013-04-23T18:30:00.000-07:002013-04-23T18:30:05.661-07:00Becoming Googlable<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="background-color: white; color: #666154; font-family: Arial, Verdana, sans-serif; font-size: 13px;">This week has been a very "progressive" week in my life with regards to my involvement in social media. I have never subscribed to social media because I have seen how its addictive nature leads to many lost hours with little or no benefit (which I still believe it true for sites like Facebook and Twitter). However, I have begun to see the advantages of being "googlable" and have consequently decided to take the plunge into social media by having a personal website and blog. A growing number of scientific researchers have personal websites to advertise their research and blogs to quickly communicate novel discoveries and ideas. I hope to follow that model by using my personal website to deploy some of my useful bioinformatics tools and by blogging about interesting scientific ideas. Hopefully this experimentation with social media will be productive both to myself and others. </span></div>
Anonymoushttp://www.blogger.com/profile/01081006207025463719noreply@blogger.com0