Genome survey of resistance gene analogs in sugarcane: genomic features and differential expression of the innate immune system from a smut-resistant genotype

New manuscript on BMC Genomics.

Link here!


Resistance genes composing the two-layer immune system of plants are thought as important markers for breeding pathogen-resistant crops. Many have been the attempts to establish relationships between the genomic content of Resistance Gene Analogs (RGAs) of modern sugarcane cultivars to its degrees of resistance to diseases such as smut. However, due to the highly polyploid and heterozygous nature of sugarcane genome, large scale RGA predictions is challenging.


We predicted, searched for orthologs, and investigated the genomic features of RGAs within a recently released sugarcane elite cultivar genome, alongside the genomes of sorghum, one sugarcane ancestor (Saccharum spontaneum), and a collection of de novo transcripts generated for six modern cultivars. In addition, transcriptomes from two sugarcane genotypes were obtained to investigate the roles of RGAs differentially expressed (RGADE) in their distinct degrees of resistance to smut. Sugarcane references lack RGAs from the TNL class (Toll-Interleukin receptor (TIR) domain associated to nucleotide-binding site (NBS) and leucine-rich repeat (LRR) domains) and harbor elevated content of membrane-associated RGAs. Up to 39% of RGAs were organized in clusters, and 40% of those clusters shared synteny. Basically, 79% of predicted NBS-encoding genes are located in a few chromosomes. S. spontaneum chromosome 5 harbors most RGADE orthologs responsive to smut in modern sugarcane. Resistant sugarcane had an increased number of RGAs differentially expressed from both classes of RLK (receptor-like kinase) and RLP (receptor-like protein) as compared to the smut-susceptible. Tandem duplications have largely contributed to the expansion of both RGA clusters and the predicted clades of RGADEs.


Most of smut-responsive RGAs in modern sugarcane were potentially originated in chromosome 5 of the ancestral S. spontaneum genotype. Smut resistant and susceptible genotypes of sugarcane have a distinct pattern of RGADE. TM-LRR (transmembrane domains followed by LRR) family was the most responsive to the early moment of pathogen infection in the resistant genotype, suggesting the relevance of an innate immune system. This work can help to outline strategies for further understanding of allele and paralog expression of RGAs in sugarcane, and the results should help to develop a more applied procedure for the selection of resistant plants in sugarcane.

Posted in Local Tools | Leave a comment

Making OrthoMCL easier to use

“OrthoMCL is an algorithm for grouping proteins into ortholog groups based on their
sequence similarity. “

With more than 3K citations, the OrthoMCL elegantly finds orthologs, co-orthologs, and in-paralogs in protein FASTA files. If all that you need is to find in-paralogs in a set of sequences from a target species, you just need to provide a unique FASTA file as input to OrthoMCL. To find orthologs and co-orthologs you feed the algorithm with FASTA files for each species.

The OrthoMCL user guide describes thirteen steps, from the software dependencies installation to the complete execution of the algorithm and obtention of the four output files: 1) coorthologs.txt, 2) inparalogs.txt, 3) orthologs.txt, and 4) groups.txt.

In the following pipeline, all softwares and dependencies (MCL, Perl, CPAN, MySQL…) are aligned to set the proper environment (Ubuntu) to run the OrthoMCLv2.0.9.

After cloning the Git repository,

git clone

all you need is to edit and set the variables at the beginning of the file:


mysqlpass="user123" # SET root password
dependenciesinstall="no" # SET yes to install softwares and dependencies
installorthomcl="no" # SET yes to install MCL software
fastainput="/path/to/fasta/dir/" # SET your input directory with n FASTA files
clusteracro="CLU" #SET an acronym for the groups
blastAVAfile="" #LEAVE empty if you don't have the BLASTp all-vs-all file (will run STEP 7)

Then, to run the pipeline:



Posted in Local Tools | Leave a comment

NJ trees for multiple FASTA files using Phangorn R package

This script intends to iterate with multiple-sequence alignment (MSA) FASTA files in a directory and create Neighbor-Joining (NJ) trees for each of those files. For this, we will use R and the package Phangorn.

Phangorn is described as a package for Phylogenetic analysis in R, and contains methods for estimation of phylogenetic trees and networks using Maximum Likelihood, Maximum Parsimony, distance methods and Hadamard conjugation. Allows to compare trees, models selection and offers visualizations for trees and split networks.

The R function list.files will produce a character vector of the names of files in directory. Then, two new variables will interact with files in a loop to create input and output files names that will subsequently be used by three commands of Phangorn to generate a NJ tree. Finally, the script writes the generated tree in newick format.


myfiles <- list.files(path = "/path/to/msa/fasta/files/", pattern = NULL, all.files = FALSE,
           full.names = FALSE, recursive = FALSE,
  = FALSE, include.dirs = FALSE, no.. = FALSE)
for (fastafile in myfiles) {
infile <- paste(c("/path/to/msa/fasta/files/",fastafile),collapse="")
outfile <- paste(c(fastafile,".nwk"),collapse="")
print (paste(infile))
mytree <- read.phyDat(infile,format="fasta", type = "AA")
dm <-
treeNJ <- NJ(dm)

#write tree
write.tree(treeNJ, file=outfile) #fastafile.nwk

fasta2NJnewick.R Git link

Posted in Local Tools | Leave a comment

Brief comparison of NGS platforms

  • Short reads (SBL and SBS types)
    • SBL – Sequencing by ligation type
      • Solid [50-75 bp] (80-320 Gb)
      • BGISEQ [50-100 bp] (8-200 Gb)
    • SBS – Sequencing by synthesis type (CRT Cycle Reversible Termination)
      • Illumina [25-300 bp] (540 Mb – 900 Gb)
      • Qiagen [NA] (12 Genes, 1250 mutations)
    • SBS – Sequencing by synthesis type (SNA Single Nucleotide Addition)
      • 454 discontinued [400-1000 bp] (35-700 Mb)
      • Ion Torrent [200-400 bp] (30 Mb – 15 Gb)
  • Long reads (SM and SA types)
    • SM – Single Molecule
      • PacBio [8-20 Kb] (500 Mb – 7 Gb)
      • ONT MinION [up to 200 Kb] (1.5 Gb)
    • SA – Synthetic Approaches
      • Illumina Synthetic Long Read Sequencing Platform [~100 Kb] (64-500 Gb)
      • 10x Genomics Emulsion-based System [up to 100 Kb] (64-500 Gb)
Posted in Blog | Tagged , | Leave a comment

Evolutionary history of the cobalamin-independent methionine synthase gene family across the land plants

Plants are successful paleopolyploids. The wide diversity of land plants is driven strongly by their gene duplicates undergoing distinct evolutionary fates after duplication. We used genomic resources from 35 model plant species to unravel the evolutionary fate of gene copies (paralogs) of the cobalamin-independent methionine synthase (metE) gene family across the land plants. To explore genealogical relationships and characterize positive selection as a driving force in the evolution of metE paralogs within a single species, we carried out complementary analyses on genomic data of 32 genotypes of soybean. The size of the metE gene family remained small across the land plants; most of the studied species possessed 1–6 paralogs. Gene products were either cytosolic or chloroplastic; this dual subcellular distribution arose early during the divergence of the land plants and reached all extant lineages. Biased gene loss and gene retention events took place multiple times; recurrent evolution remodeled redundant metE paralogs to recover and maintain the dual subcellular distribution of MetE. Shared whole-genome duplication events gave rise to the metE paralogs of both soybean and Medicago truncatula. In soybean, the ancestral paralog pair GlymaPP2A encoded a cytosolic isoform of MetE, was under strong purifying selection, and retained high levels of expression across eight RNA-seq expression libraries. The daughters GlymaPP1 and GlymaPP2B showed accelerated rates of evolution, accumulated many sites predicted to be under positive selection, and possessed low levels of expression. Our results suggest that the metE paralogs of soybean follow Ohno’s neofunctionalization model of gene duplicate evolution.

Full text:

Posted in Blog | Leave a comment

I Curso de Inverno em Bioinformática UNIFESP

The first Winter School of Bioinformatics of Institute of Science and Technology of the Federal University of São Paulo (ICT-Unifesp) will be holden next july 10 to 12. The speakers are linked to the Biocomputational Project – fomented by CAPES – one of the biggest projects for sugarcane breeding in Brazil.

Posted in Blog | Leave a comment

Applied Computational Genomics Course at UU, Aaron Quinlan

Professor Aaron Quinlan (University of Utah), author of BEDtools (link1, link2), has published his “Applied Computational Genomics Course at UU: Spring 2017”.

Highly recommended!

Posted in Blog | Leave a comment