Making OrthoMCL easier to use

“OrthoMCL is an algorithm for grouping proteins into ortholog groups based on their
sequence similarity. “

With more than 3K citations, the OrthoMCL elegantly finds orthologs, co-orthologs, and in-paralogs in protein FASTA files. If all that you need is to find in-paralogs in a set of sequences from a target species, you just need to provide a unique FASTA file as input to OrthoMCL. To find orthologs and co-orthologs you feed the algorithm with FASTA files for each species.

The OrthoMCL user guide describes thirteen steps, from the software dependencies installation to the complete execution of the algorithm and obtention of the four output files: 1) coorthologs.txt, 2) inparalogs.txt, 3) orthologs.txt, and 4) groups.txt.

In the following pipeline, all softwares and dependencies (MCL, Perl, CPAN, MySQL…) are aligned to set the proper environment (Ubuntu) to run the OrthoMCLv2.0.9.

After cloning the Git repository,

git clone

all you need is to edit and set the variables at the beginning of the file:


mysqlpass="user123" # SET root password
dependenciesinstall="no" # SET yes to install softwares and dependencies
installorthomcl="no" # SET yes to install MCL software
fastainput="/path/to/fasta/dir/" # SET your input directory with n FASTA files
clusteracro="CLU" #SET an acronym for the groups
blastAVAfile="" #LEAVE empty if you don't have the BLASTp all-vs-all file (will run STEP 7)

Then, to run the pipeline:



Posted in Local Tools | Leave a comment

NJ trees for multiple FASTA files using Phangorn R package

This script intends to iterate with multiple-sequence alignment (MSA) FASTA files in a directory and create Neighbor-Joining (NJ) trees for each of those files. For this, we will use R and the package Phangorn.

Phangorn is described as a package for Phylogenetic analysis in R, and contains methods for estimation of phylogenetic trees and networks using Maximum Likelihood, Maximum Parsimony, distance methods and Hadamard conjugation. Allows to compare trees, models selection and offers visualizations for trees and split networks.

The R function list.files will produce a character vector of the names of files in directory. Then, two new variables will interact with files in a loop to create input and output files names that will subsequently be used by three commands of Phangorn to generate a NJ tree. Finally, the script writes the generated tree in newick format.


myfiles <- list.files(path = "/path/to/fasta/msa/files/", pattern = NULL, all.files = FALSE,
           full.names = FALSE, recursive = FALSE,
  = FALSE, include.dirs = FALSE, no.. = FALSE)

for (fastafile in myfiles) {

infile <- paste(c("/path/to/fasta/msa/files/",fastafile),collapse="")
outfile <- paste(c(fastafile,".nwk"),collapse="")

print (paste(infile))

mytree <- read.phyDat(infile,format="fasta")
dm <-
treeNJ <span id="mce_SELREST_start" style="overflow: hidden; line-height: 0;"></span><span id="mce_SELREST_start" style="overflow: hidden; line-height: 0;"></span><- NJ(dm)

#write tree
write.tree(treeNJ, file=outfile) #fastafile.nwk

fasta2NJnewick.R Git link

Posted in Local Tools | Leave a comment

Brief comparison of NGS platforms

  • Short reads (SBL and SBS types)
    • SBL – Sequencing by ligation type
      • Solid [50-75 bp] (80-320 Gb)
      • BGISEQ [50-100 bp] (8-200 Gb)
    • SBS – Sequencing by synthesis type (CRT Cycle Reversible Termination)
      • Illumina [25-300 bp] (540 Mb – 900 Gb)
      • Qiagen [NA] (12 Genes, 1250 mutations)
    • SBS – Sequencing by synthesis type (SNA Single Nucleotide Addition)
      • 454 discontinued [400-1000 bp] (35-700 Mb)
      • Ion Torrent [200-400 bp] (30 Mb – 15 Gb)
  • Long reads (SM and SA types)
    • SM – Single Molecule
      • PacBio [8-20 Kb] (500 Mb – 7 Gb)
      • ONT MinION [up to 200 Kb] (1.5 Gb)
    • SA – Synthetic Approaches
      • Illumina Synthetic Long Read Sequencing Platform [~100 Kb] (64-500 Gb)
      • 10x Genomics Emulsion-based System [up to 100 Kb] (64-500 Gb)
Posted in Blog | Tagged , | Leave a comment

Evolutionary history of the cobalamin-independent methionine synthase gene family across the land plants

Plants are successful paleopolyploids. The wide diversity of land plants is driven strongly by their gene duplicates undergoing distinct evolutionary fates after duplication. We used genomic resources from 35 model plant species to unravel the evolutionary fate of gene copies (paralogs) of the cobalamin-independent methionine synthase (metE) gene family across the land plants. To explore genealogical relationships and characterize positive selection as a driving force in the evolution of metE paralogs within a single species, we carried out complementary analyses on genomic data of 32 genotypes of soybean. The size of the metE gene family remained small across the land plants; most of the studied species possessed 1–6 paralogs. Gene products were either cytosolic or chloroplastic; this dual subcellular distribution arose early during the divergence of the land plants and reached all extant lineages. Biased gene loss and gene retention events took place multiple times; recurrent evolution remodeled redundant metE paralogs to recover and maintain the dual subcellular distribution of MetE. Shared whole-genome duplication events gave rise to the metE paralogs of both soybean and Medicago truncatula. In soybean, the ancestral paralog pair GlymaPP2A encoded a cytosolic isoform of MetE, was under strong purifying selection, and retained high levels of expression across eight RNA-seq expression libraries. The daughters GlymaPP1 and GlymaPP2B showed accelerated rates of evolution, accumulated many sites predicted to be under positive selection, and possessed low levels of expression. Our results suggest that the metE paralogs of soybean follow Ohno’s neofunctionalization model of gene duplicate evolution.

Full text:

Posted in Blog | Leave a comment

I Curso de Inverno em Bioinformática UNIFESP

The first Winter School of Bioinformatics of Institute of Science and Technology of the Federal University of São Paulo (ICT-Unifesp) will be holden next july 10 to 12. The speakers are linked to the Biocomputational Project – fomented by CAPES – one of the biggest projects for sugarcane breeding in Brazil.

Posted in Blog | Leave a comment

Applied Computational Genomics Course at UU, Aaron Quinlan

Professor Aaron Quinlan (University of Utah), author of BEDtools (link1, link2), has published his “Applied Computational Genomics Course at UU: Spring 2017”.

Highly recommended!

Posted in Blog | Leave a comment

Both mechanism and age of duplications contribute to biased gene retention patterns in plants

In general, transcription factor (GO:003700) paralogs tend to be overrepresented amongst ancient (Ks > 1) duplication regardless of mechanism of duplication.


Posted in Blog | Leave a comment