Find introns correspondent to CDS sequences using genomic sequences and annotation from PLAZA database

I was working with coding-DNA sequences (CDS) data from PLAZA and, at certain point, a co-worker needed not only the exons – which the CDS is constituted – but also the sequences of introns, which is any nucleotide sequence within a gene that is removed by RNA splicing during maturation of the final RNA product.

PLAZA provide an annotation file containing coordinates for the genes (exons plus introns) correspondent to each CDS. Based on the CDS file that you need to provide as an input, the Python script bellow read the correspondent genomic coordinates in the annotation file, and retrieve the correspondent gene sequences from the genome file (scaffold sequences). Finally, the script uses MUSCLE to align each CDS to its gene sequence. For each CDS sequences, two output files are generated, and named with CDS and Scaffold. The two outputs have extensions “.fas” – the retrieved sequences –  and “.aln” – the alignment between the two sequences; where intron regions correspond to dashes in CDS alignment sequence.

Dependency:

sudo apt-get install muscle

Download example of input files from A. thaliana:

Or…

Visit the download page from PLAZA 2.5

And the script…

#!/usr/bin/python
#Find correspondent introns to CDS sequences using genomic sequences and annotation from PLAZA database
#usage: python find_introns_plaza.py scaffoldfile.fas annotationfile.csv cdsfastasequences.fas

import sys
import subprocess

filescaf = sys.argv[1]	#Scaffold file in fasta format containing genomic sequences
fileanno = sys.argv[2]	#Annotation file in csv format
filelist = sys.argv[3]	#CDS Fasta file containing sequences to obtain the respective genomic sequences

#open files
try:
    fgen = open(filescaf)
    flist = open(filelist)
except IOError:
    print ("Scaffold and FASTA files doesn't exist!")

#open annotation file and creates an set for it
set_anno = set(line.strip() for line in open (fileanno, 'r'))

#Create a dictionary with CDS Fasta sequences
fastalist={}

for line in flist:
    line = line.rstrip()
    if line[0] == '>':
        words=line.split()
        name=words[0][1:]
        fastalist[name]=''
    else:
        fastalist[name] = fastalist[name] + line

#Create a dictionary with annotation correspondent to all the CDS sequences in previous dictionary
cds_anno = dict()           #value in this dictionary will be a tupple
listscaffolds = []			#criates an list of scaffolds corresponding to cds's

for i in fastalist.keys():  #read keys from CDS fasta dictionary
	for line in set_anno:   #read lines in the annotation set
		if i in line:
			line = line.split('";"')
			start_cds = line[4]	#start coordinate
			stop_cds = line[5]	#stop coordinate
			scaffold = line[9]	#scaffold name
			cds_anno[i] = start_cds,stop_cds,scaffold	#add values to dictionary where the key is the CDS id
			listscaffolds += scaffold.splitlines()

#Create dictionary for all the Scaffolds in the genomic fasta file
seqs={}
parsing = 0
for i in fgen:
	i = i.rstrip()

	if ">" in i:
		scafid = i.split()[0][1:]

	if parsing > 0 and ">" not in i:          #Second 2: if parsing is greater than 0, than this is the first line of fasta belonging to a scaffold selected in the first step. When it finds a line with ">", then parsing will be set to 0 and everything starts begin
		seqs[scafid] = seqs[scafid] + i
	else:
		parsing = 0

	if ">" in i and scafid in listscaffolds:   #First 1: if id scaffold in my list of scaffolds, then parsing will be set 1
		seqs[scafid] = ''                      #create a dictionary with the scaffold header
		parsing += 1

#Compare all three files based in dictionary cds_anno containing the coordinates of CDS correspondent genomic sequence in scaffolds
for i in cds_anno.items():

	cdsid = i[0]
	start_cds = list(i[1])[0]
	stop_cds = list(i[1])[1]
	scaffold = list(i[1])[2]

	if scaffold in seqs.keys():
		if cdsid in fastalist.keys():

			genomic = str(seqs[scaffold])[int(start_cds)+1:int(stop_cds)+1]
			output = open(cdsid+"_"+scaffold+".fasta", "w")       #open output for writing
			output.write(">"+scaffold+"\n"+str(genomic)+"\n"+">"+cdsid+"\n"+fastalist[cdsid]);   #write FASTA output containing the CDS and Genomic sequences
			output.close()
			doalignment = subprocess.call("muscle \-in"+" "+str(cdsid+"_"+scaffold+".fasta")+" "+"\-out"+" "+str(cdsid+"_"+scaffold+".aln")  ,shell = True)  #call shell subprocess using Muscle to align the CDS sequence to its correspondent Genomic sequence 

Posted in Local Tools | Leave a comment

Shining the dark matter

Genes are just the infinitesimal part of the story. This is a reasonable conclusion after reading “The dark side of the human genome“.

Major part of superior species genomes are composed by ‘non-coding’ DNA – known as dark matter – which has no apparent function in the genomes. However, fresh studies are pinpointing that, in fact, Human genome has less genes than previously predicted – then 25 thousand and now 19 thousand genes -, and more regulatory elements (15 million predicted).

After the sequencing of Human Genome, the Encyclopedia of DNA Elements (ENCODE) Project was launched, in 2003, by the National Human Genome Research Institute (NHGRI). The project aims “to identify all functional elements in the human genome sequence”, or in other words, to bring the dark side of the human genome to the spotlight.

Evolutionary biologists have increasingly data to pore over.

ENCODE links:

 

Posted in Blog | Leave a comment

Neutral Theory: The Null Hypothesis of Molecular Evolution

In the decades since its introduction, the neutral theory of evolution has become central to the study of evolution at the molecular level, in part because it provides a way to make strong predictions that can be tested against actual data. The neutral theory holds that most variation at the molecular level does not affect fitness and, therefore, the evolutionary fate of genetic variation is best explained by stochastic processes. This theory also presents a framework for ongoing exploration of two areas of research: biased gene conversion, and the impact of effective population size on the effective neutrality of genetic variants.

http://www.nature.com/scitable/topicpage/neutral-theory-the-null-hypothesis-of-molecular-839

Posted in Blog | Leave a comment

Capture fasta sequences from file using a list of headers using Python

If you have a fasta file and want to capture some specific sequences based on a list of headers you could only use some short bash “grep” command. However, if the fasta file contains sequences in multiple lines this would be a trick situation for a simple grep.

The Python code below gets the job done.

#!/usr/bin/python
#usage: python script.py file.fasta filelist.txt
import sys

filefasta = sys.argv[1]
filelist = sys.argv[2]

try:
    f = open(filefasta)
except IOError:
    print ("File doesn't exist!")

set_list = set(line.strip() for line in open (filelist, 'r'))

seqs={}
for line in f:
    line = line.rstrip()
    if line[0] == '>':
        words=line.split()
        name=words[0][1:]
        seqs[name]=''
    else:
        seqs[name] = seqs[name] + line

for line1 in set_list:
    for i in seqs.items():
        header= i[0]
        seq = i[1]

        if line1 in header:
            print ">"+line1+"\n"+seq+"\n"

In the code above, all the sequences in the Fasta file are firstly stored in the dictionary “seqs”. Then, the headers you’re looking for are read in a loop and all the matches in “seqs” consequently printed in Fasta format. This works fine if you’re only working with small Fasta files. However, this process may require high memory usage – and therefore not be very useful – if you are working with large Fasta files, such as genomic files containing scaffold sequences.

In the code bellow, only the sequences corresponding to the headers in a list are stored in the dictionary, reducing the memory usage and making the work with large files feasible.

#!/usr/bin/python
#usage: python script.py file.fasta filelist.txt

import sys

filefast = sys.argv[1]
filelist = sys.argv[2]

#open files
try:
    fgen = open(filefast)
except IOError:
    print ("Fasta file doesn't exist!")

set_list = set(line.strip() for line in open (filelist, 'r')) #creates an set with your list of headers

#Create dictionary containing only the sequences with headers matching to the list
seqs={}
parsing = 0
for i in fgen:
	i = i.rstrip()

	if ">" in i:
		scafid = i.split()[0][1:]

	if parsing > 0 and ">" not in i:          #Second 2: if parsing is greater than 0, than this is the first line of fasta belonging to a scaffold selected in the first step. When it finds a line with ">", then parsing will be set to 0 and everything starts begin
		seqs[scafid] = seqs[scafid] + i
	else:
		parsing = 0

	if ">" in i and scafid in set_list:   #First 1: if id scaffold in my list of scaffolds, then parsing will be set 1
		seqs[scafid] = ''                      #create a dictionary with the scaffold header
		parsing += 1

#prints the sequences in dictionary
for i in seqs.items():
	header= i[0]
	seq = i[1]
	print ">"+header+"\n"+seq+"\n"
Posted in Local Tools | Leave a comment

Google teaches you Python

The Google’s Python Class is a free class for people with a little bit of programming experience who want to learn Python. The class includes written materials, lecture videos, and lots of code exercises to practice Python coding.

https://developers.google.com/edu/python/

Posted in Blog | Leave a comment

ORF Finder for DNA sequence Fasta files using Python

Succinctly, an Open Reading Frame (ORF) is a part of DNA sequence in certain frame with the the potential to code for a protein. Because amino acids are coded by triplets of nucleotides, there are three possible frames to look for an ORF in each DNA strand. Considering the forward and reverse DNA strands, six frames are possible. However, in the script below I only took into consideration the forward strand of DNA. For instance, there are several online ORF predictors, and also some already built Python modules that could be used but here I intended to use hands free.

orf

The simplest way I found was to play a game of numbers considering the index positions of each start (ATG) and stop codons (TAA, TAG, or TGA) present in the DNA sequence. Observing the figure above we can find that a DNA sequence has an ORF if:

  • the number of stop codons and start condos are greater than or equal to 1;
  • the index of at least one stop codon is greater than the index of any start codon;

But the DNA sequence might have more than one ORF. To solve this task we need additional statements. One of the possible interpretations is: if you look up to the figure above and keep the index positions of the first stop codon, we could ask:

  • Is there in the sequence any start codon with the index less than the first stop codon?

If yes, in this case is true, we have to take into consideration the start codon with the smallest index, as explained in the figure.
Now, looking up to the next stop codon we need to ask:

  • Is there in the sequence any start codon with the index greater than the previous start codon, greater than the previous stop codon and less than the current stop codon?

If yes, in this case is true, we have found a second ORF in the DNA sequence. Because there’s no more stop codons we stop looking for ORF’s in this sequence.

Please, use and improve it:


#!/usr/bin/python
#usage: python script.py file.fasta
import sys

filefasta = sys.argv[1]

try:
    f = open(filefasta)
except IOError:
    print ("File doesn't exist!")


#FUNCTION START
def orfFINDER(dna,frame):
    
    stop_codons = ['tga', 'tag', 'taa']
    start_codon = ['atg']
    start_positions = []
    stop_positions = []
    num_starts=0
    num_stops=0
    
    
    for i in range(frame,len(dna),3):
        codon=dna[i:i+3].lower()
        if codon in start_codon:
            start_positions += str(i+1).splitlines()
        if codon in stop_codons:
            stop_positions += str(i+1).splitlines()
    
    for line in stop_positions:
        num_stops += 1
    
    for line in start_positions:
        num_starts += 1
    
    orffound = {}
    
    if num_stops >=1 and num_starts >=1: #first statment: the number of stop codons and start condos are greater than or equal to 1;
        
        orfs = True
        stop_before = 0
        start_before = 0
        
        if num_starts > num_stops:
            num_runs = num_starts
        if num_stops > num_starts:
            num_runs = num_stops
        if num_starts == num_stops:
            num_runs = num_starts
            
        position_stop_previous = 0
        position_start_previous = 0
        counter = 0
        
        for position_stop in stop_positions:
            
            position_stop = int(position_stop.rstrip()) + 2
            
            for position_start in start_positions:
                
                position_start = position_start.rstrip()
                
                if int(position_start) < int(position_stop) and int(position_stop) > int(position_stop_previous) and int(position_start) > int(position_stop_previous):
                    
                    counter += 1
                    nameorf = "orf"+str(counter)
                    position_stop_previous += int(position_stop) - int(position_stop_previous)
                    position_start_previous += int(position_start) - int(position_start_previous)
                    sizeorf = int(position_stop) - int(position_start) + 1
                    
                    orffound[nameorf] = position_start,position_stop,sizeorf,frame
                        
                else:
                    
                    pass
    
    else:
        
        orfs = False

    return orffound
#FUNCTION END


#READ FASTA FILE AND SAVE HEADERS AND SEQUENCES IN A DICTIONARY
seqs={}

for line in f:
    line = line.rstrip()
    if line[0] == '>':
        words=line.split() 
        name=words[0][1:]
        seqs[name]=''
    else:
        seqs[name] = seqs[name] + line
        
#DEFINE FRAME TO FIND ORF
#if frame = 0, start from the first position in the sequence
frame=0

#EXECUTE THE ORFFINDER FUNCTION
for i in seqs.items():
    header= i[0]
    seq = i[1]
    orf = orfFINDER(seq,frame)
    
    for i in orf.items():
        numorf=i[0]
        startorf=orf[numorf][0]
        stoporf=orf[numorf][1]
        lengthorf=orf[numorf][2]
        frameorf=orf[numorf][3]
        print header,numorf,"start",startorf,"stop",stoporf,"length",lengthorf,"frame",frameorf

f.close()

Posted in Local Tools | Leave a comment

Highlighting SNPs for better presentation of your sequence alignment

Imagine the following FASTA alignment:

>header1
ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTA
>header2
ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTA
>header3
ATGGATTTATCTGCTCTTCGCGTTGAAGGAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTA
>header4
ATGGATTTATCTGTTCTTCGCGTTGAAGGAGTACAAAATGTCACTAATGCTATGCAGAAAATCTTA

Based only in the above alignment, is difficult to instantly identify where the SNPs are. Set dots where we have alignment matches is a very intuitive way to spotlight the SNPs.

header4 ATGGATTTATCTGTTCTTCGCGTTGAAGGAGTACAAAATGTCACTAATGCTATGCAGAAAATCTTA
header2 .............C..............A..............T......................
header3 .............C.............................T......................
header1 .............C..............A..............T......................

Changing similar nucleotides or amino acids by dots is the most simple way of arranging your sequence alignment in order to highlight the regions of inequality.

Here’s how it could be done in Python.

#!/usr/bin/python
#usage: python script.py file.fasta
import sys

filefasta = sys.argv[1]

try:
    f = open(filefasta)
except IOError:
    print ("File doesn't exist!")

seqs={}

for line in f:
    line = line.rstrip()
    if line[0] == '>':
        words=line.split()
        name=words[0][1:]
        seqs[name]=''
    else:
        seqs[name] = seqs[name] + line

first_header = seqs.keys()[-1]
first_sequence = list(seqs.values()[-1])

print first_header,''.join(first_sequence)

comparison = {}

for i in seqs.items():
    header= i[0]
    seq = i[1]
    if header != first_header:
        comparison[header] = ''
        for i in range(0,len(seq)):
            letter=seq[i]
            base=first_sequence[i]
            if letter == base:
                comparison[header] = comparison[header] + '.'
            else:
                comparison[header] = comparison[header] + letter

for i in comparison.items():
    header=i[0]
    seq=i[1]
    print header,seq

Posted in Blog | Leave a comment