Find introns correspondent to CDS sequences using genomic sequences and annotation from PLAZA database

I was working with coding-DNA sequences (CDS) data from PLAZA and, at certain point, a co-worker needed not only the exons – which the CDS is constituted – but also the sequences of introns, which is any nucleotide sequence within a gene that is removed by RNA splicing during maturation of the final RNA product.

PLAZA provide an annotation file containing coordinates for the genes (exons plus introns) correspondent to each CDS. Based on the CDS file that you need to provide as an input, the Python script bellow read the correspondent genomic coordinates in the annotation file, and retrieve the correspondent gene sequences from the genome file (scaffold sequences). Finally, the script uses MUSCLE to align each CDS to its gene sequence. For each CDS sequences, two output files are generated, and named with CDS and Scaffold. The two outputs have extensions “.fas” – the retrieved sequences –  and “.aln” – the alignment between the two sequences; where intron regions correspond to dashes in CDS alignment sequence.


sudo apt-get install muscle

Download example of input files from A. thaliana:


Visit the download page from PLAZA 2.5

And the script…

#Find correspondent introns to CDS sequences using genomic sequences and annotation from PLAZA database
#usage: python scaffoldfile.fas annotationfile.csv cdsfastasequences.fas

import sys
import subprocess

filescaf = sys.argv[1]	#Scaffold file in fasta format containing genomic sequences
fileanno = sys.argv[2]	#Annotation file in csv format
filelist = sys.argv[3]	#CDS Fasta file containing sequences to obtain the respective genomic sequences

#open files
    fgen = open(filescaf)
    flist = open(filelist)
except IOError:
    print ("Scaffold and FASTA files doesn't exist!")

#open annotation file and creates an set for it
set_anno = set(line.strip() for line in open (fileanno, 'r'))

#Create a dictionary with CDS Fasta sequences

for line in flist:
    line = line.rstrip()
    if line[0] == '>':
        fastalist[name] = fastalist[name] + line

#Create a dictionary with annotation correspondent to all the CDS sequences in previous dictionary
cds_anno = dict()           #value in this dictionary will be a tupple
listscaffolds = []			#criates an list of scaffolds corresponding to cds's

for i in fastalist.keys():  #read keys from CDS fasta dictionary
	for line in set_anno:   #read lines in the annotation set
		if i in line:
			line = line.split('";"')
			start_cds = line[4]	#start coordinate
			stop_cds = line[5]	#stop coordinate
			scaffold = line[9]	#scaffold name
			cds_anno[i] = start_cds,stop_cds,scaffold	#add values to dictionary where the key is the CDS id
			listscaffolds += scaffold.splitlines()

#Create dictionary for all the Scaffolds in the genomic fasta file
parsing = 0
for i in fgen:
	i = i.rstrip()

	if ">" in i:
		scafid = i.split()[0][1:]

	if parsing > 0 and ">" not in i:          #Second 2: if parsing is greater than 0, than this is the first line of fasta belonging to a scaffold selected in the first step. When it finds a line with ">", then parsing will be set to 0 and everything starts begin
		seqs[scafid] = seqs[scafid] + i
		parsing = 0

	if ">" in i and scafid in listscaffolds:   #First 1: if id scaffold in my list of scaffolds, then parsing will be set 1
		seqs[scafid] = ''                      #create a dictionary with the scaffold header
		parsing += 1

#Compare all three files based in dictionary cds_anno containing the coordinates of CDS correspondent genomic sequence in scaffolds
for i in cds_anno.items():

	cdsid = i[0]
	start_cds = list(i[1])[0]
	stop_cds = list(i[1])[1]
	scaffold = list(i[1])[2]

	if scaffold in seqs.keys():
		if cdsid in fastalist.keys():

			genomic = str(seqs[scaffold])[int(start_cds)+1:int(stop_cds)+1]
			output = open(cdsid+"_"+scaffold+".fasta", "w")       #open output for writing
			output.write(">"+scaffold+"\n"+str(genomic)+"\n"+">"+cdsid+"\n"+fastalist[cdsid]);   #write FASTA output containing the CDS and Genomic sequences
			doalignment ="muscle \-in"+" "+str(cdsid+"_"+scaffold+".fasta")+" "+"\-out"+" "+str(cdsid+"_"+scaffold+".aln")  ,shell = True)  #call shell subprocess using Muscle to align the CDS sequence to its correspondent Genomic sequence 

This entry was posted in Local Tools. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s