Capture fasta sequences from file using a list of headers using Python

If you have a fasta file and want to capture some specific sequences based on a list of headers you could only use some short bash “grep” command. However, if the fasta file contains sequences in multiple lines this would be a trick situation for a simple grep.

The Python code below gets the job done.

#!/usr/bin/python
#usage: python script.py file.fasta filelist.txt
import sys

filefasta = sys.argv[1]
filelist = sys.argv[2]

try:
    f = open(filefasta)
except IOError:
    print ("File doesn't exist!")

set_list = set(line.strip() for line in open (filelist, 'r'))

seqs={}
for line in f:
    line = line.rstrip()
    if line[0] == '>':
        words=line.split()
        name=words[0][1:]
        seqs[name]=''
    else:
        seqs[name] = seqs[name] + line

for line1 in set_list:
    for i in seqs.items():
        header= i[0]
        seq = i[1]

        if line1 in header:
            print ">"+line1+"\n"+seq+"\n"

In the code above, all the sequences in the Fasta file are firstly stored in the dictionary “seqs”. Then, the headers you’re looking for are read in a loop and all the matches in “seqs” consequently printed in Fasta format. This works fine if you’re only working with small Fasta files. However, this process may require high memory usage – and therefore not be very useful – if you are working with large Fasta files, such as genomic files containing scaffold sequences.

In the code bellow, only the sequences corresponding to the headers in a list are stored in the dictionary, reducing the memory usage and making the work with large files feasible.

#!/usr/bin/python
#usage: python script.py file.fasta filelist.txt

import sys

filefast = sys.argv[1]
filelist = sys.argv[2]

#open files
try:
    fgen = open(filefast)
except IOError:
    print ("Fasta file doesn't exist!")

set_list = set(line.strip() for line in open (filelist, 'r')) #creates an set with your list of headers

#Create dictionary containing only the sequences with headers matching to the list
seqs={}
parsing = 0
for i in fgen:
	i = i.rstrip()

	if ">" in i:
		scafid = i.split()[0][1:]

	if parsing > 0 and ">" not in i:          #Second 2: if parsing is greater than 0, than this is the first line of fasta belonging to a scaffold selected in the first step. When it finds a line with ">", then parsing will be set to 0 and everything starts begin
		seqs[scafid] = seqs[scafid] + i
	else:
		parsing = 0

	if ">" in i and scafid in set_list:   #First 1: if id scaffold in my list of scaffolds, then parsing will be set 1
		seqs[scafid] = ''                      #create a dictionary with the scaffold header
		parsing += 1

#prints the sequences in dictionary
for i in seqs.items():
	header= i[0]
	seq = i[1]
	print ">"+header+"\n"+seq+"\n"
This entry was posted in Local Tools. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s