## ORF Finder for DNA sequence Fasta files using Python

Succinctly, an Open Reading Frame (ORF) is a part of DNA sequence in certain frame with the the potential to code for a protein. Because amino acids are coded by triplets of nucleotides, there are three possible frames to look for an ORF in each DNA strand. Considering the forward and reverse DNA strands, six frames are possible. However, in the script below I only took into consideration the forward strand of DNA. For instance, there are several online ORF predictors, and also some already built Python modules that could be used but here I intended to use hands free. The simplest way I found was to play a game of numbers considering the index positions of each start (ATG) and stop codons (TAA, TAG, or TGA) present in the DNA sequence. Observing the figure above we can find that a DNA sequence has an ORF if:

• the number of stop codons and start condos are greater than or equal to 1;
• the index of at least one stop codon is greater than the index of any start codon;

But the DNA sequence might have more than one ORF. To solve this task we need additional statements. One of the possible interpretations is: if you look up to the figure above and keep the index positions of the first stop codon, we could ask:

• Is there in the sequence any start codon with the index less than the first stop codon?

If yes, in this case is true, we have to take into consideration the start codon with the smallest index, as explained in the figure.
Now, looking up to the next stop codon we need to ask:

• Is there in the sequence any start codon with the index greater than the previous start codon, greater than the previous stop codon and less than the current stop codon?

If yes, in this case is true, we have found a second ORF in the DNA sequence. Because there’s no more stop codons we stop looking for ORF’s in this sequence.

```
#!/usr/bin/python
#usage: python script.py file.fasta
import sys

filefasta = sys.argv

try:
f = open(filefasta)
except IOError:
print ("File doesn't exist!")

#FUNCTION START
def orfFINDER(dna,frame):

stop_codons = ['tga', 'tag', 'taa']
start_codon = ['atg']
start_positions = []
stop_positions = []
num_starts=0
num_stops=0

for i in range(frame,len(dna),3):
codon=dna[i:i+3].lower()
if codon in start_codon:
start_positions += str(i+1).splitlines()
if codon in stop_codons:
stop_positions += str(i+1).splitlines()

for line in stop_positions:
num_stops += 1

for line in start_positions:
num_starts += 1

orffound = {}

if num_stops >=1 and num_starts >=1: #first statment: the number of stop codons and start condos are greater than or equal to 1;

orfs = True
stop_before = 0
start_before = 0

if num_starts > num_stops:
num_runs = num_starts
if num_stops > num_starts:
num_runs = num_stops
if num_starts == num_stops:
num_runs = num_starts

position_stop_previous = 0
position_start_previous = 0
counter = 0

for position_stop in stop_positions:

position_stop = int(position_stop.rstrip()) + 2

for position_start in start_positions:

position_start = position_start.rstrip()

if int(position_start) < int(position_stop) and int(position_stop) > int(position_stop_previous) and int(position_start) > int(position_stop_previous):

counter += 1
nameorf = "orf"+str(counter)
position_stop_previous += int(position_stop) - int(position_stop_previous)
position_start_previous += int(position_start) - int(position_start_previous)
sizeorf = int(position_stop) - int(position_start) + 1

orffound[nameorf] = position_start,position_stop,sizeorf,frame

else:

pass

else:

orfs = False

return orffound
#FUNCTION END

seqs={}

for line in f:
line = line.rstrip()
if line == '>':
words=line.split()
name=words[1:]
seqs[name]=''
else:
seqs[name] = seqs[name] + line

#DEFINE FRAME TO FIND ORF
#if frame = 0, start from the first position in the sequence
frame=0

#EXECUTE THE ORFFINDER FUNCTION
for i in seqs.items():
seq = i
orf = orfFINDER(seq,frame)

for i in orf.items():
numorf=i
startorf=orf[numorf]
stoporf=orf[numorf]
lengthorf=orf[numorf]
frameorf=orf[numorf]