I have a list with three values (strings) and a substring.
-
Each string in the list needs to be searched for the given substring between position 20 and 50 and printed out if there’s more than 5 occurances (of this substring in each string).
-
If the string lacks the substring a message should be printed that the substring is missing (in each list item).
The output should be (considering my code below)
1 Enriched with SP1 binding sites
3 Contains no SP1 binding sites
seq_list = ["GGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGG", "GGGCGG", "BBBBBBB"]
binding_site = "GGGCGG"
for count, value in enumerate(seq_list, start=1):
if binding_site in value:
sumSP = int(sum(s.count('GGCGG')for s in seq_list))
if sumSP >20:
print(count, "enriched with SP1 binding sites")
else:
print(count,"No binding sites found.")
So I’ve got two problems. First, I’ve scoured the internet for a simple solution to search each string between pos 20-50 but only manage to find how to search the entire lists positions (using slice).
The second problem is that my code sumSP
doesn’t work since it gives true for my second string which should be false, since it’s only value 1 in my list that holds more than 5 binding_sites.
The following code closely follows your code snippet. It uses two calls to str.find()
to find the binding site at all and between positions 20 and 50.
seq_list = ["GGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGG", "GGGCGG", "BBBBBBB"]
binding_site = "GGGCGG"
for count, value in enumerate(seq_list, start=1):
if value.find(binding_site) != -1:
if value.find(binding_site, 20, 50) != -1:
sumSP = value.count('GGCGG')
if sumSP >= 5:
print(count, "enriched with SP1 binding sites")
else:
print(count,"No binding sites found.")
Output:
1 enriched with SP1 binding sites
2 No binding sites found.
The code below is what I think you want but can be easily modified. It uses REGEX as a simple way to count sub-string occurences. It shows how to search a portion of a string.
import re
seq_list = ["GGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGG", "GGGCGG", "BBBBBBB"]
binding_site = "GGGCGG"
search_for="GGCGG"
START = 20
FINISH = 50
for i, seq in enumerate(seq_list):
if not binding_site in seq:
print(f"seq {i} No binding sites found.")
elif len(seq) < FINISH:
print(f"seq {i} length {len(seq)} less than search size {FINISH}")
else:
num = len(re.findall(search_for, seq[START:FINISH]))
print(f"seq {i} has {num} found - enriched with SP1 binding sites")
which gives:
seq 0 has 3 found - enriched with SP1 binding sites
seq 1 length 6 less than search size 50
seq 2, No binding sites found.
Note that because Python is zero-indexed, START 20 is index position 20 and so the 21st character and so on, which may or may not be what you want.
Please do not upload images of code/data/errors.
do you know about the KMP algorithm?
Overlapping occurrences or not?
KMP algorithm feels a bit overcourse at the moment, just taking a short basic course in Python 🙂 There shouldn’t be any overlapping occurrances.
@qwr I don’t see how that would help.
Show 3 more comments