Ambiguity Codes for Nucleotides | |
---|---|
Code | Nucleotides |
M | {A, C} |
R | {A, G} |
W | {A, T} |
S | {C, G} |
Y | {C, T} |
K | {G, T} |
V | {A, C, G} |
H | {A, C, T} |
D | {A, G, T} |
B | {C, G, T} |
N | {A, C, G, T} |
(X | Y)
where X and Y are pattern units. The construct (X | Y)
will match
successfully if either X matches or Y matches.
NOTE: The spaces before and after the vertical bar
" | "
are
necessary.
name=Xwhere "name" is one of {p1,p2,p3,...} and X is a basic pattern unit. When a named simple pattern unit successfully matches a section of a sequence, that section can be later referred to in constructs such as
p1 p2[0,1,0] ~p3and so forth (see below). The "name" saves the value of the matched substring.
name=complementswhere name is one of {r1,r2,r3,...} and complements is a set defining what is meant by "complement" under the named rule. For example,
r1={au,ua,gc,cg,gu,ug,ga,ag} r2={au,ua,gc,cg}shows two complementation rule pattern units defining two specialized notions of "complement".
Explicitly defined complementation rules are useful when scanning for helicies in nucleotide sequences, especially when unusual constraints exist for specific positions.
Normally, one uses the standard complementation rule, i.e. the set
{at,ta,cg,gc}.PatScan assumes this to be the default rule, and it does not need to be defined explicitly.
examples: AGYGGT YCXXGA TATAA[1,0,0]
example: 3...8
examples: ~p2 ~p3[1,1,0] r2~p4
examples: p1 p2[1,0,0]
example: any(IV)
example: notany(IVL)
example: {(80,0,20,0),(0,100,0,0),(20,40,40,0)} > 119
example: length(p1+p2+p3) < 12
[Mismatches,Deletions,Insertions]For example, a string pattern of the form
RNYRNYRNYRNY[1,0,0]would match 12 characters in a nucleotide string (where R stands for a purine and Y stands for a pyrimidine; the standard ambiguity codes are used for nucleotides, but only X is allowed as an ambiguous character for proteins). A "deletion" is a character in the pattern for which no character in the matched sequence corresponds, while an "insertion" is a character in the sequence which does not correspond to any character in the string pattern unit.
Min...Max,which indicates that it will match any subsequence with length between Min and Max.
~p1matches the reverse complement of whatever p1 represents. If a special rule of complementation is required, it precedes the ~, thus
r1~p2says "match the reverse complement of whatever p2 matched, where complementation is defined by complementation rule 2". You can also add a match qualifier; thus,
~p2[1,0,1]would allow a single mismatch and a single character bulge in the helix.
p1=3...3 p1 p1would match a 9-character string made up of a 3-mer repeated three times. You can qualify the matches; for example,
p1=3...3 p1[1,0,0] p1[1,0,0] p1[1,0,0] p1[1,0,0]matches a 15-character string which might be thought of as 5 repetitions of a 3-mer that have experienced a few mutations.
C1 C2 C3 C4 C5 C6 C7 C8 A 16 57 0 95 0 18 0 0 C 0 10 80 0 100 60 0 50 G 84 29 0 0 0 20 100 50 T 0 4 20 5 0 2 0 0One could use the following pattern unit to search for inexact matches related to such a "weight matrix":
{(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5), (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450This pattern unit will attempt to match exactly eight characters. For each character in the sequence, the entry in the corresponding tuple is added to an accumulated sum. If the sum is greater than 450, the match succeeds; else it fails. For protein sequences, you must use 20-tuples (with the entries corresponding to the amino acids in alphabetical order). This will be used only by the most serious aficionadoes.
p1=5...5 p2=1...5 p3=3...7 ~p1[1,0,0] p4=3...8 ~p3 length(p2+p4) < 10would match a pseudo-knot like structure, setting a maximum size on the two unpaired internal subsequences.
TTF $matches TTF at the end of the database sequence.
^ TTFmatches TTF at the beginning of the database sequence.
p1=4...4 <p1 matches all of the following, AGGD DGGA FAFL LFAF GSAP PASG SAPR RPASThat is, it matches any four characters, followed by its reverse. This is the actual palindrome, not the biologically common meaning of "reverse complement".