What Is a PatScan Pattern?

Basic Rules

  1. PatScan uses the standard codes:

  2. Upper and Lower case are equivalent.

  3. T and U can be used interchangably.

  4. Ambiguity Codes may be used to represent unknown characters in a pattern.

Rules for Patterns

  1. A pattern is a sequence of pattern units.

  2. A pattern unit is either a simple pattern unit or of the form (X | Y) where X and Y are pattern units. The construct (X | Y) will match successfully if either X matches or Y matches.

    NOTE: The spaces before and after the vertical bar " | " are necessary.

  3. A simple pattern unit is either a

  4. A named pattern unit is of the form
    	name=X
    
    where "name" is one of {p1,p2,p3,...} and X is a basic pattern unit. When a named simple pattern unit successfully matches a section of a sequence, that section can be later referred to in constructs such as
    	p1
    	p2[0,1,0]
    	~p3
    
    and so forth (see below). The "name" saves the value of the matched substring.

  5. A complementation rule pattern unit is of the form
    	name=complements
    
    where name is one of {r1,r2,r3,...} and complements is a set defining what is meant by "complement" under the named rule. For example,
    	r1={au,ua,gc,cg,gu,ug,ga,ag} r2={au,ua,gc,cg}
    
    shows two complementation rule pattern units defining two specialized notions of "complement".

    Explicitly defined complementation rules are useful when scanning for helicies in nucleotide sequences, especially when unusual constraints exist for specific positions.

    Normally, one uses the standard complementation rule, i.e. the set

    	{at,ta,cg,gc}.
    
    PatScan assumes this to be the default rule, and it does not need to be defined explicitly.

  6. A basic pattern unit can be any of the following:

  7. A string pattern unit is a string of characters, optionally followed by a match qualifier of the form
    	[Mismatches,Deletions,Insertions] 
    
    For example, a string pattern of the form
    	RNYRNYRNYRNY[1,0,0]
    
    would match 12 characters in a nucleotide string (where R stands for a purine and Y stands for a pyrimidine; the standard ambiguity codes are used for nucleotides, but only X is allowed as an ambiguous character for proteins). A "deletion" is a character in the pattern for which no character in the matched sequence corresponds, while an "insertion" is a character in the sequence which does not correspond to any character in the string pattern unit.

  8. A range pattern unit is of the form
    	Min...Max,
    
    which indicates that it will match any subsequence with length between Min and Max.

  9. A complement pattern unit is used to match against the complement of a string previously matched by a named pattern unit. Thus,
    	~p1
    
    matches the reverse complement of whatever p1 represents. If a special rule of complementation is required, it precedes the ~, thus
    	r1~p2
    
    says "match the reverse complement of whatever p2 matched, where complementation is defined by complementation rule 2". You can also add a match qualifier; thus,
    	~p2[1,0,1]
    
    would allow a single mismatch and a single character bulge in the helix.

  10. A repeat pattern unit is used to match against the value saved in a previously matched name pattern unit. Thus,
    	p1=3...3 p1 p1
    
    would match a 9-character string made up of a 3-mer repeated three times. You can qualify the matches; for example,
    	p1=3...3 p1[1,0,0] p1[1,0,0] p1[1,0,0] p1[1,0,0]
    
    matches a 15-character string which might be thought of as 5 repetitions of a 3-mer that have experienced a few mutations.

  11. An any-of pattern unit is used in constructing protein patterns, since we do not allow a rich set of ambiguity codes for amino acids. It has the form any(AAs) where AAs is a string of acceptable characters.

  12. A not-any-of pattern unit is similar to the any-of pattern unit, except that it matches any character not in the designated set.

  13. A weight pattern unit is a somewhat clumsy way to represent a standard weight matrix (it was intended that patterns be generated by programs, and so we deemed the syntax minimally acceptable). The form of the pattern unit is { List of N-tuples } > MinValue suppose that you wanted to match a sequence of eight characters. The "consensus" of these eight characters is GRCACCGS, but the actual "frequencies of occurrence" are given in the matrix below. Thus, the first character is an A 16% the time and a G 84% of the time. The second is an A 57% of the time, a C 10% of the time, a G 29% of the time, and a T 4% of the time.
                  C1     C2    C3    C4   C5    C6    C7    C8
    
    	A     16     57     0    95    0    18     0     0
    
            C      0     10    80     0  100    60     0    50
    
            G     84     29     0     0    0    20   100    50
    
            T      0      4    20     5    0     2     0     0   
    
    One could use the following pattern unit to search for inexact matches related to such a "weight matrix":
    	{(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
    	 (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450
    
    This pattern unit will attempt to match exactly eight characters. For each character in the sequence, the entry in the corresponding tuple is added to an accumulated sum. If the sum is greater than 450, the match succeeds; else it fails. For protein sequences, you must use 20-tuples (with the entries corresponding to the amino acids in alphabetical order). This will be used only by the most serious aficionadoes.

  14. A length-limit pattern unit puts a bound on the sum of the lengths matched by previous named pattern units (which probably named a sequence matched by a range pattern unit). It does not "consume" any of the sequence; rather, it just succeeds or fails. Thus,
    p1=5...5 p2=1...5 p3=3...7 ~p1[1,0,0] p4=3...8 ~p3 length(p2+p4) < 10
    
    would match a pseudo-knot like structure, setting a maximum size on the two unpaired internal subsequences.

Additional Rules

  1. End of String
    For example:
    	TTF $
    
    matches TTF at the end of the database sequence.

  2. Beginning of string
    For example:
    	^ TTF
    
    matches TTF at the beginning of the database sequence.

  3. Palindrome sequence
    For example:
    	p1=4...4 <p1       matches all of the following,
    
           	AGGD DGGA
    	FAFL LFAF
    	GSAP PASG 
    	SAPR RPAS
    
    That is, it matches any four characters, followed by its reverse. This is the actual palindrome, not the biologically common meaning of "reverse complement".