Friday, May 25, 2012

Recurring numbers and finite state automata

Today we shall be talking about recurring numbers and we'll be explaining them using finite state automata.

A recurring number is generated by dividing two whole numbers to obtain a decimal whose digits after the point never end. For example, if we divide 1 by 3, we'll get 0.333333333... Remember that division works by dividing a number by the next digit, writing the whole number result as a digit of the answer and using the remainder as a carry, like so:

1 can be written as 1.00000..., which is what we'll do.

 \------------
3/1 . 0 0 0 0

1 divided by 3 is 0 remainder 1, so we write 0 above and put 1 as a carry on the next digit.

  0
 \-----------
3/1 . 10 0 0

10 divided by 3 is 3 remainder 1, so we write 3 above and put 1 as a carry on the next digit.

  0 . 3
 \------------
3/1 . 10 10 0

10 divided by 3 is 3 remainder 1, so we write 3 above and put 1 as a carry on the next digit.

  0 . 3  3
 \-------------
3/1 . 10 10 10

and so on.

But recurring numbers can sometimes be less simple, such as 1/7 which is 0.142857142857142857...:

1 can be written as 1.00000..., which is what we'll do.

 \------------
7/1 . 0 0 0 0

  0
 \-----------
7/1 . 10 0 0

  0 . 1
 \------------
7/1 . 10 30 0

...

  0 . 1  4  2  8  5  7
 \-------------------------
7/1 . 10 30 20 60 40 50 10

...

  0 . 1  4  2  8  5  7  1  4  2  8  5  7
 \-------------------------------------------
7/1 . 10 30 20 60 40 50 10 30 20 60 40 50 10

So why does this happen? Why don't the digits kept changing differently but instead start repeating a sequence? It's because the next digit depends on what remainder the previous digit left, so if the previous digit left a remainder which was left by another previous digit then the new digit and new remainder will be the same as the earlier digit. For example, in this case, the remainder was a 1, but the first remainder left was also a 1, so the sequence will repeat itself:

0 . 1  4  2  8  5  7
 \-------------------------
7/1 . 10 30 20 60 40 50 10

  0 . 1  4  2  8  5  7  1
 \-------------------------
7/1 . 10 30 20 60 40 50 10

  0 . 1  4  2  8  5  7  1  4
 \---------------------------
7/1 . 10 30 20 60 40 50 10 30

...

  0 . 1  4  2  8  5  7  1  4  2  8  5  7
 \-------------------------------------------
7/1 . 10 30 20 60 40 50 10 30 20 60 40 50 10

More technically, this is because the digits are generated using a finite state, the remainder, and when you try to generate an infinite sequence using a finite state, eventually you must get repetition, since you must reuse a previously used state at some point in the infinite sequence.

We can draw this idea using a finite state automaton where the states are the remainders left by a digit and the transition gives the next digit. For example, this is what the finite state automaton for 1/7 is:


Remember, the numbers in the circles are remainders, which are used as carries in front of zeros, and the numbers on the arrows are digits in the result. So we start from a remainder of 1 which results from when we start dividing 1 by 7, and start jumping from remainder to remainder. Now we can either end up with a remainder of 0, in which case the result would terminate and will not be recurring, or it will jump to a remainder which has already been traversed, in which case it will enter an infinite loop, which results in a recurring number. As the transitions show, in the case of dividing by 7, the second case is inevitable, regardless of which non-zero remainder we start with.

Since every number can leave only one of a finite number of remainders when used to divide another number, the result can either be a terminating number or a recurring number and cannot keep on changing forever. Further more, since the number of remainders a number can leave when divided is equal to that number (x/2 can only leave either 0 or 1 as remainders, x/3 can only leave either 0, 1 or 2 as remainders, etc) then there can only be that number of remainders. Since there can only be that number of remainders, and since a remainder of 0 would terminate the number, then the length of the recurring part of a number cannot be longer than one less than the denominator of that number. For example, for all x, x/7 cannot have a recurring part longer than 6 digits. Likewise, for all x, x/100 cannot have a recurring part longer than 99 digits.

Notice that the finite state automaton as used here only works for when we start adding zeros after the point during division. For example, if we divide the following:

\--------------------
7/1 . 1  2  3  0  0  0

then the finite state automaton can only be used when we start using the zeros at the end since the numbers in the automaton assume that the remainders are carried on to zeros. Since even if we use another recurring number, that recurring number can be converted to a fraction and the quotient of two fractions is another fraction, then the finite state automaton can also be used on recurring numbers. For example, if we divide the following:

\--------------------
7/0 . 3  3  3  3  3  3 ...

then we can convert 0.3333... into 1/3, the division can be converted to the fraction (1/3)/7 which can be simplified to 1/(3x7) = 1/21. Now we can use the finite state automaton on 1/21 and reason on when the result should start recurring.

Finally we see that if we divide an irrational number then the result could also be irrational since the state is not finite anymore but will be changing forever with every digit in the divided number.

This is the basis for the finite state automata's pumping lemma, which says that if the number of letters (or digits in this case) generated from a finite state automaton exceeds the number or states in the automaton, then there must be reuse of a previously visited state and thus the letters cannot be forever randomly generated but must follow a regular pattern at that point. In this case we have an even stronger pumping lemma which says that not only will the digits follow a regular pattern after exceeding the number of states in the automaton but will forever repeat the same sequence. This is because there is only one transition going out of each state since only one remainder can be left when dividing a number, so once a state is revisited there is no other path to follow except the previously visited path.

Tuesday, May 1, 2012

Summary of research paper "Chargaff’s “Grammar of Biology”: New Fractal-like Rules" by Yamagishi & Herai

This is a summary of the research paper http://arxiv.org/ftp/arxiv/papers/1112/1112.1528.pdf. Although it has "grammar of biology" in its title, it's basically about patterns in the frequencies of the 4 nucleotides in DNA: Adenine, Thymine, Cytosine and Guanine.

In case you don't know about it, DNA is a chain of these 4 types of molecules called nucleotides. Each nucleotide is represented by single letter which is the first letter of their name. So "TATA" can be a part of a DNA sequence for example. The DNA is made of 2 strands twisted around each other into a double helix. The two stands must be "complementary" such that if the first nucleotide on one strand is an "A" then the other strand must start with a "T", if it is a "T" then other must be an "A", if it is a "C" then the other must be a "G" and if it is a "G" then the other must be a "C", and the same goes for each nucleotide position in each strand.

Erwin Chargaff had discovered 2 simple patterns regarding the frequencies of the nucleotides in most organisms.

Chargaff's first parity rule says that if you count the number of each nucleotide on each strand, you'll find that the number of A nucleotides on one strand is equal to the number of T nucleotides on the other strand and so on for each complementary nucleotide. This is because of what was said in the previous paragraph.

The second parity rule says that the first rule also applies to the same strand, but only approximately. So the number of A nucleotides on a strand is approximately equal to the number of T nucleotides on the same strand and so on for each complementary nucledotide. It is less understood why this should happen. The problem is that if it were because the number of nucleotides is uniformly random, then the number of A and T nucleotides and the number of G and C nucleotides should not only be approximately equal, but the two groups should also be equal to each other, which is not the case. So the number of A nucleotides is approximately equal to the number of T nucleotides but not to the number of C or G nucleotides.

The second parity rule was later discovered to be a special case of a more general rule which says that on a given strand of DNA, the frequency of a sequence of nucleotides is approximately equal to the frequency of the reverse of the complement of that sequence. For example the number of time that "TAG" appears in a strand of DNA is approximately equal to the number of times that "CTA" appears in the same strand.

The research paper being discussed adds some new rules about the frequencies of the nucleotides.

Let's start from some notation.
  • Let w be a nucleotide sequence such as "ATCGAT", S be a DNA strand and S' be the other strand of the DNA.
  • |p| is the length of the string p (could be a word or a strand). For example |"ATA"| = 3.
  • F(w, S) is the fraction of the number of times w occurs in S over |S|-|w|+1. In other words, it is calculated using a sliding window and the number of times the word is found is divided by the number of times the window is moved. For example F("AA", "AATAAA") = 3/5. (This was confirmed in private conversation with the authors of the paper)
  • C(w) is the complementary word of w nucleotide to nucleotide. For example C("ATCG") = "TAGC".
  • R(w) is the reverse of w. For example R("ATCG") = "GCTA".

Using this notation, Chargaff's rules can be represented as:
1: F(w, S) = F(C(w), S')
2: F(w, S) ≈ F(R(C(w)), S) (generalized version)

What the authors of the paper did was that they organized all possible words of a certain length into a table which they called a "Math table" such that each row will have a new word, its complement, its reverse and its reverse complement, but each word can only be used once in the table. Since each element of the table is unique, the words in the first column which generate the other columns are called a "generating set" or "G", and each word in "G" is labelled "g". According to the authors of the paper, there is no special way how to determine which words go into G.

For example here is a Math table for words of length 2:
gC(g)R(g)C(R(g))
AATT--
ATTA--
CCGG--
CGGC--
ACTGCAGT
AGTCGACT

By summing the frequencies of each column in the table it was determined that:
Eq1: SUM(F(g, S) for g in G) ≈ SUM(F(C(g), S) for g in G)
Eq2: SUM(F(R(g), S) for g in G) ≈ SUM(F(C(R(G)), S) for g in G)

Note: I'm not sure if the way G is chosen makes a difference or if the blanks in the Math table affect the summation.

The authors of the paper say that there are other identities were discovered using the Math table, but don't mention them.

Since SUM(F(g, S) + F(C(g), S) + F(R(g), S) + F(C(R(g)), S) for g in G) = 100% (assuming R(g) returns blank and hence F(R(g), S) and F(C(R(g), S) give 0 if R(g) is blank in the Math table),

SUM(F(g, S) + F(C(g), S) for g in G) + SUM(F(R(g), S) + F(C(R(g)), S) for g in G) = 100%

Using Eq1 and Eq2, we can turn this equation into

SUM(2F(g, S) for g in G) + SUM(2F(R(g), S) for g in G) ≈ 100%
2SUM(F(g, S) for g in G) + 2SUM(F(R(g), S) for g in G) ≈ 100%
Eq3: SUM(F(g, S) for g in G) + SUM(F(R(g), S) for g in G) ≈ 50%
Or
Eq4: SUM(F(C(g), S) for g in G) + SUM(F(C(R(g)), S) for g in G) ≈ 50%

Eq3 was confirmed with a variety of different species DNA.