Why is dna read in triplets




















Combinatorially, using three DNA letters for one amino acid makes the most sense. Once the DNA double helix had been discovered, the next big challenge was to work out how the four letters of DNA could code for each of the twenty amino acids that make protein. The first question was how many DNA letters coded for each amino acid?

If it was one DNA letter for one amino acid then you could only code for a maximum of four amino acids. Two letters in every possible combination could code for up to sixteen amino acids. Still, not enough. But three DNA letters provide more than enough combinations to code for all twenty amino acids.

So three was the answer. It was a triplet code. A codon chart. Codon charts are used to identify the animo acid created by a particular 3-letter combination of nucleotides. Image by D. Why a Triplet Code? Previous Page Next Page. Since TP in a DNA sequence is unlikely to be changed by a small number of base substitutions, 17 then such shift will exist for a long period of time.

The presence of such a shift between the TP of a nucleotide sequence and RF may serve as an indication of an RF shift in the concerned gene. Influence of one nucleotide deletion on TP of nucleotide sequence. Numbers above sequence S 1 show positions of nucleotides in RF. Twenty-fifth base has been deleted from sequence S 1. Consequently, sequence S 1 can be presented as two sequences— S 2 and S 3, i. This means that first column of matrix M 2 corresponds to third column of matrix M 3, second column of matrix M 2 corresponds to first column of matrix M 3, third column of matrix M 2 corresponds to second column of matrix M 3.

After summing sequences S 2 and S 3 and formation of sequence S 4, TPM M 4 matrix comes out from addition of first column of matrix M 2 to first column of matrix M 3 and so on, that leads to merging of non-identical columns and considerably decreases statistical significance of TP in sequence S 4. By accounting deletions, we get sequence S 4 that has TPM M 4 matrix in which matrices M 2 and M 3 are merged in consideration of cyclic permutation.

Finding and accounting deletions considerably increases statistical significance of TP in sequence S 4. Currently, some methods have been developed that reveal TP by using regularity in symbol preferences over different triplet positions in the DNA sequence. They use Fourier transformation, hidden Markov chains and other statistical methods based on position-dependent preferences for nucleotides in coding sequences as a mathematical apparatus.

In this matrix, columns represent period positions and rows represent nucleotides. In the current work, two problems were set. First, we wanted to find all genes where RF shifts can be identified by using TP. For each gene from the KEGG databank, analysed 25 we extracted a region with TP having maximal statistical significance calculated by information decomposition without allowing any deletions or insertions of nucleotides.

Then we searched for a statistically significant extension of the TP region in the same gene in the presence of insertions and deletions of nucleotides by using modified profile analysis Fig. More than genes contained a statistically significant shift between TP and ORF, which points to the presence of mutations in genes originating from the RF shift.

We made such a check for the genes that had mismatches between the gene's RF and TP. Then we choose coordinates L 1 and L 2 starting from the beginning of nucleotide sequence and fill the matrix M 4 , 3 for selected subsequence.

Element of matrix m k , j shows how many times symbol A k in nucleotide subsequence from L 1 to L 2 matches the number j in artificial periodical sequence U. We calculate mutual information as All analysed sequences represent coding region CDS of genes without introns. In other words, matrix M is linked to RF, which exists in the analysed gene. This allows us to estimate statistical significance of the periodicity found.

We can reduce I to standard normal distribution: We produced a set of nucleotide sequences for each length in the range from 30 to nucleotides by using a random number generator. Each of these sets contained 10 sequences. Thereafter, mutual information was calculated for each sequence from each set. For each set the histogram showing distribution of 2 I value was also built.

All sequences with TP, which were found in the current work, were longer than 60 bp. Let us refer to the nucleotide sequence found in such a way as T.

Thereafter, we saved the found maximal sequence for the given gene, its coordinates in the given gene and periodicity matrix M , which shows the type of TP found. To choose a threshold value for Z , we generated a set of random DNA sequences with the same size and sequence length distribution as for genes from the 29th release of KEGG databank. The point is that gene's TP can be split up by insertions and deletions into several sections that may have rather low level of Z , but which is greater than 5.

However, matrices M for each such section in gene will be identical or very similar, but cyclically shifted against each other Fig. In this case, consequent joining of these sections into a single one can considerably increase the statistical significance of a joined region that can be found by making an alignment against matrix M see Section 2.

Therefore, using a relatively low threshold value of Z will allow to not miss TP regions in genes separated into several sections by insertions and deletions.

We applied this algorithm for those DNA sequences in which we have revealed TP without insertions and deletions by the method of information decomposition. Let the coordinates of start and end of found local alignment R be r 1 and r 2. To do this, we used the Monte Carlo method. We generated a set of random nucleotide sequences Q on which the region T was left unchanged and the regions of sequence S within the range from 1 to t 1 and from t 2 to L were shuffled in a random way.

The set Q contained 10 6 sequences. Example of alignment against weight matrix obtained from TPM. Here, t 1 and t 2 are coordinates of region with continuous TP found by information decomposition T region and r 1 and r 2 are coordinates of extended region with TP found by dynamic programming R region. Points r 1 and r 2 on optimal alignment path have the coordinates i 0 , j 0 and i m , j m in matrix, respectively see Section 2.

Also, we estimated statistical significance Z R of alignment R by the Monte Carlo method in a way we described earlier. G is the similarity function for global alignment R , and it is calculated as we do for function F. Then we determined the value of Z R as:. Using the matrix M , we built corrected position-specific matrix of the base weights as we suggested earlier: 28 , Corrected weight matrix W was calculated as:.

For each alignment, we estimated Z T the statistical significance of T region alignment by the Monte Carlo method similar to Section 2. Thereto, we generated a set of sequences QT for each sequences from QV set in which the region T was randomly shuffled. Then we determined the value of Z T as:. Then we calculated:. We take the sum in Equation 7 for N sequences from QV set. We used for further calculation a value of C that has a maximum of X C.

We did the selection of C value for each matrix M. Transition to weight matrix ensures assignment of higher weight to infrequent bases when they have high frequency in the given position of profile and, vice versa, assignment of lower weight for such bases having low frequency in the given position. Thereby the correlation in the formation of adjacent insertions or deletions is taken into account. On the basis of introduced weights, we can find the optimal alignment, between analysed sequence and profile, i.

Let us create a profile matrix q i , j of size L as:. To find local optimal alignment of sequence s j against profiles q i , j , we applied the method of dynamic programming. Here, the index i stands for nucleotide in sequence s i and index j for the column number in profile matrix q. Initial values for similarity function F are specified as:. For calculations on triplet matrix, d equals 2.

During building the local alignment, we determine the maximal value of the similarity function F coordinates i m , j m corresponding to this maximal value.

Then we determine the path from points i m , j m to i 0 , j 0 , where the value of the similarity function becomes zero for the first time. According to the path made, we built alignment between sequence S and profile matrix q.

We used Equation 9 for carrying out the global alignment but without using zero in the right part of the equation. All other parameters were the same as in the case of the local alignment. Initial values for similarity function F for global alignment are specified as: On this basis, we chose At the same time, it was important to make the alignment able to find region T revealed by the method of information decomposition Section 2.

Index i varies from 1 to 4. This algorithm was used for the generation of random sequences contained in QV. Then we made alignments for the set of random sequences QV and determined the number of sequences having insertions or deletions within the region from t 1 to t 2.

Alignments of sequences from set QV were also built again using these new parameters. We found regions having continuous TP in genes. These results conform to earlier works on TP detection by either using informational methods or other techniques. This means that the distance from the left and right edges of the TP region to the start and end of the gene was more than 30 bp.

This criterion was satisfied for TP regions. Then we aligned nucleotide sequences of corresponding genes against TPM see Sections 2. General information describing all these sequences can be found in the Section Supplementary data. Let us consider those genes in which the region of continuous TP was extended by taking into account nucleotide insertions and deletions see Section 2.

Further, we will discuss nucleotide sequences with coordinates from r 1 to t 1 and from t 2 to r 2 Fig. This is the reason why we found statistically significant alignment from r 1 to r 2 in sequence S against the weight matrix constructed on the basis of TPM M.

We suppose that TP found in sequences T 1 and T 2 is a trace of some ancient RF that existed in these nucleotide sequences earlier. First column of the matrix M corresponds to the first codon base in sequence S , whereas due to insertions and deletions of nucleotides, in subsequences T1 and T2 matrix M corresponds to alternative ancient RF which may not match the actual RF there. Such an assumption is based on the idea that if a gene responsible for the same genetic function existed in several genomes, then insertion or deletion of nucleotides in this gene within one genome does not ultimately lead to analogous changes in another genome.

It is important that these sequences should be now known and should not have accumulated many evolutional alterations. This will allow us to see their similarity. We conducted such an investigation within the scope of the present work.



0コメント

  • 1000 / 1000