Mouse gene

2) Unprocessed pseudogenes arise either from the duplication of a gene in DNA replication or are degenerated genes that become inactive and are no longer under selection.

How do we recognise pseudogenes? Well, processed pseudogenes are all exons and no introns. And both types accumulate mutations under neutral mutation including things such as multiple frame-shifts and stop codons. There is another measure (that will also be important when we look at proteins in a future article) which is the non-synonymous to synonymous mutation rate, so let's look at it now. Each amino acid in a protein is coded by a sequence of three bases called a triplet codon. Now there are 20 amino acids that make up proteins, but 64 different codons (4 possible bases to select from at each of the three positions which is 4³ combinations). That means that most amino acids can be coded by several different codons.

Triplet Codons and Corresponding Amino Acids

	U		C		A		G
U	UUU	Phe	UCU	Ser	UAU	Tyr	UGU	Cys
	UUC		UCC		UAC		UGC
	UUA	Leu	UCA		UAA	STOP	UGA	STOP
	UUG		UCG		UAG		UGG	Trp
C	CUU	Leu	CCU	Pro	CAU	His	CGU	Arg
	CUC		CCC		CAC		CGC
	CUA		CCA		CAA	Gln	CGA
	CUG		CCG		CAG		CGG
A	AUU	Ile	ACU	Thr	AAU	Asn	AGU	Ser
	AUC		ACC		AAC		AGC
	AUA		ACA		AAA	Lys	AGA	Arg
	AUG	Met	ACG		AAG		AGG
G	GUU	Val	GCU	Ala	GAU	Asp	GGU	Gly
	GUC		GCC		GAC		GGC
	GUA		GCA		GAA	Glu	GGA
	GUG		GCG		GAG		GGG

The table of mRNA triplet codons and the amino acids they code for - note that in RNA, T (thymine) in DNA is transcribed as U (uracil). mRNA is transcribed from the anti-sense strand of DNA

Now for a gene under selection, synonymous mutations (ie mutations which substitute one codon for another coding for the same amino acid) produce the same amino acid and protein (for example Phe or Phenylalanine is coded by UUU and UUC. Some amino acids, for example, Leu, are coded by six different codons) and are not acted on by natural selection. Non-synonymous mutations produce a different amino acid and hence a modified protein and are acted on by Natural Selection. So if we look at fixed synonymous versus non-synonymous mutations in an active gene we will see a different rate in the two types of mutation. A pseudogene is not transcribed to protein, so there should be no difference between synonymous and non-synonymous mutations (the ratio between non synonymous and synonymous mutation rates is known as the K_a/K_s ratio). However, very recent pseudogenes are quite difficult to spot by accumulated mutations, as they would have been acted on by natural selection for all the time they were active genes. Nevertheless, the K_a/K_s ratio in a gene is strong evidence for whether it is active or a pseudogene.

One extreme case of pseudogenes is the Gapdh gene. Mouse has one functional Gapdh gene but 400 pseudogenes scattered about many of the mouse's chromosomes (note that this is an exceptional number – don't run away with the idea that all genes have that many pseudogenes – in fact the average is likely to be around 1 pseudogene per gene. About 18,000 pseudogenes were found altogether). Of the 400, nearly 300 are easily identified as pseudogenes by the methods above, but 100 are recent enough that they needed to be identified as pseudogenes by careful manual inspection. But the fact that we now have mouse and human genomes gives us another line of attack: the pseudogenes on the mouse genome do not have a corresponding homologous gene in the same syntenic position in humans whereas the active gene does.

By looking suspiciously and closely at predicted mouse genes that fail to have a human homologue in a syntenic location, there were 4,000 found that were actually pseudogenes rather than real genes. The average number of exons in these pseudogenes was less than half that in actual genes (as many have been deleted once the gene has become inactive and this is just as predicted. Of the total of 18,000 pseudogenes found (14,000 clearly such, plus the 4,000 previously classified as genes) more than half are processed pseudogenes (they have no introns). There are probably a good many more pseudogenes that haven't been identified because they are ancient and have decayed so far owing to neutral mutation of millions of years that they are unrecognisable – see the article on repeat sequences.

Now, having identified which sequences are pseudogenes and having removed them from the gene catalogue, it is possible to do a comparison of the mouse and human gene sets. At the time of publication of the draft mouse genome, the headline writers in popular publications came up with sensational and unjustified claims such as "Mouse 99% same as Human" and other misleading statements. This is what was actually determined: 99% of mouse genes have homologues in man (the actual protein similarity is much less than 99%. See article on mouse proteins.) Of these, 96% are in the same syntenic location in man as in mouse. 80% of mouse genes that have a match on the same syntenic region in man are also the best match for that human gene. These are called 1:1 orthologues, ie not just similar genes but genes that have descended and diverged from a common ancestor.

The less than 1% mouse genes (118) with no homologues in humans do have homologues in other species. So we can explain them as follows: either the corresponding gene has been deleted from the human genome or it is rodent specific (unlikely since they are all known in other organisms) or the corresponding gene has not been found in humans yet or they might be evolving so rapidly in one or other lineage that they are unrecognisable as homologues.

A completely different method for predicting genes (not based on looking for sequences which code known proteins) was also used. This identifies genes by looking for statistical properties of coding regions, TATA boxes, UTRs, splice sites, introns etc. This process is enhanced by applying it to two genomes simultaneously and it was applied to the human and mouse sequences. This technique found a possible further 12,000 exons beyond the existing catalogue. By sampling a subset and checking the predictions experimentally it seems that about 6,000 of these are actual active exons yielding about 1000 additional genes.

How many genes does a mammal have? Well the current count of predicted genes in human and mouse is about 23,000 with 190,000 exons. But there is a database of complementary DNA from mammals (cDNA is transcribed from mRNA present in different mouse tissues using reverse transcriptase and corresponds to the exons in genes). 79% of known mouse cDNAs are in the predicted mouse exons from the sequence – so we are missing about 21%. Taking that and the fact that not all cDNAs have been identified and that some predictions are false positives gives an exon count of about 225,000 – 250,000. From other data, we know that there are on average 8.3 exons per mouse gene and that would give 27,000 – 30,000 genes in the mammalian genome. Although the number has fluctuated wildly in the last 3 years, we seem to be homing in on a number around 30,000.

However, if there are small single exon genes not strongly expressed they would not be detected and would not be included in the 30,000.

RNA Genes

Finally the researchers looked at RNA genes. These genes do not code for proteins but for RNA including tRNA used for transferring amino acids to the poly peptide chain in the ribosome. The human catalogue had 518 tRNA genes and 118 pseudogenes. It is much more difficult to identify tRNA genes in mouse because mouse has an active SINE (repeat sequence - see earlier post) that is derived from tRNA and leaves debris scattered about the genome that looks like RNA genes. At first pass the researchers found 2,764 RNA genes and 22,000 pseudogenes but the vast majority were masked out as SINEs. That left 498 possibles. But we expect active tRNA genes to be extremely highly conserved across species. If we include only genes with 95% sequence identity we find 335 in mouse and 345 in man of which about 250 are absolutely identical. That set includes all the 46 expected anti-codons (used to transcribe the 61 possible codons). There are 46 anti-codons to translate the 61 sense codons because of the famous Crick wobble rules that state that the base in the third position of an anti-codon can correspond to two different bases and so a single anti-codon can translate more than one different codon. This occurs without loss of information since, you will remember, the set of codons has redundancy. So 61 codons translate the 20 amino acids via 46 anti-codons.

Would any of this be possible without common descent? The increased richness that is possible by comparing the mammalian genomes rather than just looking at one is astonishing. We learn a great deal by comparing the genomes and the value of doing that relies entirely on the relationship and common descent of human and mouse.

1. Go here for an excellent very detailed review of pseudogenes and their implications for the evolution/creation debate: Edward Max, Plagiarized Errors and Molecular Genetics