HomeWriting Links Resources
BackUpNext

 

3. Comparison of mouse and human coding genes

Alec MacAndrew

 

The draft mouse genome was published on 6th December 2002 , Waterstone et al, Nature 420, 520 - 562

Note that this is a 43 page paper (Nature averages 2 -3 pages per paper) with around 200 authors and 330 references. This is all new to science and the volume of material is more than a very fat text book if one includes the references . The detail is published not in a single paper, but in about six related papers occupying more than half of the super fat 6th December issue of Nature.

 Genes and Pseudogenes

The first part of this review focuses on the protein encoding genes. The current human gene catalogue contains just under 23,000 predicted genes in just under 200,000 exons (remember that genes are not continuous blocks of code in the genome but are interrupted by long stretches of non-coding sequence. The coding sections are called exons - averaging about nine per gene - and the interrupting sections are called introns). It is known that this is an incomplete set of genes – see below.

The mouse gene catalogue contains just over 22,000 predicted genes in 191,000 exons. It is important to remember that the catalogues, although getting better, are still incomplete as the sequences themselves are incomplete. There will be some genes missing or incorrectly described and some predicted genes are likely to be pseudogenes. (1)

The fact that we now have both human and mouse sequences allows some of these errors to be addressed. Let's look first at pseudogenes. Pseudogenes originate as real genes, but they have lost their functionality, they are not transcribed and they are not under selection. There are two types of pseudogene:

1) An error can occur during transcription. The DNA in a gene is transcribed to nuclear RNA.  The introns are excised and the exons are spliced to form messenger RNA which forms a continuous sequence of code. So far this is the normal process of transcription. But it is possible at this stage for retrotranscription of the mRNA back into the genomic DNA in a more or less random location to occur. This will create what is called a processed pseudogene, which, you will realise, contains no introns as they have excised in an earlier step.

2) Unprocessed pseudogenes arise either from the duplication of a gene in DNA replication or are degenerated genes that become inactive and are no longer under selection.

How do we recognise pseudogenes? Well, processed pseudogenes are all exons and no introns. And both types accumulate mutations under neutral mutation including things such as multiple frame-shifts and stop codons. There is another measure (that will also be important when we look at proteins in a future article) which is the non-synonymous to synonymous mutation rate, so let's look at it now. Each amino acid in a protein is coded by a sequence of three bases called a triplet codon. Now there are 20 amino acids that make up proteins, but 64 different codons (4 possible bases to select from at each of the three positions which is 43 combinations). That means that most amino acids can be coded by several different codons.

Triplet Codons and Corresponding Amino Acids

 

U

C

A

G

U

UUU

Phe

UCU

Ser

UAU

Tyr

UGU

Cys

UUC

UCC

UAC

UGC

UUA

Leu

UCA

UAA

STOP

UGA

STOP

UUG

UCG

UAG

UGG

Trp

C

CUU

 Leu

CCU

Pro

CAU

His

CGU

Arg

CUC

CCC

CAC

CGC

CUA

CCA

CAA

Gln

CGA

CUG

CCG

CAG

CGG

A

AUU

Ile

ACU

Thr

AAU

Asn

AGU

Ser

AUC

ACC

AAC

AGC

AUA

ACA

AAA

Lys

AGA

Arg

AUG

Met

ACG

AAG

AGG

G

GUU

Val

GCU

Ala

GAU

Asp

GGU

Gly

GUC

GCC

GAC

GGC

GUA

GCA

GAA

Glu

GGA

GUG

GCG

GAG

GGG

The table of mRNA triplet codons and the amino acids they code for - note that in RNA, T (thymine) in DNA is transcribed as U (uracil).  mRNA is transcribed from the anti-sense strand of DNA

Now for a gene under selection, synonymous mutations (ie mutations which substitute one codon for another coding for the same amino acid) produce the same amino acid and protein (for example Phe or Phenylalanine is coded by UUU and UUC. Some amino acids, for example, Leu, are coded by six different codons) and are not acted on by natural selection. Non-synonymous mutations produce a different amino acid and hence a modified protein and are acted on by Natural Selection. So if we look at fixed synonymous versus non-synonymous mutations in an active gene we will see a different rate in the two types of mutation. A pseudogene is not transcribed to protein, so there should be no difference between synonymous and non-synonymous mutations (the ratio between non synonymous and synonymous mutation rates is known as the Ka/Ks ratio). However, very recent pseudogenes are quite difficult to spot by accumulated mutations, as they would have been acted on by natural selection for all the time they were active genes.  Nevertheless, the Ka/Ks ratio in a gene is strong evidence for whether it is active or a pseudogene.

One extreme case of pseudogenes is the Gapdh gene. Mouse has one functional Gapdh gene but 400 pseudogenes scattered about many of the mouse's chromosomes (note that this is an exceptional number – don't run away with the idea that all genes have that many pseudogenes – in fact the average is likely to be around 1 pseudogene per gene. About 18,000 pseudogenes were found altogether). Of the 400, nearly 300 are easily identified as pseudogenes by the methods above, but 100 are recent enough that they needed to be identified as pseudogenes by careful manual inspection. But the fact that we now have mouse and human genomes gives us another line of attack: the pseudogenes on the mouse genome do not have a corresponding homologous gene in the same syntenic position in humans whereas the active gene does.

By looking suspiciously and closely at predicted mouse genes that fail to have a human homologue in a syntenic location, there were 4,000 found that were actually pseudogenes rather than real genes. The average number of exons in these pseudogenes was less than half that in actual genes (as many have been deleted once the gene has become inactive and this is just as predicted. Of the total of 18,000 pseudogenes found (14,000 clearly such, plus the 4,000 previously classified as genes) more than half are processed pseudogenes (they have no introns). There are probably a good many more pseudogenes that haven't been identified because they are ancient and have decayed so far owing to neutral mutation of millions of years that they are unrecognisable – see the article on repeat sequences.

Comparison of Mouse and Human gene sets

Now, having identified which sequences are pseudogenes and having removed them from the gene catalogue, it is possible to do a comparison of the mouse and human gene sets. At the time of publication of the draft mouse genome, the headline writers in popular publications came up with sensational and unjustified claims such as "Mouse 99% same as Human" and other misleading statements. This is what was actually determined: 99% of mouse genes have homologues in man (the actual protein similarity is much less than 99%.  See article on mouse proteins.) Of these, 96% are in the same syntenic location in man as in mouse. 80% of mouse genes that have a match on the same syntenic region in man are also the best match for that human gene. These are called 1:1 orthologues, ie not just similar genes but genes that have descended and diverged from a common ancestor.

The less than 1% mouse genes (118) with no homologues in humans do have homologues in other species. So we can explain them as follows: either the corresponding gene has been deleted from the human genome or it is rodent specific (unlikely since they are all known in other organisms) or the corresponding gene has not been found in humans yet or they might be evolving so rapidly in one or other lineage that they are unrecognisable as homologues.

A completely different method for predicting genes (not based on looking for sequences which code known proteins) was also used. This identifies genes by looking for statistical properties of coding regions, TATA boxes, UTRs, splice sites, introns etc. This process is enhanced by applying it to two genomes simultaneously and it was applied to the human and mouse sequences. This technique found a possible further 12,000 exons beyond the existing catalogue. By sampling a subset and checking the predictions experimentally it seems that about 6,000 of these are actual active exons yielding about 1000 additional genes.

Number of genes in the mammalian genome

How many genes does a mammal have? Well the current count of predicted genes in human and mouse is about 23,000 with 190,000 exons. But there is a database of complementary DNA from mammals (cDNA is transcribed from mRNA present in different mouse tissues using reverse transcriptase and corresponds to the exons in genes). 79% of known mouse cDNAs are in the predicted mouse exons from the sequence – so we are missing about 21%. Taking that and the fact that not all cDNAs have been identified and that some predictions are false positives gives an exon count of about 225,000 – 250,000. From other data, we know that there are on average 8.3 exons per mouse gene and that would give 27,000 – 30,000 genes in the mammalian genome. Although the number has fluctuated wildly in the last 3 years, we seem to be homing in on a number around 30,000.

However, if there are small single exon genes not strongly expressed they would not be detected and would not be included in the 30,000.

RNA Genes

Finally the researchers looked at RNA genes. These genes do not code for proteins but for RNA including tRNA used for transferring amino acids to the poly peptide chain in the ribosome. The human catalogue had 518 tRNA genes and 118 pseudogenes. It is much more difficult to identify tRNA genes in mouse because mouse has an active SINE (repeat sequence - see earlier post) that is derived from tRNA and leaves debris scattered about the genome that looks like RNA genes. At first pass the researchers found 2,764 RNA genes and 22,000 pseudogenes but the vast majority were masked out as SINEs. That left 498 possibles. But we expect active tRNA genes to be extremely highly conserved across species. If we include only genes with 95% sequence identity we find 335 in mouse and 345 in man of which about 250 are absolutely identical. That set includes all the 46 expected anti-codons (used to transcribe the 61 possible codons). There are 46 anti-codons to translate the 61 sense codons because of the famous Crick wobble rules that state that the base in the third position of an anti-codon can correspond to two different bases and so a single anti-codon can translate more than one different codon. This occurs without loss of information since, you will remember, the set of codons has redundancy. So 61 codons translate the 20 amino acids via 46 anti-codons.

Conclusion

Would any of this be possible without common descent? The increased richness that is possible by comparing the mammalian genomes rather than just looking at one is astonishing.  We learn a great deal by comparing the genomes and the value of doing that relies entirely on the relationship and common descent of human and mouse.


 

1. Go here for an excellent very detailed review of pseudogenes and their implications for the evolution/creation debate: Edward Max, Plagiarized Errors and Molecular Genetics


 

BackUpNext