Recombinant DNA Technology
How to find one gene in large genome? A gene might be 1/1,000,000 of the genome. Three basic approaches:
1. cell-based molecular cloning: create and isolate a bacterial strain that replicates a copy of your gene.
2. hybridization: make DNA single stranded, allow double strands to re-form using a labeled (e.g. radioactive) version of your gene to make it easy to detect.
3. Polymerase chain reaction (PCR). Make many copies of a specific region of the DNA.
Molecular Cloning. The original recombinant DNA technique: 1974 by Cohen and Boyer. Several key players:
1. restriction enzymes. Cut DNA at specific sequences: e.g. EcoR1 cuts at GAATTC and BamH1 cuts at GGATCC. Used by bacteria to destroy invading DNA: their own DNA has been modified (methylated) at the corresponding sequences by a methylase.
2. Plasmids: independently replicating DNA circles (only circles replicate in bacteria). Spread antibiotic resistance genes. Becomes the cloning vector: remove extraneous stuff, leave a selectable marker (resistance gene), origin of replication (ori), restriction sites to put new DNA into. Selectable marker necessary to kill off bacteria that don't have the plasmid.
3. DNA ligase. Attaches 2 pieces of DNA together. Especially works well if the two ends have some complementary single stranded ends: "sticky ends".
4. transformation: DNA manipulated in vitro can be put back into the living cells by a simple process (calcium chloride plus heat shock for E. coli). The transformed DNA replicates and expresses its genes.
Once the recombinant plasmid has been transformed into the bacteria, you can isolate individual colonies and see if they have what you want in them. One common way is to isolate the plasmid DNA from several colonies, digest it with restriction enzymes that you know should cut the DNA at specific places, then run the DNA on an electrophoresis gel to see if the pieces are the proper sizes. For larger scale: need hybridization assays (below).
Source of DNA to clone: 1. genomic: cut up whole genome and clone small pieces. Advantage is, you get everything. Disadvantage is, a lot of it is junk. 2. cDNA: DNA copy of mRNA, made with reverse transcriptase. Advantage: you just get the expressed genes. Disadvantages: you don't get control sequences or introns, and frequency depends on level of expression.
Libraries: a large number of clones, often pooled together (so you have to fish out the one you want), but sometimes ordered. Genomic library vs. cDNA. Genomic uses enough input DNA to cover the genome 5-10x, so chance fluctuations don't prevent all sequences from being cloned. Repeat sequence DNA is a problem. cDNA libraries are usually made from single tissues: expression varies between tissues. Large difference in expression levels, often compensated for by normalizing the library: trying to equalize copy number of different sequences.
Specialized cloning vectors:
1. large scale vectors: plasmids are good to maybe 10 kb.
Next larger are phage lambda vectors, Lambda is about 50 kb long, and the central 20 kb is only used for lysogeny; it can be replaced by foreign DNA. Ligation of arms with insert is done in vitro, then packaged in vitro by extracts from cells that have contain pieces of the phage heads. Then, use these page to infect new E. coli. 20 kb or so can be cloned.
Cosmids are similar to phage vectors: use lambda, but remove all but the ends (cos sites), ori, and selectable marker. Package in vitro--becomes a large (50 kb) plasmid in the E. coli.
Bacterial artificial chromosomes (BACs, PACs): several forms, based on F factor or phage P1. Essentially like plasmid vectors otherwise. Up to 300 kb can be cloned. Most commonly used vectors for genomic sequencing these days. Low copy number so recombination rarely happens--a real problem with repeat sequence DNA common in eukaryotes (also use reconbination-deficient mutant E. coli strains). Get them into the cell by electroporation: high voltage pulse opens up temporary pores in E. coli membrane.
Yeast Artificial Chromosomes (YACs): Can do up to 2 Mb of DNA. A linear chromosome, has centromere, telomeres, ARS (autonomously replicating sequence), selectable marker for yeast (uracil or tryptophan biosynthesis genes usually). Also has E. coli ori and selectable marker: grow the vector itself in E. coli, then purify it, legate in foreign DNA, transform into yeast.
2. expression vectors: It is possible to just get mRNA product (no intron splicing though), using promoter sequences from bacteriophages (which are very active) and terminator sequences. For protein product, need promoter to get RNA, ribosome binding site (translation start) to get protein. Usually fuse foreign cDNA to the start of an E. coli gene: the E. coli portion can be used as a tag: antibody binding. Or, a group of histamines attached to the end makes it easy to purify the protein, as a special chromatography resin can bind it.
3. shuttle vectors (work in 2 different species: E. coli and human, e.g.). Need to have a selectable marker and an origin of replication for BOTH species: most prokaryotic ones won't work in eukaryotes and vice versa. Often have complicated expression systems attached: inducible promoters, localization signals, signals to get the protein on the cell surface or secreted, put it into a virus, etc.
Hybridization. The basic idea is that if DNA is made single stranded (melted), it will pair up with another DNA (or RNA) with the complementary sequence. If one of the DNA molecules is labeled, you can detect the hybridization.
Basic applications: Southern blot (or Northern: which uses RNA on the gel instead of DNA), in situ hybridization, and colony hybridization.
Southern blot. Cut DNA with restriction enzymes: give a set of DNAs with distinct sizes. Then, run on electrophoresis gel: small pieces run faster, and migration is only affected by DNA size (not true of proteins). Blot the DNA onto a membrane that DNA sticks to, then bath in hybridization solution, wash, do autoradiography.
In situ hybridization: done on sections of tissue on a microscope slide: see chromosomes or cellular locations. Same basic technique, but no blotting. FISH: fluorescent in situ hybridization: done to find gene locations of chromosomes. Fluorescent tag better for small objects, as it doesn't get fuzzy or take much time to expose.
Colony hybridization: for use with bacterial colonies (or phage plaques). Very similar to Southern: blot colonies onto a filter, lyse them to expose the DNA and make it single stranded, do the hybridization and autoradiography.
Labeling: use 32P-labelled dNTPs (dNTP = any deoxyribo nucleotide). Use single stranded DNA as template, short random oligonucleotides as primers, DNA polymerase to make a copy of the DNA that incorporates the label. Can also label RNA, use non-radioactive labels (often a small molecule that labeled antibodies bind to, or a fluorescent tag), use other labeling methods.
Hybridization. All the DNA must be single stranded (melt at high temp or with NaOH). Occurs in a high salt solution at say 60 degrees C. Complementary DNAs find each other and stick. Need to wash off non-specific binding. Stringency: less than perfect matches will occur at lower stringency (e.g. between species). Increase stringency by increasing temp and decreasing salt concentration. Rate of hybridization depends on DNA concentration and time (Cot), as well as GC content and DNA strand length.
Autoradiography. Put the labeled DNA next to X-ray film; the radiation fogs the film.
Applications: 1. Finding clones that match your probe. Colony hybridization. The blot takes some of the colony but leaves some live bacteria behind, so you can match the radioactive spot with the live colony you want.
2. Detection of mutant genes. Hb-S (sickle cell) is caused by a change of A to a T (Glutamic acid -> Valine) in the beta-globin gene. Occurs in a restriction site for Mst 1, destroying the site. Hb-A gives 2 bands on a Southern blot at 1.2 and 0.2 kb; Hb-S gives a single 1.4 kb band, when using a probe for beta-globin. Also. many mutations are due to deletions of part or all of the gene: easy to see on a Southern.
3. RFLP mapping. Restriction Fragment Length Polymorphisms. Use many probes from many different chromosomal locations, often in unknown genes. Lots of variation in restriction sites, especially in introns and other non-coding areas. Can be sued as genetic markers: the alleles are the presence or absence of bands of particular sizes. Co-dominant, which is very useful in mapping.
Polymerase Chain Reaction (PCR)
Based on DNA polymerase creating a second strand of DNA. Needs a template strand (single stranded DNA) and two primers that flank the region to be amplified. Only short regions (up to 2 kbp) can be amplified. DNA polymerase adds new bases to the 3' ends of the primers to create the new second strand.
PCR is based on a cycle of 3 steps that occur at different temperatures. Each cycle doubles the number of DNA molecules: 25-35 cycles produces enough DNA to see on an electrophoresis gel. Each step takes about 1 minute to complete.
1. Denature the DNA (make it single stranded) at 94oC.
2. Hybridize the primers to the single strands. Temp varies with primer, around 50oC
3. Build the second strands with DNA polymerase: 72oC.
A key element in PCR is a special form of DNA polymerase from Thermus aquaticus, a bacterium that lives in nearly boiling water in the Yellowstone National Park hot springs. This enzyme, Taq polymerase, can withstand the temperature cycle of PCR, which would kill DNA polymerase from E. coli.
PCR is cheaper, faster, and easier to perform than either cloning or hybridization, so it has become very widely used for examining the structure of genes.
Applications of PCR:
1. Microsatellites (Simple Sequence Repeats: SSRs). Used for mapping the human genome--the main marker system used today. SSRs are short (2-5 bases) sequences that are repeated several times in tandem: TGTGTGTGTGTG is 6 tandem repeats of TG. SSRs are found in and near many genes throughout the genome--they are quite common and easy to find. During normal replication of the DNA in the nucleus, DNA polymerase sometimes slips and creates extra copies or deletes a few copies of the repeat. This happens rarely enough that most people inherit the same number of repeats that their parents had (i.e. SSRs are stable genetic markers), but often enough that numerous variant alleles exist in the population. Mapping SSRs is a matter of having PCR primers that flank the repeat region, then examining the PCR products on an electrophoresis gel and counting the number of repeats. SSRs are co-dominant markers: both alleles can be detected in a heterozygote.
If an SSR is a 3 base repeat within the coding region of a gene, it will create a tandem array of some amino acid. Certain genetic diseases, most notably Huntington's Disease, are caused by an increase in the number of repeats: once the number gets high enough the protein functions abnormally, causing neural degeneration. Such SSRs are called "tri-nucleotide repeats" or TNRs.
2. Allele-specific PCR. If a disease-causing mutation is a single base change, create primers which have the mutant nucleotide at their 3' end. A primer that matches the wild type sequence will not pair with DNA from the mutant allele, so no amplification will occur. If you have separate primers from both mutant and wild type alleles you can detect heterozygotes as well.
3. RT-PCR. Reverse transcriptase coupled with PCR. Reverse transcriptase creates DNA from RNA, usually messenger RNA. Used to detect expression of genes even when the level of mRNA is very low--many interesting regulatory genes have low expression levels.
Sequencing DNA
Originally 2 methods were invented around 1976, but only one is widely used: invented by Fred Sanger. Uses DNA polymerase to synthesize a second DNA strand that is labeled. Also uses chain terminator nucleotides: dideoxy nucleotides (ddNTPs), which lack the -OH group on the 3' carbon of the deoxyribose. When DNA polymerase inserts one of these ddNTPs into the growing DNA chain, the chain terminates, as nothing can be added to its 3' end.
Sequencing is done by having 4 separate reactions, one for each DNA base. All 4 reactions contain the 4 normal dNTPs, but each reaction also contains one of the ddNTPs. In each reaction, DNA polymerase starts creating the second strand; when it reaches a base for which some ddNTP is present, the chain will either terminate if a ddNTP is added, or continue if the corresponding dNTP is added. However, all the second strands in, say, the A tube will end at some A base: you get a collection of DNAs that end at each of the A's in the region being sequenced.
The newly synthesized DNA from the 4 reactions is then run (in separate lanes) on an electrophoresis gel. The DNA bands fall into a ladder-like sequence, spaced one base apart. The actual sequence can be read from the bottom of the gel up. Automated sequencers use 4 different fluorescent dyes as tags and run all 4 reactions in the same lane of the gel.
Sequencing reactions usually produce about 500 bp of good sequence. To assemble a 3 billion bp genome from these pieces is a major challenge in computer science. Usually it is done in two steps: first break the chromosomes up into a series of BAC clones (about 100 kb each) and map their positions. Then, sequence each BAC separately.