BIOS 546 Assignment
I want you to write a program to calculate several properties of a protein based on its amino acid sequence: the amino acid composition; molecular weight; net charge at pH 5, 7, and 9; hydrophobicity; and absorbance at 190 and 205 nanometers. You should do this for the whole protein and also for the protein divided into thirds: 1st third, 2nd third, and 3rd third, without copying and pasting code: use subroutines that you pass sequences or coordinates to.
The amino acid sequence of proteins is usually written in the 1 letter code: in the field of bioinformatics you will see much of this code, and you really should memorize it (although that isn't necessary for this course).
The data you will need for this assignment is in the file /home/bios546/amino2.txt. Your program should read this file, skip the first several comment lines, then create a series of hashes with the single letter codes as the keys, and the data in the columns as values. Thus, you should have a hash for the full names, another hash for the pK's, another hash for hydrophobicity, etc. Each hash will have the same keys, but different values. Notice that the amino and acid termini of the protein also have entries, for use with the net charge calculations. They also need to be entered into a hash. After it finishes reading the file, your program should close it. To get the individual data items into an array, use: @fields = split /\s+/, $line;
The protein sequences are in several files that end in ".pep" in the /home/bios546 directory. Your program should prompt the user for which file name is to be analyzed (using STDIN), then open that file. The peptide sequence is on several lines, so you will need to read in each line, chomp it (to remove the "\n" ), then concatenate it to the previous lines to get the entire peptide sequence into a single string. After the file has been read in, close it.
Once you have the peptide as a single string, you will need to split it into an array of individual amino acids, using the "split" command: the statement @amino_acids = split //, $sequence; will do the job.
Now you need to do various calculations.
1. For amino acid composition, count the number of each amino acid, then report it in a table with the amino acids in alphabetical order. For example:
Amino acid number
Alanine 15
Arginine 4
Asparagine 7
Etc.
2. For molecular weight, just add up the weights of the individual amino acids. Similarly, just add the absorbances and hydrophobicities.
Hydrophobicity is a measure of the free energy resulting from moving the amino acid in question from water to n-octanol, which simulates the interior of a membrane. If the overall score is greater than 0, the protein prefers an aqueous medium; a score less than 0 indicates a protein that prefers the interior of a membrane. In practice, this score is usually calculated for small groups of amino acids using a sliding window, to find hydrophobic membrane-spanning regions.
The absorbance numbers are the molar extinction coefficients of the various amino acids. Most of the absorbance is due to the aromatic amino acids tryptophan, tyrosine, and phenylalanine, but other amino acids also contribute. The ratio of absorbances at different wavelengths can be used to identify proteins and to determine the purity of a sample.
3. The net charge on the protein determines how it will migrate in an electrophoresis gel. Only some amino acids are charged: they have pK values listed, while the ones that are always neutral have "--" in that column. A big distinction needs to be drawn between amino acids that are charged (+1) at low pH but uncharged at high pH: histidine, lysine, arginine and the N-terminus of the protein; and those amino acids that are uncharged at low pH but charged (-1) at high pH: cysteine, aspartic acid, glutamic acid, tyrosine, and the C-terminus of the protein. Low pH means a high [H+] concentration: the switch between charged and uncharged is due to the loss of an H+ from the amino acid as the pH increases. For example, for lysine you are going from a charged -NH3+ group to an uncharged -NH2 group as the pH increases. Another example: for aspartic acid you are going from an uncharged -COOH group to a charge -COO- group as the pH increases.
There is an equation that can be used to calculate the proportion of amino acids of a given type that have an H on them: fraction with H = [H+] / ([H+] + KD) . [H+] is the hydrogen ion concentration, or 10-pH. Also, KD is the dissociation constant, or 10-pK. At each pH (5, 7, 9) use this equation for each of the amino acids (plus N and C termini) that have a pK listed to determine the fraction that have H, then multiply by the number of that amino acid ( use 1 for the N and C termini), then add them up for the whole protein to get its net charge. When you do this, take into account whether the amino acid is charged when it has H on it or uncharged when it has H on it, and whether the charge is +1 or -1!
Once you can do all these things for the whole protein, develop a way to repeat the calculations for each third of the protein separately. Do this using subroutines, and NOT just cutting and pasting large segments of code! Report all the results in a neat and clear fashion. Be sure your program is well documented.