Mutual Information

products
main search tutorial faq Pfam::Home                  
JSPWiki: HIVGENOME
JSPWiki logo
X3D-PDB
CASP7
Protein CorreLogo
Set your name in
UserPreferences

Edit this page




JSPWiki v2.2.33


HIVGENOME


October 17, 2006

After calculating Mutual Information in the HCV genome it was time to see what challenges could be found in the HIV genome. Many challenges!

HIV genome encodes multiple overlapping proteins with different reading frames. This makes it difficult to provide a true amino acid alignment representative of the proteins. The sequences for various elements of HIV can be found at Los Alamos HIV web site. I sent a support email to the HIV database group and they pointed me to GeneCutter which is designed to solve this problem. They suggested that I submit the 600+ HIV genomes and GeneCutter would then return the amino acid sequences representative of the multiple reading frames. I then wrote some code to append each protein sequence to construct a flat or simplified version of the protein sequences without the multiple reading frames. This gave me an aligned set of amino acid sequences that I could then use to calcutate Mutual Information. I used quicktree to build a phylogenetic tree for the sequences. This allowed me to use the RPE method to detect co-evolving pairs. With 600+ sequences at a length of 3000+ it took about 30 hours to process on an AMD 64 1900.

Once the mutual information pairs are identified for the multiple sequence alignment they are then programmatically mapped to a single sequence as a reference with no inserts. For this example 97BL006_AF193275 was used because it was first in the list. Each MI pair is then compared to see if they are found in a different protein sequence. The following boundaries were used based on the 97BL006_AF193275 sequence that represents the sequence without multiple reading frames.

  • GAG START=0 END=492
  • POL START=493 END=1483
  • VIF START=1484 END=1671
  • VPR START=1672 END=1765
  • TAT START=1766 END=1865
  • REV START=1866 END=1987
  • VPU START=1988 END=2068
  • ENV START=2069 END=2905
  • NEF START=2906 END=3109

An example of the genes found in a typical sequence with multiple reading frames

http://hiv-web.lanl.gov/content/hiv-db/CRFs/CRFs.html

Using the yEd graph editor I began connecting nodes and was puzzeled by the patterns. It then occured to me that because of the multiple reading frames the overlapping positions would be detected as co-evolving pairs. Seeing this, I felt very confident that the calculations performed over the last 30 hours were correct. The initial graph with a sample of the detected co-evolving pairs from different genes with the same genome position but different reading frame is shown.

Strong signal for co-evolving pairs that share the same sequence position but a different reading frame

To filter for this effect each amino acid pair combination each occurence of amino acids in the mutual information pairs was counted. If an amino acid was found in more than one co-evolving pair then it was listed as a co-evolving pair for graphing. This helped filter the data set to something a little more interesting. The remaining layout was done by hand using the yEd graph tool and when possible amino acids from the same genome sequence position were not included. It was easier for the first attempt to do it by hand versus writing code to filter the bad pairs. Future improvements will provide a range of overlaps which will allow automatic filtering.

Additional improvements were made to include the amino acid found that is contributing the most information to the co-evolving pair for each position. This same amino acid is used to determine the color of the node where Hydrophobic=Brown, Positive=Red, Negative=Black, Polar=White and Proline=Yellow. In cases where a single sequence position had different physio-chemical properties based on the other sequence position an additional node was added and then grouped. Each co-evolving pair is expressing information that may be different from other co-evolving pairs where one sequence position is in common. If it was possible that a sequence position was the result of a different reading frame then it was circled in green as an indicator that it may not be a true co-evolving pair for all combinations. I was doing it by hand so it can hav a few extra nodes that should be eliminated.

MI graph between proteins with minimal occurence of multiple reading frames

A high res version of this picture for printing is included below HIV-protein-protein-graph-noCRF-2048.png




Go to top   Edit this page   More info...   Attach file...
This page last changed on 18-Oct-2006 21:02:29 EDT by 68.233.52.209.


home search tutorial faq Pfam::Home
For questions or comments please contact willishf@ufl.edu