I will work in the group for two projects
- Hole filling in metabolic pathways.
- Multi-omics analysis of Mycobacterium tuberculosis.
Before exploring the related research articles, a brief introduction to "Network based annotation" published earlier by the CompBio group.
- Network based annotation
The network based annotation predicts function of a protein in a real or predicted protein interaction network from the neighboring proteins. The method is performed for each protein in the network by compiling a list of GO categories for each of the proteins connected to it by predicted or experimental interaction. A neighborhood score was calculated for each category in the list and then the list is sorted based on this score.The top ranking categories are used as predictions. The method is integrated in the Bioverse pipeline. Bioverse is available as a web service and it includes data for more than 50 organisms. 
Pathway hole filler
The algorithm is implemented as a part of Pathway tools sofware which is a program for creating, editing and querying pathway genome databases (PGDBs). The steps for the algorithm are as follows:
- Given a visible hole in a metabolic pathway, the first step is to retrieve sequences from SwissProt and PIR for the corresponding enzyme that catalyze the same reaction in other organisms.
- BLAST each of these sequences against the genome of interest to identify the candidate sequences.
- Determine the probability that each candidate has the activity required by the missing reaction.
The BLAST hits that are obtained after aligning each of the sequences with the organism of interest are compiled based on shotgun-based data consolidation analysis. A list of parameters are used which are used as evidence for Bayes classifier. These include has function,Shotgun-score,Best E-value,Average Rank Average fraction aligned,Pathway direction and Adjacent reactions. Bayesian network structure is used by the authors for candidate evaluation step. Each candidate is evaluated by assigning a probability value that the sequence encodes the desired function based on operon, homology and pathway based data. The bayesian network has nodes for E-value, alignment length and rank of candidate protein. The main node is 'has function' node and every other node is conditionally dependent on this node. The probability calculation requires the following data:
- evidence for particular candidate
- probability of finding that evidence if the candidate has desired function it if it has not.
- prior probability that any candidate has the desired function
A conditional probability distribution for each node in network is calculated for known reactions in PGDB as P(has function|evidence). This is followed by 5 fold cross validation and statistical evaluation using McNemar's test at different false positive rates. The method is evaluated on three PGDBs, CauloCyc (Caulobacter cresentus), MtbRvCyc (Mycobacterium tuberculosis, strain H37Rv) and VchoCyc (Vibrio cholerae). One of the example pathway to demonstrate the results is pyridine nucleotide biosynthesis pathway from CauloCyc. For detailed results see paper. The interesting thing to find from this paper is what they say for future work. The authors mention that the inclusion of Expression data and phylogenetic profiles might enhance the accuracy of predictions and may allow identification of candidates with little or no homology information to known sequences. This is exactly what I will try to do in my project, to integrate information from various functional evidences to fill the metabolic pathway holes. I also want to take into account the ortholog information from various databases like eggNOG and STRING. 
Comparative genomic approaches
About 20-60% of the proteins in most genomes have no functions assigned.Genome context based methods are useful for protein function prediction for the proteins involved in related cellular processes. These methods include:
- Clustering of genes on the chromosome i.e. Chromosomal Clustering.
- Protein fusion (Rosette Protein)
- Phylogenetic profiles (occurrence of proteins across a variety of species)
- Gene neighborhood or shared regulatory sites.
The three basic steps to identify missing genes in an organism are:
- Define a case/Problem meaning absence of a particular gene in a particular pathway.
- Evidence accumulation and analysis.
- Experimental verification
The missing genes in a genome is described to be globally missing, functions which do not have any representative sequenced gene for any organism or locally missing, functions previously connected to one sequenced form of gene in one group of species but expected to occur in another form in another group of species. The identification of missing genes can be started from biochemical information from literature review or Books, or web resources and databases like,KEGG,ERGO,PGDB,Biochemical Pathway Chart from Expasy. A matrix of function versus species could be prepared utilizing data from resources like COG database.For each of the evidence stated above the review talks in details with a specific example and points out to very useful references.The references given in this paper seems to be very useful and should be read for an overall depth in the area of genome context based annotation methods using comparative genomics. 
Homomorphisms:Metabolic pathway mapping
The paper has a lot of mathematics to understand, which I am trying to understand. The biological part of it says that the pathway holes can be categorized into two different types,visible pathway holes and the hidden pathway holes. As the name suggests, visible pathway hole is a partial EC number assigned to a particular metabolic reaction which is a result of ambiguity during the identification of this gene. On the other hand, hidden pathway hole which is completely missed from a given pathway description which is the result when the gene encoding an enzyme is not identified in an organism genome.These holes are filled using pathway mapping which is done through identification of homomorphisms through enzyme matching cost based on EC notation.The similar pathways are mapped in two related organisms and the visible hole is filled if there is information for it in databases like SwissProt and TrEMBL. The hidden pathway hole is also identified using the homomorphism metabolic mapping and are filled based on its presence in other closely related species. The authors conclude that the mapping tool can be used for identification of pathway holes and this is a proposed framework for finding and filling these holes based on pathway mapping and database search. 
CanOE:Fishing Candidate genes for Orphan Enzymes
This is recent method that exploits genomic and metabolic contextual information by a graph based algorithm. (Reading it, will update later)
Genome annotation errors due to partial EC numbers
The presence of partial EC numbers in various databases like KEGG, VIMMS and IMG makes it difficult to interpret the function of different proteins. It is also known to be erroneous to use this data for training purpose for any kind of computational analysis as it might not build a accurate model and the errors will then propagate in the predictions. It is therefore recommended to use a different scheme of hierarchical classification while assigning EC to a novel set of proteins as suggested here. A change in specification of these EC numbers have been proposed by the authors and it is a sensible argument in order to reduce the future effects of the semantic ambiguity in partial EC numbers. Given a partial EC number,there can be two meanings, the first being the exact activity of the enzyme is not known and therefore, the fourth number is not specified and a '-' is assigned instead of a number, and the second where the exact activity of the enzyme is known, but the NC-IUBMB, which is only official authorized body for providing EC numbers, has not yet assigned a sequence number and therefore a '-' in place of it. In the first case, it is proposed these instances should be indicated with a ‘?’ in the fourth position, e.g. EC 2.3.4.?, meaning ‘unknown’, while instances of the second case should be indicated with an ‘n’ in the fourth position, e.g. EC 2.3.4.n, meaning ‘not available yet’. 
Metabolic networks and its analysis
Most of the biological networks are scale free which means they follow a power law degree of distribution. Power law can be mathematically formulated as, P(k)=αk-Γ where,Γ is the slope of linear approximation of curve known as the degree component and it determines many properties of the network. Smaller the value of Γ, more is the importance of hub in the network. The scale free networks are different from the random networks which follow poisson degree distribution. Some of the most useful network properties are centrality, modularity, extreme pathway and elementary flux modes.
It helps to identify important vertices in the network,Degree centrality can identify hubs in the network by showing the vertices in the network with highest number of connections,Betweenness centrality shows the vertices with the highest number of shortest path going through them and Closeness centrality can identify vertices in the central part of the network and vertices in the peripheral part.
Researchers have shown that metabolic networks are organised into small but highly connected modules that combine in a hierarchical manner to large units. Modular organisation of metabolic networks by certain network decomposition methods can help us to better understand the organisation of complex biological metabolic networks. The modularity coefficient can be used to measure the degree of network decomposition and it can also be used as a parameter for clustering.
- Extreme Pathway (EP) and Elementary flux modes (EFM)
EFM is defined as minimal set of enzymes that can operate at a steady state with all irreversible reactions processing in the appropriate direction. By minimal we mean that the complete inhibition of even one of these enzymes would result in cessation of any steady state in the system. EP are similar to EFM and can be used to understand the regulatory mechanism of metabolic networks in details. 
How to find metabolic holes
There can be different approaches to find what is missing in a metabolic pathway as described below:
Based on KEGG pathway maps
One of the ways to start is by looking at the organism specific metabolic pathway maps and then identify the visible missing functions (displayed as partial EC number in the metabolic maps). The brite hierarchy file* will tell the details about hierarchical structure of the metabolism and proteins involved. A single grep to identify “.-” pattern will give a list of partial EC numbers which have been assigned a Rvid but the function is unknown so far. In addition to this there are enzymes which are not known in an organism specific pathway, these are present in white boxex in kegg metabolic maps which indicates that the particular reaction is known to be catalyzed by this enzymes but its presence has not been reported in the organism to which the metabolic map belong.
- The KEGG BRITE database is a collection of BRITE hierarchy files, called htext (hierarchical text) files, which are manually created. The "ko" hierarchy file is manually created for the functional classifications of genes and proteins using the K numbers. Organism-specific hierarchy files are then computationally generated by converting K numbers to gene identifiers in each organism. I used the organism specific 'mtu' file to identify the visible holes. A brief summary of the analysis of data for all the metabolic pathways of M.tb is tabulated in the table below.
|Group of genes/EC/KO||Number|
|Total number of metabolic pathways in M.tb||163|
|Metabolic pathways with no proteins information||62|
|Total number of unique RvIDs involved in these pathways||814|
|Total number of KO that these proteins map||573|
|Total number of EC that these proteins map||478|
|Total number of proteins mapped to partial EC notations||123|
|Total number of unique partial EC numbers to which these proteins map||29|
Based on available data for M.tb
There are two group largely working in the area of metabolic reconstructions,
- The first one is the Palsson's group at UCSD where constraint based models are being used extensively to identify the missing information in various organisms. They have metabolic reconstructions for species like Bacillus subtilis, Escherichia coli, Homo sapiens, Haemophilus influenzae, Helicobacter pylori, Methanosarcina barkeri, Mouse Cardiomyocyte, Mycobacterium tuberculosis, Saccharomyces cerevisiae (baker's yeast) and Staphylococcus aureus.
- The other one is Peter Karp's group at SRI International, Bioinformatics Center, USA. They are working with non constraint based models to find and fill gaps in partially reconstructed models and use the genome context based methods on a large scale.
So the current data statistics from both these resources is as follows:
|Information||Number and availability|
|Model name||M.tb iNJ661, published in 2007|
|Metabolic reconstruction format||SBML format and Excel sheet format|
|Network maps||svg and jpeg format|
|Total number of genes in the model||661|
|Total number of reactions in the model||939|
|Model name||M.tb H37Rv strain,version 17.1|
|Total number of genes||3966|
|Total number of protein genes||3916|
|Total number of RNA genes||50|
|Total number of pathways||201|
|Total number of pathway holes||197|
|Total number of pathway holes filled||0|
As shown in Table 3, the pathway holes filled so far for M.tb H37Rv is zero,where as 197 holes are still present in the various known pathways for the pathogen. This therefore provides a dataset to carry on the analysis for the hole filling problem. These 197 enzymes (holes) are cross checked in pathway if there is any annotation provided for these,a table showing this comparison is here File:Comparison BioCyc and KEGG.pdf
Based on Reaction database
The other way is to use a reaction database, Ma-Zeng in 2003 came up with a reaction database using KEGG LIGAND. The database has been updated and published in 2011,and is an important resource for metabolic reconstructions. It is based on KEGG LIGAND (release 44.0, KEGG is now into its 67th release) and also uses BRENDA to get experimental data about reactions like the reversibility information. It includes 6851 reactions of which 4304 are irreversible and 2547 are reversible, 3535 EC (2943 complete ones). A connection database has also been developed which made it possible to represent the structure of metabolic network as a graph in a physiologically more meaningful way. 
- EE Graph
The idea here is to generate a enzyme enzyme dependency graph from this connection database which will be the superset graph. Step 1: Prepare a separate file having connection information, reactant, product and then the enzyme catalyzing the reaction. Step 2: Prepare a metabolic graph, where nodes are the compounds and edges are the enzymes/reactions. Step 3: Generate a enzyme enzyme dependency graph which have enzymes as nodes and an edge exists between two enzymes catalyzing two different reactions if the product of one reaction is being utilized by another reaction.
- List of EC from a particular organism
Take a list of all the Enzyme for a organism of interest from KEGG or if it is a newly sequenced genome than we can run our enzyme profiles (ModEnzA, a link to its website is here ) to identify a list of enzymes.
- Identify neighbors
For each EC in the list identify the neighbors from the EE graph, If the neighbor is present among the EC list then we don’t have to worry about it but in the other case if it is not present then it might be a missing function in the organism. This neighbor could be a partial EC or a complete one. This result can be then validated back from KEGG pathways.
How to fill holes in metabolic pathways
A case study for serine biosynthesis in M.tb where en enzyme 18.104.22.168 is predicted to be missing by BioCyc database is used to evaluate how to score candidate proteins for best possible functional hits. A brief overview of how to do it through the available functional association scores is outlined here File:Scoring scheme.pdf
Rough draft of initial results
I wrote the background as a rough first draft but will improve it through iterations. Referencing is incomplete. File:Paper draft0.pdf