The FuncPatch Server

1. Introduction

FuncPatch is designed for inferring conserved functional patches in protein tertiary structures. Unlike many other web servers designed for the same purpose, FuncPatch explicitly combines the information from protein alignments and protein tertiary structures aiming to accurately infer clusters of conserved sites in proteins. Given a protein alignment and a representative protein tertiary structure, FuncPatch estimates the site-specific substitution rate at each amino acid site. The estimated site-specific substitution rates measures the conservation levels of amino acid sites: a very low site-specific substitution rate implies that the corresponding amino acid site is highly conserved over the evolutionary history and may be functionally important.

2. Methodology

The FuncPatch assumes that site-specific substitution rates may be spatially correlated in protein tertiary structures. This assumption is based on the observation that functional sites tend to be physically close to each other in protein tertiary structures and form functional patches. Therefore, closely located sites are more likely to have similar substitution rates than distantly located sites. To model the spatial correlation of substitution rates, FuncPatch firstly uses a parsimony method to estimate the most parsimonious number of substitutions for each site in the user provided alignment. Then, FuncPatch estimates the site-specific substitution rates using the algorithm described in the original paper of FuncPatch. The detailed pipeline is described below.

FuncPatch Pipeline

Figure 1. The pipeline of the FuncPatch server. The red ellipses correspond to the inputs and output of FunctPatch; the yellow rectangles correspond to the intermediate steps; the yellow diamonds correspond to the decisions. Details of the input formats and options are described in the following section. The core algorithm of FuncPatch (the step marked with a "★") combines a Gaussian prior distribution and a Poisson likelihood function to infer the site-specific substitution rates. More details of the algorithm can be found in the original paper of FuncPatch.

3. Input data and options

A good input dataset is probably the most important thing in bioinformatics analyses. The input data should contain enough information to answer the question you are asking. No computational method can give you meaningful results based on nonsensical data (garbage in, garbage out). In this section, we describe how to prepare a good input dataset and the details of the input options.

(1) Protein alignment

A protein alignment must be submitted by users. Currently, only alignments in FASTA format are supported by the FuncPatch server and the sequence IDs must NOT contain white spaces. The quality of the alignment is critical to the success of the analyses. To maximize the power of FuncPatch and to avoid misleading results, we strongly suggest you to check the protein alignment carefully before uploading it. Firstly you should ensure that the proteins in the alignment are homologs and the proteins should have similar functions and/or biochemical activities. At the same time, the sequences in the alignment should not be too similar to each other. Remember that if the sequences are too similar, there are not many substitutions in the alignment and FuncPatch, or any other similar methods, cannot infer site-specific substitution rates reliably.

(2) Phylogenetic tree

A phylogenetic tree describes the evolutionary relationship among a set of sequences. A phylogenetic tree is used in FuncPatch to guide the estimation of site-specific substitution rates. If you are comfortable with building a phylogenetic tree by yourself, we suggest you to do so. Then, please upload your tree to the FuncPatch server after clicking "Yes" in Input (2). Instead, if you are not familiar with phylogenetic software, please click "No" in Input (2) and FuncPatch will construct a phylogenetic tree automatically. The quality of the automatically reconstructed tree is lower than the state-of-the-art phylogenetic software. If you decide to upload your own tree, please ensure that the tree is a binary tree in Newick format and the names of leaves in the tree exactly match the sequence IDs in the protein alignment.

(3) PDB file or PDB ID

A protein tertiary structure is important to FuncPatch, since FuncPatch assumes that site-specific substitution rates are spatially correlated in the protein tertiary structure. FuncPatch accepts either a PDB file or a PDB ID. After you provide a PDB file or ID to FuncPatch, a list of chains in the PDB structure will appear in the web page and you should choose a PDB chain in the PDB structure which corresponds to a sequence in your protein alignment.

(4) Query sequence ID

The query sequence ID in Input (4) corresponds to a sequence in the alignment which is effectively identical to the sequence of the selected PDB chain. Note that this input option is not shown in Figure 1 for simplicity.

(5) Genetic code option

FuncPatch uses the PROTPARS program in PHYLIP to infer the most parsimonious number of substitutions for each site. To do so, the users must provide a genetic code corresponding to the protein alignment. FuncPatch supports 5 different genetic codes:
• Standard (NCBI transl_table=1)
• Universal mitochondrial (NCBI transl_table=4)
• Vertebrate mitochondrial (NCBI transl_table=2)
• Invertebrate mitochondrial (NCBI transl_table=5)
• Yeast mitochondrial (NCBI transl_table=3)

Please make sure that the genetic code is the correct one for your alignment.

(6) Surface residue option

You can perform the analyses based on either all residues in the PDB chain or only surface residues in the PDB chain. Sometimes it is interesting to use only surface residues because surface residues are more likely to be functional. Using only surface resides can also speed up the analyses. A residue is considered to be on the protein surface if its relative solvent accessibility is greater than 20%.

4. Output format

After you successfully submit your job to the FuncPatch server, the page will be redirected to a waiting page automatically. Please bookmark this page if your want to return to the result page later. Typically it takes from seconds to minutes to analyze a single dataset. After the computation is finished, the waiting page will be redirected to the output page. The output page is self-explanatory and it should be relatively easy to figure out its meaning.

The output consists of two sections. The first section (Inputs and User Options) shows the key options specified by users and input files. The second section is the main results. It consists of one figure and two tables. Figure 1 in the result page visualizes the most conserved sites predicted by FuncPatch. By default, 10% of most slowly evolved sites are highlighted by blue. You can change the number of highlighted sites in this figure. Table 1 shows the estimated hyperparameters and log Bayes factor. FuncPatch has two hyperparameters, the characteristic length scale and the signal standard deviation. The log Bayes factor is used to test whether the spatial correlation is significant. If it is large, e.g. great than 8, the spatial correlation of substitution rates may be significant in the dataset. Table 2 lists all sites with estimated site-specific substitution rates and 50% credible intervals. 50% credible intervals indicate how reliable the estimated substitution rates are. Smaller credible intervals suggest that the estimated substitution rates are more reliable in the corresponding sites.