TOOLS FOR PREDICTION AND ANALYSIS OF PROTEIN-CODING GENE STRUCTURE

GeneBuilder: An intergrated computing system for protein-coding gene prediction





Motivation

         The GeneBuilder system is based on prediction of functional signals and coding regions by different approaches in combination with similarity searches in proteins and EST databases. The potential gene structure models are obtained by using a dynamic programming method. The program permits the use of several parameters for gene structure prediction and refinement. During gene model construction selecting different exon homology levels with a protein sequence selected from a list of homologous proteins can improve the accuracy of the gene structure prediction. In case of low homology GeneBuilder is still able to predict the gene structure. GeneBuilder system has been tested by using the standard set (Burset and Guigo, 1996) and the performances are: 0.89 Sensitivity and 0.91 Specificity at nucleotide level. The total Correlation Coefficient is 0.88.



GeneBuilder description

         GeneBuilder is composed of many modules. Each module is executed independently. The GeneBuilder results output is based on results obtained by the automatic execution of different programs and it is shown in a separate window of the browser.

1. Organism (Human, Mouse,Fugu, Drosophila, C.elegans, Arabidopsis and Aspergillus).
         GeneBuilder system can be used for human, mouse, fugu, Drosophila, C.elegans, Arabidopsis and Aspergillus sequences. This option is very important since functional signals prediction, dicodon statistics, repeated elements searching are organism specific.

2. Mode (Gene, Exon).
         The GENE option is used for predicting the full gene model. The set of Potential Coding Fragments (PCFs) is used to construct the gene models with the maximum coding potential by using the dynamic programming technique. It is also possible to use one homologous protein selected from the list of potential homologous proteins to refine the predicted gene model. The EXON option is used for only selecting the exons with the best scores. Two levels of score are used: excellent and good. The EXON option can be useful for long genomic sequences with an unknown number of potential genes, since there is very little over prediction.

3. Strand (Direct, Complement).
         By default the analysis is performed on the direct strand. If the complement option is selected, the analysis will be executed on the complementary DNA strand.

4. Sequencing error correction (Disable, Error report, Automatic correction).
         By selecting the automatic correction option, it is possible to find and automatically correct potential sequencing errors due to the frame-shifts and substitutions in the stop codons. The gene model predicted can be substantially improved if these errors are eliminated. The Error report option is used to generate only the report without correcting the sequence under analysis.

5. Splice sites prediction (All, Excellent only).
         For the splice site prediction we use the classification analysis combined with the weight matrix technique. When the excellent only option is selected, the program is able to find 95% of real splice sites. About 15% of pseudosites will be predicted as splicing signals. When the All option is selected, the program is able to find 98% of the real splice sites, but 30-35% of all predicted sites will be false (Milanesi et al., 1993).

6. Potential coding regions (All, Good, Excellent, Key protein similarity).
         Potential coding regions are found by combining the protein coding potential calculated by using the dicodon statistic and the splicing signals. With the All option selected, all potential coding exons (with and without similarity to key proteins) will be used for gene reconstruction. In this mode and with GENE option GeneBuilder will try to reconstruct potential gene. This mode is very useful as the first step of sequence analysis when no information about gene content is available for a query sequence. As a second step, proteins found by the searches of the predicted peptides against the protein database can be used to obtain a more accurate gene structure. With the Good option selected, only the exons having "good" quality will be used for gene reconstruction. When the Excellent exons option is used only the exons having "excellent" quality will be used for gene reconstruction. Finally when the Protein similarity option is selected, only the exons with similarity to a selected homologous protein will be used for gene reconstruction.

7. First and last coding exons (Disable, Exons with high protein homology).
         With the Exons having high protein homology option selected it is possible to require more accuracy in the determination of the first and last exons (potential genes must start and finish with well confirmed coding exons). This option is very important where several genes are present in a query sequence and the gene localization can be confirmed by using homology with a chosen protein.

8. Sequence segment for coding region predictions (Start, End).
         The first and last segment positions for coding region prediction: This option is particularly useful when analysing very long sequences containing several genes. The default values are the first and the last positions of a query sequence, but they can be changed depending on the researcherŐs evaluation of the predicted features.

9. Complete gene model (Yes, No).
         With the Yes option selected the program is able to reveal only the models with complete potential gene structure including the first and last exons. With No selected any gene models, including partials are reported.

10. Repeated element mapping (Yes, No).
         With the Yes option selected the program is able to predict the repeated elements present in the sequence and mask them before searching the ESTs homology search in GeneBuilder.

11. EST mapping (Yes, No).
         With the Yes option selected the GeneBuilder performs a homology search against the EST database and the position of the homologous EST sequences is reported in the output in relation to a query sequence. The repeated elements can be automatically masked before the homology search. This information is used only when the similarity between the EST sequences and the query sequence is greater then 95%. The module ESTMAP is also able to predict the introns in DNA comparing ESTs and a query sequence based on the method described by Mott (1997).

12. TATA box prediction (Good, Marginal).
         This module is based on the Hamming-Clustering method (Milanesi et al., 1996) for TATA-box prediction. With the option Good only the better matches are presented as potential TATA-box. With the option Marginal all potential TATA-box are described. Due to the difficulty in finding the precise location of this signal it is better to predict the general model of the gene and then to determine the true TATA-box position upstream of the first CDS.

13. POLY-A site prediction (PolyA pattern length).
         This module is based on the Hamming-Clustering method (Milanesi et al., 1996) for poly-A signal prediction. The option PolyA pattern length is used for increasing or decreasing the pattern discrimination. For this signal we suggest predicting the general model of the gene and then determining the true Poly-A position downstream from the last CDS.

14. Search for potential binding sites of transcription factors (Vertebrates, Fungi, Insects, Plants, Miscellaneous, All).
         By selecting the Vertebrates, Fungi, Insects, Plants, Miscellaneous options it is possible to search the input sequence by using the appropriate group of weight matrices. The group matrix for vertebrates genomes is used as default. It is also possibile to choose the matrices to be used individually.

15. Mail to (Yes, No).
         Results of GeneBuilder in text format can be sent by e_mail. This is useful when the analysis is of long sequences or is over poor network connections. In this case the e_mail address is mandatory.

Availibility

WEBGENE





References



  • Milanesi L., D'Angelo D., Rogozin I.B. GeneBuilder: interactive in silico prediction of genes structure. Bioinformatics, 1999, Jul; 15 (7):612-621.

  • Milanesi L. and Rogozin I.B. Prediction of human gene structure. In: Guide to Human Genome Computing (2nd ed.) (Ed. M.J.Bishop) Academic Press, Cambridge, 1998, 215-259.