What Is Genome Annotation?

Genome annotation is a high-throughput annotation of the biological functions of all genes in the genome using bioinformatics methods and tools. It is a hot topic in functional genomics research.

Genomic annotation

Genome annotation is a high-throughput annotation of the biological functions of all genes in the genome using bioinformatics methods and tools.
111 Development Environment
This system is based on a PC, and the operating system is Linux. The test system is a PIII 550 dual-CPU microcomputer, with 1GB of memory, running RedHat 710 Linux system. The database management system uses MySQL, the Web server program uses Apache, and the application program interface is written in Perl scripting language. This system can also run on a single CPU microcomputer, and the memory is not less than 512MB. All system software and application software are available for free from the Internet.
112 test data
In this system, the largest contiguous contig (Contig) obtained by preliminary splicing of the cyanobacteria (Synechococcus sp.) PCC7002 genome was used as test data, a total of 3 03247bp.
113 MGAP's genome annotation system
The genome annotation system is the core of MGAP.It integrates many commonly used software for gene identification and protein function prediction, including GeneMarks, IPRsearch, BLASTPGP, and FASTA3, and many databases, such as non-redundant protein sequence database (Non redundant, NR), The protein sequence database (PDBSeq) with known three-dimensional spatial structure, the International Protein Resource Information System (InterPro) [6], and the Orthologous Protein Family Database (Cluster of orthologousgroups (COG)), etc., have written corresponding modules for automatic operation, and Import the results of each step into the database. The general module of MGAP integration can be used directly by any other microbial genome. Different laboratories can add corresponding modules or data according to actual research needs, such as the protein sequence library of cyanobacteria Anabaena sp. Strain PCC 7120.
Gene identification is the first step of MGAP.This system uses Gene2Marks software, which is the most authoritative for microbial genome gene identification, to perform gene prediction.The contiguous contig test sequence (3 03247bp) is submitted through the website http: //PPopal.biology.gatech.eduPGeneMarkPgenemarks.cgi. GeneMarks default parameters, predicted 279 genes.
Then use MGAP's data load module (Loaddata) to import the prediction results into the ORF table.
114 MGAP User Interface
The user interface is used to display annotation results and provides an easy-to-operate and analyze platform. The user interface of this system is based on web design and development. Users can access the genome annotation system through a browser, including the display of genomic circular diagrams, the distribution of genes and ORFs on chromosomes, and retrieval of annotation information. The genomic circular gene distribution map is constructed based on the following information: predicting the starting position and length of the obtained gene, the positive and negative strand information of the encoded gene, and the predicted functional classification of the gene.
2 results
Figure 1 shows the annotation results of the MGAP system on the PCC7002 genomic contig contig test sequence. A is the gene display map, and B is the ORF display page. From A to outside, they are: (1) positive-strand-encoding genes; (2) negative-strand-encoding genes; (3) GC content statistics; (4) GC deviation statistics. The circular genome constructed by this system can display the coding genes on the positive and negative strands, and the functional categories are indicated by corresponding colors. This system follows the classic protein functional classification method [8], that is, all genes of the microbial genome are divided into 16 categories according to their functions, and then subdivided into 113 subclasses. In addition, statistics GC and GC Bias functions have been added. A 200 bp sliding window was used to calculate the GC content, and a 13 kb sliding window was used to calculate the GC offset. The GC deviation represents the difference between the G and C content, and is defined as: (G2C) P (G + C) [9]. Click on the circular genome display map in A to get the local ORF display page of B genome. Click on an ORF in the picture to call up all its annotation information, including the ORF's position, length, positive and negative strand information, nucleic acid and protein sequences in the genome, as well as the NR protein library, COG database, InterPro, PDBseq database Search results. All results have corresponding connections to the original database.
3 Discussion
New genome function annotation is an important aspect of genomic research.MGAP organically integrates the software used for annotation with the public database, automates the annotation process and stores the results in the database system, and finally provides a friendly interface, which can be convenient for small and medium-sized laboratories. Practical microbial genome annotation system, reducing manual participation and improving annotation efficiency. The system takes into account the actual situation of general small and medium-sized laboratories in China, and is based on the development of inexpensive PC microcomputers and free Linux, MySQL, Apache, and Perl software systems.
It must be pointed out that all current computer annotation information cannot be guaranteed to be completely accurate. MGAP relies to some extent on annotation information in existing databases. For various reasons, there must be some errors in these notes. Obviously, these error messages will inevitably be introduced into the new annotation system. To this end, MGAP integrates multiple annotation methods and complements each other. For example, if an ORF has similar sequences searched by the BLASTP from the NR database, it can also find corresponding functional sites in the InterPro protein motif library, and it can also find high-scoring matching COGs. The annotation result is more reliable. In addition, the necessary manual annotation can avoid or correct the errors of automatic annotation. For example, a shift or deletion of the reading frame due to a sequencing error will cause a gene to be split into two segments, and such errors can currently only be corrected manually. Genome annotation is a complicated and tedious process and requires a lot of biological knowledge. Detailed and accurate annotations require rigorous biological experiments to obtain. There are still many unknown functional genes in the annotation result of the test sequence of this system, and new data needs to be continuously expanded and updated gradually. The new version of MGAP will add an interactive user annotation module to further expand and enhance the system's annotation capabilities.

IN OTHER LANGUAGES

Was this article helpful? Thanks for the feedback Thanks for the feedback

How can we help? How can we help?