What Is Bioinformatics Analysis?
Bioinformatics is a discipline that studies the collection, processing, storage, dissemination, analysis, and interpretation of biological information. It is also a combination of life science and computer science with the rapid development of life science and computer science. A new discipline. It reveals the biological mysteries of large and complex biological data through the comprehensive use of biology, computer science and information technology.
- Book title
- Bioinformatics
- Author
- Hodgeman
- ISBN
- 978703 0288738
- Fixed price
- 52.00 yuan
- Publishing house
- Science Press
- date of publish
- September 1, 2010
- Open
- 16K
- Pre-genomic era (before the 1990s) This stage is mainly the establishment of various sequence comparison algorithms, the establishment of biological databases, the development of search tools, and the analysis of DNA and protein sequences.
- The era of the genome (after the 1990s to 2001) This stage is mainly large-scale genome sequencing, gene identification and discovery, the systematic establishment of network databases and the development of interactive interface tools.
- Post-genomics era (2001 to present) With the completion of human genome sequencing and the completion of genome sequencing of various model organisms, the development of biological science has entered the post-genomic era, and the focus of genomics research has shifted from the structure of the genome to the function of genes. . An important hallmark of this transfer is the generation of functional genomics, and the previous work on genomics was accordingly called structural genomics. [2]
- Bioinformatics is a discipline that studies the collection, processing, storage, dissemination, analysis, and interpretation of biological information. It is also a combination of life science and computer science with the rapid development of life science and computer science. A new discipline. It reveals the biological mysteries of large and complex biological data through the comprehensive use of biology, computer science and information technology.
Introduction to Bioinformatics
- Bioinformatics [1] is a science that uses computer as a tool to store, retrieve and analyze biological information in the research of life sciences. It is one of the major frontiers of life sciences and natural sciences today, and it will also be one of the core areas of natural sciences in the 21st century. Its research focus is mainly reflected in two aspects of genomics and proteomics, specifically, starting from nucleic acid and protein sequences, analyzing the structural and functional biological information expressed in the sequences.
Definition of Bioinformatics
- I. Bioinformatics is a new subject of genetic data collection, analysis and dissemination to the research community. (Lin Huaan, Dr. Hwa A. Lim, 1987)
- 2. Bioinformatics refers specifically to database-type work, including providing persistent and stable data support in a stable place (Bioinformatics refers to database-like activities, involving persistent sets of data that are maintained in a consistent state over essentially indefinite periods of time). (Lin Huaan, Dr. Hwa A. Lim, 1994)
- 3. Bioinformatics is a conceptual biology in the field of macromolecules, and uses the technology of informatics. This includes a variety of methods derived from disciplines such as applied mathematics, computer science, and statistics. Understanding and organizing information related to biological macromolecules on a large scale. (Luscombe, 2001)
- Specifically, bioinformatics, as a new subject area, takes the analysis of genomic DNA sequence information as the source, simulates and predicts the spatial structure of the protein after obtaining the information of the protein coding region, and then makes necessary according to the function of the specific protein. Drug design. Genomic informatics, protein spatial structure simulation, and drug design constitute three important components of bioinformatics. From the specific content of bioinformatics research, bioinformatics should include these three main parts: research on new algorithms and statistical methods; analysis and interpretation of various types of data; Develop new tools for effective use and management of data.
- Bioinformatics is a discipline that studies the laws of biological systems using computer technology.
- Bioinformatics is basically just a combination of molecular biology and information technology, especially Internet technology. Bioinformatics research materials and results are a variety of biological data. The research tool is a computer. The research methods include searching (collecting and screening), processing (editing, organizing, managing, and displaying) biological data and Utilization (calculation, simulation).
- Since the 1990s, with the development of various genome sequencing programs, breakthroughs in molecular structure determination technology, and the popularity of the Internet, hundreds of biological databases have sprung up and grown like mushrooms. A serious challenge for bioinformatics workers: What information is contained in the billions of ACGT sequences? How does this information in the genome control the development of organisms? How has the genome evolved?
- Another challenge in bioinformatics is to predict the structure of a protein from its amino acid sequence. This problem has been baffling theoretical biologists for more than half a century, and the need to find answers to questions is becoming increasingly urgent. Nobel laureate W. Gilbert pointed out in 1991: "The traditional way of biology to solve problems is experimental. Now, based on all genes will be known and reside in a database in an electronically operable way. The new The starting point of the biological research model should be theoretical. A scientist will start from theoretical speculation and then return to experiments to track or verify these theoretical hypotheses. "
- The main research directions of bioinformatics: genomics-proteomics-systems biology-comparative genomics. In 1989, the International Conference on Computer Modeling of Biochemical Systematics and Biomathematics was held in the United States. Bioinformatics has developed into computational biology, The age of computational systems biology.
- Forgetting to cite the lengthy definition of bioinformatics, to explain its core application in popular language is: With the landmark progress of the biological genome sequencing project including the Human Genome Project, the resulting Biological data is increasing at an unprecedented rate, and has doubled every 14 months. At the same time, with the popularity of the Internet, hundreds of biological databases have sprung up and grown like mushrooms. However, these are only the acquisition of primitive biological information, which is the initial stage of the development of the bioinformatics industry. At this stage, most bioinformatics companies sell their biological databases for a living. Celera, best known for sequencing the human genome, is a successful example of this stage.
- The advanced stage of the bioinformatics industry is reflected in this, and humans have since entered the post-genomic era centered on bioinformatics. A new drug innovation project combining bioinformatics is a typical application at this stage.
Bioinformatics Experience Stage
Introduction to Bioinformatics Development
- Bioinformatics is based on molecular biology. Therefore, to understand bioinformatics, we must first have a simple understanding of the development of molecular biology. The study of the structure and function of biological macromolecules of biological cells has been started very early. In 1866 Mendel experimentally put forward the hypothesis: genetic factors exist as biological components. (DNA). Until Avery and McCarty demonstrated in 1944 that DNA is the genetic material of living organs, chromosomal proteins were still thought to carry genes, and DNA was a secondary role. In 1944, Chargaff discovered the famous Chargaff law, that is, the amount of guanine and cytosine in the DNA is always equal, and the amount of adenine and thymine is equal. Meanwhile, Wilkins and Franklin used X-ray diffraction to determine the structure of DNA fibers. In 1953 James Watson and Francis Crick speculated on the three-dimensional structure (double helix) of DNA in the journal Nature. DNA forms a double helix with phosphate sugar chains, and the bases on deoxyribose follow the Chargaff rule to form base pairs between double-stranded phosphate sugar chains. This model shows that DNA has a self-complementary structure. According to the base pair principle, genetic information stored in DNA can be accurately copied. Their theory laid the foundation for molecular biology. The DNA double helix model has already foreseen the rules of DNA replication. Kornberg isolated DNA polymerase I from E. coli in 1956, which can link four kinds of dNTPs into DNA. DNA replication requires a DNA as a template. Meselson and Stahl (1958) experimentally proved that DNA replication is a semi-reserved replication. Crick proposed the law of genetic information transmission in 1954. DNA is a template for synthesizing RNA, and RNA is a template for synthesizing proteins. It is called the Central Dogma. This central law will be useful for molecular biology and bioinformatics in the future. Development has played an extremely important guiding role. After the efforts of Nirenberg and Matthai (1963), the genetic code encoding 20 amino acids was deciphered. The discovery of restriction enzymes and the cloning of recombinant DNA laid the technical foundation for genetic engineering. It is precisely because the study of molecular biology has a huge role in promoting the development of life sciences that the emergence of bioinformatics has become a necessity. In February 2001, the completion of human genome engineering sequencing brought bioinformatics to a climax. Due to the rapid development of automatic DNA sequencing technology, the amount of public data of nucleic acid sequences in the DNA database is increasing at a rate of 106bp per day, and biological information is rapidly expanding into a sea of data. There is no doubt that we are transitioning from an era of accumulating data to interpreting data. The huge accumulation of data often contains the possibility of potential breakthrough discoveries. "Bioinformatics" is an interdisciplinary that emerges from this premise. Roughly speaking, the core content of this field is to study how to understand the DNA sequence, structure, evolution and its relationship with biological functions through statistical calculation and analysis of DNA sequences. The research topic involves molecular biology. Many fields such as molecular evolution and structural biology, statistics and computer science. Bioinformatics is a very rich subject, and its core is genomic informatics, including the acquisition, processing, storage, distribution, and interpretation of genomic information. The key to genomic informatics is to "read" the nucleotide sequence of the genome, that is, the exact position of all genes on the chromosome and the function of each DNA fragment; at the same time, after the discovery of new gene information, the spatial structure of the protein is simulated and predicted, Drug design based on the function of specific proteins. Understanding the regulation mechanism of gene expression is also an important content of bioinformatics. According to the role of biomolecules in gene regulation, the diagnosis and treatment of human diseases are described. Its research goal is to reveal "the complexity of the information structure of the genome and the fundamental laws of genetic language" and explain the genetic language of life. Bioinformatics has become an important part of the development of the whole life science, and has become the forefront of life science research.
Research direction of bioinformatics
- Bioinformatics has formed a number of research directions in just ten years. The following briefly introduces some of the main research focuses.
Bioinformatics sequence alignment
- The basic problem of Sequence Alignment is to compare the similarity or dissimilarity of two or more symbol sequences. From the original intention of biology, this question contains the following meanings: reconstruct the complete sequence of DNA from overlapping sequence fragments. Determine the physical and genetic map storage from probe data under various test conditions, traverse and compare DNA sequences in the database, compare the similarity of two or more sequences, and search for related sequences and subsequences in the database To find the continuous generation pattern of nucleotides, find out the information components in protein and DNA sequences. Sequence alignments take into account the biological characteristics of DNA sequences, such as localized insertions, deletions (the first two are referred to as indel), and substitutions. The objective function of the sequences is to obtain the minimum distance weighted sum or maximum similarity sum of mutation sets between sequences The methods of alignment include global alignment, local alignment, and generation gap penalties. The two sequence alignment often uses a dynamic programming algorithm. This algorithm is applicable when the sequence length is small. However, for large gene sequences (such as human DNA sequences up to 10 ^ 9bp), this method is not applicable, and even the algorithm is complicated. Linearity is also difficult to work with. Therefore, the introduction of the heuristic method is inevitable, and the well-known BLAST and FASTA algorithms and corresponding improvement methods have been proposed from the previous.
Bioinformatics protein alignment
- The basic problem is to compare the similarity or dissimilarity of the spatial structure of two or more protein molecules. The structure and function of proteins are closely related. It is generally believed that the structures of proteins with similar functions are generally similar. Proteins are long chains of amino acids with lengths ranging from 50 to 1000 ~ 3000AA (Amino Acids). Proteins have a variety of functions, such as enzymes, storage and transportation of substances, signal transmission, antibodies, and so on. The amino acid sequence inherently determines the 3-dimensional structure of the protein. It is generally believed that proteins have four different structures. The reasons for studying protein structure and prediction are: medically understand biological functions, find targets for dockingdrugs, gain better genetic engineering for crops in agriculture, and use enzyme synthesis in industry. The reason for directly aligning protein structures is that the 3-dimensional structure of a protein is more stable in evolution than its primary structure, and it also contains more information than the AA sequence. The premise of the study of the 3-dimensional structure of proteins is that the intrinsic amino acid sequence corresponds to the 3-dimensional structure (not necessarily completely true), and can be physically explained with the minimum energy. Starting from observing and summing up the protein structure laws of known structures, the structure of unknown proteins is predicted. Homology modeling and threading methods fall into this category. Homologous modeling is used to find highly similar protein structures (more than 30% amino acids are the same), while the latter is used to compare different protein structures in the evolutionary family. However, the research status of protein structure prediction is far from meeting actual needs.
Bioinformatics gene recognition analysis
- The basic problem of gene recognition is that after a genomic sequence is given, the range of the gene is correctly identified and the precise position in the genomic sequence. Non-coding regions consist of introns and are generally discarded after protein formation. However, if the non-coding regions are removed from the experiment, gene replication cannot be completed. Obviously, as a genetic language, DNA sequences are contained both in coding regions and in non-coding sequences. There is no general guideline for analyzing non-coding region DNA sequences. In the human genome, not all sequences are encoded, that is, a template for a certain protein. The completed coding portion only accounts for 3 to 5% of the total human gene sequence. Obviously, it is difficult to imagine manually searching for such a large gene sequence. The method of detecting the codon region includes measuring the codon frequency of the codon region, first and second-order Markov chains, ORF (Open Reading Frames), promoter recognition, and HMM (Hidden Markov Model) And GENSCAN, Splice Alignment, etc.
Molecular evolution of bioinformatics
- Molecular evolution is to use the similarities and differences of the same gene sequence in different species to study the evolution of organisms and build evolutionary trees. Both DNA sequences and amino acid sequences encoded by them can be used, and even molecular evolution can be studied through the structural alignment of related proteins, assuming that similar races are genetically similar. By comparison, you can discover at the genome level which are common in different races and which are different. Early research methods often used external factors such as size, skin color, number of limbs, etc. as the basis for evolution. With the completion of genome sequencing tasks for many model organisms, one can study molecular evolution from the perspective of the entire genome. When matching genes of different races, there are generally three situations to deal with: Orthologous: genes of different races with the same function; Paralogous: genes of the same race with different functions; Xenologs: genes passed between organisms by other means, such as being injected by a virus Genes. The method often used in this field is to construct a phylogenetic tree, by using features (that is, specific positions of amino acid bases in DNA sequences or proteins) and distance-based (aligned scores) methods and some traditional clustering methods (such as UPGMA).
Contigs Bioinformatics sequence contig assembly
- According to current sequencing technologies, only 500 or more base pair sequences can be detected per reaction. For example, the shortgun method is used for the measurement of human genes, which requires the formation of a large number of shorter sequences. Contigs. Gradually stitching them together to form longer contigs, until the complete sequence is obtained is called contig assembly. From the algorithm level, the contig of a sequence is an NP-complete problem.
Bioinformatics genetic code
- Research on the genetic code usually suggests that the relationship between codons and amino acids is caused by an accidental event in the history of biological evolution, and has been fixed in the common ancestor of modern organisms, and has continued to this day. Different from this "freezing" theory, some people have proposed three theories of selection optimization, chemistry and history to explain the genetic code. With the completion of various biological genome sequencing tasks, new materials have been provided for studying the origin of the genetic code and testing the authenticity of the above theories.
Bioinformatics drug design
- One of the goals of human genetic engineering is to understand the structure, function, interactions, and relationships between about 100,000 proteins in the human body and to various human diseases, and to seek various treatment and prevention methods, including drug treatment. Drug design based on the structure of biological macromolecules and small molecules is an extremely important research area in bioinformatics. In order to inhibit the activity of certain enzymes or proteins, based on the known protein tertiary structure, molecular alignment algorithms can be used to design inhibitor molecules on computers as candidate drugs. The purpose of this field is to discover new genetic medicines with great economic benefits.
Bioinformatics student system
- With the development of large-scale experimental technology and data accumulation, studying and analyzing biological systems at the global and system levels, and revealing their development laws, have become another research hotspot in the post-genomic era-systems biology. At present, its research content includes simulation of biological systems (Curr Opin Rheumatol, 2007, 463-70), system stability analysis (Nonlinear Dynamics Psychol Life Sci, 2007, 413-33), and system robustness analysis (Ernst Schering Res Found Workshop, 2007, 69-88). The modeling language represented by SBML (Bioinformatics, 2007, 1297-8) is rapidly developing, with Boolean networks (PLoS Comput Biol, 2007, e163), differential equations (Mol Biol Cell, 2004, 3841-62), Stochastic processes (Neural Comput, 2007, 3262-92), discrete dynamic event systems (Bioinformatics, 2007, 336-43) and other methods have been applied in system analysis. The establishment of many models has borrowed from the modeling methods of circuits and other physical systems, and many studies have tried to solve the complexity of the system from macro analysis ideas such as information flow, entropy, and energy flow (Anal Quant Cytol Histol, 2007, 296-308) . Of course, it takes a long time to build a theoretical model of biological systems. Although the experimental observation data has increased in large quantities, the data required for model identification of biological systems far exceeds the output capacity of the data. For example, for time series chip data, the number of sampling points is not enough to use traditional time series modeling methods, and the huge experimental cost is the main difficulty in system modeling. System description and modeling methods also need groundbreaking development.
Bioinformatics technology method
- Bioinformatics is not just a simple arrangement of biological knowledge and simple applications of mathematical knowledge in mathematics, physics, and information science. Massive data and complex backgrounds have led to rapid development of methods such as machine learning, statistical data analysis, and system descriptions in the context of bioinformatics. The huge amount of calculations, complex noise patterns, and large amounts of time-varying data have brought great difficulties to traditional statistical analysis, requiring non-parametric statistics (BMC Bioinformatics, 2007, 339), cluster analysis (Qual Life Res, 2007) , 1655-63) and other more flexible data analysis techniques. Analysis of high-dimensional data requires compression techniques for feature spaces such as partial least squares (PLS). In the development of computer algorithms, it is necessary to take full account of the time and space complexity of the algorithm, and use parallel computing, grid computing and other technologies to expand the implementability of the algorithm.
Bioinformatics other
- Such as gene expression profiling, metabolic network analysis; gene chip design and proteomics data analysis, etc., have gradually become important emerging research fields in bioinformatics; in terms of disciplines, the disciplines derived from bioinformatics include structural genomics, functions Genomics, comparative genomics, proteomics, pharmacogenomics, Chinese medicine genomics, tumor genomics, molecular epidemiology and environmental genomics have become important research methods in systems biology. It is not difficult to see from the development that genetic engineering has entered the post-genomic era. We also have a clear understanding of issues that are closely related to bioinformatics, such as machine learning, and possible misleading in mathematics.
Bioinformatics research methods
- Data (library) as the core
- 1 The establishment of the database
- 2 Retrieval of biological data
- 3 Processing of biological data
- 4 Use of biological data: computational biology
Bioinformatics machine learning
- The large-scale biological information has brought new problems and challenges to data mining, and requires the addition of new ideas. Conventional computer algorithms can still be applied to biological data analysis, but they are increasingly unsuitable for sequence analysis problems. The reason for this, It is due to the inherent complexity of biological systems and the lack of a complete theory of life organization established at the molecular level. Simon once gave a definition of learning: learning is a change in the system, and this change can make the system more effective when it does the same job. . The purpose of machine learning is to expect to automatically obtain the corresponding theory from the data. By using such methods as inference, model fitting and learning from samples, it is particularly suitable for lack of general theories, "noise" patterns, and large-scale data sets. Therefore, machine learning has formed a feasible method that is complementary to conventional methods. Machine learning makes it possible to use computers to extract useful knowledge from vast amounts of biological information and discover knowledge. Machine learning methods are used in large sample, multi-vector data analysis work. It plays an increasingly important role, and a large number of gene database processing requires computers to automatically identify and label to avoid time-consuming and costly manual processing methods. Early scientific methods-observations and assumptions-faced with high data Volume, fast data acquisition rate and objective analysis requirements-can no longer rely on human perception to deal with it. Therefore, the combination of bioinformatics and machine learning has become inevitable. The most basic theoretical framework in machine learning Is based on probability, in a sense, a continuation of statistical model fitting, The purpose is to extract useful information. Machine learning is closely related to pattern recognition and statistical reasoning. Learning methods include data clustering, neural network classifiers and nonlinear regression, etc. Hidden Markov models are also widely used to predict the genetic structure of DNA. The research focus includes: 1) Observing and exploring interesting phenomena. The focus of ML research is how to visualize and explore high-dimensional vector data. The general method is to reduce it to low-dimensional space, such as conventional principal component analysis (PCA) , Kernel Principal Component Analysis (KPCA), Independent Component Analysis, Locally Linear Embedding. 2) Generate hypotheses and formal models to explain phenomena [6]. Most clustering methods can be viewed as It is a mixture of fitting vector data to some simple distribution. In bioinformatics, clustering methods have been used in microarray data analysis, cancer type classification and other directions. Machine learning is also used to obtain corresponding phenomena from gene databases. Explanation. Machine learning accelerates the progress of bioinformatics and brings corresponding problems. Most machine learning methods assume that the data conforms to A relatively fixed model, and the general data structure is usually variable, especially in bioinformatics. Therefore, it is necessary to establish a general method that does not rely on the assumed data structure to find the internal structure of the data set. Secondly In machine learning methods, "black box" operations are often used, such as neural networks and hidden Markov models. The internal mechanism for obtaining specific solutions is still unclear.
Mathematical Problems in Bioinformatics
- Mathematics accounts for a large proportion in bioinformatics. Statistics, including multivariate statistics, is one of the mathematical foundations of bioinformatics; probability theory and stochastic process theory, such as Hidden Markov Chain Model (HMM), have important applications in bioinformatics; others such as sequence Operational research for comparison; optimization theory used in protein spatial structure prediction and molecular docking research; research on the topology of DNA supercoil structure; group theory on the genetic code and symmetry of DNA sequences, etc. In short, various Mathematical theory has more or less played a corresponding role in biological research. But not all mathematical methods can be universally established in the introduction of bioinformatics. The following uses statistical and metric spaces as examples.
The Paradox of Bioinformatics Statistics
- The development of mathematics has been accompanied by paradoxes. The most significant paradox in the study of evolutionary trees and clustering is the mean, which means that the conventional two methods cannot be used to separate the two categories. It also shows that the mean and Can not bring more geometric properties of the data. Then, if the data presents a similar unique distribution, the usual evolutionary tree algorithms and clustering algorithms (such as K-means) will often lead to wrong conclusions. Statistical traps It often results from a lack of general understanding of the structure of the data.
Bioinformatics Metric Space Hypothesis
- In bioinformatics, the establishment of the evolutionary tree and the clustering of genes need to introduce the concept of metric. For example, genes that are close or similar in distance have the same function, and meet the minimum score in the evolutionary tree. Have the same ancestry, the premise of this metric space is that the metric is true in the global sense. So, is this premise hypothetical universal? We might as well give a general description: suppose the two vectors are A, B Among them, under the premise of satisfying the linear independence between the dimensions, the metric of the two vectors can be defined as: According to the above formula, an Euclidean metric space satisfying the orthogonal invariant motion group can be obtained, which is also the most biological The general description often used in informatics, that is, it assumes that the variables are linearly independent. However, this assumption generally cannot correctly describe the nature of the metric. Especially in high-dimensional data sets, it is obvious that the non-linear correlation between data variables does not exist. From this, we can think that a correct measurement formula can be given by: The Einstein sum convention is used in the above formula to describe Measure of the relationship between the amount of the latter in meeting equivalent to , so it is a more general description, but the question is how to accurately describe the nonlinear correlation between variables, we are studying the issue.
Statistical Learning in Bioinformatics
- The amount of data and databases faced in bioinformatics are very large, but the relative objective function is generally difficult to give a clear definition. The difficulties faced by bioinformatics can be described as the scale of the problem and the problem The contradiction between the definition of morbidity, generally from a mathematical point of view, the introduction of a regular term to improve performance is inevitable [7]. The following statistical learning theory based on this idea, Kolmogorov complexity [98] and BIC (Bayesian Information Criterion) [109] and its problems are briefly introduced. Support vector machine (SVM) is a more popular method. Its research background is Vapnik's statistical learning theory, which is achieved by maximizing two data sets. The minimum interval is used to achieve classification. For non-linear problems, the kernel function is used to map the data set to a high-dimensional space without explicitly describing the properties of the data set in the high-dimensional space. The advantage of this method over neural methods is that the neural network The parameter selection of the hidden layer is simplified to the choice of the kernel function, so it has attracted widespread attention. It has also begun to receive attention in bioinformatics, however The selection of the kernel function itself is a rather difficult problem. From this level, the selection of the optimal kernel function may be just an ideal, and the SVM may be just like a neural network just another big bubble in the process of machine learning research. Kolmogorov The idea of complexity and the theory of statistical learning describe the nature of learning from different perspectives, the former from the perspective of coding, and the latter based on limited samples to achieve consistent convergence. Kolmogorov complexity is not computable, so MDL is derived from it The principle (minimum description length), which was originally only applicable to discrete data, has been extended to continuous data sets, trying to obtain the minimum description of model parameters from the coding perspective. The drawback is that the complexity of modeling is too high, leading to large data sets It is difficult to apply. The BIC criterion is considered from the perspective of model complexity. The BIC criterion gives a large penalty to the higher model complexity. On the contrary, the penalty is small, which implicitly reflects the principle of "Occam Razor". It is widely used in bioinformatics. The main limitation of the BIC criterion is the assumption and prior selection of parametric models. Sensitivity, slow processing large amounts of data. Therefore, in this respect there is still a lot of space to explore.
Summary of Bioinformatics Discussion
- Human understanding of genes has gone from the previous understanding of individual genes to examining the organizational structure and information structure of genes at the level of the entire genome, and examining the interrelationships among genes in position, structure, and function. This requires biology Informatics needs to change the essential concepts on some basic ideas. This section discusses and ponders these issues.
Bioinformatics heuristic
- Simond pointed out in the book of human cognition that when people solve problems, they generally do not seek the optimal method, but only to find a satisfactory method. Because even the simplest problem, the least number of times, The most effective solution is also very difficult. The degree of difficulty between the optimal method and the satisfactory method is very different. The latter does not depend on the space of the problem and does not need to perform all searches. It only needs to reach the level of solution. As mentioned earlier, in the face of large-scale sequence and protein structure data sets, to obtain global results, it is often not possible to obtain good results even when the algorithm complexity is linear. Therefore, it is necessary to transform the solution space or not rely on Since the solution space of the problem is to obtain a satisfactory solution, bioinformatics still needs further understanding of the human brain by artificial intelligence and cognitive science, and can obtain better heuristic methods from it.
- Dealing with problems of different scales: Marvin Minsky pointed out in the research on artificial intelligence: When processing of small-scale data volume is promoted to large-scale data volume, it is often not possible to improve the algorithm, but more essential. Change. This is like a person climbing a tree and climbing higher every day, but to climb to the moon, other methods must be used. In molecular biology, traditional experimental methods are no longer suitable for processing rapidly growing massive data. Similarly, in the use of computer processing, existing data mining problems can not be solved by relying on the original computer algorithms. For example, in the case of sequence alignment, dynamic programming can be used in small-scale data, and in large-scale data, Heuristic methods have to be introduced when aligning scale sequences, such as BLAST, FASTA.
Hidden Disturbances in Bioinformatics Optimism
- Bioinformatics is an emerging discipline that started in the 1990s and has now entered the "post-genomic era". Researchers in this field are generally optimistic. So is there a potential hidden disturbance? In the history of the development of early artificial intelligence, around 1960, Simon believed that in less than ten years, humans could complete the simulation of humans just like completing the moon landing, and create a robot with exactly the same intelligent behavior as humans. So far, This promise is still far in the future. Although the results of artificial intelligence research have penetrated into various fields, the understanding of human thinking and behavior is far from completely clear. In essence, this is due to the wrong positioning and the lack of original artificial intelligence research. Seeing the nature of artificial intelligence from an epistemological perspective; from a research perspective, reducing intelligent behavior to general formal language and rules does not fully describe human behavior. It is expected that the success of physical sciences will also apply to artificial intelligence research. It's not realistic. In contrast, the purpose of bioinformatics is to unravel the basis of all organisms from the genetic sequence. Mystery, to obtain the physiological mechanism of life from the structure, which is philosophically expected to explain all human behaviors and functions and the cause of disease at the molecular level. This is similar to the optimistic behavior expressed in the early development of artificial intelligence, but also from Early achievements in molecular biology, biophysics, and biochemistry. However, in essence, similar to artificial intelligence research, they all hope to restore the mystery of life to the function of an isolated gene sequence or a single protein, with little emphasis on genes. The regulatory role of sequences or proteomes as a whole in living organisms. Therefore, we have to think about whether the final results of such research can support our optimism in bioinformatics? It may be too early to say yes.
Bioinformatics Summary
- In summary, it is not difficult to see that bioinformatics is not a sufficiently optimistic field. The reason is that it is a new discipline based on the intersection of molecular biology and multiple disciplines. The current situation is still manifested in various ways. The simple stacking of these disciplines is not particularly closely related to each other. There is no effective general method for processing large-scale data; and the internal generation mechanism of large-scale data is not fully understood, which makes it difficult for bioinformatics research to have breakthrough results in the short term. Then, in order to get a real solution, it can't be obtained from computer science in the end. The real solution may still have to obtain the essential motivation from biology itself and new mathematical ideas. No doubt, as Dulbecco said in 1986: "Human DNA sequences are the true meaning of human beings, and everything that happens in this world is closely related to this sequence." But to completely decipher this sequence and related content, we still have a long way to go.
- Chinese Science Citation Database (CSCD-2008)