Bioinformatics: Its role in Drug Discovery and Development

Tuesday, April 5, 2011


INTRODUCTION:

Bioinformatics is the discipline of quantitative analysis of information relating to biological macromolecules with the aid of computers. The development of bioinformatics as a field is the result of advances in both molecular biology and computer science over the past 30–40 years.
The earliest bioinformatics efforts can be traced back to the 1960s, although the word bioinformatics did not exist then. Probably, the first major bioinformatics project was undertaken by Margaret Dayhoff in 1965, who developed a first protein sequence database called Atlas of Protein Sequence and Structure. Subsequently, in the early 1970s, the Brookhaven National Laboratory established the Protein Data Bank for archiving three-dimensional protein structures. At its onset, the database stored less than a dozen protein structures, compared to more than 30,000 structures today. The first sequence alignment algorithm was developed by Needleman and Wunsch in 1970. This was a fundamental step in the development of the field of bioinformatics, which paved the way for the routine sequence comparisons and database searching practiced by modern biologists.

  The fundamental reason that bioinformatics gained prominence as a discipline was the advancement of genome studies that produced unprecedented amounts of biological data. The explosion of genomic sequence information generated a sudden demand for efficient computational tools to manage and analyze the data. The development of these computational tools depended on knowledge generated from a wide range of disciplines including mathematics, statistics, computer science, information technology, and molecular biology. The merger of these disciplines created information oriented field in biology, which is now known as bioinformatics.

  As per the National Center for Biotechnology Information (NCBI), bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline.

  The science of Bioinformatics, which is the merging of molecular biology with computer science, is essential to the use of genomic information in understanding human diseases and in the identification of new molecular targets for drug discovery. In the area of drug discovery, bioinformatics is being increasingly used to support target validation by providing functionally predictive information mined from databases and experimental datasets using a variety of tools. The predictive power of these tools becomes strongest when information from several techniques is combined, including experimental confirmation of predictions.
IMPORTANT TOOLS OF BIOINFORMATICS USED IN DDD:
A number of tools of bioinformatics aid in the process of drug discovery. The important tools in relevance to the drug discovery process may be classified into the following classes:

A.    Methods for Sequence Alignment
B.     Methods for Structure Prediction
C.     Phylogenetic Analysis

Sequence Alignment Methods:

There are three different types of sequence alignment:
·         Global alignment
·         Local alignment
·         Multiple sequence alignment



GLOBAL ALIGNMENT:

This method gives the best alignment over the entire length of two sequences. The Needleman-Wunsch algorithm is the most simple and efficient way to carry out the global alignment.
This algorithm involves three steps:
1.      Initialization
2.      Matrix Fill
3.      Trace Back
This algorithm is mainly dependant on the DYNAMIC PROGRAMMING method. This algorithm is a very easy method to find out the similarity between the two sequences.

LOCAL ALIGNMENT:
The Needleman-Wunsch algorithm creates a global alignment. That is, it tries to take all of one sequence and align it with all of a second sequence. Short and highly similar subsequences may be missed in the alignment because they are outweighed by the rest of the sequence. Hence, one would like to create a locally optimal alignment. The Smith and Waterman algorithm finds an alignment that determines the longest/best subsequence pair that give the maximum degree of similarity between the two original sequences. This means that not all of the sequences might be aligned together.
Only minimal changes to the Needleman-Wunsch algorithm are required. These are
  • A negative score/weight must be given to mismatches.
  • Zero must be the minimum score recorded in the matrix.
  • The beginning and end of an optimal path may be found anywhere in the matrix - not just the last row or column.
We also have a very powerful tool for the local alignment of sequences, the Basic Local Alignment Search Tool (BLAST) which is owned by the NCBI.
There are various types of BLAST programs:
S.No.
Type
Remarks
01.
blastn
Search a nucleotide database using a nucleotide query
02.
blastp
Search protein database using a protein query
03.
blastx
Search protein database using a translated nucleotide query
04.
tblastn
Search translated nucleotide database using a protein query
05.
tblastx
Search translated nucleotide database using a translated nucleotide query


MUTLIPLE SEQUENCE ALIGNMENT:

This involves the simultaneous alignment of more than two sequences. For this purpose a very important tool is the CLUTALW2 which is owned by the EBI.
This can accessed freely at www.ebi.ac.uk/clustalw


Protein Structure Prediction Methods:

Determining Protein Structure

Traditionally, a protein's structure was determined using one of two techniques: X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy.
The Advent of Computational Modeling
Researchers have been working for decades to develop procedures for predicting protein structure that are not so time consuming and that are not hindered by size and solubility constraints. To do this, researchers have turned to computers for help in predicting protein structure from gene sequences, a concept called homology modeling. The complete genomes of various organisms, including humans, have now been decoded and allow researchers to approach this goal in a logical and organized fashion.
Before going into more details of the process, it is very essential to understand the key terms used here:
  • Folding motifs are independent folding units, or particular structures, that recur in many molecules.
  • Domains are the building blocks of a protein and are considered elementary units of molecular function.
  • Families are groups of proteins that demonstrate sequence homology or have similar sequences.
  • Superfamilies consist of proteins that have similar folding motifs but do not exhibit sequence similarity.
It is theorized that proteins that share a similar sequence generally share the same basic structure. Therefore, by experimentally determining the structure for one member of a protein family, called a target, researchers have a model on which to base the structure of other proteins within that family. Moving a step further, by selecting a target from each superfamily, researchers can study the universe of protein folds in a systematic fashion and outline a set of sequences associated with each folding motif. Many of these sequences may not demonstrate a resemblance to one another, but their identification and assignment to a particular fold is essential for predicting future protein structures using homology modeling.
The scientific basis for these theories is that a strong conservation of protein three-dimensional shape across large evolutionary distances—both within single species, between species, and in spite of sequence variation—has been demonstrated again and again. Although most scientists choose high-priority structures as their targets, this theory provides the option to choose any one of the proteins within a family as the target, rather than trying to achieve experimental results using a protein that is particularly difficult to work with using crystallographic or NMR techniques.
Specific tasks must be carried out to maximize results when determining protein structure using homology modeling.
First, protein sequences must be organized in terms of families, preferably in a searchable database, and a target must be selected. Protein families can be identified and organized by comparing protein sequences derived from completely sequenced genomes. Targets may be selected for families that do not exhibit apparent sequence homology to proteins with a known three-dimensional structure.
Next, researchers must generate a purified protein for analysis of the chosen target and then experimentally determine the target's structure, either by X-ray crystallography and/or NMR. Target structures determined experimentally may then be further analyzed to evaluate their similarity to other known protein structures and to determine possible evolutionary relationships that are not identifiable from protein sequence alone. The target structure will also serve as a detailed model for determining the structure of other proteins within that family. In favorable cases, just knowing the structure of a particular protein may also provide considerable insight into its possible function.

PDB: The Protein Data Bank
The PDB was the first "bioinformatics" database ever built and is designed to store complex three-dimensional data. The PDB was originally developed and housed at the Brookhaven National Laboratories but is now managed and maintained by the Research Collaboratory for Structural Bioinformatics (RCSB). The PDB is a collection of all publicly available three-dimensional structures of proteins, nucleic acids, carbohydrates, and a variety of other complexes experimentally determined by X-ray crystallography and NMR.
Protein Modeling at NCBI
The Molecular Modeling Database

NCBI's Molecular Modeling Database (MMDB), an integral part of the Entrez information retrieval system, is a compilation of all of the PDB three-dimensional structures of biomolecules. The difference between the two databases is that the MMDB records reorganize and validate the information stored in the database in a way that enables cross-referencing between the chemistry and the three-dimensional structure of macromolecules. By integrating chemical, sequence, and structure information, MMDB is designed to serve as a resource for structure-based homology modeling and protein structure prediction.

NCBI has also developed a three-dimensional structure viewer, called Cn3D, for easy interactive visualization of molecular structures from Entrez. Cn3D serves as a visualization tool for sequences and sequence alignments. What sets Cn3D apart from other software is its ability to correlate structure and sequence information. For example, using Cn3D, a scientist can quickly locate the residues in a crystal structure that correspond to known disease mutations or conserved active site residues from a family of sequence homologues, or sequences that share a common ancestor. Cn3D displays structure-structure alignments along with the corresponding structure-based sequence alignments to emphasize those regions within a group of related proteins that are most conserved in structure and sequence. Cn3D also features custom labeling options, high-quality graphics, and a variety of file exports that together make Cn3D a powerful tool for literature annotation.

Phylogenetic Analysis:

The phylogenetic analysis can be efficiently done with the tool ClustalW2 which is owned and operated bye EMBL-EBI.

CLUSTALW

Multiple alignments of protein sequences are important tools in studying sequences. The basic information they provide is the identification of conserved sequence regions. This is very useful in designing experiments to test and modify the function of specific proteins, in predicting the function and structure of proteins and in identifying new members of protein families.
The program ClustalW2 can be used for two purposes:

1. It can be used to produce a multiple sequence alignment. Using the web form the user need only input or upload a file of the sequences that they want to align in an accepted format. The other options on the form are set to the default values for producing a multiple alignment. The user can use the defaults or they can make some changes on the form to customise their run. A multiple sequence alignment of the sequences submitted will be returned to the user (.aln file).

2. It can be used to produce a true phylogenetic tree. In order to use this option, the user must input or upload a multiple alignment of sequences in one of the standard multiple alignment formats (.aln file). Then, in the phylogentic tree section of the form, they must choose one of the tree type options; NJ, Pyhlip or Dist. These are programs for drawing phylogenetic trees. This time the user will retrieve a .ph (always), .dst and/or .nj files (depending on options chosen), which will contain the phylogenetic trees. By default, the form is set to produce a multiple alignment.

Phylogram and Cladogram:
A phylogram is a branching diagram (tree) that is assumed to be an estimate of a phylogeny. The branch lengths are proportional to the amount of inferred evolutionary change. A cladogram is a branching diagram (tree) assumed to be an estimate of a phylogeny where the branches are of equal length. Therefore, cladograms show common ancestry, but do not indicate the amount of evolutionary "time" separating taxa. It is possible to see the tree distances by clicking on the diagram to get a menu of options. The options available allow you to do things like changing the colours of lines and fonts and showing the distances.

BIOINFORMATICS IN DRUG DISCOVERY AND DEVELOPMENT:
Drug discovery and development through bioinformatics is one of the most actively pursued areas of research. The basic process of drug discovery can be divided into four steps:
§  Target Identification
§  Target Validation
§  Lead Identification
§  Lead Optimization
Bioinformatics has an important role in the Target Validation process and to a lesser extent in the other stages of the drug discovery process.
ROLE OF BIOINFORMATICS IN TARGET VALIDATION
  The importance of bioinformatics in target validation is justified because a rational and efficient mining of the information that integrates knowledge about genes and proteins is necessary for linking targets to biological information.
The validation of a drug target involves demonstrating the relevance of the target protein, which is a very essential step and this combines data from molecular biology, cell biology, bioinformatics, in-vitro and in-vivo experiments. Although experimental work is the key driver in target validation, bioinformatics plays a very important role in supporting this process as biological knowledge is to be mined from numerous databases containing data on DNA sequences, protein structures, pathways, organisms and disease that exist to uncover the disease links and provide clues to biological function.

Predicting function from sequence and structure:
The most commonly used approach to assign function to proteins is by sequence similarity, but this approach has its limitations, so attention has focused on complementing and extending this approach by the development of complementary methods to function prediction using sequence and structural information.

Sequence-based approaches
The identification of signatures of domains and functional sites in amino acid sequences has played an important and complementary role to similarity searching methods in the functional characterization of proteins. For this purpose the InterPro which is owned and maintained by the EBI plays a very important role. In an extension of this approach, the prediction of sequence motifs associated with post-translational modifications and sub cellular localization of proteins has the ability to transfer functional information between sequences that are unrelated at the primary sequence or evolutionary level. The key principle here is that functionally-related proteins will have similar posttranslational modifications and sorting signals even if they are unrelated at the sequence level. For this purpose, the ProtFun method may be used which integrates 14 individual attributes (e.g. glycosylation, phosphorylation, signal peptides etc.) to predict functional categories, also the Proteome Analyst may be used which enables one to predict sub cellular location using database text annotations from homologues in addition to sequence information. Another tool is the Eukaryotic Linear Motif (ELM) server which is a resource for investigating short peptide linear motifs which are used for cell compartment targeting, protein–protein interaction, regulation by phosphorylation, acetylation, glycosylation and a range of other post-translational modifications. Scansite is yet another important tool to identify short sequence motifs within query proteins that regulate protein–protein interactions in cell signaling and can be used to generate biochemical tools that enable the identification of interaction partners. The availability of the sequenced genomes of a wide range of organisms has facilitated the development of protein function prediction methods based on viewing this data in an evolutionary context. Phylogenomic profiling focuses on how proteins became similar in sequence through evolution rather than on the sequence similarity itself. In this approach, the evolutionary history of genes is used to predict the function of uncharacterized genes. The
Resampled Inference of Orthologues (RIO) web server has been developed to automate phylogenomic analysis.

Structure-based approaches
Protein structure plays a central role in the understanding and use of sequence data because of the tight relationship that exists between structure and function. As structures are more highly conserved than sequences during evolution they can also be used to detect more distant homologues.g Discovery Today: Technologies | Target validation Vol. 1, No. 2 2004
In addition, knowing the structure of a protein allows the in-silico design of targeted libraries of small molecule compounds which can be used as probes of cellular function and as possible drug leads in chemogenomics approaches. For this purpose, the Relibase and ChematicaTM database systems have been designed to facilitate the retrieval of protein–ligand related information. Despite of the efforts for high throughput production of protein structures, molecular modeling methods are increasingly being used to bridge the gap between the number of protein sequences in the databases and the number of experimentally-determined structures. Although homology modeling produces the most accurate models, it does require homologous proteins with a structure and a high percentage sequence identity with the target protein. Alternatively, fold recognition (threading) methods and are applied when homologous template structures are unavailable. And when no template structures are available, the ab initio prediction methods are applied. Such methods usually generate low resolution structures, which might be sufficient for functional annotation of the protein sequences. The detection of remote homologues having poor sequence homology, but having a good structural homology can be improved by the automatics extraction of SWISSPROT annotations in combination with PSI-BLAST.

Predicting function from protein–protein interactions:

Protein–protein interactions are fundamental to most cellular processes. As a result, such interactions are being increasingly used to assign functions to uncharacterized proteins on the principle that interactions with proteins of known function will be a strong indicator of the function of proteins of unknown function. In addition, protein interaction data also provides the basis for the reconstruction of cellular pathways.  A range of experimental techniques have been used to detect protein–protein interactions with computational methods playing an important role in this process by expanding the scope of experimental data and increasing the confidence of protein–protein interaction pairs. There are a number of publicly accessible protein interaction databases, many of which combine experimental data with curated literature information. In addition, tools to aid data integration and visualization of the protein networks generated have been developed. Clustering of protein interactions into networks provides information on the biological context of proteins, an important step towards identification of its functional role. The networks of relationships between proteins are visualized as interaction maps. Parallel with experiments to determine protein–protein interactions, the genomic-context approach uses conservation of gene order in different genomes to predict functional association of proteins. This approach exploits the fact that genes of functionally interacting proteins tend to be associated with each other on genomes. The STRING database and Predictome tool both make use of genomic context information. Interaction mining uses experimentally derived interactions in one organism to infer the structure of an interaction network in a related organism.


Predicting function from gene expression:

Microarray experiments allow the gene expression profiles of thousands of genes to be measured and compared with each other in cells and tissues under a range of experimental conditions, as well as between healthy and diseased states. Computational approaches are important in analyzing the large datasets produced and in prioritizing the resulting (long) lists of differentially expressed genes for target validation. The key challenges from a drug discovery perspective are: (1) Filtering the gene lists to create shortlists of targets that are most likely to be directly involved in the disease process;
(2) Deducing functional relationships from the datasets.
Clusters of genes that have similar expression profiles are often inferred to be functionally associated, although this is not always the case. However, the accuracy of such correlations can be improved by examining intraspecies and interspecies conservation of gene expression. To identify functionally related genes that do not have similar expression profiles, a method for the identification of ‘‘transitive’’ genes whose expression correlates with the expression of these genes has been described. An alternate approach to cluster analysis is to view the datasets at the level of biological processes or pathways, so that an overview of the main biological themes can be ascertained and this information can then be used as the basis for focusing in on particular groups of genes. Several tools have been developed to facilitate this process, which make use of the common vocabulary for biological function put together by the Gene Ontology (GO) Consortium. The ontology describes gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner, facilitating comparison of the functional features of proteins. Several tools like the MAPPfinder, GoMiner, FATIGO and EASE can be used to link gene expression data to the GO hierarchy.
As a complement to the tools developed for the analysis of microarray data, automatic pipelines for the analysis of open reading frames stemming from cDNA sequencing projects, like the LIFE database have also been developed.

Predicting function from linking proteins to pathways and signaling networks:

As most diseases result from the agitation of a signal transduction pathway, a full insight into the function of proteins, particularly their relevance to disease, requires information on the pathway in which a putative target participates. A number of databases that contain large amounts of curated pathway information and tools for pathway construction, modeling and analysis have been developed. The detailed biochemical knowledge about metabolic pathways is reflected in the extensive nature of metabolic pathway databases like the Kegg2 and has great potential for the validation of targets in pathogenic organisms.



Predicting function from text mining of the biological literature:

Much of the experimental information pertinent to a target’s biological function and potential links with a particular disease are hidden in the free text of the 11,000,000 journal articles in MEDLINE, the most widely used biomedical literature database. Literature evidence has always played a crucial part in target selection and validation and continues to be pivotal in accepting or rejecting associations derived from experimental or computational approaches. Careful curation of the literature is used as the basis for the manual annotation of entries in databases such as UniProt, DIP and BIND. Because of the large datasets obtained from transcriptomic and proteomic experiments, there is increasing interest in automating the identification of relevant MEDLINE articles (information retrieval) and finding facts and relationships in the unstructured text of these articles (information extraction). From a target validation perspective, the goals of text mining in bioinformatics are to identify and define the functional relationships between genes or proteins and to use this information to predict specific biological functions, links to disease pathology and/or pathway relationships. No single tool can currently perform all the required tasks and this is reflected in the range of text mining tools available. Starting from a nucleic acid or protein sequence, MedBlast is a tool for the identification of relevant literature references, which are either cited by the sequence annotation or cite the sequence (direct references), or contain gene symbols of the given sequence (indirect references). AbXtract is a web tool for the automatic extraction of biological information from collections of MEDLINE abstracts in which keywords, sentences and abstracts relevant to protein function are automatically extracted and displayed to the user. XplorMed summarizes MEDLINE search results and extracts the main associations between words so users can select abstracts of interest for further analysis as well as interactively examine the context of keywords in abstracts. PubMatrix uses two lists of keywords to generate an HTML matrix table of pair-wise comparisons so that users can quickly identify interesting combinations of terms and access the relevant MEDLINE abstracts. When used with lists of gene names and function, PubMatrix can be used to annotate and analyse the gene lists produced by proteomic and transcriptomic experiments. Using information retrieved from MEDLINE articles related to function, diseases and related genes, the Gene Information System predicts positive, cooperative or negative relationships between pairs of genes in a two-phase text mining process. The Medical Knowledge Explorer system uses GO and LocusLink  as the lexicons for constructing function name and gene/gene product name indices and extracts information from articles using a sentence alignment and classification algorithm. BITOLA is a tool for finding new relationships between genes and disease in which a discovery algorithm operates on a knowledge base of relations between biomedical concepts extracted from MEDLINE.

 CONCLUSION:

Bioinformatics has been found to play a key role in the drug discovery process. By providing prediction of biological function and potential disease related roles of targets in the experimental work, it serves as the hypotheses that can be tested in vivo and in vitro and then generated in silico. Although the experimental results will still remain to be the important factors that determine the progress of the drug discovery process, the information provided by bioinformatics will help facilitate the drug discovery process. With newer methods for the automated extraction of biological information from scientific literature, there will be an evolution of newer databases containing molecular interactions and pathways, thus aiding further in the drug discovery process. The emphasis on the future will be mainly on better methods for modeling function from the structure, thus enabling the researcher to rationally design or modify the protein and ligand alike and thereby producing significant numbers of specifically designed therapeutic proteins.