Multiple dna sequence alignment programs




















Gene Codes will make an announcement when the new, fully compatible version is released. We understand that these unforeseen delays create challenges for our Mac users.

If you have lost or will lose access to your Sequencher license due to an OS upgrade, please let us know, and we will work to provide a solution. Clustal [1] has been part of the Sequencher family of plugins since version 4.

It is a widely used multiple-sequence alignment program which works by determining all pairwise alignments on a set of sequences, then constructs a dendrogram grouping the sequences by approximate similarity and then finally performs the alignment using the dendogram as a guide. Curr Opin Struct Biol 16 , — Needleman, S. J Mol Biol 48 , — Smith, T.

J Mol Biol , —7. Gotoh, O. J Mol Biol , —8. Feng, D. J Mol Evol 25 , — Thompson, J. Nucleic Acids Res 22 , — Bioinformatics 23 , —4. Barton, G. J Mol Biol , — Berger, M. Comput Appl Biosci 7 , — PubMed Google Scholar. Comput Appl Biosci 9 , — Ishikawa, M. Notredame, C. Nucleic Acids Res 24 , — Comput Appl Biosci 10 , — Comput Appl Biosci 11 , — Hirosawa, M. BMC Bioinformatics , 4 , Bioinformatics , 19 , ii —ii Bioinformatics , 19 , — Bioinformatics , 17 , — BMC Bioinformatics , 5 , 6. Oxford University Press is a department of the University of Oxford.

It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide. Sign In or Create an Account. Sign In. Advanced Search. Search Menu. Article Navigation. Close mobile search navigation Article Navigation. Volume Article Contents Abstract. Burkhard Morgenstern Burkhard Morgenstern. Oxford Academic. Google Scholar. Select Format Select format. Permissions Icon Permissions.

Figure 1. Open in new tab Download slide. Issue Section:. Download all slides. Comments 0. Add comment Close comment form modal. I agree to the terms and conditions. You must accept the terms and conditions. Add comment Cancel. Submit a comment. MACSE is the first automatic solution to align protein-coding gene datasets containing non-functional sequences pseudogenes without disrupting the underlying codon structure.

A wide range of molecular analyses rely on multiple sequence alignments MSA , e. In all these studies, the initial MSA can strongly impact conclusions and biological interpretations [5].

As a consequence, MSA is a richly developed area of bioinformatics and computational biology. A coding sequence can be considered either at the nucleotide NT or amino acid AA level.

Because of the redundancy of genetic codes, different codons encode the same AA. The NT sequence is thus less conserved but more informative than its AA translation. Since they are more informative, NT sequences should be able to provide equally good or even better alignments than their sole AA translation.

These interruptions result from i the insertion of a non-multiple of 3 consecutive nucleotides — or the deletion thereof —, both inducing frameshifts that lead to transient or irreversible aberrant downstream AA sequence translation; and ii the substitution of an in-frame nucleotide resulting in unexpected, premature stop codons that shorten the AA sequence.

These events may have either artefactual or biological causes. First of all, experimental errors may occur. Sequencing errors are frequent with the new sequencing technologies resulting in elevated error rates in homopolymers when using GS-FLX [6] and in short read ends with Illumina Genome Analyzer [7].

Secondly, gene inactivation during the course of evolution leads to pseudogenes that exhibit disruption s of their original ORFs and whose identification has proven computationally difficult [9].

Thirdly, programmed frameshift mutations that are tolerated during translation have been widely documented [10] and their role in the evolution of novel gene function has been reported [11] To achieve higher NT alignment quality and detection of ORF interruptions, the AA translation should be taken into account during the alignment process.

Ignoring it would mean omitting fundamental information. Yet, frameshifts and premature stop codons hamper the correct AA-guided alignment of NT sequences.

However, when dealing with protein-coding sequences, these methods do not take into account the corresponding AA translations.

As a result, a protein-coding sequence containing an insertion of two nucleotides followed by a downstream insertion of 7 nucleotides will have the same gap-related penalties as the more realistic scenario of an insertion of three nucleotides followed by another insertion of 6.

To overcome these problems, one common strategy consists of using a three-step approach. First of all coding NT sequences are translated into AA, these AA sequences are then aligned, and lastly, the obtained protein alignment is used for deriving the NT one.

Moreover, it can either consider the full DNA sequence as coding, or search for its longest reading frame. The main drawback of this three-step approach is its inability to handle unexpected frameshifting substitutions. The AA translation that follows such events is no longer the correct one. In other cases, the translated AA sequence will look like a highly divergent, orphan sequence at the protein level and will induce a partly aberrant DNA alignment.

Such cases seem to be frequently encountered even in benchmark alignment datasets [22]. Unlike the vast literature on sequence alignment, few studies have focused on AA-aware NT sequence alignment. One of the first works on this subject was by Hein [23]. He then considered a special case where the two costs are simply summed and sequence evolution is idealized to involve only nucleotide substitutions and AA indels no frameshift is allowed.

An algorithm has been proposed to align two sequences of length and under this model [23]. A solution was then described to solve the same problem under affine gap costs in by Arvestad [24] and Pedersen et al.

These improvements seemed to be promising as this algorithm reached the same asymptotic complexity as classical DNA alignment methods. However, the authors acknowledged that the constant factor masked by the notation may be limitative in practice [25]. Indeed, to obtain a pairwise alignment, their method needs to compute table entries which preclude its use in the MSA context.

An alternative approach that was recently proposed [26] consists of scoring the alignment according to a weighted sum of four costs: the NT alignment cost plus those of its three possible AA alignment translations. To make the algorithm simpler and faster, no specific cost is associated with indels that induce frameshifts. Here, frameshifting indels are supposed to be penalized by the AA mismatch they will induce.

Considering all three reading frames may appear surprising since often only one is relevant, but this tool was specifically developed for handling viral genomes which may use overlapping reading frames [26].

In a slightly different context, an algorithm has been proposed to detect frameshift errors in newly determined NT sequences by comparison with AA sequences in public databases [27]. The algorithm generalizes the classical Smith-Waterman pairwise algorithm [28] so that the three reading frames are considered. An explicit frameshift cost is used to penalize frameshifts.

This method provides an elegant solution for evaluating sequence proximity but cannot be extended to MSA since the underlying alignment cannot be displayed by the classical matrix representation used in MSA algorithms. Indeed although pairwise solutions have existed for almost two decades, MACSE is the first MSA program able to align coding sequences based on their AA translations while accounting for frameshifts.

We illustrate the relevance and usefulness of the MACSE program on biological case studies aimed at 1 computing MSA of protein-coding genes containing non-functional, pseudogene sequences, 2 aligning high-throughput sequencing reads against reference coding sequences and 3 detecting undocumented frameshifts in published sequences. MACSE is an efficient solution to detect errors in coding sequences and the first automatic solution to align pseudogenes while taking into account their potential AA translation and preserving their codon structure.

Numerous evolutionary studies of individual genes or gene families involved in morphological adaptations require to quantify variation in selective pressure. Such analyses of molecular evolution based on codon models typically require aligning both functional and non-functional pseudogene sequences while respecting the underlying codon structure at the nucleotide level [4] , [29] , [30].

In this case, standard MSA programs that consider nucleotide sites independently disrupt the coding structure, while those that rely on AA translation are hampered by the presence of multiple frameshifts and premature stop codons. As a first biological case, we show how MASCE can align multiple heterogeneous sequences from the ambn gene coding for ameloblastin. This enamel constitutive protein has been lost in whales whose teeth have been replaced by keratinous baleens [31].

In these species, the relaxation of selective constraints has allowed the accumulation of mutations leading to the occurrence of frameshifts and stop codons. Although no longer coding for a functional protein, the ghost of selection past acting on these pseudogenes nevertheless left traces of their former codon structure [32].

Using MACSE with the option adjusting frameshift and stop codon costs in pseudogenes rendered possible the incorporation of non-functional sequences in a codon-based alignment of functional orthologs of this gene Fig.

Here, MACSE suggests the occurrence of three frameshifts, the positions of which are indicated by exclamation marks. In the first two cases they pinpoint the insertion of an additional nucleotide in several pseudogenes Fig. Three situations are illustrated in which frameshifts detected by MACSE are indicated by exclamation marks. The 7 pseudogene sequences are boxed. Case 1: To maintain the reading frame, two exclamation marks are introduced in the Balaena and Eubalaena sequences. This pinpoints the occurrence of an extra C inserted in these three pseudogenes.

Case 2: A similar situation in the three Balaenoptera sequences, with an extra T. Case 3: To maintain the reading frame, one exclamation mark is introduced in the Eschrichtius sequence. This pinpoints a single nucleotide deletion in this pseudogene. MACSE default parameters were used, i.

As a second example, we considered more divergent sequences from bird olfactory receptor genes. In this case, ecological differences among species have shaped the olfactory gene repertoires through gene duplication and pseudogenization events [29]. Here, we used MACSE to align 93 functional sequences with 18 pseudogenes from the brown kiwi Apteryx australis and domestic chicken Gallus gallus olfactory repertoires.

The codon alignment highlights the occurrence of multiple stop codons Fig. Stars and exclamation marks in the corresponding AA alignment respectively emphasize these events, which disrupt the coding frame while maintaining the correct translation. Note also that some functional sequences of these olfactory receptor genes share large in-frame deletions that are handled by MACSE.

The same alignment region is displayed at the NT left and AA right levels. The 18 pseudogene sequences are boxed. Stop codons stars in amino acid sequences occurring at sites 1 and 2, and frameshifts exclamation marks inferred by MACSE at sites 3 and 4 are circled.

Such analyses allow estimating where along the gene and when along the phylogeny pseudogenization events have occurred [4]. Note that other softwares e. Hence, no matter which of the three possible reading frames is used the resulting translation contains stop codons. Indeed, pseudogene sequences should not be translated using a single reading frame as done by revTrans, transAlign or TranslatorX but using the three reading frames alternatively switching from one to the other at each frameshift.

The DIALIGN option searching for the longest reading frame is not satisfactory since sequences are truncated at the first encountered stop codon. Other DIALIGN options, including those based on AA translation, result in alignments that disrupt the codon structure by introducing numerous frameshifts and stop codons even in functional sequences. By explicitly modeling frameshifting events and allowing distinct alignment penalties for different sets of sequences, MACSE has a main advantage over existing alignment tools, and is able to infer frameshift positions and propose more relevant alignments when non-functional sequences are sampled.

This greatly facilitates subsequent analyses of molecular evolution based on codon models. With the exponentially growing DNA data generated by new high-throughput technologies, it has become particularly important to correctly align sequencing reads or contigs with the corresponding reference markers.

Despite the high genome coverage generated by these approaches, the mapping and alignment tasks are complicated by the fact that or Illumina reads may suffer from sequencing errors [6] , [7].

Alignment-based methods have recently been proposed to correct sequencing errors in next-generation sequencing reads [33]. Since numerous phylogenomics and molecular evolution studies rely on expressed sequence tag EST data [34] , MACSE can help computational biologists to align reads with their corresponding coding sequences. As a second proof-of-concept example, we therefore illustrate the use of MACSE to align reads obtained from a transcriptomic approach among mammalian rodents.

There are five model rodents for which complete genome resources are available cf. Here, we focus on the transcriptome of a non-model rodent species, — the jerboa Jaculus jaculus —, belonging to the Dipodidae, a family which is closely related to Muridae including mouse and rat [35].



0コメント

  • 1000 / 1000