Sequence Alignment

Sequence Alignment

Gene Molecular Evolution

image

Create a multiple sequence alignment, phylogenetic tree, and calculate gene conservation.

Version 1.1.2

Use Cases

  • Perform molecular evolutionary analysis.
    • Conduct phylogenetic and conservation analysis of a gene family curated by a user
    • Conduct phylogenetic and conservation analysis of a gene family in publicly available curated ortholog datasets
    • Identify homologies and create MSA of a gene family identified by a database search
    • Identify genes whose evolutionary rates shift in association with change in a trait

Summary and Methods

Phylogenetic analysis is a scientific method used to study the evolutionary relationships between genes, species, or groups of organisms. It aims to reconstruct the evolutionary history, or phylogeny, by analyzing and comparing genetic characteristics.

One commonly used approach in phylogenetic analysis is Multiple Sequence Alignment (MSA). MSA involves aligning the DNA, RNA, or protein sequences of different organisms to identify regions of similarity and difference. This alignment helps to infer the evolutionary relationships and identify evolutionary changes that have occurred over time. MSA is typically performed using algorithms that optimize the alignment based on sequence similarities, insertions, deletions, and gaps.

Once the MSA is obtained, it serves as the basis for constructing a phylogenetic tree. A phylogenetic tree is a graphical representation of the evolutionary relationships between different species or groups. It depicts the branching patterns that connect common ancestors and descendant lineages. In the tree, each branch represents a different species or lineage and the nodes represent hypothetical common ancestors.

Phylogenetic tree construction involves various methods, such as maximum likelihood (ML), maximum parsimony (MP), and Bayesian inference. These methods use the information from the MSA to estimate the most likely evolutionary tree that explains the observed sequence similarities and differences. The resulting tree represents the most probable evolutionary history given the available data.

The branches in a phylogenetic tree can be further classified as either bifurcating (binary) or multifurcating (polytomy). Bifurcating branches represent a split into two distinct lineages, indicating a speciation event. Polytomies occur when there is insufficient information to resolve the exact branching pattern, representing uncertainty or rapid diversification.

Phylogenetic trees can provide insights into various aspects of evolution, including the divergence times between species, the order of speciation events, and the patterns of evolutionary change. They are widely used in fields such as evolutionary biology, systematics, comparative genomics, and ecology to study the relationships and evolutionary history of organisms.

Click the toggles below to learn more about the different starting points of this workflow.

Curated Gene Family

Curated Orthologs

Run Orthofinder

Create MSA from Database Sequence Search

Inputs

Outputs

Workflow Walkthrough

Results Walkthrough

Citations

Sequence Similarity Search

image

Find gene sequences that are significantly similar to known query sequences.

Version 1.1.1

Use Case

Identify Similar Sequences using a Database where similarity is determined by statistical significance of the alignment score

Summary

This workflow uses Blast [1, 2], Diamond [3], SSEARCH, FASTA36 [4], or Miniprot [5] for sequence comparison and similarity searches in biological databases.

  1. Blast (Basic Local Alignment Search Tool): Blast is a widely used algorithm for comparing biological sequences, such as DNA, RNA, or protein sequences, against a large database. It employs a heuristic approach to find regions of local similarity between sequences. Blast provides a measure of sequence similarity, identifies regions of conservation, and predicts functional and evolutionary relationships between sequences.
  2. Diamond: Diamond is a sequence alignment tool specifically designed for comparing protein sequences against protein sequence databases. It utilizes a fast and sensitive algorithm based on the concept of seed-and-extend alignment. Diamond is known for its high speed and is often used for large-scale protein sequence analysis.
  3. SSEARCH: SSEARCH is a sequence comparison tool that performs global sequence alignment using the Smith-Waterman algorithm. It is known for its sensitivity in detecting distant homologs by finding optimal local alignments. SSEARCH is commonly used in protein sequence analysis and is particularly effective when comparing sequences with low similarity.
  4. FASTA36: FASTA36 is a versatile and widely used program for comparing protein and nucleotide sequences against sequence databases. It employs the FASTA algorithm, which is based on local alignment. FASTA36 is known for its sensitivity in identifying distant homologs and can be used for both database searches and pairwise alignments.
  5. Miniprot: Miniprot is an extremely fast protein-to-genome aligner developed by Heng Li, the developer of minimap2. It outputs alignments in PAF (paired alignment format) and gtf (gene transfer format).

Methods

This analysis was performed using the Sequence Similarity Search workflow on the Form Bio platform. This workflow takes an input FastA file and performs a sequence similarity search with BLAST [1, 2], Diamond [3], SSEARCH, FASTA36 [4], or Miniprot[5].

Inputs

Outputs

Runtime Estimates

Average = 1 hour 19 minutes

image

Workflow Walkthrough

Citations

Genome Coordinate Conversion

image

Convert the location of a set of genomic features, such as genes, transcription factor bindings sites, or promoters, from one genome to another.

Version 1.0.1

Use Cases

  • Create a Genome Coordinate Conversion File between two genomes
  • Map the location of a set of genomic features (e.g. genes, transcription factor binding sites, promoters) from the target genome to the query genome
  • Filter the genomic features that are converted to the query genome for overlap with a second set of genomic features. The locations of the features in this separate set are in the query genome

Summary

This workflow is designed to help the user convert the location of a set of genomic features, such as genes, transcription factor bindings sites, or promoters, from one genome to another. The user will provide as input a query genome to convert to and a target genome to convert from. If a Genome Coordinate Conversion File is not provided, one may be generated from two input FastA files. The user will receive as output the location of genomic features on the target genome, and a Genome Coordinate Conversion File if indicated.

Methods

This workflow was performed using the Genome Coordinate Conversion workflow on the Form Bio platform. The Genome Coordinate Conversion File is a whole genome alignment between the target genome and the query genome. Only features that lie in regions of homology between the two genomes are mapped. If the Genome Coordinate Conversion File does not yet exist, this workflow can generate one using either LastZ [1] or SegAlign [2] (GPU-optimized version of LastZ) from a pair of genome FASTA files. CrossMap [3] will use this chain file to convert the coordinates of genomic features in the target genome to the query genome. This file of genomic features in the target/reference genome can be in BED, VCF, BAM, or MAF file formats. CrossMap outputs a BED file with the location of these genomic features in the query genome. If provided with a second BED file with genomic features in the query genome, the workflow will filter converted genomic features for overlap with this second BED file [4].

Inputs

Outputs

Workflow Walkthrough

Results Walkthrough

Citations

Built with

image