🖊️

Protein Engineering

Protein Structure and Function

image

Get information about protein sequences including identifying protein functional domains, predicting gene ontology and EC numbers, and predicting protein structure.

Version 2.1.0

Use Cases

Predict 3D structure and functional information of proteins, DNA, RNA, and complexes from FastA files, text sequences, or protein accession numbers

Summary

This workflow is designed to help the user determine the structure and function of protein sequences. This workflow is capable of identifying protein functional domains, predicting gene ontology and EC numbers, and predicting protein structure. The user will provide the sequences of interest as a FastA file, text sequence, or protein accession number, and will receive as output functional and structural information on the sequences.

Protein Functional Domains

Protein domains are functional and structural units within proteins that can be conserved across different proteins and species. Identifying protein domains is important for understanding protein function and evolution. One method for finding protein domains is to use a tool called Reverse PSI-BLAST (RPS-BLAST) to search against the Conserved Domain Database (CDD), a collection of well-characterized protein domains. The RPS-BLAST algorithm uses a profile-based approach to identify domains in protein sequences. It compares the query protein sequence to a database of protein domain profiles, searching for regions of the sequence that match a particular domain profile. These domain profiles are constructed from alignments of sequences that share a particular domain or motif and are therefore highly conserved.

When using RPS-BLAST to search the CDD, the query protein sequence is compared against the CDD’s library of domain profiles and the program returns a list of domains that match the query sequence along with statistical measures of the significance of each match. Once the protein domains have been identified, researchers can use this information to make predictions about the protein’s function, interactions, and evolutionary history. For example, if a protein contains a domain that is commonly found in enzymes, it is likely that the protein is also an enzyme, and the specific enzymatic activity can be inferred from the domain. Similarly, if a domain is found in many different species, it suggests that the domain is important for the protein’s function and has been conserved throughout evolution.

Predicting Protein Function

Annotating a gene for function involves identifying the specific biological activity that the gene encodes. This information can be used to better understand the role of the gene in biological systems and to identify potential targets for therapeutic interventions. One tool that can be used to annotate a gene for function is DeepFRI (Deep Functional Relevance Index), a deep learning-based tool that predicts gene ontology (GO) terms and enzyme commission (EC) numbers based on the gene’s sequence and other available information. To use DeepFRI to annotate a gene for function, one would first input the gene’s DNA or protein sequence into the tool. DeepFRI then uses a deep neural network to analyze the sequence and predict the gene’s functions, including specific biological processes, molecular functions, and cellular components associated with the gene.

In addition to predicting GO terms, DeepFRI can also predict EC numbers, which are a classification system used to identify specific enzyme activities. The tool uses the gene sequence to predict the catalytic activity of the protein encoded by the gene, which can help to identify potential drug targets and inform drug development efforts. Once the GO terms and EC numbers have been predicted by DeepFRI, researchers can use this information to better understand the function of the gene and its role in biological systems. They can also use this information to identify other genes with similar functions or to investigate potential drug targets based on the gene’s predicted enzyme activity. Overall, DeepFRI is a powerful tool for annotating genes for function and can help to accelerate research in a wide range of fields, from basic biology to drug discovery.

Protein Structure

Predicting the structure of a protein sequence is an important step in a thorough analysis of a protein sequence. RaptorX, RosettafoldNA, and AlphaFold are all powerful tools for predicting protein structure.

  • RaptorX is a protein structure prediction server that uses a variety of methods to predict the structure of a given protein sequence. It employs advanced machine learning techniques, such as deep neural networks, to accurately predict protein structures even in the absence of experimental data. RaptorX uses homology modeling, threading, and ab initio modeling to generate protein structure predictions and has been shown to be highly accurate in blind tests.
  • RosettaFoldNA is a protein structure prediction software developed by the RosettaCommons, a consortium of academic research groups. It uses a computational method called “de novo” protein structure prediction, which involves predicting the structure of a protein from scratch without using any information from existing protein structures. RosettaFoldNA employs a sophisticated energy function to evaluate candidate protein structures and uses a Monte Carlo search algorithm to explore the vast conformational space of protein structures. Along with this it can also be used to predict the 3-D structures of DNA, RNA, and mixed nucleotide/amino acid structures.
  • AlphaFold is a deep learning-based protein structure prediction system developed by the artificial intelligence research company DeepMind. It is based on a deep neural network that was trained on a large database of protein structures. AlphaFold can accurately predict protein structures with remarkable speed and accuracy and has outperformed other methods in several protein structure prediction challenges. AlphaFold uses a novel technique called “attention” to predict the relative positions of different parts of a protein, which allows it to predict protein structures with a high degree of accuracy.

These tools are all very powerful and can help researchers to better understand the structure and function of proteins, which is critical for developing new drugs and therapies.

Methods

This analysis was completed using the Protein Structure and Function workflow on the Form Bio platform. If the input is a protein sequence accession number then relevant databases are searched to pull the sequence FastA file. Otherwise, a FastA file or amino acid sequence is input as text and then made into a FastA file. After a FastA file has been obtained the file is formatted and split into individual protein sequences to run in parallel throughout the rest of the workflow. If the sequences given are monomers the workflow is split into 2 main routes. In the first route the sequence is RPS-BLASTed against the Conservative Domain Database [1] (CDD) and then run through deepFRI [2] for structure-based protein function prediction. In the second route, the sequence is run through a selection of structure prediction algorithms (Alphafold [3] or RaptorX [4]) in which multiple 2-D or 3-D structures are predicted and ranked. After both routes are finished the results are consolidated and put into a final report in HTML format. If the sequences given are multimers there is only a singular route for the workflow, during which Alphafold Multimer [5] is run to predict the 3-D multimer structure. In this process, multiple structures are predicted and then ranked and relaxed using Amber Force Fields. The result is multiple ranked PDB files as well as a final HTML report containing a fully interactive plot of the top-ranked model’s 3-D structure, predicted aligned error plots, predicted LDDT per position of the top 5 predictions, and a sequence coverage plot. Details on the run and a random hash are also added to ensure reproducibility and ease of tracking the predictions. Lastly if Nucleic Acid Structures is selected then RosettafoldNA [6] will be run to predict the 3-D structures of RNA, DNA, or mixed protein/nucleic acid structures. In this format a design file is used to specify the corresponding FastA files and designate them both for order in the chain structure being predicted and for designation of DNA, RNA, protein, or paired protein/RNA chains.

Inputs

Monomer

  • Fasta File, Accession Number, or Sequence as Text Input
  • Algorithm selection

Example Fasta File Text:

  • 1pazA ENIEVHMLNKGAEGAMVFEPAYIKANPGDTVTFIPVDKGHNVESIKDMIP

Multimer

  • Fasta File, Accession Number, or Sequence as Text Input
  • Example Sequence Input (For sequence inputs in multimer chains are separated by a colon, :):

    GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRYKEAAEENRALAKLHHELAIVED:MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH

    Example FastA File Text (In multimer chains are split by each entry in the Fasta File):

    T1083 GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRYKEAAEENRALAKLHHELAIVED T1084 MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH

RosettafoldNA

  • Input Folder
  • Design File labeling which inputs to use in what order and if they are P (protein), R (RNA), D (DNA), or PR (paired protein/rna)
  • Example Input Folder Structure: protein1.fa protein2.fasta rna.fa

    Example Design File:

    SequenceType
    FileName
    P
    protein1
    R
    rna
    P
    protein2

Example Protein1.fa:

prot TRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRSLKMRGQAFVIFKEVSSATNALRSMQGFPFYDKPMRIQYAKTDSDIIAKM

Example rna.fa:

RNA GAGAGAGAAGTCAACCAGAGAAACACACCAACCCATTGCACTCCGGGTTGGTGGTATATTACCTGGTACGGGGGAAACTTCGTGGTGGCCGGCCACCTGACA

Outputs

  • pdb file
  • rpsblast.json and rpsblast.out files from blast search
  • protein function json and csv files from deepFRI
  • html file with information about the protein from vizapp or from alphafold

Workflow Walkthrough

  1. Navigate to the Protein Structure and Function launcher card. You can also find this workflow using the search tool at the top right corner or by using the “Protein Engineering” filter on the left-hand side.
  2. Select the version from the dropdown menu in the top-right corner, and click “Run Workflow” when ready to begin analysis.
  3. image
  4. Select what type of protein sequence you have uploaded - monomer, multimer, or a nucleic acid structure. Select the type of data you will input (FastA file, text sequence, or protein accession number), and upload the data.
  5. image
  6. Select the protein structure prediction algorithms to run (RaptorX or Alphafold) and whether to predict 2D structure only or both 2D and 3D structure (the default).
  7. image
  8. Give your workflow a unique name, and review workflow parameters and inputs. When satisfied and ready to run, click “Run Workflow”.
  9. image

Results Walkthrough

  1. To view the results of your Protein Structure and Function workflow run, first find and open your corresponding workflow run in the Activity tab.
  2. Navigate to the Files tab. In the output folder, you can navigate, open, and download all workflow results files.
  3. image

Citations

  1. Lu, S. et al. CDD/SPARCLE: The conserved domain database in 2020Nucleic Acids Research 48, D265–D268 (2020).
  2. Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networksNature Communications 12, 3168 (2021).
  3. Highly accurate protein structure prediction with AlphaFold | Nature.
  4. Peng, J. & Xu, J. RaptorX: Exploiting structure information for protein alignment by statistical inferenceProteins 79, 161–171 (2011).
  5. Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. 2021.10.04.463034 (2022) doi:10.1101/2021.10.04.463034.
  6. Accurate prediction of nucleic acid and protein-nucleic acid complexes using RoseTTAFoldNA | bioRxiv.

Built with

image

Protein Design

image

Predict new protein sequence with a similar structure to a known sequence.

Version 1.0.0

Use Cases

This workflow can be used to redesign a protein sequence while maintaining its original structure

Summary

This workflow takes an input protein PDB file and can redesign the sequence of it while maintaining the same structure by using proteinmpnn. While doing this parameters can be set to add amino acid biases, avoid certain amino acids, or only redesign certain parts of the protein. It accepts both monomers and multimers [1]. 

Inputs

Protein PDB File

Outputs

Fasta File of Redesigned Sequences

Workflow Walkthrough

  1. Navigate to the Protein Design workflow on the Form Bio platform. You can locate this workflow by using the search bar in the top-right corner, or by using the Protein Engineering filter on the left hand side.
  2. Select the version from the dropdown menu in the top right corner. You may view information about the workflow run here, including use cases and a methods summary. When ready to begin, click Run Workflow.
  3. image
  4. Upload a protein PDB file containing the protein sequence(s) of interest.
  5. image
  6. Tune parameters of the workflow run, including number of redesigns to calculate, and which, if any, sections of the protein should remain constant.
  7. image
  8. Give your workflow a unique name, and review workflow inputs and parameters. When ready to submit your run, click “Run Workflow”.
  9. image

Citations

  1. Dauparas, Justas & Anishchenko, Ivan & Bennett, Nathaniel & Bai, Hua & Ragotte, Robert & Milles, Lukas & Wicky, Basile & Courbet, Alexis & Haas, Robbert & Bethel, Neville & Leung, Philip & Huddy, Timothy & Pellock, Sam & Tischer, Doug & Chan, Frederick & Koepnick, Brian & Nguyen, Hannah & Kang, Alex & Sankaran, Banumathi & Baker, David. (2022). Robust deep learning based protein sequence design using ProteinMPNN. 10.1101/2022.06.03.494563.

Built with

image