Form Bio Platform Documentation
Form Bio Platform Documentation
/Workflows
Workflows
/
🔌
Power Tools
🔌

Power Tools

  • DeepSomatic
  • DeepTrio
  • DeepVariant
  • Download Public Data Files
  • Extract Sequences from Genome
  • FastQC
  • Kraken 2
  • Sequence Similarity Search
  • Sequencer Raw Data to FastQ
‣

DeepSomatic

image

This workflow can be used to identify single-nucleotide variants, indels and structural variants in diploid species genomics resequencing projects by comparison to a reference genome

Version 1.7.0

Use Cases

  • Determine variants in DNA samples compared to a reference genome including single nucleotide variants (SNVs), insertions and deletions
    • Somatic Variant Calling
  • Sequencing Platform supported include Illumina, Pacbio and Oxford Nanopore (ONT)

Summary

This workflow is designed to run DeepSomatic with BAM files.

Methods

Variants are detected with joint calling using DeepSomatic to produce VCF files. Variants effects are determined using SNPEff [1].

‣

Inputs

  • Run Name: This is a unique name for each run of pipelines in your project
  • Organism: Reference Genome used for alignment
  • Reference Genome Annotation: Annotation that should be used for determining gene and transcript counts.
  • Input Folder: This is the folder that contains all of the fastq files that will be used in this analysis
  • Sample Description File
    • This file matches the sequence files to samples; sequence data from multiple runs will be merged if they have the same SampleID
    • RunID should be a part of the the fastq files.
    • GroupID denotes samples to be analyzed jointly
    • File Format
    • RunID
      SampleID
      GroupID
      SampleType
      SRR994739
      SAMEA9454349
      Tumor A
      Tumor
      SRR994740
      SAMEA9454349
      Normal A
      Normal
      SRR994741
      SAMEA9454341
      Tumor B
      Tumor
      SRR994742
      SAMEA9454342
      Family B
      Normal
  • Capture Bedfile
    • The intervals in capture BED file indicate regions where alignments are expected based on the target capture kit.
    • Make sure that there is no column names present in the file.
    • The fourth column can indicate a region name and used to determine poorly capture regions.
    • SeqName
      Start
      End
      Name
      chr1
      1787293
      1787413
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_1
      chr1
      1787353
      1787473
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_2
      chr1
      1789040
      1789160
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_1
      chr1
      1789160
      1789280
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_2
      chr1
      1790375
      1790495
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_1
      chr1
      1790495
      1790615
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_2
      chr1
      1793187
      1793307
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_1
      chr1
      1793247
      1793367
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_2
      chr1
      1804380
      1804500
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_1
      chr1
      1804500
      1804620
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_2
      chr1
      1806416
      1806536
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_1
      chr1
      1806476
      1806596
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_2
‣

Outputs

  • Variants (genomevariants)
    • DeepVariant VCF
    • MAF
‣

Workflow Walkthrough

  1. Navigate to the DeepSomatic launcher card. You can use the search bar at the top right corner, or use the Google DeepOmics tags to find the workflow card.
  2. image
  3. Select the version from the dropdown versioning menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
  4. image
  5. Select the sequencer platform that was used to generate the data, Illumina, PacBio, or Oxford Nanopore. Also provide the directory containing the files to be analyzed as well as a file containing RunIDs, LibraryIDs and SampleIDs (this table can be created within the workflow itself).
  6. image
  7. On the next tab, provide a BED file containing the genomic regions to be analyzed. You may also optionally upload a BED file detailing genomic regions of note.
  8. image
  9. Finally, give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  10. image
‣

Results Walkthrough

  1. To view results for your DeepSomatic workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. image
  3. Upon selection, results from your workflow run are summarized in the Results tab.
  4. image
  5. Navigate to the All Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
  6. image
‣

Citations

  1. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly 6, 80–92 (2012).

Built with

image
image
‣

DeepTrio

image

This workflow can be used to identify single-nucleotide variants, insertions and deletions in diploid species genomics resequencing projects by comparison to a reference genome for probands and their parents.

Version 1.7.0

Use Cases

  • Determine variants in DNA samples compared to a reference genome including single nucleotide variants (SNVs), insertions and deletions
    • Germline Variant Calling
  • Sequencing Platform supported include Illumina, Pacbio and Oxford Nanopore (ONT)

Summary

This workflow is designed to run DeepTrio with BAM files from probands and their parents.

Methods

Variants are detected with joint calling using DeepTrio to produce VCF files. Variants effects are determined using SNPEff [1].

‣

Inputs

  • Run Name: This is a unique name for each run of pipelines in your project
  • Organism: Reference Genome used for alignment
  • Reference Genome Annotation: Annotation that should be used for determining gene and transcript counts.
  • Input Folder: This is the folder that contains all of the fastq files that will be used in this analysis
  • Sample Description File
    • This file matches the sequence files to samples; sequence data from multiple runs will be merged if they have the same SampleID
    • RunID should be a part of the the fastq files.
    • GroupID denotes samples to be analyzed jointly
    • File Format
    • RunID
      SampleID
      Group ID
      SampleType
      SRR994739
      SAMEA9454349
      Family A
      Proband
      SRR994740
      SAMEA9454349
      Family A
      Parent
      SRR994741
      SAMEA9454341
      Family A
      Parent
  • Capture Bedfile
    • The intervals in capture BED file indicate regions where alignments are expected based on the target capture kit.
    • Make sure that there is no column names present in the file.
    • Forth column can indicate a region name and used to determine poorly capture regions.
    • SeqName
      Start
      End
      Name
      chr1
      1787293
      1787413
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_1
      chr1
      1787353
      1787473
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_2
      chr1
      1789040
      1789160
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_1
      chr1
      1789160
      1789280
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_2
      chr1
      1790375
      1790495
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_1
      chr1
      1790495
      1790615
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_2
      chr1
      1793187
      1793307
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_1
      chr1
      1793247
      1793367
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_2
      chr1
      1804380
      1804500
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_1
      chr1
      1804500
      1804620
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_2
      chr1
      1806416
      1806536
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_1
      chr1
      1806476
      1806596
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_2
‣

Outputs

  • Variants (genomevariants)
    • DeepVariant VCF
    • MAF
‣

Workflow Walkthrough

  1. Navigate to the DeepTrio launcher card. You can use the search bar at the top right corner, or use the Google DeepOmics tags to find the workflow card.
  2. image
  3. Select the version from the dropdown menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
  4. image
  5. Select the sequencer platform that was used to generate the data, Illumina, PacBio, or Oxford Nanopore. Choose whether or not to split large fastq files into smaller files by checking the “Whole Genome Sequencing Parallelization” box (warning: may be more costly). Also provide the directory containing the files to be analyzed as well as a file containing RunIDs, LibraryIDs and SampleIDs (this table can be created within the workflow itself).
  6. image
  7. On the next tab, select a reference genome to which the input data will be compared. You may also optionally upload a BED file detailing genomic regions of note.
  8. image
  9. Finally, give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  10. image
‣

Results Walkthrough

  1. To view results for your DeepTrio workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. Upon selection, results from your workflow run are summarized in the Results tab.
  3. image
  4. Navigate to the All Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
  5. image
‣

Citations

  1. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly 6, 80–92 (2012).

Built with

image
image
‣

DeepVariant

image

This workflow runs DeepVariant on BAM files.

Version 1.7.0

Use Cases

  • Determine variants in DNA samples compared to a reference genome including single nucleotide variants (SNVs), insertions, deletions and structural variants
    • Germline Variant Calling
  • Determine variants in DNA samples compared to a custom reference genome for small or synthetic genomes
    • Plasmid
    • Virus
    • Bacteria
    • Sythetic Genome
  • Sequencing Platform supported include Illumina, Pacbio and Oxford Nanopore (ONT)

Summary

This workflow is designed to run DeepVariant with BAM files. Workflows can be run either with Parabricks, native open-source tools (NOST).

Methods

Variants are detected with joint calling using DeepVariant [1] to produce gVCF files. Genotyping of gVCF files is determined using GLNexus [2]. Variants effects are determined using SNPEff [3].

‣

Inputs

  • Run Name: This is a unique name for each run of pipelines in your project
  • Organism: Reference Genome used for alignment
  • Reference Genome Annotation: Annotation that should be used for determining gene and transcript counts.
  • Input Folder: This is the folder that contains all of the fastq files that will be used in this analysis
  • Sample Description File
    • This file matches the sequence files to samples; sequence data from multiple runs will be merged if they have the same SampleID
    • RunID should be a part of the the fastq files.
    • GroupID denotes samples to be analyzed jointly
    • File Format
    • RunID
      SampleID
      GroupID
      SRR994739
      SAMEA9454349
      Family A
      SRR994740
      SAMEA9454349
      Family A
      SRR994741
      SAMEA9454341
      Family A
      SRR994742
      SAMEA9454342
      Family B
  • Capture Bedfile
    • The intervals in capture BED file indicate regions where alignments are expected based on the target capture kit.
    • Make sure that there is no column names present in the file.
    • Forth column can indicate a region name and used to determine poorly capture regions.
    • SeqName
      Start
      End
      Name
      chr1
      1787293
      1787413
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_1
      chr1
      1787353
      1787473
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_2
      chr1
      1789040
      1789160
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_1
      chr1
      1789160
      1789280
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_2
      chr1
      1790375
      1790495
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_1
      chr1
      1790495
      1790615
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_2
      chr1
      1793187
      1793307
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_1
      chr1
      1793247
      1793367
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_2
      chr1
      1804380
      1804500
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_1
      chr1
      1804500
      1804620
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_2
      chr1
      1806416
      1806536
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_1
      chr1
      1806476
      1806596
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_2
‣

Outputs

  • Variants (genomevariants)
    • DeepVariant VCF
    • MAF
‣

Workflow Walkthrough

  1. Navigate to the DeepVariant launcher card. You can use the search bar at the top right corner, or use the Google DeepOmics tags to find the workflow card.
  2. image
  3. Select the version from the dropdown versioning menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
  4. image
  5. Select the sequencer platform that was used to generate the data, Illumina, PacBio, or Oxford Nanopore. Choose which algorithm to run based on the input data, Parabricks DeepVariant, DeepVariant for RNAseq, or DeepVariant. Choose whether or not to split large fastq files into smaller files by checking the “Whole Genome Sequencing Parallelization” box (warning: may be more costly).
  6. image
  7. On the same tab, name the variant result output. Also provide the directory containing the files to be analyzed as well as a file relating SampleIDs to GroupIDs (this table can be created within the workflow itself).
  8. image
  9. On the next tab, select a reference genome to which the input data will be compared. You may also optionally upload a BED file detailing genomic regions of note.
  10. image
  11. Finally, give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  12. image
‣

Results Walkthrough

  1. To view results for your DeepVariant workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. image
  3. Upon selection, results from your workflow run are summarized in the Results tab.
  4. image
  5. Navigate to the All Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
  6. image
‣

Citations

  1. Yun, T. et al. Accurate, scalable cohort variant calls using DeepVariant and GLnexus. (2020) doi:10.1101/2020.02.10.942086.
  2. Lin, M. F. et al. GLnexus: Joint variant calling for large cohort sequencing. (2018) doi:10.1101/343970.
  3. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly 6, 80–92 (2012).

Built with

image
image
image
‣

Download Public Data Files

image

Download publicly available Short-Read Archive (SRA), Gene Expression Omnibus (GEO), or Recount3 data from their respective databases, files from URL, or gene sequences.

Version 1.0.1

Use Cases

Access and download data from a variety of sources

  • Short Read Archive (SRA) data from the NCBI and EMBL
  • Gene Expression Omnibus (GEO) data from the NCBI
  • Recount3 Data: a public RNASeq project of human and mouse samples

Summary and Methods

This workflow is designed to help the user download data from a variety of sources, including Short Read Archive (SRA) data, gene sequences from supported genomes, the Gene Expression Omnibus (GEO), and Recount3. Click the toggles below to learn more about how the workflow accesses data from each source.

‣

Short Read Archive (SRA) Data

Summary

This workflow is designed to help the user download Short Read Archive (SRA) data from the NCBI and EMBL. The user will provide a list of SRA IDs to retrieve and will receive the associated SRA data.

Methods

This analysis was performed using the Download Data Files workflow on the Form Bio platform. The user provided a list of Short Read Archive ID numbers as a file input. The workflow retrieved any associated SRA data from the NCBI and EMBL as FastQ files.

‣

Gene Expression Omnibus (GEO)

Summary

This workflow is designed to help the user download Gene Expression Omnibus (GEO) data from the NCBI. The user will provide as input a list of GEO IDs to retrieve. The user will retrieve any information associated with the input GEO IDs.

Methods

This analysis was performed using the Download Data Files workflow on the Form Bio platform. The user provided a list of Gene Expression Omnibus (GEO) IDs as an input file. The workflow returned any information associated with the input GEO IDs as FastQ files.

‣

Recount3: Gene Expression Data in Mouse/Human

Summary

This workflow is designed to help the user access data from the Recount3 project. The user will provide as input a list of Recount3 project IDs. The user will receive as output any data associated with the input Recount3 project IDs.

Methods

This analysis was performed using the Download Data Files workflow on the Form Bio platform. The user provided a list of Recount3 project IDs as an input file. The workflow then returned any information associated with the input IDs as a FastQ file.

‣

Inputs

  • Run Name
    • This is a unique name for each run of pipelines in your project
  • SRAList
    • File with SRA Run, Sample or Project IDs: SRR/ERR, SAM, PRJ
  • File with GEO Sample ides: GSM
  • Recount3 Project ID
    • Search Projects IDs
‣

Outputs

  • For SRA FastQ Files, 1 per RunID
    • RunID.fastq.gz
    • Sample to Run IDs
    • RunID
      SampleID
      SRR994739
      SAMEA9454349
      SRR994740
      SAMEA9454349
      SRR994741
      SAMEA9454341
      SRR994742
      SAMEA9454348
      SRR994743
      SAMEA9454348
      SRR994744
      SAMEA9454342
  • For Recount3 and GEO, there are a variety of files available depending on sample/project.
‣

Workflow Walkthrough

  1. Navigate to the Download Public Data Files workflow. You can use the search bar at the top-right corner to find the workflow, or use the External Data or Power Tools filter on the left-hand side.
  2. Select version from the dropdown menu in the top right corner. When ready to begin analysis, click the “Run Workflow” button.
  3. image
  4. To start, select the type of data you wish to retrieve - either Short Read Archive (SRA) data, Gene Omnibus Expression (GEO) data, or Recount3 data. Depending on the resource you wish to access, you will be asked to provide a file containing the IDs of the data to retrieve - SRA IDs, GEO IDs, or the Recount3 project ID respectively.
  5. image
  6. Give the workflow a unique name, and review the workflow inputs and parameters. When ready to submit, click “Run Workflow”.
  7. image
‣

Results Walkthrough

  1. To begin, find the workflow run in the Activity tab. Select your workflow.
  2. On this page, you can view an array of information about your workflow run. To find your downloaded files, select the Files tab.
  3. image
  4. In the Files tab, select the Output folder to view all output files.
  5. image
  6. Alternately, you can find retrieved files in the pipeline-outputs folder.
  7. image
  8. Once in the folder, select getncbi.
  9. image
  10. Find the folder corresponding to your workflow run, then select output to view files.

Built with

image
image
‣

Extract Sequences from Genome

image

This workflow can be used extract gene or region sequences.

Version 1.0.1

Use Cases

Access and download data from a variety of sources

  • Gene sequences from a supported genome
  • Genome regons from a supported genome

Summary

This workflow is designed to help the user extract sequences from a genome.

‣

Inputs

  • Run Name
    • This is a unique name for each run of pipelines in your project
  • List File
    • List of Gene Symbols
  • Genome Regions File
    • List of Genomics Regions (remember that chromosome names must match)
‣

Outputs

  • Sequences from the gene or genomic region
  • Gene sequences include genomic with the window, cds and protein
‣

Workflow Walkthrough

  1. Navigate to the Extract Sequences from Genome launcher card. You can use the search bar at the top right corner, or use the Power Tool tag to find the workflow card.
  2. image
  3. Select the version from the dropdown menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
  4. image
  5. Select the type of data source, gene sequence or gene region, then select a reference genome and assembly version. Next, determine the gene window size which is the length upstream/downstream of the intended gene to add onto the gene as an input sequence. Finally, provide a file containing the genes to be extracted.
  6. image
  7. Give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  8. image
‣

Results Walkthrough

  1. To view results for your Extract Sequences from Genome workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. Upon selection, results from your workflow run are summarized in the Results tab. Previews of the output files can be viewed here.
  3. image
  4. Navigate to the All Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
  5. image

Built with

image
‣

FastQC

image

FastQC is a quality control tool for high throughput sequence data. It reads in sequence data in a variety of formats and can either provide an interactive application to review the results or create an HTML report summarizing the statistics.

Version 1.0.0

Use Cases

  • Assess Sequence Quality

Summary

FastQC [1] is used to assess sequence quality.

‣

Workflow Walkthrough

  1. Navigate to the FastQC launcher card. You can use the search bar at the top right corner, or use the Power Tool or Genomics tags to find the workflow card.
  2. image
  3. Select the version from the dropdown menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
  4. image
  5. Select the directory where the fastQ files for your analysis are located. Accepted file formats are .fq.gz or fastq.gz.
  6. image
  7. Give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  8. image
‣

Results Walkthrough

  1. To view results for your FastQC workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. Upon selection, results from your workflow run are summarized in the Results tab. HTML output files can be previewed or opened from here.
  3. image
  4. Navigate to the All Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
  5. image
‣

Citations

  1. Andrews, S. et al. FastQC. (2012).

Built with

image
‣

Kraken 2

image

Kraken 2 is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Kraken uses a k-mer-based approach to achieve a high level of accuracy and fast classification speeds.

Version 1.0.0

Use Cases

  • Classify sequences by taxonomy

Summary

Kraken [1] is run with a confidence of 0.5 using Kraken’s precompiled database or a custom database.

‣

Workflow Walkthrough

  1. Navigate to the Kraken 2 launcher card. You can use the search bar at the top right corner, or use the Power Tool or Genomics tags to find the workflow card.
  2. image
  3. Select the version from the dropdown menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
  4. image
  5. Select the directory where the fastQ files for your analysis are located. Accepted file formats are .fq.gz or fastq.gz. Determine how you want any contaminants to be filtered.
  6. image
  7. Give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  8. image
‣

Results Walkthrough

  1. To view results for your Kraken 2 workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. Upon selection, results from your workflow run are summarized in the Results tab. HTML output files can be previewed or opened from here.
  3. image
  4. Navigate to the All Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
  5. image
‣

Citations

  1. Wood, D. E. & Salzberg, S. L. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biology 15, R46 (2014).

Built with

image
‣

Sequence Similarity Search

image

Find gene sequences that are significantly similar to known query sequences.

Version 1.1.1

Use Case

Identify Similar Sequences using a Database where similarity is determined by statistical significance of the alignment score

Summary

This workflow uses Blast [1, 2], Diamond [3], SSEARCH, FASTA36 [4], or Miniprot [5] for sequence comparison and similarity searches in biological databases.

  1. Blast (Basic Local Alignment Search Tool): Blast is a widely used algorithm for comparing biological sequences, such as DNA, RNA, or protein sequences, against a large database. It employs a heuristic approach to find regions of local similarity between sequences. Blast provides a measure of sequence similarity, identifies regions of conservation, and predicts functional and evolutionary relationships between sequences.
  2. Diamond: Diamond is a sequence alignment tool specifically designed for comparing protein sequences against protein sequence databases. It utilizes a fast and sensitive algorithm based on the concept of seed-and-extend alignment. Diamond is known for its high speed and is often used for large-scale protein sequence analysis.
  3. SSEARCH: SSEARCH is a sequence comparison tool that performs global sequence alignment using the Smith-Waterman algorithm. It is known for its sensitivity in detecting distant homologs by finding optimal local alignments. SSEARCH is commonly used in protein sequence analysis and is particularly effective when comparing sequences with low similarity.
  4. FASTA36: FASTA36 is a versatile and widely used program for comparing protein and nucleotide sequences against sequence databases. It employs the FASTA algorithm, which is based on local alignment. FASTA36 is known for its sensitivity in identifying distant homologs and can be used for both database searches and pairwise alignments.
  5. Miniprot: Miniprot is an extremely fast protein-to-genome aligner developed by Heng Li, the developer of minimap2. It outputs alignments in PAF (paired alignment format) and gtf (gene transfer format).

Methods

This analysis was performed using the Sequence Similarity Search workflow on the Form Bio platform. This workflow takes an input FastA file and performs a sequence similarity search with BLAST [1, 2], Diamond [3], SSEARCH, FASTA36 [4], or Miniprot[5].

‣

Inputs

  • FastA file
  • Algorithm
  • Database or Genome
‣

Outputs

Sequence Alignment Output

Runtime Estimates

Average = 1 hour 19 minutes

image
‣

Workflow Walkthrough

  1. Navigate to the Sequence Similarity Search workflow. You can use the search tool at the top right corner, or find it using the Sequence Alignment filter on the left side.
  2. Select the version from the dropdown menu in the top right corner. You can view use cases, a summary, and inputs/outputs on this page. When ready to begin analysis, click “Run Workflow”.
  3. image
  4. Launcher Tabs
    1. Provide a FastA file containing the query sequence(s). Then, select the type of sequence search (nucleotide to nucleotide, nucleotide to protein, etc), the search algorithm, and the database to search.
    2. image
    3. Determine how you want to the output file (XML, HTML, etc), and tune additional search parameters depending on your chosen algorithm.
    4. image
    5. Review workflow parameters and inputs, and give the workflow run a unique name. When ready to submit the job, click “Run Workflow”.
    6. image

      Results Walkthrough

    7. To view the results of your Sequence Similarity Search workflow run, first find and select your workflow run from the Activity Tab.
    8. Navigate to the Files tab, then find workflow outputs under the output folder.
    9. image
‣

Citations

  1. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990).
  2. Pearson, W. R. & Lipman, D. J. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America 85, 2444–2448 (1988).
  3. Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods 18, 366–368 (2021).
  4. Wrpearson/Fasta36 at master.
  5. Protein-to-genome alignment with miniprot | Bioinformatics | Oxford Academic.
‣

Sequencer Raw Data to FastQ

image

Converts Sequencer Data to FastQ; includes DeepConsensus for PacBio.

Version 0.0.4

Use Cases

  • The user has completed PacBio HiFi sequencing
  • The user has completed Illumina sequencing
  • The user has completed Oxford Nanopore sequencing

Summary

This is a workflow that can be used to convert raw data from a sequencer into data FastQ format.

Methods

If input data is PacBio subread uBAMs or the sequencer run folder, consensus contig reads are created using circular consensus sequencing (CCS) [1]. Optionally DeepConsensus can be used to improve basecalls [2]. If the input data is ONT fast5 data, basecalling is performed using Dorado [3]. If the input data is Illumina sequencer run folder, bcl2fastq will be run to create fastq files and demultiplex using sample barcodes [4].

‣

Inputs

Mandatory

High-Throughput Sequence Data

  • Input data folder, either PacBio BAMs, ONT fast5 OR Run folder from PacBio (includes subread XML) or Illumina
  • BAMs
    • PacBio CCS BAM - Sequence file from PacBio Machine run in AAV Mode OR run through recall adapter and CCS to create consensus sequences
    • PacBio Subreads BAM - Files from PacBio Machine without any preproccessing(no recommended)
  • Data Folder
    • PacBio Run Folder, includes subreads, XML, etc
  • FastQ
    • FastQ files generated from CCS BAM files

Optional Inputs

  • Barcode Sequence
    • used to separate multiplexed samples by sequence barcodes or indexes
  • MASseq Adapters
    • used in masseq applications
‣

Outputs

FastQ files per Sample

‣

Workflow Walkthrough

  1. Navigate to the Sequencer Raw Data to FastQ launcher card. You can use the search bar at the top right corner, or use the Google DeepOmics tags to find the workflow card.
  2. image
  3. Select the version from the dropdown versioning menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
  4. image
  5. Select the sequence input type from the drop-down menu. Also provide the directory containing the files to be analyzed as well as a file containing the barcode sequence for each sample (this table can be created within the workflow itself).
  6. image
  7. On the next tab, choose the data assay type (or ONT basecalling model) based on the sequence input type that was selected on the first tab. Configure any additional parameters. These parameters will differ for Illumina vs PacBio vs ONT fast5 input types:
  8. Illumina parameter interface:

    image

    PacBio parameter interface:

    image

    ONT fast5 parameter interface:

    image
  9. Finally, give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  10. image
‣

Results Walkthrough

  1. To view results for your Sequencer Raw Data to FastQ workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. image
  3. Upon selection, results from your workflow run are summarized in the Results tab.
  4. image
  5. Navigate to the All Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
  6. image
‣

Citations

  1. Travers, K. J., Chin, C.-S., Rank, D. R., Eid, J. S. & Turner, S. W. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Research 38, e159–e159 (2010).
  2. Baid, G. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nature Biotechnology 41, 232–238 (2023).
  3. Chris Seymour, Joyjit Daw, Mike Vella & Mark Bicknell. Dorado is a high-performance, easy-to-use, open source basecaller for Oxford Nanopore reads.
  4. Bcl2fastq.

Built with

image
image
image