Assess the completeness and contamination of PacBio sequenced adeno-associated virus (AAV) constructs by examining alignment coverage across sequences and among specific regions including the promoter and CDS.
Version 2.3.0
Use Cases
The user has completed PacBio HiFi sequencing for AAV constructs and wishes to characterize the quality of the sequencing results
The user has completed Illumina sequencing and wishes to detect variants in the output data’
Summary
This is a quality control workflow that can be used to characterize PacBio adeno-associated virus (AAV) products by examining alignment coverage across sequence regions of interest. The user will provide either BAM files from the PacBio sequencer run in AAV mode or the raw PacBio run data folder in Tar GZ format to include subreads and XML data. The user may optionally provide Illumina sequencing data for variant detection. For each run analyzed, the user will receive a report of the alignment statistics.
Methods
If masking is selected, vector sequence along with packaging plasmid sequences were can be used to mask the human genome using MUMmer [1]. If input data were PacBio uBAMs that have not been run in AAV mode, consensus contig reads are created using circular consensus sequencing (CCS) [2]. Reads are aligned to the reference sequences using Minimap2 [3]. A custom report of alignment statistics was generated using a workflow developed at PacBio. Resulting alignments are filtered for quality to include primary alignments and reads with mapping quality scores greater than 10. Counts and lengths of alignments to regions of interest are determined from alignment files using Bedtools [4]. If Illumina data is provided, reads are trimmed using TrimGalore [5], to trim low quality (qual < 25) ends of reads and remove reads < 35bp. Trimmed reads are aligned to a reference genome using Minimap2. Duplicate reads can optionally be marked using Picard MarkDuplicates [6]. BAMs from the same sample generated by multiple runs are merged using Samtools [7]. Replication errors can be detected using MuTect2 [8] and Freebayes [9]. Finally a report is generated with relevant quality metrics.
Tips and Tricks
The genomic regions file must be in the form of a BED file with no header and three mandatory columns: chrom (name of chromosome), chromStart, and chromEnd (the starting and ending positions of the feature in the chromosome). The file also takes 9 additional optional columns, including exon count and size as well as strand. More information can be found here.
‣
Inputs
Mandatory
‣
High-Throughput Sequence Data
BAMs
PacBio CCS BAM - Sequence file from PacBio Machine run in AAV Mode OR run through recall adapter and CCS to create consensus sequences
PacBio Subreads BAM - Files from PacBio Machine without any preproccessing(no recommended)
Data Folder
PacBio Run Folder, includes subreads, XML, etc
FastQ
FastQ files generated from CCS BAM files
‣
Reference Data
Construct FastA
this sequence will be concatenated with it's reverse complement if the sample is self-complementary
Plasmid Sequences
other sequence used in the AAV replication inlcuding plasmids
Target Bedfile
annotated region of the construct including ITR, promotor and CDS regions
Must have either:
multiple itr regions ie itr5 and itr3
a region called “vector” or “transfer” that spans itr to itr
Host Genome
Mask Genome, the genome can be optionally masked in the region of the gene in the construct
Name for output VCF (used with Illumina analysis)
AAV Serotype
‣
Options
Barcode Sequence
Illumina Reads
Optional folder of fastqs for determining sequence variants using Illumina
Alignment options for the PacBio Technology
Parameters
Raw Data (check if not using PacBio Hi-Fi BAM files as main input)
AAV Serotype (AAV2 or skip)
‣
Outputs
‣
Reference Genome and Annotation Files
genomefa.tar.gz
genome.bed
pacBioSOP.tar.gz
seq.bed
pbsopstats/sopregions.bed
‣
Report
SampleID.results.html
‣
Alignment Files
pbsopbams/SampleID.bam
pbsopbams/SampleID.pbsop.bam
Filtered for Regional Analysis
bams/SampleID.qual.bam
bams/SampleID.qual.bam.bai
‣
Raw Stats Files
pbsopstats/SampleID.Rdata
pbsopstats/SampleID.alignments.tsv
pbsopstats/SampleID.nonmatch_stat.csv.gz
pbsopstats/SampleID.per_read.csv
pbsopstats/SampleID.readsummary.tsv
pbsopstats/SampleID.sequence-error.tsv
pbsopstats/SampleID.summary.csv
pbsopstats/SampleID_AAV_report.pdf
stats/SampleID.reads2regions.bed
stats/SampleID.regionposcts.txt
featurects/SampleID.bedout.txt
featurects/SampleID.bedtools.cov.txt
Runtime Estimates
Average: 2 hours 18 minutes based on 15 test runs
‣
Workflow Walkthrough
Navigate to the AAV PacBio Quality Control workflow on the Form Bio platform. You can use the Gene Therapy or Candidate Validation filters to help you locate the launcher.
Select the version from the dropdown tab in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
Launcher Tabs
Set up sequence files for analysis.
Choose an analysis mode. PacBio SOP is based on this protocol outline; FormBio Enhanced includes additional host contamination analysis and an interactive report.
Choose an alignment computational mode. Sentieonis faster than Open Source.
Select the type of sequence input.
Provide a file containing the barcode sequence for each sample (this table can be created within the workflow itself).
Lastly, provide the directory containing the files to be analyzed (may have to scroll down).
Configure the AAV design.
Select the AAV Serotype (the base genome used for Flip Flop analysis)
Indicate the format of the Plasmid, Vector or Construct FastA sequence, upload the sequence and a file describing genomic regions.
A Packaging FastA file should be uploaded by default. Changing this file is optional.
Select reference genome and genome annotation version
Finally, give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
‣
Results Walkthrough
To view results for your AAV PacBio QC workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
Upon selection, results from your workflow run are summarized in the Results tab. This view may vary based on the type of analysis run. HTML output files can be previewed or opened from here.
Under the All Files tab, you can view the final HTML file, which is nested in the output folder. You may view or download these files. These files can also be found in the File Explorer.
Krueger, F., James, F., Ewels, P., Afyounian, E. & Schuster-Boeckler, B. FelixKrueger/TrimGalore: V0.6.7 - DOI via Zenodo. (2021) doi:10.5281/ZENODO.5127899.
Thomer, A. K., Twidale, M. B., Guo, J. & Yoder, M. J. Picard Tools. in Conference on Human Factors in Computing Systems - Proceedings (2016).