Rigorous Alignment Curation: Cleanup Of Outliers and Noise
Raccoon is a lightweight toolkit for post-consensus genomic QC and phylogenetic quality control. It provides modular tools for sequence metadata harmonization, alignment curation, and phylogenetic tree assessment. Raccoon identifies problematic sequences and sites (e.g., clustered SNPs, SNPs near Ns/gaps, frame‑breaking indels, long branches, and convergent mutations) and produces detailed reports, mask files, and curated datasets for downstream analyses.
Rationale: Quality assessment and curation of genomic sequence data is essential for robust phylogenetic inference. By systematically evaluating sequence quality, alignment accuracy, and tree topology, raccoon helps researchers identify and address data issues that could compromise epidemiological or evolutionary conclusions before proceeding with downstream analysis.
Running best-practice phylogenetics can be challenging, however with raccoon a simple alignment and phylogenetic workflow can be customised with data quality in mind.
A) Input files
B) raccoon seq-qc
Outputs:
C) alignment
Multiple sequence alignment is a key step prior to running phylogenetics. It is the scaffold upon which we can begin to reconstruct the evolutionary relationships between different sequences in the tree. We will run alignment using MAFFT, which is a popular software tool for creating multiple sequence alignments.
Output:
D) raccoon aln-qc
A high-quality alignment is crucial to generating a good phylogenetic tree. Being able to accurately assess whether there are issues with your multiple sequence alignment is a key skill that we will cover today.
The alignment is checked for various issues that may impact the quality of the phylogenetic inference. Different kinds of SNPs (clustered SNPs, N-adjacent SNPs, gap-adjacent SNPs) are flagged that may suggest issues with the alignment or with a given sequence. If a given sequence has many issues flagged (default >20), that sequence is flagged for removal from the analysis. Flagged SNPs do not necessarily mean there is anything wrong with the SNP, it may reflect genuine biological variation. However, these sites may need to be investigated closely.
Output:
E) tree estimation Tree building is run using IQTREE. The substitution model used is configurable and an outgroup can optionally be included. If an outgroup is included, ancestral state reconstruction will be run during the tree building process to provide additional checks on the tree, and the outgroup sequence will be pruned off from the final tree. In this case, as we are not yet familiar with the data, we will not select an outgroup as it is not clear what an appropriate outgroup would be.
Key output:
F) raccoon tree-qc
Output:
A typical raccoon workflow progresses through four main stages:
Start by harmonizing sequence headers and combining multiple FASTA files:
raccoon seq-qc -f samples_batch1.fasta samples_batch2.fasta \
-m metadata.csv \
--metadata-id-field sample_id \
--metadata-location-field location \
--metadata-date-field collection_date \
-o combined_sequences.fasta
Input:
combined_sequences.fasta with structured headers (e.g., sample_id|location|date)Align sequences using MAFFT (or another aligner):
mafft --auto combined_sequences.fasta > alignment.fasta
Input: Combined FASTA file
Output: Multiple sequence alignment in FASTA format
Assess alignment quality and identify suspect sites:
raccoon aln-qc alignment.fasta -d alignment_qc_results \
--reference-id reference_seq_id
Input:
Exclude flagged sites from downstream analysis:
raccoon mask alignment.fasta \
--mask-file alignment_qc_results/mask_sites.csv \
-d alignment_qc_results \
-o alignment.masked.fasta
Input: Alignment FASTA file and mask CSV
Output: Masked alignment with flagged sites replaced by mask character
Build phylogeny (using IQ-TREE or similar):
iqtree -s alignment.masked.fasta -m GTR+G -bb 1000 -alrt 1000
Output: .treefile (phylogeny) and .state (ancestral state reconstruction, if using IQ-TREE)
Assess tree topology and identify problematic sequences:
raccoon tree-qc --tree alignment.masked.fasta.treefile \
--alignment alignment.masked.fasta \
--asr-state alignment.masked.fasta.state \
-d tree_qc_results \
--run-adar --adar-window 300 --adar-min-count 3 \
--run-apobec
Input: Tree file, alignment FASTA, and ASR state file
Output: Interactive HTML report and flagged sequence list
For complete phylogenetic quality-control workflows, raccoon-nf integrates raccoon’s modular tools with alignment and tree-building software (MAFFT, IQ-TREE) in a production-ready Nextflow pipeline. The raccoon-nf pipeline coordinates all QC steps in sequence:
raccoon-nf can be run through the EPI2ME desktop interface for users without command-line expertise. See the tutorial for a complete walkthrough.
From source:
pip install artic-raccoon
Show help:
raccoon --help
seq-qc)Basic usage:
raccoon seq-qc -f a.fasta b.fasta -o combined.fasta
With metadata-driven headers:
raccoon seq-qc -f a.fasta b.fasta -o combined.fasta \
-m metadata.csv other_metadata.csv \
--metadata-id-field sample \
--metadata-location-field location \
--metadata-date-field date \
--header-separator '|'
With a custom header template:
raccoon seq-qc -f a.fasta b.fasta -o combined.fasta \
-m metadata.csv --header-fields "{id}|{country}|{date}"
Key options:
-f, --fasta: input FASTA files (one or more) (required)-o, --outfile: output FASTA file (default: combined.fasta; use - for stdout)-m, --metadata: metadata CSV file(s) for header harmonisation--metadata-delimiter: metadata delimiter (default ,; .tsv auto-detected)--metadata-id-field: metadata ID column (default: sample)--metadata-location-field: metadata location column (default: location)--metadata-date-field: metadata date column (default: date)--header-fields: template for custom headers (e.g. {id}|{country}|{date})--header-separator: separator used for non-template harmonised headers (default: |)--seq-id-delimiter: delimiter for parsing IDs from input headers (default: |)--seq-id-field-index: 0-based field index for parsed sequence ID (default: 0)--min-length: minimum sequence length to keep--max-n-content: maximum N-content proportion to keepaln-qc)Basic usage:
raccoon aln-qc <alignment.fasta> -d outdir
With GenBank reference for frame-break checks:
raccoon aln-qc <alignment.fasta> -d outdir \
--genbank <reference.gb> --reference-id <ref_id>
Disable selected flag classes:
raccoon aln-qc <alignment.fasta> -d outdir \
--no-flag-n-adjacent --no-flag-gap-adjacent
Key options:
alignment (positional): input alignment FASTA file (required)-d, --outdir: output directory (default: .)-t, --sequence-type: sequence type, nt or aa (default: nt)--genbank: GenBank file for frame-breaking indel checks--reference-id: reference sequence ID in alignment (for GenBank features)--max-n-content: N-content threshold for flagging--cluster-window: window size (bp) for clustered SNP detection--cluster-count: minimum SNPs in-window to mark as clustered--no-flag-clustered: skip clustered SNP flagging--no-flag-n-adjacent: skip N-adjacent SNP flagging--no-flag-gap-adjacent: skip gap-adjacent SNP flagging--no-flag-frame-break: skip frame-breaking indel flagging--flag-removal-threshold: mark sequence for removal above this flagged-site countmask)raccoon mask <alignment.fasta> \
--mask-file results/alignment_qc/mask_sites.csv \
-d results/alignment_qc
Key options:
--mask-file: mask CSV file from aln-qc--mask-character: character to use for masking (default: ?)-o, --outfile: output masked alignment file name-d, --outdir: output directory-t, --sequence-type: nt or aa (default: nt)tree-qc)Basic usage:
raccoon tree-qc --tree <treefile> -d outdir \
--alignment <alignment.fasta> --asr-state <treefile>.state \
--run-adar --adar-window 300 --adar-min-count 3
Key options:
-t, --tree: input phylogeny file (required)-d, --outdir: output directory (default: .)--tree-format: auto, newick, or nexus (default: auto)--alignment: alignment FASTA used with ASR state file--asr-state: ancestral state reconstruction file in IQTREE format--assembly-refs: assembly/reference FASTA used for mapping--outgroup-ids: comma-separated outgroup sequence IDs--mask-file: optional mask CSV with sites to ignore--tip-fields: template for parsing tip-label fields--tip-field-delimiter: delimiter used for tip field parsing--tip-date-field: field name treated as date in tip parsing--long-branch-sd: SD threshold for long-branch flagging (default: 3.0)--midpoint-root: midpoint-root tree for report visualisation (applied only when --asr-state is not provided)--run-apobec: run APOBEC3 checks--run-adar: run ADAR checks--adar-window: max distance (bp) for ADAR cluster window (default: 300)--adar-min-count: min ADAR sites in window to flag branch (default: 3)--height: optional figure heightSee full CLI details in docs/cli.md.
seq-qc outputscombined.fasta): Sequences with harmonized headers, upper-case, single-line formataln-qc outputsGenerated in the specified output directory (default: .):
mask_sites.csv: Tab-separated file listing flagged sites with flag types (clustered_snps, N_adjacent, gap_adjacent, frame_break)alignment_flags.tsv: Detailed per-sequence report showing all flagged sites for each sequencealignment_qc_report.html: Interactive HTML report with:
alignment_summary.txt: Text summary of flagging statisticsmask outputsalignment.masked.fasta): Original alignment with flagged sites replaced by mask character (default: ?)tree-qc outputsGenerated in the specified output directory (default: .):
tree_qc_report.html: Interactive HTML report featuring:
flagged_sequences.txt: List of sequences recommended for removal with justificationconvergent_mutations.csv (if ASR provided): Convergent mutations detectedtree_summary.txt: Text summary of tree QC findingsMask output uses the following note values:
| Note | Meaning |
|---|---|
| clustered_snps | Clustered SNPs within the configured window. |
| N_adjacent | SNPs adjacent to an N run within the configured window. |
| gap_adjacent | SNPs adjacent to a gap within the configured window. |
| frame_break | Gap sites that break the CDS frame length. |
The examples folder includes a constructed alignment and GenBank reference suitable for quick testing:
A comprehensive tutorial covering sequence metadata harmonisation, multiple sequence alignment, alignment curation, phylogenetic inference, and tree assessment is available at artic.network/tutorials/raccoon.nf. The tutorial includes:
The tutorial is suitable for both guided workshop delivery and self-paced learning.