raccoon

raccoon logo

Rigorous Alignment Curation: Cleanup Of Outliers and Noise

Raccoon is a lightweight toolkit for post-consensus genomic QC and phylogenetic quality control. It provides modular tools for sequence metadata harmonization, alignment curation, and phylogenetic tree assessment. Raccoon identifies problematic sequences and sites (e.g., clustered SNPs, SNPs near Ns/gaps, frame‑breaking indels, long branches, and convergent mutations) and produces detailed reports, mask files, and curated datasets for downstream analyses.

Rationale: Quality assessment and curation of genomic sequence data is essential for robust phylogenetic inference. By systematically evaluating sequence quality, alignment accuracy, and tree topology, raccoon helps researchers identify and address data issues that could compromise epidemiological or evolutionary conclusions before proceeding with downstream analysis.


Contents

Use cases

Sequence QC

Alignment QC

Masking

Phylogenetic QC

Quickstart

Typical Workflow

Running best-practice phylogenetics can be challenging, however with raccoon a simple alignment and phylogenetic workflow can be customised with data quality in mind.

raccoon logo

A) Input files

B) raccoon seq-qc

Outputs:

C) alignment

Multiple sequence alignment is a key step prior to running phylogenetics. It is the scaffold upon which we can begin to reconstruct the evolutionary relationships between different sequences in the tree. We will run alignment using MAFFT, which is a popular software tool for creating multiple sequence alignments.

Output:

D) raccoon aln-qc

A high-quality alignment is crucial to generating a good phylogenetic tree. Being able to accurately assess whether there are issues with your multiple sequence alignment is a key skill that we will cover today.

The alignment is checked for various issues that may impact the quality of the phylogenetic inference. Different kinds of SNPs (clustered SNPs, N-adjacent SNPs, gap-adjacent SNPs) are flagged that may suggest issues with the alignment or with a given sequence. If a given sequence has many issues flagged (default >20), that sequence is flagged for removal from the analysis. Flagged SNPs do not necessarily mean there is anything wrong with the SNP, it may reflect genuine biological variation. However, these sites may need to be investigated closely.

Output:

E) tree estimation Tree building is run using IQTREE. The substitution model used is configurable and an outgroup can optionally be included. If an outgroup is included, ancestral state reconstruction will be run during the tree building process to provide additional checks on the tree, and the outgroup sequence will be pruned off from the final tree. In this case, as we are not yet familiar with the data, we will not select an outgroup as it is not clear what an appropriate outgroup would be.

Key output:

F) raccoon tree-qc

Output:

A typical raccoon workflow progresses through four main stages:

  1. Sequence QC – Combine and harmonize sequence metadata across multiple input files
  2. Alignment – Generate a multiple sequence alignment (using external tools like MAFFT)
  3. Alignment QC – Flag problematic sites and generate a mask file
  4. Optional Masking - If the sites flagged appear to need removal from the sequences, the alignment can be masked using this step
  5. Tree estimation - Estimate a maximum likelihood phylogeny (using external tools like IQTREE)
  6. Phylogenetic QC – Assess tree quality and identify outlier sequences

Step 1: Sequence Quality Control

Start by harmonizing sequence headers and combining multiple FASTA files:

raccoon seq-qc -f samples_batch1.fasta samples_batch2.fasta \
  -m metadata.csv \
  --metadata-id-field sample_id \
  --metadata-location-field location \
  --metadata-date-field collection_date \
  -o combined_sequences.fasta

Input:

Step 2: Multiple Sequence Alignment

Align sequences using MAFFT (or another aligner):

mafft --auto combined_sequences.fasta > alignment.fasta

Input: Combined FASTA file
Output: Multiple sequence alignment in FASTA format

Step 3: Alignment Quality Control

Assess alignment quality and identify suspect sites:

raccoon aln-qc alignment.fasta -d alignment_qc_results \
  --reference-id reference_seq_id

Input:

Step 4: Apply Mask (Optional)

Exclude flagged sites from downstream analysis:

raccoon mask alignment.fasta \
  --mask-file alignment_qc_results/mask_sites.csv \
  -d alignment_qc_results \
  -o alignment.masked.fasta

Input: Alignment FASTA file and mask CSV
Output: Masked alignment with flagged sites replaced by mask character

Step 5: Phylogenetic Inference

Build phylogeny (using IQ-TREE or similar):

iqtree -s alignment.masked.fasta -m GTR+G -bb 1000 -alrt 1000

Output: .treefile (phylogeny) and .state (ancestral state reconstruction, if using IQ-TREE)

Step 6: Phylogenetic Quality Control

Assess tree topology and identify problematic sequences:

raccoon tree-qc --tree alignment.masked.fasta.treefile \
  --alignment alignment.masked.fasta \
  --asr-state alignment.masked.fasta.state \
  -d tree_qc_results \
  --run-adar --adar-window 300 --adar-min-count 3 \
  --run-apobec

Input: Tree file, alignment FASTA, and ASR state file
Output: Interactive HTML report and flagged sequence list

Integrated workflows

raccoon-nf: End-to-end Nextflow pipeline

For complete phylogenetic quality-control workflows, raccoon-nf integrates raccoon’s modular tools with alignment and tree-building software (MAFFT, IQ-TREE) in a production-ready Nextflow pipeline. The raccoon-nf pipeline coordinates all QC steps in sequence:

  1. Sequence QC – harmonise headers and filter sequences
  2. Alignment – run MAFFT on combined sequences
  3. Alignment QC – assess alignment quality and flag problematic sites
  4. Tree estimation – build phylogenetic tree with IQTREE
  5. Tree QC – evaluate tree topology and identify outliers

raccoon-nf can be run through the EPI2ME desktop interface for users without command-line expertise. See the tutorial for a complete walkthrough.

Stand alone installation

From source:

pip install artic-raccoon

CLI usage

Show help:

raccoon --help

Sequence QC (seq-qc)

Basic usage:

raccoon seq-qc -f a.fasta b.fasta -o combined.fasta

With metadata-driven headers:

raccoon seq-qc -f a.fasta b.fasta -o combined.fasta \
  -m metadata.csv other_metadata.csv \
  --metadata-id-field sample \
  --metadata-location-field location \
  --metadata-date-field date \
  --header-separator '|'

With a custom header template:

raccoon seq-qc -f a.fasta b.fasta -o combined.fasta \
  -m metadata.csv --header-fields "{id}|{country}|{date}"

Key options:

Alignment QC (aln-qc)

Basic usage:

raccoon aln-qc <alignment.fasta> -d outdir

With GenBank reference for frame-break checks:

raccoon aln-qc <alignment.fasta> -d outdir \
  --genbank <reference.gb> --reference-id <ref_id>

Disable selected flag classes:

raccoon aln-qc <alignment.fasta> -d outdir \
  --no-flag-n-adjacent --no-flag-gap-adjacent

Key options:

Apply mask (mask)

raccoon mask <alignment.fasta> \
  --mask-file results/alignment_qc/mask_sites.csv \
  -d results/alignment_qc

Key options:

Phylogenetic QC (tree-qc)

Basic usage:

raccoon tree-qc --tree <treefile> -d outdir \
  --alignment <alignment.fasta> --asr-state <treefile>.state \
  --run-adar --adar-window 300 --adar-min-count 3

Key options:

See full CLI details in docs/cli.md.

Output Descriptions

seq-qc outputs

aln-qc outputs

Generated in the specified output directory (default: .):

mask outputs

tree-qc outputs

Generated in the specified output directory (default: .):

Mask notes

Mask output uses the following note values:

Note Meaning
clustered_snps Clustered SNPs within the configured window.
N_adjacent SNPs adjacent to an N run within the configured window.
gap_adjacent SNPs adjacent to a gap within the configured window.
frame_break Gap sites that break the CDS frame length.

Example data

The examples folder includes a constructed alignment and GenBank reference suitable for quick testing:

Tutorial

A comprehensive tutorial covering sequence metadata harmonisation, multiple sequence alignment, alignment curation, phylogenetic inference, and tree assessment is available at artic.network/tutorials/raccoon.nf. The tutorial includes:

The tutorial is suitable for both guided workshop delivery and self-paced learning.