Exploring The Workflow of The Oxford Nanopore Technologies (ONT) from pod5 to Polished Asembly - A Note To Myself
By Ken Koon Wong in ont oxford nanopore technology minion epi2me dorado samtools flye
September 21, 2025
Hands-on with Oxford Nanopore workflow: pod5 → BAM → assembly! Processed M. tuberculosis from raw signals to complete genome using dorado + flye. Fascinating how thousands of contigs became one 4M+ bp sequence. Polishing question: when is it worth the compute time?
Motivations:
ONT Minion is really cool! I have been interested in this technology for a few months now and found it to have great potential. In case we have an opportunity to explore this technology, why not let’s get familiar with the workflow, so that at least we’re somewhat familiar with the workflow and prepared to process the sequence data! Let’s explore! After somer reading, it seems like ONT produces a raw sequence file called pod5
. And the file needs to be converted / processed to fasta
or fastq
format.
TL;DR Workflow
basecall pod5 to BAM (
dorado
) > Convert BAM to fastq (samtools
) > Assemble (flye
) > Align (dorado
) & Index (samtools
) > Polish (dorado
)
Objectives:
Find a pod5 file
If I’m not mistaken, an ONT will return a pod5
file. The way they capture the sequences is very interesting! Apparently each nucleotide, when passes through the pore, it changes the ionic current. And pod5
basically records those signals. Each signal reflects multiple nucleotide. There is a python package where we can view the pod5 file called pod5Viewer
.
Now let’s find one and download it. I found a
mycobacterium tuberculosis
pod5 from The University of Melbourne. And interestingly, there are 45 pod5
for a single isolate. Very cool! But mostly what I saw from other pod5 files are usually just 1 big file. fyi, the zip file of pod5 was about 7 gb! Let’s view one of the file.
pod5viewer
# install
conda create -n p5v python==3.10
conda activate p5v
pip install pod5Viewer
# run
pod5Viewer
You can open the pod5 file and inspect the raw signal like so.
Interesting looking thing! Now, on to our first step of our workflow, basecalling!
Workflow
Basecall pod5 to BAM
We can use dorado
to convert the pod5 file to BAM format, like so.
# installation
curl "https://cdn.oxfordnanoportal.com/software/analysis/dorado-1.1.1-osx-arm64.zip" -o dorado-1.1.1-osx-arm64.zip
unzip dorado-1.1.1-osx-arm64.zip
export PATH="/path/to/dorado-1.1.1-osx-arm64/bin:$PATH"
# basecall
dorado basecaller hac --device metal /path/to/your/pod5_files/ > mycobacterium_basecalled.bam
the bam
file will be about 600mb.
Turn BAM into fastq
# install
wget https://github.com/samtools/samtools/releases/download/1.22.1/samtools-1.22.1.tar.bz2
cd samtools-1.x
./configure --prefix=/where/to/install
make
make install
export PATH=/where/to/install/bin:$PATH
# convert
samtools fastq mycobacterium_basecalled.bam > mtb.fastq
Wow, the fastq is 1.4g. Let’s take a look at the fastq!
Wow! So many contigs! Hmm, can we
mlst
this? Let’s try.
mlst mtb.fastq
Wow, this already works pretty good! It was already identify mycobacterium
. Because there are so many repeats, that’s why we’re seeing repeated loci profiles listed. Interestingly, some with different loci profile. Alright, our next step is to assemble
and see if we make them into 1 longgggg sequence
Assemble with Flye
# install
conda install flye
# assemble
flye --nano-hq mtb.fastq \
--threads 10 \
--out-dir mtb_assembly
That took a few minutes. Great! Let’s go into our mtb_assembly
folder and read the assembly.fasta
Wow, look at that! One longgg contig! Hurray! Now let’s run
mlst
again and then see what pops up.
mlst mtb_assembly/assembly.fasta
Pretty good. Same thing! Mycobacterium tuberculosis. Let’s blast it with whole genome and see what pops up.
blastn -query mtb_assemble/assembly.fasta -db /blast_db/ref_prok_rep_genomes -outfmt 6 -num_threads 10 -out mtb_wgs.txt
click to expand R code
library(tidyverse)
colnames <- c("qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")
df <- read_tsv("mtb_wgs.txt", col_names = colnames)
sseqid_vec <- df |>
arrange(desc(bitscore)) |>
head(10) |>
pull(sseqid)
system(paste0("blastdbcmd -db blast_db/ref_prok_rep_genomes -outfmt '%a %t' -entry ",paste(sseqid_vec, collapse = ",")))
We’re looking at arranged descending bitscore and top 10 results. Nine of 10 are Mtb, which is great! Wait a minute, there is something called Mycobacterium canettii. What is that !? Wow, at least that’s an Mtb complex! Let’s dive in a bit.
click to expand R code
seq_to_compare <- c("NC_000962.3","NC_015848.1")
df |>
filter(sseqid %in% seq_to_compare) |>
group_by(sseqid) |>
mutate(total_length = sum(length)) |>
distinct(sseqid,total_length)
Wow, looking at the total length for both reference sequence, they have very wide coverage, NC_000962.3
(Mtb) being the highest but M canettii isn’t too bad either at 5047243!
Alright, next is to polish
! But before we can polish, we have to align & index
.
Align & Index
# align and create new aligned bam
dorado aligner mtb_assembly/assembly.fasta mycobacterium_basecalled.bam | samtools sort > aligned_mycobacterium.bam
# Index the new alignment
samtools index aligned_mycobacterium.bam
this should create 2 files, aligned_mycobacterium.bam
and aligned_mycobacterium.bam.bai
. Alright, so far so good! This process might be less than a minute. Next, we polish!
Polish
dorado polish --device auto aligned_mycobacterium.bam mtb_assembly/assembly.fasta > polished_assembly.fasta
Looks like the current polish method does not allow metal
to my knowledge. Maybe that might change. But still it wasn’t bad, took a bout 1 hour for polishing. And we ended up with 35
additional nucleotides on the polish_assembly.
Which begs the question, why do we need to polish these assembly? 🤔
Let’s see if this changes anything in terms of blastn with whole genome.
blastn -query polished_assembly.fasta -db /blast_db/ref_prok_rep_genomes -outfmt 6 -num_threads 10 -out mtb_wgs_polished.txt
click to expand R code
colnames <- c("qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")
df <- read_tsv("mtb_wgs_polished.txt", col_names = colnames)
sseqid_vec <- df |>
arrange(desc(bitscore)) |>
head(10) |>
pull(sseqid)
system(paste0("blastdbcmd -db blast_db/ref_prok_rep_genomes -outfmt '%a %t' -entry ",paste(sseqid_vec, collapse = ",")))
No change! Bitscore didn’t change either (not shown). Very interesting. I’m not exactly sure when do we need to use this post-assembly process. And what makes this “more” accurate?
Other Methods
Epi2me
has great workflow for automated selection with additional perks as well, including using resfinder
to assess antimicrobial resistance genes etc. I’ve tried the command line, minimal coding. Does require installation of Java
, Nextflow
, and Docker
. If you’re using mac
and has alert of
mac malware, click on this link to find out how to disable it. There is also a GUI
version of epi2me, you can check that out as well, but it requires registration and login. I personally have not used that.
On a side note, epi2me uses medaka
for polishing and it’s extremely slow on my computer. Unsure why and probably user error. Hence my workflow of using dorado
for most of the tasks. I could essentially write a R
script to automate all of the above. A project for the near future!
Opportunities for improvement
- Definitely looking forward to adding
resfinder
workflow to assess amr. Since this doesn’t have a complete mycobacterium amr sequences, we may have to add another database that contains those such as AMRfinder - Explore
prokka
for annotation - Need to learn quality control
- need to learn what is the acceptable threshold to set for blastn when we’re using it to identify isolate
- need to learn when to use
polish
and when not to
Final Thoughts
That was fun exploring the raw sequence of ONT to constructing the workflow to process raw sequence to a polished assembly. Though this use case is probably not ideal for identification because most of the current less laborous technology such as PCR can already identify Mtb, this might be helpful for phylogenetic analysis and tree construction. Also might be helpful in assessing NTM. Since I couldn’t find a raw sequence of NTM, I wasn’t able to look into that… but it was fun going through a known MTB, MLST, and blast!
Lessons learnt
- learnt
pod5Viewer
,dorado
,flye
,samtools
- learnt to process raw ONT sequence to a polish assembly
- explore a bit on how we could use
resfinder
or other methods to explore AMR. this will be potentially helpful
If you like this article:
- please feel free to send me a comment or visit my other blogs
- please feel free to follow me on BlueSky, twitter, GitHub or Mastodon
- if you would like collaborate please feel free to contact me