Learning About Assemblying DNA, BLAST, and Multiple Loci Sequence Testing (MLST) - A Note To Myself

By Ken Koon Wong in r R assemblying dna blast mlst pubmlst neis serogroup neisseria meningitidis bioinformatics

September 15, 2025

We learnt how to assemble DNA fragments from a recently published N. meningitidis outbreak study using SKESA 🧬, practiced using BLAST for species identification and MLST for strain typing, and explored how serogroups are determined using specialized Python tools for capsule characterization 🔬 - such a fun and exciting hands-on introduction to genomic analysis!

Motivations

We’ve learnt a few of the basics of AMR gene detection, Phylogenetic analysis, and the fundamental of alignment for the past month or so. Lately there is a great article on Outbreak of Neisseria meningitidis Conjunctivitis in Military Trainees — Texas, February–May 2025 by SJ Ching et al, with senior author, Joseph E Marcus, where they found 41 cases of N. meningitidis conjunctivitis caused by an unencapsulated (nongroupable) strain identified by whole genome sequencing (WGS) occurred in young, healthy military trainees living in a communal setting. They also uploaded WGS fasta to NCBI and PubMLST. Let’s take a look!

Disclaimer:

I am not a bioinformatician and do not work with genes directly. These are notes to myself as I learn the basics. Please take this with a grain of salt. Verify the information presented. If you noted some error in this article, please let me know so that I can learn! Also, these codes were not run during rmarkdown knitting because of significant delay for each knitting, however, the results posted here should be reprodicuble. Please again let me know if they are not

Game Plan

This was our first learning thought process.

Inspect the sequence from NCBI
Run 16S rRNA to find the closest genus and species and see if we can identify it.
Learn to run BLAST and examine results
Learn to run MLST and examine results

And eventually, we met lots of obstacles and learnt a lot from the process. Below are the actual learning objectives.

Objectives

NCBI Sequence Read Archive
Assemble DNA Sequence
BLAST it!
What is MLST?
- Using it in Terminal
- Using it in R
Let’s Look At Encapsulation Gene & penA
Acknowledgements
Opportunity for improvement
Lessons Learnt

NCBI Sequence Read Archive

Alright, let’s inspect the sequence from NCBI

Now, let’s download the fasta. Wow, such a big file! 1.1 Gbases which translates to about 1.41 Gbytes 😵‍💫 Why is it such a large file? Let’s explore.

(nm1 <- Biostrings::readDNAStringSet("/path/to/SRR35265268.fasta"))

Wow! 3682904 in length and 301 in width !?!?! Now I know I’m new to this, but this looks very odd from the whole genome chromosomes we got from NCBI. Why is that? Let’s explore.

Looks like this was from Illumina MiSeq and their max read length is 300bp? Oh interesting!

Alright, will our usual workflow even work? Let’s see if we can extract 16s rRNA sequence from this raw sequence.

system("barrnap -threads 90 SRR35265268.fasta")

OK, that makes sense. Typical 16S rRNA sequence is around 1500bp and each of our contig is only 300bp. No wonder barrnap can’t find 5s, 16s, or 23s. Now what? Do we then need to assemble this raw sequence !?! 🤔 But, we’ve never done this before. Alright, let’s give it a few go!

Assemble DNA Sequence

What is it?

DNA sequence assembly is the computational process of reconstructing complete genomes from millions of short, overlapping DNA fragments produced by sequencing technologies. Since sequencing machines can only read small pieces of DNA at a time, assembly algorithms analyze overlaps between these fragments to piece them together like a jigsaw puzzle, ultimately creating longer contiguous sequences that represent the original DNA molecule.

How can we do this?

There are a few DNA assemblers available and we can give them a try. These are spades, megahit, skesa. After multiple trials, skesa is our winner! 🥇 We tried spades, but it seemed to take a long time. Tried megahit and couldn’t get it running. And ultimately skesa ran smoothly and fast, blazing fast within seconds.

SKESA (Strategic Kmer Extension for Scrupulous Assemblies) is a de novo genome assembler developed by NCBI.

Terminal

conda install bioconda::skesa
skesa --reads SRR35265268.fasta --contigs_out SRR35265268_assemble.fasta

Wow, that took less than a minute! Awesome, let’s inspect!

R

(nm2 <- Biostrings::readDNAStringSet("/path/to/SRR35265268_assemble.fasta"))

Look at that! From 3682904 in length to 117 and 301 in width to some with 5 digits! Great! Now here is the part that I’m not quite sure how to evaluate the quality control where this is acceptable in the genome realm. Very interested in this question. If you know the answer and have a reference, please let me know!

Extract 16s rRNA from Assembled DNA

OK, let’s assess this with barrnap and then see if we can use NCBI website to briefly give us an idea of what species and genus this is.

system("barrnap -threads 90 SRR35265268_assemble.fasta > SRR35265268_16s.gff")

(nm_16s_df <- ape::read.gff("SRR35265268_16s.gff") |>
      as_tibble() |>
      filter(str_detect(attributes, "16S"))) 

seq_nm2 <- subseq(nm2[109], start = nm_16s_df[[1, "start"]], end = nm_16s_df[[1, "end"]])
(reverseComplement(seq_nm2))

Note that the above, we used barrnap to extract 16s and output to gff file. then use ape to read gff file and we extract the actual DNA sequence from the assembled DNA. Note that because the strand is -, hence we have to reverse it. Also note that because the extracted 16s was found on contig_109 hence we used nm2[109]. Alright, let’s copy and paste this with clipr::write_clip(reverseComplement(seq_nm2) |> as.character()), where it will copy the character of the reversed sequence into system, so we don’t have to use mouse to highlight then copy. This is very helpful when we have to copy very long sequence that is already in our variable.

BLAST it!

Now that we have our 16s, let’s BLAST it!

We go to NCBI Blast, select blastn since we’re looking at nucleotides. Paste the 16s sequence on the query. Select 16S, then blast!

Side note, BLAST stands for Basic Local Alignment Search Tool. Notice that previously, we looked into Global alignment, but this is Local alignment. We’ll get into the method next time, it’s really fascinating, similar to global method but instead of starting from right lower corner, we start from the highest score and traceback!

Then wait for result, and we sort by percent identity. Yes! Neisseria meningitidis as our first! very cool!

Now Let’s Use Some Command Lines

This is BLAST manual

Terminal

# Installation
brew install blast

# Show all database available to download
update_blastdb.pl --showall

# Download database
mkdir blast_db
cd blast_db

## Download 16s ribosome db
update_blastdb.pl --decompress 16S_ribosomal_RNA

## Download partial nucleotide of prokaryote
update_blastdb.pl --decompress nt_prok 

## Download whole genome of prokaryotes
update_blastdb.pl --decompress ref_prok_rep_genomes

The above database downloads are based on your preference. Let’s get all of them and see what we get for each database.

Using BLAST in Terminal

blastn \
  -query SRR35265268_assemble.fasta \
  -db blast_db/16S_ribosomal_RNA \
  -outfmt 6 \
  -out result_16s.txt

R

library(tidyverse)

colnames <- c("qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")

df <- read_tsv("/path/to/result_16s_2.txt", col_names = colnames) |>
  arrange(desc(pident))

top10seqid <- df |> head(10) |> pull(sseqid)

system(paste0("blastdbcmd -entry \"",paste0(top10seqid,collapse=","),"\" -db /path/to/blast_db/16S_ribosomal_RNA -outfmt \"%a %t\""))

This is what df looks like:

This is what blastdbcmd retrieved: Awesome! Same result as the one we ran on NCBI website. Good that it matches. Now, let’s evaluate nucleotide and whole genome.

Nucleotide

Code

system("blastn -query /path/to/SRR35265268_assemble.fasta -db /path/to/blast_db/nt_prok -outfmt 6 -out result_nt.txt")

df <- read_tsv("/path/to/result_nt.txt", col_names = colnames) 

df_summarize <- df |>
  group_by(sseqid) |>
  summarize(n = n(),
            percent_contig = n_distinct(qseqid)/length(nm2)) |>
  arrange(desc(percent_contig)) 

top10seqid <- df_summarize |> head(10) |> pull(sseqid)

result_blastdbcmd <- system(paste0("blastdbcmd -entry \"",paste0(top10seqid,collapse=","),"\" -db /path/to/blast_db/nt_prok -outfmt \"%a %t\""), intern=T)

lookup_table <- tibble(sseqid = top10seqid, name = result_blastdbcmd) 

df_summarize |>
  left_join(lookup_table, by = "sseqid")

Wow, all Neisseria meningitidis, and whole genome too! Notice that the percent_contig coverage is also REALLY high! What this means is that, when grouped by the ssqeid the unique nucleotide from NCBI, the percentage of distinct contigs from the study sample passed the BLAST detection threshold. 1 means 100%.

Whole Genome

Code

system("blastn -query /path/to/SRR35265268_assemble.fasta -db /path/to/blast_db/ref_prok_rep_genomes -outfmt 6 -out result_wgs.txt")

df <- read_tsv("result_wgs.txt", col_names = colnames) 

df_summarize <- df |>
  group_by(sseqid) |>
  summarize(n = n(),
            percent_contig = n_distinct(qseqid)/length(nm2)) |>
  arrange(desc(percent_contig)) 

top10seqid <- df_summarize |> head(10) |> pull(sseqid)

result_blastdbcmd <- system(paste0("blastdbcmd -entry \"",paste0(top10seqid,collapse=","),"\" -db /path/to/blast_db/ref_prok_rep_genomes -outfmt \"%a %t\""), intern=T)

lookup_table <- tibble(sseqid = top10seqid, name = result_blastdbcmd) 

df_summarize |>
  left_join(lookup_table, by = "sseqid")

Very intersting, isn’t it? The only one with high percentage contig coverage >90% is N meningitidis. But other species with the same genus came up as well. Now the curios thing is, what is the difference between nt and ref_prok_rep_genomes? I was told that nt contains all nucleotide sequences in GenBank, including partial sequences, genes, and complete genomes, and ref_prok_rep_genomes is a curated subset containing only complete reference genomes for prokaryotes (one representative per species typically). But which genome will be chosen to be the representative? 🤷‍♂️

Next, let’s venture to something I’ve never heard of until this article, called MLST! Apparently this is a method to characterize isolates of bacterial species using the sequences of internal fragments of multiple (usually seven) housekeeping genes. Let’s go!

What is MLST?

MLST stands for Multi-Locus Sequence Typing. It is a method used to characterize bacterial strains based on the sequences of multiple housekeeping genes. Housekeeping genes are essential genes that are typically conserved across different strains of a species and are involved in basic cellular functions. PubMLST is a great database for MLST. This is a great technique for identifying clonal group. The 7 housekeeping genes for N meningitidis are abcZ, adk, aroE, fumC, gdh, pdhC, pgm. You can use the website and upload a fasta and it can give you more insight as to how closely related your sequence is to the database, but you can also use command line wiht bioconda::mlst.

Terminal

# installation
conda install bioconda::mlst

# run mlst
mlst SRR35265268_assemble.fasta --scheme "neisseria"

Look at that! abcZ(8) adk(10) aroE(5) fumC(~218) gdh(5) pdhC(3) pgm(8). What does the ~ mean? It means >= --minid which is set at 95 default. Now we know that from this the fumC is 1457, a novel gene! Initially the sequence typing (ST) was missing , with the help of Joseph Marcus et al, looking at search typing by profiles we’re able to see that it’s closely related to ST-32 complex which is now listed as such.

R

current_path <- Sys.getenv("PATH")
new_path <- paste("/path/to/miniforge3/bin:/path/to/miniforge3/condabin", current_path, sep=":")
Sys.setenv(PATH = new_path)
(nm_mlst <- system("mlst SRR35265268_assemble.fasta --scheme \"neisseria\"", intern = T) |> 
  str_split("\t") |> 
  unlist() |> 
  _[4:10])

Now, with the above we can probably craete some function and have it run in R and capture the output! We won’t do that right now. The above is a note for me that sometimes when we install via conda, the path may not be the same in R compared to terminal, hence it’s a good idea to go to terminal and echo $PATH, then fill in the gaps in R like the above, if R cannot recognize the application.

Alright, let’s run an isoalte from PubMLST and see if we can get the same MLST! this one

Code

(nm_mlst <- system("mlst new.fasta --scheme \"neisseria\"", intern = T) |> 
  str_split("\t") |> 
  unlist() |> 
  _[4:10])

Alright, it matches! Awesome! A note to myself, to find the typing of the isolate, click on typing a subfolder of all loci to reveal the MLST.

Let’s look at Encapsulation Gene & penA

Since the article stated that the isolates were nongroupable, without the presence of csaB, csb, csc, csw, and csy genes associated with encapsulation, and the presence of penA. Let’s see if we can replicate that result!

I tried looking for these capsule genes on NCBI but i’m having no luck. I wonder if it’s because the nomenclature used are slightly different. With that in mind, let’s look at PubMLST and choose a few invasive strains and check their capsule loci.

Alright, it looks like for serogroup B, C, W, Y, we need to get those sequences. We need to also go to a serogroup A to get the sequence that encode those. Then download these fasta and see if we can find it on the current assembled sequence.

library(RCurl)

# Serogroup 
serogroup_A <- c("NEIS2157", "NEIS2158", "NEIS2159", "NEIS2160")
serogroup_B <- c("NEIS0054", "NEIS0053", "NEIS0052", "NEIS2161", "NEIS0049", "NEIS0055")
serogroup_C <- c("NEIS0054", "NEIS0053", "NEIS0052", "NEIS0051", "NEIS0050", "NEIS0049")
serogroup_W <- c("NEIS0054", "NEIS0053", "NEIS0052", "NEIS2162", "NEIS2164", "NEIS0049")
serogroup_Y <- c("NEIS0054", "NEIS0053", "NEIS0052", "NEIS2163", "NEIS2164", "NEIS0049")
all_loci <- c(serogroup_A, serogroup_B, serogroup_C, serogroup_W, serogroup_Y)

# download fasta
download_locus <- function(locus) {
  url <- paste0("https://rest.pubmlst.org/db/pubmlst_neisseria_seqdef/loci/", locus, "/alleles_fasta")
  fasta <- getURL(url)
  writeLines(fasta, paste0(locus, ".fasta"))
  cat("Downloaded:", locus, "\n")
}

for (locus in all_loci) {
  download_locus(locus)
  Sys.sleep(0.5)
}

# Merge all files
all_fasta <- character()
for (locus in all_loci) {
  fasta_content <- readLines(paste0(locus, ".fasta"))
  all_fasta <- c(all_fasta, fasta_content)
}

writeLines(all_fasta, "capsule.fasta")

# create database
system("makeblastdb -in capsule.fasta -dbtype nucl -out /path/to/blast_db/capsule_db")

# run blast
system("blastn -query SRR35265268_assemble.fasta -db blast_db/capsule_db -fmtout 6 -o result_mlst.txt")

Very intersting. It looks like NEIS0049, NEIS0050, NEIS0053 were detected, matches 3 of 6 loci of serotype C. But we know that this is a non-serotypable. Not sure what the threshold is to be considered as a serotype? How were they were able to classify this as capsule null locus (cnl). Wait a minute… why do I see this so common?

Ohhhhhh… this is a python package created by Nadav Topaz where we can predict the serogroup!!! Awesome!!! And I thought I had to hand curate all these fasta and write condition and threshold to predict the serogroup. Awesome!

Terminal

git clone https://github.com/ntopaz/characterize_neisseria_capsule.git
cd characterize_neisseria_capsule
python build_neisseria_dbs.py

# transfer the fasta file to a new fasta directory in characterize_neisseria_capsule
python characterize_neisseria_capsule.py -d fasta -o results
cat results/serogroup/serogroup_results.json

Alright! NG non-groupable, and cnl! Looking at the python script, it looks like the coverage needs to be at least 30%, otherwise it’s considered as false positive. What’s converage? It’s length of nucleotide matched divided by the length of the reference sequence (e.g., If sample gene is ACTACA, reference gene is ACTCAGGGAAAC, that would be 5/12). And also all the essential genes (of the serogroup) need to be detected in order to be classified as that serogroup i think. Let’s pick another invasive isolate and see if that’s true.

Let’s take a look at this that has serogroup Y.

Alright! All 6 essential genes are present!

Let’s look at another one on PubMLST, this, go all the way down and click on capsule.

Wait a minute, why does this only have 5 of 6 detected and still classified as W !?!?! 🤔 Let’s download fasta and inspect locally.

It is true that NEIS2164 which is cssF is not detected, but this package labeled it as non-essential. Maybe classification has changed !? 🤷‍♂️ If you know the answer, please let me know!

Let’s look at a few other ones chosen! Let’s choose all serogroups and including cnl.

This is really amazing! Also interesting that this package makes it quite strict where if there is stop codon in the sequence, it will list it as NG.

Next, let’s look at penA. It’s interesting that it was challenging for me to search for these genes or common namees on NCBI. Eventually I still have to go back to PubMLST. When we look at their antibiotic resistance scheme for N meningitidis, we see these.

Loci	Common Name	Product	Comment
gyrA			Target gene for cipro susceptiblity
NEIS0123		DNA-directed RNA polymerase subunit a
NEIS0414	ponA	Penicillin binding protein (PBP) 1
NEIS1320		DNA gyrase subunit A
NEIS1525	parC	DNA topoisomerase IV subunit A
NEIS1600	parC	DNA topoisomerase IV subunit B
NEIS1609	folP	dihydropteroate synthetase	?sulfonamide MIC
NEIS1635	mtrR	transcriptional regulator
NEIS1753	penA	PBP2
NEIS3240	blaROB-1	Beta-lactamase
penA
rpoB			?rifampin

Alright, let’s inspect penA and NEIS1753 locally.

Code

download_locus("penA")
download_locus("NEIS1753")

# create blast database with your reference fasta
system("makeblastdb -in \"penA.fasta NEIS1753.fasta\" -dbtype nucl -out /path/to/blast_db/penA")

# blast
system("blastn -query /path/to/SRR35265268_assemble.fasta -db /path/to/blast_db/penA -outfmt 6 -out pena_result.txt") 
(result_pena <- read_tsv("pena_result.txt", col_names = colnames))

wow, 100% match with NEIS1753 1494!!! 🙌

Looking at the study’s isolate, it also seems like it contains other antimicrobial resistance genes.

Acknowledgement and Final Thought

I’dl like to thank both Joseph Marcus and Jonathan Ryder in humoring my excitement when the fasta was available for downloading! Learnt a lot from this hands-on assessment. There you have it! We assembled DNA, learnt BLAST & MLST, inspected quite a few serogroup N meningitidis. Have a better understanding on how all these serogroup gets typed! ❤️

Opportunity for improvement

what is a proper/accepted thresholds of GC content, width of contig, etc for quality control?
probably a good idea to create a searchable/interactive table of NEIS**** loci, common names, aliases, so it’s easier to search! I had a hard to finding these nomenclature.
there are R packages for mlst, can probably use that out of the box instead of using system. but good to use the original tools
learn the common thresholds for each species/bacteria
learn and document BLAST algorithms
learn kraken2
learn SRST2
learn other packages for assemblying DNA

Lessons Learnt

clipr::write_clip() is so helpful in copying whatever value in R to system. For example seq1 <- "your long contig you want to copy where hightlighting the entire contigs is a nightmare" then clipr::write_clip(seq1). Now you can paste!
learnt about _[1]
make sure path of terminal is the same as path in R
learnt skesa, blastn, makeblastdb, blastdbcmd
learnt to navigate PubMLST
learnt NEIS*** loci and its common names

If you like this article:

please feel free to send me a comment or visit my other blogs
please feel free to follow me on BlueSky, twitter, GitHub or Mastodon
if you would like collaborate please feel free to contact me

Posted on:: September 15, 2025

Length:: 14 minute read, 2936 words

Categories:: r R assemblying dna blast mlst pubmlst neis serogroup neisseria meningitidis bioinformatics

Tags:: r R assemblying dna blast mlst pubmlst neis serogroup neisseria meningitidis bioinformatics

See Also:: Bias, Variance, and Doubly Robust Estimation: Testing The Promise of TMLE in Simulated Data; My Attempt To Reproduce Stanford HIVdb Sequence and Mutation Analysis From Scratch; Learning And Exploring The Workflow of RNA-Seq Analysis - A Note To Myself