My Attempt To Reproduce Stanford HIVdb Sequence and Mutation Analysis From Scratch
By Ken Koon Wong in r R stanford hivdb hiv genotype resistance mutation bioconductor DECIPHER sierrapy sierra
October 10, 2025
Ever wondered what M184V, K65R actually mean? I learnt it from rebuilding Stanford’s HIV resistance algorithm from scratch to find out. Spoiler: it took tons of code to match their 3-line tool. But the lesson was worth it
Motivations:
We’ve all learnt and memorize what those letters and numbers mean when it comes to antiretroviral resistance. Since we’ve been exploring genomics lately, let’s take another look at HIV genotype. Stanford University HIVdb is an amazing resource! I’ve always been confused with all these letters and have difficulty understanding how to even check for genotype resistance because all these numbers and letters are quite confusing and intimidating. Let’s put on our bioconductor hat and revisit this topic and see if we can at least get a better understanding on what these letters and numbers mean. Better yet, use this opportunity to try to reproduce the algorithm that tells us the susceptibility of the ART! Hang tight on this one, it’s going to be a bumpy road!
Objectives
- Find HIV reference gene
- Find Resistance
- Workflow
- Final Thoughts
- Opportunities for improvement
- Lessons Learnt
Find HIV Reference Gene
From most of the searching, the reference gene appears to be HXB2. We know that pol gene encodes protease (PR), reverse transcriptase (RT) and integrase (IN). Below is directly from NCBI. source. You can click on it and hover over those regions to see what they are.
Take note that all these genes reside in specific position. So the trick is to use a reference, in our case HXB2, locate these 3 genes (RT, PR, IN), then extract them and make them into a database. Some reference is pretty good at letting you know where these positions are, some don’t! As for our reference, it didn’t really say either! We’ll have to look around reference and see if we can get those position.
library(Biostrings)
library(DECIPHER)
library(tidyverse)
## We downloaded a bunch of HIV genomes, including the reference known as K03455 (aka HXB2)
(all_hiv_genome <- readDNAStringSet("all_hiv_whole_genome.fasta"))
Wow, look at that. HIV genome is only around 9000bp! Whereas bacteria such as Ecoli it was about 4 Mb.
## load reference
hxb2_idx <- grep("K03455", names(all_hiv_genome), ignore.case = TRUE)
hxb2_genome <- all_hiv_genome[hxb2_idx]
## locate the genes
rt_sequence <- subseq(hxb2_genome, start = 2550, end = 4229)
pi_sequence <- subseq(hxb2_genome, start = 2253, end = 2549)
int_sequence <- subseq(hxb2_genome, start = 4230, end = 5093)
# Rename sequences with informative names
names(pi_sequence) <- "HIV_PR"
names(rt_sequence) <- "HIV_RT"
names(int_sequence) <- "HIV_INT"
# Combine all three into one DNAStringSet
pol_regions <- c(pi_sequence, rt_sequence, int_sequence)
# Write to a single FASTA file
writeXStringSet(pol_regions, "hiv_pol_regions.fasta")
# make it into blast database
system("makeblastdb -in hiv_pol_regions.fasta -dbtype nucl -out /path/to/hiv/hiv_pol_db -parse_seqids")
With the above we’re basically setting up a databse for the reference pol gene of PR, RT, and IN. We will then find a genome of interest and use blast to identify where these genes are located on the sample sequence like so.
## use blast to find where these genes positions are
system('blastn -query all_hiv_whole_genome.fasta -db /path/to/hiv/hiv_pol_db -word_size 7 -evalue 0.0000001 -outfmt 6 -out hiv_sample_blast_results.txt')
## read it
colnames <- c("qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")
(all_hiv_sample <- read_tsv("hiv_sample_blast_results.txt", col_names = colnames))
Notice sseqid
section, we have HIV_RT
, HIV_INT
, HIV_PR
. We just need to filter these genes, extract the genes via their positions qstart
and qend
. Let’s take a look at U63632.1
sample1 <- all_hiv_sample |>
filter(str_detect(qseqid, "U63632.1")) |>
filter(sseqid == "HIV_RT") |>
select(qseqid, qstart, qend)
OK, how do we go from here to knowing what’s mutated?
Find The Resistance
All those letters, M184V
, K65R
etc, the letters and numbers must mean something right? You’re right! The numbers are position, but what about the letters? They don’t really look like ATCG, do they? They’re amino acids! For example, the M
on M184
stands for Methionine, the V
stands for Valine. So M184V
means at position 184, Methionine has mutated to Valine. Similarly, K65R
means at position 65, Lysine has mutated to Arginine.
Amino acids and their letters:
A
: Alanine, C
: Cysteine, D
: Aspartic Acid, E
: Glutamic Acid, F
: Phenylalanine, G
: Glycine, H
: Histidine, I
: Isoleucine, K
: Lysine, L
: Leucine, M
: Methionine, N
: Asparagine, P
: Proline, Q
: Glutamine, R
: Arginine, S
: Serine, T
: Threonine, V
: Valine, W
: Tryptophan, Y
: Tyrosine
his also means that our translated RT reference HIV gene HXB2 at position 184, we should expect the amino acid to be M. Let’s take a look
(rt_t <- translate(rt_sequence))
Wow! We’re used to seeing DNA sequence colors but now take a look at the different amino acid colors! How pretty! Now let’s look at position 184.
subseq(rt_t, 184,184)
Alright! M
indeed! Now, let’s go through the workflow of using this information to run through our sample
Workflow
TL;DR Extract region -> Align Translation of AA with Reference & Assess Mutation -> Calculate Mutation Score
Identify/Extract Regions
sample1 <- all_hiv_sample |>
filter(str_detect(qseqid, "U63632.1")) |>
filter(sseqid == "HIV_RT") |>
select(qseqid, qstart, qend)
sample_rt <- subseq(
all_hiv_genome[str_detect(names(all_hiv_genome), sample1$qseqid)],
start = sample1$qstart,
end = sample1$qend
)
We’ve done this process above, but for other genes, we’d need to write a function to change HIV_RT
to the others.
Translate
Just for demonstration purposes:
#### Take note, this entire code chunk is not needed !!!
seq_length <- sample1$qend - sample1$qstart + 1
adjusted_length <- seq_length - (seq_length %% 3)
sample_rt <- subseq(
all_hiv_genome[str_detect(names(all_hiv_genome), sample1$qseqid)],
start = sample1$qstart,
end = sample1$qstart + adjusted_length - 1
)
(sample_rt_t <- translate(sample_rt, if.fuzzy.codon = "solve"))
The reason we had to adjust length, sometimes, is that the sequence may not be perfectly divisible by 3. Since codons are groups of 3 nucleotides, any extra nucleotides that don’t form a complete codon would lead to an incomplete translation. By adjusting the length to be divisible by 3, we ensure that the translation process can proceed without any issues. During translation process, we could also use
solve
to handle any fuzzy codons that may arise due to sequencing errors or ambiguities in the nucleotide sequence. This ensures that the translation process can still proceed even if there are some uncertainties in the input sequence.
But in reality, we can align them directly from our DNA sequence with the help of DECIPHER, it will align the translated AA!
Align Translation & Assess Mutation
# this automatically translate aligned seq into aligned AA, sweet !!!
alignseq <- AlignTranslation(c(rt_sequence,sample_rt), type="AAStringSet")
# turn alignment into matrix
align_matrix <- as.matrix(alignseq)
# extract alignment on both ref and sample
ref_seq <- align_matrix[1,]
sample_seq <- align_matrix[2,]
# find position where there is mutation
mutation_positions <- which(ref_seq != sample_seq & ref_seq != "-" & sample_seq != "-")
# load into dataframe
mutations <- data.frame(
position = mutation_positions,
reference = ref_seq[mutation_positions],
sample = sample_seq[mutation_positions],
mutation = paste0(ref_seq[mutation_positions], mutation_positions, sample_seq[mutation_positions])
) |>
mutate(position_replace = paste0(position,sample))
mutations$mutation
Lots of mutations. HIV viruses are notorious for SNPs which may or may not be clinically significant. And with just a single nucleotide mutation, as you can see the translated amino acid could be different from reference. OK, since we know how to change for mutations, how do we know if this is the same as Stanford HIVdb? Let’s copy and paste the entire genome of this sample and paste it here
## easy way of copying to system so we can paste on Stanford website (link above)
all_hiv_genome[str_detect(names(all_hiv_genome),"U63632.1")] |> as.character() |> clipr::write_clip()
Not too shabby! The one mutation that is significant is M348I
. We were able to capture that, not too shabby! Now, how do we go from here to the inferred susceptibility of NRTIs and NNRTIs? In come the mutation score algorithm!
Calculate Mutation Score
This part is very interesting. My initial trial was using their [genotype-phenotype DRMcv model]((
https://hivdb.stanford.edu/download/GenoPhenoDatasets/DRMcv.R) but I couldn’t reproduce most of the results. And realized that
Stanford HIVdb Sequance Analysis uses
another algorithm. And this algorithm, at least for the final product, is more straight forward than trying to use a model to reproduce a prediction, which was what the DRMcv was using glmnet
and lasso
. There is also
extensive documentation. All of the csv
files below were obtained from the links provided from the documentation. Let’s code!
code
load_hivdb <- function(dataset, mutations){
if (dataset=="NRTI") {
mut_score <- read_csv("hivdb_nrti_single.csv")
mut_score_combo <- read_csv("hivdb_nrti_combo.csv")
n <- 2
m <- 4
}
if (dataset=="NNRTI") {
mut_score <- read_csv("hivdb_nnrti_single.csv")
mut_score_combo <- read_csv("hivdb_nnrti_combo.csv")
n <- 2
m <- 2
}
if (dataset=="PI") {
mut_score <- read_csv("hivdb_pi_single.csv")
mut_score_combo <- read_csv("hivdb_pi_combo.csv")
n <- 2
m <- 3
}
if (dataset=="INSTI") {
mut_score <- read_csv("hivdb_insti_single.csv")
mut_score_combo <- read_csv("hivdb_insti_combo.csv")
n <- 2
m <- 3
}
mut_interest <- c()
for (mut_i in mut_score |> pull(Rule)) {
add_mut <- F
add_mut <- mut_i %in% (mutations |> pull(mutation))
if (add_mut) {
mut_interest <- c(mut_interest, mut_i)
}
}
if (length(mut_interest)==0) {
return(mut_score |>
slice_sample(n = 0) |> # Get 0 rows but keep structure
select(-1) |>
summarise(across(everything(), ~0)) |>
pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
mutate(levels = case_when(
score <= 9 ~ 1,
score >= 10 & score <= 14 ~ 2,
score >= 15 & score <= 29 ~ 3,
score >= 30 & score <= 59 ~ 4,
score >= 60 ~ 5
)) |>
mutate(interpretation = case_when(
levels == 1 ~ "susceptible",
levels == 2 ~ "potential low-level resistance",
levels == 3 ~ "low-level resistance",
levels == 4 ~ "intermediate resistance",
levels == 5 ~ "high-level resistance"
)))
}
mut_combo_vec <- mut_score_combo |>
pull(combination_rule) |>
paste(collapse = " + ") |>
str_split(" \\+ ") |>
unlist() |>
unique()
mut_combo_idx <- mut_combo_vec %in% (mutations |> pull(mutation))
# Function to sort mutations by position number
sort_mutations <- function(mutation_string) {
# Split the string by " + "
positions <- str_extract(mutation_string, "\\d+") |> as.numeric()
# Sort mutations by their positions
sorted_mutations <- mutation_string[order(positions)]
return(sorted_mutations)
}
mut_combo_seq <- sort_mutations(mut_combo_vec[mut_combo_idx])
# Function to create combinations of a given size
create_combinations <- function(mutations_, size) {
combn(mutations_, size, FUN = function(x) paste(x, collapse = " + "), simplify = TRUE)
}
if (length(mut_combo_seq)==1) {
mut_score_sum <- mut_score |>
filter(Rule %in% mut_interest) |>
select(-1) |>
summarize(across(everything(), sum)) |>
pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
mutate(levels = case_when(
score <= 9 ~ 1,
score >= 10 & score <= 14 ~ 2,
score >= 15 & score <= 29 ~ 3,
score >= 30 & score <= 59 ~ 4,
score >= 60 ~ 5
)) |>
mutate(interpretation = case_when(
levels == 1 ~ "susceptible",
levels == 2 ~ "potential low-level resistance",
levels == 3 ~ "low-level resistance",
levels == 4 ~ "intermediate resistance",
levels == 5 ~ "high-level resistance"
))
} else {
# Generate combinations of size 2, 3, and 4
all_combinations <- map(n:m, ~create_combinations(mut_combo_seq, .x)) |>
unlist()
# filter from mut_score_combo
mut_score_combo_df <- mut_score_combo |>
filter(combination_rule %in% all_combinations) |>
rename(Rule = combination_rule)
# sum single + combo
mut_score_sum <- mut_score |>
filter(Rule %in% mut_interest) |>
bind_rows(mut_score_combo_df) |>
select(-1) |>
summarize(across(everything(), sum)) |>
pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
mutate(levels = case_when(
score <= 9 ~ 1,
score >= 10 & score <= 14 ~ 2,
score >= 15 & score <= 29 ~ 3,
score >= 30 & score <= 59 ~ 4,
score >= 60 ~ 5
)) |>
mutate(interpretation = case_when(
levels == 1 ~ "susceptible",
levels == 2 ~ "potential low-level resistance",
levels == 3 ~ "low-level resistance",
levels == 4 ~ "intermediate resistance",
levels == 5 ~ "high-level resistance"
))
}
return(mut_score_sum)
}
(nrti <- load_hivdb("NRTI", mutations))
(nnrti <- load_hivdb("NNRTI", mutations))
Wow, long code but we were able to reproduce the same score as Stanford HIVdb! Not too shabby. Take note that if certain mutations exist together, there is another table that shows additional penalty score! For example, if our sequence contain M41L + T215FY
mutation, the total mutation score for TDF is 5+10+10=25
, it’s not just 5+10
.
see here for full table.
Now, we just need to repeat this for PR and IN, write it in a pipeline and there you have it! Let’s look at their official web service via sierrapy
and then see if we can reproduce with our algorithm!
Stanford Sierrapy
Let’s explore a much simpler and expert-driven approach. Let’s try Stanford HIVdb’s SierraPy, their python equivalent. This uses their web service, hence will need internet and also need to send sequence as well. Very fast.
# install
pip install sierrapy
# write fasta of sample of interest, let's take a look at KJ849778.1
all_hiv_genome[all_hiv_genome |> names() |> str_detect("KJ849778.1")] |> writeXStringSet("hiv_test.fasta")
# run sierrapy input fasta
system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json")
# read json output
jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores
Alright. 🤔 with 3 lines of code after you have an assembled HIV whole genome, Sierra web service via
Sierrapy
will provide you the susceptibility interpretation within seconds! The json also provides you with the alignments as well! You do need internet for this thought, if you’re looking for local interpretation then can look into
Sierra Local
. There are lots of ways to get to this without having the have heavy codes like we did! lol, but it’s great to know how it works under the hood!
Now, since we know what Stanford HIVdb shows, let’s see if our algorithm will return the same result!
code
extract_align <- function(sample,class) {
if (!exists("hxb2_genome")) { stop("you need load/enter hxb2_genome") }
sample1 <- all_hiv_sample |>
filter(str_detect(qseqid, sample)) |>
filter(sseqid == class) |>
select(qseqid, qstart, qend)
sample_rt <- subseq(
all_hiv_genome[str_detect(names(all_hiv_genome), sample1$qseqid)],
start = sample1$qstart,
end = sample1$qend
)
## locate the genes
if (class == "HIV_RT") { rt_sequence <- subseq(hxb2_genome, start = 2550, end = 4229) }
if (class == "HIV_PR") { rt_sequence <- subseq(hxb2_genome, start = 2253, end = 2549) }
if (class == "HIV_INT") { rt_sequence <- subseq(hxb2_genome, start = 4230, end = 5093) }
# this automatically translate aligned seq into aligned AA, sweet !!!
# alignseq <- AlignTranslation(c(rt_sequence,sample_rt), type="AAStringSet", verbose = F)
alignseq_nt <- AlignSeqs(c(rt_sequence,sample_rt),verbose=F)
ref_pos <- c()
align_ref_seq <- as.matrix(alignseq_nt) |> _[1,]
for (i in 1:length(align_ref_seq)) {
if (align_ref_seq[i] %in% LETTERS) { ref_pos <- c(ref_pos, i) }
}
align_ref_seq2 <- align_ref_seq[ref_pos] |> paste(collapse="") |> DNAStringSet()
align_sample_seq <- as.matrix(alignseq_nt) |> _[2, ref_pos] |> str_replace("-","G") |> paste(collapse="") |> DNAStringSet()
alignseq <- AlignTranslation(c(align_ref_seq2,align_sample_seq), verbose=F, type = "AAStringSet")
# which(align_ref_seq[ref_pos] != align_sample_seq)
# turn alignment into matrix
align_matrix <- as.matrix(alignseq)
# extract alignment on both ref and sample
for (i in 1:length(align_matrix[1,])) {
if (align_matrix[1,][i] %in% LETTERS) {
start_seq <- i
break
}
}
ref_seq <- align_matrix[1,start_seq:length(align_matrix[1,])]
sample_seq <- align_matrix[2,start_seq:length(align_matrix[1,])]
# find position where there is mutation
mutation_positions <- which(ref_seq != sample_seq & ref_seq != "-" & sample_seq != "-")
# load into dataframe
mutations <- data.frame(
position = mutation_positions,
reference = ref_seq[mutation_positions],
sample = sample_seq[mutation_positions],
mutation = paste0(ref_seq[mutation_positions], mutation_positions, sample_seq[mutation_positions])
) |>
mutate(position_replace = paste0(position,sample))
return(mutations)
}
load_hivdb <- function(dataset, mutations){
if (dataset=="NRTI") {
mut_score <- read_csv("hivdb_nrti_single.csv")
mut_score_combo <- read_csv("hivdb_nrti_combo.csv")
n <- 2
m <- 4
}
if (dataset=="NNRTI") {
mut_score <- read_csv("hivdb_nnrti_single.csv",show_col_types = FALSE)
mut_score_combo <- read_csv("hivdb_nnrti_combo.csv",show_col_types = FALSE)
n <- 2
m <- 2
}
if (dataset=="PI") {
mut_score <- read_csv("hivdb_pi_single.csv",show_col_types = FALSE)
mut_score_combo <- read_csv("hivdb_pi_combo.csv",show_col_types = FALSE)
n <- 2
m <- 3
}
if (dataset=="INSTI") {
mut_score <- read_csv("hivdb_insti_single.csv",show_col_types = FALSE)
mut_score_combo <- read_csv("hivdb_insti_combo.csv",show_col_types = FALSE)
n <- 2
m <- 3
}
mut_interest <- c()
for (mut_i in mut_score |> pull(Rule)) {
add_mut <- F
add_mut <- mut_i %in% (mutations |> pull(mutation))
if (add_mut) {
mut_interest <- c(mut_interest, mut_i)
}
}
if (length(mut_interest)==0) {
return(mut_score |>
slice_sample(n = 0) |> # Get 0 rows but keep structure
select(-1) |>
summarise(across(everything(), ~0)) |>
pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
mutate(levels = case_when(
score <= 9 ~ 1,
score >= 10 & score <= 14 ~ 2,
score >= 15 & score <= 29 ~ 3,
score >= 30 & score <= 59 ~ 4,
score >= 60 ~ 5
)) |>
mutate(interpretation = case_when(
levels == 1 ~ "susceptible",
levels == 2 ~ "potential low-level resistance",
levels == 3 ~ "low-level resistance",
levels == 4 ~ "intermediate resistance",
levels == 5 ~ "high-level resistance"
)))
}
mut_combo_vec <- mut_score_combo |>
pull(combination_rule) |>
paste(collapse = " + ") |>
str_split(" \\+ ") |>
unlist() |>
unique()
mut_combo_idx <- mut_combo_vec %in% (mutations |> pull(mutation))
# Function to sort mutations by position number
sort_mutations <- function(mutation_string) {
# Split the string by " + "
positions <- str_extract(mutation_string, "\\d+") |> as.numeric()
# Sort mutations by their positions
sorted_mutations <- mutation_string[order(positions)]
return(sorted_mutations)
}
mut_combo_seq <- sort_mutations(mut_combo_vec[mut_combo_idx])
# Function to create combinations of a given size
create_combinations <- function(mutations_, size) {
combn(mutations_, size, FUN = function(x) paste(x, collapse = " + "), simplify = TRUE)
}
if (length(mut_combo_seq)==1) {
mut_score_sum <- mut_score |>
filter(Rule %in% mut_interest) |>
select(-1) |>
summarize(across(everything(), sum)) |>
pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
mutate(ART = case_when(
ART == "ATV_r" ~ "ATV/r",
ART == "DRV_r" ~ "DRV/r",
ART == "LPV_r" ~ "LPV/r",
TRUE ~ ART
)) |>
mutate(levels = case_when(
score <= 9 ~ 1,
score >= 10 & score <= 14 ~ 2,
score >= 15 & score <= 29 ~ 3,
score >= 30 & score <= 59 ~ 4,
score >= 60 ~ 5
)) |>
mutate(interpretation = case_when(
levels == 1 ~ "susceptible",
levels == 2 ~ "potential low-level resistance",
levels == 3 ~ "low-level resistance",
levels == 4 ~ "intermediate resistance",
levels == 5 ~ "high-level resistance"
))
} else {
# Generate combinations of size 2, 3, and 4
all_combinations <- map(n:m, ~create_combinations(mut_combo_seq, .x)) |>
unlist()
# filter from mut_score_combo
mut_score_combo_df <- mut_score_combo |>
filter(combination_rule %in% all_combinations) |>
rename(Rule = combination_rule)
# sum single + combo
mut_score_sum <- mut_score |>
filter(Rule %in% mut_interest) |>
bind_rows(mut_score_combo_df) |>
select(-1) |>
summarize(across(everything(), sum)) |>
pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
mutate(levels = case_when(
score <= 9 ~ 1,
score >= 10 & score <= 14 ~ 2,
score >= 15 & score <= 29 ~ 3,
score >= 30 & score <= 59 ~ 4,
score >= 60 ~ 5
)) |>
mutate(interpretation = case_when(
levels == 1 ~ "susceptible",
levels == 2 ~ "potential low-level resistance",
levels == 3 ~ "low-level resistance",
levels == 4 ~ "intermediate resistance",
levels == 5 ~ "high-level resistance"
))
}
return(mut_score_sum)
}
hiv_genotype <- function(sample="KJ849778.1") {
class_group <- c("HIV_RT","HIV_PR","HIV_INT")
for (class in class_group) {
mutations <- extract_align(sample, class)
if (class == "HIV_RT") {
print(load_hivdb("NRTI",mutations))
print(load_hivdb("NNRTI",mutations))
}
if (class == "HIV_PR") {
print(load_hivdb("PI",mutations))
}
if (class == "HIV_INT") {
print(load_hivdb("INSTI",mutations))
}
}
}
WHAT !?!?! My algorithm failed 💔 !!! ❌❌❌ Noooo…!!! Ths iis very odd. Let’s inspect!
Let’s pick RT and inspect.
sample <- "KJ849778.1"
mutations <- extract_align(class = "HIV_RT", sample = sample)
mutations |> pull(mutation)
Let’s look at Stanford’s
Do you see what i’m seeing? Our aligned AA has a lot of
X
’s and these X’s coincide with Stanford’s. For example M184X
on ours is Stanford’s M184MI
. Our G190X
is their G190EKR
. 🤔 And some of the mutations are the same, so it’s not a frame issue. Oh wait !!! All X
’s because we cannot assess what exactly the amino acid is, we can’t tell if there is mutation at all, hence we assume it could be any !?! That total makes sense! That means we’ll have to incorporate this into our algorithm! And also, not shown here, if there is missing, algorithm should also choose the max mutation score as penalty. Now let’s implement that!
code
load_hivdb <- function(dataset, mutations){
if (dataset=="NRTI") {
mut_score <- read_csv("hivdb_nrti_single.csv")
mut_score_combo <- read_csv("hivdb_nrti_combo.csv")
n <- 2
m <- 4
}
if (dataset=="NNRTI") {
mut_score <- read_csv("hivdb_nnrti_single.csv",show_col_types = FALSE)
mut_score_combo <- read_csv("hivdb_nnrti_combo.csv",show_col_types = FALSE)
n <- 2
m <- 2
}
if (dataset=="PI") {
mut_score <- read_csv("hivdb_pi_single.csv",show_col_types = FALSE)
mut_score_combo <- read_csv("hivdb_pi_combo.csv",show_col_types = FALSE) |>
rename(`ATV/r`=ATV_r,`DRV/r`=DRV_r,`LPV/r`=LPV_r)
n <- 2
m <- 3
}
if (dataset=="INSTI") {
mut_score <- read_csv("hivdb_insti_single.csv",show_col_types = FALSE)
l74 <- mut_score |>
filter(str_detect(Rule,"L74")) |>
select(-Rule)
l74_2 <- tibble(Rule = c("L74M","L74F")) |>
bind_cols(l74)
mut_score <- mut_score |>
filter(!str_detect(Rule, "L74")) |>
bind_rows(l74_2)
mut_score_combo <- read_csv("hivdb_insti_combo.csv",show_col_types = FALSE)
n <- 2
m <- 3
}
mut_interest <- c()
# check if there is X
if (sum(mutations |> pull(mutation) |> str_detect("X$")) > 0) {
mutations_vec <- str_replace(mutations |> pull(mutation), pattern = "X", replacement = "")
} else {
mutations_vec <- mutations |> pull(mutation)}
for (mut_i in mut_score |> pull(Rule)) {
add_mut <- F
add_mut <- str_detect(mut_i, paste(mutations_vec,collapse="|"))
if (add_mut) {
mut_interest <- c(mut_interest, mut_i)
}
}
if (length(mut_interest)==0) {
return(mut_score |>
slice_sample(n = 0) |> # Get 0 rows but keep structure
select(-1) |>
summarise(across(everything(), ~0)) |>
pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
mutate(ART = case_when(
ART == "ATV_r" ~ "ATV/r",
ART == "DRV_r" ~ "DRV/r",
ART == "LPV_r" ~ "LPV/r",
TRUE ~ ART
)) |>
mutate(levels = case_when(
score <= 9 ~ 1,
score >= 10 & score <= 14 ~ 2,
score >= 15 & score <= 29 ~ 3,
score >= 30 & score <= 59 ~ 4,
score >= 60 ~ 5
)) |>
mutate(interpretation = case_when(
levels == 1 ~ "susceptible",
levels == 2 ~ "potential low-level resistance",
levels == 3 ~ "low-level resistance",
levels == 4 ~ "intermediate resistance",
levels == 5 ~ "high-level resistance"
)))
}
mut_combo_vec <- mut_score_combo |>
pull(combination_rule) |>
paste(collapse = " + ") |>
str_split(" \\+ ") |>
unlist() |>
unique()
mut_combo_idx <- mut_combo_vec %in% (mutations |> pull(mutation))
# Function to sort mutations by position number
sort_mutations <- function(mutation_string) {
# Split the string by " + "
positions <- str_extract(mutation_string, "\\d+") |> as.numeric()
# Sort mutations by their positions
sorted_mutations <- mutation_string[order(positions)]
return(sorted_mutations)
}
mut_combo_seq <- sort_mutations(mut_combo_vec[mut_combo_idx])
# Function to create combinations of a given size
create_combinations <- function(mutations_, size) {
combn(mutations_, size, FUN = function(x) paste(x, collapse = " + "), simplify = TRUE)
}
#sum(mutations |> pull(mutation) |> str_detect("X$")) > 0
if (length(mut_combo_seq)==1) {
mut_score_sum <- mut_score |>
filter(Rule %in% mut_interest) |>
mutate(Rule_x = str_extract(Rule, "^[A-Z]{1}[0-9]+")) |>
# distinct(Rule_x, .keep_all = T) |>
group_by(Rule_x) |>
summarize(across(everything(),max)) |>
select(-Rule,-Rule_x) |>
summarize(across(everything(), sum)) |>
pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
mutate(ART = case_when(
ART == "ATV_r" ~ "ATV/r",
ART == "DRV_r" ~ "DRV/r",
ART == "LPV_r" ~ "LPV/r",
TRUE ~ ART
)) |>
mutate(levels = case_when(
score <= 9 ~ 1,
score >= 10 & score <= 14 ~ 2,
score >= 15 & score <= 29 ~ 3,
score >= 30 & score <= 59 ~ 4,
score >= 60 ~ 5
)) |>
mutate(interpretation = case_when(
levels == 1 ~ "susceptible",
levels == 2 ~ "potential low-level resistance",
levels == 3 ~ "low-level resistance",
levels == 4 ~ "intermediate resistance",
levels == 5 ~ "high-level resistance"
))
} else {
# Generate combinations of size 2, 3, and 4
if (m > length(mut_combo_seq)) { m <- length(mut_combo_seq) }
if (n > length(mut_combo_seq)) { n <- length(mut_combo_seq) }
all_combinations <- map(n:m, ~create_combinations(mut_combo_seq, .x)) |>
unlist()
# filter from mut_score_combo
mut_score_combo_df <- mut_score_combo |>
filter(combination_rule %in% all_combinations) |>
rename(Rule = combination_rule)
# sum single + combo
mut_score_sum <- mut_score |>
filter(Rule %in% mut_interest) |>
bind_rows(mut_score_combo_df) |>
select(-1) |>
summarize(across(everything(), sum)) |>
pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
mutate(ART = case_when(
ART == "ATV_r" ~ "ATV/r",
ART == "DRV_r" ~ "DRV/r",
ART == "LPV_r" ~ "LPV/r",
TRUE ~ ART
)) |>
mutate(levels = case_when(
score <= 9 ~ 1,
score >= 10 & score <= 14 ~ 2,
score >= 15 & score <= 29 ~ 3,
score >= 30 & score <= 59 ~ 4,
score >= 60 ~ 5
)) |>
mutate(interpretation = case_when(
levels == 1 ~ "susceptible",
levels == 2 ~ "potential low-level resistance",
levels == 3 ~ "low-level resistance",
levels == 4 ~ "intermediate resistance",
levels == 5 ~ "high-level resistance"
))
}
return(mut_score_sum)
}
hiv_genotype()
Not too shabby! Now our NNRTI and PI look exactly the same! But our INSTI is not great! Mainly because there are a quite a few deletions that our alignment and their alignments don’t agree.
Let’s try another genome and see if our algorithm agrees with Stanford’s
Trial 1 ✅
Our code
sample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1]
hiv_genotype(sample = sample)
Stanford’s
all_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta")
# run sierrapy input fasta
system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json")
# read json output
jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores
Alright! Not too shabby! That was
AY900571.2
accession.
Trial 2 ✅
Let’s try another one.
Our code
code
sample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1]
hiv_genotype(sample = sample)
Stanford’s
code
all_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta")
# run sierrapy input fasta
system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json")
# read json output
jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores
Awesome! No resistance at all! That was FM877777.1
. Let’s give another a go!
Trial 3 ✅
Our code
code
sample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1]
hiv_genotype(sample = sample)
Stanford’s
code
all_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta")
# run sierrapy input fasta
system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json")
# read json output
jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores
There you go! Our algorithm works! That was EU448295.1
. Let’s do another one!
Trial 4 ✅
Our code
code
sample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1]
hiv_genotype(sample = sample)
Stanford’s
code
all_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta")
# run sierrapy input fasta
system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json")
# read json output
jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores
There you go! Our algorithm works again! That was EU735539.1
. Let’s do our last one.
Trial 5 ✅
Our code
code
sample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1]
hiv_genotype(sample = sample)
Stanford’s
code
all_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta")
# run sierrapy input fasta
system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json")
# read json output
jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores
That was AF443091.1
. And 5 of 5 correct ✅ ! Hurray !!!!
We did rewrite some of the functions (not shown here), compared our algorithm and Sierra’s of all the downaloded HIV genome fastas with complete RT, PR, INT positions from blast (n=494). Found 96.2% full agreement (n=475), meaning 0 deviation of level of any ART from both algorithm on the same genome sample. Not too shabby at all! I suspect the 3.4% is mainly alignment issues. I had initially used AlignTranslation
prior to sequence alignemtn and it only gave me 89% agreement. Found out that I had to create my own function to align it properly (align DNA sequence first, then use reference sequence repositioning and extract sample sequence, then aligntranslate). I also found out that INSTI mutation score table is missing something.
Investigation
Final Thoughts
Hands down using SierraPy
or Sierra
-type app is the way to go, if we want reproducibility. There are options for local apps out there (see below). But learning to reproduce this really helps me understand the labeling of mutation and the way to get to susceptibility scoring better! It was a bumpy road, lots of trials, and lots more error and failure, but it ultimately it was a great learning experience. My respect for all these developers and researchers who have worked on this for years, really goes up! ❤️ It’s not easy at all! Even with the help of LLM!
Opportunities for Improvement
- Need to rewrite the interpretation code into a function so that we don’t have to paste 3 different times or modify 3 different times on the
load_hivdb
function. - Learn the exact algorithm sierra uses for alignment, especially when it comes to deletion/insertion
- Look at
sierra-local
to see how they usenucamino
to align and find mutations/indels - prettify the results with
gt
- include more ARTs
- look into Stanford HIVdb NGS analysis
Lessons Learnt
- We did use genotype-phenotype DRMcv but found the results not reproducible and also does not match the mutation interpretation results, probably user error.
- After multiple attempts to align seq, map to reference, then translate, just found out that
DECIPHER::AlignTranslation()
does a better job at this! Don’t useAlignSeq
for aligning amino acids! - learnt that the position of M184V, K65R etc are RT, PR, IN gene specific!
- learnt that Stanford HIVdb assumes mutation of different variants if
X
. For example, ifM184X
then it could beM184I
orM184V
, assume the worst case scenario and assign the max mutation score for that position.
If you like this article:
- please feel free to send me a comment or visit my other blogs
- please feel free to follow me on BlueSky, twitter, GitHub or Mastodon
- if you would like collaborate please feel free to contact me