My Attempt To Reproduce Stanford HIVdb Sequence and Mutation Analysis From Scratch

Ever wondered what M184V, K65R actually mean? I learnt it from rebuilding Stanford’s HIV resistance algorithm from scratch to find out. Spoiler: it took tons of code to match their 3-line tool. But the lesson was worth it. We went from agreement of 89% to 99.8% after different methods of alignment

Learning And Exploring The Workflow of RNA-Seq Analysis - A Note To Myself

Learned RNA-seq workflow using C. difficile data from a published study 🧬. Processed raw reads through fastp → kallisto → DESeq2 -> GSEA pipeline. Results matched the original paper’s findings, with clear differential expression between mucus and control conditions 📊.

Learning About Assemblying DNA, BLAST, and Multiple Loci Sequence Testing (MLST) - A Note To Myself

We learnt how to assemble DNA fragments from a recently published N. meningitidis outbreak study using SKESA 🧬, practiced using BLAST for species identification and MLST for strain typing, and explored how serogroups are determined using specialized Python tools for capsule characterization 🔬 - such a fun and exciting hands-on introduction to genomic analysis!

Building DNA Sequence Alignment With Needleman-Wunsch Algorithm From Scratch - A Note To My Self

Ever wondered how DNA alignment actually works under the hood? 🧬 We coded the Needleman-Wunsch algorithm from scratch, working through scoring matrices by hand with simple examples like “CAT” vs “CT” before testing on real E. coli sequences. Pretty neat to see the magic happen! ✨

Learning The Basics of Phylogenetic Analysis

🧬🔬 Explore phylogenetic analysis from genome to tree! Basic workflow with R/Bioconductor. Learnt to work with large genomic dataset. Extract 16S rRNA from 10K+ E.coli strains using dataset dehydrate, barrnap for extraction, rapidNJ for tree building & FigTree for visualization.

Learning Antimicrobial Resistance (AMR) genes with Bioconductor

Instead of flashcards, we Rube Goldberg’d this with Bioconductor! Analyzed 3,280 E. coli genomes from NCBI, detecting ESBL genes in 84.4% of samples. CTX-M-15 was most common. Helped us understand gene nomenclature and sequence analysis! 📊🔬

Creating A Question Bank Using Google Sheet, Plumber, and Digital Ocean Droplet

Learn how to build a flash-card style question bank using Google Sheets as storage, R’s Plumber API, and host it on a Digital Ocean droplet—step-by-step setup, deployment, and tips.

From Math to Code: Building GAM with Penalty Functions From Scratch

Enjoyed learning penalized GAM math. Built penalty matrices, optimized λ using GCV, and implement our own GAM function. Confusing? Yes! Rewarding? Oh yes!

Understanding Basis Spline (B-spline) By Working Through Cox-deBoor Algorithm

I finally understood B-splines by working through the Cox-deBoor algorithm step-by-step, discovering they’re just weighted combinations of basis functions that make non-linear regression linear. What surprised me is going through Bayesian statistics really helped me understand the engine behind the model! Will try this again in the future!

Taylor Series Approximation To Newton Raphson Algorithm - A note for myself of the proof

We learnt to derive the Newton-Raphson algorithm from Taylor series approximation and implements it for logistic regression in R. We’ll show how the second-order Taylor expansion leads to the Newton-Raphson update formula, then compare individual parameter updates versus using the full Fisher Information matrix for faster convergence

From Complete Separation To Maximum Likelihood Estimation in Logistic Regresion: A Note To Myself

Refreshed my rusty calculus skills lately! 🤓 Finally understand what happens during complete separation and why those coefficient SE get so extreme. The math behind maximum likelihood estimation makes more sense now! Chain rule, quotient rule, matrix inversion are crucial!

Simulating A Simple Response Adaptive Randomization - I Have To See It To Believe It

In my simulations of Response Adaptive Randomization, I discovered it performs comparably to fixed 50-50 allocation in identifying treatment effects. The adaptive approach does appear to work! However, with only 10 trials, I’ve merely scratched the surface. Important limitations exist - temporal bias risks, statistical inefficiency, and complex multiplicity adjustments in Bayesian frameworks.

Exploring `RSQLite` With `DBI`: A Note To Myself

I messed around with DBI and RSQLite and learned it’s actually pretty simple to use in R - just connect, write tables, and use SQL queries without all the complicated server stuff. Thanks to Alec Wong for suggesting this!

Getting My Feet Wet With `Plumber` and JavaScript

Tried out plumber and a bit of JavaScript to build a simple local API for logging migraine events 🧠💻. Just a quick tap on my phone now records the time to a CSV—pretty handy! 📱✅

Learning To Create an R Package With Deliberate Redundancy 🤣 A Note For Myself

🙈 Made a hilariously redundant R package for a simple OpenAI calls, but the real win was finally learning how to build an R package! 🛠️ Is it efficient? Absolutely not!Was it worth the time and experience? Yes! Will I do it again? Yes! Will it break? Yes! 🤣

LLM-assisted Summarization of Abstracts And Bluesky Post via R

How do we identify relevant articles in our domains? This project uses example journal RSS feeds with abstracts, uses LLMs to extract points of interest, and shares insights on Bluesky—stimulating curiosity.

Tidyverse 🪐to Polars 🐻‍❄️: My Notes

I found Polars syntax is quite similar to dplyr. And the way that we can chain the functions makes it even more familiar! It was fun learning the nuances, now it’s time to put them into practice! Wish me luck! 🍀

Stable Diffusion 3 in R? Why not? Thanks to {reticulate} 🙏❤️🙌

"Fascinating" describes my journey with Stable Diffusion 3. It’s deepened my appreciation for original art and masterpieces. Understanding how to generate quality art is just the beginning—it drives me to explore the underlying structure. Join me in exploring SD3 in R!

Gemini 1.5 Flash Better Than RAG? Let’s Check It Out In R!

Overall, I am quite impressed with the responses! With minimal prompt engineering, document cleaning! It was able to return accurate responses, and even separated different conditions and provided appropriate treatment options. It was also able to return the correct response for tricky questions that our RAG was not able to. It definitely has potential!

Llama, Llama, Oh Give Me A Sign. What’s In The Latest IDSA Guideline?

Wow, what a journey, and more to come! We learned how to perform simple RAG with an LLM and even ventured into LangChain territory. It wasn’t as scary as some people said! The documentation is fantastic. Best of all, we did it ALL in R with Reticulate, without leaving RStudio! Not only we can read IDSA Guidelines, we can use LLM to assist us with retrieving information!

V_s__l_ng M_ss_ng D_t_ W_th D_G & S_m_l_t__n

MCAR, MAR, MNAR, all so confusing. But with DAG, oh so amusing! Many technical words, I don’t understand, but with simulation, I am a fan! Join me in exploring missing mechanisms, learn I will with great optimism.

S.P.I.C.E of Causal Inference

The SUTVA, Positivity, Identifiability, Consistency, Exchangeability of Causal Inference, the essential ingredients that helps us bring out the true flavor of the causal model. Here is my understanding of each assumptions (main course) with examples (side dish) and accompanied by simulation (paired with beverages). Bon Appétit!

My Simple Understanding of Total Effect = Direct Effect + Indirect Effect (via Mediator)

I’ve struggled with differentiating between total, direct, and indirect effects, so this blog/note serves as a personal reference to solidify my understanding and make future amendments as needed. While there are comprehensive articles available, this is a simplified explanation for myself and potentially others

Exploring Non-linear Effects: Visual CATE Analysis of Continuous Confounders, Binary Exposures, and Continuous Outcomes

It was enjoyable to visualize the non-linear relationship with interaction and observe the corresponding changes in CATE. If one understands the underlying equation, it’s possible to easily obtain the ATE using calculus. Lastly, adopting Richard McElreath’s Owl framework as a documented procedure ensures quality assurance! 🙌

Clearer Understanding of 95% Confidence Interval Through The Lens of Simulation

I’m now more confident in my understanding of the 95% confidence interval, but less certain about confidence intervals in general, knowing that we can’t be sure if our current interval includes the true population parameter. On a brighter note, if we have the correct confidence interval, it could still encompass the true parameter even when it’s not statistically significant. I find that quite refreshing

Calculating Number Needed to Treat/Harm (NNT/H) with Odds Ratio

We learned how to convert the pooled odds ratio from a random-effects model and subsequently calculate the number needed to treat (NNT) or harm (NNH). It’s important to understand that without knowing the event proportions in either the treatment or control groups, we cannot accurately estimate the absolute risk reduction for an individual study or for a meta-analysis. Fascinating indeed! Everyday is a school day! 🙌

Approaches to Calculating Number Needed to Treat (NNT) with Meta-Analysis

Here, we have demonstrated three different methods for calculating NNT with meta-analysis data. I learned a lot from this experience, and I hope you find it enjoyable and informative as well. Thank you, @wwrighID, for initiating the discussion and providing a pivotal example by using the highest weight control event proportion to back-calculate ARR and, eventually, NNT. I also want to express my gratitude to @DrToddLee for contributing a brilliant method of pooling a single proportion from the control group for further estimation. Special thanks to @MatthewBJane, the meta-analysis maestro, for guiding me toward the correct equation to calculate event proportions, with weight estimated by the random effect model. 🙏

An Educational Stroll With Stan - Part 4

What an incredible journey it has been! I’m thoroughly enjoying working with Stan codes, even though I don’t yet grasp all the intricacies. We’ve already tackled simple linear and logistic regressions and delved into the application of Bayes’ theorem. Now, let’s turn our attention to the fascinating world of Mixed-Effect Models, also known as Hierarchical Models

An Educational Stroll With Stan - Part 3

Diving into this, we’re exploring how using numbers to express our certainty/uncertainty, especially with medical results, can help sharpen our estimated ‘posterior value’ and offer a solid base for learning and discussions. We often talk about specifics like sensitivity without the nitty-gritty math, but crafting our own priors and using a dash of Bayes and visuals can really spotlight how our initial guesses shift. Sure, learning this takes patience, but once it clicks, it’s a game-changer – continuous learning for the win!

An Educational Stroll With Stan - Part 2

I learned a great deal throughout this journey. In the second part, I gained knowledge about implementing logistic regression in Stan. I also learned the significance of data type declarations for obtaining accurate estimates, how to use posterior to predict new data, and what generated quantities in Stan is for. Moreover, having a friend who is well-versed in Bayesian statistics proves invaluable when delving into the Bayesian realm! Very fun indeed!

An Educational Stroll With Stan - Part 1

There is a lot to learn about Bayesian statistics, but it’s fun, exciting, and flexible! I thoroughly enjoyed the beginning of this journey. There will be learning curves, but there are so many great people and resources out there to help us get closer to understanding the Bayesian way.

Cracking the Code: Unveiling the Hidden Language of USB HID Keyboards!

Sending key presses to another device using software that emulates a keyboard, but isn't a physical keyboard, is a fascinating concept. We understand that in the Linux/Unix environment and with Python, this can be accomplished through low-level programming. But can the R programming language achieve the same feat? If it can, then how does it work?

Exploring Interaction Effects and S-Learners

Interaction adventures through simulations and gradient boosting trees using the S-learner approach. I hadn’t realized that lightGBM and XGBoost could reveal interaction terms without explicit specification. Quite intriguing!

Hugging Face 🤗, with a warm embrace, meet R️ ❤️

I’m delighted that R users can have access to the incredible Hugging Face pre-trained models. In this demonstration, we provide a straightforward example of how to utilize them for sentiment analysis using GPT-generated synthetic data from evaluation comments. Let’s go!

Exploring Causal Discovery with Causal-learn and Reticulate in R

The PyWhy Causal-learn Discord community is fantastic! The package documentation is equally impressive, making experiential learning both fun and informative. Truly, it’s another exceptional tool for causal discovery at our fingertips!

Exploring Causal Discovery with gCastle through Reticulate in R

Get ready for a thrill ride in causal discovery! We’re diving into gCastle, a Python package, right in R to amp up our skills. Let’s orchestrate our prior knowledge and nail that true DAG. 🔥

Unraveling the Effects: Collider Adjustments in Logistic Regression

Simulating a binary dataset, coupled with an understanding of the logit link and the linear formula, is truly fascinating! However, we must exercise caution regarding our adjustments, as they can potentially divert us from the true findings. I advocate for transparency in Directed Acyclic Graphs (DAGs) and emphasize the sequence: causal model -> estimator -> estimand.

From TakeOut to TakeIn: The Savings Simulator

Saving can be enjoyable! If you’re planning to cut down on takeout orders, why not use past data to simulate your savings? Let it inspire and motivate your future dining-in decisions! 👍

What Happens If Our Model Adjustment Includes A Collider?

Beware of what we adjust. As we have demonstrated, adjusting for a collider variable can lead to a false estimate in your analysis. If a collider is included in your model, relying solely on AIC/BIC for model selection may provide misleading results and give you a false sense of achievement.

Front-door Adjustment

Front-door adjustment: a superhero method for handling unobserved confounding by using mediators (if present) to estimate causal effects accurately

ShinyConf 2023: A Medical Educato(R)’s Journey to Data Science

I had the opportunity to share our journey to data science in medical education

Invest everyday, biweekly, or monthly?

Which strategy is the most optimal for dollar cost averaging? Let’s play with data!

Mandell O Mandell, Please Grant Me Some Insight!

Bring a textbook to life by Using a simple Natural Language Processing method (Ngram) to guide focused reading and build a robust differential diagnosis

Seeking Inspiration from Random Learning

I didn’t want to read the textbook in sequence. Hence, I figured that if I read a paragraph a day in a random chapter, I might be able to benefit from random learning!

Purchase Flower For Just Because Occasion. How `R` you doing it?

Just Because in a true sense :D

Math Puzzle #1

How to solve this… 2 ? 1 ? 6 ? 6 ? 200 ? 50 = 416.56

The 100 Prisoners Problem

Brief Introduction: The 100 prisoners problem is a probability theory and combinatorics problem. In this challenge, 100 numbered prisoners must find their own numbers in one of 100 drawers in order to survive. Rules: We have 100 prisoners labeled: 1, 2 … 100 on their clothes we have a room filled with 100 boxes labeled 1, 2, … 100 on the outside of the boxes inside each box, there is a number from 1, 2 … 100 only 1 prisoner may enter the room each time Each prisoner may open only up to 50 attempts/boxes and cannot communicate with other prisoners if the prisoner found his/her/their number, he/she/they will exit the room and no be able to talk to other prisoners.