Using SortmeRNA

Using SortMeRNA for removing rRNA

Looking online, there seems to be some thoughts that removing rRNA reads is not required when Poly-A selection is performed during sequencing, as preferentially selecting polyA tails automatically creates a bias towards mRNA (see this article and this article).
However, some suggest that removing rRNA would save computation time and lead to a cleaner assembly (see here).
This tutorial/paper is a good resource - it says that if GC content is abnormal, it could be a sign of high rRNA content (GC content of over 50% is typical for rRNA) and then SortMeRNA should be used. They also said “if there is any doubt, this step should be performed.”

Notes on Pstrigosa bioinformatics data analysis

Dr. Brad Weiler’s new publication on diel transcriptomics of Pseudodiploria strigosa was just published on bioRxiv here, and his pipeline is available now on GitHub here.

Redoing trimming of reciprocal transplant seqs

I realized that I didn’t follow the Lexogen guidelines for trimming sequences from samples that were prepped for cDNA libraries using QuantSeq.

denovo Transcriptome Assembly for Pstr

I don’t have a reference genome or transcriptome to align the Chapter 3 reads of P.strigosa to for the reciprocal transplant study, so I need to generate a de novo transcriptome using the reads from the study.

Addressing reviewer comments for stress-hardening paper

A reviewer had a concern about the intra-genotypic variation in the Acer and Pcli stress-hardening manuscript that I submitted to Ecology and Evolution (see figure below).

Using gitignore file for GitHub public repository

I recently had to make all of my files for the temperaturevariabilityAcerPcli publicly available because I was making my repository public for publication, and I needed to transfer a bunch of the files from one computer to the other.

Downloading sequences from Box

I tried some different methods for downloading the Chapter 3 Reciprocal Transplant experiment .fastq.gz sequences from Box, and realized that there isn’t a way to download/upload the raw sequences from Box without locally downloading them to my computer (which would take 100s of GBs of space). So, I think my best bet is to use the computer in the NTK lab to download them locally, then upload them to Pegasus.

Downloading sequences from Basespace and uploading sequences to NCBI

NOTE: There is a great tutorial and overview from Danielle Becker in Dr. Hollie Putnam’s lab on their GitHub lab notebook here.

Notes on GLMs

After reviewing some great tutorials, I think I understand the best steps for generalized linear models with fixed and random effects (I think).

Acer CCC KEGG Pathway analysis issues

I presented my data chapters 2 and 3 to NTK’s lab on Friday, and she pointed out something I hadn’t thought of - that even though the most significant KEGG pathway in my Acer CCC vs. Nursery analysis was estrogen signaling, the DGEs themselves that had KO terms which annotate to this KEGG pathway could also be involved in many other pathways. Looking at the full list of genes with the KEGG terms here, you can see that a lot of the genes are involved in many potential pathways. So why did the KEGG analysis give me the estrogen signaling pathway specifically?

Take 2 with CRW DHW MMM

I am revisiting this post from last year, where I was trying to download data from NOAA’s CRW to calculate MMM and DHW for my reef sites. Post here

2024-02-12-noteaboutannotationfilesfromMS.md

I realized that when Michael re-annotated the Locatelli unpublished Acer genome in December, he updated the names of everything (which no longer match my counts matrix). Basically when we first started this pipeline, all the Acropora gene names were “Acropora_” before the gene number. But in the recent re-annotation, Michael named everything “Acervicornis” with no underscore. I ran into issues when I was trying to run the GO_MWU scripts because I was using the “Acervicornis_iso2go.tab” file but trying to match the “Acropora_” gene names (which none of them matched of course).

Issue with FvFm code

I was looking at the full dataset of all the IPAM data to see if there were any issues, because when I started trying to remove NAs from the GLM for FvFm treatment data, I started getting errors.

Fv/Fm GLMMs

I am revisiting my Fv/Fm code for my Chapter 2 stress-hardening analysis. When I first started this, I tried to follow code from Cunning et al. 2021, who assessed A.cervicornis outplants using CBASS across Florida. However, I found other publications that applied different ways of analyzing this data that were more straightforward (I think because Cunning et al. 2021 had a lot more variation to account for due to the different sites across space and time, and also the authors added corrections for light parameters in each CBASS tank and for use of a Diving PAM). I started playing around with other codes and eventually got really deep in the weeds of this.

installing and using reefmapmaker to create coral reef maps

I discovered a github repo for a conda package that helps produce high-quality coral reef maps using online data. The package is called reefmapmaker.

Acer gene counts using samtools

Finally getting the gene counts for Acer samples from Chapter 2

Annotating new Acer transcriptome

I’m going to be following Michael Studivan’s code for annotating the new Acer transcriptome.

Downloading SST Data from CoralWatch

Asking ChatGPT to define everything I need:

UTR gff parser

I’m going to try to run this again because Nick said he had no issues doing it. I think maybe if I try to run gtf_advanced_parser.py first on the genomic.gff file downloaded from NCBI, then maybe I can see if that works and I can find the “three_prime_UTR” annotations in the gff file.

Libro et al. 2013 Acer genome alignment attempt

new Acer genome continued

Previous ways I tried to download it: ```{bash} wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/032/359/415/GCA_032359415.1_NEU_Acer_K2/GCA_032359415.1_NEU_Acer_K2_genomic.gtf.gz wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/032/359/415/GCA_032359415.1_NEU_Acer_K2/GCA_032359415.1_NEU_Acer_K2_genomic.fna.gz gunzip GCA_032359415.1_NEU_Acer_K2_genomic.gtf.gz gunzip GCA_032359415.1_NEU_Acer_K2_genomic.fna.gz

Conda Environment on Pegasus, new Acer genome, and bowtie2

There are a lot of sections here, so here are links to specific topics:

Trouble with Fv/Fm Data

I’m revisiting the IPAM data from the Chapter 2 stress-hardening experiment, and I’m having a really hard time wrapping my head around it.

DESeq2 design formula

I have started working through Dr. Michael Studivan’s DESeq code, but I am having trouble figuring out how best to include all of my explanatory variables in the DESeq model. In his study, he has time and treatment as explanatory variables. He created a column that combined the two: time.treatment. See screenshot of data table with design:

PCR duplicates

I met with Dr. Michael Studivan last week to discuss plans for DNA and RNA extractions of the Ch 3 Pstr samples. During this meeting, we talked about the results from the Ch 2 RNA-seq data, and he was concerned with how high the number of passed reads there were post-filtering step (see the multiqc report). He said that when he has run TagSeq analysis, he usually get around 60% loss of reads when going from raw reads to trimmed reads. Whereas for mine, I got >95% retention of reads which passed the fastp filter. So what’s happening here? Did fastp not work?

Ch4_AcerCCC read counts

I’m writing up a summary for the Ch4 samples now, and I want to generate a summary table with the average and standard deviation of raw reads, trimmed reads, and percent alignment rates.

pre-filtering samples based on million reads

I have been writing up a summary of the gene expression analysis in advance of my committee meeting, but I realized something that is inconsistent between different people’s codes and I’m not sure what the baseline or “correct” way is.

Ch2 Acer DESeq2

So I started on the code to get through DESeq2 for the ch2 temperature variability samples.

redo Fastp for Ch2 samples

I need to redo the Fastp step because in comparing the fastqc reports of the raw reads versus the “trimmed” reads, they are the exact same and it appears that nothing was trimmed. I also think Fastp didn’t work correctly because there is only one fastp.html and fastp.json file for the entire list of samples, and it corresponds to Pcli-148 (the last sample of the batch). So I think it overwrote things and maybe didn’t process correctly.

Tximport for salmon quant Pcli

So I had a lot of issues in trying to import the “quant.sf” files into R so that I could generate a counts matrix of transcripts.

DESeq2 for Ch4 CCC samples (both location and genotype in dds model)

I imported the quantified reads by using the multiqc general stats text file from the STAR alignment of trimmed samples and the readsPerGene.out.tab files generated from STAR alignment.

Pcli Salmon transcriptome

So I installed Salmon locally onto my scratch space by downloading the Linux binary package:

using Pcli transcriptome

So while the stringtie assembly is running for Acer, I am going to try to figure out how to use the Pcli file I downloaded (see previous blog post).

stringtie2 for Ch2 temp variability

Stringtie is “a fast and efficient assembler of RNA-Seq alignments to assemble and quantitate full-length transcripts. putative transcripts. For our use, we will input short mapped reads from HISAT2. StringTie’s ouput can be used to identify DEGs in programs such as DESeq2 and edgeR” (from Sam’s github)

STAR index and alignment for Pcli

Ok so since I have STAR already installed locally on my scratch space, I’m going to try to do the genome index and alignment using STAR for the Pcli transcriptome. ChatGPT says you can just run it the same way as if you were running a genome index.

renaming trimmed files

Following the fastp trimming code, all the file names ended in “.clean.processed” and that prevented them from being recognized as fastq.gz files for the next step in the pipeline. So I needed to write a for-loop to remove those suffices from all file names.

HISAT2 alignment

So, following the next step of the pipeline, Ariana, Kevin, Sam, and Zoe all use HISAT2 for alignment. Here is an explanation of all the flags from Sam’s GitHub.

updated stringtie code

Following Sam’s code, I used these arguments below for stringtie:

troubleshooting STAR index (again)

Yesterday I had Natalia look at my code and we discovered something in the STAR index output (from the original STAR index and the updated gff3 file STAR index after following Jill’s code).

STAR align updated to counts matrix

So following the updated STAR alignment with the updated gff3 file for the Acer CCC samples, I re-tried the code from Natalia and got different results which I think means it worked!

Pcli transcriptome

Michael Studivan created a Google Drive of different species’ transcriptomes, and the one for Pclivosa comes from Avila-Magana et al. 2021. I’m not sure if Michael separated out everything from the metatranscriptome, but the file I have downloaded is called “Dip_Host.fna.gz”. So it is just the coral host.

Ch4_AcerCCC updated gff STAR alignment results

The STAR alignment job finished, so I quickly ran multiqc to compare the original alignment, the first round of updated annotations, and the most recent updated gff3 file. What’s interesting is all their alignment rates look the same…

md5 check sums script

So I followed the first script from Sam and Ariana to do the “md5 check sums” thing. I guess the purpose of it is to check the integrity of the files after downloading them from Basespace. Idk why it’s necessary but everyone in Hollie’s lab seems to do it, so it must be important! However, I submitted this job on Pegasus yesterday for the Ch2_tempvariability2023 raw fastq.gz files and the job had still not yet started today:

download fastp to Pegasus and running fastp

The next step of the pipeline is to run fastp to conduct trimming and cleaning of sequence files.

Starting over

I think I need to start from the beginning and go through one pipeline that has been proven for that person (or lab group) to work time and time again, rather than try to frankenstein pieces of people’s codes together to get something to work (which is what I have tried thus far and haven’t been successful). It also helps to find a person with well-annotated code. Thankfully, Hollie Putnam’s lab has tried-and-true methods that go back to pipelines of Dr. Sam Barr and Dr. Ariana Huffmyer.

Acer CCC versus Acer Stress-Hardening Samples

So, the major difference between the CCC samples and the stress-hardening samples is that the first were sequenced using Lexogen 3’ QuantSeq, and the second were sequenced following the TagSeq protocol, which is a “specialty” library prep specifically tested and refined with coral samples. So although both projects were sequenced on the same type of machine with the same goal for output (NovaSeq S2 SE100), the results are vastly different.

Subread and FeatureCounts

I am trying to see if I can still get gene counts on the Acer CCC samples without using StringTie, since that has been difficult for me to get to work. I adapted this featureCounts code from my wound-healing project and submitted the job to pegasus:

stringtie on SH samples

I am now going to try the stringtie scripts on the stress-hardening Acer samples and see if I get better results than the Acer CCC ones.

Troubleshooting STAR alignment

This morning to try to troubleshoot the “missing gene IDs” issue with my STAR alignment, I followed one suggestion of ChatGPT to use the Integrative Genomics Viewer app to visualize the alignment of one of the sample BAM files with the Acer genome.

STAR output to gene counts

I can’t get stringtie to work so I want to try Natalia’s method instead, where she took the STAR read counts and somehow turned that into a gene count matrix.

stringtie code for Acer CCC and rerunning STAR with updated annotation file

1) Need to install stringtie and gffcompare to local programs folder on pegasus.

samtools for STAR alignment results

So I went down the path of using samtools after STAR alignment because it seems that it is used as a quality-checking tool to see if alignment worked. I thought multiQC did that but idk what the difference is. So I tried first running Danielle’s code (https://github.com/daniellembecker/DanielleBecker_Lab_Notebook/blob/master/_posts/2021-04-14-Molecular-Underpinnings-RNAseq-Workflow.md) but then realized her sequences are paired-end reads and that’s why the code wasn’t working.

following HBC pipeline - qualimap and salmon

I keep returning to this GitHub page time and time again, and it continues to be the best resource I can find to explain to me why certain tools are used, as well as provide citations for tested tools that improve accuracy and precision over others. Here is the link to the overview: https://github.com/hbctraining/DGE_workshop_salmon_online/blob/master/lessons/01a_RNAseq_processing_workflow.md

interpreting STAR alignment output files

Each sample comes with these output files following STAR alignment script:

Aligning raw reads versus trimmed reads results for Acer CCC samples

The results of the STAR alignment for the trimmed reads were a little strange looking (especially in comparison to the stress-hardening samples, those look way better), so I tried aligning the raw reads and seeing if I got better results (higher alignment and less unmapped reads). Interestingly, it looks like trimming improved alignment rates (trimmed samples on left, raw reads on right):

stringtie articles

I just wanted to put all these somewhere for when I need them. Stringtie is what Jill Ashey used following STAR alignment. In the past I used Subread (I think?) for Pdam.

SUCCESS STAR alignment Acer genome

This was the most recent code I tried that didn’t work:

STAR index Acer genome

The first step when aligning reads to the genome when using STAR is to first create an index based on the most up-to-date genome assembly. I downloaded a 2019 draft assembly from Baums et al. (https://usegalaxy.org/published/history?id=1f8678b27ae56467 and https://usegalaxy.org/published/page?id=2f8d21c73f8501e2 for file descriptions based on Apal files).

STAR alignment troubleshooting

I’m looking at Young et al. 2020 methodology because they used Iliana Baums’ Apal genome, which is also available on Galaxy, so maybe the code required will be similar.

Which samples to use based on minimum library size

I have decided which aligner to use, but before I begin that step, I need to determine which samples have made the cut based on the read trimming and QC.

initial attempt STAR alignment codes

Here is the script Natalia used to set up the STAR genome index (https://github.com/China2302/SCTLD_RRC/blob/main/hpc/STAR_index.sh): ```{bash} #!/bin/bash #BSUB -J star_index #BSUB -q general #BSUB -P c_transcriptomics #BSUB -n 8 #BSUB -o /scratch/projects/c_transcriptomics/Genomes/Ofav/star_index%J.out #BSUB -e /scratch/projects/c_transcriptomics/Genomes/Ofav/star_index%J.err #BSUB -u nxa945@miami.edu #BSUB -N

Acer genomes vs transcriptomes

I wanted to make a separate post dedicated to Acer because it seems particularly confusing.

notes on alignment tools

In the past, for the Pdamicornis wound healing paper, I used STAR to align the 3’ RNAseq reads to the genome, not the transcriptome.

The notes I’m relying on for this round of analysis are from Dr. Natalia Andrade (https://github.com/China2302/transcriptomics-workflow) and Dr. Michael Studivan (https://github.com/mstudiva/tag-based_RNAseq/blob/master/). The reason for this is because I sequenced the Acer CCC samples with Natalia’s SCTLD Mote and Nova samples, while the temperature variability 2023 Acer and Pcli samples were sequenced following Michael’s protocols.

So far, because Michael’s stuff is all in python, I have had more success with Natalia’s pipeline. However, the QC and trimming steps are not super specific, and they both used cutadapt to trim Illumina adapters. Michael’s code had more specifics for trimming (i.e. specific base pair sequences) that are potentially more likely to be found in TagSeq generated reads. Natalia specifically trimmed the polyA tails because that is a problem in 3’ RNA seq.

But for alignment, there are many different possible tools to use, and there is also the question of whether to align to a genome or a transcriptome.

The UM CCS student mentors made a pros and cons list for aligning to a genome versus a transcriptome (https://github.com/ccsstudentmentors/tutorials/tree/master/RNA-Seq/Quantifying-RNA-Expression). I think it also depends on what is available for the species you’re analyzing.

Transcriptome – then use Bowtie.

Genome – use STAR.

After talking to Dr. Kevin Wong, he said to just pick whichever gives the best alignment rate.

In terms of which tool to use, STAR seems to be the superior aligner:

https://www.biostars.org/p/353946/

some wins against Pegasus

After meeting with Anthony Bonacolta, I finally got some help as to why my job submissions to Pegasus weren’t working. First, the alias “compute” that I created includes “bsub -P and_transcriptomics” and I was trying to do compute .sh every time I submitted a job. When I did it this way, it also wasn’t recognizing when i put something in the bigmem queue and would put it in the general queue.

WGCNA for wound healing manuscript pt. 2

I’ve been feeling like the heatmap I created doesn’t quite make sense, because I feel like the only thing that should be on the x axis is condition x hour, not each individual like I have it currently:

WGCNA for wound healing manuscript

I’m currently working through the WGCNA tutorial for the wound healing dataset (following https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html and I specifically did the step-by-step network construction, not the automatic one).

Helpful articles for quality control and pre-processing of RNA-seq data

Quality Control and Preprocessing of Sequencing Reads: https://bio-protocol.org/exchange/protocoldetail?id=4454&type=1

Trimmomatic vs. Cutadapt

Before even diving into the benefits of Trimmomatic versus Cutadapt on 3’ RNA-seq data, I remembered a 2020 publication which said that adapter and low-quality base trimming was actually not even necessary before alignment, and that the “soft-clipping” removal of these short sequences during the alignment stage using Subread was more successful at not having false positives (and removing relevant data) while still removing 94% of adapter sequences. Here is the link to the paper: https://academic.oup.com/nargab/article/2/3/lqaa068/5901066

writing a script to create multiple jobs at once for trimming

This is the code I have right now for the trimming script (note: the parallel flag didn’t work, so it ran each file individually one at a time on Pegasus in the general queue, which was very slow):

creating .txt file with just sample names using awk/sed

I want to create a .txt file that just has the sample names listed, no file extensions. This worked for me (ran directly in terminal in folder with fastq sample files):

installing cutadapt and trim_galore on Pegasus

trim_galore is a wrapper that requires cutadapt and fastQC (https://github.com/FelixKrueger/TrimGalore)

installing MultiQC on pegasus

I want to install multiqc so I can look at all the fastQC results at once and compare them to one another.

initial FastQC results for Acer CCC samples (untrimmed)

So I got the multiqc report for the CCC Acer samples to work, and it seems like they are all over the place.

Uploading files to Pegasus

After trying several different codes, this is what I got to work:

Revisiting Pegasus (UM Supercomputer) after 4 years

What has changed with Pegasus since 2019? Time to find out.

QC scripts for Pegasus Stress-Hardening RNA-Seq Experiment

These scripts are what I want to use for QC once I get access to Pegasus and can move all my sequences onto the project space.

Downloading stress-hardening sequences from BaseSpace

I think I found the way to use the BaseSpace Downloader GUI - I got two sample sequences to download successfully (Acer_005 and Acer_019). I think trying to download the project itself doesn’t work for whatever reason, as it keeps failing repeatedly. It could be because I’m trying to download it straight to an external hard drive, but it’s 49GB so I can’t download that to my computer.

Adding gene lists to GO tables

I want to figure out which specific genes are corresponding to each GO term – right now it’s just number of genes reported in the “Significant” column in the results table.

Optimizing command line on new laptop for bash scripts for downloading raw sequences

Today I received the sequences from UT Austin GSAF via Illumina Basespace. Now I need to download them.

Creating Polygons for PCAs

Screen Shot 2023-05-05 at 11 37 11 AM

Re-running TopGO with Up vs. Downregulated genes split up

I just checked and I did use Fisher’s Exact Test (re: last post), so I don’t need to redo that. But I need to redo a couple things. First, I need to split up the Up and Downregulated gene lists and run TopGO separately. Next, I need to make sure that i filtered L2FC < or > 0, instead of 2 which I did for the volcano plots. 2 is very, very strict, and all the papers I’m reading count differential expression as p-adjusted < 0.05 and L2FC < or > 0. My annotated gene lists that I saved from each DESeq2 object for each hour are filtered based on p-adjusted < 0.05, but no specific L2FC. So that’s good, that means the only thing I need to change is the volcano plots. Although I did a lot of manual editing in Illustrator, so maybe I can just leave that as is. I just need to make sure that for TopGO, I count any upregulated genes as L2FC >0, and downregulated genes as L2FC < 0.

TopGO = what does it all mean?????

I am not sure how to interpret the results of the topGO analysis.

Separating Up- and Down-Regulated Genes for GO Analysis?

I have returned to the GO analysis for each hour, as I was talking with Kevin and he mentioned to separate the up and down-regulated gene lists so that you know which pathways are being upregulated versus downregulated. But then I ended up in a deep-dive of GO enrichment analysis, and whether it’s appropriate to split up the DEG list by directionality. Turns out this is a contested debate and ultimately depends on your question (I think).

RNA Sequencing Contract with Dr. Michael Studivan

Prepping metadata from RNA extractions for 2022 Stress-Hardening Experiment and 2022-2023 Urban Coral Reciprocal Transplant (Carly+Rich Experiment)

More notes on GLMMs (with Dr. Kevin Wong!)

We looked through my PAM_stresshardening.Rmd code that I wrote for running glmmTMB (fvfm ~ Treatment + (1

Colony) + (1

Tank/Date), family = list(family=”beta”, link=”logit”)).

GLMMS for Stress-Hardening IPAM Data

Based on the PPT from Kevin, I think my Fv/Fm data falls into a Percentage/Proportion data type, where numerical data is bound from 0-1. The assumption for this is that there is overdispersion (the mean is < the variance). The code provided in the PPT as an example was: glmmTMB::glmmTMB(survival ~ treatment + (1|sample), family= list(family=”beta”, link=”logit”)

Figuring out why R code for importing PAM data stopped working

I realized that there were some issues with the RStudio on my AOML desktop versus my laptop. First, I had updated the RStudio on my computer, so it messed up one of the packages that was needed for Ross Cunning’s IPAMtoR custom program. I tried to install devtools to install this package (called “joeyr”) but AOML wasn’t letting me do that for some reason, or something else needed to be updated or something. I got it to work on my laptop though. Then, once I actually got it to work, I have been having issues importing the csv files that were downloaded from the IPAM software.

Figuring out Linear Mixed Models using Stress-Hardening Data

I am revisiting the 2022 stress-hardneing experiment to finally tackle the unknown, which is how to incorporate so many different variables into one statistical test (tank replicates, biological replicates, species, genotypes, treatments, time points).

GO Analysis

In doing the gene ontology enrichment analysis for biological processes (comparing control vs. wounded at each time point), we don’t get any meaningful GO terms that pop up.

Venn Diagrams

Today’s goal is to make a venn diagram showing overlap of any significant DGEs between the time points. Similar to something like this from Traylor-Knowles et al. 2021 and Connelly et al. 2020

DEGPatterns Gene Clustering

I have been trying to recreate this figure but I’m having a lot of trouble getting it.

PCA plot significance

I have made my PCA plots for each hour of the differential expression analysis (Figure below).

topGO Analysis

I am working on understanding the topGO R package (Alexa et al. 2006) to test for gene ontology enrichment from the Pocillopora damicornis wound healing transcriptomics study.

Hello! This is a test!

Next you can update your site name, avatar and other options using the _config.yml file in the root of your repository (shown below).