Downloading sequences from Basespace
NOTE: There is a great tutorial and overview from Danielle Becker in Dr. Hollie Putnam’s lab on their GitHub lab notebook here.
PhD from the University of Miami Rosenstiel School of Marine, Atmospheric, and Earth Science
NOTE: There is a great tutorial and overview from Danielle Becker in Dr. Hollie Putnam’s lab on their GitHub lab notebook here.
After reviewing some great tutorials, I think I understand the best steps for generalized linear models with fixed and random effects (I think).
I presented my data chapters 2 and 3 to NTK’s lab on Friday, and she pointed out something I hadn’t thought of - that even though the most significant KEGG pathway in my Acer CCC vs. Nursery analysis was estrogen signaling, the DGEs themselves that had KO terms which annotate to this KEGG pathway could also be involved in many other pathways. Looking at the full list of genes with the KEGG terms here, you can see that a lot of the genes are involved in many potential pathways. So why did the KEGG analysis give me the estrogen signaling pathway specifically?
I am revisiting this post from last year, where I was trying to download data from NOAA’s CRW to calculate MMM and DHW for my reef sites. Post here
I realized that when Michael re-annotated the Locatelli unpublished Acer genome in December, he updated the names of everything (which no longer match my counts matrix). Basically when we first started this pipeline, all the Acropora gene names were “Acropora_” before the gene number. But in the recent re-annotation, Michael named everything “Acervicornis” with no underscore. I ran into issues when I was trying to run the GO_MWU scripts because I was using the “Acervicornis_iso2go.tab” file but trying to match the “Acropora_” gene names (which none of them matched of course).
I was looking at the full dataset of all the IPAM data to see if there were any issues, because when I started trying to remove NAs from the GLM for FvFm treatment data, I started getting errors.
I am revisiting my Fv/Fm code for my Chapter 2 stress-hardening analysis. When I first started this, I tried to follow code from Cunning et al. 2021, who assessed A.cervicornis outplants using CBASS across Florida. However, I found other publications that applied different ways of analyzing this data that were more straightforward (I think because Cunning et al. 2021 had a lot more variation to account for due to the different sites across space and time, and also the authors added corrections for light parameters in each CBASS tank and for use of a Diving PAM). I started playing around with other codes and eventually got really deep in the weeds of this.
I discovered a github repo for a conda package that helps produce high-quality coral reef maps using online data. The package is called reefmapmaker.
I’m going to be following Michael Studivan’s code for annotating the new Acer transcriptome.
Asking ChatGPT to define everything I need:
I’m going to try to run this again because Nick said he had no issues doing it. I think maybe if I try to run gtf_advanced_parser.py first on the genomic.gff file downloaded from NCBI, then maybe I can see if that works and I can find the “three_prime_UTR” annotations in the gff file.
Previous ways I tried to download it: ```{bash} wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/032/359/415/GCA_032359415.1_NEU_Acer_K2/GCA_032359415.1_NEU_Acer_K2_genomic.gtf.gz wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/032/359/415/GCA_032359415.1_NEU_Acer_K2/GCA_032359415.1_NEU_Acer_K2_genomic.fna.gz gunzip GCA_032359415.1_NEU_Acer_K2_genomic.gtf.gz gunzip GCA_032359415.1_NEU_Acer_K2_genomic.fna.gz
There are a lot of sections here, so here are links to specific topics:
I’m revisiting the IPAM data from the Chapter 2 stress-hardening experiment, and I’m having a really hard time wrapping my head around it.
I have started working through Dr. Michael Studivan’s DESeq code, but I am having trouble figuring out how best to include all of my explanatory variables in the DESeq model. In his study, he has time and treatment as explanatory variables. He created a column that combined the two: time.treatment. See screenshot of data table with design:
I met with Dr. Michael Studivan last week to discuss plans for DNA and RNA extractions of the Ch 3 Pstr samples. During this meeting, we talked about the results from the Ch 2 RNA-seq data, and he was concerned with how high the number of passed reads there were post-filtering step (see the multiqc report). He said that when he has run TagSeq analysis, he usually get around 60% loss of reads when going from raw reads to trimmed reads. Whereas for mine, I got >95% retention of reads which passed the fastp filter. So what’s happening here? Did fastp not work?
I’m writing up a summary for the Ch4 samples now, and I want to generate a summary table with the average and standard deviation of raw reads, trimmed reads, and percent alignment rates.
I have been writing up a summary of the gene expression analysis in advance of my committee meeting, but I realized something that is inconsistent between different people’s codes and I’m not sure what the baseline or “correct” way is.
So I started on the code to get through DESeq2 for the ch2 temperature variability samples.
I need to redo the Fastp step because in comparing the fastqc reports of the raw reads versus the “trimmed” reads, they are the exact same and it appears that nothing was trimmed. I also think Fastp didn’t work correctly because there is only one fastp.html and fastp.json file for the entire list of samples, and it corresponds to Pcli-148 (the last sample of the batch). So I think it overwrote things and maybe didn’t process correctly.
So I had a lot of issues in trying to import the “quant.sf” files into R so that I could generate a counts matrix of transcripts.
I imported the quantified reads by using the multiqc general stats text file from the STAR alignment of trimmed samples and the readsPerGene.out.tab files generated from STAR alignment.
So I installed Salmon locally onto my scratch space by downloading the Linux binary package:
So while the stringtie assembly is running for Acer, I am going to try to figure out how to use the Pcli file I downloaded (see previous blog post).
Stringtie is “a fast and efficient assembler of RNA-Seq alignments to assemble and quantitate full-length transcripts. putative transcripts. For our use, we will input short mapped reads from HISAT2. StringTie’s ouput can be used to identify DEGs in programs such as DESeq2 and edgeR” (from Sam’s github)
Ok so since I have STAR already installed locally on my scratch space, I’m going to try to do the genome index and alignment using STAR for the Pcli transcriptome. ChatGPT says you can just run it the same way as if you were running a genome index.
Following the fastp trimming code, all the file names ended in “.clean.processed” and that prevented them from being recognized as fastq.gz files for the next step in the pipeline. So I needed to write a for-loop to remove those suffices from all file names.
So, following the next step of the pipeline, Ariana, Kevin, Sam, and Zoe all use HISAT2 for alignment. Here is an explanation of all the flags from Sam’s GitHub.
Following Sam’s code, I used these arguments below for stringtie:
Yesterday I had Natalia look at my code and we discovered something in the STAR index output (from the original STAR index and the updated gff3 file STAR index after following Jill’s code).
So following the updated STAR alignment with the updated gff3 file for the Acer CCC samples, I re-tried the code from Natalia and got different results which I think means it worked!
Michael Studivan created a Google Drive of different species’ transcriptomes, and the one for Pclivosa comes from Avila-Magana et al. 2021. I’m not sure if Michael separated out everything from the metatranscriptome, but the file I have downloaded is called “Dip_Host.fna.gz”. So it is just the coral host.
The STAR alignment job finished, so I quickly ran multiqc to compare the original alignment, the first round of updated annotations, and the most recent updated gff3 file. What’s interesting is all their alignment rates look the same…
So I followed the first script from Sam and Ariana to do the “md5 check sums” thing. I guess the purpose of it is to check the integrity of the files after downloading them from Basespace. Idk why it’s necessary but everyone in Hollie’s lab seems to do it, so it must be important! However, I submitted this job on Pegasus yesterday for the Ch2_tempvariability2023 raw fastq.gz files and the job had still not yet started today:
The next step of the pipeline is to run fastp to conduct trimming and cleaning of sequence files.
I think I need to start from the beginning and go through one pipeline that has been proven for that person (or lab group) to work time and time again, rather than try to frankenstein pieces of people’s codes together to get something to work (which is what I have tried thus far and haven’t been successful). It also helps to find a person with well-annotated code. Thankfully, Hollie Putnam’s lab has tried-and-true methods that go back to pipelines of Dr. Sam Barr and Dr. Ariana Huffmyer.
So, the major difference between the CCC samples and the stress-hardening samples is that the first were sequenced using Lexogen 3’ QuantSeq, and the second were sequenced following the TagSeq protocol, which is a “specialty” library prep specifically tested and refined with coral samples. So although both projects were sequenced on the same type of machine with the same goal for output (NovaSeq S2 SE100), the results are vastly different.
I am trying to see if I can still get gene counts on the Acer CCC samples without using StringTie, since that has been difficult for me to get to work. I adapted this featureCounts code from my wound-healing project and submitted the job to pegasus:
I am now going to try the stringtie scripts on the stress-hardening Acer samples and see if I get better results than the Acer CCC ones.
This morning to try to troubleshoot the “missing gene IDs” issue with my STAR alignment, I followed one suggestion of ChatGPT to use the Integrative Genomics Viewer app to visualize the alignment of one of the sample BAM files with the Acer genome.
I can’t get stringtie to work so I want to try Natalia’s method instead, where she took the STAR read counts and somehow turned that into a gene count matrix.
1) Need to install stringtie and gffcompare to local programs folder on pegasus.
So I went down the path of using samtools after STAR alignment because it seems that it is used as a quality-checking tool to see if alignment worked. I thought multiQC did that but idk what the difference is. So I tried first running Danielle’s code (https://github.com/daniellembecker/DanielleBecker_Lab_Notebook/blob/master/_posts/2021-04-14-Molecular-Underpinnings-RNAseq-Workflow.md) but then realized her sequences are paired-end reads and that’s why the code wasn’t working.
I keep returning to this GitHub page time and time again, and it continues to be the best resource I can find to explain to me why certain tools are used, as well as provide citations for tested tools that improve accuracy and precision over others. Here is the link to the overview: https://github.com/hbctraining/DGE_workshop_salmon_online/blob/master/lessons/01a_RNAseq_processing_workflow.md
Each sample comes with these output files following STAR alignment script:
The results of the STAR alignment for the trimmed reads were a little strange looking (especially in comparison to the stress-hardening samples, those look way better), so I tried aligning the raw reads and seeing if I got better results (higher alignment and less unmapped reads). Interestingly, it looks like trimming improved alignment rates (trimmed samples on left, raw reads on right):
I just wanted to put all these somewhere for when I need them. Stringtie is what Jill Ashey used following STAR alignment. In the past I used Subread (I think?) for Pdam.
This was the most recent code I tried that didn’t work:
The first step when aligning reads to the genome when using STAR is to first create an index based on the most up-to-date genome assembly. I downloaded a 2019 draft assembly from Baums et al. (https://usegalaxy.org/published/history?id=1f8678b27ae56467 and https://usegalaxy.org/published/page?id=2f8d21c73f8501e2 for file descriptions based on Apal files).
I’m looking at Young et al. 2020 methodology because they used Iliana Baums’ Apal genome, which is also available on Galaxy, so maybe the code required will be similar.
I have decided which aligner to use, but before I begin that step, I need to determine which samples have made the cut based on the read trimming and QC.
Here is the script Natalia used to set up the STAR genome index (https://github.com/China2302/SCTLD_RRC/blob/main/hpc/STAR_index.sh): ```{bash} #!/bin/bash #BSUB -J star_index #BSUB -q general #BSUB -P c_transcriptomics #BSUB -n 8 #BSUB -o /scratch/projects/c_transcriptomics/Genomes/Ofav/star_index%J.out #BSUB -e /scratch/projects/c_transcriptomics/Genomes/Ofav/star_index%J.err #BSUB -u nxa945@miami.edu #BSUB -N
I wanted to make a separate post dedicated to Acer because it seems particularly confusing.
In the past, for the Pdamicornis wound healing paper, I used STAR to align the 3’ RNAseq reads to the genome, not the transcriptome.
The notes I’m relying on for this round of analysis are from Dr. Natalia Andrade (https://github.com/China2302/transcriptomics-workflow) and Dr. Michael Studivan (https://github.com/mstudiva/tag-based_RNAseq/blob/master/). The reason for this is because I sequenced the Acer CCC samples with Natalia’s SCTLD Mote and Nova samples, while the temperature variability 2023 Acer and Pcli samples were sequenced following Michael’s protocols.
So far, because Michael’s stuff is all in python, I have had more success with Natalia’s pipeline. However, the QC and trimming steps are not super specific, and they both used cutadapt to trim Illumina adapters. Michael’s code had more specifics for trimming (i.e. specific base pair sequences) that are potentially more likely to be found in TagSeq generated reads. Natalia specifically trimmed the polyA tails because that is a problem in 3’ RNA seq.
But for alignment, there are many different possible tools to use, and there is also the question of whether to align to a genome or a transcriptome.
The UM CCS student mentors made a pros and cons list for aligning to a genome versus a transcriptome (https://github.com/ccsstudentmentors/tutorials/tree/master/RNA-Seq/Quantifying-RNA-Expression). I think it also depends on what is available for the species you’re analyzing.
Transcriptome – then use Bowtie.
Genome – use STAR.
After talking to Dr. Kevin Wong, he said to just pick whichever gives the best alignment rate.
In terms of which tool to use, STAR seems to be the superior aligner:
After meeting with Anthony Bonacolta, I finally got some help as to why my job submissions to Pegasus weren’t working. First, the alias “compute” that I created includes “bsub -P and_transcriptomics” and I was trying to do compute .sh every time I submitted a job. When I did it this way, it also wasn’t recognizing when i put something in the bigmem queue and would put it in the general queue.
I’ve been feeling like the heatmap I created doesn’t quite make sense, because I feel like the only thing that should be on the x axis is condition x hour, not each individual like I have it currently:
I’m currently working through the WGCNA tutorial for the wound healing dataset (following https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html and I specifically did the step-by-step network construction, not the automatic one).
Before even diving into the benefits of Trimmomatic versus Cutadapt on 3’ RNA-seq data, I remembered a 2020 publication which said that adapter and low-quality base trimming was actually not even necessary before alignment, and that the “soft-clipping” removal of these short sequences during the alignment stage using Subread was more successful at not having false positives (and removing relevant data) while still removing 94% of adapter sequences. Here is the link to the paper: https://academic.oup.com/nargab/article/2/3/lqaa068/5901066
This is the code I have right now for the trimming script (note: the parallel flag didn’t work, so it ran each file individually one at a time on Pegasus in the general queue, which was very slow):
I want to create a .txt file that just has the sample names listed, no file extensions. This worked for me (ran directly in terminal in folder with fastq sample files):
trim_galore is a wrapper that requires cutadapt and fastQC (https://github.com/FelixKrueger/TrimGalore)
I want to install multiqc so I can look at all the fastQC results at once and compare them to one another.
So I got the multiqc report for the CCC Acer samples to work, and it seems like they are all over the place.
After trying several different codes, this is what I got to work:
What has changed with Pegasus since 2019? Time to find out.
These scripts are what I want to use for QC once I get access to Pegasus and can move all my sequences onto the project space.
I think I found the way to use the BaseSpace Downloader GUI - I got two sample sequences to download successfully (Acer_005 and Acer_019). I think trying to download the project itself doesn’t work for whatever reason, as it keeps failing repeatedly. It could be because I’m trying to download it straight to an external hard drive, but it’s 49GB so I can’t download that to my computer.
I want to figure out which specific genes are corresponding to each GO term – right now it’s just number of genes reported in the “Significant” column in the results table.
Today I received the sequences from UT Austin GSAF via Illumina Basespace. Now I need to download them.
I just checked and I did use Fisher’s Exact Test (re: last post), so I don’t need to redo that. But I need to redo a couple things. First, I need to split up the Up and Downregulated gene lists and run TopGO separately. Next, I need to make sure that i filtered L2FC < or > 0, instead of 2 which I did for the volcano plots. 2 is very, very strict, and all the papers I’m reading count differential expression as p-adjusted < 0.05 and L2FC < or > 0. My annotated gene lists that I saved from each DESeq2 object for each hour are filtered based on p-adjusted < 0.05, but no specific L2FC. So that’s good, that means the only thing I need to change is the volcano plots. Although I did a lot of manual editing in Illustrator, so maybe I can just leave that as is. I just need to make sure that for TopGO, I count any upregulated genes as L2FC >0, and downregulated genes as L2FC < 0.
I am not sure how to interpret the results of the topGO analysis.
I have returned to the GO analysis for each hour, as I was talking with Kevin and he mentioned to separate the up and down-regulated gene lists so that you know which pathways are being upregulated versus downregulated. But then I ended up in a deep-dive of GO enrichment analysis, and whether it’s appropriate to split up the DEG list by directionality. Turns out this is a contested debate and ultimately depends on your question (I think).
Prepping metadata from RNA extractions for 2022 Stress-Hardening Experiment and 2022-2023 Urban Coral Reciprocal Transplant (Carly+Rich Experiment)
We looked through my PAM_stresshardening.Rmd code that I wrote for running glmmTMB (fvfm ~ Treatment + (1 | Colony) + (1 | Tank/Date), family = list(family=”beta”, link=”logit”)). |
Based on the PPT from Kevin, I think my Fv/Fm data falls into a Percentage/Proportion data type, where numerical data is bound from 0-1. The assumption for this is that there is overdispersion (the mean is < the variance). The code provided in the PPT as an example was: glmmTMB::glmmTMB(survival ~ treatment + (1|sample), family= list(family=”beta”, link=”logit”)
I realized that there were some issues with the RStudio on my AOML desktop versus my laptop. First, I had updated the RStudio on my computer, so it messed up one of the packages that was needed for Ross Cunning’s IPAMtoR custom program. I tried to install devtools to install this package (called “joeyr”) but AOML wasn’t letting me do that for some reason, or something else needed to be updated or something. I got it to work on my laptop though. Then, once I actually got it to work, I have been having issues importing the csv files that were downloaded from the IPAM software.
I am revisiting the 2022 stress-hardneing experiment to finally tackle the unknown, which is how to incorporate so many different variables into one statistical test (tank replicates, biological replicates, species, genotypes, treatments, time points).
In doing the gene ontology enrichment analysis for biological processes (comparing control vs. wounded at each time point), we don’t get any meaningful GO terms that pop up.
Today’s goal is to make a venn diagram showing overlap of any significant DGEs between the time points. Similar to something like this from Traylor-Knowles et al. 2021 and Connelly et al. 2020
I have been trying to recreate this figure but I’m having a lot of trouble getting it.
I have made my PCA plots for each hour of the differential expression analysis (Figure below).
I am working on understanding the topGO R package (Alexa et al. 2006) to test for gene ontology enrichment from the Pocillopora damicornis wound healing transcriptomics study.
Next you can update your site name, avatar and other options using the _config.yml file in the root of your repository (shown below).