Acer CCC KEGG Pathway analysis issues

I presented my data chapters 2 and 3 to NTK’s lab on Friday, and she pointed out something I hadn’t thought of - that even though the most significant KEGG pathway in my Acer CCC vs. Nursery analysis was estrogen signaling, the DGEs themselves that had KO terms which annotate to this KEGG pathway could also be involved in many other pathways. Looking at the full list of genes with the KEGG terms here, you can see that a lot of the genes are involved in many potential pathways. So why did the KEGG analysis give me the estrogen signaling pathway specifically?

Read More

2024-02-12-noteaboutannotationfilesfromMS.md

I realized that when Michael re-annotated the Locatelli unpublished Acer genome in December, he updated the names of everything (which no longer match my counts matrix). Basically when we first started this pipeline, all the Acropora gene names were “Acropora_” before the gene number. But in the recent re-annotation, Michael named everything “Acervicornis” with no underscore. I ran into issues when I was trying to run the GO_MWU scripts because I was using the “Acervicornis_iso2go.tab” file but trying to match the “Acropora_” gene names (which none of them matched of course).

Read More

Issue with FvFm code

I was looking at the full dataset of all the IPAM data to see if there were any issues, because when I started trying to remove NAs from the GLM for FvFm treatment data, I started getting errors.

Read More

Fv/Fm GLMMs

I am revisiting my Fv/Fm code for my Chapter 2 stress-hardening analysis. When I first started this, I tried to follow code from Cunning et al. 2021, who assessed A.cervicornis outplants using CBASS across Florida. However, I found other publications that applied different ways of analyzing this data that were more straightforward (I think because Cunning et al. 2021 had a lot more variation to account for due to the different sites across space and time, and also the authors added corrections for light parameters in each CBASS tank and for use of a Diving PAM). I started playing around with other codes and eventually got really deep in the weeds of this.

Read More

UTR gff parser

I’m going to try to run this again because Nick said he had no issues doing it. I think maybe if I try to run gtf_advanced_parser.py first on the genomic.gff file downloaded from NCBI, then maybe I can see if that works and I can find the “three_prime_UTR” annotations in the gff file.

Read More

new Acer genome continued

Previous ways I tried to download it: ```{bash} wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/032/359/415/GCA_032359415.1_NEU_Acer_K2/GCA_032359415.1_NEU_Acer_K2_genomic.gtf.gz wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/032/359/415/GCA_032359415.1_NEU_Acer_K2/GCA_032359415.1_NEU_Acer_K2_genomic.fna.gz gunzip GCA_032359415.1_NEU_Acer_K2_genomic.gtf.gz gunzip GCA_032359415.1_NEU_Acer_K2_genomic.fna.gz

Read More

Trouble with Fv/Fm Data

I’m revisiting the IPAM data from the Chapter 2 stress-hardening experiment, and I’m having a really hard time wrapping my head around it.

Read More

DESeq2 design formula

I have started working through Dr. Michael Studivan’s DESeq code, but I am having trouble figuring out how best to include all of my explanatory variables in the DESeq model. In his study, he has time and treatment as explanatory variables. He created a column that combined the two: time.treatment. See screenshot of data table with design:

Read More

PCR duplicates

I met with Dr. Michael Studivan last week to discuss plans for DNA and RNA extractions of the Ch 3 Pstr samples. During this meeting, we talked about the results from the Ch 2 RNA-seq data, and he was concerned with how high the number of passed reads there were post-filtering step (see the multiqc report). He said that when he has run TagSeq analysis, he usually get around 60% loss of reads when going from raw reads to trimmed reads. Whereas for mine, I got >95% retention of reads which passed the fastp filter. So what’s happening here? Did fastp not work?

Read More

Ch4_AcerCCC read counts

I’m writing up a summary for the Ch4 samples now, and I want to generate a summary table with the average and standard deviation of raw reads, trimmed reads, and percent alignment rates.

Read More

pre-filtering samples based on million reads

I have been writing up a summary of the gene expression analysis in advance of my committee meeting, but I realized something that is inconsistent between different people’s codes and I’m not sure what the baseline or “correct” way is.

Read More

redo Fastp for Ch2 samples

I need to redo the Fastp step because in comparing the fastqc reports of the raw reads versus the “trimmed” reads, they are the exact same and it appears that nothing was trimmed. I also think Fastp didn’t work correctly because there is only one fastp.html and fastp.json file for the entire list of samples, and it corresponds to Pcli-148 (the last sample of the batch). So I think it overwrote things and maybe didn’t process correctly.

Read More

stringtie2 for Ch2 temp variability

Stringtie is “a fast and efficient assembler of RNA-Seq alignments to assemble and quantitate full-length transcripts. putative transcripts. For our use, we will input short mapped reads from HISAT2. StringTie’s ouput can be used to identify DEGs in programs such as DESeq2 and edgeR” (from Sam’s github)

Read More

STAR index and alignment for Pcli

Ok so since I have STAR already installed locally on my scratch space, I’m going to try to do the genome index and alignment using STAR for the Pcli transcriptome. ChatGPT says you can just run it the same way as if you were running a genome index.

Read More

renaming trimmed files

Following the fastp trimming code, all the file names ended in “.clean.processed” and that prevented them from being recognized as fastq.gz files for the next step in the pipeline. So I needed to write a for-loop to remove those suffices from all file names.

Read More

Pcli transcriptome

Michael Studivan created a Google Drive of different species’ transcriptomes, and the one for Pclivosa comes from Avila-Magana et al. 2021. I’m not sure if Michael separated out everything from the metatranscriptome, but the file I have downloaded is called “Dip_Host.fna.gz”. So it is just the coral host.

Read More

Ch4_AcerCCC updated gff STAR alignment results

The STAR alignment job finished, so I quickly ran multiqc to compare the original alignment, the first round of updated annotations, and the most recent updated gff3 file. What’s interesting is all their alignment rates look the same…

Read More

md5 check sums script

So I followed the first script from Sam and Ariana to do the “md5 check sums” thing. I guess the purpose of it is to check the integrity of the files after downloading them from Basespace. Idk why it’s necessary but everyone in Hollie’s lab seems to do it, so it must be important! However, I submitted this job on Pegasus yesterday for the Ch2_tempvariability2023 raw fastq.gz files and the job had still not yet started today:

Read More

Starting over

I think I need to start from the beginning and go through one pipeline that has been proven for that person (or lab group) to work time and time again, rather than try to frankenstein pieces of people’s codes together to get something to work (which is what I have tried thus far and haven’t been successful). It also helps to find a person with well-annotated code. Thankfully, Hollie Putnam’s lab has tried-and-true methods that go back to pipelines of Dr. Sam Barr and Dr. Ariana Huffmyer.

Read More

Acer CCC versus Acer Stress-Hardening Samples

So, the major difference between the CCC samples and the stress-hardening samples is that the first were sequenced using Lexogen 3’ QuantSeq, and the second were sequenced following the TagSeq protocol, which is a “specialty” library prep specifically tested and refined with coral samples. So although both projects were sequenced on the same type of machine with the same goal for output (NovaSeq S2 SE100), the results are vastly different.

Read More

Subread and FeatureCounts

I am trying to see if I can still get gene counts on the Acer CCC samples without using StringTie, since that has been difficult for me to get to work. I adapted this featureCounts code from my wound-healing project and submitted the job to pegasus:

Read More

stringtie on SH samples

I am now going to try the stringtie scripts on the stress-hardening Acer samples and see if I get better results than the Acer CCC ones.

Read More

STAR output to gene counts

I can’t get stringtie to work so I want to try Natalia’s method instead, where she took the STAR read counts and somehow turned that into a gene count matrix.

Read More

samtools for STAR alignment results

So I went down the path of using samtools after STAR alignment because it seems that it is used as a quality-checking tool to see if alignment worked. I thought multiQC did that but idk what the difference is. So I tried first running Danielle’s code (https://github.com/daniellembecker/DanielleBecker_Lab_Notebook/blob/master/_posts/2021-04-14-Molecular-Underpinnings-RNAseq-Workflow.md) but then realized her sequences are paired-end reads and that’s why the code wasn’t working.

Read More

following HBC pipeline - qualimap and salmon

I keep returning to this GitHub page time and time again, and it continues to be the best resource I can find to explain to me why certain tools are used, as well as provide citations for tested tools that improve accuracy and precision over others. Here is the link to the overview: https://github.com/hbctraining/DGE_workshop_salmon_online/blob/master/lessons/01a_RNAseq_processing_workflow.md

Read More

Aligning raw reads versus trimmed reads results for Acer CCC samples

The results of the STAR alignment for the trimmed reads were a little strange looking (especially in comparison to the stress-hardening samples, those look way better), so I tried aligning the raw reads and seeing if I got better results (higher alignment and less unmapped reads). Interestingly, it looks like trimming improved alignment rates (trimmed samples on left, raw reads on right):

Read More

stringtie articles

I just wanted to put all these somewhere for when I need them. Stringtie is what Jill Ashey used following STAR alignment. In the past I used Subread (I think?) for Pdam.

Read More

STAR index Acer genome

The first step when aligning reads to the genome when using STAR is to first create an index based on the most up-to-date genome assembly. I downloaded a 2019 draft assembly from Baums et al. (https://usegalaxy.org/published/history?id=1f8678b27ae56467 and https://usegalaxy.org/published/page?id=2f8d21c73f8501e2 for file descriptions based on Apal files).

Read More

STAR alignment troubleshooting

I’m looking at Young et al. 2020 methodology because they used Iliana Baums’ Apal genome, which is also available on Galaxy, so maybe the code required will be similar.

Read More

initial attempt STAR alignment codes

Here is the script Natalia used to set up the STAR genome index (https://github.com/China2302/SCTLD_RRC/blob/main/hpc/STAR_index.sh): ```{bash} #!/bin/bash #BSUB -J star_index #BSUB -q general #BSUB -P c_transcriptomics #BSUB -n 8 #BSUB -o /scratch/projects/c_transcriptomics/Genomes/Ofav/star_index%J.out #BSUB -e /scratch/projects/c_transcriptomics/Genomes/Ofav/star_index%J.err #BSUB -u nxa945@miami.edu #BSUB -N

Read More

notes on alignment tools

In the past, for the Pdamicornis wound healing paper, I used STAR to align the 3’ RNAseq reads to the genome, not the transcriptome.

The notes I’m relying on for this round of analysis are from Dr. Natalia Andrade (https://github.com/China2302/transcriptomics-workflow) and Dr. Michael Studivan (https://github.com/mstudiva/tag-based_RNAseq/blob/master/). The reason for this is because I sequenced the Acer CCC samples with Natalia’s SCTLD Mote and Nova samples, while the temperature variability 2023 Acer and Pcli samples were sequenced following Michael’s protocols.

So far, because Michael’s stuff is all in python, I have had more success with Natalia’s pipeline. However, the QC and trimming steps are not super specific, and they both used cutadapt to trim Illumina adapters. Michael’s code had more specifics for trimming (i.e. specific base pair sequences) that are potentially more likely to be found in TagSeq generated reads. Natalia specifically trimmed the polyA tails because that is a problem in 3’ RNA seq.

But for alignment, there are many different possible tools to use, and there is also the question of whether to align to a genome or a transcriptome.

The UM CCS student mentors made a pros and cons list for aligning to a genome versus a transcriptome (https://github.com/ccsstudentmentors/tutorials/tree/master/RNA-Seq/Quantifying-RNA-Expression). I think it also depends on what is available for the species you’re analyzing.

Transcriptome – then use Bowtie.

Genome – use STAR.

After talking to Dr. Kevin Wong, he said to just pick whichever gives the best alignment rate.

In terms of which tool to use, STAR seems to be the superior aligner:

  1. https://www.biostars.org/p/353946/
  2. Screen Shot 2023-06-15 at 8 17 33 AM
Read More

some wins against Pegasus

After meeting with Anthony Bonacolta, I finally got some help as to why my job submissions to Pegasus weren’t working. First, the alias “compute” that I created includes “bsub -P and_transcriptomics” and I was trying to do compute .sh every time I submitted a job. When I did it this way, it also wasn’t recognizing when i put something in the bigmem queue and would put it in the general queue.

Read More

WGCNA for wound healing manuscript pt. 2

I’ve been feeling like the heatmap I created doesn’t quite make sense, because I feel like the only thing that should be on the x axis is condition x hour, not each individual like I have it currently:

Read More

WGCNA for wound healing manuscript

I’m currently working through the WGCNA tutorial for the wound healing dataset (following https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html and I specifically did the step-by-step network construction, not the automatic one).

Read More

Trimmomatic vs. Cutadapt

Before even diving into the benefits of Trimmomatic versus Cutadapt on 3’ RNA-seq data, I remembered a 2020 publication which said that adapter and low-quality base trimming was actually not even necessary before alignment, and that the “soft-clipping” removal of these short sequences during the alignment stage using Subread was more successful at not having false positives (and removing relevant data) while still removing 94% of adapter sequences. Here is the link to the paper: https://academic.oup.com/nargab/article/2/3/lqaa068/5901066

Read More

Downloading stress-hardening sequences from BaseSpace

I think I found the way to use the BaseSpace Downloader GUI - I got two sample sequences to download successfully (Acer_005 and Acer_019). I think trying to download the project itself doesn’t work for whatever reason, as it keeps failing repeatedly. It could be because I’m trying to download it straight to an external hard drive, but it’s 49GB so I can’t download that to my computer.

Read More

Adding gene lists to GO tables

I want to figure out which specific genes are corresponding to each GO term – right now it’s just number of genes reported in the “Significant” column in the results table.

Read More

Re-running TopGO with Up vs. Downregulated genes split up

I just checked and I did use Fisher’s Exact Test (re: last post), so I don’t need to redo that. But I need to redo a couple things. First, I need to split up the Up and Downregulated gene lists and run TopGO separately. Next, I need to make sure that i filtered L2FC < or > 0, instead of 2 which I did for the volcano plots. 2 is very, very strict, and all the papers I’m reading count differential expression as p-adjusted < 0.05 and L2FC < or > 0. My annotated gene lists that I saved from each DESeq2 object for each hour are filtered based on p-adjusted < 0.05, but no specific L2FC. So that’s good, that means the only thing I need to change is the volcano plots. Although I did a lot of manual editing in Illustrator, so maybe I can just leave that as is. I just need to make sure that for TopGO, I count any upregulated genes as L2FC >0, and downregulated genes as L2FC < 0.

Read More

Separating Up- and Down-Regulated Genes for GO Analysis?

I have returned to the GO analysis for each hour, as I was talking with Kevin and he mentioned to separate the up and down-regulated gene lists so that you know which pathways are being upregulated versus downregulated. But then I ended up in a deep-dive of GO enrichment analysis, and whether it’s appropriate to split up the DEG list by directionality. Turns out this is a contested debate and ultimately depends on your question (I think).

Read More

GLMMS for Stress-Hardening IPAM Data

Based on the PPT from Kevin, I think my Fv/Fm data falls into a Percentage/Proportion data type, where numerical data is bound from 0-1. The assumption for this is that there is overdispersion (the mean is < the variance). The code provided in the PPT as an example was: glmmTMB::glmmTMB(survival ~ treatment + (1|sample), family= list(family=”beta”, link=”logit”)

Read More

Figuring out why R code for importing PAM data stopped working

I realized that there were some issues with the RStudio on my AOML desktop versus my laptop. First, I had updated the RStudio on my computer, so it messed up one of the packages that was needed for Ross Cunning’s IPAMtoR custom program. I tried to install devtools to install this package (called “joeyr”) but AOML wasn’t letting me do that for some reason, or something else needed to be updated or something. I got it to work on my laptop though. Then, once I actually got it to work, I have been having issues importing the csv files that were downloaded from the IPAM software.

Read More

GO Analysis

In doing the gene ontology enrichment analysis for biological processes (comparing control vs. wounded at each time point), we don’t get any meaningful GO terms that pop up.

Read More

topGO Analysis

I am working on understanding the topGO R package (Alexa et al. 2006) to test for gene ontology enrichment from the Pocillopora damicornis wound healing transcriptomics study.

Read More

Hello! This is a test!

Next you can update your site name, avatar and other options using the _config.yml file in the root of your repository (shown below).

Read More