following HBC pipeline - qualimap and salmon
I keep returning to this GitHub page time and time again, and it continues to be the best resource I can find to explain to me why certain tools are used, as well as provide citations for tested tools that improve accuracy and precision over others. Here is the link to the overview: https://github.com/hbctraining/DGE_workshop_salmon_online/blob/master/lessons/01a_RNAseq_processing_workflow.md
So although samtools and stringtie could be good things to use, I’m just going to try to follow the HBC training from now on because it seems to be in agreement with the current bioinformatics literature. STAR keeps coming up as a good option for genomic alignment of reads, and Salmon is the top performer in terms of quantifying expression using lightweight alignment. So I’ll move forward with that (unless it doesn’t work at all on Pegasus).
This is directly from the training:
“5. Quality control of aligned sequence reads (STAR/Qualimap) As mentioned above, the differential gene expression analysis will use transcript/gene pseudocounts generated by Salmon. However, to perform some basic quality checks on the sequencing data, it is important to align the reads to the whole genome. Either STAR or HiSAT2 are able to perform this step and generate a BAM file that can be used for QC.
A tool called Qualimap explores the features of aligned reads in the context of the genomic region they map to, hence providing an overall view of the data quality (as an HTML file). Various quality metrics assessed by Qualimap include:
- DNA or rRNA contamination
- 5’-3’ biases
- Coverage biases”
So, I have the BAM files from using STAR to perform read alignment to the whole genome. So next lets install qualimap (http://qualimap.conesalab.org/).
wget https://bitbucket.org/kokonech/qualimap/downloads/qualimap_v2.3.zip
unzip qualimap_v2.3.zip
cd qualimap_v2.3/
source qualimap # this didn't work
make qualimap # this didn't work
there is no source file for qualimap like there was for STAR so maybe I don’t need to compile it? I’ll just try running it and see how it goes.
cd /scratch/projects/and_transcriptomics/Allyson_CCC/aligned
/scratch/projects/and_transcriptomics/programs/qualimap_v2.3/qualimap
I tried running the /scratch/projects/and_transcriptomics/programs/qualimap_v2.3/qualimap line directly in the command line after moving to the Allyson_CCC/aligned folder, and it came out with this error: “Can’t connect to X11 window server using ‘:0.0’ as the value of the DISPLAY variable.”
I think actually maybe what’s happening is that qualimap is a GUI (graphical user interface) so when I try to run it in Pegasus it can’t export it to a desktop for me to view it. So I think I actually need to download the .bam files so that they’re local and then run qualimap on my computer.
but which .bam do I use? Aligned.sortedByCoord.out.bam versus Aligned.toTranscriptome.out.bam?
Ok wait I found this thread that explains how by default Qualimap will try to open a GUI. so you need to run the “unset DISPLAY” command in the terminal. I also need to add the location of Qualimap to the PATH variable. When I open my bash profile, I think I have the PATH variable updated to include everything in the programs folder because I have this line: “export PATH=$PATH:/scratch/projects/and_transcriptomics/programs”.
Ok, when I do that I still get the same error. But on the HBC training link it said to specify “rnaseq” when running qualimap, and when I do that I get a different error message so that’s progress I guess. I’m going to try to submit a job similar to what they have.
Ok so for using qualimap rnaseq you specifically need a GTF file, not a GFF file . So I am going to try using bamqc instead to get general region quality results.
#!/bin/bash
#BSUB -J qualimap
#BSUB -q general
#BSUB -P and_transcriptomics
#BSUB -n 8
#BSUB -o qualimap%J.out
#BSUB -e qualimap%J.err
#BSUB -u and128@miami.edu
#BSUB -N
and="/scratch/projects/and_transcriptomics"
cd "/scratch/projects/and_transcriptomics/Allyson_CCC/aligned"
data=($(ls *Aligned.sortedByCoord.out.bam))
for sample in ${data[@]} ;
do \
${and}/programs/qualimap_v2.3/qualimap bamqc \
-outdir ${and}/Allyson_CCC/aligned \
-bam ${and}/Allyson_CCC/aligned/${sample} \
-gff ${and}/genomes/Acer/Acerv_assembly_v1.0.gff3 \ ; \
done
I keep getting java errors, maybe I need to load a java module before running.
#!/bin/bash
#BSUB -J qualimap
#BSUB -q general
#BSUB -P and_transcriptomics
#BSUB -n 8
#BSUB -o qualimap%J.out
#BSUB -e qualimap%J.err
#BSUB -u and128@miami.edu
#BSUB -N
and="/scratch/projects/and_transcriptomics"
cd "/scratch/projects/and_transcriptomics/Allyson_CCC/aligned"
data=($(ls *Aligned.sortedByCoord.out.bam))
module load java/7
module load R/4.1.0
for sample in ${data[@]} ;
do \
${and}/programs/qualimap_v2.3/qualimap bamqc \
-bam ${and}/Allyson_CCC/aligned/${sample} \
-outdir ${and}/Allyson_CCC/aligned \
-gff ${and}/genomes/Acer/Acerv_assembly_v1.0.gff3 \ ; \
done
Ok I think I got it to work!
A couple things to note: 1) the code I wrote overwrites each output file because I didn’t specify file names in the -outdir flag. 2) the qualimap bamqc report doesn’t give me any useful info that multiqc didn’t already give me. so idk why this is useful.
do i need to do qualimap or samtools????? if I had a GTF file, qualimap rnaseq looks like it would be helpful because it provides info about exons vs introns, etc. but for whatever reason the gff file doesn’t work in qualimap.
Going back to the HBC training:
“Having completed the alignment, the first thing we want to know is how well did our reads align to the reference. Rather than looking at each read alignment, it can be more useful to evaluate statistics that give a general overview for the sample. One of the output files from the STAR aligner contains mapping statistics, let’s take a closer look at one of those files.”
Ok, so look at each Log.final.out file and look for the mapping rates and the number of multimappers. (it also gives this in the multiqc report…)
They say that “at least 75% of the reads should be uniquely mapped. Once values start to drop lower than 60% it’s advisable to start troubleshooting. The lower the number of uniquely mapping reads means the higher the number of reads that are mapping to multiple locations. It is best to keep this number low because multi-mappers are not included when we start counting reads NOTE: The thresholds suggested above will vary depending on the organism that you are working with. Much of what is discussed here is in the context of working with human or mouse data. For example, 75% of mapped reads holds true only if the genome is good or mature. For badly assembled genomes we may not observe a high mapping rate, even if the actual sequence sample is good.”
Based on my alignmnet scores from STAR multiqc report, there’s a lot of variability with the samples.
But it doesn’t exactly say what to do with it. Like do I remove these files? I already have so few samples.
Regardless, I the tutorial also has this good section of text that explains how the field has moved towards the use of STAR then Salmon: “To determine where on the human genome our reads originated from, we will align our reads to the reference genome using STAR (Spliced Transcripts Alignment to a Reference). STAR is an aligner designed to specifically address many of the challenges of RNA-seq data mapping using a strategy to account for spliced alignments. We will not go into detail about how STAR works, but if you are interested in undertanding the alignment strategy we have some materials linked here.
NOTE: Until recently, the standard approach for RNA-seq analysis has been to map our reads using a splice-aware aligner (i.e STAR) and then use the resulting BAM files as input to counting tools like featureCounts and htseq-count to obtain our final expression matrix. The field has now moved towards using lightweight alignment tools like Salmon as standard practice, and so we only use STAR for generating a BAM file. If you are interested in knowing more about the standard approach we have some materials linked here.”
Ok well I can’t even use Salmon anyways because I don’t have a reference transcriptome. So I guess that’s why Stringtie is used…
i’m gonna keep creating memes to deal with the pain