Using SortmeRNA
Using SortMeRNA for removing rRNA
- Looking online, there seems to be some thoughts that removing rRNA reads is not required when Poly-A selection is performed during sequencing, as preferentially selecting polyA tails automatically creates a bias towards mRNA (see this article and this article).
- However, some suggest that removing rRNA would save computation time and lead to a cleaner assembly (see here).
- This tutorial/paper is a good resource - it says that if GC content is abnormal, it could be a sign of high rRNA content (GC content of over 50% is typical for rRNA) and then SortMeRNA should be used. They also said “if there is any doubt, this step should be performed.”
Based on the multiQC report of my trimmed reads, it looks like the average GC content is not normally distributed and is higher for some samples. To be safe, I’ll follow Brad’s code and run SortMeRNA on it.
Update
So I ran SortMeRNA. Now what? I don’t want to do the BBSplit step because I’m not looking to see what aligns to members of the microbiome. I think I just want to skip to Trinity. But what files from SortMeRNA do I use to do that?
When I look at Brad’s next steps for BBSplit and Trinity, it looks like he renamed his fastq files from SortMeRNA.
The results of SortMeRNA are directories for each sample.
According to the SortMeRNA user manual:
- “kvdb” = “key-value datastore for alignment results”
- idx = “index database”
- “out” = “output files like aligned.blast”
It doesn’t mention the other folder I have, which is “readb”, and that’s the one I found .fq.gz files in, which look like what I would work with?
Ok I found on this website more info: it says that “readb” means “pre-processed reads/index”.
ChatGPT said that my code isn’t working because I didn’t assign it a name for the aligned versus unaligned reads. Idk… Working on trying a new code it suggested now:
#!/usr/bin/env bash
# Define project directories and paths
and="/scratch/projects/and_transcriptomics"
project="and_transcriptomics"
projdir="${and}/reciprocaltransplant"
readsdir="${projdir}/raw_seq_files/trimmed_cutadapt"
logdir="${projdir}/logs/sortmeRNA"
cd "${readsdir}"
# Loop through R1 files
for r1 in *_R1.trimmed.fastq.gz; do
# Extract sample base name (e.g., "Pstr-Dec2022-202")
base="${r1%_R1.trimmed.fastq.gz}"
# Submit the job directly
bsub <<EOF
#!/usr/bin/env bash
#BSUB -P ${project}
#BSUB -J ${base}_sortmeRNA
#BSUB -e ${logdir}/${base}_sortmeRNA.err
#BSUB -o ${logdir}/${base}_sortmeRNA.out
#BSUB -q bigmem
#BSUB -n 10
cd "${readsdir}"
${and}/programs/sortmerna-4.3.6-Linux/bin/sortmerna \\
--ref ${and}/programs/sortmerna-4.3.6-Linux/database/smr_v4.3_default_db.fasta \\
--reads ${readsdir}/${base}_R1.trimmed.fastq.gz \\
--reads ${readsdir}/${base}_R2.trimmed.fastq.gz \\
--aligned ${base}_rRNA \\
--other ${base}_non_rRNA \\
--fastx \\
--out2 \\
--sout \\
--log \\
--workdir ${projdir}/raw_seq_files/sortmerna/${base}
EOF
done
Update: this script sat in the job queue for 2 days. I don’t think it will work.
I also realized that I don’t think it’s using the files that sortmeRNA already made. I’m having chatGPT write a few versions to try to see what will work.
Sept 17 2025 Update:
-
I realized in the original error files for sortmeRNA, I got “run time limit reached” flags for each job. So I think what happened was that they all timed out at around 6 hours. In an unrelated note, but may be related, the Pegasus technicians told me that I needed to start using pegasus2 rather than pegasus. I discovered this when I tried to check the status of my jobs, and the LSF “daemon” wasn’t responding.
-
So I have done that, and I’m basically trying to re-run the original script that I got to work, but that timed out. I added the specific file names for –aligned and –other in the hopes that it would help. I also added a longer time limit request in the job submission, so hopefully that helps.
The job I ran most recently is called “6_sortmerna.sh” on Pegasus and looks like this:
#!/usr/bin/env bash
# Define project directories and paths
and="/scratch/projects/and_transcriptomics"
project="and_transcriptomics"
projdir="${and}/reciprocaltransplant"
readsdir="${projdir}/raw_seq_files/trimmed_cutadapt"
logdir="${projdir}/logs/sortmeRNA"
cd "${readsdir}"
# Loop through R1 files
for r1 in *_R1.trimmed.fastq.gz; do
# Extract sample base name (e.g., "Pstr-Dec2022-202")
base="${r1%_R1.trimmed.fastq.gz}"
# Write the job script
cat <<EOF > "${projdir}/scripts/sortmeRNA/${base}_sortmeRNA.job"
#!/usr/bin/env bash
#BSUB -P ${project}
#BSUB -J ${base}_sortmeRNA
#BSUB -e ${logdir}/${base}_sortmeRNA.err
#BSUB -o ${logdir}/${base}_sortmeRsNA.out
#BSUB -q bigmem
#BSUB -n 10
#BSUB -W 48:00
#BSUB -R "rusage[mem=6000]"
cd "${readsdir}"
${and}/programs/sortmerna-4.3.6-Linux/bin/sortmerna --ref ${and}/programs/sortmerna-4.3.6-Linux/database/smr_v4.3_default_db.fasta \
--reads ${readsdir}/${base}_R1.trimmed.fastq.gz \
--reads ${readsdir}/${base}_R2.trimmed.fastq.gz \
--aligned --other --fastx --blast --out2 --sout \
--workdir ${projdir}/raw_seq_files/sortmerna/${base}
EOF
# Submit the job
bsub < "${projdir}/scripts/sortmerna/${r1}_sortmerna.job"
done
but that last line about submitting the jobs doesn’t work, so I’ll need to manually submit all the jobs.
I ran this in the command line:
for f in *.job; do
bsub < "$f"
done
Ok, the “bigmem” queue ones never submitted and were still pending a day later, so I used bjobs -p
to see why, and it said this: “Not enough processors to meet the job’s spanning requirement: 1 host;”
So I changed the parameters to “normal” queue and -8 processors, and it ran. But then all the jobs exited out because of the following issue: “Please, ensure the directory “/scratch/projects/and_transcriptomics/reciprocaltransplant/raw_seq_files/sortmerna/Pstr-Nov2023-198/kvdb” is Empty prior running ‘sortmerna’”.
According to chatGPT, having a pre-filled “kvdb” folder is the only one that sortmerna cares about and will refuse to rewrite over.
So now I’m running this script to erase those subfolders for all samples in advance, then trying again.
#!/usr/bin/env bash
# Define project directories and paths
and="/scratch/projects/and_transcriptomics"
project="and_transcriptomics"
projdir="${and}/reciprocaltransplant"
readsdir="${projdir}/raw_seq_files/trimmed_cutadapt"
logdir="${projdir}/logs/sortmeRNA"
cd "${readsdir}"
# Loop through R1 files
for r1 in *_R1.trimmed.fastq.gz; do
# Extract sample base name (e.g., "Pstr-Dec2022-202")
base="${r1%_R1.trimmed.fastq.gz}"
# Write the job script
cat <<EOF > "${projdir}/scripts/sortmeRNA/${base}_sortmeRNA.job"
#!/usr/bin/env bash
#BSUB -P ${project}
#BSUB -J ${base}_sortmeRNA
#BSUB -e ${logdir}/${base}_sortmeRNA.err
#BSUB -o ${logdir}/${base}_sortmeRsNA.out
#BSUB -q normal
#BSUB -n 8
#BSUB -W 48:00
#BSUB -R "rusage[mem=6000]"
# Clean up old SortMeRNA outputs for this sample
rm -rf ${projdir}/raw_seq_files/sortmerna/${base}/kvdb
cd "${readsdir}"
${and}/programs/sortmerna-4.3.6-Linux/bin/sortmerna --ref ${and}/programs/sortmerna-4.3.6-Linux/database/smr_v4.3_default_db.fasta \
--reads ${readsdir}/${base}_R1.trimmed.fastq.gz \
--reads ${readsdir}/${base}_R2.trimmed.fastq.gz \
--aligned --other --fastx --blast --out2 --sout \
--workdir ${projdir}/raw_seq_files/sortmerna/${base}
EOF
# Submit the job
bsub < "${projdir}/scripts/sortmerna/${r1}_sortmerna.job"
done
Sept 22 update
Just checked the log files from the most recent sortmerna run, and now there’s some fun new errors.
[write_a_read:215] ESC[0;31mERRORESC[0m: Failed deflating readstring:
[write_a_read:215] ESC[0;31mERRORESC[0m: zlib status: -1
[append:328] ESC[0;31mERRORESC[0m: Failed deflating readstring:
[append:328] ESC[0;31mERRORESC[0m: zlib status: -1
ChatGPT gave some bs reasons why but I honestly am just starting to give up on this program.
Ok, I discovered that a subset of samples did work (24 out of 96):
Ok, I’m not sure why the rest didn’t work, I’m just going to rerun them now:
#!/usr/bin/env bash
# Define directories
and="/scratch/projects/and_transcriptomics"
project="and_transcriptomics"
projdir="${and}/reciprocaltransplant"
readsdir="${projdir}/raw_seq_files/trimmed_cutadapt"
logdir="${projdir}/logs/sortmeRNA"
sortdir="${projdir}/raw_seq_files/sortmerna"
scriptdir="${projdir}/scripts/sortmerna"
# Make sure scriptdir and logdir exist
mkdir -p "${scriptdir}" "${logdir}"
# Loop through R1 trimmed reads
cd "${readsdir}" || exit
for r1 in *_R1.trimmed.fastq.gz; do
base="${r1%_R1.trimmed.fastq.gz}"
# Check if the SortMeRNA out folder exists and has files
out_folder="${sortdir}/${base}/out"
if [ -d "$out_folder" ] && [ "$(ls -A "$out_folder" 2>/dev/null)" ]; then
echo "Skipping ${base} (output already exists)"
continue
fi
# Write the LSF job script
cat <<EOF > "${scriptdir}/${base}_sortmeRNA.job"
#!/usr/bin/env bash
#BSUB -P ${project}
#BSUB -J ${base}_sortmeRNA
#BSUB -e ${logdir}/${base}_sortmeRNA.err
#BSUB -o ${logdir}/${base}_sortmeRNA.out
#BSUB -q normal
#BSUB -n 8
#BSUB -W 48:00
#BSUB -R "rusage[mem=6000]"
# Clean up old SortMeRNA outputs for this sample
rm -rf ${projdir}/raw_seq_files/sortmerna/${base}/kvdb
cd "${readsdir}"
${and}/programs/sortmerna-4.3.6-Linux/bin/sortmerna \
--ref ${and}/programs/sortmerna-4.3.6-Linux/database/smr_v4.3_default_db.fasta \
--reads ${readsdir}/${base}_R1.trimmed.fastq.gz \
--reads ${readsdir}/${base}_R2.trimmed.fastq.gz \
--aligned ${sortdir}/${base}/${base}_rRNA \
--other ${sortdir}/${base}/${base}_non_rRNA \
--fastx --blast --out2 --sout \
--workdir ${sortdir}/${base}
EOF
# Submit the job
bsub < "${scriptdir}/${base}_sortmeRNA.job"
done
Ok, this didn’t work again. This time the error was “std::out_of_range / stoi”. Apparently, ChatGPT is saying that version 4.3.6 of SortMeRNA is known to have a bug to cause crashes if there are any differences between reads 1 and 2 that it can’t parse, I guess even if the length differ by 1 base pair. I checked my MultiQC report and I don’t think that’s the issue. But ChatGPT recommended removing these flags --blast, --out2, --sout
and trying again. Otherwise, I may need to downgrade to a previous version of SortMeRNA.
Sept 22
Ok, I think now what’s happening is I maxed out the space quota on my Pegasus account. I was at 2 T of files. I deleted the Trim Galore folder because I am using the cutadapt trimmed version files now, and that freed up ~500 GB of space. Now I’m re-running sortmeRNA, but I’m having it also delete all the subfolders for the samples that didn’t work, because there could potentially be issues with having partial files written already in there.
Sept 23
I discovered this open thread on the sortmerna GitHub that has the error I keep getting. It looks like there is a bug. If the file is larger than 2 GB, it stops working.
Ok, I discovered anything greater than 20 million reads is not working in sortmerna. This equates to around 2 GB (see screenshot). Also, why is R2 larger than R1 if they have the same amount of reads?
Here are the other threads in sortmerna issues on GitHub:
- https://github.com/sortmerna/sortmerna/issues/429
- https://github.com/sortmerna/sortmerna/issues/445
- https://github.com/sortmerna/sortmerna/issues/421
- https://github.com/sortmerna/sortmerna/issues/419
- https://github.com/sortmerna/sortmerna/issues/424
- https://github.com/sortmerna/sortmerna/issues/340
- https://github.com/sortmerna/sortmerna/issues/305
- https://github.com/sortmerna/sortmerna/issues/405
- https://github.com/sortmerna/sortmerna/issues/326
The last few are more related to resource allocation on HPCs.
So far, the only suggestion I see is to update to use sortmerna v4.3.7.
Someone talked about running it in parallel on several split up files: https://github.com/sortmerna/sortmerna/issues/413
On this thread they seem to have made a good script for doing the split: https://github.com/sortmerna/sortmerna/issues/336
One flag i saw that could be helpful is the “–no-best” function, since I’m just trying to see if any sequence aligns to any rRNA, not what specific rRNA it is (see this thread: https://github.com/sortmerna/sortmerna/issues/437).
Also I saw this flag get suggested: -threads 8
And, I’m going to put this job back in bigmem instead of general because that could be an issue too.
Ok now nothing is working, not even the original scripts. It actually looks like I may have pre-emptively killed the jobs that were running, because it did look like things were happening in some of the folders.
In the original run where I got the samples with under 2 GB of data to work:
- the script had the following flags: –fastx –blast –out2 –sout
- they have /out directories
- they have “aligned.blast.gz”, “aligned_paired_fwd.fq.gz”, and paired_rev, singleton, other_paired and other_singleton
- their .out files (which have the weird RsNA in the name of the file) have something like “[writeReports:268] === done Reports in 2594.41 sec ===”
- they have an aligned.log that has a date of completion at the very end
In my latest run, which it appears some worked:
- I had removed –blast –out2 –sout.
- they do not have /out directories
- their .out files have [writeReports:268] === done Reports in 4303.29 sec ===
- they only have “_non_rRNA.fq.gz” and “_rRNA.fq.gz”
- they don’t have an aligned.log, they have an _rRNA.log
Based on Brad’s code for next steps, he uses the other_paired_fwd.fq.gz and other_paired_rev files. So essentially i need the flags –out2 and -sout. I’m not sure if I need the –blast one. but the latest run i was doing only gave a _non_rRNA one.
Note:
I’m using the default database that sortmerna recommended for alignment: smr_v4.3_default_db.fasta
- Original source databases (clustering parameters given below):
- Silva 138 SSURef NR99 (16S, 18S)
- Silva 132 LSURef (23S, 28S)
- RFAM v14.1 (5S, 5.8S) bac-16S 90%, 5S & 5.8S seeds, rest 95% (benchmark accuracy: 99.899%)
nf-core
This program is pre-built with scripts and uses sortmerna. There are a couple of versions:
- rnaseq: https://github.com/nf-core/rnaseq/tree/3.21.0
- denovotranscript: https://github.com/nf-core/denovotranscript/tree/main
If you go into their GitHub and look at the modules (in the main.nf doc), you’ll see the sortmerna parameters they use.
For denovotranscript:
sortmerna \\
${'--ref '+fastas.join(' --ref ')} \\
$reads_input \\
--threads $task.cpus \\
--workdir . \\
--aligned rRNA_reads \\
--fastx \\
--other non_rRNA_reads \\
$paired_cmd \\
$out2_cmd \\
$args
For rnaseq:
"""
sortmerna \\
${'--ref '+fastas.join(' --ref ')} \\
$refs_input \\
$reads_input \\
--threads $task.cpus \\
--workdir . \\
$reads_args \\
$paired_cmd \\
$out2_cmd \\
$args
This tells me that you don’t need sortmerna to separate the paire-end reads. You can just get the output file of “_non_rRNA.fq.gz” and “_rRNA.fq.gz” and then Trinity can use those.
only the nf denovotranscript one uses trinity, but it doesn’t give a format for the input file. I would assume its using “reads” as the _non_rRNA fq file. But I need to make sure.
Ok it looks like the sortmerna commands for paired end reads require: –fastx –aligned rRNA_reads –other non_rRNA_reads –paired_in –out2
Then, Trinity will take the _non_rRNA_reads files and use that for assembly.
Also note that nf-core denovotranscript uses fastQC after sortmeRNA to check them.
So in summary, I need to re-run my files because the most recent run did not separate forward and reverse reads. But I think I can keep the originals (<2 GB files) that did work, becuase they have _other_paired fwd and rev files. I would assume that means non rRNA reads.
Ok, I got this script to start running (after removing all the base folders from the samples that didn’t work initially):
#!/usr/bin/env bash
# Define directories
and="/scratch/projects/and_transcriptomics"
project="and_transcriptomics"
projdir="${and}/reciprocaltransplant"
readsdir="${projdir}/raw_seq_files/trimmed_cutadapt"
logdir="${projdir}/logs/sortmeRNA"
sortdir="${projdir}/raw_seq_files/sortmerna"
scriptdir="${projdir}/scripts/sortmerna"
# Loop through R1 trimmed reads
cd "${readsdir}" || exit
for r1 in *_R1.trimmed.fastq.gz; do
base="${r1%_R1.trimmed.fastq.gz}"
# Check if the SortMeRNA out folder exists and has files
out_folder="${sortdir}/${base}/out"
if [ -d "$out_folder" ] && [ "$(ls -A "$out_folder" 2>/dev/null)" ]; then
echo "Skipping ${base} (output already exists)"
continue
fi
# Write the LSF job script
cat <<EOF > "${scriptdir}/${base}_sortmeRNA.job"
#!/usr/bin/env bash
#BSUB -P ${project}
#BSUB -J ${base}_sortmeRNA
#BSUB -e ${logdir}/${base}_sortmeRNA.err
#BSUB -o ${logdir}/${base}_sortmeRNA.out
#BSUB -q bigmem
#BSUB -n 10
#BSUB -W 48:00
cd "${readsdir}"
${and}/programs/sortmerna-4.3.6-Linux/bin/sortmerna \
--ref ${and}/programs/sortmerna-4.3.6-Linux/database/smr_v4.3_default_db.fasta \
--reads ${readsdir}/${base}_R1.trimmed.fastq.gz \
--reads ${readsdir}/${base}_R2.trimmed.fastq.gz \
--aligned ${sortdir}/${base}_rRNA \
--other ${sortdir}/${base}_non_rRNA \
--fastx --out2 \
--workdir ${sortdir}
EOF
# Submit the job
bsub < "${scriptdir}/${base}_sortmeRNA.job"
done
update:
the above script only worked for the first sample because there was no subdirectory for the sample folder name………….
and also then I tried to run it with sortmerna 4.3.7 and it failed. so it has to be 4.3.6
so close to abandoning this