Intro#
A research paper published on biorxiv determined a new coronavirus subgenus, I would like to figure out is there any changes on protease. However, the sequence data has not been publish.
Fortunately, the similar sequence is do available on NCBI, unfortunately, only RNA-seq data is available.
So I need to assemble the RNA-seq reads first, and BLAST the sequence I need with the assembled data.
TL;DR#
-
Setup the environment with conda:
conda create -n sra_env -y conda activate sra_env conda install -c bioconda -y sra-tools trinity transdecoder blast fastp fastqc
-
Fetch the data:
prefetch SRR11301086 fasterq-dump SRR11301086 mkdir -p analysis_results/SRR11301086/{raw_fastqc,clean_fastqc,fastp,trinity,transdecoder,blast_results}
-
Data quality check
fastqc SRR11301086_1.fastq SRR11301086_2.fastq \ -o analysis_results/SRR11301086/raw_fastqc \ -t 28 \ 2>&1 | tee analysis_results
-
Quality control using fastp
fastp -i SRR11301086_1.fastq \ -I SRR11301086_2.fastq \ -o analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \ -O analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \ --qualified_quality_phred 20 \ --length_required 50 \ --thread 28 \ --html analysis_results/SRR11301086/fastp/SRR11301086_fastp.html \ --json analysis_results/SRR11301086/fastp/SRR11301086_fastp.json \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_fastp.log
-
Data quality check (post-cleaning data)
fastqc analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \ analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \ -o analysis_results/SRR11301086/clean_fastqc \ -t 28 \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_fastqc_cleaned.log
-
Assemble with Trinity
nohup Trinity --seqType fq \ --left analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \ --right analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \ --CPU 28 --max_memory 48G \ --output analysis_results/SRR11301086/trinity \ 2>&1 > analysis_results/SRR11301086/logs/SRR11301086_trinity.log & mv analysis_results/SRR11301086/trinity.Trinity.fasta* analysis_results/SRR11301086/trinity/
-
Check the Trinity result:
TrinityStats.pl analysis_results/SRR11301086/trinity/trinity.Trinity.fasta \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_trinity_stats.txt
-
BLAST sequence of interest
-
Put your sequence in query.fasta.
vi query.fasta
-
Make BLAST database and run:
makeblastdb -in analysis_results/SRR11301086/trinity/trinity.Trinity.fasta -dbtype nucl -out analysis_results/SRR11301086/trinity/SRR11301086_nuc_db tblastn -query query.fasta \ -db analysis_results/SRR11301086/trinity/SRR11301086_nuc_db \ -out analysis_results/SRR11301086/blast_results/SRR11301086_nuc_blast.txt \ -evalue 1e-5 \ -word_size 2 \ -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq" \ -num_threads 28
-
-
Check the BLAST result:
cat analysis_results/SRR11301086/blast_results/SRR11301086_nuc_blast.txt
-
Extract the sequence from
trinity.Trinity.fasta
# Build index samtools faidx analysis_results/SRR11301086/trinity/trinity.Trinity.fasta # Extract the sequence samtools faidx analysis_results/SRR11301086/trinity/trinity.Trinity.fasta TRINITY_DN284_c0_g1_i7 > analysis_results/SRR11301086/blast_results/SRR11301086_TTRINITY_DN284_c0_g1_i7.fasta
Tail#
-
You can also blast with the Predicted sequence:
TransDecoder.LongOrfs -t analysis_results/SRR11301086/trinity/trinity.Trinity.fasta\ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_transdecoder_long.log TransDecoder.Predict -t analysis_results/SRR11301086/trinity/trinity.Trinity.fasta \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_transdecoder_predict.log mv *.transdecoder* analysis_results/SRR11301086/transdecoder/
-
Make BLAST database and run:
makeblastdb -in analysis_results/SRR11301086/transdecoder/trinity.Trinity.fasta.transdecoder.pep \ -dbtype prot \ -out analysis_results/SRR11301086/transdecoder/SRR11301086_db \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_makeblastdb.log blastp -query query.fasta \-db analysis_results/SRR11301086/transdecoder/SRR11301086_db \ -out analysis_results/SRR11301086/blast_results/SRR11301086_prot_blast.txt \ -evalue 1e-5 \ -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq" \ -num_threads 28 \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_blast.log
此文由 Mix Space 同步更新至 xLog
原始链接为 https://xxu.do/posts/academic/De-novo-assemble-RNA-seq-sequence