Jayden

Jayden

De novo assemble RNA-seq sequence

Intro#

A research paper published on biorxiv determined a new coronavirus subgenus, I would like to figure out is there any changes on protease. However, the sequence data has not been publish.

Fortunately, the similar sequence is do available on NCBI, unfortunately, only RNA-seq data is available.

So I need to assemble the RNA-seq reads first, and BLAST the sequence I need with the assembled data.

TL;DR#

  1. Setup the environment with conda:

    conda create -n sra_env -y
    conda activate sra_env
    conda install -c bioconda -y sra-tools trinity transdecoder blast fastp fastqc
    
  2. Fetch the data:

    prefetch SRR11301086
    fasterq-dump SRR11301086
    mkdir -p analysis_results/SRR11301086/{raw_fastqc,clean_fastqc,fastp,trinity,transdecoder,blast_results}
    
  3. Data quality check

    fastqc  SRR11301086_1.fastq SRR11301086_2.fastq \
            -o analysis_results/SRR11301086/raw_fastqc \
            -t 28 \
            2>&1 | tee analysis_results
    
  4. Quality control using fastp

    fastp -i SRR11301086_1.fastq \
          -I SRR11301086_2.fastq \
          -o analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \
          -O analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \
          --qualified_quality_phred 20 \
          --length_required 50 \
          --thread 28 \
          --html analysis_results/SRR11301086/fastp/SRR11301086_fastp.html \
          --json analysis_results/SRR11301086/fastp/SRR11301086_fastp.json \
          2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_fastp.log
    
  5. Data quality check (post-cleaning data)

    fastqc analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \
           analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \
           -o analysis_results/SRR11301086/clean_fastqc \
           -t 28 \
           2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_fastqc_cleaned.log
    
  6. Assemble with Trinity

    nohup Trinity --seqType fq \
            --left analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \
            --right analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \
            --CPU 28 --max_memory 48G \
            --output analysis_results/SRR11301086/trinity \
            2>&1 > analysis_results/SRR11301086/logs/SRR11301086_trinity.log &
    
    mv analysis_results/SRR11301086/trinity.Trinity.fasta* analysis_results/SRR11301086/trinity/
    
  7. Check the Trinity result:

    TrinityStats.pl analysis_results/SRR11301086/trinity/trinity.Trinity.fasta \
    2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_trinity_stats.txt
    
  8. BLAST sequence of interest

    1. Put your sequence in query.fasta.

      vi query.fasta
      
    2. Make BLAST database and run:

      makeblastdb -in analysis_results/SRR11301086/trinity/trinity.Trinity.fasta -dbtype nucl -out analysis_results/SRR11301086/trinity/SRR11301086_nuc_db
      
      tblastn -query query.fasta \
              -db analysis_results/SRR11301086/trinity/SRR11301086_nuc_db \
              -out analysis_results/SRR11301086/blast_results/SRR11301086_nuc_blast.txt \
              -evalue 1e-5 \
              -word_size 2 \
              -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq" \
              -num_threads 28
      
  9. Check the BLAST result:

    cat analysis_results/SRR11301086/blast_results/SRR11301086_nuc_blast.txt
    
  10. Extract the sequence from trinity.Trinity.fasta

    # Build index
    samtools faidx analysis_results/SRR11301086/trinity/trinity.Trinity.fasta
    # Extract the sequence
    samtools faidx analysis_results/SRR11301086/trinity/trinity.Trinity.fasta TRINITY_DN284_c0_g1_i7 > analysis_results/SRR11301086/blast_results/SRR11301086_TTRINITY_DN284_c0_g1_i7.fasta
    

Tail#

  1. You can also blast with the Predicted sequence:

    TransDecoder.LongOrfs -t analysis_results/SRR11301086/trinity/trinity.Trinity.fasta\
        2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_transdecoder_long.log
    
    TransDecoder.Predict -t analysis_results/SRR11301086/trinity/trinity.Trinity.fasta \
        2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_transdecoder_predict.log
    
    mv *.transdecoder* analysis_results/SRR11301086/transdecoder/
    
  2. Make BLAST database and run:

    makeblastdb -in analysis_results/SRR11301086/transdecoder/trinity.Trinity.fasta.transdecoder.pep \
                -dbtype prot \
                -out analysis_results/SRR11301086/transdecoder/SRR11301086_db \
                2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_makeblastdb.log
    
    blastp -query query.fasta \-db analysis_results/SRR11301086/transdecoder/SRR11301086_db \
           -out analysis_results/SRR11301086/blast_results/SRR11301086_prot_blast.txt \
           -evalue 1e-5 \
           -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq" \
           -num_threads 28 \
           2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_blast.log
    

此文由 Mix Space 同步更新至 xLog
原始链接为 https://xxu.do/posts/academic/De-novo-assemble-RNA-seq-sequence


Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.