08 January 1941

The current post want to use STAR and HTseq together to estimation gene expression for RNA-seq:

How to install lastest version of HTseq (Htseq-count)? HTseq is python based approach (>2.7 or 3.4).

  • python Requirement:
pip install deeptoolsintervals
pip install matplotlib
pip install numpydoc
pip install plotly
pip install py2bit
pip install pyBigWig
pip install scipy
  • and then
pip install HTSeq
  • Here I suppose you use STAR to mapping RNA-seq fastq to human genome (hg19)
  • –outSAMstrandField intronMotif option adds an XS attribute to the spliced alignments in the BAM file, which is required by Cufflinks for unstranded RNA-seq data.
  • before run htseq-count, you’d better to download human gtf files like this way:
    wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/GRCh37_mapping/gencode.v32lift37.annotation.gtf.gz
    gunzip gencode.v32lift37.annotation.gtf.gz
  • and now run htseq to the bam files created by STAR to generate FPKM matrix
    htseq-count -m intersection-nonempty -t exon -i gene_id -f bam STAR.output.bam gencode.v32lift37.annotation.gtf -o output
  • if you do not want to use HTseq, you can also try cufflinks

    cufflinks --library-type fr-firststrand --outFilterIntronMotifs RemoveNoncanonical
  • How to infer the RNA-seq library type by salmon if you don’t have such information?

  • before run salmon, you need to download fasta file for the transcripts like the following example:
wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/GRCh37_mapping/gencode.v32lift37.transcripts.fa.gz
gunzip gencode.v32lift37.transcripts.fa.gz
  • another tools called : RSeQC is also a powerful and much easier tool to infer RNA-seq strands. you can download required knowngene.hg19.bed12 from UCSC or my annotationdb.
 pip install RSeQC
 wget https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/knowngene.hg19.bed12
 infer_experiment.py -s 2000000 -r knowngene.hg19.bed12 -i UW040LPS_ATTCCT_L002Aligned.sortedByCoord.out.bam

  • more related gtfs such as hg38, hg19 can be found in this links:
hg38 GTFs: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/
hg19 GTFs: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/

  • then now you can run selmon to infer the RNA-seq library type:
salmon quant -t gencode.v32lift37.transcripts.fa -l A -a UW040LPS_ATTCCT_L002Aligned.sortedByCoord.out.bam -o salmon_quant
  • Finally

