The current post want to use STAR and HTseq together to estimation gene expression for RNA-seq:
How to install lastest version of HTseq (Htseq-count)? HTseq is python based approach (>2.7 or 3.4).
- python Requirement:
pip install deeptoolsintervals
pip install matplotlib
pip install numpydoc
pip install plotly
pip install py2bit
pip install pyBigWig
pip install scipy
- and then
pip install HTSeq
- Here I suppose you use STAR to mapping RNA-seq fastq to human genome (hg19)
- –outSAMstrandField intronMotif option adds an XS attribute to the spliced alignments in the BAM file, which is required by Cufflinks for unstranded RNA-seq data.
- before run
htseq-count
, you’d better to download human gtf files like this way:wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/GRCh37_mapping/gencode.v32lift37.annotation.gtf.gz gunzip gencode.v32lift37.annotation.gtf.gz
- and now run
htseq
to the bam files created bySTAR
to generate FPKM matrixhtseq-count -m intersection-nonempty -t exon -i gene_id -f bam STAR.output.bam gencode.v32lift37.annotation.gtf -o output
-
if you do not want to use HTseq, you can also try
cufflinks
cufflinks --library-type fr-firststrand --outFilterIntronMotifs RemoveNoncanonical
-
How to infer the RNA-seq library type by salmon if you don’t have such information?
- before run salmon, you need to download fasta file for the transcripts like the following example:
wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/GRCh37_mapping/gencode.v32lift37.transcripts.fa.gz
gunzip gencode.v32lift37.transcripts.fa.gz
- another tools called : RSeQC is also a powerful and much easier tool to infer RNA-seq strands. you can download required knowngene.hg19.bed12 from UCSC or my annotationdb.
pip install RSeQC
wget https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/knowngene.hg19.bed12
infer_experiment.py -s 2000000 -r knowngene.hg19.bed12 -i UW040LPS_ATTCCT_L002Aligned.sortedByCoord.out.bam
- more related gtfs such as hg38, hg19 can be found in this links:
hg38 GTFs: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/
hg19 GTFs: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/
- then now you can run selmon to infer the RNA-seq library type:
salmon quant -t gencode.v32lift37.transcripts.fa -l A -a UW040LPS_ATTCCT_L002Aligned.sortedByCoord.out.bam -o salmon_quant
- Finally