Most Important Database for Human Biomedical Research

1934-02-28 00:00:00 +0000

  • Cancer Dependency Map, 2020, Identifying all dependencies in every cancer cell, https://depmap.org/portal/
  • Open Targets is an innovative, large-scale, multi-year, public-private partnership that uses human genetics and genomics data for systematic drug target identification and prioritisation.
  • VnD: a structure-centric database of disease-related SNPs and drugs: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3013797/
  • Drugbank-databases: http://www.drugbank.ca/w/databases
  • PharmGKB: https://www.pharmgkb.org/

How to get genomic coordinates for all protein domains

1933-02-28 00:00:00 +0000

Here is the best solution:

git clone https://github.com/lindenb/jvarkit.git
cd jvarkit
./gradlew mapuniprot
wget  ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz
java -jar ~/hpc/tools/jvarkit/dist/mapuniprot.jar  -R ~/hpc/db/hg19/hg19.fa  -u ~/hpc/uniprot_sprot.xml.gz -k knownGene.txt.gz -o uniprot_sprot.bed

How to Understand Human Cancer

1931-02-28 00:00:00 +0000

Immune escape mechanisms

  • up-regulation of immune-inhibitory molecules
  • down-regulation of antigen display
  • recruitment of suppressive cell populations

BRAFV600E-induced constitutive activation of the BRAF-MAPK pathway

How to Submit NGS Seqeuencing data to SRA

1930-02-28 00:00:00 +0000

Step 1: creat Bioproject (receive PRJNA ID to be used in step 2: PRJNA605250)

  • https://submit.ncbi.nlm.nih.gov/subs/bioproject/

Step 2: creat Biosample (receive SAMN ID to be used in step 3: SAMN14052692)

  • https://submit.ncbi.nlm.nih.gov/subs/biosample/
  • https://submit.ncbi.nlm.nih.gov/biosample/template/
  • Please check the example of PTPN22 dataset, BioSampleObjects-PTPN22.txt

Step 3: use SRA Submission Portal Wizard to submit fastq.gz files

  • fill the data and sample information file and will be upload in the next step
  • upload data with this link: https://submit.ncbi.nlm.nih.gov/subs/sra/

Resources for flow cytometry bioinformatics analysis

1929-02-28 00:00:00 +0000

Resources for flow cytometry bioinformatics analysis

Community portals

  • ISAC https://isac-net.org
  • FlowCAP http://flowcap.flowsite.org

Data standards

  • ISAC standards page https://isac-net.org/page/Data-Standards
  • FCS 3.1 https://onlinelibrary.wiley.com/doi/10.1002/cyto.a.20825
  • Gating-ML https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4874733
  • MIFlowCyt https://isac-net.org/page/MIFlowCyt
  • Classification Results File Format (CLR) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4874736

Parsers, core data access

  • CFCS library http://sourceforge.net/projects/flowcyt
  • flowCore http://bioconductor.org/packages/release/bioc/html/flowCore.html
  • Python FCM https://pythonhosted.org/fcm

Unsupervised clustering/visualization

  • t-SNE https://lvdmaaten.github.io/tsne
  • VISNE https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4076922
  • FLOCK http://sourceforge.net/projects/immportflock
  • SPADE https://github.com/nolanlab/spade
  • Wanderlust https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4045247/
  • ACCENSE http://www.cellaccense.com
  • Phenograph https://github.com/jacoblevine/PhenoGraph

Semi/supervised analysis/visualization

  • FLAME https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682540
  • x-Cyt http://www.broadinstitute.org/mpg/xcyt
  • Citrus https://github.com/nolanlab/citrus

Other R packages

  • Bioconductor Flow Cytometry view: http://master.bioconductor.org/packages/release/BiocViews.html#___FlowCytometry
  • Illustrative highthroughputassays package: http://bioconductor.org/help/workflows/highthroughputassays
  • Cytofkit https://www.bioconductor.org/packages/3.6/bioc/html/cytofkit.html (status in Bioconductor is now ‘deprecated’)
  • OpenCyto https://github.com/RGLab/openCyto

Integrated applications

  • FlowJo http://www.flowjo.com
  • CytoBank http://cytobank.org
  • Astrolabe https://www.astrolabediagnostics.com
  • FCS Express https://www.denovosoftware.com
  • Kaluza http://www.beckman.com/coulter-flow-cytometry/software/kaluza-analysis
  • GemStone http://www.vsh.com/products/GemStone/index.asp

Utilities

  • BD flow tool index http://www.bdbiosciences.com/research/multicolor/tools/index.jsp
  • GenePattern Flow Cytometry Suite (registration required) https://cloud.genepattern.org Browse Modules > Flow Cytometry

Mailing Lists:

  • Purdue cytometry discussion list http://cyto.purdue.edu/hmarchiv

Data repositories:

  • https://flowrepository.org

  • scDataviz: single cell dataviz and downstream analyses: https://github.com/kevinblighe/scDataviz

How to use plink1.9/plink2 to do human genetics research

1928-02-28 00:00:00 +0000

how do I prune second-degree-related samples? —rel-cutoff is obsolete and see —king-cutoff

plink --bfile RA1000 --bmerge RA500 --make-bed --out RA3000
plink --bfile RA3000 --impute-sex --make-bed --out RA3000.R1
grep PROBLEM RA3000.R1.sexcheck | awk '{print $2}' > sexcheck.exclude.txt
plink --bfile RA3000 --impute-sex --exclude --make-bed --out RA3000.R1
plink2 —bfile ... —king-cutoff 0.088 --maf 0.01 —make-bed --out myplink
### Script to check plink .bim files against HRC/1000G for strand, id names, positions, alleles, 
### ref/alt assignment, William Rayner 2015, wrayner@well.ox.ac.uk, Version 4.2.7
cd /mnt/sas0/AD/sguo234/asa
wget http://www.well.ox.ac.uk/~wrayner/tools/HRC-1000G-check-bim-v4.2.7.zip
wget http://ngs.sanger.ac.uk/production/hrc/HRC.r1-1/HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz
wget https://www.well.ox.ac.uk/~wrayner/tools/1000GP_Phase3_combined.legend.gz
gunzip 1000GP_Phase3_combined.legend.gz
unzip HRC-1000G-check-bim-v4.2.7.zip
gunzip HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz
wget http://qbrc.swmed.edu/zhanxw/software/checkVCF/checkVCF-20140116.tar.gz
tar xzvf checkVCF-20140116.tar.gz
## compare bim to HRC European Population(HEP)
perl HRC-1000G-check-bim.pl -b RA3000.R3.bim -f RA3000.R3.frq -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h
sh Run-plink.sh
## compare bim to 1000G East-Asian Population(EAP)
perl HRC-1000G-check-bim.pl -b RA3000.R3.bim -f RA3000.R3.frq -r 1000GP_Phase3_combined.legend -g -p EAS
sh Run-plink.sh
#### How to add rs ID to replace illumina array probes (exm) with bcf annotate (use -m-both to extent multi-allic SNPs to multiple row)

bcftools norm dbSNP153.hg19.vcf.gz --threads 48 -m-both -Oz -o dbSNP153.norm.hg19.vcf.gz
tabix -p vcf dbSNP153.norm.hg19.vcf.gz
bcftools annotate --threads 48 -c ID -a ~/db/dbSNP153/dbSNP153.norm.hg19.vcf.gz RA3000.R4.vcf.gz -Oz -o RA3000.R4.RS.vcf.gz
plink --bfile RA3000.R5 --maf 0.01 --hwe 0.01 --extract fstl1.bed --range --pheno RA3000.mphen --mpheno 1 --logistic --adjust --ci 0.95 --out FSTL1-RA-CTR
wget https://raw.githubusercontent.com/Shicheng-Guo/Gscutility/master/addfakegenotype.pl
perl addfakegenotype.pl > dbSNP153.hg19.plink.vcf
plink --vcf dbSNP153.hg19.plink.vcf --make-bed --allow-extra-chr --out dbSNP153.hg19

How to output rs ID with bcftools

bcftools query -f '%ID\n' gnomad.genomes.r2.1.sites.chr16.rec.hsa.gff3.sort.rmdup.biallelic.vcf.bgz

How to output sample ID with bcftools

bcftools query -l gnomad.genomes.r2.1.sites.chr16.rec.hsa.gff3.sort.rmdup.biallelic.vcf.bgz

How to convert vcf to bed file: vcf2bed

for i in `ls gnomad.genomes.r2.1.sites.chr*.rec.hsa.gff3.sort.rmdup.biallelic.vcf.bgz`
do 
bcftools query -f '%CHROM\t%POS\t%ID\n' $i | awk '{print $1,$2-1,$2,$3}' OFS="\t"
done

Recently, colleagues always ask me what’s the best solution to replace SNP ids in plink map files. Actually, there are tens of different solutions and you only need to mask one of these and then apply it for all the times. Now let me show you several different solutions.

bcftools annotate

plink --bfile ROI --recode vcf --out ROI
bcftools view ROI.vcf -Oz -o ROI.vcf.gz
tabix -p vcf ROI.vcf.gz
bcftools annotate -a ~/hpc/db/hg19/dbSNP/All_20180423.hg19.vcf.gz -c ID ROI.vcf.gz -Oz -o ROI.hg19.vcf.gz

You need prepare the old id and new id relationship or mapping file before you use this function. However, the best solution is to replace the old id with chr:pos and then download chr:pos, new id from ucsc.

plink --bfile ROI --recode vcf --out ROI
bcftools view ROI.vcf -Oz -o ROI.vcf.gz
tabix -p vcf ROI.vcf.gz
bcftools annotate -a ~/hpc/db/hg19/dbSNP/All_20180423.hg19.vcf.gz -c ID ROI.vcf.gz -Oz -o ROI.hg19.vcf.gz
plink --file <input_prefix> --recode HV --snps-only just-acgt --out <output_prefix>

How to apply ANNOVAR to annotate big VCF files (-G -> AWK -> table_annovar)


wget http://www.openbioinformatics.org/annovar/download/0wgxR2rIVP/annovar.latest.tar.gz
scp root@101.133.145.142:/root/tools/annovar.latest.tar.gz ./
tar xzvf annovar.latest.tar.gz 
cd/home/mxiong/tools/annovar
annotate_variation.pl -downdb -buildver hg19 cytoBand humandb
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene humandb
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar tfbsConsSites humandb/ 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar targetScanS humandb/ 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar wgRna humandb/ &
annotate_variation.pl -downdb -buildver hg19 -webfrom ucsc gnomad_genome humandb/ 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar dbnsfp35a humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar ljb23_all humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar exac03 humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar esp6500siv2 humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar clinvar_20160302 humandb   
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar cosmic70 humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar icgc21 humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar nci60 humandb
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar dann humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar gwasCatalog humandb
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp150 humandb
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar kaviar_20150923 humandb
bcftools view --threads 48 -G RA3000.R4.RS.vcf.gz -Oz -o RA3000.R4.RG.vcf.gz
zcat RA3000.R4.RG.vcf.gz| awk '{print $1,$2,$2,$4,$5,$3}\' OFS="\t"| grep -v '#' > RA3000.R4.avinput
table_annovar.pl RA3000.R4.avinput ~/tools/annovar/humandb/ -buildver hg19 -out RA3000 -remove -protocol refGene,cytoBand,avsnp150,dbnsfp35a -operation gx,r,f,f -nastring . -csvout -polish -xref ~/tools/annovar/humandb/gene_fullxref.txt

How to apply rvtest to do gene-based association study

wget https://github.com/zhanxw/rvtests/releases/download/v1.8.6/rvtests-20150104.tar.gz
tar xzvf rvtests-20150104.tar.gz
rvtest --inVcf RA3000.R5 --pheno RA3000.mphen --mpheno 1 --out RA3000-CTR --single wald,score
rvtest --inVcf RA3000.R5 --pheno RA3000.mphen --mpheno 2 --out RA3000-ILD --single wald,score

GenomeAsia100K

# GenomeAsia100K data are available at the links below. Users can download GA100K data in compressed Variant Call Format (VCF) file.
mkdir ~/hpc/db/GenomeAsia100K
cd ~/hpc/db/GenomeAsia100K
for i in {1..22}
do
wget --no-check-certificate https://browser.genomeasia100k.org/service/web/download_files/$i.substitutions.annot.cont_withmaf.vcf.gz &
done
# Round 1: run merging and try to find dupliates SNPs and then remove them 
rm mergelist.txt
for i in {1..22}
do
echo chr$i >> mergelist.txt
done
plink --merge-list mergelist.txt --make-bed --out 1000plink
# Round 2: remove duplicates SNPs which stored in `1000plink-merge.missnp`
mkdir temp
for i in {1..22}
do
echo $i
echo \#PBS -N $i > $i.job
echo \#PBS -l nodes=1:ppn=1 >> $i.job
echo \#PBS -M Guo.shicheng\@marshfieldresearch.org >> $i.job
echo \#PBS -m abe  >> $i.job
echo \#PBS -o $(pwd)/temp/ >>$i.job
echo \#PBS -e $(pwd)/temp/ >>$i.job
echo \#PBS -m abe  >> $i.job
echo cd $(pwd) >> $i.job
echo plink --bfile chr$i --alleleACGT --snps-only just-acgt --exclude 1000plink-merge.missnp --make-bed --out chr$i.uni >> $i.job
qsub  $i.job
done
# Round 3: Try to merge again
rm mergelist.txt
for i in {1..22}
do
echo chr$i >> mergelist.txt
done
plink --merge-list mergelist.txt --make-bed --out 1000plink

How to prepare PCA plot for your own data and 1000 Genome data.

awk '{print $4}' hsa.gff3.hg19.bed  | sort -u > hsa.gff3.hg19.snp
plink --bfile /gpfs/home/guosa/hpc/db/hg19/1000Genome/plink/G1000plink --extract hsa.gff3.hg19.snp --make-bed --out G1000.miRNA
plink --bfile ROI.RA3000.dbsnp --bmerge G1000.miRNA --allow-no-sex --make-bed --out ./PCA/ROI
plink --bfile ROI --threads 31 --cluster --mds-plot 2
plink --bfile ROI --threads 31 --pca 2 'header' --out ROI

hapmap2<-read.table("https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/hapmap2.pop",head=F)
hapmap3<-read.table("https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/hapmap3.pop",head=T)
G1000Sam <-read.table("https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/1000G/1000GenomeSampleInfo.txt",head=F,as.is=T)
G1000Super<-read.table("https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/1000G/superpopulation.txt",head=F,sep="\t")
G1000Sam$superpop<-G1000Super[match(G1000Sam$V3,G1000Super$V1),]$V2
write.table(G1000Sam,file="1000GenomeSampleInfo.txt",quote=F,sep="\t",col.names=F,row.names=F)

eigenvec<-read.table("plink.eigenvec",head=T)
head(eigenvec)
pop<-as.character(G1000Sam[match(eigenvec[,2],G1000Sam[,2]),3])
super<-as.character(G1000Sam[match(eigenvec[,2],G1000Sam[,2]),5])
pop[is.na(pop)]<-"GHRA"
super[is.na(super)]<-"GHRA"
eigenvec$pop=pop
eigenvec$super=super
eigenvec$col=as.numeric(as.factor(super))
eigenvec$pch=as.numeric(as.factor(super))

set<-unique(data.frame(pch=eigenvec$pch,col=eigenvec$col,legend=eigenvec$super))

for(i in 1:15){
jpeg(paste("pca.super",i,".jpg",sep=""))
plot(eigenvec[,3],eigenvec[,4],pch=16,col=eigenvec$col+i,xlab="principle component 1",ylab="principle componment 2",cex.axis=1.5,cex.lab=1.5,cex=1)
legend("topright",pch=16,legend=set$legend,col=set$col+i,bty="n",cex=1)
dev.off()
}

Plink2 common operation (Step 1, decpress all the zst files, especially *pgen.zst)

for i in `ls *zst | rev | cut -c 5- | rev | uniq`
do
echo $i
plink2 --zst-decompress $i.zst > $i
done
cp phase3_corrected.psam all_phase3.psam

How to remove all the multiallelic SNPs in plink2 with –exclude

awk '$5~/,/{print}' all_phase3.pvar | awk '{print $3}' > multiallelic
plink2 --pfile TEMP --rm-dup --make-bpgen --out TEMP2 --threads 24
plink2 --pfile TEMP --exclude TEMP2.rmdup.mismatch --make-pgen --out TEMP2 --threads 24
plink2 --pfile TEMP --rm-dup --make-bpgen --out TEMP2 --threads 24
plink2 --pfile TEMP2 --rm-dup --make-bpgen --out all_phase3 --threads 24
rm TEMP*

How to remove all the multiallelic SNPs in plink2 with –max-alleles

#### Processing 1000 Genome data downloaded from plink2 website
for i in `ls *zst | rev | cut -c 5- | rev | uniq`
do
echo $i
plink2 --zst-decompress $i.zst > $i
done
cp phase3_corrected.psam all_phase3.psam

#### method 1 with awk and --exclude option
awk '$5~/,/{print}' all_phase3.pvar | awk '{print $3}' > multiallelic
plink2 --pfile all_phase3 --exclude multiallelic.txt --make-pgen --out TEMP --threads 24
plink2 --pfile TEMP --rm-dup --make-bpgen --out TEMP2 --threads 24
plink2 --pfile TEMP --exclude TEMP2.rmdup.mismatch --make-pgen --out TEMP2 --threads 24
plink2 --pfile TEMP --rm-dup --make-bpgen --out TEMP2 --threads 24
plink2 --pfile TEMP2 --rm-dup --make-bed --out all_phase3 --threads 24
rm TEMP*
#### method 2 with --max-alleles 2
plink2 --pfile all_phase3 --max-alleles 2 --make-pgen --out TEMP --threads 24
plink2 --pfile TEMP --rm-dup --make-bpgen --out TEMP2 --threads 24
plink2 --pfile TEMP --exclude TEMP2.rmdup.mismatch --make-pgen --out TEMP2 --threads 24
plink2 --pfile TEMP2 --rm-dup --make-bed --out all_phase3 --threads 24
rm TEMP*
#### pgen to --make-bed
wget https://www.cog-genomics.org/static/bin/plink/glist-hg19 -O glist-hg19
wget https://www.cog-genomics.org/static/bin/plink/glist-hg38 -O glist-hg38
plink --bfile all_phase3 --allow-extra-chr --fst --within all_phase3.clst --out all_phase3 --threads 24
grep -w -f candidate.list glist-hg19 > extract.txt
plink --bfile all_phase3 --allow-extra-chr --extract range extract.txt --fst --within all_phase3.clst --out 2019ncov --threads 24
awk '$5>0.1' 2019ncov.fst
#### extract 
grep -f top6.txt /mnt/sas0/AD/sguo234/db/1000Genome/all_phase3.fst

#### X, Y, MT, PAR1, PAR2 to 23,24,25,26,27
sed -i 's/X/23/g' all_phase3.bim
sed -i 's/Y/24/g' all_phase3.bim
sed -i 's/MT/25/g' all_phase3.bim
sed -i 's/PAR1/26/g' all_phase3.bim
sed -i 's/PAR2/27/g' all_phase3.bim
#### calculate Fst
plink --bfile all_phase3 --filter-females --allow-extra-chr --fst --within all_phase3.clst --out all_phase3.chrX --threads 24
plink --bfile all_phase3 --freq

Genome-wide Complex Trait Analysis (GCTA)

cd ~/tools/
# https://cnsgenomics.com/software/gcta/#Download
wget https://cnsgenomics.com/software/gcta/bin/gcta_1.93.0beta.zip
unzip gcta_1.93.0beta.zip
plink --bfile EUR.QC --clump-p1 1 --clump-r2 0.1 --clump-kb 250 --clump Height.QC.Transformed --clump-snp-field SNP --clump-field P --out EUR

Utilize zstd to enhance compression

git clone https://github.com/facebook/zstd.git
cd zstd
make
awk '{print $1}' all_phase3.bim | uniq 
# compare gzip and zstd
gzip -c all_phase3.rsid > all_phase3.rsid.gz &
zstd -19 all_phase3.rsid -o all_phase3.rsid.zst &

ACE2, located in chrX

#### male 
plink --bfile all_phase3 --allow-extra-chr --fst --within all_phase3.clst --out all_phase3.chrX --threads 24

Cheat Sheet for Default Settings:

You might wonder about the '25'. Non-autosomal chromosomes can also be identified by numeric code: if there are n autosomes, n+1 is the X chromosome, n+2 is Y, n+3 is XY, and n+4 is MT.

Reference:

perl HRC-1000G-check-bim.pl -b RA3000.R3.bim -f RA3000.R3.frq -r 1000GP_Phase3_combined.legend -g -p EAS

Position Matches
 ID matches 1000G 0
 ID Doesn't match 1000G 625863
 Total Position Matches 625863
ID Match
 Different position to 1000G 14
No Match to 1000G 67125
Skipped (X, XY, Y, MT) 27712
Total in bim file 738980
Total processed 720714

Indels (ignored in r1) 11268

SNPs not changed 101110
SNPs to change ref alt 501088
Strand ok 602197
Total Strand ok 602198


Strand to change 1
Total checked 625877
Total checked Strand 602198
Total removed for allele Frequency diff > 0.2 1394
Palindromic SNPs with Freq > 0.4 983

Non Matching alleles 22696
ID and allele mismatching 22694; where 1000G is . 0
Duplicates removed 6998

perl HRC-1000G-check-bim.pl -b MCRI20000.bim -f MCRI20000.frq -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h

Position Matches
 ID matches HRC 0
 ID Doesn't match HRC 408295
 Total Position Matches 408295
ID Match
 Different position to HRC 0
No Match to HRC 115190
Skipped (X, XY, Y, MT) 2030
Total in bim file 525515
Total processed 525515

Indels (ignored in r1) 0

SNPs not changed 83800
SNPs to change ref alt 322754
Strand ok 406423
Total Strand ok 406554

Strand to change 220
Total checked 408295
Total checked Strand 406643
Total removed for allele Frequency diff > 0.2 1902
Palindromic SNPs with Freq > 0.4 433

Non Matching alleles 1219
ID and allele mismatching 1219; where HRC is . 516
Duplicates removed 0

related database

PLINK http://pngu.mgh.harvard.edu/purcell/plink/
• MaCH http://csg.sph.umich.edu//abecasis/mach/tour/imputation.html
• HAPIUR https://code.google.com/p/hapi-ur/
• All the annotation sources
– 1000Genomes http://www.1000genomes.org/
– CADD http://cadd.gs.washington.edu/
– ClinVar http://www.ncbi.nlm.nih.gov/clinvar/
– Conservation http://mendel.stanford.edu/SidowLab/downloads/gerp/ , http://compgen.cshl.edu/phast/helppages/phyloP.txt , http://compgen.cshl.edu/phast/help-pages/phastCons.txt
– ESP6500 http://evs.gs.washington.edu/EVS/
– ExAC http://exac.broadinstitute.org/
– FunctionalMutation http://varianttools.sourceforge.net/Annotation/DbNSFP
– GRASP2 http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:GRASP2
– GTExEqtl http://www.gtexportal.org/home/
– GWAVA https://www.sanger.ac.uk/sanger/StatGen_Gwava
– Haploreg http://www.broadinstitute.org/mammals/haploreg/haploreg.php
– Interpro http://www.ebi.ac.uk/interpro/
– RegulomeDB http://regulomedb.org/

RNA-seq data to reveal novel response mechanism to bacterial

1928-02-28 00:00:00 +0000

bacterial RNA-seq analysis with Rockhopper: RNA-seq data to reveal novel response mechanism to bacterial within host wound tissues

Here is a small pipeline to do microbial RNA-seq analysis with [Rockhopper][1], Here I suppose you have pair-end RNA-seq data and single-end reads will be much easier for which I will not show it in this post:  

Step 1: Download Rockhopper including both Windows and Java versions. Please remember download Rockhopper.exe is important since it can be used to create reference much easier. You can download the reference from windows system and copy them to your Linux server. 
    wget https://cs.wellesley.edu/~btjaden/Rockhopper/download/current/Rockhopper.exe
    wget http://cs.wellesley.edu/~btjaden/Rockhopper/download/current/Rockhopper.jar
Step 2: create reference: 
Step 3: prepare the running script with Perl and bash
ls *.fastq.gz | paste - - > input.txt
perl -p -i -e 's/\s/\&/'  input.txt
paste 
#########################################################################################################
##### RNA-seq data to reveal novel response mechanism to bacterial within host wound tissues ###########
#########################################################################################################
## 02/04/2020
wget http://cs.wellesley.edu/~btjaden/Rockhopper/download/current/Rockhopper.jar
cd ~/hpc/project/RnaseqBacterial/extdata/rnaseq
genome_DIR1=~/hpc/project/RnaseqBacterial/extdata/rnaseq/Rockhopper_Results/genomes/Staphylococcus_aureus_subsp__aureus_USA300_FPR3757
genome_DIR2=

mkdir temp
for i in $(ls *.fastq.gz | rev | cut -c 17- | rev | uniq)
do
echo $i
echo \#PBS -N $i  > $i.job
echo \#PBS -l nodes=1:ppn=12 >> $i.job
echo \#PBS -M Guo.shicheng\@marshfieldresearch.org >> $i.job
echo \#PBS -m abe  >> $i.job
echo \#PBS -o $(pwd)/temp/ >>$i.job
echo \#PBS -e $(pwd)/temp/ >>$i.job
echo cd $(pwd) >> $i.job
echo java -Xmx1200m -cp Rockhopper.jar Rockhopper -g $genome_DIR1 $i\_R1_001.fastq.gz%$i\_R2_001.fastq.gz -o ./Rockhopper_Results/$i >> $i.job
qsub  $i.job
done

How do download nr.faa for diamond and samsa2

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
diamond makedb --in nr.gz -d nr

Here is R script to summarize microbial dual-transcriptome data


setwd("//mcrfnas2/bigdata/Genetic/Projects/shg047/project/RnaseqBacterial/extdata/rnaseq/Rockhopper_Results")

perl -p -i -e "s/\'/-/g" gene.tab.matrix.v4.txt
perl -p -i -e "s/\(/-/g" gene.tab.matrix.v4.txt
perl -p -i -e "s/\)/-/g" gene.tab.matrix.v4.txt
perl -p -i -e "s/\,/-/g" gene.tab.matrix.v4.txt
perl -p -i -e "s/\:/-/g" gene.tab.matrix.v4.txt
perl -p -i -e "s/\;/-/g" gene.tab.matrix.v4.txt

data<-read.table("gene.tab.matrix.v4.txt",head=T,row.names=1,sep="\t",as.is=T,check.names = F)
head(data)
dim(data)

data<-data[order(apply(data[,1:24],1,function(x) sum(x>0)),decreasing = T),]
data<-data[,c(order(as.numeric(colnames(data[1:24]))),25,26)]

data$pc<-percent(apply(data[,1:24],1,function(x) sum(x>0))/24)
data$mean<-round(apply(data[,1:24],1,function(x) mean(x)),0)
data$sd<-round(apply(data[,1:24],1,function(x) sd(x)),0)
data$cv<-round(apply(data[,1:24],1,function(x) cv(x)),2)

head(data)
write.csv(data,file="gene.tab.matrix.v6.csv",quote=F,row.names = T)

GOI1<-read.table("https://raw.githubusercontent.com/Shicheng-Guo/RnaseqBacterial/master/extdata/interestlist1.txt")
GOI2<-read.table("https://raw.githubusercontent.com/Shicheng-Guo/RnaseqBacterial/master/extdata/interestlist2.txt")

x1<-tolower(GOI1[,1])
x2<-tolower(GOI1[,1])
x3<-tolower(data$N1)

x1[which( ! x1 %in% x3)]
x2[which( ! x2 %in% x3)]

out1<-na.omit(data[match(tolower(GOI1[,1]),tolower(data$N1)),])
out2<-na.omit(data[match(tolower(GOI2[,1]),tolower(data$N1)),])
out1
out2
write.csv(out1,file="gene.tab.matrix.table3.csv",quote=F)
write.csv(out2,file="gene.tab.matrix.table4.csv",quote=F)

RNA-seq data with metaphlan2 to reveal novel response mechanism to bacterial

1928-02-28 00:00:00 +0000

  • metaphlan2 is based on python 2.7 and don’t require to download NR database since it use its own annotated database. if it is first time to comply py2, maybe you need restart the ternimal before install metaphlan2
wget https://repo.anaconda.com/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh

conda create --name py2 python=2.7
conda activate py2

conda install -c bioconda metaphlan2
metaphlan2.py --help

EAS and SAS data in GenomeAsia100K panel

1926-02-28 00:00:00 +0000

  • STEP1: download GenomeAsia100K panel data
    # GenomeAsia100K data are available at the links below. Users can download GA100K data in compressed Variant Call Format (VCF) file.
    mkdir ~/hpc/db/GenomeAsia100K
    cd ~/hpc/db/GenomeAsia100K
    for i in {1..22}
    do
    wget --no-check-certificate https://browser.genomeasia100k.org/service/web/download_files/$i.substitutions.annot.cont_withmaf.vcf.gz &
    done
    
  • STEP2: filter EAS and SAS common SNPs: INFO/AF_SAS>0.01 \&INFO/AF_SEA>0.01
    cd ~/hpc/rheumatology/RA/meta3000/MIR
    panel="hsa.gff3"
    mkdir temp
    for i in {1..22} X Y
    do
    echo \#PBS -N $i  > $i.job
    echo \#PBS -l nodes=1:ppn=1 >> $i.job
    echo \#PBS -M Guo.shicheng\@marshfieldresearch.org >> $i.job
    echo \#PBS -o $(pwd)/temp/ >>$i.job
    echo \#PBS -e $(pwd)/temp/ >>$i.job
    echo cd $(pwd) >> $i.job
    echo \# bcftools norm -m \+ ~/hpc/db/GenomeAsia100K/$i.substitutions.annot.cont_withmaf.vcf.gz -Oz -o $i.substitutions.annot.cont_withmaf.norm.vcf.gz >> $i.job
    echo \# tabix -p vcf $i.substitutions.annot.cont_withmaf.norm.vcf.gz >> $i.job
    echo bcftools view -v snps -f PASS -i \'INFO/AF_SAS\>0.01 \&INFO/AF_SEA\>0.01\' -T $panel.hg19.bed $i.substitutions.annot.cont_withmaf.norm.vcf.gz -Oz -o  gnomad.genomes.r2.1.sites.chr$i.rec.$panel.vcf.gz >>$i.job
    echo bcftools sort $i.substitutions.annot.cont_withmaf.norm.vcf.gz -Oz -o $i.substitutions.annot.cont_withmaf.norm.sort.vcf.gz >> $i.job
    echo bcftools norm -d $i.substitutions.annot.cont_withmaf.norm.sort.vcf.gz -Oz -o $i.substitutions.annot.cont_withmaf.sort.vcf.gz >> $i.job
    echo bcftools view -m2 -M2 -v snps $i.substitutions.annot.cont_withmaf.sort.vcf.gz -Oz -o $i.substitutions.annot.cont_withmaf.$panel.sort.vcf.gz >>$i.job
    qsub $i.job
    done
    

Genomics, Genomics and Related Disciplines

1925-02-28 00:00:00 +0000

  • Comparative genomics
  • Functional genomics
  • Population genomics
  • Structural genomics
  • Epigenomics
  • Metagenomics
  • Genomic medicine
  • Synthetic biology and bioengineering
  • Computational genomics
  • Pathogenomics
  • Personal genomics
  • Transcriptomics
  • Psychogenomics