28 February 1928

how do I prune second-degree-related samples? —rel-cutoff is obsolete and see —king-cutoff

plink --bfile RA1000 --bmerge RA500 --make-bed --out RA3000
plink --bfile RA3000 --impute-sex --make-bed --out RA3000.R1
grep PROBLEM RA3000.R1.sexcheck | awk '{print $2}' > sexcheck.exclude.txt
plink --bfile RA3000 --impute-sex --exclude --make-bed --out RA3000.R1
plink2 —bfile ... —king-cutoff 0.088 --maf 0.01 —make-bed --out myplink
### Script to check plink .bim files against HRC/1000G for strand, id names, positions, alleles, 
### ref/alt assignment, William Rayner 2015, wrayner@well.ox.ac.uk, Version 4.2.7
cd /mnt/sas0/AD/sguo234/asa
wget http://www.well.ox.ac.uk/~wrayner/tools/HRC-1000G-check-bim-v4.2.7.zip
wget http://ngs.sanger.ac.uk/production/hrc/HRC.r1-1/HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz
wget https://www.well.ox.ac.uk/~wrayner/tools/1000GP_Phase3_combined.legend.gz
gunzip 1000GP_Phase3_combined.legend.gz
unzip HRC-1000G-check-bim-v4.2.7.zip
gunzip HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz
wget http://qbrc.swmed.edu/zhanxw/software/checkVCF/checkVCF-20140116.tar.gz
tar xzvf checkVCF-20140116.tar.gz
## compare bim to HRC European Population(HEP)
perl HRC-1000G-check-bim.pl -b RA3000.R3.bim -f RA3000.R3.frq -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h
sh Run-plink.sh
## compare bim to 1000G East-Asian Population(EAP)
perl HRC-1000G-check-bim.pl -b RA3000.R3.bim -f RA3000.R3.frq -r 1000GP_Phase3_combined.legend -g -p EAS
sh Run-plink.sh
#### How to add rs ID to replace illumina array probes (exm) with bcf annotate (use -m-both to extent multi-allic SNPs to multiple row)

bcftools norm dbSNP153.hg19.vcf.gz --threads 48 -m-both -Oz -o dbSNP153.norm.hg19.vcf.gz
tabix -p vcf dbSNP153.norm.hg19.vcf.gz
bcftools annotate --threads 48 -c ID -a ~/db/dbSNP153/dbSNP153.norm.hg19.vcf.gz RA3000.R4.vcf.gz -Oz -o RA3000.R4.RS.vcf.gz
plink --bfile RA3000.R5 --maf 0.01 --hwe 0.01 --extract fstl1.bed --range --pheno RA3000.mphen --mpheno 1 --logistic --adjust --ci 0.95 --out FSTL1-RA-CTR
wget https://raw.githubusercontent.com/Shicheng-Guo/Gscutility/master/addfakegenotype.pl
perl addfakegenotype.pl > dbSNP153.hg19.plink.vcf
plink --vcf dbSNP153.hg19.plink.vcf --make-bed --allow-extra-chr --out dbSNP153.hg19

How to output rs ID with bcftools

bcftools query -f '%ID\n' gnomad.genomes.r2.1.sites.chr16.rec.hsa.gff3.sort.rmdup.biallelic.vcf.bgz

How to output sample ID with bcftools

bcftools query -l gnomad.genomes.r2.1.sites.chr16.rec.hsa.gff3.sort.rmdup.biallelic.vcf.bgz

How to convert vcf to bed file: vcf2bed

for i in `ls gnomad.genomes.r2.1.sites.chr*.rec.hsa.gff3.sort.rmdup.biallelic.vcf.bgz`
do 
bcftools query -f '%CHROM\t%POS\t%ID\n' $i | awk '{print $1,$2-1,$2,$3}' OFS="\t"
done

Recently, colleagues always ask me what’s the best solution to replace SNP ids in plink map files. Actually, there are tens of different solutions and you only need to mask one of these and then apply it for all the times. Now let me show you several different solutions.

bcftools annotate

plink --bfile ROI --recode vcf --out ROI
bcftools view ROI.vcf -Oz -o ROI.vcf.gz
tabix -p vcf ROI.vcf.gz
bcftools annotate -a ~/hpc/db/hg19/dbSNP/All_20180423.hg19.vcf.gz -c ID ROI.vcf.gz -Oz -o ROI.hg19.vcf.gz

You need prepare the old id and new id relationship or mapping file before you use this function. However, the best solution is to replace the old id with chr:pos and then download chr:pos, new id from ucsc.

plink --bfile ROI --recode vcf --out ROI
bcftools view ROI.vcf -Oz -o ROI.vcf.gz
tabix -p vcf ROI.vcf.gz
bcftools annotate -a ~/hpc/db/hg19/dbSNP/All_20180423.hg19.vcf.gz -c ID ROI.vcf.gz -Oz -o ROI.hg19.vcf.gz
plink --file <input_prefix> --recode HV --snps-only just-acgt --out <output_prefix>

How to apply ANNOVAR to annotate big VCF files (-G -> AWK -> table_annovar)


wget http://www.openbioinformatics.org/annovar/download/0wgxR2rIVP/annovar.latest.tar.gz
scp root@101.133.145.142:/root/tools/annovar.latest.tar.gz ./
tar xzvf annovar.latest.tar.gz 
cd/home/mxiong/tools/annovar
annotate_variation.pl -downdb -buildver hg19 cytoBand humandb
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene humandb
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar tfbsConsSites humandb/ 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar targetScanS humandb/ 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar wgRna humandb/ &
annotate_variation.pl -downdb -buildver hg19 -webfrom ucsc gnomad_genome humandb/ 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar dbnsfp35a humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar ljb23_all humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar exac03 humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar esp6500siv2 humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar clinvar_20160302 humandb   
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar cosmic70 humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar icgc21 humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar nci60 humandb
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar dann humandb 
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar gwasCatalog humandb
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp150 humandb
annotate_variation.pl -buildver hg19 -downdb -webfrom annovar kaviar_20150923 humandb
bcftools view --threads 48 -G RA3000.R4.RS.vcf.gz -Oz -o RA3000.R4.RG.vcf.gz
zcat RA3000.R4.RG.vcf.gz| awk '{print $1,$2,$2,$4,$5,$3}\' OFS="\t"| grep -v '#' > RA3000.R4.avinput
table_annovar.pl RA3000.R4.avinput ~/tools/annovar/humandb/ -buildver hg19 -out RA3000 -remove -protocol refGene,cytoBand,avsnp150,dbnsfp35a -operation gx,r,f,f -nastring . -csvout -polish -xref ~/tools/annovar/humandb/gene_fullxref.txt

How to apply rvtest to do gene-based association study

wget https://github.com/zhanxw/rvtests/releases/download/v1.8.6/rvtests-20150104.tar.gz
tar xzvf rvtests-20150104.tar.gz
rvtest --inVcf RA3000.R5 --pheno RA3000.mphen --mpheno 1 --out RA3000-CTR --single wald,score
rvtest --inVcf RA3000.R5 --pheno RA3000.mphen --mpheno 2 --out RA3000-ILD --single wald,score

GenomeAsia100K

# GenomeAsia100K data are available at the links below. Users can download GA100K data in compressed Variant Call Format (VCF) file.
mkdir ~/hpc/db/GenomeAsia100K
cd ~/hpc/db/GenomeAsia100K
for i in {1..22}
do
wget --no-check-certificate https://browser.genomeasia100k.org/service/web/download_files/$i.substitutions.annot.cont_withmaf.vcf.gz &
done
# Round 1: run merging and try to find dupliates SNPs and then remove them 
rm mergelist.txt
for i in {1..22}
do
echo chr$i >> mergelist.txt
done
plink --merge-list mergelist.txt --make-bed --out 1000plink
# Round 2: remove duplicates SNPs which stored in `1000plink-merge.missnp`
mkdir temp
for i in {1..22}
do
echo $i
echo \#PBS -N $i > $i.job
echo \#PBS -l nodes=1:ppn=1 >> $i.job
echo \#PBS -M Guo.shicheng\@marshfieldresearch.org >> $i.job
echo \#PBS -m abe  >> $i.job
echo \#PBS -o $(pwd)/temp/ >>$i.job
echo \#PBS -e $(pwd)/temp/ >>$i.job
echo \#PBS -m abe  >> $i.job
echo cd $(pwd) >> $i.job
echo plink --bfile chr$i --alleleACGT --snps-only just-acgt --exclude 1000plink-merge.missnp --make-bed --out chr$i.uni >> $i.job
qsub  $i.job
done
# Round 3: Try to merge again
rm mergelist.txt
for i in {1..22}
do
echo chr$i >> mergelist.txt
done
plink --merge-list mergelist.txt --make-bed --out 1000plink

How to prepare PCA plot for your own data and 1000 Genome data.

awk '{print $4}' hsa.gff3.hg19.bed  | sort -u > hsa.gff3.hg19.snp
plink --bfile /gpfs/home/guosa/hpc/db/hg19/1000Genome/plink/G1000plink --extract hsa.gff3.hg19.snp --make-bed --out G1000.miRNA
plink --bfile ROI.RA3000.dbsnp --bmerge G1000.miRNA --allow-no-sex --make-bed --out ./PCA/ROI
plink --bfile ROI --threads 31 --cluster --mds-plot 2
plink --bfile ROI --threads 31 --pca 2 'header' --out ROI

hapmap2<-read.table("https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/hapmap2.pop",head=F)
hapmap3<-read.table("https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/hapmap3.pop",head=T)
G1000Sam <-read.table("https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/1000G/1000GenomeSampleInfo.txt",head=F,as.is=T)
G1000Super<-read.table("https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/1000G/superpopulation.txt",head=F,sep="\t")
G1000Sam$superpop<-G1000Super[match(G1000Sam$V3,G1000Super$V1),]$V2
write.table(G1000Sam,file="1000GenomeSampleInfo.txt",quote=F,sep="\t",col.names=F,row.names=F)

eigenvec<-read.table("plink.eigenvec",head=T)
head(eigenvec)
pop<-as.character(G1000Sam[match(eigenvec[,2],G1000Sam[,2]),3])
super<-as.character(G1000Sam[match(eigenvec[,2],G1000Sam[,2]),5])
pop[is.na(pop)]<-"GHRA"
super[is.na(super)]<-"GHRA"
eigenvec$pop=pop
eigenvec$super=super
eigenvec$col=as.numeric(as.factor(super))
eigenvec$pch=as.numeric(as.factor(super))

set<-unique(data.frame(pch=eigenvec$pch,col=eigenvec$col,legend=eigenvec$super))

for(i in 1:15){
jpeg(paste("pca.super",i,".jpg",sep=""))
plot(eigenvec[,3],eigenvec[,4],pch=16,col=eigenvec$col+i,xlab="principle component 1",ylab="principle componment 2",cex.axis=1.5,cex.lab=1.5,cex=1)
legend("topright",pch=16,legend=set$legend,col=set$col+i,bty="n",cex=1)
dev.off()
}

Plink2 common operation (Step 1, decpress all the zst files, especially *pgen.zst)

for i in `ls *zst | rev | cut -c 5- | rev | uniq`
do
echo $i
plink2 --zst-decompress $i.zst > $i
done
cp phase3_corrected.psam all_phase3.psam

How to remove all the multiallelic SNPs in plink2 with –exclude

awk '$5~/,/{print}' all_phase3.pvar | awk '{print $3}' > multiallelic
plink2 --pfile TEMP --rm-dup --make-bpgen --out TEMP2 --threads 24
plink2 --pfile TEMP --exclude TEMP2.rmdup.mismatch --make-pgen --out TEMP2 --threads 24
plink2 --pfile TEMP --rm-dup --make-bpgen --out TEMP2 --threads 24
plink2 --pfile TEMP2 --rm-dup --make-bpgen --out all_phase3 --threads 24
rm TEMP*

How to remove all the multiallelic SNPs in plink2 with –max-alleles

#### Processing 1000 Genome data downloaded from plink2 website
for i in `ls *zst | rev | cut -c 5- | rev | uniq`
do
echo $i
plink2 --zst-decompress $i.zst > $i
done
cp phase3_corrected.psam all_phase3.psam

#### method 1 with awk and --exclude option
awk '$5~/,/{print}' all_phase3.pvar | awk '{print $3}' > multiallelic
plink2 --pfile all_phase3 --exclude multiallelic.txt --make-pgen --out TEMP --threads 24
plink2 --pfile TEMP --rm-dup --make-bpgen --out TEMP2 --threads 24
plink2 --pfile TEMP --exclude TEMP2.rmdup.mismatch --make-pgen --out TEMP2 --threads 24
plink2 --pfile TEMP --rm-dup --make-bpgen --out TEMP2 --threads 24
plink2 --pfile TEMP2 --rm-dup --make-bed --out all_phase3 --threads 24
rm TEMP*
#### method 2 with --max-alleles 2
plink2 --pfile all_phase3 --max-alleles 2 --make-pgen --out TEMP --threads 24
plink2 --pfile TEMP --rm-dup --make-bpgen --out TEMP2 --threads 24
plink2 --pfile TEMP --exclude TEMP2.rmdup.mismatch --make-pgen --out TEMP2 --threads 24
plink2 --pfile TEMP2 --rm-dup --make-bed --out all_phase3 --threads 24
rm TEMP*
#### pgen to --make-bed
wget https://www.cog-genomics.org/static/bin/plink/glist-hg19 -O glist-hg19
wget https://www.cog-genomics.org/static/bin/plink/glist-hg38 -O glist-hg38
plink --bfile all_phase3 --allow-extra-chr --fst --within all_phase3.clst --out all_phase3 --threads 24
grep -w -f candidate.list glist-hg19 > extract.txt
plink --bfile all_phase3 --allow-extra-chr --extract range extract.txt --fst --within all_phase3.clst --out 2019ncov --threads 24
awk '$5>0.1' 2019ncov.fst
#### extract 
grep -f top6.txt /mnt/sas0/AD/sguo234/db/1000Genome/all_phase3.fst

#### X, Y, MT, PAR1, PAR2 to 23,24,25,26,27
sed -i 's/X/23/g' all_phase3.bim
sed -i 's/Y/24/g' all_phase3.bim
sed -i 's/MT/25/g' all_phase3.bim
sed -i 's/PAR1/26/g' all_phase3.bim
sed -i 's/PAR2/27/g' all_phase3.bim
#### calculate Fst
plink --bfile all_phase3 --filter-females --allow-extra-chr --fst --within all_phase3.clst --out all_phase3.chrX --threads 24
plink --bfile all_phase3 --freq

Genome-wide Complex Trait Analysis (GCTA)

cd ~/tools/
# https://cnsgenomics.com/software/gcta/#Download
wget https://cnsgenomics.com/software/gcta/bin/gcta_1.93.0beta.zip
unzip gcta_1.93.0beta.zip
plink --bfile EUR.QC --clump-p1 1 --clump-r2 0.1 --clump-kb 250 --clump Height.QC.Transformed --clump-snp-field SNP --clump-field P --out EUR

Utilize zstd to enhance compression

git clone https://github.com/facebook/zstd.git
cd zstd
make
awk '{print $1}' all_phase3.bim | uniq 
# compare gzip and zstd
gzip -c all_phase3.rsid > all_phase3.rsid.gz &
zstd -19 all_phase3.rsid -o all_phase3.rsid.zst &

ACE2, located in chrX

#### male 
plink --bfile all_phase3 --allow-extra-chr --fst --within all_phase3.clst --out all_phase3.chrX --threads 24

Cheat Sheet for Default Settings:

You might wonder about the '25'. Non-autosomal chromosomes can also be identified by numeric code: if there are n autosomes, n+1 is the X chromosome, n+2 is Y, n+3 is XY, and n+4 is MT.

Reference:

perl HRC-1000G-check-bim.pl -b RA3000.R3.bim -f RA3000.R3.frq -r 1000GP_Phase3_combined.legend -g -p EAS

Position Matches
 ID matches 1000G 0
 ID Doesn't match 1000G 625863
 Total Position Matches 625863
ID Match
 Different position to 1000G 14
No Match to 1000G 67125
Skipped (X, XY, Y, MT) 27712
Total in bim file 738980
Total processed 720714

Indels (ignored in r1) 11268

SNPs not changed 101110
SNPs to change ref alt 501088
Strand ok 602197
Total Strand ok 602198


Strand to change 1
Total checked 625877
Total checked Strand 602198
Total removed for allele Frequency diff > 0.2 1394
Palindromic SNPs with Freq > 0.4 983

Non Matching alleles 22696
ID and allele mismatching 22694; where 1000G is . 0
Duplicates removed 6998

perl HRC-1000G-check-bim.pl -b MCRI20000.bim -f MCRI20000.frq -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h

Position Matches
 ID matches HRC 0
 ID Doesn't match HRC 408295
 Total Position Matches 408295
ID Match
 Different position to HRC 0
No Match to HRC 115190
Skipped (X, XY, Y, MT) 2030
Total in bim file 525515
Total processed 525515

Indels (ignored in r1) 0

SNPs not changed 83800
SNPs to change ref alt 322754
Strand ok 406423
Total Strand ok 406554

Strand to change 220
Total checked 408295
Total checked Strand 406643
Total removed for allele Frequency diff > 0.2 1902
Palindromic SNPs with Freq > 0.4 433

Non Matching alleles 1219
ID and allele mismatching 1219; where HRC is . 516
Duplicates removed 0

related database

PLINK http://pngu.mgh.harvard.edu/purcell/plink/
• MaCH http://csg.sph.umich.edu//abecasis/mach/tour/imputation.html
• HAPIUR https://code.google.com/p/hapi-ur/
• All the annotation sources
– 1000Genomes http://www.1000genomes.org/
– CADD http://cadd.gs.washington.edu/
– ClinVar http://www.ncbi.nlm.nih.gov/clinvar/
– Conservation http://mendel.stanford.edu/SidowLab/downloads/gerp/ , http://compgen.cshl.edu/phast/helppages/phyloP.txt , http://compgen.cshl.edu/phast/help-pages/phastCons.txt
– ESP6500 http://evs.gs.washington.edu/EVS/
– ExAC http://exac.broadinstitute.org/
– FunctionalMutation http://varianttools.sourceforge.net/Annotation/DbNSFP
– GRASP2 http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:GRASP2
– GTExEqtl http://www.gtexportal.org/home/
– GWAVA https://www.sanger.ac.uk/sanger/StatGen_Gwava
– Haploreg http://www.broadinstitute.org/mammals/haploreg/haploreg.php
– Interpro http://www.ebi.ac.uk/interpro/
– RegulomeDB http://regulomedb.org/


blog comments powered by Disqus