Today, I will give a talk about “Next generation protocol to bcftools in medical genetics research” in MCRI research hub meeting. As we know, bcftools, vcftools, plink2, GATK4 have been widely used in medical genetics and population genetics research. The usage of these tools require lots of experiences. However, the original protocols are quite limited espeically lacking of real-data example. Here, I will provide the real-data examples and solution to most frequently problem we meet in the usage of these tools.

bcftools view

bcftools view is the most frequent command to use for SNPs filtering, sample filtering, format changing.

bcftools view -i '(IMP=1 & R2>0.6)|IMPUTED=0' chr$i.dose.dbSNP.hg19.vcf.gz |  bcftools annotate -x ^FORMAT/GT -Oz -o chr$i.dose.dbSNP.clean.hg19.vcf.gz

bcftools view -i \'R2\>0.6\|TYPED=1\|TYPED_ONLY=1\' -Oz chr$i.dose.vcf.gz -Oz -o chr$i.dose.filter.vcf.gz

bcftools annotation

# collect gene regions. don't forget to extend regions with -5K to +5K to region regions to cover promoter and enhancer SNPs
awk '{print $1,$2-5000,$3+5000,$4}' OFS="\t"  MUC.hg19.bed | bedtools sort -i > MUC.hg19.sort.bed

bcftools annotate -a ~/hpc/db/hg19/dbSNP152/dbSNP152.chr$i.hg19.vcf.gz -c ID  chr$i.dose.contig.vcf.gz -Oz -o chr$i.dose.dbSNP.hg19.vcf.gz >>$i.job

# merge MUC genotypes from chr1 to chr22
cd /gpfs/home/guosa/hpc/rheumatology/RA/he2020/impute/R3
ls chr*.dose.MUC.clean.hg19.vcf.gz > MUC.vcf.txt
bcftools concat -f MUC.vcf.txt -Oz -o MUC.hg19.vcf.gz
bcftools annotate -a ~/hpc/db/hg19/dbSNP152/dbSNP152.chr$i.hg19.vcf.gz -c ID  chr$i.dose.contig.vcf.gz -Oz -o MUC.hg19.vcf.gz

# https://github.com/Shicheng-Guo/AnnotationDatabase/blob/master/hg19/refGene.hg19.VCF.sort.bed.gz
# https://github.com/Shicheng-Guo/AnnotationDatabase/blob/master/hg19/refGene.hg19.VCF.sort.bed.gz.tbi

bcftools annotate -a ~/hpc/db/hg19/refGene.hg19.VCF.sort.bed.gz -c CHROM,FROM,TO,GENE -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') MUC.hg19.vcf.gz -Oz -o MUC.anno.hg19.vcf.gz

# review dbSNPs from vcf.gz file
bcftools view -i '%iD=="rs35705950"' MUC.anno.hg19.vcf.gz | less -S 
bcftools view -i '%iD=="rs79920422"' MUC.anno.hg19.vcf.gz | less -S 

How to fix this strand flips for Michigan Imputation Server?

Error: More than 100 obvious strand flips have been detected. Please check strand. Imputation cannot be started! ``` chr2.1kg.phase3.v5a.dedup.norm.EUR.vcf.gz zcat chr2.1kg.phase3.v5a.dedup.norm.EUR.vcf.gz | awk ‘{print $1,$2,$3,$4,$5}’ OFS=”\t”| grep -v ‘.;’ > keep.txt bcftools view -T keep.txt chr2.1kg.phase3.v5a.dedup.norm.EUR.vcf.gz -Oz -o chr2.1kg.phase3.v5a.dedup.norm.clean.EUR.vcf.gz

you need remove the following SNPs from chr2.1kg.phase3.v5a.dedup.EUR.vcf.gz

^2 18004148 .;rs541466601 AGAGCCCA AGAGCCCG ^2 33172927 .;rs554173463 GC GA ^2 40484170 .;rs537996337 AAATAAATA AAATAAATG ^2 44234879 rs550419970;.;rs533534296 ATAAA ATAAATAAA ^2 45515468 .;rs544467622 CTTTTGT CTTTTAT ^2 47982412 .;rs536566627 GG GC ^2 74559847 .;rs564494041 TGT TGC ^2 77195037 .;rs548983836 TTAC TCAC ^2 84495638 rs573607330;.;rs569169043 AATAA AATAAATAA ^2 109192726 .;rs539410502 GTTTTGTTTTG GTTTTGTTTTC ^2 179073443 .;rs538697958 TTTTCC TTTGCC ^2 206417755 rs574286642;.;rs576039119 AGAA AGAAGAA




### vcftools

zcat dbSNP152.chr1.hg19.vcf.gz | vcf-sort -p 16 -t ./temp/ | bgzip -c > dbSNP152.chr1.hg19.sort.vcf.gz & zcat dbSNP152.chr7.hg19.vcf.gz | vcf-sort -p 16 -t ./temp/ | bgzip -c > dbSNP152.chr7.hg19.sort.vcf.gz & zcat dbSNP152.chr8.hg19.vcf.gz | vcf-sort -p 16 -t ./temp/ | bgzip -c > dbSNP152.chr8.hg19.sort.vcf.gz & zcat dbSNP152.chr9.hg19.vcf.gz | vcf-sort -p 16 -t ./temp/ | bgzip -c > dbSNP152.chr9.hg19.sort.vcf.gz &

#### GATK
Some of my colleagues meet lots of GATK bugs. Please be sure GATK requires `Java 1.8` other Java will have some unexpected errors. On the other side, please download the database from [GATK Resource Bundle ftp server](https://software.broadinstitute.org/gatk/download/bundle) rather than other database. 

gatk CreateSequenceDictionary -R hg19.fa -O hg19.dict

#### SnpSift 

#### Example 1. How to build vcf annotation database for bcftools annotate

hg19 dbSNP

wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz.md5 wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz.tbi

hg38 dbSNP

wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz.md5 wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz.tbi

copy files from UW-Madison to MCRI

scp nu_guos@submit-1.chtc.wisc.edu:/home/nu_guos/All_20180423* ~/hpc/db/hg19/dbSNP

split with chrosome name

SNP orders are not correct, sort chr1,chr7,chr8,chr9

mkdir ./temp/chr1/ mkdir ./temp/chr7/ mkdir ./temp/chr8/ mkdir ./temp/chr9/ zcat dbSNP152.chr1.hg19.vcf.gz | vcf-sort -p 16 -t ./temp/ | bgzip -c > dbSNP152.chr1.hg19.sort.vcf.gz & zcat dbSNP152.chr7.hg19.vcf.gz | vcf-sort -p 16 -t ./temp/ | bgzip -c > dbSNP152.chr7.hg19.sort.vcf.gz & zcat dbSNP152.chr8.hg19.vcf.gz | vcf-sort -p 16 -t ./temp/chr8 | bgzip -c > dbSNP152.chr8.hg19.sort.vcf.gz & zcat dbSNP152.chr9.hg19.vcf.gz | vcf-sort -p 16 -t ./temp/chr9 | bgzip -c > dbSNP152.chr9.hg19.sort.vcf.gz & mv dbSNP152.chr8.hg19.sort.vcf.gz dbSNP152.chr8.hg19.sort.vcf.gz mv dbSNP152.chr9.hg19.sort.vcf.gz dbSNP152.chr9.hg19.sort.vcf.gz tabix -p vcf dbSNP152.chr8.hg19.sort.vcf.gz & tabix -p vcf dbSNP152.chr9.hg19.sort.vcf.gz & ```

Footnote:

human refGene hg19 TSS: https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/hg19/refGene_hg19_TSS.bed
human refGene hg38 TSS: https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/hg38/refGene.hg38.TSS.bed
All the figures are only used for non-profit education. reminding me if infrigement happens