On 01/19/2020, I will give a talk about “UK Biobank Whole-Exome Sequence Binary Phenome Analysis with Robust Region-Based Rare-Variant Test, The American Journal of Human Genetics, December 19, 2019” in Marshfield Clinic Research Institute. Please check the PPT I prepared. I have implemented the pipeline in Marshfield Clinic SuperServer HPC cluster. The Figure I prepare for this post is GWAS result for colon cancer (CRC), RA and ESCC based on UKBB-50K-Exome-sequencing data. Several genes looks very interesting, for example SCL21A2.
Continue reading...
Here, I want to summarize Epigenome Research to Human Cancers with WGBS.
Continue reading...
Here, I want to summarize Population Genetics in East Asian and Allele Frequency.
Continue reading...
Recently, I need to give a talk about genetic and epigenetic based Intra-Tumor Heterogeneity (ITH)
Continue reading...
Recently, I need to give a talk about electronic health/medical records (EHR/EMR) data analysis. Why we need analyze EHR and EMR data? What’s the benefit and what’s the challenge? How to apply text-mining and data-mining in non-structural or semi-structual data analysis? How to deal with compliance and HIPPA? How to deal with multiple EMR systems and diffential format? the importance of genetic/genomic information in health care managment? the importance of the heatlh history of the patients? the mature pipeline to extract important clinical information? cancer real-world data (RWD) analysis? NCBI pubmed data re-leanring? How to connect bioinformatics team and data science as well as medical informatic scientist?
Continue reading...
How to apply Artificial Intelligence in medical diagnosis and medical device from US-FDA policy? More and more patents based on artifical intelligence occured in medical diagnosis and medical service, however, FDA have its own policy in the usage of AI in medical production. Here, we discussed the best approach to design machine learning and AI in production strategy design. Usually, we will have training dataset and test dataset in the machine learning modeling. Here, I will use an example to introduce the best way to apply AI in your production design.
Continue reading...
“Artificial intelligence in risk prediction and mendelian diseases diagnosis”: Today, I want to give a talk about application of AI in risk prediction and disease diagnosis with deep sequencing and deep phenotyping. As the technique growth, we have already had the ability to generate deep sequencing data within short time, however, deep phenotyping is not easy work which limited the development of genetic association study. Actually, we can apply ** molecular phenotypes ** to replace “clinical or external phenotypes”.
Continue reading...
Today, I just submitted the letter of intent for the interal grant application in Marshfiled Clinic Research Institute as the PI. The proposal I submitted this year named: Deep Learning Prediction of Chemotherapy Response using Multi-Omics Features. Same with other grant, the first step of the application is submit letter of intent. You need prepare the title, PI/co-PIs, research abstract including background, hypothesis, aims, strategy and significance. You don’t need to get signature from all the co-PIs in this step. Usually the research abstract should be less than 500 words. You should submit 2 potential reviewer to review your abstract.
Continue reading...
Today, I will give a talk about “Genome-wide cell-free DNA fragmentation in patients with cancer” in MCRI postdoc club meeting. The paper was published in Nature recently. See the slice here or download here for more details.
Continue reading...
Recently, our project collaborated with Dr. Dongyi He in effects of (5R)-5-Hydroxytriptolide to epigenetic changes of LFS from RA was accepted by Scientfic Reports
Continue reading...
Recently (07/09/2019), dbSNP have been updated dbSNP153 from dbSNP152. However, NCBI only provided dbSNP153 in hg38 (GRCH38) version without any source for hg19 version. Here, I prepared a approach to generate dbSNP153 in hg19..
Continue reading...
Recently, our project collaborated with Dr. Jiucun Wang named A gene-based recessive diplotype exome scan discovers FGF6, a novel hepcidin-regulating iron-metabolism gene was accepted by Blood
Continue reading...
Today, I will give a talk about “Next generation protocol to bcftools in medical genetics research” in MCRI research hub meeting. As we know, bcftools, vcftools, plink2, GATK4 have been widely used in medical genetics and population genetics research. The usage of these tools require lots of experiences. However, the original protocols are quite limited espeically lacking of real-data example. Here, I will provide the real-data examples and solution to most frequently problem we meet in the usage of these tools.
Continue reading...
One of terrible things in marshfield clinic research institute (MCRI) is that software/package install request. Any software even Rstudio, R, Python will be reviewed by ITS department for security issues. Furthermore, the review process usually takes 2-3 months which make the research become quite difficult. Lucky thing is we can submit request as soon as possible supposing you think the software will be potentially used in the coming month. Finally, ITS don’t provide the software list which have been approved. therefore, it is quite necessary to list them and I think it will be helpful for further research fellows in MCRI.
Continue reading...
deepTools is a suite of python tools particularly developed for the efficient analysis of high-throughput sequencing data, such as ChIP-seq, RNA-seq or MNase-seq. deeptools has been widely applied in bam/bigwig data analysis. here, I show some example how to use deeptools in mbd-seq and medip-seq methylation data analysis. Meanwhile, MACS usage will also be shown in this poster.
Continue reading...
Today, I will give a talk about how to do multiple bedgraph data analysis with Intervene for ChIP-seq or MBD-seq data with Intervene. Intervene is a tool for intersection and visualization of multiple genomic region and gene sets (or lists of items). Intervene provides an easy and automated interface for effective intersection and visualization of genomic region sets or lists of items, thus facilitating their analysis and interpretations.
Continue reading...
Recently, colleagues always ask me what’s the best solution to update hapmap3 from hg18 to hg19 and hg38. Here, I try to give certain solutions.
Continue reading...
Here, Characterizing population with principle componment analysis
Continue reading...
Here, Data Science in Population Genetics and Medical Genetics
Continue reading...
Here, I want to summarized the genetic and epigeneitc difference between rheumatoid arthritis(RA) and osteoarthritis(OA). OA and RA have certain similarity such as joint demage..Genetics/Epigenetics
Continue reading...
The classification decisions made by machine learning models are usually difficult - if not impossible - to understand by our human brains. The complexity of some of the most accurate classifiers, like neural networks, is what makes them perform so well - often with better results than achieved by humans. But it also makes them inherently hard to explain, especially to non-data scientists.
Continue reading...
Recently, I was on Gran Canaria for a vacation. So, what better way to keep up the holiday spirit a while longer than to visualize all the places we went in R!?
Continue reading...
In my last post, where I shared the code that I used to produce an example analysis to go along with my webinar on building meaningful models for disease prediction, I mentioned that it is advised to consider over- or under-sampling when you have unbalanced data sets. Because my focus in this webinar was on evaluating model performance, I did not want to add an additional layer of complexity and therefore did not further discuss how to specifically deal with unbalanced data.
Continue reading...
Today, I want to show how I use Thomas Lin Pedersen’s awesome ggraph package to plot decision trees from Random Forest models.
Continue reading...
Last week I showed how to build a deep neural network with h2o and rsparkling. As we could see there, it is not trivial to optimize the hyper-parameters for modeling. Hyper-parameter tuning with grid search allows us to test different combinations of hyper-parameters and find one with improved accuracy.
Continue reading...
Last week, I introduced how to run machine learning applications on Spark from within R, using the sparklyr package. This week, I am showing how to build feed-forward deep neural networks or multilayer perceptrons. The models in this example are built to classify ECG data into being either from healthy hearts or from someone suffering from arrhythmia. I will show how to prepare a dataset for modeling, setting weights and other modeling parameters and finally, how to evaluate model performance with the h2o package via rsparkling.
Continue reading...
This week I want to show how to run machine learning applications on a Spark cluster. I am using the sparklyr package, which provides a handy interface to access Apache Spark functionalities via R.
Continue reading...
When running an analysis, I am usually combining functions from multiple packages. Most of these packages come with their own plotting functions. And while they are certainly convenient in that they allow me to get a quick glance at the data or the output, they all have their own style. If I want to prepare a report, proposal or a paper though, I want all my plots to come from a single cast so that they give a consistent feel to the story I want to tell with my data.
Continue reading...
Today, I want to share my analysis of the World Gender Statistics dataset.
Continue reading...
In my last post, I built a shiny app to explore World Gender Statistics.
Continue reading...
This week I explored the World Gender Statistics dataset. You can look at 160 measurements over 56 years with my Shiny app here.
Continue reading...
I’m an avid R user and rarely use anything else for data analysis and visualisations. But while R is my go-to, in some cases, Python might actually be a better alternative.
Continue reading...
Machine learning uses so called features (i.e. variables or attributes) to generate predictive models. Using a suitable combination of features is essential for obtaining high precision and accuracy. Because too many (unspecific) features pose the problem of overfitting the model, we generally want to restrict the features in our models to those, that are most relevant for the response variable we want to predict. Using as few features as possible will also reduce the complexity of our models, which means it needs less time and computer power to run and is easier to understand.
Continue reading...
It’s no secret that Google Big Brothers most of us. But at least they allow us to access quite a lot of the data they have collected on us. Among this is the Google location history.
Continue reading...
With the upcoming holidays, I thought it fitting to finally explore the ttbbeer package. It contains data on beer ingredients used in US breweries from 2006 to 2015 and on the (sin) tax rates for beer, champagne, distilled spirits, wine and various tobacco items since 1862.
Continue reading...
This app is based on the gwascat R package and its ebicat38 database and shows trait-associated SNP locations of the human genome. You can visualize and compare the genomic locations of up to 8 traits simultaneously.
Continue reading...
In my last post I created a gene homology network for human genes. In this post I want to extend the network to include edges for other species.
Continue reading...
Edited on 20 December 2016
Continue reading...
In last week’s post I explored whether machine learning models can be applied to predict flu deaths from the 2013 outbreak of influenza A H7N9 in China. There, I compared random forests, elastic-net regularized generalized linear models, k-nearest neighbors, penalized discriminant analysis, stabilized linear discriminant analysis, nearest shrunken centroids, single C5.0 tree and partial least squares.
Continue reading...
Edited on 26 December 2016
Continue reading...
Last week’s post showed how to create a Gilmore Girls character network.
Continue reading...
With the impending (and by many - including me - much awaited) Gilmore Girls Revival, I wanted to take a somewhat different look at our beloved characters from Stars Hollow.
Continue reading...
When working with any type of genome data, we often look for annotation information about genes, e.g. what’s the gene’s full name, what’s its abbreviated symbol, what ID it has in other databases, what functions have been described, how many and which transcripts exist, etc.
Continue reading...
I created the R package exprAnalysis designed to streamline my RNA-seq data analysis pipeline. Below you find the vignette for installation and usage of the package.
Continue reading...
wget https://ftp.ncbi.nih.gov/snp/redesign/latest_release/VCF/GCF_000001405.38.gz
wget https://ftp.ncbi.nih.gov/snp/redesign/latest_release/VCF/GCF_000001405.38.gz.md5
wget https://ftp.ncbi.nih.gov/snp/redesign/latest_release/VCF/GCF_000001405.38.gz.tbi
wget https://ftp.ncbi.nih.gov/snp/redesign/latest_release/VCF/GCF_000001405.38.gz.tbi.md5
wget https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/GCF_000001405.38_GRCh38.p12_assembly_report.txt
gawk -v RS="(\r)?\n" 'BEGIN { FS="\t" } !/^#/ { if ($10 != "na") print $7,$10; else print $7,$5 }' GCF_000001405.38_GRCh38.p12_assembly_report.txt > dbSNP-to-UCSC-GRCh38.p12.map
perl -p -i -e '{s/chr//}' dbSNP-to-UCSC-GRCh38.p12.map
bcftools annotate --rename-chrs dbSNP-to-UCSC-GRCh38.p12.map GCF_000001405.38.gz | gawk '/^#/ && !/^##contig=/ { print } !/^#/ { if( $1!="na" ) print }' | bgzip -c > GCF_000001405.38.dbSNP153.GRCh38p12b.GATK.vcf.gz
git config --global user.name "Shicheng-Guo"
git config --global user.email "Shicheng.Guo@hotmail.com"
git config --global color.ui true
git config --global core.editor emacs
ssh-keygen -t rsa -C "Shicheng.Guo@hotmail.com"
less ~/.ssh/id_rsa.pub
ssh -T git@github.com
Hi username! You've successfully authenticated, but Github does not provide shell access.
This post is to record all the environment setting for my previous work stations. the best way should be record the installation for all these tools, however, they are always being updated so here I only record them.
Continue reading...
The current post want to use STAR and HTseq together to estimation gene expression for RNA-seq:
Continue reading...
BooK
Continue reading...
DNA methylation
Continue reading...
How to apply RIblast to predict ncRNA interaction target ``` cd ~/hpc/tools/RIblast/extdata wget http://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz wget http://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/mrna.fa.gz -O Homo_sapiens.GRCh38.mrna.fa.gz wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.transcripts.fa.gz
Continue reading...
Here is the best solution:
Continue reading...
Immune escape mechanisms
Continue reading...
Step 1: creat Bioproject (receive PRJNA ID to be used in step 2: PRJNA605250)
Continue reading...
how do I prune second-degree-related samples? —rel-cutoff is obsolete and see —king-cutoff
plink --bfile RA1000 --bmerge RA500 --make-bed --out RA3000
plink --bfile RA3000 --impute-sex --make-bed --out RA3000.R1
grep PROBLEM RA3000.R1.sexcheck | awk '{print $2}' > sexcheck.exclude.txt
plink --bfile RA3000 --impute-sex --exclude --make-bed --out RA3000.R1
plink2 —bfile ... —king-cutoff 0.088 --maf 0.01 —make-bed --out myplink
```
cd /mnt/sas0/AD/sguo234/asa wget http://www.well.ox.ac.uk/~wrayner/tools/HRC-1000G-check-bim-v4.2.7.zip wget http://ngs.sanger.ac.uk/production/hrc/HRC.r1-1/HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz wget https://www.well.ox.ac.uk/~wrayner/tools/1000GP_Phase3_combined.legend.gz gunzip 1000GP_Phase3_combined.legend.gz unzip HRC-1000G-check-bim-v4.2.7.zip gunzip HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz wget http://qbrc.swmed.edu/zhanxw/software/checkVCF/checkVCF-20140116.tar.gz tar xzvf checkVCF-20140116.tar.gz
perl HRC-1000G-check-bim.pl -b RA3000.R3.bim -f RA3000.R3.frq -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h sh Run-plink.sh
perl HRC-1000G-check-bim.pl -b RA3000.R3.bim -f RA3000.R3.frq -r 1000GP_Phase3_combined.legend -g -p EAS sh Run-plink.sh
bacterial RNA-seq analysis with Rockhopper: RNA-seq data to reveal novel response mechanism to bacterial within host wound tissues
Continue reading...
# GenomeAsia100K data are available at the links below. Users can download GA100K data in compressed Variant Call Format (VCF) file.
mkdir ~/hpc/db/GenomeAsia100K
cd ~/hpc/db/GenomeAsia100K
for i in {1..22}
do
wget --no-check-certificate https://browser.genomeasia100k.org/service/web/download_files/$i.substitutions.annot.cont_withmaf.vcf.gz &
done
Script A: transfer final report to ped (fr2ped.pl)
use strict;
use Cwd;
chdir getcwd;
open F,shift @ARGV;
my $i=1;
while(<F>){
next if !/ZS/;
my($snp,$sam,$rs,$gc,$chr,$pos,$a1,$a2,undef)=split/\s+/;
if($i eq 1){
print "$sam $sam 0 0 0 0 $a1 $a2";
}else{
print " $a1 $a2";
}
$i++;
}
print "\n";
Script B: transfer final report to ped(fr2map.pl)
use strict;
use Cwd;
chdir getcwd;
open F,shift @ARGV;
while(<F>){
next if !/ZS/;
my($snp,$sam,$rs,$gc,$chr,$pos,$a1,$a2,undef)=split/\s+/;
print "$chr $rs 0 $pos\n";
}
Step 3.0: run the script to do the job
```
for i in ls *.txt | rev | cut -c 17- | rev | uniq
do
echo $i
perl ./fr2ped.pl $i_FinalReport.txt > $i.ped
done
Sometimes, maybe you want to merge >7000 vcf files/samples into one big VCF file with bcftools merge
, for example PMRP have 20,000 samples/vcf files:
bcftools merge -l merge.txt -Oz -o merge.vcf.gz
if the sample counts <1021, everything is okay. However, if it is >= 1021, bcftools merge will reports:
[E::hts_idx_load3] Could not load local index file '229209.fstl1.vcf.gz.tbi'
Failed to open 229209.fstl1.vcf.gz: could not load index
Okay. Here is my final solution developed based on WouterDeCoste’s post. I hope it is helpful. One of my friends told me his computer allowed merging 7000 VCF at one time. I am not sure whether it is caused by a specific file operating setting. ``` ls *.vcf.gz | split -l 500 - subset_vcfs
Continue reading...
Here, I summarized Automatic GWAS and Post-GWAS Analysis Pipeline Published Works:
Continue reading...
1, download ANNOVAR: http://annovar.openbioinformatics.org/en/latest/user-guide/download/
Continue reading...
Here, I list all the used annotation in my previous publication:
Continue reading...
Plan-A
Continue reading...
02/27/2021: Johnson & Johnson COVID-19 Vaccine Authorized by U.S. FDA For Emergency Use - First Single-Shot Vaccine in Fight Against Global Pandemic
Continue reading...Also check out R-bloggers for lots of cool R stuff!