My Blog
Genome-wide multiple bedgraph data analysis with Intervene
Today, I will give a talk about how to do multiple bedgraph data analysis with Intervene for ChIP-seq or MBD-seq data with Intervene. Intervene is a tool for intersection and visualization of multiple genomic region and gene sets (or lists of items). Intervene provides an easy and automated interface for effective intersection and visualization of genomic region sets or lists of items, thus facilitating their analysis and interpretations.
conda install -c bioconda intervene
for i in 2019032901 2019032903 2019040901 2019051703 2019052301 2019053101 2019053102
do
intervene venn -i /gpfs/home/guosa/hpc/methylation/pancrease/medip/venn/$i*.bed --project $i
done
Here, you can find the venn diagrams which are quite fancy. We can prepare venn for each sample.
How to update hapmap2 and hapmap3 from hg18 to hg19 or hg38
Recently, colleagues always ask me what’s the best solution to update hapmap3 from hg18 to hg19 and hg38. Here, I try to give certain solutions.
- Update hapmap2 from hg18 to hg19 or hg38 with liftOver and plink
wget http://zzz.bwh.harvard.edu/plink/dist/hapmap_r23a.zip
wget http://zzz.bwh.harvard.edu/plink/dist/hapmap.pop
unzip hapmap_r23a.zip
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg19.over.chain.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
wget https://raw.githubusercontent.com/Shicheng-Guo/GscPythonUtility/master/liftOverPlink.py
# rebuild plink file to avoid chromsome-miss-order problem
plink --bfile hapmap_r23a --make-bed --out hapmap_r23a.tab
# space to tab to generate bed files for liftOver from hg18 to hg19
plink --bfile hapmap_r23a.sort --recode tab --out hapmap_r23a.tab
# apply liftOverPlink.py to update hg18 to hg19 or hg38
./liftOverPlink.py -m hapmap_r23a.tab.map -p hapmap_r23a.tab.ped -o hapmap_r23a.hg19 -c hg18ToHg19.over.chain.gz -e ./liftOver
./liftOverPlink.py -m hapmap_r23a.tab.map -p hapmap_r23a.tab.ped -o hapmap_r23a.hg38 -c hg19ToHg38.over.chain.gz -e ./liftOver
plink --file hapmap_r23a.hg19 --make-bed --allow-extra-chr --out hapmap2.hg19
plink --file hapmap_r23a.hg38 --make-bed --allow-extra-chr --out hapmap2.hg38
- Update hapmap3 from hg18 to hg19 and hg38 with liftOver and plink
# download database and script
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg19.over.chain.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
wget https://raw.githubusercontent.com/Shicheng-Guo/GscPythonUtility/master/liftOverPlink.py
# download hapmap3 data in plink format
wget https://www.broadinstitute.org/files/shared/mpg/hapmap3/hapmap3_r1_b36_fwd_consensus.qc.poly.recode.ped.bz2
wget https://www.broadinstitute.org/files/shared/mpg/hapmap3/hapmap3_r1_b36_fwd_consensus.qc.poly.recode.map.bz2
wget https://www.broadinstitute.org/files/shared/mpg/hapmap3/relationships_w_pops_051208.txt
bzip2 -d hapmap3_r1_b36_fwd_consensus.qc.poly.recode.ped.bz2
bzip2 -d hapmap3_r1_b36_fwd_consensus.qc.poly.recode.map.bz2
# convert from hg18 to hg19 plink file
plink --bfile hapmap3_r1_b36_fwd_consensus.qc.poly.recode --recode --out hapmap3.hg18
./liftOverPlink.py -m hapmap3.hg18.map -p hapmap3.hg18.ped -o hapmap3.hg19 -c hg18ToHg19.over.chain.gz -e ./liftOver
./liftOverPlink.py -m hapmap3.hg18.map -p hapmap3.hg18.ped -o hapmap3.hg38 -c hg19ToHg38.over.chain.gz -e ./liftOver
# update plink to binary mode
plink --file hapmap3.hg19 --make-bed --allow-extra-chr --out hapmap3.hg19
plink --file hapmap3.hg38 --make-bed --allow-extra-chr --out hapmap3.hg38
# hapmap3 data cleaning and filtering
plink --bfile hapmap3.hg19 --missing
plink --file hapmap3.hg19 --maf 0.01 --make-bed --indep 50 5 2 --out hapmap3.hg19
plink --bfile hapmap3.hg19 --extract hapmap3.hg19.prune.in --genome --min 0.185
perl ./run-IBD-QC.pl plink
- Merge your data with hapmap2 dataset and prepare the PCA plot
### merge personal data with hapmap2 and hapmap3
/home/guosa/hpc/db/hapmap3/hapmap3.hg19.deking
/home/guosa/hpc/db/hapmap2/hapmap2.hg19.deking
cd /home/guosa/hpc/db/hapmap2/
plink2 --bfile ~/hpc/db/hapmap2/hapmap2.hg19.deking --exclude RA2020hapmap2-merge.missnp --max-alleles 2 --make-bed --out hapmap2.hg19
plink2 --bfile ../RA2020-B8.dbsnp --max-alleles 2 --exclude RA2020hapmap2-merge.missnp --make-bed --out RA2020.hg19
plink --bfile hapmap2.hg19 --bmerge RA2020.hg19 --make-bed --out RA2020hapmap2
plink --bfile RA2020hapmap2 --maf 0.01 --geno 0.1 --make-bed --out RA2020hapmap2Update
plink --bfile RA2020hapmap2Update --make-bed --out RA2020hapmap2UpdatePCA
plink2 --bfile RA2020hapmap2UpdatePCA --pca --threads 31
hapmap2<-read.table("https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/hapmap2.pop",head=F)
head(hapmap2)
head(hapmap3)
eigenvec<-read.table("plink2.eigenvec",head=F)
head(eigenvec)
pop<-as.character(hapmap2[match(eigenvec[,2],hapmap2[,2]),3])
pop[is.na(pop)]<-"Guanghua"
eigenvec$pop=pop
eigenvec$col=as.numeric(as.factor(pop))
head(eigenvec)
pdf("pca.pdf")
plot(eigenvec[,3],eigenvec[,4],pch=16,col=as.factor(eigenvec$pop),xlab="principle component 1",ylab="principle componment 2",cex.axis=1.5,cex.lab=1.5)
legend("topright",pch=16,legend=unique(as.factor(pop)),col=unique(as.factor(eigenvec$pop)),bty="n",cex=1.5)
dev.off()
subset(eigenvec,pop=="Guanghua" & V3<0) # RA478
- Merge hapmap3 data with your own data and prepare the PCA plot
cd ~/hpc/rheumatology/RA/he2020/hapmap3
plink --bfile ~/hpc/db/hapmap3/hapmap3.hg19.deking --bmerge ../RA2020-B8.dbsnp --make-bed --out RA2020hapmap3
plink2 --bfile ~/hpc/db/hapmap3/hapmap3.hg19.deking --exclude RA2020hapmap3-merge.missnp --max-alleles 2 --make-bed --out hapmap3.hg19
plink2 --bfile ../RA2020-B8.dbsnp --max-alleles 2 --exclude RA2020hapmap3-merge.missnp --make-bed --out RA2020.hg19
plink --bfile hapmap3.hg19 --bmerge RA2020.hg19 --make-bed --out RA2020.hapmap3
plink --bfile RA2020.hapmap3 --maf 0.01 --geno 0.1 --make-bed --out RA2020.hapmap3.update
plink --bfile RA2020.hapmap3.update --make-bed --out RA2020.hapmap3.update.pca
plink2 --bfile RA2020.hapmap3.update.pca --pca --threads 31
### R
hapmap3<-read.table("https://raw.githubusercontent.com/Shicheng-Guo/AnnotationDatabase/master/hapmap3.pop",head=T)
head(hapmap3)
eigenvec<-read.table("plink2.eigenvec",head=F)
head(eigenvec)
pop<-as.character(hapmap3[match(eigenvec[,2],hapmap3[,2]),7])
pop[is.na(pop)]<-"SGH"
eigenvec$pop=pop
head(eigenvec)
pdf("Hapmap3.Guanghua.PCA.pdf")
plot(eigenvec[,3],eigenvec[,4],col=as.factor(eigenvec$pop),pch=as.numeric(as.factor(eigenvec$pop)),xlab="principle component 1",ylab="principle componment 2",cex.axis=1.5,cex.lab=1.5)
legend("topright",legend=unique(as.factor(pop)),pch=unique(as.numeric(as.factor(eigenvec$pop))),col=unique(as.factor(eigenvec$pop)),bty="n",cex=1)
dev.off()
subset(eigenvec,pop=="SGH" & V3<0) # RA478
pdf("Hapmap3.Guanghua.PCA.fine.pdf")
plot(eigenvec[,3],eigenvec[,4],col=as.factor(eigenvec$pop),pch=as.numeric(as.factor(eigenvec$pop)),xlab="principle component 1",ylab="principle componment 2",cex.axis=1.5,cex.lab=1.5,xlim=c(0.006,0.01),ylim=c(-0.0075,0.0025))
legend("topright",legend=unique(as.factor(pop)),pch=unique(as.numeric(as.factor(eigenvec$pop))),col=unique(as.factor(eigenvec$pop)),bty="n",cex=1)
dev.off()
subset(eigenvec,pop=="SGH" & V3<0.0075) # RA478
write.table(subset(eigenvec,pop=="SGH" & V3<0.0075),file="GH.txt",sep="\t",col.names=T,row.names=F,quote=F)
write.table(subset(eigenvec,pop=="SGH" & V3<0.0075)[,1:2],file="pca.iid.remove",sep="\t",col.names=F,row.names=F,quote=F)
Characterizing population with principle componment analysis
Here, Characterizing population with principle componment analysis
Data Science in Population Genetics and Medical Genetics
Here, Data Science in Population Genetics and Medical Genetics
Geneics and epigeneitcs of rheumatoid arthritis and osteoarthritis
Here, I want to summarized the genetic and epigeneitc difference between rheumatoid arthritis(RA) and osteoarthritis(OA). OA and RA have certain similarity such as joint demage..Genetics/Epigenetics
Genetics and epigenetics of pancreatic and cholangiocarcinoma
- Pancreatic cancer and cholangiocarcinoma
- All the figures are only used for non-profit education. reminding me if infrigement happens.
Genetic and epigenetic of esophageal squamous cell carcinoma
- Esophageal squamous cell carcinoma
- All the figures are only used for non-profit education. reminding me if infrigement happens.
Advances in Genetic and epigenetic of Hepatocellular Carcinoma
- Advances in Genetic and epigenetic of Hepatocellular Carcinoma
- All the figures are only used for non-profit education. reminding me if infrigement happens
How to apply R to prepare fancy figures in the manuscript
- All the figures are only used for non-profit education. reminding me if infrigement happens.
Explaining complex machine learning models with LIME
The classification decisions made by machine learning models are usually difficult - if not impossible - to understand by our human brains. The complexity of some of the most accurate classifiers, like neural networks, is what makes them perform so well - often with better results than achieved by humans. But it also makes them inherently hard to explain, especially to non-data scientists.
Especially, if we aim to develop machine learning models for medical diagnostics, high accuracies on test samples might not be enough to sell them to clinicians. Doctors and patients alike will be less inclined to trust a decision made by a model that they don’t understand.
Therefore, we would like to be able to explain in concrete terms why a model classified a case with a certain label, e.g. why one breast mass sample was classified as “malignant” and not as “benign”.
Local Interpretable Model-Agnostic Explanations (LIME) is an attempt to make these complex models at least partly understandable. The method has been published in
lime is able to explain all models for which we can obtain prediction probabilities (in R, that is every model that works with predict(type = "prob")). It makes use of the fact that linear models are easy to explain because they are based on linear relationships between features and class labels: The complex model function is approximated by locally fitting linear models to permutations of the original training set.
On each permutation, a linear model is being fit and weights are given so that incorrect classification of instances that are more similar to the original data are penalized (positive weights support a decision, negative weights contradict them). This will give an approximation of how much (and in which way) each feature contributed to a decision made by the model.
The code for lime has originally been made available for Python but the awesome Thomas Lin Pedersen has already created an implementation in R. It is not on CRAN (yet, I assume), but you can install it via Github:
devtools::install_github("thomasp85/lime")
The data I am using is the World Happiness Data from my last post. So, let’s train a neural network on this data to predict three classes of the happiness scores: low, medium and high.
load("data_15_16.RData")
# configure multicore
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
library(caret)
set.seed(42)
index <- createDataPartition(data_15_16$Happiness.Score.l, p = 0.7, list = FALSE)
train_data <- data_15_16[index, ]
test_data <- data_15_16[-index, ]
set.seed(42)
model_mlp <- caret::train(Happiness.Score.l ~ .,
data = train_data,
method = "mlp",
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = FALSE))
The explain function
The central function of lime is lime() It creates the function that is used in the next step to explain the model’s predictions.
We can give a couple of options. Check the help ?lime for details, but the most important to think about are:
- Should continuous features be binned? And if so, into how many bins?
Here, I am keeping the default bin_continuous = TRUE but specify 5 instead of 4 (the default) bins with n_bins = 5.
library(lime)
explain <- lime(train_data, model_mlp, bin_continuous = TRUE, n_bins = 5, n_permutations = 1000)
Now, let’s look at how the model is explained. Here, I am not going to look at all test cases but I’m randomly choosing three cases with correct predictions and three with wrong predictions.
pred <- data.frame(sample_id = 1:nrow(test_data),
predict(model_mlp, test_data, type = "prob"),
actual = test_data$Happiness.Score.l)
pred$prediction <- colnames(pred)[3:5][apply(pred[, 3:5], 1, which.max)]
pred$correct <- ifelse(pred$actual == pred$prediction, "correct", "wrong")
Beware that we need to give our test-set data table row names with the sample names or IDs to be displayed in the header of our explanatory plots below.
library(tidyverse)
pred_cor <- filter(pred, correct == "correct")
pred_wrong <- filter(pred, correct == "wrong")
test_data_cor <- test_data %>%
mutate(sample_id = 1:nrow(test_data)) %>%
filter(sample_id %in% pred_cor$sample_id) %>%
sample_n(size = 3) %>%
remove_rownames() %>%
tibble::column_to_rownames(var = "sample_id") %>%
select(-Happiness.Score.l)
test_data_wrong <- test_data %>%
mutate(sample_id = 1:nrow(test_data)) %>%
filter(sample_id %in% pred_wrong$sample_id) %>%
sample_n(size = 3) %>%
remove_rownames() %>%
tibble::column_to_rownames(var = "sample_id") %>%
select(-Happiness.Score.l)
The explain function from above can now be used with our test samples. Further options we can specify are:
- How many features do we want to use in the explanatory function?
Let’s say we have a big training set with 100 features. Looking at all features and trying to understand them all could be more confusing than helpful. And very often, a handful of very important features will be enough to predict test samples with a reasonable accuracy (see also my last post on OneR). So, we can choose how many features we want to look at with the n_features option.
- How do we want to choose these features?
Next, we specify how we want this subset of features to be found. The default, auto, uses forward selection if we chose n_features <= 6 and uses the features with highest weights otherwise. We can also directly choose feature_select = "forward_selection", feature_select = "highest_weights" or feature_select = "lasso_path". Again, check ?lime for details.
In our example dataset, we only have 7 features and I want to look at the top 5.
I also want to have explanation for all three class labels in the response variable (low, medium and high happiness), so I am choosing n_labels = 3.
explanation_cor <- explain(test_data_cor, n_labels = 3, n_features = 5)
explanation_wrong <- explain(test_data_wrong, n_labels = 3, n_features = 5)
It will return a tidy tibble object that we can plot with plot_features():
plot_features(explanation_cor, ncol = 3)

plot_features(explanation_wrong, ncol = 3)

The information in the output tibble is described in the help function ?lime and can be viewed with
tibble::glimpse(explanation_cor)
So, what does this tell us, now? Let’s look at case 22 (the first row of our plot for correctly predicted classes): This sample has been correctly predicted to come from the medium happiness group because it
- has a dystopia value between 2.03 & 2.32,
- a trust/government corruption score below 0.05,
- a GDP/economy score between 1.06 and 1.23 and
- a life expectancy score between 0.59 and 0.7.
From the explanation for the label “high” we can also see that this case has a family score bigger than 1.12, which is more representative of high happiness samples.
pred %>%
filter(sample_id == 22)
## sample_id low medium high actual prediction correct
## 1 22 0.02906327 0.847562 0.07429938 medium medium correct
The explanatory function named dystopia the most strongly supporting feature for this prediction. Dystopia is an imaginary country that has the world’s least-happy people. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables […]
The explanatory plot tells us for each feature and class label in which range of values a representative data point would fall. If it does, this gets counted as support for this prediction, if it does not, it gets scored as contradictory. For case 22 and the feature dystopia, the data point 2.27 falls within the range for medium happiness (between 2.03 and 2.32) with a high weight.
When we look at where this case falls on the range of values for this feature, we can see that is indeed very close to the median of medium training cases and further away from the medians for high and low training cases. The other supportive features show us the same trend.
train_data %>%
gather(x, y, Economy..GDP.per.Capita.:Dystopia.Residual) %>%
ggplot(aes(x = Happiness.Score.l, y = y)) +
geom_boxplot(alpha = 0.8, color = "grey") +
geom_point(data = gather(test_data[22, ], x, y, Economy..GDP.per.Capita.:Dystopia.Residual), color = "red", size = 3) +
facet_wrap(~ x, scales = "free", ncol = 4)

An overview over the top 5 explanatory features for case 22 is stored in:
as.data.frame(explanation_cor[1:9]) %>%
filter(case == "22")
## case label label_prob model_r2 model_intercept
## 1 22 medium 0.84756196 0.05004205 0.5033729
## 2 22 medium 0.84756196 0.05004205 0.5033729
## 3 22 medium 0.84756196 0.05004205 0.5033729
## 4 22 medium 0.84756196 0.05004205 0.5033729
## 5 22 medium 0.84756196 0.05004205 0.5033729
## 6 22 high 0.07429938 0.06265119 0.2293890
## 7 22 high 0.07429938 0.06265119 0.2293890
## 8 22 high 0.07429938 0.06265119 0.2293890
## 9 22 high 0.07429938 0.06265119 0.2293890
## 10 22 high 0.07429938 0.06265119 0.2293890
## 11 22 low 0.02906327 0.07469729 0.3528088
## 12 22 low 0.02906327 0.07469729 0.3528088
## 13 22 low 0.02906327 0.07469729 0.3528088
## 14 22 low 0.02906327 0.07469729 0.3528088
## 15 22 low 0.02906327 0.07469729 0.3528088
## feature feature_value feature_weight
## 1 Dystopia.Residual 2.27394 0.14690100
## 2 Trust..Government.Corruption. 0.03005 0.06308598
## 3 Economy..GDP.per.Capita. 1.13764 0.02944832
## 4 Health..Life.Expectancy. 0.66926 0.02477567
## 5 Generosity 0.00199 -0.01326503
## 6 Family 1.23617 0.13629781
## 7 Generosity 0.00199 -0.07514534
## 8 Trust..Government.Corruption. 0.03005 -0.07574480
## 9 Dystopia.Residual 2.27394 -0.07687559
## 10 Economy..GDP.per.Capita. 1.13764 0.07167086
## 11 Family 1.23617 -0.14932931
## 12 Economy..GDP.per.Capita. 1.13764 -0.12738346
## 13 Generosity 0.00199 0.09730858
## 14 Dystopia.Residual 2.27394 -0.07464384
## 15 Trust..Government.Corruption. 0.03005 0.06220305
## feature_desc
## 1 2.025072 < Dystopia.Residual <= 2.320632
## 2 Trust..Government.Corruption. <= 0.051198
## 3 1.064792 < Economy..GDP.per.Capita. <= 1.275004
## 4 0.591822 < Health..Life.Expectancy. <= 0.701046
## 5 Generosity <= 0.123528
## 6 1.119156 < Family
## 7 Generosity <= 0.123528
## 8 Trust..Government.Corruption. <= 0.051198
## 9 2.025072 < Dystopia.Residual <= 2.320632
## 10 1.064792 < Economy..GDP.per.Capita. <= 1.275004
## 11 1.119156 < Family
## 12 1.064792 < Economy..GDP.per.Capita. <= 1.275004
## 13 Generosity <= 0.123528
## 14 2.025072 < Dystopia.Residual <= 2.320632
## 15 Trust..Government.Corruption. <= 0.051198
In a similar way, we can explore why some predictions were wrong.
If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog.
sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] dplyr_0.5.0 purrr_0.2.2 readr_1.1.0
## [4] tidyr_0.6.1 tibble_1.3.0 tidyverse_1.1.1
## [7] RSNNS_0.4-9 Rcpp_0.12.10 lime_0.1.0
## [10] caret_6.0-73 ggplot2_2.2.1 lattice_0.20-35
## [13] doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3
##
## loaded via a namespace (and not attached):
## [1] lubridate_1.6.0 assertthat_0.2.0 glmnet_2.0-5
## [4] rprojroot_1.2 digest_0.6.12 psych_1.7.3.21
## [7] R6_2.2.0 plyr_1.8.4 backports_1.0.5
## [10] MatrixModels_0.4-1 stats4_3.3.3 evaluate_0.10
## [13] httr_1.2.1 hrbrthemes_0.1.0 lazyeval_0.2.0
## [16] readxl_0.1.1 minqa_1.2.4 SparseM_1.76
## [19] extrafontdb_1.0 car_2.1-4 nloptr_1.0.4
## [22] Matrix_1.2-8 rmarkdown_1.4 labeling_0.3
## [25] splines_3.3.3 lme4_1.1-12 extrafont_0.17
## [28] stringr_1.2.0 foreign_0.8-67 munsell_0.4.3
## [31] hunspell_2.3 broom_0.4.2 modelr_0.1.0
## [34] mnormt_1.5-5 mgcv_1.8-17 htmltools_0.3.5
## [37] nnet_7.3-12 codetools_0.2-15 MASS_7.3-45
## [40] ModelMetrics_1.1.0 grid_3.3.3 nlme_3.1-131
## [43] jsonlite_1.4 Rttf2pt1_1.3.4 gtable_0.2.0
## [46] DBI_0.6-1 magrittr_1.5 scales_0.4.1
## [49] stringi_1.1.5 reshape2_1.4.2 xml2_1.1.1
## [52] tools_3.3.3 forcats_0.2.0 hms_0.3
## [55] pbkrtest_0.4-7 yaml_2.1.14 colorspace_1.3-2
## [58] rvest_0.3.2 knitr_1.15.1 haven_1.0.0
## [61] quantreg_5.29