{"id":251,"date":"2022-01-08T21:40:37","date_gmt":"2022-01-08T13:40:37","guid":{"rendered":"https:\/\/linguopeng.top\/?p=251"},"modified":"2022-01-08T23:11:24","modified_gmt":"2022-01-08T15:11:24","slug":"figure-2-random-forest-model-detects-bacterial-taxa-that-accurately-predict-indica-and-japonica","status":"publish","type":"post","link":"https:\/\/linguopeng.top\/?p=251","title":{"rendered":"Figure 2. Random-forest model detects bacterial taxa that accurately predict indica and japonica"},"content":{"rendered":"\n<pre class=\"wp-block-code\"><code>---\ntitle: \"Figure 2. Random-forest model detects bacterial taxa that accurately predict indica and japonica subspeciation.\"\nauthor: \"Yong-Xin Liu\"\ndate: \"2019\/2\/20\"\noutput: html_document\n---\n\n```{r setup, include=FALSE}\nknitr::opts_chunk$set(echo = TRUE)\n# Clean workspace\nrm(list=ls()) \n# Load setting and functions\nsource(\"..\/script\/stat_plot_functions.R\")\n# Set output directory\noutput_dir=\".\/\"\n```\n\n\n## a. Top features\n\n(a) The top 18 bacterial families were identified by applying random-forest classification of the relative abundance of the root microbiota in field II against indica and japonica rice. Biomarker taxa are ranked in descending order of importance to the accuracy of the model. The inset represents 10-fold cross-validation error as a function of the number of input families used to differentiate indica and japonica root microbiota in order of variable importance. \n\nLoad RandomForest model for plotting\n\n```{r load}\nload(\"model.RData\")\n```\n\n\n### a1. Cross validation\n\nggplot2 visualize result of cross validation\n\n```{r cv_ggplot2, echo=TRUE}\n\nn.var = error.cv$num\nerror.cv = error.cv&#91;,2:6]\ncolnames(error.cv) = paste('err',1:5,sep='.')\nerr.mean = apply(error.cv,1,mean)\nallerr = data.frame(num=n.var,err.mean=err.mean,error.cv)\n# number of features selected\noptimal = 18\n\nwrite.table(allerr, file = \"family_rfcv.txt\", sep = \"\\t\", quote = F, row.names = T, col.names = T)\n\np = ggplot() + \n  geom_line(aes(x = allerr$num, y = allerr$err.1), colour = 'grey') + \n  geom_line(aes(x = allerr$num, y = allerr$err.2), colour = 'grey') + \n  geom_line(aes(x = allerr$num, y = allerr$err.3), colour = 'grey') + \n  geom_line(aes(x = allerr$num, y = allerr$err.4), colour = 'grey') + \n  geom_line(aes(x = allerr$num, y = allerr$err.5), colour = 'grey') + \n  geom_line(aes(x = allerr$num, y = allerr$err.mean), colour = 'black') + \n  geom_vline(xintercept = optimal, colour='black', lwd=0.36, linetype=\"dashed\") + \n#  geom_hline(yintercept = min(allerr$err.mean), colour='black', lwd=0.36, linetype=\"dashed\") + \n  coord_trans(x = \"log2\") +\n  scale_x_continuous(breaks = c(1, 2, 5, 10, 20, 30, 50, 100, 200)) + # , max(allerr$num)\n  labs(title=paste('Training set (n = ', dim(otutab_t)&#91;1],')', sep = ''), \n       x='Number of families ', y='Cross-validation error rate') + \n  annotate(\"text\", x = optimal, y = max(allerr$err.mean), label=paste(\"optimal = \", optimal, sep=\"\")) + \n  main_theme\nggsave(p, file = \"family_rfcv.pdf\", width = 89, height = 50, unit = 'mm')\np\n```\n\n\n### a2. Features barplot with taxonomy in phylum\n\n\n```{r feature_imp, echo=TRUE}\nimp = read.table(\"family_imp_phylum.txt\", header=T, row.names= 1, sep=\"\\t\") \nimp = tail(imp, n = optimal)\nimp$Family = factor(rownames(imp), levels = rownames(imp))\n\np = ggplot(imp, aes(x = Family, y = MeanDecreaseAccuracy, fill = Phylum)) + \n  geom_bar(stat = \"identity\") + \n  coord_flip() + main_theme\nggsave(paste(\"family_top_feautre\",\".pdf\", sep=\"\"), p, width=89 * 1.5, height=50 * 1.5, unit='mm')\np\n```\n\n## b. Feature abundance barplot\n\n(B) Biomarker families with higher relative abundance in the root microbiome of indica (n=15) and japonica (n=3) rice. Error bars represent standard deviations. \n\nData prepare and draw all features abundance in two subspecies\n\n```{r feature_bar_all, echo=TRUE}\n# select randomForest top features\notu_bar = sub_otutab&#91;rownames(imp),] \nmean = data.frame(id = rownames(otu_bar), mean=rowMeans(otu_bar))\n# Decreasing by mean\nmean = arrange(mean, desc(mean))  \notu_bar$Family = rownames(otu_bar)\notu_bar = melt(otu_bar, id.vars = \"Family\")\n# head(otu_bar)\ndesign$sampleID = rownames(design)\notu_bar = merge(otu_bar, design&#91;,c(\"sampleID\",\"subspecies\")], by.x=\"variable\", by.y = \"sampleID\", all.x = T)\n# head(otu_bar)\n\notu_error_bar = summarySE(otu_bar, measurevar=\"value\", groupvars=c(\"Family\",\"subspecies\"))\n# head(otu_error_bar)\n\notu_error_bar$Family = factor(otu_error_bar$Family, levels = mean$id)\n\n# p = ggplot(otu_error_bar, aes(x=Family, y=value, fill=subspecies)) + \n#   geom_bar(position=position_dodge(), stat=\"identity\") +\n#   geom_errorbar(aes(ymin=value-ci, ymax=value+ci),\n#                 width=.5,                    # Width of the error bars\n#                 position=position_dodge(.9)) + main_theme\n# p=p+theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1))\n# p\n# ggsave(paste(\"family_errorbar\",\".pdf\", sep=\"\"), p, width=89 * 1.5, height=50 * 2, unit='mm')\n\nDA_list = read.table(\"f_HTEJ-HIND_all.txt\", header=T, row.names=1, sep=\"\\t\") # , comment.char=\"\"\nDA_list = DA_list&#91;rownames(imp) ,1:5]\nDA_list$level = ifelse(DA_list$logFC&gt;0, \"Enriched\",\"Depleted\")\notu_error_bar = merge(otu_error_bar, DA_list, by.x=\"Family\", by.y = \"row.names\", all.x = T)\n```\n\n### b1. Plot japonica (TEJ) enriched family\n\n```{r tej_enriched, echo=TRUE}\nTEJ = rownames(DA_list&#91;DA_list$level == \"Enriched\",])\np = ggplot(otu_error_bar&#91;otu_error_bar$Family %in% TEJ, ], aes(x=Family, y=value, fill=subspecies)) + \n  geom_bar(position=position_dodge(), stat=\"identity\") +\n  geom_errorbar(aes(ymin=value-ci, ymax=value+ci),\n                width=.5,                    # Width of the error bars\n                position=position_dodge(.9)) + main_theme\np=p+theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1))\nggsave(paste(\"Family_barplot_TEJ\",\".pdf\", sep=\"\"), p, width=89 * 0.6, height=50 * 1.5, unit='mm')\np\n```\n\n### b2. Plot indica (IND) enriched family\n\n```{r ind_enriched, echo=TRUE}\nIND = rownames(DA_list&#91;DA_list$level == \"Depleted\",])\np = ggplot(otu_error_bar&#91;otu_error_bar$Family %in% IND, ], aes(x=Family, y=value, fill=subspecies)) + \n  geom_bar(position=position_dodge(), stat=\"identity\") +\n  geom_errorbar(aes(ymin=value-ci, ymax=value+ci),\n                width=.5,                    # Width of the error bars\n                position=position_dodge(.9)) + main_theme\np=p+theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1))\nggsave(paste(\"Family_barplot_IND\",\".pdf\", sep=\"\"), p,  width=89 * 1.5, height=50 * 1.5, unit='mm')\np\n```\n\n## c. Test on Field I\n\n(c) The results of prediction of indica and japonica rice in field I according to the random-forest model. Each square represents an individual plant from 68 indica and 27 japonica varieties. The characteristics of the varieties are shown on the left. The predicted characteristics are shown on the right. Indica varieties are shown in red; japonica varieties are shown in blue. \n\nFiled I (LN) as test set\n\n```{r test1, echo=TRUE}\n#  Select by manual set group\nif (TRUE){\n  sub_design = subset(design, group %in% c(\"LTEJ\",\"LIND\")) \n  sub_design$group  = factor(sub_design$group, levels=c(\"LTEJ\",\"LIND\")) \n}\nidx = rownames(sub_design) %in% colnames(otutab)\nsub_design = sub_design&#91;idx,]\nsub_otutab = otutab&#91;,rownames(sub_design)]\n# head(sub_otutab)&#91;,1:10]\n\notutab_t = as.data.frame(t(sub_otutab))\notutab_t$group = factor(design&#91;rownames(otutab_t),]$subspecies, levels= c(\"TEJ\",\"IND\"))\n\n# apply model on test set\nset.seed(315)\nlibrary(randomForest)\notutab.pred = predict(otutab_t.rf, otutab_t )  \npre_tab = table(observed=otutab_t&#91;,\"group\"], predicted=otutab.pred) \npre_tab\n\n# save prediction result\npredict = data.frame(group = otutab_t&#91;,\"group\"], predicted=otutab.pred)\nwrite.table(predict, file = \"RF_prediction_FieldILN.txt\", quote = F, row.names = T, col.names = T, sep = \"\\t\")\nsystem(\"sed -i 's\/group\/ID\\tsubspecies\/;s\/IND\/indica\/g;s\/TEJ\/japonica\/g' RF_prediction_FieldILN.txt\")\n\n# visualize prediction result by heatmap\npredict$result = ifelse(predict$group == predict$predicted, 1, 0)\npredict$predict = ifelse(predict$predicted == \"IND\", 1, 2)\ncolumn = 17\n\nIND = predict&#91;predict$group==\"IND\",]$predict\nrow = round(length(IND)\/column + 0.5)\ni = column * row - length(IND)\nIND = c(IND, rep(NA, i))\nmatrix = matrix(IND, nrow = row, ncol = column, byrow = T)\npheatmap(matrix, color = c(\"#F9766E\", \"#00BFC4\") , cluster_rows = F, cluster_cols = F, cellwidth = 15, cellheight = 12,\n         filename = \"family_test_IND.pdf\")\npheatmap(matrix, color = c(\"#F9766E\", \"#00BFC4\") , cluster_rows = F, cluster_cols = F, cellwidth = 15, cellheight = 12)\n\nTEJ = predict&#91;predict$group==\"TEJ\",]$predict\nrow = round(length(TEJ)\/column + 0.5)\ni = column * row - length(TEJ)\nTEJ = c(TEJ, rep(NA, i))\nmatrix = matrix(TEJ, nrow = row, ncol = column, byrow = T)\npheatmap(matrix, color = c(\"#F9766E\", \"#00BFC4\") , cluster_rows = F, cluster_cols = F, cellwidth = 15, cellheight = 12,\n         filename = \"family_test_TEJ.pdf\")\npheatmap(matrix, color = c(\"#F9766E\", \"#00BFC4\") , cluster_rows = F, cluster_cols = F, cellwidth = 15, cellheight = 12)\n```\n\n\n## d. Test on new varieties grown in new locations\n\n(d) The results of prediction of indica and japonica varieties that were not included in the training set and were grown in different geographical locations, according to the random-forest model. Each square represents an individual plant from tested varieties. The number of biological replicates in this figure is as follows: in field I, indica (n = 201), japonica (n = 80); in field II, indica (n = 201), japonica (n = 81).\n\n```{r test2, echo=TRUE}\n# Select IR24(indica) and ZH11(japonica) grow in Changping and Shangzhuang Farm (include Ln and Hn two field).\ngrouplist = c(\"IR24HnCp7\", \"IR24LnCp7\", \"IR24HnSz7\", \"IR24LnSz7\", \"ZH11HnCp7\", \"ZH11LnCp7\", \"ZH11HnSz7\", \"ZH11LnSz7\")\ndesign_sub = subset(design, groupID %in% grouplist)\n\npredict = matrix(0, ncol = 8, nrow = 17)\ncolnames(predict) = grouplist\ncolumn = 17\n\nfor(i in 1:length(grouplist)) {\n  #i = 1\n  design_sub2 = subset(design_sub, groupID %in% grouplist&#91;i])\n  idx = rownames(design_sub2) %in% colnames(otutab)\n  design_sub2 = design_sub2&#91;idx,]\n  otutab_sub = otutab&#91;,rownames(design_sub2)]\n  otutab_sub=as.data.frame(t(otutab_sub))\n  #otutab_sub$group = as.vector(design_sub2$subspecies)\n  set.seed(315)\n  otutab.pred = predict(otutab_t.rf, otutab_sub)\n  otutab.pred\n  levels(otutab.pred) = c(2, 1)\n  j = column - length(otutab.pred)\n  predict&#91;, i]= c(as.vector(otutab.pred), rep(NA, j))\n}\nmatrix = matrix(as.numeric(predict), ncol = 8, nrow = 17)\ncolnames(matrix) = grouplist\nmatrix = t(matrix)\ncolnames(matrix) = paste(\"rep\", 1:17, sep = \"\")\n\nwrite.table(matrix, file = \"RF_predict_test2.txt\", quote = F, row.names = T, col.names = T, sep = \"\\t\")\nsystem(\"sed -i 's\/rep1\/ID\\trep1\/;s\/\\t1\/\\tindica\/g;s\/\\t2\/\\tjaponica\/g' RF_predict_test2.txt\")\n\n\n# pheatmap(matrix, color = c(\"#F9766E\", \"#00BFC4\") , cluster_rows = F, cluster_cols = F, cellwidth = 15, cellheight = 12)\npheatmap(matrix&#91;1:4, 1:16], color = c(\"#F9766E\", \"#00BFC4\") , cluster_rows = F, cluster_cols = F, cellwidth = 15, cellheight = 12,\n         filename = \"family_test_nrt_IND.pdf\")\npheatmap(matrix&#91;1:4, 1:16], color = c(\"#F9766E\", \"#00BFC4\") , cluster_rows = F, cluster_cols = F, cellwidth = 15, cellheight = 12)\n\npheatmap(matrix&#91;5:8,], color = c(\"#F9766E\", \"#00BFC4\") , cluster_rows = F, cluster_cols = F, cellwidth = 15, cellheight = 12,\n         filename = \"family_test_nrt_TEJ.pdf\")\npheatmap(matrix&#91;5:8,], color = c(\"#F9766E\", \"#00BFC4\") , cluster_rows = F, cluster_cols = F, cellwidth = 15, cellheight = 12)\n```<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><div class='fancybox-wrapper lazyload-container-unload' data-fancybox='post-images' href='https:\/\/linguopeng.top\/wp-content\/uploads\/2022\/01\/family_top_feautre.jpg'><img class=\"lazyload lazyload-style-1\" src=\"data:image\/svg+xml;base64,PCEtLUFyZ29uTG9hZGluZy0tPgo8c3ZnIHdpZHRoPSIxIiBoZWlnaHQ9IjEiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIgc3Ryb2tlPSIjZmZmZmZmMDAiPjxnPjwvZz4KPC9zdmc+\"  loading=\"lazy\" decoding=\"async\" width=\"1050\" height=\"589\" data-original=\"https:\/\/linguopeng.top\/wp-content\/uploads\/2022\/01\/family_top_feautre.jpg\" src=\"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsQAAA7EAZUrDhsAAAANSURBVBhXYzh8+PB\/AAffA0nNPuCLAAAAAElFTkSuQmCC\" alt=\"\" class=\"wp-image-252\"\/><\/div><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><div class='fancybox-wrapper lazyload-container-unload' data-fancybox='post-images' href='https:\/\/linguopeng.top\/wp-content\/uploads\/2022\/01\/family_rfcv.jpg'><img class=\"lazyload lazyload-style-1\" src=\"data:image\/svg+xml;base64,PCEtLUFyZ29uTG9hZGluZy0tPgo8c3ZnIHdpZHRoPSIxIiBoZWlnaHQ9IjEiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIgc3Ryb2tlPSIjZmZmZmZmMDAiPjxnPjwvZz4KPC9zdmc+\"  loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"392\" data-original=\"https:\/\/linguopeng.top\/wp-content\/uploads\/2022\/01\/family_rfcv.jpg\" src=\"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsQAAA7EAZUrDhsAAAANSURBVBhXYzh8+PB\/AAffA0nNPuCLAAAAAElFTkSuQmCC\" alt=\"\" class=\"wp-image-265\"\/><\/div><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><div class='fancybox-wrapper lazyload-container-unload' data-fancybox='post-images' href='https:\/\/linguopeng.top\/wp-content\/uploads\/2022\/01\/Family_barplot_IND.jpg'><img class=\"lazyload lazyload-style-1\" src=\"data:image\/svg+xml;base64,PCEtLUFyZ29uTG9hZGluZy0tPgo8c3ZnIHdpZHRoPSIxIiBoZWlnaHQ9IjEiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIgc3Ryb2tlPSIjZmZmZmZmMDAiPjxnPjwvZz4KPC9zdmc+\"  loading=\"lazy\" decoding=\"async\" width=\"1050\" height=\"589\" data-original=\"https:\/\/linguopeng.top\/wp-content\/uploads\/2022\/01\/Family_barplot_IND.jpg\" src=\"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsQAAA7EAZUrDhsAAAANSURBVBhXYzh8+PB\/AAffA0nNPuCLAAAAAElFTkSuQmCC\" alt=\"\" class=\"wp-image-253\"\/><\/div><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><div class='fancybox-wrapper lazyload-container-unload' data-fancybox='post-images' href='https:\/\/linguopeng.top\/wp-content\/uploads\/2022\/01\/Family_barplot_TEJ.jpg'><img class=\"lazyload lazyload-style-1\" src=\"data:image\/svg+xml;base64,PCEtLUFyZ29uTG9hZGluZy0tPgo8c3ZnIHdpZHRoPSIxIiBoZWlnaHQ9IjEiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIgc3Ryb2tlPSIjZmZmZmZmMDAiPjxnPjwvZz4KPC9zdmc+\"  loading=\"lazy\" decoding=\"async\" width=\"420\" height=\"589\" data-original=\"https:\/\/linguopeng.top\/wp-content\/uploads\/2022\/01\/Family_barplot_TEJ.jpg\" src=\"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsQAAA7EAZUrDhsAAAANSURBVBhXYzh8+PB\/AAffA0nNPuCLAAAAAElFTkSuQmCC\" alt=\"\" class=\"wp-image-257\"\/><\/div><\/figure>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-251","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/linguopeng.top\/index.php?rest_route=\/wp\/v2\/posts\/251","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/linguopeng.top\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/linguopeng.top\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/linguopeng.top\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/linguopeng.top\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=251"}],"version-history":[{"count":2,"href":"https:\/\/linguopeng.top\/index.php?rest_route=\/wp\/v2\/posts\/251\/revisions"}],"predecessor-version":[{"id":266,"href":"https:\/\/linguopeng.top\/index.php?rest_route=\/wp\/v2\/posts\/251\/revisions\/266"}],"wp:attachment":[{"href":"https:\/\/linguopeng.top\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=251"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/linguopeng.top\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=251"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/linguopeng.top\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=251"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}