如何使用BioConductor分析芯片数据芯片哪个图爆的多

温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!&&|&&
Java/PHP/js/python/R等web软件开发, 生物信息学, 数据挖掘与机器学习,各种CMS, 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 肿瘤免疫基因组与蛋白组分析
LOFTER精选
网易考拉推荐
用微信&&“扫一扫”
将文章分享到朋友圈。
用易信&&“扫一扫”
将文章分享到朋友圈。
芯片便宜、标准化做的比较好,结论比测序更准确,其价值正在被重新发现!从最近看的几篇文章可见一斑。Bindea, Gabriela, et al. "Spatiotemporal dynamics of intratumoral immune cells reveal the immune landscape in human cancer."&Immunity&39.4 (2013): 782-795.Affymetrix MOE430 V2 arrayBai, Ling, et al. "An Integrated Genome-Wide Systems Genetics Screen for Breast Cancer Metastasis Susceptibility Genes."&PLoS Genet&12.4 (2016): e1005989.芯片分析方法学概述:http://agetouch./blog/static/2/代码部分:&http://bioinformatics.knowledgeblog.org//analysing-microarray-data-in-bioconductor/图片部分:http://bioinformatics.knowledgeblog.org//analysing-microarray-data-in-bioconductor-figures-and-code/Volcano plots of microarray data火山图http://bioinformatics.knowledgeblog.org//volcano-plots-of-microarray-data/使用免费软件做后续分析:http://bioinformatics.knowledgeblog.org//downstream-functional-analysis-of-an-omics-experiment-using-freely-available-tools/我的笔记:目的:安装R和&包;下载GEO数据集;标准化;做一些质量控制检查;分析发现差异表达基因;目录:1.linux1604安装R2.R中安装bioconductor3.获取数据4.Describing the experiment 实验描述&5.Loading and normalising the data 加载和标准化数据&6.Quality control checks 质量控制检查7.Filtering data 过滤数据8.Finding differentially expressed probesets 找到差异表达基因探针集合 &9.Annotating the results with associated gene symbols 注释基因与相关基因符号10.后续:生物显著性分析和火山图可视化----------------------------------------1.linux1604安装RRhttps://cloud.r-project.org/bin/linux/ubuntu/trusty/下载Rhttps://cloud.r-project.org/bin/linux/ubuntu/trusty/r-base-core-dbg_3.3.1-1trusty0_amd64.deb安装sudo dpkg -i linuxqq_v1.0-preview3_i386.deb报错r-base-core-dbg depends on r-base-core (= 3.3.1-1trusty0); however:& Version of r-base-core on system is 3.2.3-4.&r-base-core-dbg depends on r-base-dev (= 3.3.1-1trusty0); however:& Version of r-base-dev on system is 3.2.3-4.&&更新系统:sudo apt-get update下载依赖包&r-base-core &https://cloud.r-project.org/bin/linux/ubuntu/trusty/r-base-core_3.3.1-1trusty0_amd64.debr-base-devhttps://cloud.r-project.org/bin/linux/ubuntu/trusty/r-base-dev_3.3.1-1trusty0_all.deb只安装r-base-core_3.3.1sudo dpkg -i r-base-core_3.3.1-1trusty0_amd64.deb运行R$ RR version 3.3.1 () -- "Bug in Your Hair"Copyright (C) 2016 The R Foundation for Statistical ComputingPlatform: x86_64-pc-linux-gnu (64-bit)安装依赖包,报错wangjl@Ubuntu1404:~$ sudo apt-get install libxml2-dev&[sudo] password for wangjl:&Reading package lists... DoneBuilding dependency tree & & &&Reading state information... Donelibxml2-dev is already the newest version (2.9.3+dfsg1-1ubuntu0.1).libxml2-dev set to manually installed.You might want to run 'apt-get -f install' to correct these:The following packages have unmet dependencies:&r-base-core-dbg : Depends: r-base-dev (= 3.3.1-1trusty0) but 3.2.3-4 is to be installedE: Unmet dependencies. Try 'apt-get -f install' with no packages (or specify a solution).重新安装r-base-dev_3.3.1-1sudo dpkg -i r-base-dev_3.3.1-1trusty0_all.deb重新安装依赖包$ sudo apt-get install libxml2-dev&$ sudo apt-get install libcurl4-openssl-dev&$ sudo apt-get install curl &--------------------------------------------------2.R中安装bioconductor设置bioconductor进入root,在进入R$ suPassword:&root@Ubuntu1404:/home/wangjl# R&source("http://bioconductor.org/biocLite.R")Bioconductor version 3.2 (BiocInstaller 1.20.3), ?biocLite for helpbiocLite() #下载、安装一堆包``````````````下载失败的包:trying URL '/src/contrib/DBI_0.5-1.tar.gz'Content type 'application/x-gzip' length 258716 bytes (252 KB)======downloaded 32 KBERROR: dependencies ‘DBI’, ‘RSQLite’ are not available for package ‘AnnotationDbi’Error in download.file(url, destfile, method, mode = "wb", ...) :&& cannot download all filesIn addition: Warning messages:1: In install.packages(pkgs = doing, lib = lib, ...) :& installation of package ‘AnnotationDbi’ had non-zero exit status2: In download.file(url, destfile, method, mode = "wb", ...) :& downloaded length 0 != reported length 79427Warning in download.packages(pkgs, destdir = tmpd, available = available, &:& download of package ‘markdown’ failed``````````````重新启动R,安装包依旧出现& source("http://bioconductor.org/biocLite.R")Bioconductor version 3.2 (BiocInstaller 1.20.3), ?biocLite for help& biocLite()BioC_mirror: https://bioconductor.orgUsing Bioconductor 3.2 (BiocInstaller 1.20.3), R 3.3.1 ().Installing package(s) ‘AnnotationDbi’also installing the dependencies ‘DBI’, ‘RSQLite’trying URL '/src/contrib/DBI_0.5-1.tar.gz'===================================那就手工下载包:https://cran.r-project.org/src/contrib/DBI_0.5-1.tar.gzhttps://cran.r-project.org/src/contrib/RSQLite_1.0.0.tar.gz然后怎么安装?//install-r-packages/1. 自动安装在R的控制台,输入install.packages("stepNorm",contriburl="http://www.your.url" , dependencies = TRUE ) 若要指定安装目录 (e.g. "mydir"),则输入 install.packages("stepNorm",contriburl="http://www.biostat.ucsf.edu/jean/software",lib="mydir")&这种方法可能找不到需要的package,那么可以用方法2&2. 手动安装&Windows:&下载package.zip文件&打开R的菜单栏-&Packages-&"Install package from local zip file..." 选择package.zip文件&win7也可以进入R根目录,使用cmd安装下载好的包:D:\Program Files\R\R-3.3.1\bin\x64&Rcmd INSTALL C:\Users\Administrator\AppData\Local\Temp\RtmpmOGrZw\downloaded_packages\data.table_1.9.6.zip* installing to library 'D:/Program Files/R/R-3.3.1/library''data.table'MD5D:\Program Files\R\R-3.3.1\bin\x64&RcmdUsage: Rcmd command argswhere 'command' is one of:
Install add-on packages
Remove add-on packages
Make a DLL for use with dynload
Run R in batch mode
Build add-on packages
Check add-on packages
Post process R profiling files
Convert Rd format to various other formats
difference R output files
Convert Rd format to PDF
Convert Rd format to pretty text
Extract S/R code from vignette
Process vignette documentation
Obtain configuration information about R
Open a file via Windows file associations
Process a latex fileUse
Rcmd command --helpfor usage information for each command.注意Rcmd后面的INSTALL必须大写!Linux:&下载package.tar.gz文件&在Shell终端(注意不是R)输入: sudo R CMD INSTALL package.tar.gz&有人说要用 sudo R CMD INSTALL --build package.tar.gz 没试过。。===================================Old packages: 'boot', 'cluster', 'codetools', 'lattice'更新。&install.packages('RCurl',lib="")指定位置后安装。# install the GEO librariesbiocLite("GEOquery")--------------------------------------------------3.获取数据(53M)& library(GEOquery)& getGEOSuppFiles("GSE20986")我们可以看到两个文件:$ ls GSE20986/filelist.txt GSE20986_RAW.tarThe GSE20986_RAW.tar file is a compressed archive of the CEL files (the Affymetrix native file format).&反正怎么都下载失败,各种报错:&downloaded length 6901136 != reported length &&zzu这乌龟网速,下载完,地老天荒http://agetouch./blog/static/9/?newFollowBlog最后用迅雷下载的移动到GSE20986,接着解压数据& untar("GSE20986/GSE20986_RAW.tar", exdir="data")& cels &- list.files("data/", pattern = "[gz]")& sapply(paste("data", cels, sep="/"), gunzip)& cels--------------------------------------------------4.Describing the experiment 实验描述&We are now ready to start analysis, however in order to do this we need to capture the experimental information.&我们已经开始分析前,还需要获取实验信息。This is just a text file which describes the chip names, and the source of the biological samples hybridised to them.就是一个描述chip名字的txt文件,包含和他们杂交的生物样本的来源。To load the data into R using simpleaffy we need to create a file to represent this information. Open a new terminal window and type:使用R语言载入数据,我们需要一个描述该信息的simpleaffy(https://www.bioconductor.org/packages/devel/bioc/vignettes/simpleaffy/inst/doc/simpleAffy.pdf)打开终端,输入$ ls data/*.CEL & data/phenodata.txt规则:1.共分为三列:Name FileName Target2.每一列用tab键分开3.name可以和filename相同,也可以提供更易于理解的、人性化的名字。4.target来源于GEO页,这定义了实验重复。phenodata.txt最终格式如下:Name FileName TargetGSM524662.CEL GSM524662.CEL irisGSM524663.CEL GSM524663.CEL retinaGSM524664.CEL GSM524664.CEL retinaGSM524665.CEL GSM524665.CEL irisGSM524666.CEL GSM524666.CEL retinaGSM524667.CEL GSM524667.CEL irisGSM524668.CEL GSM524668.CEL choroidGSM524669.CEL GSM524669.CEL choroidGSM524670.CEL GSM524670.CEL choroidGSM524671.CEL GSM524671.CEL huvecGSM524672.CEL GSM524672.CEL huvecGSM524673.CEL GSM524673.CEL huvec注:博主把第一列去掉.CEL,发现报错,& celfiles &- read.affy(covdesc="phenodata.txt", path="data")Error: the following are not valid files:& & data/GSM524662& &data/GSM524663无奈,又重新恢复原样。--------------------------------------------------5.Loading and normalising the data 加载和标准化数据&The simpleaffy package provides routines for handling CEL files including normalisation and loading data with sample information.&simpleaffy包提供处理cel文件的例行方法,包括标准化和载入带样本信息的数据。We are going to load the data into an R object called ‘celfiles’我们加载数据到‘celfiles’的R对象。发现需要提前安装,library(simpleaffy)#install.packages('simpleaffy')不行。&biocLite('simpleaffy')#又安装很多包后,终于可以用了。& library(simpleaffy)& celfiles &- read.affy(covdesc="phenodata.txt", path="data")You can confirm this has worked, and download the chip annotations at the same time (if this is your first run) by typing:你可以确认正常工作了,如果这是你首次运行,你可以同时用如下命令下载chip注释信息:& celfilesAffyBatch objectsize of arrays= features (12 kb)cdf=HG-U133_Plus_2 (54675 affyids)number of samples=12number of genes=54675annotation=hgu133plus2notes=我的提示是这样的:& celfilestrying URL 'https://bioconductor.org/packages/3.2/data/annotation/src/contrib/hgu133plus2cdf_2.18.0.tar.gz'Content type 'application/x-gzip' length 4353300 bytes (4.2 MB)==================================================downloaded 4.2 MB* installing *source* package ‘hgu133plus2cdf’ ...** R** data** preparing package for lazy loadingWarning: replacing previous import ‘AnnotationDbi::tail’ by ‘utils::tail’ when loading ‘hgu133plus2cdf’Warning: replacing previous import ‘AnnotationDbi::head’ by ‘utils::head’ when loading ‘hgu133plus2cdf’** help*** installing help indices** building package indices** testing if installed package can be loadedWarning: replacing previous import ‘AnnotationDbi::tail’ by ‘utils::tail’ when loading ‘hgu133plus2cdf’Warning: replacing previous import ‘AnnotationDbi::head’ by ‘utils::head’ when loading ‘hgu133plus2cdf’* DONE (hgu133plus2cdf)The downloaded source packages are in ‘/tmp/Rtmpc0uxta/downloaded_packages’AffyBatch objectsize of arrays= features (21 kb)cdf=HG-U133_Plus_2 (54675 affyids)number of samples=12number of genes=54675annotation=hgu133plus2notes=Warning messages:1: replacing previous import ‘AnnotationDbi::tail’ by ‘utils::tail’ when loading ‘hgu133plus2cdf’&2: replacing previous import ‘AnnotationDbi::head’ by ‘utils::head’ when loading ‘hgu133plus2cdf’&首尾和原文不同,首次运行会下载注释文件,尾巴多了个warning信息。无法判断结果会怎么样,继续搞吧。这个啥意思呢?annotation=hgu133plus2 难道是新片名称?去GEO(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20986)看了平台名字:
GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array发现真的是芯片系列的名字。Now we can normalise the data. As with the data deposited in GEO we are going to use the GC-RMA algorithm.&现在你可以标准化数据了。对于存放在GEO中的数据,我们打算使用GC-RMA 算法。If this is the first time you have run it, it will download additional files during the process.如果这是首次运行它,期间会下载额外文件。博主注:这里标准化方法为什么是这个呢?还有哪些?怎么选择?/bbs/thread/83961文献中:Comparison of Affymetrix GeneChip expression measures提到的方法多如牛毛,但是具体选择提到了ACCURACY和PRECISION,盼解释?总之一点,如何选择呢?For more information about GC-RMA you can read this paper.&&/nbt/journal/v22/n6/full/nbtb.html& celfiles.gcrma &- gcrma(celfiles)Adjusting for optical effect............Done.Computing affinitiesLoading required package: AnnotationDbi.Done.Adjusting for non-specific binding............Done.NormalizingCalculating Expression我的提示是这样的:前面又下载了一次hgu133plus2probe_2.18.0.tar.gz& celfiles.gcrma &- gcrma(celfiles)Adjusting for optical effect............Done.Computing affinities[1] "Checking to see if your internet connection works..."trying URL 'https://bioconductor.org/packages/3.2/data/annotation/src/contrib/hgu133plus2probe_2.18.0.tar.gz'Content type 'application/x-gzip' length 8505171 bytes (8.1 MB)==================================================downloaded 8.1 MB* installing *source* package ‘hgu133plus2probe’ ...** R** data** preparing package for lazy loading** help*** installing help indices** building package indices** testing if installed package can be loaded* DONE (hgu133plus2probe)The downloaded source packages are in ‘/tmp/Rtmpc0uxta/downloaded_packages’Loading required package: AnnotationDbiLoading required package: stats4Loading required package: IRangesLoading required package: S4VectorsAttaching package: ‘IRanges’The following object is masked from ‘package:simpleaffy’:& & members.Done.Adjusting for non-specific binding............Done.NormalizingCalculating ExpressionIf you want to see the data associated with the normalised object you can –&如果你想看并且看到了到与标准化的对象相关的数据,you will notice this is no longer an AffyBatch object,&but an ExpressionSet object, which holds normalised values.你会注意到这已经不是要给AffyBatch对象了,而是一个ExpressionSet对象,保存标准化后的值。& celfiles.gcrmaExpressionSet (storageMode: lockedEnvironment)assayData: 54675 features, 12 samples&& element names: exprs&protocolData& sampleNames: GSM524662.CEL GSM524663.CEL ... GSM524673.CEL (12 total)& varLabels: ScanDate& varMetadata: labelDescriptionphenoData& sampleNames: GSM524662.CEL GSM524663.CEL ... GSM524673.CEL (12 total)& varLabels: sample FileName Target& varMetadata: labelDescriptionfeatureData: noneexperimentData: use 'experimentData(object)'Annotation: hgu133plus2&--------------------------------------------------6.Quality control checks 质量控制检查Before we analyse the data it is imperative that we do some quality control checks to make sure there are no issues with the dataset.&The first thing we can do is check the effects of the GC-RMA normalisation, by plotting a boxplot of probe intensities before and after normalisation.分析之前,我们很有必要做一些质控检查,确保数据集没有问题。首先,我们先检查GC-RMA标准化效果,通过画标准化前后的探针强度箱线图。& biocLite('RColorBrewer')& # load colour libraries& library(RColorBrewer)& # set colour palette& cols &- brewer.pal(8, "Set1")& # plot a boxplot of unnormalised intensity values& boxplot(celfiles, col=cols)&如何保存图像?(1)/xiaoji/blog/static//pdf(01before.pdf) #打开一个pdf文档,并命名为picture1.pdf#plot(1:10); &#画图boxplot(celfiles, col=cols)dev.off(); & #关闭当前设备(graphics.off()关闭当前所有打开的图形图像)(2)我是直接截图的。Ubuntu1604下,shift+PrintScreen& # plot a boxplot of normalised intensity values, affyPLM is required to interrogate celfiles.gcrma& biocLite('affyPLM')& library(affyPLM)& boxplot(celfiles.gcrma, col=cols)&& # the boxplots are somewhat skewed by the normalisation algorithm 箱线图某种程度被标准化算法扭曲了& # and it is often more informative to look at density plots 通常看密度图信息量更丰富。& # Plot a density vs log intensity histogram for the unnormalised data 画一个标准化之前的 密度vs log强度 直方图。& hist(celfiles, col=cols)&& # Plot a density vs log intensity histogram for the normalised data 画一个标准化之后的 密度vs log强度 直方图。& hist(celfiles.gcrma, col=cols)&From these plots we can conclude that there are no major deviations amongst the 12 chips,&and normalisation has brought the intensities from all of the chips into distributions with similar characteristics.&从这些点图我们可以看出,这12张新片没有重大偏差,并且标准化已经让来自所有芯片的强度值的分布呈现类似特征。To take a closer look at the situation on a per-chip level we can use affyPLM.&affyPLM allows us to visualise statistical characteristics of the CEL files.我们可以通过affyPLM进一步查看每个芯片的水平。affyPLM允许我们可视化CEL文件的统计特征。& # Perform probe-level metric calculations on the CEL files: 在CEL文件上进行芯片水平矩阵计算。& celfiles.qc &- fitPLM(celfiles)& # Create an image of GSM24662.CEL: &创建一个GSM24662.CEL的图形& image(celfiles.qc, which=1, add.legend=TRUE)&& # Create an image of GSM524665.CEL 创建一个GSM524665.CEL文件&& # There is a spatial artifact present 有一个空间艺术感& image(celfiles.qc, which=4, add.legend=TRUE)&& # affyPLM also provides more informative boxplots &|affyPLM 也提供信息更丰富的箱线图。& # RLE (Relative Log Expression) plots should have |RLE (Relative Log Expression)散点图应该有接近于0的值。& # values close to zero. GSM524665.CEL is an outlier |GSM524665是另一个异常值。& RLE(celfiles.qc, main="RLE")&& # We can also use NUSE (Normalised Unscaled Standard Errors). | 我们也可以使用NUSE(标准化的非缩放误差)& # The median standard error should be 1 for most genes. | 对于大多数基因,中位数标准误应该是0& # GSM524665.CEL appears to be an outlier on this plot too | GSM524665.CEL 在这个图中似乎是一个异常。& NUSE(celfiles.qc, main="NUSE")&We can also look at the relationships between the samples using heirarchical clustering:我们也可以使用分层聚类看样品间的关系。& eset &- exprs(celfiles.gcrma)& distance &- dist(t(eset),method="maximum")& clusters &- hclust(distance)& plot(clusters)&This suggests our HUVEC samples are very much a separate group compared to the eye tissues,&which show some evidence of clustering by tissue type.&这表明,HUVEC样品相对眼睛组织是一个更分离的组,表明通过组织类型聚类的证据。GSM524665.CEL does not appear to be a particular outlier in this view.GSM524665.CEL在这个视图中不像是一个特殊的异常。--------------------------------------------------7.Filtering data 过滤数据Now we have looked at the data, we can go on to analyse it.&现在我们可以继续分析了。The first stage of analysis is to filter out uninformative data such as control probesets and other internal controls as well as removing genes with low variance,&that would be unlikely to pass statistical tests for differential expression, or are expressed uniformly close to background detection levels.&分析的第一个阶段是过滤掉无信息的数据,比如质控探针和其他内部质控,去掉其他低变异、不太能通过差异表达统计检测的基因,或者接近背景噪声的值。The modifiers to nsFilter below tell nsFilter not to remove probesets without Entrez Gene identifiers, or have duplicated Entrez Gene identifiers.如下对nsFilter的修改,是为了让nsFilter不要移除没有Entrez基因识别号的探针集合,或者有重复Entrez基因识别号的探针集合。& celfiles.filtered &- nsFilter(celfiles.gcrma, require.entrez=FALSE, remove.dupEntrez=FALSE)& # What got removed and why?& celfiles.filtered$filter.log$numLowVar[1] 27307&$feature.exclude[1] 62From this we conclude 27,307 probesets have been removed for reasons of low variance, and 62 control probesets have been removed as well.我们得出结论 27,307个探针集合因为低变异被移除,62个质控基因也被移除了。--------------------------------------------------8.Finding differentially expressed probesets 找到差异表达基因探针集合 &Now we have a filtered dataset, we can send the information to limma for differential gene expression analysis.&现在我们有一个过滤过的数据集,我们可以把信息发送到limma做差异基因表达分析。First of all we need to extract information about the samples:首先我们需要提取样品的信息。& samples &- celfiles.gcrma$Target& # check the results of this&& samples[1] "iris" "retina" "retina" "iris" "retina" "iris" "choroid"[8] "choroid" "choroid" "huvec" "huvec" "huvec"& # convert into factors 转化为因子&& samples &- as.factor(samples)& # check factors have been assigned& samples[1] iris retina retina iris retina iris choroid choroid choroid[10] huvec huvec huvecLevels: choroid huvec iris retina& # set up the experimental design 建立实验设计&& design &- model.matrix(~0 + samples)& colnames(design) &- c("choroid", "huvec", "iris", "retina")& # inspect the experiment design& designchoroid huvec iris retina1 0 0 1 02 0 0 0 13 0 0 0 14 0 0 1 05 0 0 0 16 0 0 1 07 1 0 0 08 1 0 0 09 1 0 0 010 0 1 0 011 0 1 0 012 0 1 0 0attr(,"assign")[1] 1 1 1 1attr(,"contrasts")attr(,"contrasts")$samples[1] "contr.treatment"At this point we have normalised filtered data, and a description of the data and the samples and experimental design. This can be fed into limma for analysis.这时,我们有了标准化、过滤过的数据,数据、样本和实验设计的描述。这就可以发送给limma做分析了。& biocLite('limma')& library(limma)Attaching package: ‘limma’The following object is masked from ‘package:BiocGenerics’:& & plotMA& # fit the linear model to the filtered expression set& fit &- lmFit(exprs(celfiles.filtered$eset), design)&& # set up a contrast matrix to compare tissues v cell line& contrast.matrix &- makeContrasts(huvec_choroid = huvec - choroid, huvec_retina = huvec - retina, huvec_iris &- huvec - iris, levels=design)& # check the contrast matrix& contrast.matrixContrastsLevels huvec_choroid huvec_retina huvec_irischoroid -1 0 0huvec 1 1 1iris 0 0 -1retina 0 -1 0& # Now the contrast matrix is combined with the per-probeset linear model fit.& huvec_fits &- contrasts.fit(fit, contrast.matrix)博主看到如下提示:Warning message:In contrasts.fit(fit, contrast.matrix) :& row names of contrasts don't match col names of coefficients&&&&& huvec_ebFit &- eBayes(huvec_fits)& # return the top 10 results for any given contrast& # coef=1 is huvec_choroid, coef=2 is huvec_retina& topTable(huvec_ebFit, number=10, coef=1)& & & & & & & & logFC &AveExpr & & & & t & & &P.Value & &adj.P.Val & & & &B204779_s_at &7...669e-15 8. 20.25563207016_s_at &6...060e-14 5. 19.44681209631_s_at &5...337e-13 1. 18.96282242809_at & &6...404e-13 1. 18.70767205893_at & &4...534e-12 6. 17.75331227377_at & &3...854e-12 1. 17.01960204882_at & -5...318e-12 2. 16.7719638149_at & &-5...263e-11 5. 16.08854205576_at & &6...131e-11 5. 16.05990205453_at & &3...110e-11 5. 15.92277To impose a fold change cut off, and see how many genes are returned you can use the lfc modifier for topTable,&为了加一个fold change cut off,看看多少基因被返回,我们使用topTable的lfc修饰器。here we show the results for fold changes of 5,4,3 and 2 in terms of the number of probesets.这里展示了fold change分别是5,4,3,2时探针集合的数量。& nrow(topTable(huvec_ebFit, coef=1, number=10000, lfc=5))[1] 88& nrow(topTable(huvec_ebFit, coef=1, number=10000, lfc=4))[1] 194& nrow(topTable(huvec_ebFit, coef=1, number=10000, lfc=3))[1] 386& nrow(topTable(huvec_ebFit, coef=1, number=10000, lfc=2))[1] 1016& # Get a list for probesets with a four fold change or more | 获得一个大于等于4倍FC的探针列表& probeset.list &- topTable(huvec_ebFit, coef=1, number=10000, lfc=4)--------------------------------------------------9.Annotating the results with associated gene symbols 注释基因与相关基因符号In order to annotate the probesets into gene symbols we need to install and load the associated database package and the annotate package,&为了注释探针和基因符号,我们需要安装和爱在相关数据库包,以及注释包,then we can extract the probeset ID’s from the topTable results, and match the symbols然后我们可以从topTable结果中提取探针ID,然后和符号match。& biocLite("hgu133plus2.db")& library(hgu133plus2.db)& library(annotate)& #gene.symbols &- getSYMBOL(probeset.list$ID, "hgu133plus2")上面语句报错,google后[1]/questions//microarray-limma-package-in-toptable-function-dont-assign-id-for-probsets-colu[2]https://support.bioconductor.org/p/62400/The reason why Daniel Swan's code doesn't run exactly as is because the limma topTable now (in response to user requests)&puts the probe IDs into the rownames of the output table rather than into a column called "ID".&Daniel Swan的代码不运行的原因是因为limma的topTable现在(响应用户请求)把探针ID放到output表的rownames中,而不是叫做ID的一列。You can easily see this for yourself just by looking at the top table. So you just have to use rownames(probe.list) instead of probe.list$ID.你可以很容易的自己查看顶部表格。所以你仅仅使用 rownames(probe.list)而不是 probe.list$ID.可以替换成下面:& gene.symbols &- getSYMBOL(rownames(probeset.list), "hgu133plus2")& results &- cbind(probeset.list, gene.symbols)& head(results)& & & & & & & & logFC &AveExpr & & & & t & & &P.Value & &adj.P.Val & & & &B204779_s_at &7...669e-15 8. 20.25563207016_s_at &6...060e-14 5. 19.44681209631_s_at &5...337e-13 1. 18.96282242809_at & &6...404e-13 1. 18.70767205893_at & &4...534e-12 6. 17.75331204882_at & -5...318e-12 2. 16.77196& & & & & & gene.symbols204779_s_at & & & &HOXB7207016_s_at & & & & &NA&209631_s_at & & & & &NA&242809_at & & & & IL1RL1205893_at & & & & &NLGN1204882_at & & & ARHGAP25& write.table(results, "results.txt", sep="\t", quote=FALSE)--------------------------------------------------10.后续:生物显著性分析和火山图可视化There is a follow-on article from Simon Cockell on analyising the biological significance of results and&接下来有个Simon Cockell的结果的生物显著性分析。http://bioinformatics.knowledgeblog.org//downstream-functional-analysis-of-an-omics-experiment-using-freely-available-tools/another article by Colin Gillespie on visualising data from expression analysis as volcano plots.另一篇 Colin Gillespie的关于表达谱火山图的可视化分析。http://bioinformatics.knowledgeblog.org//volcano-plots-of-microarray-data/火山图常用于展示感兴趣的基因。log[fold change]作为x轴,-log10[p-value]作为y轴。使用GSE20986数据集,x轴展示了HUVEC质控细胞系和三个处理组内皮细胞系之一间的倍数变化。运行了上一次的分析,我们构建了含有fold change和p-value变化的表格。##Complete list of genes with p-values and fold change##Coef=1, so we are just looking at huvec_choroidgene_list &- topTable(huvec_ebFit, coef=1, number=1000000, sort.by="logFC")The relevant columns can be obtained via the “$” operator, viz:相关列可以通过$符号获取:##The head command prints the first few values of a vectorhead(gene_list$logFC)head(gene_list$P.Value)The volcano plot can then be generated using the standard plot command:火山图可以使用标准plot命令生成:##The par command sets "nice" graphical defaultspar(mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)plot(gene_list$logFC, -log10(gene_list$P.Value),& & &xlim=c(-10, 10), ylim=c(0, 15), #Set limits& & &xlab="log2 fold change", ylab="-log10 p-value")#Set axis labels& & & &We can use R to determine how many genes are above certain cut-offs.&我们使用R来计算有多少基因在cut-off值之上。For example, the number of genes that have an absolute fold change greater than 2 and a p-value less than 0.001 can be found using the command:比如,计算多少基因的fold change值大于2且p-value小于0.001,可以使用如下命令:##Gives 717sum(abs(gene_list$logFC) & 2 & gene_list$P.Value & 0.001)&#[1] 670&A more principled way of choosing a p-value cut-off is to use a multiple testing rule.&一个更合理的原则是使用多重检验选择p-value截断值。Common rules are Bonferroni and FDR. When we carry out a large number of statistical tests,&the Bonferroni cut-off is approximately 0.05/#tests. So通常是设置Bonferroni 和 FDR。当我们处理大量统计检测时,Bonferroni截断值大概等于0.05/test数量。所以:##no_of_genes=27,306no_of_genes = dim(probeset.list)[1]##Gives 203 genessum(abs(gene_list$logFC) & 2 & gene_list$P.Value & 0.05/no_of_genes)#[1] 528&If we wanted to use the false discovery rate as a cut-off (FDR), then we would use the adjusted p-value:如果我们使用FDR作为截断值,我们可以使用adjusted p-value:##Gives 933sum(abs(gene_list$logFC) & 2 & gene_list$adj.P.Val & 0.05)&#[1] 881In many microarray studies, the FDR cut-off is used.&在很多芯片研究中,使用了FDR截断值。However, when we have a large number of statistically significant genes, as in this example, a more conservative rule can be useful.然而,当我们有很多显著性基因时,比如这个例子,使用更加保守的规则更有效。Using ggplot2 for volcano plots 使用ggplot2画火山图查看都安装了哪些包:&library()如果没有安装ggplot2,需要先安装install.packages("ggplot2")然后画图:library(ggplot2)##Highlight genes that have an absolute fold change & 2 and a p-value & Bonferroni cut-offgene_list$threshold = as.factor(abs(gene_list$logFC) & 2 & gene_list$P.Value & 0.05/no_of_genes)&##Construct the plot objectg = ggplot(data=gene_list, aes(x=logFC, y=-log10(P.Value), colour=threshold)) +& geom_point(alpha=0.4, size=1.75) +& # opts(legend.position = "none") +&& xlim(c(-10, 10)) + ylim(c(0, 15)) +& xlab("log2 fold change") + ylab("-log10 p-value")g#Error: could not find function "opts"这个错误不知道为什么?去掉就可以了。&By using the “alpha” and “size” options in ggplot2, we can control the transparency and size of the points.通过使用ggplot2的alpha和size选项,我们可以控制点的透明度和大小。To add text to the plot, we use geom_text. For example,给点添加文字,我们使用geom_text。比如:##Graph not showngene1=g + geom_text(aes(x=gene1$logFC, y=-log10(gene1$P.Value), label=gene1$gene.symbols, size=1.2), colour="black")g + geom_text(aes(x=gene1$logFC, y=-log10(gene1$P.Value), label=gene1[1,]), colour="black", size=1.2)You can also pass vectors of text labels.我们也可以给text标签传递向量。最后这个没有画图成功。时间有限,就先在linux下保存数据了,并邮件保存文本文件。write.table(gene_list,file='gene_list-write.txt')R变量与文件写入和载入save()、load()记录这两个变量(results,gene_list)到文件使用命令:Save R Objects:&例如:save(x, y, file = "xy.RData")Reload Saved Datasetsload(file, envir = parent.frame(), verbose = FALSE)## save all dataxx &- pi # to ensure there is some datasave(list = ls(all = TRUE), file= "all.RData")rm(xx)## restore the saved values to the current environmentlocal({& &load("all.RData")& &ls()})save(a,file="a.txt") #文件乱码write.table(a,"a1.txt") #文件正常&=============================================之后在win7下RStudio继续画图。发现个问题:no_of_genes该怎么算?画的ggplot图中p值分界线都不一样。使用的R版本3.3.1& version
x86_64-w64-mingw32
x86_64, mingw32
version.string R version 3.3.1 ()nickname
Bug in Your Hair
文件从邮箱下载后放到了 D:/R_code/ggplot/gene_list-write.txt###############################################The downloaded binary packages are in#C:\Users\Administrator\AppData\Local\Temp\RtmpOWJnCb\downloaded_packagessetwd('D:/R_code')#基因注释# source("http://bioconductor.org/biocLite.R")source("biocLite.R")biocLite("hgu133plus2.db")library(hgu133plus2.db)## biocLite("XML")#install.packages("XML")biocLite("annotate")library(annotate)# 如果header=TRUE且第一行表头少了一列,则第一列默认为行名row.names&# http://bbs.pinggu.org/thread--1.html 及help F1gene_list=read.csv('ggplot/gene_list-write.txt',header=TRUE,sep=" ")gene_list=gene_list[,-7]str(gene_list)head(gene_list)#这个基因数该怎么选? 多重检验的值no_of_genes=nrow(gene_list) &# 选出一部分基因:FC大且p小的基因dd_text = gene_list[(abs(gene_list$logFC) & 2) & (gene_list$P.Value & 0.05/no_of_genes),]# 添加注释列gene.symbols &- getSYMBOL(rownames(dd_text), "hgu133plus2")results &- cbind(dd_text, gene.symbols)head(results)& & & & & & & & logFC &AveExpr & & & & &t & & &P.Value & &adj.P.Val & & & & B gene.symbols211796_s_at &8..138687 & 9.. 2. &5.589419 & & & & &NA&207526_s_at &7.... 1. 14.843452 & & & IL1RL1202437_s_at -7.... 9. &6.883201 & & & CYP1B1201438_at & -7..966922 &-9.. 2. &5.647887 & & & COL6A3204779_s_at &7.... 8. 20.255633 & & & &HOXB7207016_s_at &6.... 5. 19.446814 & & & & &NA&#画黑白图par(mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)plot(gene_list$logFC, -log10(gene_list$P.Value),& & &xlim=c(-10, 10), ylim=c(0, 15), #Set limits& & &xlab="log2 fold change", ylab="-log10 p-value")#Set axis labels&# ggplot2 彩图library(ggplot2)gene_list$threshold = as.factor(abs(gene_list$logFC) & 2 & gene_list$P.Value & 0.05/no_of_genes)head(gene_list)g = ggplot(data=gene_list, aes(x=logFC, y=-log10(P.Value), colour=threshold)) +& geom_point(alpha=0.4, size=1.5) +& # opts(legend.position = "none") +&& xlim(c(-10, 10)) + ylim(c(0, 15)) +& xlab("log2 fold change") + ylab("-log10 p-value")g&#添加文字-基因名g + geom_text(data=results, aes(x=logFC, y=-log10(P.Value), label=gene.symbols),& & & & & & & colour="black")&#Warning message:# &Removed 18 rows containing missing values (geom_text).&#因为有些基因无法注释,可以去掉NA值nrow(results) # 去掉之前 190results=na.omit(results)nrow(results) # 去掉之后 172g + geom_text(data=results, aes(x=logFC, y=-log10(P.Value), label=gene.symbols),& & & & & & & colour="black")图形没有变化!只是不出警告了。&# g还是基本图形g------------------------------------
阅读(612)|
用微信&&“扫一扫”
将文章分享到朋友圈。
用易信&&“扫一扫”
将文章分享到朋友圈。
历史上的今天
在LOFTER的更多文章
loftPermalink:'',
id:'fks_',
blogTitle:'如何使用BioConductor分析芯片数据——图文配代码',
blogAbstract:'概述芯片便宜、标准化做的比较好,结论比测序更准确,其价值正在被重新发现!从最近看的几篇文章可见一斑。Bindea, Gabriela, et al. \"Spatiotemporal dynamics of intratumoral immune cells reveal the immune landscape in human cancer.\"&Immunity',
blogTag:'',
blogUrl:'blog/static/',
isPublished:1,
istop:false,
modifyTime:8,
publishTime:1,
permalink:'blog/static/',
commentCount:0,
mainCommentCount:0,
recommendCount:0,
bsrk:-100,
publisherId:0,
recomBlogHome:false,
currentRecomBlog:false,
attachmentsFileIds:[],
groupInfo:{},
friendstatus:'none',
followstatus:'unFollow',
pubSucc:'',
visitorProvince:'',
visitorCity:'',
visitorNewUser:false,
postAddInfo:{},
mset:'000',
remindgoodnightblog:false,
isBlackVisitor:false,
isShowYodaoAd:false,
hostIntro:'Java/PHP/js/python/R等web软件开发, 生物信息学, 数据挖掘与机器学习,各种CMS, 基因组, 转录组, NGS高通量数据分析, 生物数据挖掘, 肿瘤免疫基因组与蛋白组分析',
hmcon:'0',
selfRecomBlogCount:'0',
lofter_single:''
{list a as x}
{if x.moveFrom=='wap'}
{elseif x.moveFrom=='iphone'}
{elseif x.moveFrom=='android'}
{elseif x.moveFrom=='mobile'}
${a.selfIntro|escape}{if great260}${suplement}{/if}
{list a as x}
推荐过这篇日志的人:
{list a as x}
{if !!b&&b.length>0}
他们还推荐了:
{list b as y}
转载记录:
{list d as x}
{list a as x}
{list a as x}
{list a as x}
{list a as x}
{if x_index>4}{break}{/if}
${fn2(x.publishTime,'yyyy-MM-dd HH:mm:ss')}
{list a as x}
{if !!(blogDetail.preBlogPermalink)}
{if !!(blogDetail.nextBlogPermalink)}
{list a as x}
{if defined('newslist')&&newslist.length>0}
{list newslist as x}
{if x_index>7}{break}{/if}
{list a as x}
{var first_option =}
{list x.voteDetailList as voteToOption}
{if voteToOption==1}
{if first_option==false},{/if}&&“${b[voteToOption_index]}”&&
{if (x.role!="-1") },“我是${c[x.role]}”&&{/if}
&&&&&&&&${fn1(x.voteTime)}
{if x.userName==''}{/if}
网易公司版权所有&&
{list x.l as y}
{if defined('wl')}
{list wl as x}{/list}}

我要回帖

更多关于 数据芯片哪个图爆的多 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信