Affymetrix microarray data normalization and quality assessment
The first is the slotNames command which can take either an object or the name of a class. We obtain names of slots that are related to the class as strings. slotNames(a) slotNames('Agent') Output: Do you know about R String Manipulation Functions. The getSlots and slotNames command are similar as they both take the name of a class as a. Contribute to sdchandra/CNAclinic development by creating an account on GitHub.
Denis Puthier and Jacques van Helden
This tutorial is just a brief tour of the language capabilities and is intented to give some clues to begin with the R programming language. For a more detailled overview see R for beginners (E. Paradis)
Contents
Bioconductor
Setnames In R
From Wikipedia:
Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.
Most Bioconductor components are distributed as R packages, which are add-on modules for R. Initially most of the Bioconductor software packages focused on the analysis of single channel Affymetrix and two or more channel cDNA/Oligo microarrays. As the project has matured, the functional scope of the software packages broadened to include the analysis of all types of genomic data, such as SAGE, X-seq data (RNA-Seq, ChIP-Seq, ...), or SNP data.
The broad goals of the projects are to:
- Provide widespread access to a broad range of powerful statistical and graphical methods for the analysis of genomic data.
- Facilitate the inclusion of biological metadata in the analysis of genomic data, e.g. annotation data from UCSC or GO database.
- Provide a common software platform that enables the rapid development and deployment of plug-able, scalable, and interoperable software.
- Further scientific understanding by producing high-quality documentation and reproducible research.
Some area covered by the Bioconductor project with some representative packages:
R Slotnames
- Affymetrix GeneChip analysis: Affy, simpleaffy
- Affymetrix exon arrays: xmapcore, xps
- Probe Metadata: Annotate, hgu133aprobe, hgu95av2probe, ABPkgBuilder
- Microarray data filtering: Genefilter
- Statistical analysis of microarrays: SAMR, siggenes, multtest, DEDS, pickgene
- Tiling arrays: AffyTiling, tilingArray
- CGH array analysis: CGHbase, snapCGH
- NGS quality control/filtering: ShortRead
- RNA-Seq: easyRNASeq, DESeq
- ChIP-Seq: chipseq
- High level plotting functions: geneplotter
- Functionnal enrichment analysis: GO, Gostats, goCluster, geneplotter
- Genome coordinates: GenomicFeatures, genomeIntervals, GenomeGraphs, GenomicRanges
- Graphs: graph, Rgraphviz, biocGraph
- Flow cytometry: flowCore, flowViZ
- Variant calling: VariantTools
- Proteomics: MassSpecWavelet
- Image analysis: EBImage
Installing Bioconductor
To install bioconductor you need to retrieve the biocLite function from BioC web site. We will also check that some annotation packages for affymetrix geneChips are available on your computer. To use the command below you need to start R (type R within a terminal)
[back to contents]S4 objects
We have seen several classes of objects so far (vector, factor, matrix, data.frame...). In R, one can also create custom classes of objects in order to store and interrogate more complex objects.
Let say one need to store an experiment related to two-color microarrays. Then we have to store values from red and green channel for both foreground and background signal. We could also be interested in storing the symbols of the genes measured, the kind of microarray platform used, a description of the experiment (...). The interesting point is that only one instance from such a class would store all informations related to the experiment making it easier to manipulate and share.
Let us design such a class, we will call it microarrayBatch. We will use the setClass function that will allow to store the class definition.
Now that the classes is defined, we can create an instance of thisclass (an object). Inside R, this object is viewed as an S4 object ofclass 'microarrayBatch'. As any classical S4 object it contains a setof slots whose names can be accessed with the slotNamesfunction.
The type of object stored in each slot can be accessed usingthe getClassDef function.
Let us store an artificial data set with two microarrays, eachcontaining 10 genes.
As every S4 objects, each slot can be accessed using the @operator:
We can link functions (called methods) to this object. For instancewe can define a method getGreen() for the classmicroarrayBatch. This will retrieve the data stored in slot G (redchannel of the two-color microarray).
Now let's call this functionWe can check that the function returns, as expected, the content ofthe slot G of our microarrayBatch object.
As shown in this example we can easily define new object andmethods within R. This S4 formalism is used throughout bioconductorproject.
[back to contents]The dataset from Den Boer (2009)
Here we will use a subset of the GSE13425 experiment which which can be retrieved from the Gene Expression Omnibus (GEO) public database. In this experiment, the authors were interested in the molecular classification of acute lymphoblastic leukemia (ALL) characterized by an abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation.
Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Informations related to this platform are available on GEO website under identifier GPL96.
- Go to the GEO website to get information about the experiment GSE13425.
- What kind of tumor types were analyzed ?
- What does HGU133A stand for ?
Reading Affymetrix data
Retrieving data
- Open a terminal
- Create a directory GSE13425.
- Move to this directory.
- Download the subset of the affymetrix raw files : GSE13425_sub.tar
- Uncompress the files
- Download data related to sample phenotypes phenoData_sub.txt.
- Have a look at the phenoData_sub.txt file.
Note: we won't perform pre-processing of the full dataset due to memory and time issues.
Loading data into R
- Launch R
- Load the affy library
- Using the ReadAffy function, assign the result of the ReadAffy function to an object named affy.s13 (Note: the object name is arbitrary, we choose this name to indicate that this object contains an object of type 'AffyBatch', containing the intensity values for 13 selected samples of the DenBoer dataset).
- print the affy.s13 object.
- What is the class of this object ?
- What slots does this object contain ?
- How many probes does the microarray contain ?
- Ask for help about the corresponding class
- What does the assayData slot contain ?
- Have a look at the method associated to this class
- What does the exprs method returns ? What are the dimensions ?
- Does the expression matrix contains as many rows as the number of cells on the array ?
Loading phenotypic data
By default the ReadAffy function does not load phenotypic data. They can be load using the read.AnnotatedDataFrame function that will return an object of class AnnotatedDataFrame.
Indexing an affyBatch object
The indexing operator '[' (which in fact is a function...), is also re-defined in the source code of the affy library. The code stipulates that the indexing function will always return an AffyBatch object. In the following example when selecting two microarrays, we also select both the expression values and the corresponding phenotypic data
[back to contents]Affy library: graphics
The image function
- Generate a pseudo image of the first and second arrays using the image function.
The barplot.ProbeSet() function
The probeSet names can be accessed through the geneNames function.
Note: the method geneNames() returns probeset identifiers rather than actual 'gene names'.
Given one or several probeSet IDs, the probeset method allows one to extract the corresponding probe expression values.
- Use the function barplot.ProbeSet() to visualize the intensity values for the perfect macth probes (PM) and mismatch probes (MM) of the probeSet with identifier '200000-s_at'.
- Do the same for the probesets '221798_x_at' and '209380_s_at'.
- What can we conclude about the PM and MM values for these probesets ?
Quality control of raw data
Descriptive statistics
- Create an object named affyLog2, which will contain the expression values transformed in logarithmic scale (base 2).
- Display the distribution of the first array using the hist function (use the affyLog object).
- Use the plotDensity function to display microarray distributions (use the affyLog object).
- Use the boxplot function to display microarray distributions (use the affyLog object and pch='.' as argument).
AffyRNAdeg
The box plots and histograms generated above indicate the global distribution of intensity values for all probes. A well-known pittfall of Affymetrix technology is the degradation effect: for a given gene, the intensity tend to decrease from the distalmost (3') to the less distal (5') probes. The affy library implements a specific quality control criteria, enabling to plot the changes in mean intensities from 5' to 3' probes (AffyRNAdeg function).
[back to contents]Present/absent calls
It is most generally important to select a set of genes that are above the background in at least a given number of samples. The affymetrix reference method allows one to compute for each probeSet a Absent/Marginal/Present call (A/M/P). However, this method is based on the comparison of signals emitted by PM and MM (that tend to follow the PM signal). This function is implemented in mas5calls function (as it was originally part of the MAS5 normalization algorithm).
[back to contents]Data normalization
Numerous methods have been proposed for affymetrix data normalization (mas5, PLIER, Li-Wong, rma, gcrma,...). These methods rely on elaborate treatment, including inter-sample normalization. A detailed description and comparison of these methods is out of scope for his course. For this practical, we will use the (rma()) function. Note that RMA normalization includes a log2 transformation of the raw data.
- What object is returned by rma ?
- Which slots does the object contain ?
- Ask for some help about the class of this object .
- What are the slot contained within this object ?
- Use the smoothScatter function (library geneplotter) to compare normalized values from the first and second microarray.
The ExpressionSet object
The ExpressionSet class is central to BioC as lots of packages converge to produce ExpressionSet instances. This simple object is intended to store normalized data from various technologies.
[back to contents]Checking the normalization results
Relative Log Expression (RLE)
One can use classical diagram to visualize the normalization results. Another solution to check the normalization of an expression matrix is to use the Relative Log Expression (RLE) plot.
View solution Hide solutionMA plot diagram
One popular diagram in dna chip analysis is the M versus A plot (MAplot). In this diagram:
M is the log intensity ratio calculated for any gene.
A is the average log intensity which corresponds to an estimate of the gene expression level.
Would data be perfectly normalized, M value should not depend on A values. To represent the MA plot we will first compute values for a pseudo-microarray that will be the reference. This pseudo-microarray will be highly representative of the series as it will contain the median expression values for each gene.
M=Ig,1−Ig,ref
Probe annotations
As you have probably noticed, the gene names are neither availablein the affyBatch object nor in the eset object. Each affymetrixmicroarray has its own annotation library that can be used to linkprobesets to genes Symbol and retrieve additional information aboutgenes. Here we need to load the hgu133a.db library. If it isnot previously install, use the biocLite function.
This library give access to a set of annotation sources that can be listed using the hgu133a function.
The following commands can be used to retrieve gene Symbols for the hgu133a geneChip.
Writing data onto disk
R objects can be saved using the save function (then subsequently load using the load function). For a tab-delimited file output one may use the write.table function.
[back to contents]Additional exercices
Using boxplot and densities, compare the effect on raw pm data of quantile normalization vs median centering, median-centering and scaling, and median-centering and scaling with mad.
References
- Den Boer et al. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol (2009) vol. 10 (2) pp. 125-34.
The Names of an Object
Functions to get or set the names of an object.
- Keywords
- attribute
Usage
Arguments
an R object.
a character vector of up to the same length as x
, or NULL
.
Details
names
is a generic accessor function, and names<-
is a generic replacement function. The default methods get and set the 'names'
attribute of a vector (including a list) or pairlist.
For an environment
env
, names(env)
gives the names of the corresponding list, i.e., names(as.list(env, all.names = TRUE))
which are also given by ls(env, all.names = TRUE, sorted = FALSE)
. If the environment is used as a hash table, names(env)
are its “keys”.
If value
is shorter than x
, it is extended by character NA
s to the length of x
.
It is possible to update just part of the names attribute via the general rules: see the examples. This works because the expression there is evaluated as z <- 'names<-'(z, '[<-'(names(z), 3, 'c2'))
.
The name '
is special: it is used to indicate that there is no name associated with an element of a (atomic or generic) vector. Subscripting by '
will match nothing (not even elements which have no name).
A name can be character NA
, but such a name will never be matched and is likely to lead to confusion.
Both are primitive functions.
Value
For names
, NULL
or a character vector of the same length as x
. (NULL
is given if the object has no names, including for objects of types which cannot have names.) For an environment, the length is the number of objects in the environment but the order of the names is arbitrary.
For names<-
, the updated object. (Note that the value of names(x) <- value
is that of the assignment, value
, not the return value from the left-hand side.)
Note
For vectors, the names are one of the attributes with restrictions on the possible values. For pairlists, the names are the tags and converted to and from a character vector.
For a one-dimensional array the names
attribute really is dimnames[[1]]
.
Formally classed aka “S4” objects typically have slotNames()
(and no names()
).
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
See Also
slotNames
, dimnames
.
Aliases
- names
- names.default
- names<-
- names<-.default
Examples
library(base)
# NOT RUN {# print the names attribute of the islands data setnames(islands)# remove the names attributenames(islands) <- NULLislandsrm(islands) # remove the copy madez <- list(a = 1, b = 'c', c = 1:3)names(z)# change just the name of the third element.names(z)[3] <- 'c2'zz <- 1:3names(z)## assign just one namenames(z)[2] <- 'b'z# }
Community examples
A better way to remove the `names` attribute is to use [`unname()`](https://www.rdocumentation.org/packages/base/topics/unname).```{r}unname(islands)```The main advantage of having names is that it gives you an easy-to-read way of subsetting.```{r}islands[c('South America', 'Southampton')]```Or, more fancily, you can use a regular expression to extract all islands with names begining with `'A'`, for example.```{r}islands[grepl('^A', names(islands))]```Lists can also have names.```{r}(l <- list(a = 1, b = letters[1:5], c = list(d = 1:3)))names(l) # only the top level element names, not 'd'names(unlist(l)) # unlist gives a name for every element```You can overwrite all the names.```{r}(l <- list(a = 1, b = letters[1:5], c = list(d = 1:3)))names(l) <- LETTERS[1:3]l```… or just some of them.```{r}(l <- list(a = 1, b = letters[1:5], c = list(d = 1:3)))names(l)[1:2] <- c('Alpha', 'Beta')l```Setting names on an object, then returning that object can be done in a single step using [`setNames()`](https://www.rdocumentation.org/packages/stats/topics/setNames).```{r}(l <- list(a = 1, b = letters[1:5], c = list(d = 1:3)))setNames(l, c('Alef', 'Bet', 'Gimel'))```If an object has no names, then the `names()` function returns `NULL`.```{r}v <- 1:3names(v)```If an object has some names, then the names function returns a character vector with missing values where there are no names.```{r}v <- 1:3names(v)[2] <- '2nd'names(v)v```