Prediction Analysis with Microarrays (PAM) June 2008 Course "Analysis of microarray gene expression data" Judith Boer and Lodewyk Wessels PAM version used: Add-in for Excel 2.13 Classification by "nearest shrunken centroid"The PAM software uses a modification of the nearest centroid method, which computes a standardized centroid for each class in the training set. This is the average gene expression for each gene in each class divided by the within-class standard deviation for that gene. Nearest centroid classification takes the gene expression profile of a new sample, and compares it to each of these class centroids. The class, whose centroid it is closest to, in squared distance, is the predicted class for that new sample. Nearest shrunken centroid classification makes one important modification to the standard method. It "shrinks" each of the class centroids toward the overall centroid for all classes by an amount we call the threshold. This shrinkage consists of moving the centroid towards zero by subtracting the threshold, setting it equal to zero if it hits zero. For example if threshold was 2.0, a centroid of 3.2 would be shrunk to 1.2, a centroid of –3.4 would be shrunk to –1.4, and a centroid of 1.2 would be shrunk to zero. The amount of shrinkage is determined by cross-validation. After shrinking the centroids, the new sample is classified by the usual nearest centroid rule, but using the shrunken class centroids. The shrinkage has two advantages: (1) it can make the classifier more accurate by reducing the effect of noisy genes; (2) It does automatic gene selection for genes that characterize the classes. Example datasetsPAM comes with a couple of example data files, including cDNA microarray expression data of small round blue cell tumors (SRBCT) of childhood (Khan et al., 2001). This data set consists of Training Data and Test Data with normalized, log-transformed expression measurements on 2,308 genes (after filtering). The tumors are classified as Burkitt lymphoma (BL), Ewing sarcoma (EWS), neuroblastoma (NB), or rhabodomyosarcoma (RMS). A total of 63 training samples and 25 test samples are provided, although five of the latter were not SRBCTs. The aim is to correctly classify the test samples to one of the four tumor classes defined in the training set. The second data set consists of rma-normalized Affymetrix hg133a expression data of 33 muscle samples from 2 different groups: 19 Duchenne muscular dystrophy patients (DMD) and 14 controls. In
principle, all Affymetrix probesets could be used (22K), here we use a filtered set of 1663 to shorten the calculation time. This data set only contains a Training set. Obtaining PAM (for installation on your own computer)PAM can be freely downloaded from http://www-stat.stanford.edu/~tibs/PAM/. Please note that the Excel front-end is an addition to PAM for R. Therefore, new users must download both the PAM for R package and the PAM for Excel package. Loading a datasetStart Excel. You will see that the speed buttons have been extended with two items: PAM and PAM Controller. These will be used to start PAM and to steer the computations and the production of tables and graphs. Open khan.xls, start with the Khan Data worksheet containing the Training data. - Select all data in the worksheet (click on cell A1, hold down Ctrl and Shift keys and press arrow key -> twice, then arrow key "down" twice.
- Now click on the button PAM. A form appears, on which you can specify details of the data. Do the following:
- Put 2 in the field labeled Class Labels in selection row. This says that in row 2 of the selected data the labels of the classes are written.
- Put 1 in the field labeled Sample Labels in selection row. This says that the labels of the arrays (which might codes of the patients or biological samples) are found in row 1.
- Leave the field labeled Batch Labels in selection row empty. There are no batch indicators in this data set.
- Put 3 in the field labeled Expression Data starts in selection row. This says that the expressions are found in row 3 and further down.
- Click on the button OK. PAM will start initial computations. After some time another form will appear.
Controlling computations and outputThe form named PAM Menu presents a number of buttons and a field to fill in (which becomes visible later on). Initially only a few buttons are active. You can click on Exit Menu to make the form disappear. You can always call it back by clicking the speed button PAM Controller. First click Train. This is always the first step and produces a classifier without a threshold. You can judge its performance by clicking Plot Training Error. The plot will appear on the worksheet PAM Plots. It shows how the classification error for the training data (Y-axis) changes with the threshold (X-axis). As the threshold, which is the amount of shrinkage, is increased, genes are eliminated from the class prediction. The upper axis shows the number of active genes used in the classification. Move plots to new worksheetPAM puts all plots on one sheet. This is not optimal for studying them. Generate new Worksheets under the Excel menu Insert > Worksheet. Give the worksheet a logical name, such as "Plot Training Error". Click on a plot with the right mouse button. One option is Location, where you can specify that a plot should become a separate graph sheet. This is not true for the plot of the centroids. It is produced by R and copied onto the Excel worksheet. But you can copy it to the clipboard, paste it onto a new worksheet and resize it there. Generally the gene labels are hard to read on this plot, because they are too small or do overlap each other. But they have the same order as on the list of genes. It is also possible to copy a graph to the clipboard and paste it into a Word document. Cross-validation and threshold selectionCross-validation imitates how the classifier would work on new data. The available data are split in ten parts. Nine of them are used to train the classifier and the remaining one is used for “prediction”. This is done ten times (each of the ten parts is used in turn for prediction) and the results are combined to compute the overall error. You start the computations by clicking Cross validate on the Cross Validation tab of the PAM Menu. The result can be presented as a graph by clicking Plot CV curves. Perform cross-validation and choose a threshold that minimizes the misclassification error. Fill this value in the field labeled Threshold, press Enter. Results for a chosen thresholdThe results can be obtained by clicking Plot CV Probabilities. The plot shows for each array (X-axis) the computed probability (Y-axis) that it belongs to each of the possible classes. Try two different thresholds to see the effect on the probabilities. Now go back to the Training tab of the PAM Menu. The shrunken centroids for each class, per gene, are plotted when you click on Plot Centroids. The horizontal units are log ratios of gene expression. A list of genes that “survive” the threshold can be obtained by clicking List Gene Scores. The list appears in a separate worksheet. You can also obtain a confusion matrix, by clicking on the button Display Confusion Matrix. This matrix shows a cross-table of which combinations of actual class and prediction occur. When all predictions are correct, you will find only numbers on the diagonal. Do NOT click on the Estimate FDR button, this gives an error! Test set predictionThe Kahn dataset contains a second set of arrays on the worksheet named Khan Test Data. Take a look at this data set. We will try to predict the class of these new samples, using the centroids found in the training set after cross-validation. Go to the Test Set Prediction tab on the PAM Menu. Click on the Predict Test Set button. A new form will appear, with fields to enter the rows in the Test data sheet where Class, Sample, and Expression data are found. The entries are the same as in the training set. Follow the instructions on the form to select the data columns (not the gene id columns) in the Test Data sheet. Start the prediction by clicking on OK. Plot Test Error will give you a similar error graph as in the training set. You will see that the chosen threshold is good for the test data as well. Plot Test Probabilities and Output Prediction Info will give you a visualization and table, respectively, for the predicted classes and probabilities. QuestionsHow many samples are in the training and test sets, respectively? How many genes? What threshold did you choose for centroid shrinkage? How many genes contributed to the classification at this threshold? What were the prediction results on the training set? And on the test set? Analysis of DMD datasetRun PAM on the DMD data in DMD_PAM_filtered.xls (1663 probesets). This data set only contains a training set, so you can not do the Test set prediction part. Are the groups well-predicted? Why do you think that is the case? Are the predictive genes mostly over- or underexpressed in the DMD group?
ReferencesKhan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., and Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nat Med 7, 673-9.
Pescatori, M., Broccolini, A., Minetti, C., Bertini, E., Bruno, C., D'amico, A., Bernardini, C., Mirabella, M., Silvestri, G., Giglio, V., Modoni, A., Pedemonte, M., Tasca, G., Galluzzi, G., Mercuri, E., Tonali, P. A., Ricci, E. (2007). Gene expression profiling in the early phases of DMD: a constant molecular signature characterizes DMD muscle from early postnatal life throughout disease progression, FASEB J. 21(4), 1210-26.
Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proc Natl Acad Sci U S A 99, 6567-72
|