Star Republic: Guide for Biologists

Data analysis --- normalization

Edited by Chang Zhu
1. Normalization is to scale data such that different arrays can be compared.

2. Usually it requires the arrays be identical (e.g., cDNA arrays with the same clones printed). Although possible, it is difficult to normalize data for arrays of different types and/or gene list.

3. The two frequently used normalization options are:

A. Unit Column Mean --- It is suggested for data in "unlog" scale. The data for each array (column) is adjusted such that the column mean is 1. This is suggested for two-channel ratio data and is also applicable to Affymetrix data.

B. Zero Column Mean --- It is suggested for data (cDNA or Affy) that has been transformed to log scale. The data for each array (column) is adjusted such that the column mean is 0. Note for ratio data, log scale is the "natural" scale since the "up" and "down" will be symmetric in the log scale.

4. Normalization relies on genes whose expression don't change to align different arrays. Filtration however selects for changes across arrays. These are conflicting goals and yet intertwined. Care should be taken in the order and extent of filtration and normalization.

5. Users are cautioned not to re-normalize data after extensive filtering. For example, if you have a dataset composed of 20 normal and 20 tumor samples. If you set the filtering condition to be at least 15 samples must have a minimum 2 fold of changes compared to the row mean, the majority of genes in the resulting dataset will have either higher or lower expressions in normal compared to tumor samples. Renormalization will erase these differences.

6. The above normalization options are provided in MicroHelper.

7. For more sophisticated normalization options, consult a statistician.