Star Republic: Guide for Biologists

Data analysis --- filtration

Edited by Chang Zhu
At the current state of the art, microarray data is very noisy. The noise needs to be filtered before any meaningful statistical analysis can be done.

Filtering criteria usually involve comparisons of data across arrays. To effectively compare data across arrays, data need to be normalized, so they can be compared. Frequently, the data is first normalized in the presence of excess noise, then filtered to reduce the noise, and then normalized the second time with reduced noise level. MicroHelper provides a range of filtering options.

Criteria used for filtering

1. "ceiling" and "flooring": They are used to limit impact of outliers (data points that have either too small or too big values). For example, Affymetrix data may have negative values, users might want to change the negative values to a small positive one. For arrays using two channels, the ratio could be very big or small if one of the channel is missing a signal. Some researchers will set a limit for the ratio, for example, the fold of change in either up/down direction must be smaller than 50 or 100.
2. Overall Variation Filtering: Variation is compared to the mean for all features (spots, genes, in row) and all arrays. Note the fold of change should be in "unlog" scale even if your data is in log scale.
3. Row Variation Filtering: Variation is compared to the ROW mean for all arrays, i.e., for each feature (row) a mean is calculated and variation compared to the row mean. The fold of change should be in "unlog" scale even if your data is in log scale.
4. Missing Data Filtering: filtrate out rows with percentage of missing data greater than user selected threshold. You may also fill-in missing data with estimates by using the missing data fill-in tool explained later.

These options can be selected individually or in combination. If in combination, the order of execution is from (1) to (4).

A typical filtering process looks like the following (here we assume cDNA arrays):

1. If Green/Red ratio is greater than 50, the ratio is set to be 50. If the ratio is smaller than 1/50, it is set to 1/50.
2. At least 20% of samples must have at least 4 fold of up or down changes. Otherwise the "gene" is considered having stationary expression and will be removed from the data set.
This filtering is powerful, since it removs a lot of big chance fluctuations in the data. The percentage must be smaller than the fraction of any known or expected subtypes. For example, if you expect that there will be 5 samples belonging to a subtype of cancer out of a total of 50 samples, the percentage must be set to be smaller than 10. Otherwise genes that are high or low in the subtype compared to the rest of the sample will also be filtered out.
3. A "gene" can't have more than 10% of missing data.
Missing data may frequently signal problems in the clone used for printing (e.g. low DNA, too dirty, etc.), if possible keep missing data to minimum.