README file for Quality Threshold Clustering
|
Theory: How Quality Threshold Clustering
Works
| In the QT clustering algorithm, each row or element is compared in
a pair-wise fashion with every other row or element and correlation coefficients
are computed. Elements are clustered together such that all of the
elements within a cluster must be more highly correlated with a single
central element than the input quality threshold. The algorithm then
computes the largest cluster that it can create using the input quality
threshold. It then removes those rows/columns from consideration
and then computes the next largest cluster, and so on until all of the
elements that can be clustered together within the threshold limits are
clustered.
There are two key inputs by the user that need to be addressed using the QT clustering method. The first is data prefiltering and the second is the level of the quality threshold. There are many methods for data prefiltering to remove genes that are uninformative and unchanging, or that are unusually noisy and may contain outlier data. Software to automatically prefilter microarray data is currently under development and not available at this time (1/29/02). To use the QT software the user must input a properly formatted data file and the quality threshold, which is essentially a correlation coefficient cutoff, much like the relevance threshold used in relevance networks, that ranges from 0 to 1. Here zero represents uncorrelated and 1 is perfect correlation. Empirical experience has indicated that while the QT must be greater than zero, the input QT necessary for valid clustering drops as an increasing number of elements are considered. Thus, for clustering of samples with thousands of genes being used for classification, a QT of 0.1-0.2 is often sufficient. However, if only tens or hundreds of elements are being used for classification, a QT of 0.3-0.7 is more successful. QT values greater than 0.7 often exclude true associations and less than zero are not statistically significant. A critically necessary improvement to the QT clustering software is an objective, automated method for choosing a statistically significant QT. One potential approach to computing a significant QT to be implemented in Aim 1 is random permutation of the data as described in Butte et al. This approach is currently under development. The advantages of QT clustering are that it is not sensitive to the order of the data, all of the information in the dataset is considered, and the number of clusters is not specified a priori. Another advantage of this approach is that it can easily identify and rank (in order of decreasing correlation) the set of genes most highly correlated with a given phenomenon such as tumor type, cell cycle phase, or clinical outcome. This feature can be extremely useful for determining gene weightings for tumor classification using artificial neural networks (see Options: Custom class vectors). |
Data Format: What the QT software expects
The QT software expects a plain-text tab-delimited gene expression
data file. The first row of the data set should contain the names
of the various experiments or samples. The first column should contain
the accession ID numbers for the genes. The second column should
contain gene names or descriptions. Spaces are OK in the description/gene
names and column names, but the accession ID's should not contain any spaces
within the accession number itself. The rest of the table should
contain the actual expression data. A partial sample data file is
shown below:
|
| To view a larger sample data file to play around with and use to test the QT software , click here, and then save the dataset as a text file to your desktop. Then submit that file to the QT software. |
| If you choose to cluster rows, the QT software will cluster together those genes with the most highly correlated expression patterns that meet the quality threshold cutoff. You will then be presented with graphical output of the gene expression patterns for each cluster. If you choose to cluster columns, samples with similar expression profiles will be clustered together, but the output will be in text format only. |
Options: Single
Cluster vs. Cluster All
| If you choose to calculate only a single cluster, you must then select
which gene or sample that you wish the software to form a cluster around.
If you choose to cluster all the genes or samples, there is no need or
reason to select a gene or sample around which to form clusters.
If you want to find the genes inversely correlated with the expression of a single gene, click yes on the inverse correlation option. The single cluster option must also be clicked yes to find inverse correlation to a single gene. |
| If you select to enter your own custom class vector, this is essentially
like adding an extra artificial gene to the dataset and then forming a
single cluster around it. You must choose the Single Cluster option
for this approach to work. Data should be entered as a series of
numbers for each sample, separated by commas. This approach is useful
if you wish to find genes with low expression levels in, for example, tumor
samples 1-4, but high expression in normal samples 5-8. To form such
a cluster, one might enter a custom class vector something like:
-1,-1,-1,-1,1,1,1,1 |
| For column clustering, the results are output in text format only. For clustering rows, or genes, the results are presented both in text format and in graphical format. Simply click on the link to view the results for any particular cluster. IF YOU ARE PERFORMING MULTIPLE ANALYSES AND USE THE SAME NAME REPEATEDLY, YOU MAY NEED TO CLICK THE RELOAD/REFRESH BUTTON OR POSSIBLY CLEAR THE CACHE FROM YOUR WEB BROWSER TO VIEW CHANGES IN FILES WITH THE SAME NAME. To empty the cache in Netscape, go to Edit Preferences Advaced Cache and click on clear disk cache now. For Explorer, go to Edit Preferences Web Browser Advanced and click on Empty Now in the Cache section. |
| QT Software version 1.0
Developed by Carlos S. Moreno
|