README file for Quality Threshold Clustering
 
 
The Quality Threshold Clustering Algorithm was initially described in Heyer, et al.  and was used to cluster gene expression patterns from renal carcinomas in Young et al.


Theory: How Quality Threshold Clustering Works
 
In the QT clustering algorithm, each row or element is compared in a pair-wise fashion with every other row or element and correlation coefficients are computed.  Elements are clustered together such that all of the elements within a cluster must be more highly correlated with a single central element than the input quality threshold.  The algorithm then computes the largest cluster that it can create using the input quality threshold.  It then removes those rows/columns from consideration and then computes the next largest cluster, and so on until all of the elements that can be clustered together within the threshold limits are clustered. 

There are two key inputs by the user that need to be addressed using the QT clustering method.  The first is data prefiltering and the second is the level of the quality threshold.  There are many methods for data prefiltering to remove genes that are uninformative and unchanging, or that are unusually noisy and may contain outlier data.  Software to automatically prefilter microarray data is currently  under development and not available at this time (1/29/02).

To use the QT software the user must input a properly formatted data file and the quality threshold, which is essentially a correlation coefficient cutoff, much like the relevance threshold used in relevance networks, that ranges from 0 to 1. Here zero represents uncorrelated and 1 is perfect correlation.  Empirical experience has indicated that while the QT must be greater than zero, the input QT necessary for valid clustering drops as an increasing number of elements are considered.  Thus, for clustering of samples with thousands of genes being used for classification, a QT of 0.1-0.2 is often sufficient.  However, if only tens or hundreds of elements are being used for classification, a QT of 0.3-0.7 is more successful.  QT values greater than 0.7 often exclude true associations and less than zero are not statistically significant.  A critically necessary improvement to the QT clustering software is an objective, automated method for choosing a statistically significant QT.   One potential approach to computing a significant QT to be implemented in Aim 1 is random permutation of the data as described in Butte et al. This approach is currently under development. 

The advantages of QT clustering are that it is not sensitive to the order of the data, all of the information in the dataset is considered, and the number of clusters is not specified a priori.   Another advantage of this approach is that it can easily identify and rank (in order of decreasing correlation) the set of genes most highly correlated with a given phenomenon such as tumor type, cell cycle phase, or clinical outcome.  This feature can be extremely useful for determining gene weightings for tumor classification using artificial neural networks (see Options: Custom class vectors). 


Data Format: What the QT software expects
 
The QT software expects a plain-text tab-delimited gene expression data file.  The first row of the data set should contain the names of the various experiments or samples.  The first column should contain the accession ID numbers for the genes.  The second column should contain gene names or descriptions.  Spaces are OK in the description/gene names and column names, but the accession ID's should not contain any spaces within the accession number itself.  The rest of the table should contain the actual expression data.  A partial sample data file is shown below:
 
CloneID  NAME   Conventional 1   Conventional 2   Conventional 3   Conventional 4
1518328 "integrin, alpha 6 {Incyte PD:1518328}"  1  0.137503524  -0.485426827 -0.137503524
2515389  apolipoprotein E {Incyte PD:2515389}  0.678071905   -1.432959407  -1.584962501  -1.137503524
629769    "calbindin 1, (28kD) {Incyte PD:629769}" -0.485426827 -2.070389328  -1.201633861 -2
2820985   dopa decarboxylase (aromatic L-amino acid decarboxylase) {Incyte PD:2820985}  0.263034406    -1 -0.263034406 -0.765534746 
Sample Data File
 
To view a larger sample data file to play around with and use to test the QT software , click here, and then save the dataset as a text file to your desktop.  Then submit that file to the QT software.

Options: Columns vs. Rows
 
If you choose to cluster rows, the QT software will cluster together those genes with the most highly correlated expression patterns that meet the quality threshold cutoff.  You will then be presented with graphical output of the gene expression patterns for each cluster.  If you choose to cluster columns, samples with similar expression profiles will be clustered together, but the output will be in text format only.

Options: Single Cluster vs. Cluster All
 
If you choose to calculate only a single cluster, you must then select which gene or sample that you wish the software to form a cluster around.  If you choose to cluster all the genes or samples, there is no need or reason to select a gene or sample around which to form clusters.

Options: Inverse Correlation

If you want to find the genes inversely correlated with the expression of a single gene, click yes on the inverse correlation option.  The single cluster option must also be clicked yes to find inverse correlation to a single gene.


Options: Custom class vectors
 
 
If you select to enter your own custom class vector, this is essentially like adding an extra artificial gene to the dataset and then forming a single cluster around it.  You must choose the Single Cluster option for this approach to work.  Data should be entered as a series of numbers for each sample, separated by commas.  This approach is useful if you wish to find genes with low expression levels in, for example, tumor samples 1-4, but high expression in normal samples 5-8.  To form such a cluster, one might enter a custom class vector something like:

-1,-1,-1,-1,1,1,1,1


Results: Viewing and Saving
 
 
For column clustering, the results are output in text format only.  For clustering rows, or genes, the results are presented both in text format and in graphical format.  Simply click on the link to view the results for any particular cluster.  IF YOU ARE PERFORMING MULTIPLE ANALYSES AND USE THE SAME NAME REPEATEDLY, YOU MAY NEED TO CLICK THE RELOAD/REFRESH BUTTON OR POSSIBLY CLEAR THE CACHE FROM YOUR WEB BROWSER TO VIEW CHANGES IN FILES WITH THE SAME NAME.  To empty the cache in Netscape, go to Edit Preferences Advaced Cache and click on clear disk cache now.   For Explorer, go to Edit Preferences Web Browser Advanced and click on Empty Now in the Cache section.

 
 
QT Software version 1.0
Developed by

Carlos S. Moreno
Assistant Professor
Department of Pathology & Laboratory Medicine
Emory University School of Medicine
cmoreno@emory.edu