Description: Non-negative Matrix Factorization Consensus Clustering
Authors: Pablo Tamayo (Broad Institute) with contributions from Jean-Philippe Brunet(Broad Institute), Kenneth Yoshimoto (San Diego Supercomputing Center) and Ted Liefeld (University of California San Diego). Parallel NMF Implementation from https://github.com/bioinfo-cnb/bionmf-gpu
Contact: Geneattern Help Forum
Non-negative matrix factorization (NMF) is an unsupervised learning algorithm [1] that has been shown to identify molecular patterns when applied to gene expression data [2]. Rather than separating gene clusters based on distance computation, NMF detects contextdependent patterns of gene expression in complex biological systems.
The basic principle of dimensionality reduction via matrix factorization operates as follows: given an N x M data matrix A with non-negative entries, the NMF algorithm iteratively computes an approximation, A ~ WH, where W is an N x k matrix, H is a k x M matrix, and both are constrained to have positive entries. For DNA microarrays, N, the number of genes, is typically in the thousands. M, the number of experiments, rarely exceeds a hundred, while k, the number of classes to be determined depends on the heterogeneity of the dataset. The algorithm starts with randomly initialized matrices of the appropriate size, W and H. These are iteratively updated to minimize the Euclidean distance between V and WH or a divergence norm [3]. The program also computes row and column factor memberships according to maximum amplitudes. This membership information is also used to sort the output matrices according the row and column membership (the row and columns are then relabeled: <name>_f<NMF factor>.
Name | Description | Default Value |
---|---|---|
dataset.filename * | Input dataset (gct) | |
k.initial * | Initial value of K. | 2 |
k.final * | Final value of K. | 5 |
num.clusterings * | Number of clusterings to perform for each value of K. | 20 |
max.num.iterations * | The maximum number of iterations to perform for each clustering run for each value of K. This number may not be reached depending on the stability of the clustering solution and the settings of stop convergence and stop frequency. | 2000 |
random.seed * | Random seed used to initialize W and H matrices by the random number generator. e.g. 4585, 4567, 5980. This may be set to provide repeatable results for given parameter inputs even though the algorithm is properly random. | 123456789 |
output.file.prefix * | Prefix to prepend to all output file names. | |
stop.convergence * | How many “no change” checks are needed to stop NMF iterations before max iterations is reached (convergence). Iterations will stop after this many “no change” checks report no changes. | 40 |
stop.frequency * | Frequency of “no change” checks. NMFConsensus will check for changes every ‘stop frequency’ iterations. | 10 |
* required
Input:
BRCA_DESeq2_normalized_20783x40.gct
Output:
This version only runs on the San Diego Super Computer Expanse cluster.
NMFClustering
is distributed under a modified BSD license available at https://github.com/genepattern/NMFClustering/blob/develop/LICENSE
Version | Release Date | Description |
---|---|---|
1 | December 7, 2021 | Initial version for team use. |