My Goal here is to try to collect a list of links to sites/projects/facilities/etc. dealing with microarray data analysis.
There seems to be quite a few different data formats used to store microarray data. Because of the recent birth of the field, there is a trend for each new software package developed to also develop its own proprietary data format. However, there are many packages that offer the ability to use/import data in a variety of formats. What follows is some information about the different data formats out there. Dr. Jacques van Helden's web site provided a starting point. The page lists some common and widely used microarray data formats.
MAML is an XML used to mark up gene expression data from microarrays. It was replaced by MAGE-ML (see below) by the Microarray Gene Expression Database Group (MGED). MGED combined efforts in 2001 with Rosetta to develop MAGE-ML.
Originally proposed by Rosetta as a data format for microarray and gene expression data. It was submitted to OMG in response to an RFP dealing with handling large scale microarray experiments. Eventually Rosetta and MGED combined efforts in 2001 and developed MAGE-ML (XML) and MAGE-OM (UML Object Model).
MAGE-ML evolved from a combined effort of MGED and Rosetta. They combined efforts to produce this XML. MAGE-ML is an XML used to describe the information about and that is a result of a particular microarray experiment. The goal is to provide a data format that will provide enough information to 3rd parties so the microarray data can be useful. MAGE-ML was automatically derived from the Microarray Gene Expression Object Model (MAGE-OM). MAGE-OM is a specification of the Object Modeling Group, Inc. (OMG).
This is more of a standard for microarray experiments than a data format. It dictates the minimal amount of information that must accompany the data to ensure that the results of the experiment(s) can be properly and accurately interpreted.
BioConductor is an open source and open development software initiative aimed at the Bioinformatics world. According to the BioConductor Web site, the main goals of the project are:
BioConductor is built upon the open source R Project for Statistical Computing. It seems to be principally aimed at biostatistics folks.
"CAGED is a program for the analysis of temporal profiles of gene expression data. The algorithms behind CAGED are based on Bayesian Clustering by Dynamics. This methodology will identify the most likely number of clusters based on the time series data. The algorithms/techniques are actually implemented in a CAGED v. 1.0 freely available to academic institutions. See this paper for more detailed information.
ArrayAssist can be used to analyze GeneChip® microarray data using clustering, stat. analysis and different visualization tools. It is fully compatible with the Affymetrix. Statistical capabilities include support for the t-test, f-test and ANOVA. Analyses can be exported to Excel spreadsheets. Visualization capabilities include hierarchical and K-means clustering, Principal Component Analysis, Scatter plots and 2D and 3D line graphs. A free evaluation copy is available for download.
Cluster performs various clustering analyses on microarray data. Currently, the list of clustering algorithms include hierarchical clustering, self-organizing maps (SOMs), k-means clustering, principal component analysis. This software is freely available as is the source code.
TreeView allows the user to browse the results from Cluster w/ a graphical interface. It supports both tree-based and image based browsing of heirarchical trees. TreeView is freely available from the link above.
See also, XCluster
GeneSight is a commercial product but an evaluation copy is available. It is a fairly comprehensive microarray data analysis package. Its compatible with the Affymetric GeneChip ® platform as well as has capabilities for the Universal data file import. Several data visualizations are incorporated such as 2D SOMs, K-Means clustering, Time Series Analysis, Hierarchical Clustering, 2d and 3d Principle Component Analysis.
GeneSpring is commercially but has a trail version. Statistics-based tools include t-tests, 2-way ANOVA tests and 1-way post-hoc tests for identifying differentially expressed genes. Clustering abilities include hierarchical clustering, experiment trees (??), SOMs, k-means, PCA and QT clustering.
From Source Forge and web page, Expression Profiler is an "open, extensible web-based collaborative platform for microarray gene expression, sequence and PPI analysis, exposing distinct chainable components for clustering, pattern discovery, statistics (thru R), machine-learning algorithms and visualization." This project is still under development.
It seems to be "one-stop shopping" for analysis of expression data from microarrays. It allows the user to upload their data in a few different formats and then to perform various analyses such as clustering using that data.
Cyber-T is a statistical analysis program based on systtem-R. The analyses consist of simple t-tests or regularized t-tests that use Bayesian estimate of variance. It also contains the ability to find experiment wide false positives and negatives based on p-value distributions.
dChip implements model-based expression analysis of oligonucleotide arrays. This model-based approach enables the user to do analysis at the probe level on multiple arrays. Doing this analysis accross multiple arrays allows the user to determine the standard errors for the expresion indexes. dChip also has some capabilities for comparative analysis and hierarchical clustering.
Cleaver is a web based analysis tool that allows the user to uplaod their data. The analysis of the data happens server side. Cleaver provides the ability to do classification, clustering (K-means), and visualization(Principal Components Analysis). The site says that data uplaoded is kept strictly confidential (one of the issues biologist have).
SNOMAD (Standardization and NOrmalization of MicroArray Data) is based on the R statistical language. It is a collection of algorithms for normalization and standardization of microarray data. Most of the work here is directed toward refining paired microarray data (???). This software doesn't seem to perform any analysis of microarray data such as clustering, etc.
The MAExplorer (MicroArray Explorer for Data Mining Gene Expression Patterns) project is also available from SourceForge. It is a "java-based data-mining facility for microarray databases". MAExplorer has four main use cases. First, it allows expression analysis of individual genes of interest. Second, MAExplorer allows the expression analysis of gene clusters and families of genes. Third, it provides the ability to compare expression patterns and outliers. Lastly, it provides an interface to other genome databases (network/Internet accessible). As far as analysis, it allows the user to produce scatter plots, histograms, expression profile plots, K-means clustering, K-median clustering, and hierarchical clustering. It also provies the ability to create MAEPlugins using the Open Java API.
"BRB ArrayTools is an integrated package for the visualization and statistical analysis of DNA microarray gene expression data." BRB ArrayTools is uses Microsoft Excel as its front end and input of data is expected to be in a format that is compatible with Excel. The actualy tools (analytic and visualization) were developed using R, C, Fortran, and Java. VBA is used as the glue.
GeneMaths XT is a commercial product. It allows the user to perform sophisticated data analysis on micro array data. It provides means to do unsupervised learning through the use of traversal clustering, hierarchical k-means partitioning, PCA, Discriminant Analysis, and SOMs. It also can perform supervised learning in the form of neural nets, suport-vector machines and k-nearest neighbor analysis. The documentation states that it provides special abilities to perform time-course experiments (???). It doesn't seem that there is a demo available. See this brochure for more information.
This is commercial software. DecisionSite has powerful abilities for data import. Sources of data can be text files, Excel Spreadsheets, or Oracle databases. Supported is a query interface for Affy expression databases and other expression formats. Analysis methods include hierarchical and K-means clustering, expression profile searches and PCA. A connector can be purchased that allows the user to integrate R scripts (???). Nice-looking interface from the screen shots.
MeV (MultiExperiment Viewer) is one component of a suite of tools from TIGR for complete analysis of microarray data. It can handle several types of data input such as Affymetrix .txt files as well as files that are the output of other components of the software suite. Algorithms implemented include bootstrapping, jackknifing, k-means, self organizing trees, self organizing maps, figures of merit, etc. MeV is freely availble from TIGR.
GEPAS (Gene Expression Pattern Analysis Suite) is a suite of web-based analysis tools. The tools allow the user to upload data and perform normalization, and some pre-processing things such as log transformation, replicate handling and missing value imputation. It supports hierarchical clustering and SOMs for data clustering.
Partek Discover is part of Partek Pro suite of tools. It provides the ability to perform PCA, Multidimensional Scaling, correspondence Analysis, Cluster Analysis (k-means, fuzzy C-means, Divisive and agglomerative hierarchial clustering, etc.) and similarity matching. This is a commercial package.
Pathways 4 is based upong the JAVA architecture. It includes the ability to import data in various formats such as text, spreadsheet, GEML, and others. Included with Pathways 4 is k-means, hierarchical and SOM clustering techniques. The software also provies a powerful visualization techniquies such as scatter plots and bar charts. Numerical and statistical filters are implemented such as the t-test and ANOVA.
SilicoCyte is a commercial software product that provides "...research organizations with a comprehensive, integrated, data management and analysis tool that will help them realize the true potential of microarray data analysis. Data analysis features include time searies analysis, hierarchical clustering, k-means clustering, as well as other algorithms. The package also contains features for image analysis and annotation.
GEDA (Gene Expression Data Analysis) is a web based tool allowing the user to upload their data files and apply various pre-processing and analyses too them. Clustering (classification) algorithms included are k-means, naive bayes, average linking, maximum linking, and average linking.
GeneCluster 2 builds on the analysis abilities of GeneCluster 1 which performed data filtering and SOM support. GeneCluster 2 implements weighted voting and k-nearest neighbors algorithms. The ability to perform batch SOM clustering was also added.
GenePattern is an advanced analysis tool for expression analysis. It is based on a client-server architecture. It has powerful preprocessing and analysis capabilities built in such as K-nearest neighbor classification, SOM Clustering, hierarchical clustering, PCA, and others. GenePattern also has the unique ability to define analysis pipelines. This allows the user to chain different analysis modules together so that the same system of analysis can be applied to different data sets. It can be thought of like a macro that relieves the user from having to apply each step of an analysis each time it is performed. GenePattern also allows the user to add his/her own analysis techniquies through the use of scripting, R, and other programming languages.
"GeneX-Lite is a client application that provides an interface to a relational database management system." It allows the user of the client application to manage and analyze microarray data. Can't find particular information about what analysis and visualzation tools are built into the client application.
S+ Arrayanalyzer is a commercial product that allows statistical analysis of microarray data. In the way of analysis, S+ Arrayanalyzer can perform two sample and paired t-test, one way and two way ANOVA, Wilcoxon text, hierarchical clustering, k-means clustering, pam clustering, as well as others. It is integrated with S-PLUS which allows extensions to be written in the S programming language.
Acuity supports Oracle and Microsoft SQL Server as data repositories. Data can be in the form of Affymetrix CEL, CHP, CDF and DAT files. Analysis support includes ANOVA, PCA, Hierarchical clustering, SOMs, K-means, K-median, gap staticstic, gene shaving, and more.
AMIADA (Analyzing MIcroArray Data) is a expression analysis packaged with an MS EXCEL-like user interface. The software performs data transformations, PCA, as well as various clustering techniques (specifics not listed in the documentation).
ArrayStat is a statistically-based software package aimed at microarray data analysis. Performs measures such as standard error, p-values, confidence intervals and others. The software also has the ability to perform outlier detection and power analysis for false negatives.
Data input can be from multiple formats such as CEL/CDF datasets as well as the output of several different scanners. Analyzing capabililites include various statistics (t-test, Mann Whitney test, 2-way ANOVA, etc), PCA, clustering (K-means, hierarchical, eigenValue, SOM, Random Walk, PCA w/ distance functions of Euclidean, Square Euclidean, Manhattan, etc) and Classification and prediction (Support Vector Machines, Decision Trees, Neural Networks, and Naive Bayesian). Avaidis is a commercial product w/ 3 versions available.
Engene includes various filter and normalization capabilites and can handle missing data. Clustering/analysis algorithms include k-means, HAC, fuzzy c-means, kernel c-means, SOMs, PCA, and others.
BioSieve allows the user to import tab-delimited data and Affymetrix data formats. The software has support for multiple similarity measures such as Euclid, Chebyshev, StreetBlock, Minkowski and various Pearson measures. Clustering algorithms included in BioSieve are Hirearchical, K-means, PCA, SOM, SVM, profile search, and Min Cover. Various visualizations are available based on these clusterings. A lite version is available freely for download.
GeneLinker comes in two versions - gold and platinum. Statistical analysis possibilities include f-test ANOVA, Kruskal-Wallis non-parametric ANOVA. Partitional clustering algorithms available include SOMs, K-means and Jarvis-Patrick. Hierarchical clustering and PCA are also implemented. Distance/similarity measures include Euclidean, Euclidean Squared, Manhattan, Pearson Correlation, Pearson Squared and Spearman Rank Correlation. GeneLinker can use MySQL, IBM DB2, or Oracle 9i as a backend. The Platinum version includes all the capabilities of the Gold version plus the following. GeneLinker Platinum implements IBIS (Integrated Bayesian Inference System), SLAM (Sub-Linear Association Mining), Committee of Neural Networks, and Support Vector Machines (see web page for more detail on these advanced functions).
Genesis is implemented in JAVA and read/writes tab-delimited flat files. Clustering of data can be performed with k-means, SOM, PCA, correspondence analysis using various distance measures. One-way ANOVA is available for detecting differentially expressed genes. SVMs are also supported. Genesis looks to be freely available for download. A client/server architected version is also available.
J-Express is a JAVA-based software package. Clustering support includes PCA, SOM, MDS, etc.