GHMM: General Hidden Markov Model library: HMMEd

The Graphical Query Language:
A GHMM-based tool for querying and clustering Gene-Expression time-course data

GQL is a suite of tools for analyizing time-course experiments. Currently, it is adapted to gene expression data. The two main tools are GQLQuery, for querying data sets, and GQLCluster, which provides a way for computing groupings based on a number of methods (model-based clustering using HMMs as cluster models and estimation of a mixture of HMMs).

Availability

GQL is freely available under the GPL. The first public release of GQL is available in source form and binaries form Windows or Mac.

Note that downloading and using GQL implies acceptance of the GPL. GQL is not freeware, nor public domain and the copyright will be enforced. As the GPL has strong consequences for any work derived from GQL, commercial entities can inquire about non-exclusive licenses.

If you use GQL in your research, please do cite the aforementioned paper(s).

GQLQuery: Querying time-courses

The GUI has been ported to Python using Tkinter and the brand-new Python bindings for GHMM. It runs on all Linux/Unix boxes. Executable binaries for MAC and Windows are provided.

GQLCluster: Finding groups in time-courses

Linux & Unix Version

Prerequisites

You will need to install the following packages before you install GHMM. The version 1.0 of GQL will only work with the most recent version of GHMM.

Tk (as in Tcl/Tk) in a version newer than 8.3.x. There are binary packages for most Linux distributions. You can find sources at www.tcl.tk/software/tcltk/. Tk provides the GUI toolkit.
Python version 2.6.x. This is the interpreter for language Python, the GQL-software is written in. Again, most Linux distributions have binary packages for Python. Make sure to choose Tkinter (the link to the Tk-framework) support. Sources etc. can be found at www.python.org. Note that GQL does not run with version 3.X of Python.
GSL, the GNU scientific library is at www.gnu.org/software/gsl/gsl.html. It provides a vast amount of mathematical functions, of which we just use a tiny bit.
PyGSL allow us to call the GSL from Python. You can find the sources at pygsl.sourceforge.net.

Installation

Download GQL-Unix-1.0.zip.
Uncompress and untar the file you downloaded
Set your PYTHON_PATH to include the directory where the ghmm.py has been installed (typically {prefix}/lib/python2.X/site-packages)
Note, your LD_LIBRARY_PATH must contain the directory where libghmm etc. reside

Troubleshooting

Start the Python interpreter by typing 'python' in your shell. The following commands should not produce no output.
- import Tkinter
- t = Tkinter.Tk() # A window should pop-up
- import pygsl
- import ghmm
If you see error messages, check whether you used the correct Python version for installing (pygsl should be in site-packages in the lib/python2.X directory), whether your PATH and your LD_LIBRARY_PATH variable include the directories which contain the files you are trying to use. Also, if you did not install stuff yourself, check the protections.
Messages in the console: GQL still outputs diagnostics and warnings. This is still a feature
Why does GQL not have feature X? Feel free to implement it.

Mac OS

Prerequisites

You will need a Mac OS 10.3 or latter version.

Installation

Download GQL-MAC-1.0.zip and GQLCluster-MAC-1.0.zip
Uncompress the file you downloaded and double click it!

Windows

Prerequisites

You will need Windows 98 or latter version.

Installation

Download GQL-Win-1.0zip.
Uncompress the file you downloaded and double click it at GQLCluster or GQLQuery!

See readme.txt avaliable in the unix/linux version for a non binary installation on windows.

Documentation

The papers bellows describes the methods implemented in GQL.

A. Schliep, A. Schönhuth, C. Steinhoff. Using Hidden Markov Models to Analyze Gene Expression Time Course Data. Proceedings of the ISMB 2003. Bioinformatics. 2003 Jul; 19 Suppl 1: I255-I263

A. Schliep, C. Steinhoff, A. Schönhuth.Robust inference of groups in gene expression time-courses using mixtures of HMM. Proceedings of the ISMB 2004. Bioinformatics, Aug 2004; 20 Suppl 1: I283 - I289.

A. Schliep, I. G. Costa, C. Steinhoff, A. Schönhuth. Analysing gene expression time-courses , IEEE Transactions on Computational Biology and Bioinformatics, to appear.

I. G. Costa, A. Schönhuth, A. Schliep. The Graphical Query Language: a tool for analysis of gene expression time-courses , Bioinformatics, 2005, 21(10):2544-2545.

File formats:

Both tools supports GHMM file formats for input data and model descriptions (see GHMM). It also reads input files in standard tab separated files, as the ones used by most of gene expression analysis tools. In this format, each line represents a gene and the columns the measured time points. The first column holds the gene identifiers and the second column any type of annotation of the genes. Missing values should be decoded as either 'Nan' or by not placing any character at the position. Sample files of all formats are provided in examples.



YHR124W	 meiosis                                -0.377685	-0.427071	-0.479749	 0.175438
YGR072W	 mRNA decay, nonsense-mediated unknown  -0.067600	-0.664033	-0.412644	 0.090134
YGR145W	 unknown                       	         0.266238	-0.854138	-0.103595	 0.371387
YIR031C  allantoin utilization                  -0.017010	 0.650807	 0.461851	-0.146432
YJR010W	 methionine biosynthesis                      NaN	 0.847968	 0.078140	-0.137952
YMR172W	 osmotic stress response                -0.734039	-0.258823	-0.135069	 0.127290
YIR032C  ureidoglycolate hydrolase              -0.287924	 0.701009	 0.464117	-0.160077
YHR053C	 metallothionein                        -0.263116	 0.780098	-0.363840	-0.396216

Example of a gene expression file during 4 time points. The second column holds functional annotation of the genes.

GQL also use tab separated files for files containing partial labels. Now, the files have only two colunms, the first containing the gene id and the second containing a numerical label (from 1 to n).



YHR124W  1
YGR072W  1
YGR145W  1
YIR031C  2 
YJR010W	 2 
YMR172W	 2 
YIR032C  2
YHR053C	 2

Release Notes

Version 1.0:
We had this version in heavy use for the last months. There are still some missing feature, and bugs.

Way too many log messages written to terminal
GO Especificity X Entropy Threshold not yet included in the GUI.
Priors on parameters needed to avoid overfitting (Just limit the number of Baum-Welch steps for the moment)
More options needed for estimation; update clustering view periodically (right now, do a one-step computation, view the result and estimate again)
k-Means still not integrated

The Graphical Query Language: A GHMM-based tool for querying and clustering Gene-Expression time-course data

Availability

GQLQuery: Querying time-courses

GQLCluster: Finding groups in time-courses

Linux & Unix Version

Prerequisites

Installation

Troubleshooting

Mac OS

Prerequisites

Installation

Windows

Prerequisites

Installation

Documentation

File formats:

Release Notes

The Graphical Query Language:
A GHMM-based tool for querying and clustering Gene-Expression time-course data