FROM DATA TO KNOWLEDGE (FDK)

RESEARCH UNIT

NATIONAL CENTRE OF EXCELLENCE 2002-2007 ACADEMY OF

BIENNIAL REPORT

2002 - 2003

Department of Computer Science Laboratory of Computer and Information Science Helsinki University of Technology

Table of contents

Preface ...... 3 1. Progress of research work ...... 4 1.1. Data mining and machine learning ...... 5 1.2. Computational methods in medical genetics and systems biology ...... 8 1.3. Combinatorial pattern matching and information retrieval ...... 11 1.4. Computational Structural Biology...... 13 1.5. PhD Theses...... 14 2. Changes in research strategy and research plans...... 17 3. Personnel ...... 18 3.1. Summary...... 18 3.2. Prizes and scientific honours received by researchers of the unit in 2002 - 2003..19 3.3. International positions of trust held by researchers of the unit in 2002 - 2003...... 19 3.4. Domestic positions of trust held by researchers of the unit in 2002 – 2003 ...... 23 3.5. Mobility of researchers...... 24 4. Publications and other outcomes...... 27 4.1. Articles in international scientific journals with referee practice...... 27 4.2. Articles in international edited works and conference proceedings with referee practice...... 30 4.3. Articles in Finnish scientific journals with referee practice...... 38 4.4. Articles in Finnish edited works and conference proceedings with referee practice ...... 38 4.5. Scientific monographs published abroad...... 39 4.6. Other scientific publications...... 39 4.7. Patents...... 41 4.8. Computer programs (and algorithms) ...... 41 4.10. Lectures and visiting lectures ...... 42 4.11. Radio and television programmes and articles popularising science ...... 46 4.12. Other outcomes: international conferences ...... 46 4.13. Degrees...... 47 5. Funding of the center 2002-2003 ...... 49 APPENDIX: List of personnel 2003 ...... 50

2 Preface

The granted to the From Data to Knowledge (FDK) research unit the status of a national Center-of-Excellence from January 1, 2002. The activities of the unit have, however, much longer history, dating back at least to early 1990's when some key- researchers of the unit started to collaborate. The vision has been from the beginning to build on our core competence in the algorithmics of combinatorial pattern matching and data mining, and apply it on novel problems in data analysis. With the prestigious new status our development and expansion has been even stronger than was expected. The research activity as well as the size of the personnel has grown rapidly. The unit had about 40 members at the beginning of 2002 while the current number is approaching 60. Also the establishment of the new Basic Research Unit of the Helsinki Institute of Information Technology (HIIT/BRU) in 2002 has made our environment much stronger and attractive. The HIIT/BRU, located in the same building as the FDK, is directed by Professor Heikki Mannila who is also a member of our unit. In 2004, Sami Kaski will start in the new data analysis professor position at our host department. We all will also to move to the new Exactum Building at Kumpula Campus during 2004. I expect these events to further strengthen the FDK, too. The present report summarizes the results and new plans of the unit from its first two years of activity.

Helsinki, February 15th 2004.

Esko Ukkonen

Contact information:

Prof. Esko Ukkonen Department of Computer Science P.O.Box 26 (Teollisuuskatu 23) FIN –00014 University of Helsinki, Finland Tel. +358 9 191 44172 Fax +358 9 191 44441 E-mail: [email protected] www.cs.helsinki.fi/research/fdk/

3 1. Progress of research work

In the original plan the research profile of the FDK unit was summarized as follows: Collection of raw data has in many areas of industry and research become easier than previously. Molecular biology produces long sequences of biological information; environmental satellites provide a wealth of data, process monitoring gives heaps of measurements, and the Internet gives easy access to a wide variety of data sources. Similar advances in the methods that provide useful information or knowledge from the data have not matched this overwhelming increase in the availability of data. The “From Data to Knowledge” (FDK) research unit develops methods for forming useful knowledge from large masses of data. The unit operates in a multidisciplinary fashion, integrating in its research group’s excellence in discrete algorithms, statistical techniques and application sciences. The major methodological tools of the research unit are combinatorial pattern matching and data mining. The combination of these two is unique in the world. The work combines conceptual advances, algorithmic, statistical and analytical methods, and empirical work: theory and practice go hand in hand. The results of the unit have been applied in, e.g., molecular biology, process industry, telecommunications, genetics, ecology, and natural language processing. The results have attracted wide international attention. Many concepts created by the group are in use in the scientific community, and they are presented in textbooks. Software that incorporates methods invented at the unit has been commercialized in several countries. The main themes in the planned activity of the FDK unit are efficient algorithms, data mining and combinatorial pattern matching, and the analysis of sequential and many-dimensional data as well as applications in computational molecular biology, bioinformatics, telecommunications and natural language technology.

The research of the unit can be viewed as an intertwined combination of four major projects or themes: I: Data mining and machine learning; II: Computational methods in medical genetics and systems biology; III: Combinatorial pattern matching and information retrieval; IV: Computational structural biology.

The projects are highly connected: basic research of computational methods for some specific applications occurs in each of the four. Similarly, the topics of discrete algorithms and probabilistic approaches occur repeatedly. The projects also share many researchers. Below, the subgroups belonging to each of the themes I-IV report briefly on the results and new plans. The subgroups are lead by the senior members of the unit: Helena Ahonen- Myka, Tapio Elomaa, Jaakko Hollmén, Heikki Mannila, Hannu Toivonen and Esko Ukkonen. Also two of our post-docs, Kjell Lemström and Juho Rousu, have their own subgroups.

4 1.1. Data mining and machine learning

Group Ahonen-Myka Members: Helena Ahonen-Myka, Lili Aunimo, Martin Fluch (-06/03), Oskari Heinonen, Kaisa Kostiainen (-05/03), Reeta Kuuskoski, Miro Lehtonen, Greger Lindén, Juha Makkonen, Renaud Petit, Jussi Piitulainen, Andrei Popescu (-07/03), Marko Salmenkivi, Otso Virtanen (-12/02)

We have studied the information retrieval related problems of first story detection and topic tracking [47, 71, 106, 107, 108, and 133]. The first story detection task is about spotting new, previously unreported real-life events from online news-feed, while the topic tracking attaches a document to a previously detected event. We have addressed these problems by extracting locations, proper names, temporal expressions and normal terms from documents, and assigning weights for these semantic classes. The weights are learned from a training set that contains pre-classified documents. We have also proposed new similarity measures based on semantic classes. In our experiments on a Finnish online news-stream corpus, we have found that the use of semantic classes improves the performance significantly. We have also started to experiment with commonly used American English collections which contain news articles from newspapers and news transcribed from spoken TV and radio broadcasts. We have developed a question answering system for a company helpdesk [17]. The incoming natural language questions are analyzed and the best answer candidates are retrieved from the old question-answer pairs. Varying parts, e.g. product prices, can be recognized and replaced with uptodate values from a database. Our emphasis in this area is moving to extracting answers from ordinary text documents. Our approach analyses the semantic structures of the questions, launches an information retrieval query, and finally filters the query results using the semantic knowledge acquired from the question. The retrieval part can also utilize semantic clustering of words and other linguistic analysis that we are also studying. In another context we also consider augmenting the retrieval part by ontological information and other application-specific information.

Group Elomaa Members: Ilkka Autio, Tapio Elomaa (-7/2003), Jaakko Haapasalo, Ilkka Koskenniemi, Teemu Kurppa, Matti Kääriäinen, Jussi Lindgren, Tuomo Malinen, Mika Miettinen, Juho Rousu

Our general aim is to study well-founded machine learning algorithms and to apply modern algorithmic analysis to them. As a secondary research interest we have machine vision and image understanding (in robotics), in particular, applications of machine learning methods therein. In 2002 we have continued to work with the problem of optimal numerical range discretization with respect to known evaluation functions [2, 22, and 61]. We have been able to find even faster methods than before for one important evaluation function [60] and to apply the same line of analysis to Bayesian classifiers. In the past we have also studied pruning methods for learning structured knowledge representations. This year we have

5 obtained NP-completeness and inapproximability results for the pruning of DAG-formed classifier using an algorithm that is efficient in case of trees [59]. Our new research directions have been theoretically well-motivated sampling in learning classifiers [58], algorithms for grammatical inference [61], and on-line learning [57]. In machine vision and image understanding research we have studied numerous different general approaches and have paid special attention to human detection, tracking, and identification methods. Under the learning from images theme, we also took on a short contract research project, where microscopy image analysis of bread was studied using machine learning methods [142]. The models learned using Haralick feature extraction combined with model tree induction and boosting matched the predictive accuracy of human experts. Tapio Elomaa left the research unit as he was appointed Professor at Tampere University of Technology in 2003. The other researchers continue their work in other groups of the unit.

Group Mannila Members: Ella Bingham, Floris Geerts, Aristides Gionis, Bart Goethals, Jaakko Hollmén, Manfred Jaeger, Jaripekka Juhala, Mikko Koivisto, Kalle Korpiaho, Heikki Mannila, Taneli Mielikäinen, Anne Patrikainen, Jouni K. Seppänen, Evimaria Terzi

The work is characterized by combinations of pattern discovery and probalistic modeling in data mining: pattern discovery aims at finding local phenomena, while modeling often aims at global analysis. Pattern discovery techniques can be very efficient in finding frequently occurring patterns from large masses of data. One of the basic questions is how much the collection of frequent patterns tells us about the underlying distribution. We have analyzed the use of maximum entropy approaches to inferring distributions from frequent pattern collections and obtained excellent empirical results [34]. Another major question is finding structure in large collection of 0-1 data: the results include a simple model of topics in 0-1 data, and simple algorithms for finding the topic structure [52, 120]. In industrial cooperation projects we have recently developed simple and efficient algorithms for on-line clustering. The combination of probabilistic and algorithmic techniques is also visible in several new themes. One major new theme in the work is in finding good segmentations for sequences. The (k,h)-segmentation problem and algorithms [92] show how one can locate recurrent sources from sequences; the approach applies to basically any probabilistic model for the generation of points in the sequences. We have also looked at the question of finding fragments of total orders from unordered data [93], which seems to be a fruitful approach. We are also investigating different approaches to subspace clustering [Patrikainen Mannila manuscript], [Seppänen Mannila manuscript]. On pure pattern discovery area, topics include approximation of frequent set collections [111] [Afrati Gionis Mannila manuscript]. The work on combining local and global analysis in data mining will continue. Potential new themes include spectral clustering, interplay of probabilistic clustering and frequent sets, and word discovery from sequences. The work has lots of connections to applications, e.g., in paleontology and in genomics.

6 Group Toivonen Members: Kari Laasonen (1/2003-), Mika Raento (1/2003-), Hannu Toivonen, Kari Vasko (-12/2003)

We work on data mining and machine learning, and apply novel methods on problems in ubiquitous computing and ecology. Our work on ubiquitous computing started in the beginning of 2003. The current topic is learning of contexts. In mobile and ubiquitous systems, adapting to user context can be extremely important: changes in the user's situation are rapid and they are strongly reflected to the user's needs and preferences. We have developed methods for recognizing individually important locations from cellular data and for predicting user movements [Laasonen et al, to appear in Pervasive 2004]. The methods have been implemented in a mobile phone with limited resources and without support from the network infrastructure. Future work will address other learning and mining problems with streaming data and limited computations resources, as well as system design and user experience testing. During several years, we have developed a general purpose tool, Bassist, for automating the implementation of complex Bayesian models [Toivonen et al, manuscript]. We have also built application-specific Bayesian models in joint projects with ecologists. During the period we continued developing Bayesian full probability models for organism-based environmental reconstructions, and applied them on arctic temperatures during the last 10000 years, in joint work with the Department of Ecology and Systematics [5]. A different problem, analysis of metapopulations and the conditions for population survival, was also attacked using Bayesian modeling and the Bassist tool. Bayesian approach to this problem was introduced jointly with Rolf Nevanlinna Institute and Department of Ecology and Systematics, and it was shown to make major improvements over previous models [10]. A recent research topic is segmentation of time series. In paleoecological applications, segments are likely to correspond to relatively stable periods, and they are useful in the discovery of major climatical events. Another application example is in context recognition in adaptive mobile devices. We developed a novel method for estimating the number of statistically significant segments in time series [78, 134]. Another time series problem we addressed was automated detection of epidemics using web log data of a physicians' on-line handbook [96]. The method is based on deriving a smoothed time series, on using a flexible selection of data for comparison, and on applying randomization statistics to estimate the significance of the findings. Experiments with real data show that the simple method can provide accurate and early detection of epidemics. Using the so-called ROC space (true positives vs. false negatives) is a useful way for assessing classifiers under different misclassification costs and class distributions. We showed how the chi-square test can be used to provide a 3rd dimension to the analysis, indicating the statistical significance of points in the ROC space. Within an international research group, we also applied the methodology to the 113 submissions to the Predictive Toxicology Challenge, with the interesting but unfortunate result that - with only 2 or 3 exceptions - the submissions were no better than random guesses [39]. In the future we aim to concentrate on mining and learning for ubiquitous and proactive applications.

7 1.2. Computational methods in medical genetics and systems biology

Group Hollmén Members: Jaakko Hollmén, Mikko Katajanmaa, Heikki Mannila, Anne Patrikainen, Antti Rasinen, Salla Ruosaari, Jouni K. Seppänen

The group works on data mining techniques for analyzing discrete data. Our recent interest has been in combining local and global analysis techniques [97], and in knowledge discovery in inductive databases [53, 85]. We are also involved in application-oriented data analysis collaboration with cancer researchers. In a collaborative setting, we have analyzed gene expression data in biological and medical investigations concerning various cancer types [8, 15, 18, 45]. We have also investigated data quality fundamentals in the context of microarray measurements using image analysis techniques to filter artifacts of poor quality measurements [76]. Currently, the work continues with method development in the probabilistic framework to combine several sources of data, and to draw improved inferences based on the joint data set. The immediate application area is found in our collaborative cancer research: we are working on a project aiming at finding tumor markers of work-related lung cancers. Existing measurements include gene expression data from the microarray platform, copy number alteration measurements along the chromosome, characterization of the patient material, and gene annotation databases.

Group Mannila Members: Teemu Kivioja, Mikko Koivisto, Heikki Mannila, Kimmo Palin, Pasi Rastas, Marko Salmenkivi, Esko Ukkonen

The work has concentrated on finding small and large scale structure in genomes. We have developed methods for segmenting genomes by the use of piecewise constant intensity models and Markov chain Monte Carlo techniques [14]. For small scale structure, the research has concentrated on methods for finding haplotype blocks in genome data. We have developed an MDL-based technique for finding blocks and for estimating the strength of block boundaries [Koivisto et al (manuscript); 77]. The method has attracted considerable interest, and has yielded intricate algorithmic questions. The research includes also a study on the use of population risk estimates for inferring disease models using MCMC techniques. Results include efficient algorithms for computing sum-product expressions in certain conditions, and techniques for estimating the possible models for certain traits. New plans: The work on haplotype blocks will continue, both on the areas of block discovery and on reconstruction of ancestral haplotypes. On genome segmentation, recent work [92] has led to interesting issues in approximation algorithms. The major topics will be in understanding the inter- and intraspecies variation in genomes.

8 Group Rousu Members: Veli Mäkinen, Esa Pitkänen, Ari Rantanen, Juho Rousu, Katja Saarela, Esko Ukkonen

We concentrate on developing methods for metabolic flux estimation which allow unified treatment of different kinds of incomplete data arising form isotopic-tracing experiments. The current methodology is based on modeling the measurement data in a vector spaces spanned by different labeling patterns of the metabolites. Computation involves computing intersections of certain subspaces, projections of measurement information to them and solving linear equations to uncover the fluxes [119]. The computational support for tandem mass-spectrometric analysis of isotope-labeled metabolites is another topic of interest. The research has resulted in the PIDC software [13, 75], the first version of which is available from http://www.cs.helsinki.fi/research/icomic/software.html. An improved version capable of handling overlaps in the daughter-spectra has also been developed [35]. Plans in metabolic flux estimation include integration with gene expression data, better treatment of bi- directional reactions and cell compartments, investigating higher-order balance equations and incremental versions of the algorithms. In metabolic network synthesis, we continue to study different search algorithms and evaluation criteria to guide the search. Moreover, we plan to develop well-founded methods for handling so-called hypothetical reactions in metabolic network.

Group Toivonen Members: Lauri Eronen (8/2002-), Floris Geerts (8/2002-), Bart Goethals (9/2003-), Petteri Hintsanen (8/2002-), Heikki Mannila, Päivi Onkamo, Petteri Sevon, Evimaria Terzi (8/2002-12/2003), Hannu Toivonen

We investigate relationships and patterns in genotypes and haplotypes, consisting of genetic markers, and in phenotypes, and develop novel algorithms for their automated analysis. Results include new methods for gene mapping, haplotype reconstruction, and haplotype similarity and clustering. Research on linkage disequilibrium-based gene mapping methods for complex diseases has been continued on three fronts: reconstruction and use of plausible gene genealogies in gene mapping [50], mapping of quantitative traits, and mapping from genotype data. We generalized our haplotype pattern based gene mapping method HPM to work with both quantitative traits and covariates [11]. These extensions significantly widen the scope of the haplotype pattern based approach. The scope has been extended to genotype data, too: many methods assume haplotype data, whereas genotypes are often all that is available from a lab. The general framework of HPM and its variants is described in [121]. Haplotypes are valuable for genetic studies because they capture information about regions descended from ancestral chromosomes. We have developed novel methods for haplotype reconstruction from genotype data [90, 164]. We use a class of Markov chain models to estimate the haplotype distribution in the population, and use them to infer the most likely haplotype pair for each genotype. The models are aimed specifically for long marker maps, where linkage disequilibrium between markers may vary and be relatively

9 weak. Such maps are ultimately used in chromosome or genome-wide association studies. Experimental validation of the Markov chain methods on both a wide range of simulated data and real data shows that they clearly outperform previous methods on genetically long marker maps and are highly competitive with short maps, too. Another recent research topic without an immediate goal of gene localization is the analysis of haplotypes across individuals: we have developed similarity measures as well as clustering methods for sparse haplotype data [124]. The goal is to be able find similarities and structure in the genomes of several individuals, even when not directly related to a disease. For dense haplotype data, haplotype blocks are a popular topic in genetics. A research topic we have been studying is marker selection for known blocks: given the different haplotypes over a large set of markers that do occur in a block, find a small subset of markers sufficient to reliably identify the haplotype. A new variant considers marker selection in a more general context, without a block structure. This problem can be critical in cutting down the costs of future association studies or diagnostic tests.

Group Ukkonen Members: Teemu Kivioja, Margus Lukk, Kimmo Palin, Pasi Rastas, Esko Ukkonen

The group works on computational problems arising in the analysis of genome structure, gene regulatory networks and gene expression. In gene expression analysis, perhaps the most interesting result has been the solution of the probe selection problem for a novel expression measurement technique. This technique, based on a new way of utilizing a standard DNA sequencer, is under development at VTT Biotechnology (Espoo, Finland). We have developed an entire software system for the bioinformatics of this technique. It includes an algorithm and its implementation to find the so-called hybridization probes and to group them into a small number of pools. The method is able to analyze entire genomes. It was successfully applied on the yeast genome. The computational problem is NP hard but has good approximation algorithm that is based on matching techniques [43]. Together with VTT Biotechnology we have also developed computational methods for optimizing cDNA-AFLP (cDNA Amplified Fragment Length Polymorphism) experiments. This expression profiling method is based on dividing a complex cDNA mixture into small subsets using restriction enzymes and selective PCR. The aim of our work is to select such restriction enzymes and selective primers that as many genes as possible can be profiled in an experiment (manuscript in preparation). As far as we know, we are the first to study the problem as a rigorously defined optimization problem [manuscript]. Our other study continued our work (co-operation with A. Brazma, J. Vilo at EBI, Hinxton) on finding gene regulatory relations and regulatory networks from RNA expression data [12, 36]. We developed a general method to find correlations between observed and predicted effects of the so-called knockout experiments with the yeast. The SPEXS tool for finding regulatory patterns and several examples of its applications were presented in J. Vilo's doctoral thesis [248] in 2002. In a joint project with the Haartman Institute of the University we have analyzed human gene expression data to classify different stages of melanoma, with some unexpected findings (unpublished). In 2003 we also started collaboration with J. Taipale's group (Biomedicum, University of Helsinki) on transcription factor binding affinities and the effect of the SNP's on them. So far we have developed a new biophysically justified scoring function for aligning regulatory

10 patterns [manuscript]. Currently we work on generalizations for multiple alignments, to be able to compare several genomes simultaneously in order to find weaker regulatory signals. In the genome analysis, we have worked on algorithms for finding haplotype blocks and mosaics (jointly with H. Mannila's group) and for inverting recombinations. Some of the algorithms have been implemented as a web server. A novel Hidden Markov Model for haplotype analysis is under development.

1.3. Combinatorial pattern matching and information retrieval

Group Ahonen-Myka Members: Helena Ahonen-Myka, Antoine Doucet, Martin Fluch (-06/03), Kai Hendry (09/02-11/03), Saara Huhmarniemi (08/02-04/03), Greger Lindén, Kaisa Kostiainen (-05/03), Juha Makkonen, Andrei Popescu (-07/03), Marko Salmenkivi

Our previous work on finding all maximal frequent word sequences in text has been extended in two ways: 1) finding word sequences considering also their position in the hierarchical (XML) structure [56], and 2) finding sequences in texts annotated by morphological features [51]. We have also developed a method for finding frequent word sequences in large collections. The method clusters first the document collection into smaller clusters, finds frequent word sequences separately within each cluster, and finally combines the results. The result is always an approximation of the real set of the maximal frequent word sequences but in most applications probably sufficient. The use of the discovered word sequences has been experimented in connection with XML document fragment retrieval. We have participated in the international "Initiative for the Evaluation of XML retrieval" project [87], the aim of which is to create a large testbed and scoring methods for XML retrieval. In 2002 we participated in the assessment process and workshop only [23], but in 2003 we also constructed our own retrieval system and submitted its results [88]. The performance of our system compared to the other systems was very satisfying. In addition to using word sequences as part of document descriptors, our emphasis was in finding accurate result fragments and in iteratively expanding queries based on the best answers received. We have also developed a working environment for a journalist that consists of a news story editor (an XML editor) and a retrieval interface to a collection of news stories [131]. The retrieval interface includes a live search option, i.e., the text the user is currently writing is every now and then formalized as a query and sent to the retrieval engine. The work on finding frequent word sequences will be continued by developing more efficient implementations. Moreover, we will evaluate further the quality of the sequences as document content descriptors in some applications. Particularly, we develop similarity measures for comparing documents whose content descriptors include word sequences. We also continue participating in the INEX project in order to test our methods in a standard setting.

11 Group Lemström Members: Kjell Lemström, Tuomo Malinen, Veli Mäkinen, Anna Pienimäki, Mika Turkia, Esko Ukkonen

Our project called C-BRAHMS, (Content-Based Retrieval and Analysis of Harmony and other Music Structures) aims at designing and developing efficient methods for computational problems arising from music comparison, retrieval, and analysis. The project concentrates on retrieving polyphonic music in large-scale music databases containing symbolically encoded music. Since the start of the project in January 2002, the project outcome contains, for instance, following useful results. 1) A method that finds musically meaningful patterns has been developed [7, 74]. Such patterns may subsequently be put in an indexing structure to allow fast access to the meaningful patterns. 2) Novel versions of geometry-oriented online algorithms has been introduced. In the Content-Based Music Retrieval (CBMR) application, such algorithms have several advantages over the conventional string matching algorithms: They are capable of natural dealing with polyphonic music (i.e. music where more than one note are played simultaneously); it is easy to include rhythmic information and note durations in the encoding; and the number of dimensions may be arbitrary (allowing several features to be considered) [81]. More recently, we have improved these geometry-oriented algorithms [125, 126]. 3) The first version of a CBMR query engine prototype has been set up [104]. The future plans contain both theoretical and practical work. In the theoretical side, the possibilities of the novel geometry-oriented approach need to be comprehensively studied, for instance, in order to make them robust enough. Moreover, as we would like to be able to find all musically meaningful patterns in large-scale music databases, some kind of compromise between indexing and online approaches has to be found. To avoid a combinatoric explosion, this leads also to a music analysis problem: within a polyphonic surface, which are the (monophonic) passages that are musically meaningful? The practical work is associated to the query engine. The subsequent versions of the engine are planned to be freely available (under the GNU General Public License) and all the important findings and implementations are going to be embedded in the engine. Naturally such a connection of theory and practice will have an effect on both sides of the project. For instance, once we have an audio-digital front-end plugged to the engine (that converts, e.g., hummed queries to symbolic form), to improve the robustness of the searching algorithms we can assess the required level of error tolerance set up by the error-prone conversion process.

Group Ukkonen Members: Kimmo Fredriksson (-12/2002), Shunsuke Inenaga (6/2003-) Juha Kärkkäinen, Kjell Lemström, Veli Mäkinen, Hellis Tamm, Esko Ukkonen

This subgroup works on the basic research on combinatorial pattern matching algorithms. For one-dimensional strings we have been interested in approximate matching over compressed texts [27, 33] as well as on using so-called super alphabets to speed-up the search [62]. The most significant result on research along these lines is an average-optimal multiple approximate string matching algorithm (CPM 2003). This is joint work with G. Navarro, Univ. Chile.

12 Also, we have obtained improved results on the theoretical properties and practical use of the generalized q-gram filters for approximate string searches (jointly with S. Burkhardt, MPI Saarbrucken) [20]. Perhaps the strongest single result was a new algorithm for constructing suffix arrays in linear time (directly, not via a suffix tree). This result luckily appeared in the ICALP 2003 [84, 100] conference during the same week as two competing algorithms for the same problem in the CPM 2004. We have also studied parameterized string matching problems, where a global transformation is applied to the strings (joint work with G. Navarro, Univ. Chile). We noticed a fundamental connection between translation invariant approximate string matching and sparse dynamic programming, which yielded very efficient algorithms for this problem [117, 118]. This particular problem is motivated by the retrieval of monophonic music. We generalized the method to the polyphonic case in [103]. Music can more robustly be modeled using line segments (pitch-duration pairs). We developed geometric matching algorithms under this setting [125, 126]. In the two-dimensional case we have obtained promising initial results on matching that allows continuous local transformations. This work is motivated by the 2-dimensional electrophoresis. Several results in this area appeared in the PhD Thesis of Veli Mäkinen in 2003 [73, 249]. We have also studied global transformations in the two-dimensional case; we obtained an optimal algorithm for exact matching under rotations in [63]. We have also studied the optimization of certain multitape automata arising in string database systems and found a novel sufficient condition for the minimality of (nondeterministic) multitape automata [123]. The work will be continued along the above lines. Utilizing our new ideas on 2- dimensional matching and q-gram filters will be the main objectives. We are also currently developing practical versions of space-efficient full-text indexes (continuum to compact suffix array in [32]). A new subproject in 2004 will start to develop a general purpose software library for string matching algorithms.

1.4. Computational Structural Biology

Group Ukkonen Members: Juan Carlos Borras, Kimmo Fredriksson (-12/2002), Peter Lamberg (- 09/2002), Taneli Mielikäinen, Tuomas Ojamies (1/2003-), Janne Ravantti, Esko Ukkonen

The goal of this project is to develop, in collaboration with D. Bamford's Laboratory (Biocenter, University of Helsinki), computational methods and software tools for electron microscopy in structural virology. We work on algorithms for reconstructing 3-dimensional density models for viruses from their 2-dimensional density projections as seen on electron micrographs. Because of the very high noise level of the data, the reconstruction is an extremely tough problem. The ideas of finding the projection angles using the so-called common-lines techniques (implied by the Radon Theorem) seem not to work reliably enough. Additional difficulty is that according to our results, the search problem to find consistent common lines is NP hard in typical cases [accepted]. However, we have developed a promising new theoretically

13 justified technique to reduce noise using common lines [accepted]. For the reconstruction itself our new algorithm works satisfactorily [manuscript]. Comparing 3-dimensional models, represented explicitly as density values of voxels, is another challenging problem with lots of use in structural biology. We have developed an all-against-all comparison algorithm for this problem, based on geometric hashing. The method is invariant under rotations and translations and also has some robustness against differences in density value scalings that can even be nonlinear. A prototype implementation shows that the method performs sometimes extremely well. Its sensitivity and speed in particular still need improvement. Our goal is to develop our software into server that performs searcher and comparisons of a new density distribution against a database of such distributions. Somewhat separately of the above, we also study images taken from human retinas, to classify certain spots (so-called micro aneurysms) using support vector machines. The classification is useful in the medical diagnosis of diabetes. The work on all these topics will continue.

1.5. PhD Theses

2002

V. Ollikainen: Simulation Techniques for Disease Gene Localization in Isolated Populations. PhD Thesis. University of Helsinki, Department of Computer Science, Report A-2002-2. Abstract: In this thesis we present new simulation techniques and algorithms for localizing disease genes for human multifactorial diseases, which are disorders controlled by several genetic, environmental, and stochastic risk factors. From a computational point of view, gene localization is a complex problem. We introduce a population simulator software package for exploring the complex relationships between the models of multifactorial disease and localization outcomes. The population simulator consists of components for simulating pedigree generation according to the population history specification given by the user, the inheritance of chromosomal segments in the pedigrees, the emergence of complex disease phenotypes, and the collection of samples. The simulator can be used to estimate the significance of a positive localization outcome, to estimate the power of the proposed study setup, to analyze the connection between disease models and population history, and to perform comparative studies on different methods and measures used in disease gene localization. We concentrate on haplotype association methods that take advantage of linkage disequilibrium (nonrandom association of alleles in closeby loci) and describe a method for using the population simulator in analyzing the effect of different characteristics of population history (such as population age, size, and substructure) on the amount of linkage disequilibrium present in the data. We also describe a computational framework and a software implementation for finding the optimal gene mapping strategy for a genome-wide search for genes of complex diseases in on isolated population. The procedure is applied for different population histories, disease models, and measures of linkage disequilibrium. We combine the developed simulation framework with three established gene localization methods (Haplotype relative risk, Haplotype pattern mining and Transmission/disequilibrium tests), and analyze their applicability in the initial stage of a genomewide search.

14 J. Vilo: Pattern discovery from biosequences. PhD Thesis. University of Helsinki, Department of Computer Science, Report A-2002-3. Abstract: In this thesis we have developed novel methods for analyzing biological data, the primary sequences of the DNA and proteins, the microarray based gene expression data, and other functional genomics data. The main contribution is the development of the pattern discovery algorithm SPEXS, accompanied by several practical applications for analyzing real biological problems. For performing these biological studies that integrate different types of biological data we have developed a comprehensive web-based biological data analysis environment Expression Profiler (http://ep.ebi.ac.uk/). Biosequences, i.e., the primary sequences of DNA, RNA, and protein molecules, represent the most basic type of biological information. Features of these sequences that are reused by nature help us to understand better the basic mechanisms of gene structure, function, and regulation. The SPEXS algorithm has been developed for the discovery of the biologically relevant features that can be represented in the form of sequence patterns. SPEXS is a fast exhaustive search algorithm for the class of generalized regular patterns. This class is essentially the same as used in the PROSITE pattern database, i.e. it allows patterns to consist of fixed character positions, group character positions (ambiguities), and wildcards of variable lengths. The biological relevance of the patterns can be estimated according to several different mathematical criteria, which have to be chosen according to the application. We have used SPEXS for the analysis of real biological problems, where we have been able to find biologically meaningful patterns in a variety of different applications. For example, we have studied gene regulation mechanisms by a systematic prediction of transcription factor binding sites or other signals in the DNA. In order to find genes that potentially share common regulatory mechanisms, we have used microarray based gene expression data for extracting sets of coexpressed genes. We have also demonstrated that it is possible to predict the type of interaction between the G- protein coupled receptors (GPCR) and its respective G-protein, the mechanism widely used by cells for signaling pathways. That prediction, although the GPCR s have been studied for decades, primarily for their immense value for the pharmaceutical industry, had been thought to be unlikely from the primary sequence of GPCR alone. The tools developed for various practical analysis tasks have been integrated into a web-based data mining environment Expression Profiler hosted at the European Bioinformatics Institute EBI. With the tools in Expression Profiler it is possible to analyze a range of different types of data like sequences, numerical gene expression data, functional annotations, or protein-protein interaction data, as well as to combine these analyses.

2003

V. Mäkinen: Parameterized approximate string matching and local-similarity-based point-pattern matching. PhD Thesis. University of Helsinki, Department of Computer Science, Report A-2003-6. 2003. Abstract: This thesis studies matching problems for different one- and two-dimensional objects like numerical sequences representing music, one-dimensional point-patterns representing tree-ring series, and two-dimensional point-patterns representing spots in protein electrophoresis experiments. The one-dimensional problems studied here are generalizations of the classical string . The goal has been to find as general and efficient algorithms as possible for different variations of the classical problem. While most of the variants have motivations in applications, the

15 study is more general; as an example, a wide spectrum of gradually more difficult matching problems under the label "restricted gaps" is covered. The basic scheme behind the efficient solutions is the use of geometric data structures to speed up dynamic programming. Especially semi- dynamic range minimum queries are shown to be useful in this context. One variation of the classical edit distance measure studied here is the transposition-invariant edit distance for numerical strings. A music piece can be modeled as a sequence of pitch levels, thus giving a numerical string. If the same music piece is recorded in different tones, a straightforward comparison between these two pitch sequences does not reveal the obvious similarity. A distance that measures the dissimilarity between two music pieces should therefore be transposition- invariant, i.e. it should allow an arbitrary shift in the tones without any cost. We show the perhaps surprising result that including transposition-invariance into edit distance only adds some poly- logarithmic multiplicative factors into the known upper bounds for the original problem. Another variation related to music information retrieval is matching of run-length encoded strings. The problems in this class take both the pitch and the duration information into account when comparing two music pieces. We give efficient algorithms for matching problems in this class. The concept of local-similarity-based matching is introduced in the context of point-pattern matching. The one-dimensional case has an application in matching tree-ring sequences in dendrochronology. A number of algorithms are developed for this problem. The two-dimensional case is more difficult. Consider two point sets A and B of equal cardinality in two-dimensional Euclidean space. The task is to find a one-to-one matching between these point sets such that local similarity is preserved in the matching, i.e., the matching of a point is consistent with the matchings of its neighboring points. A distance measure framework for measuring the consistency of a matching is introduced and the problem of finding the best matching under this framework is studied. It is shown that several natural instances of this framework lead to problems that are computationally infeasible; they are NP-hard to approximate within any constant factor. A relaxed model for the problem is also studied leading to a heuristic solution based on minimum weight matching. Experiments with point-patterns extracted from two-dimensional protein electrophoresis gels show that the heuristic works reasonably well.

2004

M. Koivisto: Sum-Product Algorithms for the Analysis of Genetic Risks. PhD Thesis. University of Helsinki, Department of Computer Science, Report A-2004-1. Abstract: This work is motivated by a genetic data analysis task: recurrence risk analysis. Recurrence risks in various types of relatives, e.g., offspring and siblings, characterize the inheritance pattern of a genetic disorder. Given observations of the population prevalence and different recurrence risks, the task is to infer the number of underlying genes and the frequencies and effects of different variants of the genes. Due to a complex relationship of relatively simple data and a large number of model parameters, this problem is challenging both statistically and computationally. Straightforward application of existing techniques is not sufficient. In the first part of this thesis, we study three general methodological issues. First, we review the Bayesian paradigm for statistical inference. Special emphasis is on certain fundamental difficulties of practical Bayesian methods. Second, we study the sum-product problem, that is, marginalization of a multidimensional function that factorizes into a product of low-dimensional functions. We introduce a novel algorithm that improves the well-known variable elimination method by employing fast matrix multiplication techniques. A special type of sum-product problems, called transformation problems, are studied. We generalize a technique known as the Yates algorithm and show how the Möbius transformation on a subset lattice can be computed efficiently. Third, we consider the problem of integrating a multidimensional function. We describe a sophisticated method based on a tempering technique and Metropolis-coupled Markov chain Monte Carlo

16 simulation. We argue that this method is particularly suitable to computations required in Bayesian data analysis. Connections to related methods are also explored. The second part is devoted to the genetic application. We introduce a genetic model called an epistatic Mendelian model. For computing recurrence risks under a fully specified model, we present a method that employs the Yates algorithm. Based on a different problem representation, we also give another algorithm that uses the fast Möbius transform. To integrate over the model parameters we present a version of the Metropolis-coupled Markov chain Monte Carlo method. Finally, we report experimental results that support two main conclusions. First, Bayesian analysis of recurrence risks is computationally feasible. Second, recurrence risk data provides interesting information concerning competitive genetic hypotheses, but the amount of information varies depending on the data set.

2. Changes in research strategy and research plans

While we see that our basic approach is vital and relevant there are several challenges in the future activity of the unit. In general, finding a fruitful balance between the basic research of computational methods and the applications in different areas needs continuous attention. Keeping our theoretical toolbox up-to-date is perhaps the main challenge here. Also finding application partners who can offer novel and significant computational problems in their area needs care. There is a danger of splitting our activity into too small subprojects which by loosing the focus may hamper the impact of the research unit. In the previous report we listed the following short term plans: • Recruitment of new post-docs, possibly from abroad and in co-operation with HIIT/BRU, and helping some post docs to build-up their own group. • Participation in the new research programs of the Academy of Finland and the Technology Development Agency of Finland (TEKES). For example the Academy programs in ubiquitous computing and in systems biology are in the interest of the unit as well as the planned TEKES technology program in information technology solutions for networked economy. • Participation in the forthcoming EU projects. The unit was a member of several EU consortiums in the preliminary round in June 2002. There has been significant progress with all these goals. Our new post docs include Dr. Aristides Gionis (Stanford) and Dr. Juha Kärkkäinen (returned from Max-Planck-Institut, Saarbruecken). Both are now building their own groups. The unit participates quite strongly in the new systems biology program of the Academy. We have also obtained new funding from TEKES for computational biology research. The unit is a member of several new EU projects, including networks of excellence BIOSAPIENS (bioinformatics) and Pascal (machine learning). New projects starting in 2004 include development of a general purpose software library for string matching. Such a component is missing in the well-known algorithm libraries such as Leda. Another software project, funded by TEKES, will develop a web server software system for the analysis of metabolic networks and metabolic fluxes. We have also started a strategic planning process for the second three-years-term of the unit. Our purpose is to strengthen our focus to selected core areas, 'sequences' probably being one of them.

17 3. Personnel

3.1. Summary

Personnel of the unit by personnel group is given in the table below. Appendix 1 contains more detailed information.

2002 2003 Personnel group m f tot m f tot

Finnish personnel 1. Professors and associate professors 5 1 6 4 1 5 2. Other senior researchers 1 - 1 2 - 2 3. Postdoctoral researchers and other young researchers1 6 1 7 6 2 8 4. Ministry of Education graduate school students2 15 5 20 14 6 20 5. Other postgraduate students 1 3 4 2 2 4 6. Other academic personnel 9 - 9 10 1 11 7. Auxiliary personnel ------(office, technical, other ancillary personnel)

Foreign personnel 8. Professors and associate professors ------9. Other senior researchers ------10. Postdoctoral researchers and other young researchers1 - - - 4 - 4 11. Ministry of Education graduate school students2 1 - 1 1 - 1 12. Other postgraduate students 2 1 3 2 1 3 13. Other academic personnel 3 - 3 1 - 1 14. Auxiliary personnel

TOTAL 43 11 54 46 13 59

1 A maximum of five years elapsed from defending the doctoral thesis 2 Includes all graduate school students, not only positions in salary grade A18

18 3.2. Prizes and scientific honours received by researchers of the unit in 2002 - 2003

M. Kääriäinen: Department's young teacher of the year, University of Helsinki, Department of Computer Science, 2003.

K. Lemström: Department's senior researcher of the year award, University of Helsinki, Department of Computer Science, 2003.

H. Mannila: ACM SIGKDD Innovation Award, 2003.

T. Mielikäinen: ICDM'03 Student Travel Award provided by IBM Research, 2003.

3.3. International positions of trust held by researchers of the unit in 2002 - 2003

Ahonen-Myka, Helena

Journal Reviews - IEEE Transactions on Knowledge and Data Engineering - IEEE Systems, Man and Cybernetics

Scientific program committee memberships - Twenty-Fifth Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2002) - Sixth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'02) - ICML-2002 Workshop on Text Learning (TextML'2002) - Workshop on XML in Digital Media (XMLinDM'02), in connection with International Conference on Distributed Multimedia Systems (DMS'2002) - 5th International Conference on Enterprise Information Systems (ICEIS 2003) - 26th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2003) - 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD ’03)

Other

- Chair of the Organization Committee of the joint conferences 13th European Conference on Machine Learning (ECML'02) and 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'02)

Elomaa, Tapio

Journal Reviews - IEEE Transactions on Pattern Analysis and Machine Intelligence

19 - Journal of Intelligent Information Systems

Scientific program committee memberships - 19th International Conference on Machine Learning, Sydney, Australia, 2002 - 13th European Conference on Machine Learning, Helsinki, Finland, 2002 - 6th European Conference on Principles and Practice on Knowledge Discovery from Databases, Helsinki, Finland, 2002 - 13th International Symposium on Methodologies for Intelligent Systems, Lyon, France, 2002 - International Conference on Machine Learning and Applications, Las Vegas, USA, 2002

Hollmén, Jaakko

Scientific program committee memberships - Member of the program committee of workshop on Self-Organizing Maps, WSOM’03

Lemström, Kjell

Journal Reviews - The Computer Journal - Colombian Journal of Computation - Journal of New Music Research

Scientific program committee memberships - Member of the ISMIR steering committee - Third International Conference on Music Information Retrieval (ISMIR 2002) - Fourth International Conference on Music Information Retrieval (ISMIR 2003)

Other - Reviewer for 26th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR 2003) - Reviewer for the 10th International Symposium on String Processing and Information Retrieval (SPIRE 2003)

Mannila, Heikki

Journal editor - Editor-in-Chief, Data Mining and Knowledge Discovery - Associate Editor, ACM Transactions on Internet Technology - Area Editor, IEEE Transactions on Knowledge and Data Engineering

Scientific program committee memberships - 29th International Colloquium on Automata, Languages and Programming - ICALP 2002 - Second SIAM International Conference on Data Mining 2002 (program co-chair) - Discovery Science 2002 - Eighth SIGKDD Conference on Data Mining and Knowledge Discovery (KDD'02)

20 - Sixth European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD 2002) (co-chair) - Workshop on Discrete Mathematics and Data Mining (DMDM) - Machine Learning: ECML 2002 - 12th European Conference on Machine Learning (co-chair) - First International Workshop on Knowledge Discovery in Inductive Databases (KDID´02) - 2002 Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD'2002) - The 2002 IEEE International Conference on Data Mining (ICDM 2002 ) - 28th International Conference on Very Large Data Bases (VLBD 2002) - International Conference on Database Theory (ICDT 2003) - ACM Symposium on Management of Data (SIGMOD 2003) - ACM SIGMOD 2003 - IEEE International Conference on Data Mining (ICDM 2003) - International Conference on Machine Learning (ICML 2003) - 15th International Conference on Scientific and Statistical Database Management (SSDBM 03) - 7th European Conference on Principles and Practice of Knowledge Discovery in Databases - SIGKDD 2004 - Tenth SIGKDD conference on Data Mining and Knowledge Discovery (KDD 2004) - 14th European Conference on Machine Learning 2004

Other - Evaluator for a professorship in KTH, Stockholm - Member of the evaluation panel for Computer Science of Vetenskapsradet (Sweden) - Program director for Academy of Finland Research Program on Proactive Computing - Evaluator for a professorship in the University of Oulu - Member of the evaluation panel for the PhD Thesis of Alexander Hinneburg (University of Halle) - Member of Technical Advisory Board of Verity Inc. - Member of the ACM SIGKDD Curriculum Committee - Reviewer for numerous journals - External reviewer for numerous Ph.D. theses

Mäkinen, Veli

Referee - Nordic Journal of Computing, International Symposium on Music Information Retrieval (ISMIR 2003) - SIAM International Conference on Data Mining (SDM 2004)

Rousu, Juho

Scientific program committee memberships - 13th European Conference on Machine Learning (ECML’02), Helsinki, Finland - Program committee member of 14th European Conference on Machine Learning, ECML-2003

21 Toivonen, Hannu T.T.

Journal Reviews - ACM Transactions on Database Systems - IEEE Transactions on Knowledge and Data Engineering - Data Mining and Knowledge Discovery - Journal of the ACM - Knowledge and Information Systems - Bioinformatics - European Journal of Human Genetics - Journal of Artificial Intelligence Research

Scientific program committee memberships - Program committee co-chair: Thirteenth European Conference on Machine Learning (ECML’02) - Program committee co-chair: Sixth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’02) - Program committee vice-chair: IEEE International conference on Data Mining (ICDM-2002) - Program Committee co-chair: Workshop on Data Mining in Bioinformatics (BIOKDD 2002) in conjunction KDD-2002 - Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002) - Second SIAM International Conference on Data Mining (SDM 2002) - Nineteenth International Conference on Machine Learning (ICML-2002) - 6th Pacific Area Knowledge Discovery and Data Mining Conference (PAKDD- 2002) - 4th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 02), September 2002, Marseille, France - Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV; In SPIE’s 16th Annual International Symposium on Aerospace/Defence Sensing, Simulation and Controls (AeroSense) - Workshop on Multi-Relational Data Mining (MRDM 2002), in conjunction with KDD 2002 - First International Workshop on Knowledge Discovery in Inductive Databases (KDID’02), in conjunction with ECML/PKDD 2002 - ACM SIGDD International Conference on Knowledge Discovery and Data Mining (KDD 2003) - European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2003) - IEEE International Conference on Data Mining (ICDM 2003) - International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2003) - International Conference on Scientific and Statistical Database Management (SSDBM 2003) - SIAM International Conference on Data Mining (SDM 2003) - Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2003) - ACM SIGDD Workshop on Data Mining in Bioinformatics (BIOKDD 2003), program co-chair

22 - ECML-PKDD International Workshop on Knowledge Discovery in Inductive Databases (KDID 2003) - ICDM Workshop on Frequent Itemset Mining Implementations (FIMI 2003)

Other - Reviewer for chapters in books “Data Mining and Database Technology for Data Mining” (Springer) and “New Generation of Data Mining Applications” (Wiley) - Reviewer for Fund for Scientific Research – Flanders (Belgium) - Reviewer for The Research Council of Norway - Member of PhD jury of Artur Bykowski, INSA Lyon, France - Member of PhD jury of Bart Goethals, University of Limburg, Belgium - Reviewer for the Research Council of Norway - Referee for full professor promotion, Simon Fraser University, Canada

Ukkonen, Esko

Journal Editor - Editor-in-Chief of the Nordic Journal of Computing - Journal of Universal Computer Science (JUCS) - Associate Editor of the IEEE-ACM Transactions on Computational Biology and Bioinformatics (a new journal starting to appear in 2004)

Scientific program committee memberships - First European Conference on Computational Biology (ECCB 2002), Saarbrucken, Germany - Sixth International Colloquium on Grammatical Inference (ICGI 2002), Amsterdam, The Netherlands - European Conference on Computational Biology (ECCB 2003), Paris, France - Fundamentals of Computation Theory (FCT 2003), Malmö, Sweden

Other - Refereeing of manuscripts for some conferences and a publishing house - External member of the PhD committee of Stefan Burkhardt, University of Saarland, Germany - Reviewer for the Natural Sciences and Engineering Research Council of Canada - Reviewer and panel member for the Research Council of Norway for The Research Council of Norway - Reviewer for the German-Israeli Foundation for Scientific Research & Development - Reviewer and panel member (bioinformatics) for DFG (Bonn)

3.4. Domestic positions of trust held by researchers of the unit in 2002 – 2003

Ahonen-Myka, Helena

- Member of the Board of the national language technology education network - Member of the Board of the graduate schools: Helsinki Graduate School in Computer Science and Engineering (HeCSE) and Graduate School of Language Technology in Finland (KIT Graduate School)

23 Hollmén, Jaakko

- Member of the Management Board in The Knowledge Discovery Network of Excellence (KDNet) supported by EU project No. IST-2001-33086 - Member of the management board of the Department of Computer Science and Engineering, Helsinki University of Technology - Member of the management board of the research project “Smart information system for waste treatment (iWaste)” in the STREAMS Technology project funded by TEKES (National Technology Agency) - Program Committee member, The 10th Finnish Artificial Intelligence Conference (SteP 2002) in Oulu, Finland, December 2002

Leino, Antti

- Secretary for the Finnish Heraldic Association

Mannila, Heikki

- Member of the Management Board in the Finnish Genome Center (Suomen Genomikeskuksen johtokunnan jäsen) - Member of the management board in The Knowledge Discovery Network of Excellence (KDNet), supported by EU project No. IST-2001-33086

Rousu, Juho

- Coordinator of the research theme “Analysis of measurement data” in the national technology programme “On-Line Measurements in the Process Industry”, 1999- 2002 funded by National Technology Agency (TEKES)

Ukkonen, Esko

- Reviewer for the Academy of Finland - Chairman or member of some administrative boards within the University of Helsinki - Member of the Board of the Helsinki Institute for Information Technology (HIIT) - PhD committee member (opponent) for Heikki Hyyrö, University of Tampere

3.5. Mobility of researchers

Visits to the Center in 2002

Visitor Institution PMoW3

Arita, Masanori, PhD. Computational Biology Research Center, 0,5 AIST and PRESTO, JST, Japan

3 Person months of work: 2 weeks equals 0,5 PMoW

24 Brazma, Alvis, PhD. Team Leader at the European Bioinformatics 0,5 Institute (EMBL outstation at Cambridge, UK)

Das, Sandip, PhD. Indian Statistical Institute 1

De Raedt, Luc, PhD, Prof. Albert-Ludwigs-Universitatet, Freiburg, 0,5 Germany

Halldorsson, Magnus, PhD, University of Iceland, Iceland 0,25 Prof.

Jaeger, Manfred, PhD. Max-Planck-Institut für Informatik, 3 Saarbrücken, Germany

Navarro, Gonzalo, PhD, Department of Computer Science, 0,75 Prof. University of Chile, Chile

Rung, Johan European Bioinformatics Institute (EBI), 0,25 Hinxton, Cambridge, UK

Schlitt, Thomas European Bioinformatics Institute (EBI), 0,25 Hinxton, Cambridge, UK

Vilo, Jaak European Bioinformatics Institute (EBI), 0,5 Hinxton, Cambridge, UK

Visits to the Center in 2003

Visitor Institution PMoW4

Jean-Francois Boulicaut, INSA Lyon, France 0.25 Ph.D.

Stefan Burkhardt, Ph.D. Max-Planck-Institut, Saarbrücken, Germany 0.25

Bruno Crémilleux, Ph.D. University of Caen, France 0.25 Luc De Raedt, Ph.D., Prof. Albert-Ludwigs-Universitaet, Freiburg, Germany 0.5

Alexander Hinneburg The Institute of Computer Science of the Martin- 0.5 Luther University, Halle/Wittenberg,Germany

Eric Rivals, Ph.D. University of Montpellier, France 0.5

Pierre-Yves Rolland Université d’Aix-Marseille, France 0.25

Marie-France Sagot, Ph.D. INRIA, Rhône-Alpes, Lyon, France 0.25

4 Person months of work: 2 weeks equals 0,5 PMoW

25 Mohammed Zaki, Ph.D., Rensselaer Polytechnic Institute, Troy, New York, 1.25 Assoc. Prof. USA

Also five young trainees (Stefan Immesberger, Matthias Berg, Lennart Heinzerling, Susanne Pfeifer and Marko Jung) from Germany visited FDK in August - September 2003.

Visits from the Center in 2002

Visitor Institution PMoW

Doucet, Antoine University of Caen, France 3

Kärkkäinen, Juha, PhD. Max-Planck-Institut, Saarbrücken, Germany 12

Sevon, Petteri Karolinska Institut, Stockholm, Sweden 6

Ukkonen, Esko, PhD, Prof. Schloss Dagstuhl, International Computer 0,5 Science Center, Germany

Visits from the Center in 2003

Visitor Institution PMoW

Antoine Doucet University of Caen, France 4

Juha Kärkkäinen Max-Planck-Institut für Informatik, Saarbrücken, 12 Germany

Juha Muilu European Bioinformatics Institute, Hinxton, UK 0.75

Juha Muilu IBM e-business solutions center, LaGaude, France 0.75

Veli Mäkinen University of Chile, Santiago, Chile 1

Evimaria Terzi University of Milan, Italy 0.5

Juho Rousu: Marie Curie Individual Fellowship 7/2003-6/2005, Royal Holloway University of London, United Kingdom.

26 4. Publications and other outcomes

4.1. Articles in international scientific journals with referee practice

2002

1. A. Amir, G. M. Landau, and E. Ukkonen: Online time stamped text indexing. Information Processing Letters 82 (5): 253-259, 2002.

2. T. Elomaa, and J. Rousu: Linear-time preprocessing in optimal numerical range partitioning. Journal of Intelligent Information Systems 18, 1: 55-70, 2002.

3. J. Han, R.B. Altman, V. Kumar, H. Mannila, and D. Pregibon: Emerging Scientific Applications in Data Mining. Communications of the ACM 45, 8: 54-58, August 2002.

4. T. Kivioja, M. Arvas, K. Kataja, M. Penttilä, H. Söderlund, and E. Ukkonen: Assigning probes into a small number of pools separable by electrophoresis. Bioinformatics 18 Suppl. 1 (ISMB 2002 special issue): 199-206, 2002.

5. A. Korhola, K. Vasko, H.T.T. Toivonen, and H. Olander: Holocene temperature changes in northern Fennoscandia reconstructed from chironomids using Bayesian modeling. Quaternary Science Reviews 21(16-17): 1841 - 1860, 2002.

6. H. Mannila, A. Patrikainen, J. K. Seppänen, and J. Kere: Long-range control of expression in yeast. Bioinformatics 18: 482-483, 2002.

7. D. Meredith, K. Lemström, and G. Wiggins: Algorithms for Discovering Repeated Patterns in Multidimensional Representations of Polyphonic Music. Journal of New Music Research, 31 (4): 321-345, 2002.

8. T. Niini, K. Vettenranta, J. Hollmén, M. L. Larramendy, Y. Aalto, H. Wikman, B. Nagy, J. K. Seppänen, A. Ferrer Salvador, H. Mannila, U. M. Saarinen-Pihkala, and S. Knuutila: Expression of myeloid-specific genes in childhood acute lymphoblastic leukemia -- a cDNA array study. Leukemia 16 (11): 2213-2221, 2002.

9. M. Nykänen, and E. Ukkonen: The exact path lenght problem. Journal of Algorithms 42: 41-53, 2002.

10. R.B. O'Hara, E. Arjas, H.T.T. Toivonen, and I. Hanski: Bayesian analysis of meta- population data. Ecology 83 (9): 2408-2415, 2002.

11. P. Onkamo, V. Ollikainen, P. Sevon, H.T.T. Toivonen, H. Mannila, and J. Kere: Association analysis for quantitative traits by data mining: QHPM. The annals of Human Genetics 66: 419-429, 2002.

12. K. Palin, E. Ukkonen, A. Brazma, and J. Vilo: Correlating gene promoters and expression in gene disruption experiments. Bioinformatics 18, Supplement 2 (ECCB 2002 Proceedings): 172-180, 2002.

27 13. J. Rousu, A. Rantanen, R. Ketola, J. Kokkonen, and V. Tarkiainen: Computing Positional Isotopomer Distributions from Tandem Mass Spectrometric Data. Metabolic Engineering 4 (2002), pp. 285-294.

14. M. Salmenkivi, J. Kere, and H. Mannila: Genome Segmentation using Piecewise Constant Intensity Models and Reversible Jump MCMC. Bioinformatics 18, Supplement 2 (ECCB 2002 Proceedings): 211-218, 2002.

15. H. Wikman, E. Kettunen, J. K. Seppänen, A. Karjalainen, J. Hollmén, S. Anttila, and S. Knuutila: Identification of differentially expressed genes in pulmonary adenocarcinoma by using a cDNA array. Oncogene 21 (37): 5804-5813, 2002.

16. N. Woolley, P. Holopainen, V. Ollikainen, K. Mustalahti, M. Mäki, J. Kere, and J. Partanen: A new locus for coeliac disease mapped to chromosome 15 in a population isolate. Human Genetics, 111: 40-45, 2002.

17. M.J. Zaki, J.T.L. Wang, and H.T.T.Toivonen: BIOKDD01: Workshop on Data Mining in Bioinformatics. SIGKDD Explorations 3 (2): 71 - 73, January 2002.

18. Y. Zhu, J. Hollmén, R. Räty, Y. Aalto, B. Nagy, E. Elonen, J. Kere, H. Mannila, K. Franssila, and S. Knuutila: Investigatory and analytical approaches to differential gene expression profiling in mantle cell lymphoma. British Journal of Haematology, 119(4): 905-915, 2002.

2003

19. I. Autio, and T. Elomaa: Flexible view recognition for indoor navigation based on Gabor filters and support vector machines. Pattern Recognition vol. 36, issue 12 (2003), pp. 2769-2779.

20. S. Burkhardt, and J. Kärkkäinen: Better filtering with gapped q-grams. Fundamenta Informaticae 56, 1-2(2003), pp. 51-70.

21. M. Datar, T. Feder, A. Gionis, R. Motwani, and R. Panigrahy: A combinatorial algorithm for MAX CSP. Information Processing Letters 85(6): 307-315 (2003).

22. T. Elomaa, and J. Rousu: Necessary and Sufficient Preprocessing in Numerical Range Discretization. Knowledge and Information Systems 5, 2 (2003), pp. 162-182.

23. M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim: XTRACT: Learning Document Type Descriptors from XML Document Collections. Data Mining and Knowledge Discovery 7(1): 23-56 (2003).

24. F. Geerts: Expressing the box cone radius in the relational calculus with real polynomial constraints. Discrete and Computational Geometry 30, 4(2003), pp. 607- 622.

25. G. Grahne, R. Hakli, M. Nykänen, H. Tamm, and E. Ukkonen: Design and implementation of a string database query language. Information Systems 28 (2003), pp. 347-369.

28 26. D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivonen, and R.S. Sharma: Discovering all most specific sentences. ACM Transactions on Database Systems 28 (2): 140-174, 2003.

27. J. Kärkkäinen, G. Navarro, and E. Ukkonen: Approximate String Matching over Ziv- Lempel Compressed Text. Journal of Discrete Algorithms 1, 3/4(2003), pp. 313-338.

28. M. Kääriäinen, R. Nock, and T. Elomaa: Reduced Error Pruning of Branching Programs Cannot Be Approximated to within a Logarithmic Factor, Information Processing Letters 87, 2 (2003), pp. 73-78.

29. K. Lemström, and L. Hella: Approximate Pattern Matching and Transitive Closure Logics. Theoretical Computer Science, 299 1-3(2003), pp. 387-412.

30. K. Lemström, and J. Tarhio: Transposition Invariant Pattern Matching for Multi- Track Strings. Nordic Journal of Computing, 10, 3(2003), pp. 185-205.

31. J.E. Litton, J. Muilu, A. Bjorklund, A. Leinonen, and N.L. Pedersen: Data modeling and data communication in GenomEUtwin. Twin Res. 2003 October 6(5): 383-90.

32. V. Mäkinen: Compact Suffix Array - A Space-Efficient Full-Text Index. Fundamenta Infomaticae, Special Issue - Computing Patterns in Strings, 56(1-2): 191-210, 2003.

33. V. Mäkinen, G. Navarro, and E. Ukkonen: Approximate Matching of Run-length Compressed Strings. Algorithmica 35: 4(347-369), 2003.

34. D. Pavlov, H. Mannila, and P. Smyth: Beyond independence: probabilistic methods for query approximation on binary transaction data. IEEE Transactions on Data and Knowledge Engineering 15 (6): 1409-1421, 2003.

35. J. Rousu, L. Flander, M. Suutarinen, K. Autio, A. Rantanen, and P. Kontkanen: Novel Computational Tools in Bakery Process Data Analysis: a Comparative Study. Journal of Food Engineering 57, 1 (2003), pp. 45-56.

36. Th. Schlitt, K. Palin, J. Rung, S. Diekmann, M. Lappe, E. Ukkonen, and A. Brazma: From gene networks to gene function. Genome Research 13 (2003), pp. 2568-2576.

37. T.A. Thanaraj, S. Stamm, F. Clark, J.J. Riethoven, V. Le Texier, and J. Muilu: ASD: the Alternative Splicing Database. Nucleic Acids Res. 2004 Jan 1;32(1): D64-9.

38. T.A. Thanaraj, F. Clark, and J. Muilu: Conservation of human alternative splice events in mouse. Nucleic Acids Res. 2003 May 15; 31(10): 2544-52.

39. H.T.T. Toivonen, A. Srinivasan, R.D. King, S. Kramer, and C. Helma: Statistical evaluation of the predictive toxicology challenge 2000-2001. Bioinformatics 19 (10): 1183 - 1193, 2003.

40. A. Vakali, E. Terzi, E. Bertino, and A.K. Elmagarmid: Hierarchical data placement for navigational multimedia applications. Data Knowledge Eng. 44(1): 49-80 (2003).

29 41. M. Zaki, J.T.L. Wang, and H.T.T. Toivonen: BIOKDD 2002: Recent Advances in Data Mining for Bioinformatics. SIGKDD Explorations 4(2): 112-114, 2003.

Accepted for publication

42. T. Elomaa, and M. Kääriäinen: The difficulty of reduced error pruning of leveled branching programs. Annals of Mathematics and Artificial Intelligence. To appear.

43. T. Elomaa, and J. Rousu: Efficient multisplitting revisited: Optima-preserving elimination of partition candidates. Data Mining and Knowledge Discovery. To appear.

44. F. Geerts, and B. Kuijpers: Topological formulation of termination properties of iterates of functions. Information Processing Letters, Elsevier. To appear.

45. E. Kettunen, S. Anttila, J.K. Seppänen, A. Karjalainen, H. Edgren, I. Lindström, R. Salovaara, A-M Nissén, J. Salo, K. Mattson, J. Hollmén, S. Knuutila, and H. Wikman: Differentially expressed genes in non-small cell lung cancer (NSCLC) – The expression profiling of cancer-related genes in squamous cell lung cancer. Cancer Genetics and Cytogenetics. To appear.

46. S. Koskenmies, E. Widén, P. Onkamo, M. Zucchelli, P. Sevón, H. Julkunen, J. Kere: Haplotype associations define target regions for susceptibility loci in Systemic lupus erythromatus. Eur J Hum Genetics. To appear.

47. J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi: Simple Semantics in Topic Detection and Tracking. Information Retrieval. To appear.

48. H. Mannila, and M. Salmenkivi: Using Markov chain Monte Carlo Methods and Dynamic Programming for Event Sequence Data. Knowledge and Information Systems. To appear.

49. T. Mielikäinen, and E. Ukkonen: The complexity of maximum matroid-greedoid intersection and weighted greedoid maximization. Discrete Applied Mathematics. To appear.

50. P. Sevon, V. Ollikainen, and H.T.T. Toivonen: Tree Pattern Mining for Gene Mapping. Information Sciences. To appear.

4.2. Articles in international edited works and conference proceedings with referee practice

2002

51. H. Ahonen-Myka: Discovery of frequent word sequences in text. The ESF Exploratory Workshop on Pattern Detection and Discovery in Data Mining, Imperial College, London, 16-19 September, 2002. Lecture Notes in Artificial Intelligence 2447, Springer, 2002.

30 52. E. Bingham, H. Mannila, and J. K. Seppänen: Topics in 0-1 data. In D. Hand, D. Keim, and R. Ng, (eds.), Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, July 2002.

53. A. Bykowski, J. K. Seppänen, and J. Hollmén: Model-independent bounding of the supports of Boolean formulae in binary data. In M. Klemettinen, R. Meo, F. Giannotti and L. De Raedt, (eds.), Knowledge Discovery in Inductive Databases (KDID’02), First International Workshop, University of Helsinki Department of Computer Science Series of Publications B, Report B-2002-7, pages 20-31, 2002.

54. L. De Raedt, M. Jaeger, S. D. Lee, and H. Mannila: A Theory of Inductive Query Answering. In Proceedings of the Second IEEE International Conference on Data Mining, pp. 123-130, 2002.

55. I. Días, and J. Hollmén: Residuals generation and visualization for understanding novel process conditions. In Proceedings of the IEEE 2002 International Joint conference on Neural Networks (IJCNN’02), volume 3, pages 2070-2075. IEEE Press, 2002.

56. A. Doucet: Améliorer les descripteurs de documents semi-structurés en utilisant les informations contextuelles. INFORSID 2002, Nantes, France, June 4-7, 2002, p. 401- 402. ISBN: 2-906855-18-9.

57. T. Elomaa, and J. Lindgren: Experiments with projection learning. Discovery Science, Proc. Fifth International Conference, DS '02 (Lübeck, Germany). Lecture Notes in Artificial Intelligence 2534: 127-140. Springer-Verlag, Berlin Heidelberg, 2002.

58. T. Elomaa, and M. Kääriäinen: Progressive Rademacher sampling. Proc. Eighteenth National Conference on Artificial Intelligence, AAAI-2002 (Edmonton, Canada). AAAI Press, Menlo Park, CA & MIT Press, Cambridge, MA, 2002, pp. 140-145.

59. T. Elomaa, and M. Kääriäinen: The difficulty of reduced error pruning of leveled branching programs. Proc. Seventh International Symposium on Artificial Intelligence and Mathematics, AMAI 2002 (Fort Lauderdale, FL). Will be published in Annals of Mathematics and Artificial Intelligence in 2004.

60. T. Elomaa, and J. Rousu: Fast minimum error discretization. In C. Sammut and A. Hoffmann (eds.), Proc. Nineteenth International Conference on Machine Learning, ICML'02 (Sydney, Australia). Morgan Kaufmann, San Francisco, CA, 2002, pp. 131- 138.

61. T. Elomaa: Partition-refining algorithms for learning finite state automata. In M.-S. Hacid, Z. W. Ras, D. A. Zighed, and Y. Kodratoff (eds.), Foundations of Intelligent Systems, Proc. Thirteenth International Symposium, ISMIS'02 (Lyon, France). Lecture Notes in Artificial Intelligence 2366. Springer-Verlag, Berlin Heidelberg, 2002, pp. 232-243. 62. K. Fredriksson: Faster string matching with super-alphabets. In Proceedings of SPIRE'2002, Lecture Notes in Computer Science 2476, pages 44-57, Springer Verlag, Berlin, 2002.

31 63. K. Fredriksson, G. Navarro, and E. Ukkonen: Optimal Exact and Fast Approximate Two Dimensional Pattern Matching Allowing Rotations. In Proceedings of CPM'2002, Lecture Notes in Computer Science 2373, pages 235-248, Springer- Verlag, Berlin, 2002.

64. K. Fredriksson, G. Navarro, and E. Ukkonen: Faster than FFT: Rotation invariant combinatorial template matching. In: S.G. Pandalai (eds.), Recent Research Developments in Pattern Recognition, Vol. 3: 75-112, 2002, Transworld Research Network 2002.

65. C. Iliopoulos, K. Lemström, M. Niyad, and Y. Pinzon: Evolution of Musical Motifs in Polyphonic Passages. In Proc: AISB'2002; Symposium on AI and Creativity in Arts and Science, pp. 67-75, London, United Kingdom, April 2-5, 2002.

66. L. Kovacs, and H. Ahonen-Myka: Algorithm for maximal frequent sequences in document clustering. In the 3rd International Symposium of Hungarian Research on Computational Intelligence, Budapest, November 14-15, 2002, pp. 165-176. ISBN: 9- 637154-12-4.

67. J. Kärkkäinen, and S. Burkhardt: One-gapped q-gram filters of Levenshtein distance. Proc. CPM 2002, LNCS 2373, pp. 225-234, Springer-Verlag 2002.

68. M. Lehtonen, R. Petit, O. Heinonen, and G. Lindén: A Dynamic User Interface for Document Assembly. In Furuta, Maletic, and Munson (eds.), proceedings of the ACM Document Engineering (DocEng) '02, November 8-9, 2002, McLean, Virginia, USA, pp. 134-141.

69. K. Lemström: Content-Based Retrieval of Symbolic Music. In Proc. FSKD'02; 1st International Conference on Fuzzy Systems and Knowledge Discovery, pp. 401-405, Singapore, November 18-22, 2002.

70. C.K. Leung, R. Ng, and H. Mannila: OSSM: A Segmentation Approach to Optimize Frequency Counting. In Proceedings of the 18th International Conference on Data Engineering (ICDE 2002), pp. 583-593, 2002.

71. J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi: Applying Semantic Classes in Event Detection and Tracking. In the proceedings of the International Conference on Natural Language Processing (ICON-2002), December 18-21, 2002, Mumbai, India, pp. 175-183.

72. H. Mannila: Global and local methods in data mining: basic techniques and open problems. In ICALP 2002, 29th International Colloquium on Automata, Languages, and Programming, Malaga, Spain, July 2002; Springer-Verlag.

73. V. Mäkinen, and E. Ukkonen: Local Similarity Based Point-Pattern Matching, In Proc. 13th Annual Symposium on Combinatorial Pattern Matchin (CPM 2002), Springer-Verlag LNCS VOL. 2373, pp. 115-132, Fukuoka, Japan, July 2002.

32 74. A. Pienimäki: Indexing Music Databases Using Automatic Extraction of Frequent Phrases. In Third International conference on Music Information Retrieval (ISMIR 2002), Paris, France, October 13-17, 2002, pp. 25-30.

75. A. Rantanen, J. Rousu, J. T. Kokkonen, V. Tarkiainen, and R. A. Ketola: Computing Positional Isotopomer Distributions from Tandem Mass Spectrometric Data. Metabolic Engineering 4: 285-294, 2002.

76. S. Ruosaari, and J. Hollmén: Image analysis for classifying faulty spots from microarray images. In Proceedings of the 5th International Conference on Discovery Science, Lecture Notes in Artificial Intelligence. Springer, 2002.

77. E. Ukkonen: Finding founder sequences from a set of recombinants. In: Algorithms in Bioinformatics (WABI-2002), Lect. Notes in Computer Science 2452, pp. 277-286, Springer-Verlag 2002.

78. K. Vasko, and H.T.T. Toivonen: Estimating the Number of Segments in Time Series Data Using Permutation Tests. In: Proceedings of IEEE International Conference on Data Mining 2002 (ICDM'02), pp. 466-473. ISBN: 0-7695-1754-4.

79. J. Vesanto, and J. Hollmén: Recent Advances in Intelligent Paradigms, chapter An Automated Report Generation Tool for the Data Understanding Phase. Studies in Fuzziness and Soft Computing. Physica (Springer) Verlag, 2002.

80. J. Vesanto, and J. Hollmén: An automated report generation tool for the data understanding phase. In A. Abraham and M. Koeppen, (eds.), Hybrid Information Systems, pages 611-626. Physica-Verlag (Springer), Heidelberg, 2002. Proceedings of the First International Workshop on Hybrid Intelligent Systems (HIS’01).

81. G. Wiggins, K. Lemström, and D. Meredith: SIA(M)ESE: An Algorithm for Transposition Invariant, Polyphonic Content-Based Music Retrieval. In Proc: ISMIR'02; Third International Conference on Music Information Retrieval, pp. 283- 284, Paris, France, October 13-17, 2002.

2003

82. S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis: Automated Ranking of Database Query Results. In Proceedings (electronic) of the First Biennial Conference on Innovative Data Systems Research (CIDR 2003), Asilomar, CA, USA, 2003.

83. L. Aunimo, O. Heinonen, R. Kuuskoski, J. Makkonen, R. Petit, and O. Virtanen: Question Answering System for Incomplete and Noisy Data - Methods and Measures for its Evaluation. In Proceedings of 25th European Conference on Information Retrieval Research (ECIR 2003), April 2003, Pisa, Italy, Lecture Notes in Computer Science 2633, pp. 193-206, Springer 2003.

84. S. Burkhardt, and J. Kärkkäinen: Fast lightweight suffix array construction and checking. In proceedings of the 14th Symposium on Combinatorial Pattern Matching (CPM 2003), June 2003, Morelia, Mexico. Lecture Notes in Computer Science 2676, Springer, 2003, pp. 55-69.

33 85. A. Bykowski, J. Seppänen, and J. Hollmén: Model-independent bounding of the supports of Boolean formulae in binary data. In P. Lanzi, and R. Meo (eds.): Database technologies for data mining. Springer-Verlag, 2003. To appear.

86. T. Calders, and B. Goethals: Minimal k-free Representations of Frequent Sets. In Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'03). Lecture Notes in Artificial Intelligence, Volume 2838, Springer-Verlag, pp. 71-82. September 22-26, 2003, Cavtat, Croatia.

87. A. Doucet, and H. Ahonen-Myka: Naïve clustering of a large XML document collection. In Proceedings of the First Annual Workshop of the Initiative for the Evaluation of XML retrieval (INEX), Schloss Dagstuhl, Germany, 9-11 December 2002, pp. 81-87. European Research Consortium for Informatics and Mathematics (ERCIM) Workshop Proceedings, 2003.

88. A. Doucet, L. Aunimo, M. Lehtonen, and R. Petit: Accurate retrieval of XML document fragments using EXTIRP. To appear in the Proceedings of the Second Annual Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2003), December 15-17, 2003, Schloss Dagstuhl, Germany, 2003.

89. T. Elomaa, and J. Rousu: On Decision Boundaries of Naive Bayes in Continuous Domains. Principles of Data mining and Knowledge Discovery, PKDD-2003, Lecture Notes in Computer Science 2838 (2003), 144-155.

90. L. Eronen, F. Geerts, and H.T.T. Toivonen: A Markov chain approach to reconstruction of long haplotypes. Pacific Symposium on Biocomputing (PSB2004), 104-115, Hawaii, USA, January 2004. World Scientific.

91. F. Geerts, and B. Kuijpers: Deciding termination of query-evaluation in transitive closure logics for constraint databases. In Proceedings of the 9th International Conference on Database Theory (ICDT 2003), January 2003, Sienna, Italy, pp. 190- 206.

92. A. Gionis, and H. Mannila: Finding recurrent sources in sequences. In the 7th Annual International Conference on Research in Computational Molecular Biology - RECOMB 2003. In: W. Miller, M. Vingron, S. Istrail, P. Pevzner, and M. Waterman (eds.); pp 123-130.

93. A. Gionis, T. Kujala, H. Mannila: Fragments of orders. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington, DC, USA, August 2003, pp. 129-136.

94. B. Goethals, and M.J. Zaki: Advances in Frequent Itemset Mining Implementations, FIMI03. In Proceedings of the FIMI'03 Workshop on Frequent Itemset Mining Implementations. November 19, 2003, Melbourne, Florida, USA.

95. F. Geerts, B. Goethals, and T. Mielikäinen: What you store is what you get (extended abstract). In Jean-Francois Boulicaut and Saso Dzeroski (eds.): Proceedings of the 2nd International Workshop on Knowledge Discovery in Inductive Databases, pages 60- 69. 2003.

34 96. J. Heino and H.T.T. Toivonen: Automated Detection of Epidemics from the Usage Logs of a Physicians' Reference Database. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2003), 180-191, Cavtat- Dubrovnik, Croatia, September 2003. Springer.

97. J. Hollmen, J. Seppänen, and H. Mannila: Mixture models and frequent sets: combining global and local methods for 0-1 data. In D. Barbara, and C. Kamath (eds.): Proceedings of the Third SIAM International Conference on Data Mining, pages 289-293. Society of Industrial and Applied mathematics, 2003.

98. M. Koivisto, M. Perola, T. Varilo, W. Hennah, J. Ekelund, M. Lukk, L. Peltonen, E. Ukkonen, and H. Mannila: An MDL method for finding haplotype blocks and for estimating the strength of haplotype block boundaries. In Pacific Symposium on Biocomputing 2003 (PSB'03), R.B. Altman, A.K. Dukner, L. Hunter, T.A. Jung, and T.E. Klein, eds., World Scientific 2002, pp. 502-513.

99. J. Kärkkäinen, and S. S. Rao: Full-Text Indexes in External Memory. Chapter 7 in U. Meyer, P. Sanders, J. Sibeyn (eds.), Algorithms for Memory Hierarchies. Lecture Notes in Computer Science 2625, Springer 2003, pp. 149-170.

100. J. Kärkkäinen, and P. Sanders: Simple linear work suffix array construction. In Proceecings of the 30th International Colloquium on Automata, Languages and Programming (ICALP 2003), June-July 2003, Eindhoven, The Netherlands. Lecture Notes in Computer Science 2719, Springer, 2003, pp. 943-955.

101. M. Kääriäinen, and T. Elomaa: Rademacher penalization over decision tree prunings. In N. Lavrac, D. Gamberger, H. Blockeel & L. Todorovski (eds.), Machine Learning: ECML 2003, Proc. 14th European Conf. (pp. 193-204). LNAI 2837. Springer, 2003.

102. A. Leino, H. Mannila, and R-L Pitkänen: Rule discovery and probabilistic modeling for onomastic data. In Knowledge discovery in databases: the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) 2003, pp. 291-302. Springer. ISBN 3-540-20085-1.

103. K. Lemström, and V. Mäkinen: On Finding Minimum Splitting of Pattern in Multi- Track String Matching. In Proceedings of 14th Annual Symposium on Combinatorial Pattern Matching (CPM 2003), Springer-Verlag LNCS 2676, pp. 237-253, Morelia, Mexico, June, 2003.

104. K. Lemström, V. Mäkinen, A. Pienimäki, M. Turkia, and E. Ukkonen: The C- BRAHMS project. In Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR 2003), October 2003, Baltimore, Maryland, USA, pp. 237-238.

105. K. Lemström, and G. Navarro: Flexible and efficient bit-parallel techniques for transposition invariant approximate matching in music retrieval. In Proceedings of 10th International Symposium on String Processing and Information Retrieval (SPIRE'2003) LNCS 2857, October 2003, Manaus, Brazil, pp. 224-237.

35 106. J. Makkonen, H. Ahonen-Myka: Utilizing Temporal Expressions in Topic Detection and Tracking. In Proceedings of 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL03), August 2003, Trondheim, Norway, pp. 393-404.

107. J. Makkonen: Investigations on Event Evolution in TDT. In Proceedings of HLT- NAACL 2003 Student Workshop, May 2003, Edmonton, Canada, pp. 43-48.

108. J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi: Topic Detection and Tracking with Spatio-temporal Evidence. In Proceedings of the 25th European Conference on Information Retrieval Research (ECIR 2003), April 2003, Pisa, Italy, Lecture Notes in Computer Science 2633, pp. 251-256, Springer 2003.

109. H. Mannila, and M. Salmenkivi: Intensity Modeling of genome Data. To appear in: J. Wang, D. Shasha, H.T.T. Toivonen, and M. Zaki (eds.): Data Mining in Bioinformatics, Springer-Verlag, London, 2003.

110. T. Mielikäinen: Frequency-based views to pattern collections. In Peter L. Hammer (eds.): Proceedings of the IFIP/SIAM Workshop on Discrete Mathematics and Data Mining, SIAM International Conference on Data Mining (2003), May 1-3, 2003, San Francisco, CA, USA. SIAM, 2003.

111. T. Mielikäinen, and H. Mannila: The pattern ordering problem. In Nada Lavrac, Dragan Gamberger, Ljupco Todorovski, and Hendrik Blockeel (eds.): Knowledge Discovery in Databases: PKDD 2003 - 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, September 22-26, 2003, Proceedings, Volume 2838 of Lecture Notes in Artificial Intelligence, pp. 327-338. Springer, 2003.

112. T. Mielikäinen: Finding all occurring sets of interest. In Jean-Francois Boulicaut and Saso Dzeroski (eds.): Proceedings of the 2nd International Workshop on Knowledge Discovery in Inductive Databases, pp. 97-106. 2003.

113. T. Mielikäinen: Chaining patterns. In Gunter Grieser, Yuzuru Tanaka, and Akihiro Yamamoto (eds.): Discovery Science - 6th International Conference, DS 2003, Sapporo, Japan, October 17-19, 2003, Proceedings, Volume 2843 of Lecture Notes in Artificial Intelligence, pp. 233-244. Springer, 2003.

114. T. Mielikäinen: Change profiles. In Xindong Wu and Alex Tuzhilin (eds.): Proceedings of the 2003 IEEE International Conference on Data Mining (ICDM 2003), November 19-22, 2003, Melbourne, Florida, pp. 219-226, USA. IEEE Computer Society, 2003.

115. T. Mielikäinen: On inverse frequent set mining. In Wenliang Du and Chris Clifton (eds.): Proceedings of the 2nd Workshop on Privacy Preserving Data Mining (PPDM), pp. 18-23. IEEE Computer Society, 2003.

116. T. Mielikäinen: Intersecting data to closed sets with constraints. In Bart Goethals and Mohammed J. Zaki (eds.): Proceedings of the FIMI'03 Workshop on Frequent Itemset Mining Implementations, Melbourne, Florida, USA, November 19, 2003. Volume 90

36 of CEUR Workshop Proceedings, ISSN 1613-0073, online CEUR-WS.org/Vol-90/, 2003.

117. V. Mäkinen, G. Navarro, and E. Ukkonen: Matching Numeric Strings under Noise. In Proceedings of Prague Stringology Conference (PSC'03), Czech Technical University, Prague, September, 2003, pp. 99-110.

118. V. Mäkinen, G. Navarro, and E. Ukkonen: Algorithms for Transposition Invariant String Matching. In Proceedings of 20th International Symposium on Theoretical Aspects of Computer Science (STACS 2003), Springer-Verlag LNCS 2607, pp. 191- 202, Berlin, February, 2003.

119. J. Rousu, A. Rantanen, H. Maaheimo, E. Pitkanen, K. Saarela, and E. Ukkonen: A method for estimating metabolic fluxes from incomplete isotopomer information. International Workshop on Computational methods in Systems Biology, Lecture Notes in Computer Science 2602 (2003), pp. 88-103.

120. J. Seppanen, E. Bingham, and H. Mannila: A simple algorithm for topic identification in 0-1 data. In 7th European Conference on Principles and Practice of Knowledge discovery in Databases (PKDD’03), Dubrovnik, Croatia, September 2003.

121. P. Sevon, H. Toivonen, and P. Onkamo: Gene Mapping by Pattern Discovery. To appear in J. Wang et al (eds.), Data Mining in Bioinformatics. Springer.

122. M. Sulkava, and J. Hollmén: Finding profiles of forest nutrition by clustering of the self-organizing map. In Proceedings of the Workshop on Self-organizing Maps (WSOM’03), pages 243-248, Hibikino, Kitakyushu, Japan, September 2003.

123. H. Tamm, and E. Ukkonen: Bideterministic Automata and Minimal Representations of Regular Languages. In Proceedings of 8th International Conference on Implementation and Application of Automata (CIAA 2003), July 16-18, 2003, Santa Barbara, CA, USA, pp. 61-71.

124. H.T.T. Toivonen, P. Onkamo, P. Hintsanen, E. Terzi, and P. Sevon. Data mining for gene mapping. To appear in J. Zurada and M. Kantardzic (eds.), New Generation of Data Mining Applications. IEEE Press.

125. E. Ukkonen, K. Lemström, and V. Mäkinen: Geometric Algorithms for Transposition Invariant Content-Based Music Retrieval. In Proceedings of 4th International Conference on Music Information Retrieval (ISMIR 2003), pp. 193-199, Baltimore, Maryland, USA, October, 2003.

126. E. Ukkonen, K. Lemström, and V. Mäkinen: Sweepline the Music! In Computer Science in Perspective (LNCS 2598), R. Klein, H.-W. Six, L. Wegner (eds.), 2003, pp. 330-342.

127. J. Vesanto, and J. Hollmen: An automated report generation tool for the data understanding phase. In A. Abraham, and L. Jain (eds.): Innovations in Intelligent Systems: Design, Management and Applications, Studies in Fuzziness and Soft Computing, chapter 5. Springer (Physica) Verlag, 2003.

37 128. J. Vesanto, M. Sulkava, and J. Hollmén: On the decomposition of the self-organizing map distortion measure. In Proceedings of the Workshop on Self-Organizing Maps (WSOM'03), pages 11-16, Hibikino, Kitakyushu, Japan, September 2003.

4.3. Articles in Finnish scientific journals with referee practice

2003

129. R. Pulkkinen, M. Salmenkivi, A. Leino, and H. Mannila: Louhi ja naistenmaa - kalevalaisten runojen Pohjolan sijainti peruskartan nimistön valossa. Sananjalka: Suomen kielen seuran vuosikirja 45 (2003). ISSN 0558-4639.

4.4. Articles in Finnish edited works and conference proceedings with referee practice

2002

130. A. Doucet: Extracting More Relevant Document Descriptors using Hierarchical Information. In the Proceedings of XML Finland 2002, October 21-22, 2002, Helsinki, HIIT Publications 2002-03, pp. 136-147.

131. M. Fluch, G. Lindén, and A. Popescu: A journalist's tool for writing and retrieving news stories. In the Proceedings of XML Finland 2002, October 21-22, 2002, Helsinki, HIIT Publications 2002-03, pp. 96-108.

132. K. Lemström: Polyfonisen musiikin haku sisällön perusteella (Content-bases retrieval of polyphonic music). In: Tietojenkäsittelytiede, (17), 48-65, 2002.

133. M. Salmenkivi, J. Makkonen, and H. Ahonen-Myka: Topic detection and tracking based on extracting words with meaning of the same type. In the Proceedings of Suomen Tekoälypäivät (STeP'02), Finnish AI Conference, December 16-17, 2002, Oulu.

134. K. Vasko, and H.T.T. Toivonen: A statistical stopping criterion for top down segmentation of time series. In proceedings of STeP 2002 (The 10th Finnish Artificial Intelligence Conference, Finnish AI Society), pp. 226-233. ISBN: 951-96735-4-7.

2003

135. A. Pienimäki: Symbolisen musiikkidatan kaksitasoinen klusterointi. In Tietojenkäsittelytieteen päivät 2003, TKO-A39/03, May 2003, Espoo, Finland, pp. 37-40.

38 4.5. Scientific monographs published abroad

2002

136. T. Elomaa, H. Mannila, and H. Toivonen (eds.): Machine Learning: ECML 2002, Proc. 13th European Conference (Helsinki, Finland). Lecture Notes on Artificial Intelligence 2430. Springer-Verlag, Berlin Heidelberg, 2002.

137. T. Elomaa, H. Mannila, and H. Toivonen (eds.): Principles of Data Mining and Knowledge Discovery, Proc. 6th European Conference, PKDD 2002 (Helsinki, Finland). Lecture Notes on Artificial Intelligence 2431. Springer-Verlag, Berlin Heidelberg, 2002.

138. R. Grossman, J. Han, V. Kumar, H. Mannila, and R. Motwani: Proceedings of the Second SIAM International Conference on Data Mining. SIAM 2002; ISBN 0-89871- 517-2 (Edited book).

2003

139. B. Goethals, M.J. Zaki (eds.): Proceedings of the FIMI'03 Workshop on Frequent Itemset Mining Implementations. November 19, 2003, Melbourne, Florida, USA.

140. D. Hand, H. Mannila, and P. Smyth: Principles of Data Mining. MIT press 2001. ISBN: 0-262-98290-X. Second printing, 2002. Chinese Edition, China machine Press, 2003. ISBN: 7-111-11577-5. Polish Edition, to appear, 2004.

141. M. Zaki, J.T.L. Wang, and H. Toivonen (eds.): Proceedings of BIOKDD'03, 3rd ACM SIGKDD Workshop on Data Mining in Bioinformatics. Washington DC, August 2003. Report No. 03-11, Rensselaer Polytechnic Institute, Troy, NY. 2003.

4.6. Other scientific publications

2002

142. J.T. Lindgren, and J. Rousu: Microscopy image analysis of bread using machine learning methods. ReportC-2002-68, Department of Computer Science, University of Helsinki, 2002.

143. V. Mäkinen, G. Navarro, and E. Ukkonen: Algorithms for Transposition Invariant String Matching, Technical Report TR/DCC-2002-5, Department of Computer Science, University of Chile, July 2002.

144. J. K. Seppänen, J. Hollmén, E. Bingham, and H. Mannila: Nonnegative matrix factorization on gene expression data. Bioinformatics 2002, poster 49. Bergen, April 2002.

145. J. Rousu, A. Rantanen, and E. Ukkonen: Flux Estimation from Incomplete Isotopomer Information. Report C-2002-55, Department of Computer Science, University of Helsinki, 2002.

39 2003

146. P. Floréen, and G. Lindén (eds.): Context-Aware Methods. Course on Context-Aware Computing 2003. Report C-2003-71, Department of Computer Science, University of Helsinki, December 2003.

147. P. Floréen, and G. Lindén (eds.): Context-Aware Scenarios. Course on Context- Aware Computing 2003. Report C-2003-70, Department of Computer Science, University of Helsinki, December 2003.

148. F. Geerts: Frequent Knot Discovery. Manuscript 2003.

149. M. Katajamaa, and J. Hollmén: Simulation model for gene expression data. In R. Spang, P. Beziat, and M. Vingron (eds.): Currents in Computational Molecular Biology 2003, pages 249-250, April 2003. Poster presentation at the Seventh Annual International Conference on Research in Computational Molecular Biology, RECOMB 2003.

150. M. Lehtonen: Utilizing a Multipurpose Collection of Documents. In the Proceedings of the Finnish Data Processing Week (FDPW 2001-2002), University of Petrozavodsk, Russia, 2003, pp. 138-146.

151. J. Makkonen: News-feed categorization. In the Proceedings of the Finnish Data Processing Week (FDPW 2001-2002), University of Petrozavodsk, Russia, 2003, pp. 78-91.

152. E. Pitkänen: Reconstruction of metabolic networks. Bioinformatics Research and Education Workshop, Bielefeld, Germany, April 2003.

153. A. Rantanen, J. Rousu, E. Pitkänen, K. Saarela, and E. Ukkonen: Estimating the fluxes of a metabolic network from incomplete isotopomer measurements. Poster in RECOMB'2003, Berlin, Germany, April 2003.

154. J. Rousu, and A. Rantanen: Improved Computation of 13C-Isotopomer Distributions from Tandem-MS Data. Report C-2003-9, Department of Computer Science, University of Helsinki, 2003.

155. J. Rousu: Optimal Multivariate Discretization for Naive Bayesian Classifiers is NP- hard. Report C-2003-8, Department of Computer Science, University of Helsinki, 2003.

156. S. Ruosaari, and J. Hollmén: Identifying Differentially Expressed Genes with Bootstrap-Based Testing. Poster presentation at Bioinformatics 2003, May 2003, Helsinki, Finland.

157. S. Ruosaari, and J. Hollmén: Identifying Differentially Expressed Genes. In TICSP Workshop on Computational Systems Biology, June 2003, Tampere, Finland. A poster.

40 158. J. Saarela, J. Hollmén, D. Chen, P. Tainola, A. Jokiaho, A. Palotie, H. Mannila, and L. Peltonen: Inheritance of expression profiles: A family based analysis reveals higher similarity in the gene expression profiles of related individuals. Poster presentation at the Annual meeting of the American Society of Human Genetics (ASHG), November 2003.

159. K. Saarela: Towards a reaction distance metric. Bioinformatics Research and Education Workshop, Bielefeld, Germany, April 2003.

160. H. Wikman, K. Salmenkivi, J. Seppänen, E. Kettunen, K. Vainio-Siukola, J. Hollmén, A. Karjalainen, S. Knuutila, and S. Anttila. Down-regulation of caveolin 1 and caveolin 2 in lung cancer revealed by the combined use of cDNA and tissue microarrays. Poster presentation at the ESF Program in Functional Genomics: 1st European Conference, Prague, Czech Republic, May 2003.

161. M.J. Zaki, J.T.L. Wang, and H.T.T.Toivonen: BIOKDD 2002: Recent Advances in Data Mining for Bioinformatics. SIGKDD Explorations 4 (2): 112 - 114, January 2003.

4.7. Patents

2002

162. H.T.T. Toivonen, P. Onkamo, K. Vasko, V. Ollikainen, P. Sevon, H. Mannila, and J. Kere: A method for gene mapping from genotype and phenotype data (patent application).

4.8. Computer programs (and algorithms)

2002

163. A. Rantanen, and J. Rousu: PIDC (Positional Isotopomer Distribution Calculator). Available via www.http://www.cs.helsinki.fi/research/icomic/software.html

2003

164. L. Eronen, F. Geerts, and H.T.T. Toivonen: HaploRec. A Markov chain approach to reconstruction of long haplotypes. Available at http://www.cs.helsinki.fi/group/genetics/haplotyping.html

165. A. Gionis: Prototype for approximating collections of frequent sets (for PODS'04 submission).

166. M. Koivisto, H. Mannila, M. Lukk, K. Sood, and E. Ukkonen: MDLBlockFinder -- A Minimum Description Length -based method for the analysis of haplotype blocks (c) 2003. University of Helsinki and HIIT/BRU. Available at http://www.cs.helsinki.fi/u/mkhkoivi/software/blocks/MDLBlockFinder.html

41 167. J. Makkonen: A tool for extracting English natural language temporal expressions. In: J. Makkonen, H. Ahonen-Myka: Utilizing Temporal Expressions in Topic Detection and Tracking. In Proceedings of 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL03), August 2003, Trondheim, Norway, pp. 393-404.

168. V. Mäkinen: Prototype implementation of the Compact Suffix Array (http://www.cs.helsinki.fi/u/vmakinen/software/)

169. A. Rantanen: Prototype of Friendly Metabolome Browser software: a tool for visualising metabolic networks.

170. A. Rantanen, J. Rousu, J. Kokkonen, V. Tarkiainen, and R. Ketola: PIDC: Implements the algorithm described in Computing Positional Isotopomer Distributions from Tandem Mass Spectrometric Data. Metabolic Engineering 4 (2002), 285-294. Downloadable from http://www.cs.helsinki.fi/research/icomic/software.html

171. P. Rastas: Haplovisual software for analysis and visualization of haplotype data. Available at: http://www.cs.helsinki.fi/u/prastas/haplovisual/

4.10. Lectures and visiting lectures

2002

172. H.T.T. Toivonen: Gene mapping as pattern discovery. Invited lecture at University of Limburg, December 2002.

2003

173. H. Ahonen-Myka: Extraction of temporal expressions from Finnish news-feed. Presentation in the 14th Nordic Conference on Computational Linguistics, Reykjavik, Iceland, May 2003.

174. H. Ahonen-Myka: Text analysis by discovering frequent phrases. Invited lecture at the University of Pedrozavodsk, Russia, June 2003.

175. H. Ahonen-Myka: Utilizing temporal and spatial expressions in topic detection and tracking. Visiting lecture at the University of Caen, France, December 2003.

176. P. Floréen: Tilanteen huomioon ottavat kännykät. Suomen Akatemian Tiedekatselmuksen Tiede03 yleisöpäivät, Heureka, Vantaa, Finland, November 2003.

177. F. Geerts: What You Store Is What You Get. 2nd International Workshop on Knowledge Discovery in Inductive Databases (KDID 2003), Cavtat, Croatia, September 2003.

178. F. Geerts: Conjunctive query mining for multiphenotype gene mapping. EU-project meeting, Application of Probabilistic ILP, London, UK, April 2003.

42 179. B. Goethals: SQL: An alternative to Mine Rule Invited lecture. Meeting of the consortium on discovering knowledge with Inductive Queries, Turin, Italy, February 2003.

180. B. Goethals: Mining queries in an inductive database framework. Meeting of the consortium on discovering knowledge with Inductive Queries, Turin, Italy, February 2003.

181. B. Goethals: Inductive Databases: the global picture? Meeting of the consortium on discovering knowledge with Inductive Queries, Ljubljana, Slovenia, April 2003.

182. B. Goethals: Combinatorial upper bounds on the number of candidate itemsets. FDK seminar, University of Helsinki, Finland, February 2003.

183. J. Hollmén: Data analysis of 0-1 data by combining frequent sets and mixture models. Teoriapäevad Pedasel (Theoretical computer science days), Pedase, Estonia, October 2003.

184. J. Hollmén: Bioinformatics - something for the computer scientists? Nordic University Computer Club Conference (NUCCC 2003), Espoo, Finland, March 2003.

185. J. Hollmén: Analysis of microarray data: current research and future challenges. Microarray data analysis developers' days, CSC - Scientific Computing, Espoo, Finland, May 2003.

186. J. Hollmén: Analysis of microarray data. Graduate school on microarrays, Helsinki Biomedical Graduate School, Helsinki, Finland, May 2003.

187. J. Hollmén: Beyond clustering: case studies and possibilities in gene expression data analysis." Microarray Bioinformatics Seminar, The Joint Bioinformatics Lab. of Turku Centre for Computer Science and Turku Centre for Biotechnology, Turku, Finland, May 6-7, 2003.

188. J.T. Kokkonen, V. Tarkiainen, R.A. Ketola, A. Rantanen, J. Rousu, and H. Maaheimo: Isotopomer Analysis by Mass Spectrometry and Mathematical Algorithm. Talk in 16th International Mass Spectrometry Conference, Edinburgh, Scotland, August 2003.

189. J. Kärkkäinen: Fast lightweight suffix array construction and checking. Talk at the 14th Symposium on Combinatorial Pattern Matching (CPM 2003), Morelia, Mexico, June 2003.

190. J. Kärkkäinen: A tale of three algorithms: linear time suffix array construction. Visiting lecture at the Gaspard-Monge Institute of Electronics and Computer Science, University of Marne-la-Vallee, France, December 2003.

191. A. Leino: Ahvenlammen vieressä on yleensä Haukilampi - näkökulma lähekkäisten vedenkokoumien nimeämiseen. XXX Kielitieteen päivät, Joensuu, Finland, May 2003.

43 192. A. Leino: Spatial data mining as an onomastic tool. United Nations Group of Experts on Geographical Names, Norden Division meeting, Helsinki, Finland, October 2003.

193. K. Lemström: Introduction to Music IR as research area. The Royal School of Library and Information Science, Copenhagen, Denmark, April 2003.

194. K. Lemström: Sweepline the music! University Aix-Marseille, Aix-en-Provence, France, April 2003.

195. K. Lemström: Introduction to Music Information Retrieval. Uppsala University, Sweden, December 2003.

196. K. Lemström: Introduction to Music Information Retrieval. Helsinki University Library, Finland, December 2003.

197. G. Lindén: PROACT: French-Finnish Proactive Computing. Building up ERA in ICT: Finnish Approach – seminar, Brussels, Belgium, April 9, 2003.

198. G. Lindén: Proactive Computing and PROACT. Tales of the Disappearing Computer – conference, Santorini, Greece, June 1, 2003.

199. G. Lindén: PROACT: A French-Finnish Joint Research Programme on Proactive Computing. ERA-NET National Contact Points Meeting – meeting, Brussels, Belgium, July 17, 2003.

200. G. Lindén: PROACT: A French-Finnish Joint Research Programme on Proactive Computing. Working Group Strengthening the ERA in IST domains – meeting, Brussels, Belgium, September 19, 2003.

201. H. Maaheimo, K. Ylönen, P. Jouhten, J. Rousu, A. Rantanen, R. Ketola, J. Kokkonen, E. Ukkonen, and M. Penttilä: Metabolomics for systems biology: A metabolic flux determination platform for yeast based on 13C tracer experiments. Talk in 1st Int'l Workshop on Systems Biology of Yeast, St. Louis, USA, November 2003.

202. J. Makkonen, and H. Ahonen-Myka: Extraction of Temporal Expressions from Finnish Newsfeed. A presentation at the 14th Nordic Conference of Computational Linguistics (NoDaLiDa 2003), Reykjavik, Iceland, May 2003.

203. T. Mielikäinen: Frequency-Based Views to Pattern Collections. SIAM/IFIP Workshop on Discrete Mathematics & Data Mining, San Francisco, California, USA, May 2003.

204. T. Mielikäinen: Finding All Occurring Sets of Interest. 2nd International Workshop on Knowledge Discovery in Inductive Databases, Cavtat-Dubrovnik, Croatia, September 2003.

205. T. Mielikäinen: The Pattern Ordering Problem. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, September 2003.

44 206. T. Mielikäinen: The Pattern Ordering Problem. 6th International Conference on Discovery Science, Sapporo, Japan, October 2003.

207. T. Mielikäinen: Chaining Patterns. 6th International Conference on Discovery Science, Sapporo, Japan, October 2003.

208. T. Mielikäinen: On Inverse Frequent Set Mining. Helsinki University of Technology, Laboratory for Theoretical Computer Science, Special Course on Cryptology: Privacy-Preserving Data Mining, Espoo, Finland, November 2003.

209. T. Mielikäinen: Intersecting Data to Closed Sets with Constraints. Workshop on Frequent Itemset Mining Implementations, Melbourne, Florida, USA, November 2003.

210. T. Mielikäinen: On Inverse Frequent Set Mining. 2nd Workshop on Privacy Preserving Data Mining, Melbourne, Florida, USA, November 2003.

211. T. Mielikäinen: Change Profiles. Third IEEE International Conference on Data Mining, Melbourne, Florida, USA, November 2003.

212. J. Muilu: Intergration of value-added annotation to AS data. Workshop on Alternative Splicing, European Bioinformatics Institute, Hinxton, Cambridge, UK, July 2003.

213. J. Muilu: GenomEUtwin. Genotype data and data transfer. The 1st International Conference on Bio-data Interoperability, Tokyo, Japan, November 2003.

214. V. Mäkinen: Searching Music: Transposition Invariant String Matching. University of Chile, Department of Computer Science, Chile, March 28, 2003.

215. V. Mäkinen: Transposition Invariant String Matching. Max-Planck-Institut für Informatik, Saarbrücken, Germany, April 25, 2003.

216. J. Rousu: Isotopomer distribution computation from tandem mass spectrometric data with applications to metabolic flux estimation. Second International Conference on Biomedical Spectroscopy, London, United Kingdom, July 5-8, 2003.

217. M. Salmenkivi: Laskentaintensiivisten tilastollisten menetelmien soveltaminen kansanuskon tutkimuksessa. Teologian tohtori Nilla Outakosken muistoseminaari, Helsinki, Finland, May 2003.

218. M. Salmenkivi: Hiisi, pyhä and kalma - applying computational methods to the study of folk worldview, Method & Theory in the Study of Folk Religion. 40th Anniversary Symposium of the Finnish Society for the Study of Religion, Turku, Finland, October 2003.

219. E. Ukkonen: What are algorithms needed for. Tieteen päivät 2003, Helsinki, Finland, January 2003.

220. E. Ukkonen: Analyzing haplotype data. Fourteenth Seminar: Algorithms and Combinatorics in Biology, Lyon, France, April 2003.

45 221. E. Ukkonen: Analyzing haplotype data. TICSP Workshop on Computational Systems Biology, Tampere, Finland, June 2003.

222. E. Ukkonen: Discovering patterns I nhaplotype data. Fifteenth Int. School “Algorithmics for data mining and pattern discovery”, Lipari, Italy, July 2003.

223. E. Ukkonen: two lectures in Workshop on Combinatorics, Algorithms and Applications, Ubatuba, Brazil, September 2003.

224. E. Ukkonen: A method for estimating metabolic fluxes from incomplete isotopomer information. Mathematical Aspects of Systems Biology, Göteborg, November 2003.

4.11. Radio and television programmes and articles popularising science

2002

225. E. Ukkonen: FDK-huippuyksikkö – tietoa datasta. (FDK – centre-of-excellence – Knowledge from data). Tietojenkäsittelytiede Dec. 2001, pp. 21-23.

226. E. Ukkonen: A general talk on algorithmics and bioinformatics, presented five times in various domestic seminars (Helsinki University of Technology, University of Helsinki).

227. E. Ukkonen: The role of computer science in bioinformatics. Invited seminar talk, Chalmers/ Göteborg University, Dec. 2002.

2003

228. E. Ukkonen: Mihin algoritmeja tarvitaan (What are algorithms needed for?) Tieteessä tapahtuu 7, 2003, 19-22.

4.12. Other outcomes: international conferences

In August 2002, FDK organized two colocated scientific conferences in Helsinki: the 13th European Conference on Machine Learning (ECML'02) and the 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'02). Both are among the best international conferences in their respective fields. FDK was responsible for both the scientific program (chairs: T. Elomaa, H. Mannila, H.T.T. Toivonen) and the local organization (chair: H. Ahonen-Myka) of the joint conference. The fully reviewed conferences received in total 218 submissions, each of which was reviewed by at least three members of the international program committees. 80 high- quality papers were accepted, 31 of them conditionally, for presentation and publication in the proceedings, published by Springer-Verlag.

46 In addition to the two main conferences, 6 workshops and 6 tutorials were organized as part of the five day conference program which attracted 300 participants from all over the world.

4.13. Degrees

4.13.1 Master’s theses

2002

229. L. Aunimo: Recognition of semantic similarity between text fragments. Master’s thesis. University of Helsinki, Department of Computer Science. 2002.

230. I. Hautamaa: Knowledge management and knowledge bases as an aid for technical support. Master’s thesis. University of Helsinki, Department of Computer Science. 2002.

231. S. Huhmarniemi: Implementation of a minimalistic universal grammar. Master’s thesis. University of Helsinki, Department of Computer Science. 2002.

232. R. Kuuskoski: Question answering systems. Master’s thesis. University of Helsinki, Department of Computer Science. 2002.

233. M. Kääriäinen: The generalization error of machine learning methods: dependency on the data and on the algorithm. Master’s thesis. University of Helsinki, Department of Computer Science. 2002.

234. J. Lindgren: Efficient use of attributes in learning. Master’s thesis. University of Helsinki, Department of Computer Science. 2002.

235. T. Mielikäinen: Orientation of electron micrographs. Master’s thesis. University of Helsinki, Department of Computer Science. 2002.

236. A. Mikkonen: Identification of persons from video. Master’s thesis. University of Helsinki, Department of Computer Science. 2002.

237. O. Orrainen: Localizing defects in elevators using control state data. Master’s thesis. University of Helsinki, Department of Computer Science. 2002.

238. A. Patrikainen: Projected clustering of high-dimensional binary data. Master’s thesis. Helsinki University of Technology. 2002.

239. A. Pienimäki: Indexing music using maximal phrases. Master’s thesis. University of Helsinki, Department of Computer Science. 2002.

240. A. Verkhovsky: Semiautomatic particle detection from electron micrographs using thresholding. Master’s thesis. University of Helsinki, Department of Computer Science. 2002.

47 241. O. Virtanen: Machine learning methods in text categorization and their use in a question answering system. Master’s thesis. University of Helsinki, Department of Computer Science. 2002.

2003

242. P. Harjula: Email filtering using content-based methods. Master’s thesis. University of Helsinki, Department of Computer Science. 2003.

243. H. Hiisilä: Finding Components In Discrete Biosequences. Master’s thesis. Helsinki University of Technology, Department of Computer Science and Engineering. 2003.

244. N. Laine: Learning information extraction rules. Master’s thesis. University of Helsinki, Department of Computer Science. 2003.

245. A. Pienimäki: Data Mining Techniques for Algorithmic Music Analysis. Master’s thesis. University of Helsinki, Institute for Art Research. 2003.

246. L. Rinne: Utilizing web usage mining methods in companies. Master’s thesis. University of Helsinki, Department of Computer Science. 2003.

4.13.2. PhD theses

2002

247. V. Ollikainen: Simulation Techniques for Disease Gene Localization in Isolated Populations. PhD Thesis. University of Helsinki, Department of Computer Science, Report A-2002-2.

248. J. Vilo: Pattern discovery from biosequences. PhD Thesis. University of Helsinki, Department of Computer Science, Report A-2002-3.

2003

249. V. Mäkinen: Parameterized approximate string matching and local-similarity-based point-pattern matching. PhD Thesis. University of Helsinki, Department of Computer Science, Report A-2003-6. 2003.

2004

250. M. Koivisto: Sum-Product Algorithms for the Analysis of Genetic Risks. PhD Thesis. University of Helsinki, Department of Computer Science, Report A-2004-1. 2004.

48 5. Funding of the center 2002-2003 Total funding in EUR Financier 2002 2003 A. Domestic funding 1. Own basic funding (i.e. host institution)5 - University of Helsinki 285 920 325 000 - Helsinki University of Technology 45 822 50 000 2. Academy of Finland6 60 548 256 420 3. National Technology Agency (TEKES) 110 085 61 200 4. Other domestic financiers - Academy of Finland7 537 030 355 199 - University of Helsinki8 75 685 80 000 - Helsinki University of Technology 45 000 70 000 - University of Helsinki9 35 320 - Ministry of Education10 140 000 288 000 - Enterprises * APPA (Sonera) 180 000 * TYTTI (Alma Media, Lingsoft, Nokia, Vaisala) 50 457 * 4M (Nokia, Fujitsu Invia, Lingsoft) 3 400 * ALTTI (GeneOS, Jurilab, Cyberell) 25 010 - VTT 10 000 B. Foreign funding - Max-Planck Institut 40 000 40 000 - EU, Member of the following Networks-of-excellence11 * KDNet * NeuroCOLT II

TOTAL 1 615 867 1 554 229

5 Including rent of space 6 Comprising the funding for the centre-of-excellence 7 Including xx eur funding from year 2002 8 Funds granted by Ministry of Education due to status of the centre-of-excellence 9 Funds for project: Autonomisen robotin älykäs ohjaus 10 Graduate schools ComBi and HecSe 11 No money spent during the year 2002

49 APPENDIX: List of personnel 2003

Name Sex Year of Personnel Estimated (Family name, given names) (m/f) birth group working time12

Ahonen-Myka, Helena f 1963 1 0,8 Hollmén, Jaakko m 1970 1 1 Mannila, Heikki m 1960 1 1 Toivonen, Hannu m 1967 1 1 Ukkonen, Esko m 1950 1 1 Lindén, Greger m 1963 2 1 Muilu, Juha m 1960 2 0,2 Geerts, Floris m 1973 10 1 Gionis, Aristides m 1972 10 1 Goethals, Bart m 1975 10 1 Hyvönen, Saara f 1967 3 1 Inenaga, Shunsuke m 1978 10 0,3 Koivisto, Mikko m 1974 3 1 Kärkkäinen, Juha m 1968 3 0,7 Lemström, Kjell m 1968 3 1 Mäkinen,Veli m 1975 3 1 Onkamo, Päivi f 1971 3 1 Rousu, Juho m 1970 3 1 Salmenkivi, Marko m 1967 3 1 Aunimo, Lili f 1973 4 1 Autio, Ilkka m 1973 4 1 Doucet, Antoine m 1976 11 1 Heinonen, Oskari m 1970 4 1 Kääriäinen, Matti m 1978 4 1 Kivioja, Teemu m 1972 4 1 Kuuskoski, Reeta f 1974 4 1 Lehtonen, Miro m 1975 4 1 Lindgren, Jussi m 1977 4 0,5 Makkonen, Juha m 1974 4 1 Mielikäinen, Taneli m 1978 4 1 Palin, Kimmo m 1978 4 1 Patrikainen, Anne f 1979 4 1 Pienimäki, Anna f 1975 4 1 Rantanen, Ari m 1972 4 1 Ravantti, Janne m 1964 4 0,3 Ruosaari, Salla f 1976 4 1 Saarela, Katja f 1976 4 1 Seppänen, Jouni m 1976 4 1 Sevon, Petteri m 1971 4 1 Vasko, Kari m 1974 4 0,8 Borras, Juan-Carlos m 1973 12 1 Heino, Jaana f 1973 5 0,8 Hintsanen, Petteri m 1978 5 1 Kujala, Teija f 1961 5 0,8 Leino, Antti m 1965 5 1

12 Estimated working time = FTE (full-time equivalent) means a minimum of 36hours/week. One FTE means working 52 weeks/year including paid holidays. Six months of work corresponds to 0,5 FTE.

50 Lukk, Margus m 1977 12 0,8 Tamm, Hellis f 1965 12 1 Haiminen, Niina f 1979 6 0,8 Juhala, Jaripekka m 1983 6 0,2 Korpiaho, Kalle m 1982 6 0,2 Malinen, Tuomo m 1976 6 0,4 Muhonen, Juho m 1980 6 1 Ojamies, Tuomas m 1978 6 1 Petit, Renaud m 1979 13 1 Pitkänen, Esa m 1978 6 1 Rasinen, Antti m 1979 6 1 Rastas, Pasi m 1978 6 1 Tatti, Nikolaj m 1982 6 1 Turkia, Mika m 1972 6 0,6

51