Titech Technical Report (Dept. of CS) 2001 Abstract

TR01-0001 (February)

Title: Conformance and mirroring for timed system
Author(s): Bin Zhou, Tomohiro Yoneda and Bernd-Holger Schlingloff
Contact person: Tomohiro Yoneda (yoneda@cs.titech.ac.jp)
Abstract: Conformance has been used as a correctness criterion for asynchronous circuits. Although the definition of conformance does not directly lend itself for implementation in a verification procedure, there is an equivalent property which can be implemented effectively: in the case of untimed systems, conformance of an implementation to a specification is equivalent to the failure-freeness between the implementation and the mirror of the specification. For bounded-delay systems, in general this property does not hold. In this paper, we define various notions of failure and examine whether the above propery holds or not. We then discuss alternative effective algorithms for conformance checking of timed system.

TR01-0002 (March)

Title

Improving automatic query expansion by using heteregeneous thesauri

Author(s)

Rila Mandala

Contact person

Tokunaga Takenobu (take@cl.cs.titech.ac.jp)

Abstract

Information retrieval is concerned with locating documents relevant to a user's information need from a collection of documents. The user describes his/her information need using a query which consists of a number of words. Information retrieval systems compare the query with the documents in the collection and return the documents that are likely to satisfy the information need. One major problem in information retrieval is the difficulty in describing user information needs in terms of a query so that a system can accurately distinguish between relevant and irrelevant documents for the query. In fact the user's original query statement will usually consist of just a few terms related to the subject of interest.

In this thesis, we address the word mismatch problem through automatic query expansion which is a technique utilized within information retrieval to remedy this problem. A query is expanded by adding other terms that are closely related to the original query terms. Expansion terms can be selected by referring to thesauri or by consulting users through the relevance feedback technique. Past research has verified the effectiveness of relevance feedback, but it puts a burden on users to a certain extent. Furthermore, if a user is not familiar with the vocabulary of a document collection, it is difficult to obtain good expansion terms, unless the system can suggest terms to the user.

Many researchers found that query expansion using thesaurus has no improvement or very limited improvement when compared to the relevance-feedback method. This thesis analyses why query expansion using thesaurus shows only limited performance and based on this analysis proposes a method to improve the performance of automatic query expansion by using heterogeneous thesauri. Experimental results show that our method can improve the retrieval performance significantly.

Analysis of results shows that the performance of queries that contain multiple aspect is degraded by our method. We investigated several methods to overcome this problem either manually and automatically and experiments show that these methods can successfully increase the performance of those queries. Further analysis also shows that queries containing negation statements is degraded by our method. By eliminating the negation statements we found that our method also improve the performance of those queries.

We compared our results with relevance feedback and found that the performance of retrieval using our method is comparable to retrieval system performance using relevance feedback, and better than the performance of retrieval system using pseudo-relevance feedback. Further, we propose a simple combination of query expansion using our method and pseudo-relevance feedback method. The performance of this combination is better than using only one method.

TR01-0003 (March)

Title

Making lexical sense of Japanese-English machine translation: A disambiguation extravaganza

Author(s)

Timothy Baldwin

Contact person

Timothy Baldwin (tim@cl.cs.titech.ac.jp)

Abstract

This dissertation is concerned with lexical disambiguation in the context of Japanese--English machine translation, in the form of the tasks of translation retrieval, Japanese relative clause construction (RCC) interpretation and verb sense disambiguation.

In translation retrieval, disambiguation is based on, for a given input, determining the translation in the ``translation memory'' (database of source/target language string pairs) which will be of maximum practical use in translating the input. In this, we look at the effects of segment (word) order, segmentation and segment contiguity on translation retrieval performance. In extensive experimentation, character-based indexing (where each string is split up into constituent characters) was shown to be superior to word-based indexing (where each string is split up into words with a segmentation module), and bag-of-words methods roughly equivalent to segment order-sensitive methods. Modelling of local segment contiguity in the form of segment N-grams was found to be beneficial in terms of both retrieval speed and accuracy. We also tested a number of both static and dynamic segment weighting methods, but found them to have little effect.

With Japanese RCC interpretation, each RCC is described by way of a vector of morphological and shallow semantic features, and supervised learning used to classify RCC inputs according to a taxonomy aimed at Japanese-English machine translation. We focus particularly on feature selection and construction in an attempt to attain the maximum achievable classification accuracy. Feature selection/construction is carried out by way of backwards sequential search through the feature space, in increasing order of estimated feature relevance, and the impact of each feature of overall classification accuracy evaluated by way of ``nested cross-validation''.

Finally, verb sense disambiguation was performed over a pattern-based valency dictionary, and linked extensively to both the ``argument status'' (complementhood) and selectional restriction annotation of each case slot. Rather than simply evaluating the satisfaction of selectional restrictions in a binary fashion, with the proposed method, we return a score on the quality of the match, including provision for penalised ``backing-off''. Argument status features in the penalisation of backing-off of selectional restrictions, penalisation of the non-alignment of case slots within the case frame, and determination of the scope for case marking alternation.

TR01-0004 (March)

Title: Research on extentended generalized LR method
Author(s): Yuko Oda
Contact person: Hozumi Tanaka (tanaka@cl.cs.titech.ac.jp)
Abstract: Generalized LR method (GLR) is one of the fast algorithm for parsing natural language, which is based on the context free grammar (CFG). Some extended notation of CFG have already proposed so that human can describe various grammatical constraint easily. This paper propose the new method to handle extended notations of CFG in GLR parsing framework.

TR01-0005 (April)

Title: The deltaup constraint solver: correctness proof
Author(s): Tetsuya Suzuki and Takehiro Tokuda
Contact person: Tetsuya Suzuki (tetsuya@tt.cs.titech.ac.jp)
Abstract: The DeltaBlue algorithm is known as a fast incremental constraint solver based on local propagation and is widely used for constructing graphical user interface and algorithm animation. The primary tasks of the planning phase of DeltaBlue are method selections. In this paper, we prove that DeltaUp, which is a modification of DeltaBlue, minimizes the number of method selections in every planning phase.

TR01-0006 (May)

Title: Fast EM learning of a family of PCFGs
Author(s): Sato, Taisuke, Kameya, Yoshitaka, Abe, Shigeru and Shirai, Kiyoaki
Contact person: Sato, Taisuke (sato@mi.cs.titech.ac.jp)
Abstract: The primary purpose of this report is to optimize the Inside-Outside algoerithm for PCFGs and realize fast EM learning of various elaborations of PCFGs. The point of our optimization is the introduction of a new data structure called "support graphs" hierarchically representing the set of parse trees and a new EM learning algorithm called "graphical EM algorithm" that runs on them. Learning experiments with PCFGs using two corpora, ATR and EDR, with contrasting characters showed that the graphical EM algorithm can learn parameters of PCFGs orders of magnitude faster than the Inside-Outside algorithm. We also experimentally show the gEM algorithm requires at most about twice as much time to learn parameters of elaborated PCFGs as pure PCFGs when ambiguity is not explosive. Hence learning experiments with a large corpus to develop new types of PCFGs is now much more feasible, thereby indirectly contributing to the solution of the first problem.

TR01-0007 (June)

Title: Automatic reconfiguration of an autonomous disk cluster
Author(s): Daisuke Ito and Haruo Yokota
Contact person: Haruo Yokota (yokota@cs.titech.ac.jp)
Abstract: Recently, storage-centric configurations, such as network attached storage (NAS) and storage area network (SAN) architectures, have attracted considerable attention in advanced data processing. For these configurations, high scalability, flexibility, and availability are key features. Central control is unsuitable for realization of these features. We propose autonomous disks to enable distributed control in storage-centric configurations. Autonomous disks configure a cluster in a network, and implement the above key features using Event-Condition-Action rules with a distributed directory. The user can change the control strategy by modifying the rules. The combination of rules and the distributed directory also manages load distribution automatically and has the capability to automatically reconfigure the cluster. In this paper, we focus on the cluster-reconfiguration function of the autonomous disks to handle disk failures or modify management strategies. To add/detach disks to/from a cluster without a special server, autonomous disks use dynamic coordination and a multi-phase synchronization protocol similar to that used in distributed transaction commitment. They allow the usual operations to be accepted during reconfiguration. We are now implementing an experimental system using Java on PCs to demonstrate the effectiveness of the approach. The results of preliminary experimentation indicate that the synchronization cost is adequately acceptable.

TR01-0008 (June)

Title: A theory of pseudoframes for subspaces with applications
Author(s): Shidong Li and Hidemitsu Ogawa
Contact person: Hidemitsu Ogawa (ogawa@og.cs.titech.ac.jp)
Abstract: We define and characterize a frame-like stable decomposition for subspaces of a general separable Hilbert space. We call it pseudoframes for subspaces (PFFS). Properties of PFFS are discussed. A necessary and sufficient characterization of PFFSs is provided. Analytical formulae for the construction of PFFSs are derived. Examples and applications of PFFSs are discussed.

TR01-0009 (July)

Title: Experimental evaluation of EM learning for PCFGs and their extensions
Author(s): Takashi Mori, Yoshitaka Kameya and Taisuke Sato
Contact person: Taisuke Sato (sato@cs.titech.ac.jp)
Abstract: Probabilistic context-free grammars (PCFGs) are widely known as statistical language models. Recently, Kameya et al. proposed a new framework for efficient EM learning for PCFGs and their extensions using the graphical EM (gEM) algorithm, assuming the underlying CFG is given. They experimentally showed that the gEM algorithm for PCFGs significantly outperforms the Inside-Outside algorithm. In this paper, we give two experimental results about EM learning, in terms of efficiency and accuracy. First, we show that, with ATR corpus and a hand-crafted Japanese grammar, the gEM algorithm for several PCFGs' extensions achieves almost the same efficiency as that for PCFGs. Second we measure the accuracy of PCFGs and their extensions learned by the gEM algorithm. Although it's a common wisdom that learning from unannotated data by the EM algorithm yields poor results, using the grammar with low ambiguity, we have accomplished high accuracy, comparable to the models learned by counting. Our results therefore show usefulness of the EM algorithm for PCFGs and their extensions.

TR01-0010 (August)

Title: Exploitation of existing databases for text retrieval from MEDLINE
Author(s): Tuan Nam Tran and Masayuki Numao
Contact person: Tuan Nam Tran (tt-nam@nm.cs.titech.ac.jp)
Abstract: Biomedical literature databases such as MEDLINE contain a valuable source of information which is important for biology research such as classification of proteins into functional groups, extraction of protein-protein interaction facts, discovery of new functional relationships. For this reason, the task of retrieving MEDLINE abstracts which are required for biology research mentioned above with high precision is necessary. We apply text classification techniques to this problem by exploiting an existing protein database. The results show that we can obtain text retrieval results with high precision compared with MEDLINE's current retrieval system.

TR01-0011 (August)

Title: Text mining from biomedical literature using term extraction and association rule mining
Author(s): Tuan Nam Tran and Masayuki Numao
Contact person: Tuan Nam Tran (tt-nam@nm.cs.titech.ac.jp)
Abstract: Recently there has been a great deal of research aimed for exploiting the rich information in biological literature databases. A large amount of research has focused on using Natural Language Processing techniques for extracting information from biological literature sources such as MEDLINE. However, there has little research which applies text data mining techniques for discovering novel knowledges. In this paper, we have applied an association rule-based algorithm for the task of text data mining from MEDLINE. By combination association rule-based algorithm with Natural Language Processing techniques such as term extraction, we have obtained interesting results using a set of MEDLINE records pertaining to S.cerevisiae. Moreover, the obtained association rules may be useful for consistency checking and error detection in annotation of MeSH terms in MEDLINE records.

TR01-0012 (August)

Title: Active learning with model selection --- Simultaneous optimization of sample points and models for trigonometric polynomial models
Author(s): Masashi Sugiyama and Hidemitsu Ogawa
Contact person: Masashi Sugiyama (sugi@og.cs.titech.ac.jp)
Abstract: In supervised learning, the selection of sample points and models is crucial for acquiring a higher level of the generalization capability. So far, the problems of active learning and model selection have been independently studied. If sample points and models are simultaneously optimized, then a higher level of the generalization capability is expected. We call this problem active learning with model selection. However, this problem can not be generally solved by simply combining existing active learning and model selection techniques because of the active learning / model selection dilemma: the model should be fixed for selecting sample points and conversely the sample points should be fixed for selecting models. In spite of the dilemma, the problem of active learning with model selection can be straightforwardly solved if there is a set of sample points that is optimal for all models in consideration. Based on the idea, we give a procedure for active learning with model selection in trigonometric polynomial models. The effectiveness of the proposed procedure is demonstrated through computer simulations

TR01-0013 (August)

Title: Optimal design of regularization term and regularization parameter by subspace information criterion
Author(s): Masashi Sugiyama and Hidemitsu Ogawa
Contact person: Masashi Sugiyama (sugi@og.cs.titech.ac.jp)
Abstract: The problem of designing the regularization term and regularization parameter for linear regression models is discussed. We derive an approximation to the generalization error called the subspace information criterion (SIC) for regularization learning. SIC gives an unbiased estimate of the generalization error with finite samples under certain conditions. SIC is used for (a) choosing the optimal regularization term and regularization parameter from given candidates, and (b) obtaining the closed form of the optimal regularization parameter for a fixed regularization term. The effectiveness of SIC is demonstrated through computer simulations with artificial and real data.

TR01-0014 (August)

Title: Optimization of linear image restoration filters by subspace information criterion
Author(s): Masashi Sugiyama and Hidemitsu Ogawa
Contact person: Masashi Sugiyama (sugi@og.cs.titech.ac.jp)
Abstract: Image restoration from degraded images lies at the foundation of image processing, pattern recognition, and computer vision, so it has been extensively studied. A large number of image restoration filters have been devised so far. It is known that a certain filter works excellently for a certain type of original image or degradation. However, the same filter may not be suitable for other images, so the selection of filters is exceedingly important in practice. Moreover, if a filter includes adjustable parameters such as the regularization parameter or threshold, its restoration performance relies heavily on the choice of the parameter values. In this paper, we therefore propose a unified method for determining the filter type and parameter values. The proposed method is based on the subspace information criterion (SIC), which is an unbiased estimator of the expected squared error for any linear filter. Thanks to the useful property of SIC, the proposed method allows one to optimize the filter type and parameter values in a consistent fashion. Furthermore, we apply SIC to the moving-average filter and derive an analytic form of the optimal parameter values. Computer simulations demonstrate the effectiveness of the proposed method.

TR01-0015 (August)

Title: Framework of timed trace theoretic verification revisited
Author(s): Bin Zhou, Tomohiro Yoneda and Chris Myers
Contact person: Tomohiro Yoneda (yoneda@cs.titech.ac.jp)
Abstract: This paper develops a framework to support trace theoretic verification of timed circuits and systems. A theoretical foundation for classifying timed traces as either successes or failures is developed. The concept of the semimirror is introduced to allow conformance checking thus supporting hierarchical verification of timed circuits and systems. Finally, we relate our framework to those previously proposed for timing verification.

TR01-0016 (December)

Title: Subspace information criterion for infinite dimensional hypothesis spaces
Author(s): Masashi Sugiyama and Klaus-Robert Mueller
Contact person: Masashi Sugiyama (sugi@og.cs.titech.ac.jp)
Abstract: A central problem in learning is to select an appropriate model. This is typically done by estimating the unknown generalization errors of a set of models to be selected from and by then choosing the model with minimal generalization error estimate. In this article, we discuss the problem of model selection and generalization error estimation in the context of kernel regression models, e.g., kernel ridge regression, kernel subset regression or Gaussian process regression.
Previously, a non-asymptotic generalization error estimator called the subspace information criterion (SIC) was proposed, that could be successfully applied to finite dimensional subspace models. SIC is an unbiased estimator of the generalization error for the finite sample case under the conditions that the learning target function belongs to a specified reproducing kernel Hilbert space (RKHS) H with dim H less than the number M of training samples.
In this paper, we extend the range of applicability of SIC, and show that even if dim H>M, SIC is an unbiased estimator of an essential part of the generalization error. Our extension allows to make use of infinite dimensional RKHSs, i.e., richer function classes commonly used in Gaussian processes, support vector machines or boosting. We further show that when dim H>M, SIC can be expressed in a much simpler form, making its computation highly efficient. In computer simulations on ridge parameter selection with real and artificial data sets, SIC compares favorably with other standard model selection techniques for instance leave-one-out cross-validation or an empirical Bayesian method.