TR98-0001 (January)

Title: MT R&D in Asia
Author(s): Hozumi Tanaka and Virach Sornlertlamvanich
Contact person: Hozumi Tanaka (tanaka@cs.titech.ac.jp)
Abstract: There has been a big shift in MT R&D in the Asian region after many large-scale projects being conducted over the past ten years. The Multi-lingual Machine Translation (MMT) project is a significant R&D project that has increased the number of NLP-related researchers and the quantity of research activities undertaken by research institutes in recent years. This project has provided considerable insight through collaborative research across languages, and is hoped to catalyze future MT R&D in this region. Though MT systems are still far from the ultimate goal of perfect translation, at present many MT systems are being applied to the support of information retrieval from the Internet.

TR98-0002 (February)

Title: Large-scale Document Clustering for Associative Document Search
Author(s): Makoto Iwayama and Takenobu Tokunaga
Contact person: Takenobu Tokunaga (take@cs.titech.ac.jp)
Abstract: Approximated algorithms for clustering large-scale document collection are proposed and evaluated under the context of cluster-based document retrieval (i.e., associative document search). These algorithms use a precise clustering algorithm as a subroutine to construct a stratified structure of cluster trees. An experiment showed that more than 100 times speedup in cpu time was gained at best. Through experiments of self retrieval and topic assignment, we confirmed sufficient search performance on cluster trees that are constructed by approximated algorithms. In particular, top down construction offered over 99% accuracy of self retrieval which is comparable performance to exhaustive search. Top down construction also offered promising performance in topic assignment, that is, better recall/precision than that obtained by exhaustive search. All of the results for cluster-based retrieval were obtained by simple and efficient binary tree search.

TR98-0003 (March)

Title: Corpus-based word sense disambiguation
Author(s): Atsushi Fujii
Contact person: Atsushi Fujii (fujii@cs.titech.ac.jp)
Abstract: Resolution of lexical ambiguity, commonly termed ``word sense disambiguation'', is expected to improve the analytical accuracy for tasks which are sensitive to lexical semantics. Such tasks include machine translation, information retrieval, parsing, natural language understanding and lexicography. Reflecting the growth in utilization of machine readable texts, word sense disambiguation techniques have been explored variously in the context of corpus-based approaches. Within one corpus-based framework, that is the similarity-based method, systems use a database, in which example sentences are manually annotated with correct word senses. Given an input, systems search the database for the most similar example to the input. The lexical ambiguity of a word contained in the input is resolved by selecting the sense annotation of the retrieved example.
In this research, we apply this method of resolution of verbal polysemy, in which the similarity between two examples is computed as the weighted average of the similarity between complements governed by a target polysemous verb. We explore similarity-based verb sense disambiguation focusing on the following three methods. First, we propose a weighting schema for each verb complement in the similarity computation. Second, in similarity-based techniques, the overhead for manual supervision and searching the large-sized database can be prohibitive. To resolve this problem, we propose a method to select a small number of effective examples, for system usage. Finally, the efficiency of our system is highly dependent on the similarity computation used. To maximize efficiency, we propose a method which integrates the advantages of previous methods for similarity computation.

TR98-0004 (March)

Title: Stochastic-based Integrated Natural Language Analysis (in Japanese)
Author(s): Kiyoaki Shirai
Contact person: Kiyoaki Shirai (kshirai@cs.titech.ac.jp)
Abstract: This paper describes several methods to overcome problems in natural language analysis using statistics learned from language sources.
Disambiguation is one of the major problems in analyzing sentences. This paper proposes a new framework of statistical language modeling, integrating various statistics for disambiguation. This model consists of three submodels. The first submodel is a syntactic model, which reflects syntactic statistics such as structural preferences. The second submodel is a lexical model, which reflects lexical statistics such as the occurrence of each word and word collocations. And the third submodel is a semantic model, which reflects statistics of word sense. One of the significant characteristics of this model is that it learns each statistics type separately, although many previous models have learned them simultaneously. Learning each submodel separately enables us to use different language sources for the different submodels, and to make understanding of each submodel's behavior much easier. This model was applied to the disambiguation of dependency structures of Japanese sentences for experimental evaluation. Experimentation showed that the contribution of lexical statistics to disambiguation was as great as that of syntactic statistics, in the proposed framework.
In learning a language model, smoothing methods are generally applied to overcome the data sparseness problem. Maximum entropy methods, which estimate a probabilistic model from training data, are suitable for smoothing in natural language analysis, but their computational cost is prohibitively large. This paper aims at suppressing this overhead, by preventing of repetition of the same calculation in the estimation process, and by proposing a new method of selecting useful features for estimation of the probabilistic model. The probabilistic model estimated by the proposed methods was as good as that for existing methods, with much less computing time.
In order to analyze various types of sentences, a context free grammar covering a broad range of linguistic phenomena is required. This paper describes a new method of acquiring a grammar from bracketed corpora. The characteristic of this method is that it suppresses the computational cost for grammar acquisition, through reliance on linguistic-based heuristics. The acceptance rate of the acquired grammar using the proposed method was 92.4 %, documenting its broad coverage.

TR98-0005 (August)

Title: Probabilistic Language Modeling for Generalized LR Parsing
Author(s): Virach Sornlertlamvanich
Contact person: Virach Sornlertlamvanich (virach@cs.titech.ac.jp)
Abstract: In this thesis, we introduce probabilistic models to rank the likelihood of resultant parses within the GLR parsing framework. Probabilistic models can also bring about the benefit of reduction of search space, if the models allow prefix probabilities for partial parses. In devising the models, we carefully observe the nature of GLR parsing, one of the most efficient parsing algorithms in existence, and formalize two probabilistic models with the appropriate use of the parsing context. The context in GLR parsing is provided by the constraints afforded by context-free grammars in generating an LR table (global context), and the constraints of adjoining pre-terminal symbols (local n-gram context).
In this research, firstly, we conduct both model analyses and quantitative evaluation on the ATR Japanese corpus to evaluate the performance of the probabilistic models. Ambiguities arising from multiple word segmentation and part-of-speech candidates in parsing a non-segmenting language, are taken into consideration. We demonstrate the effectivity of combining contextual information to determine word sequences in the word segmentation process, define parts-of-speech for words in the part-of-speech tagging process, and choose between possible constituent structures, in single-pass morphosyntactic parsing. Secondly, we apply empirical evaluation to show that the performance of the probabilistic GLR parsing model (PGLR) using an LALR table is in no way inferior to that of using a CLR table, despite the states in a CLR table providing more precise context than those in an LALR table. Thirdly, we propose a new node-driven parse pruning algorithm based on the prefix probability of PGLR, which is effective in beam search style parsing. The pruning threshold is estimated by the number of state nodes up to the current parsing stage. The algorithm provides significant evidence of reduction in both parsing time and computational resources. Finally, a further PGLR model is formalized which overcomes some problematic issues by the way of increasing the context in parsing.

TR98-0006 (December)

Title: EPBOBs (Extended Pseudo Biorthogonal Bases) for Signal Recovery
Author(s): Hidemitsu Ogawa and Nasr-Eddine Berrached
Contact person: Hidemitsu Ogawa (ogawa@cs.titech.ac.jp)
Abstract: The purpose of this paper is to deal with the problem of recovering a signal from its noisy version. One example is to restore old images degraded by noise. The recovery solution is given within the framework of series expansion and we shall show that for the general case the recovery functions have to be elements of an extended pseudo biorthogonal basis (EPBOB) in order to suppress efficiently the corruption noise. After we discuss the different situations of noise, we provide some methods to construct the optimal EPBOB in order to deal with these situations.

TR98-0007 (December)

Title: The Family of Projection Learnings
Author(s): Akira Hirabayashi and Hidemitsu Ogawa
Contact person: Akira Hirabayashi (hira@cs.titech.ac.jp)
Abstract: In feed-forward neural networks, the projection learning, the partial projection learning, and the averaged projection learning are proposed to obtain good generalization ability. The collection of learning methods that involve projections of the original function, including the previous three, are called {\em the family of projection learnings}. We propose a new and natural definition of the family of projection learnings, which have concrete and clear physical meanings, unlike previous ones. Comparison studies show that the previous definitions, though lack of physical meaning, also describe the complete family of projection learnings and is equivalent to the new definition.

TR98-0008 (December)

Title: Properties of the Family of Projection Learnings
Author(s): Akira Hirabayashi and Hidemitsu Ogawa
Contact person: Akira Hirabayashi (hira@cs.titech.ac.jp)
Abstract: We proposed a new and natural definition of the family of projection learnings in the previous paper. Based on the new definition, we derive a general form of projection learning operators in this paper. Properties of these projection learnings such as noise suppression capability are analyzed. We also show that physical interpretation of the projection learnings is directly reflected in the general form of the learning operators. The general forms of the projection learning, the partial projection learning, and the averaged projection learning are also obtained that are different from the general forms previously derived.

TR98-0009 (December)

Title: Applicability of Memorization Learning to the Family of Projection Learnings
Author(s): Akira Hirabayashi and Hidemitsu Ogawa
Contact person: Akira Hirabayashi (hira@cs.titech.ac.jp)
Abstract: One of major goals of learning in the feed-forward neural networks is to obtain good generalization capability. Most learning methods, however, can be classified as memorization learning which reduce training errors only. The memorization learning does not guarantee good generalization capability and is known to cause over-learning. On the other hand, it is true that this learning method is useful from the engineering view point due to ease of use. In this paper, we discuss as to what extent the memorization learning can be utilized to realize projection learnings, and conclude that it can be applied to the entire family of projection learning methods when proper training data is provided.

TR98-0010 (December)

Title: Applicability of Memorization Learning to the Family of Projection Learnings in the Presence of Noise
Author(s): Akira Hirabayashi and Hidemitsu Ogawa
Contact person: Akira Hirabayashi (hira@cs.titech.ac.jp)
Abstract: In feed-forward neural networks, most learning methods are memorization learning which only reduces training errors. It is used frequently because, first, memorization learning does not need much apriori knowledge, and, second, due to ease of use it is useful from the engineering view point. Authors have already clarified that the memorization learning can achieve the same generalization ability as the entire family of projection learnings when, in the absence of noise, proper training data is provided. In this paper, we discuss the case where the training data contains noise. We show that according to the properties of the noise, in some cases the memorization learning can be applied to the entire family of projection learnings, while in other cases it can be applied to only one projection learning.

TR98-0011 (December)

Title: Incremental Projection Learning for Optimal Generalization
Author(s): Masashi Sugiyama and Hidemitsu Ogawa
Contact person: Masashi Sugiyama (sugi@cs.titech.ac.jp)
Abstract: In the case where new training data is added after the learning process has been completed, incremental learning, in which the posterior result is built upon the prior results, is generally preferred because it is effective in computation. In this paper, we give an incremental projection learning in the presence of noise. The memory and computational complexity required for the incremental projection learning is far less than that required for the batch projection learning. Note that the incremental projection learning provides exactly the same result as that obtained by the batch projection learning. Moreover, we derive a condition for identifying redundant training data that has no effect on the generalization ability, so that the computation becomes more efficient.

TR98-0012 (December)

Title: Active Learning for Noise Suppression
Author(s): Masashi Sugiyama and Hidemitsu Ogawa
Contact person: Masashi Sugiyama (sugi@cs.titech.ac.jp)
Abstract: If we choose training data which provide the optimal generalization ability, we can carry out the learning effectively. In the presence of noise, it is also important to suppress noise influence. In this paper, we give a training data selection method for minimizing the noise variance.

TR98-0013 (December)

Title: Training Data Selection for Optimal Generalization in a Trigonometric Polynomial Model
Author(s): Masashi Sugiyama and Hidemitsu Ogawa
Contact person: Masashi Sugiyama (sugi@cs.titech.ac.jp)
Abstract: A necessary and sufficient condition of a set of training data that provides the optimal generalization capability in a trigonometric polynomial model is derived. In addition to provide the optimal generalization capability, training sets which satisfy the condition also reduce memory usage and computational complexity required for learning. There are infinitly many training sets which satisfy the condition. A selection method of training sets which further reduce both memory usage and computational complexity is presented. Finally, effectiveness of the proposed method is confirmed through computer simulations.

TR98-0014 (December)

Title: A Theory of Pseudoframes for Subspaces with Applications
Author(s): Shidong Li and Hidemitsu Ogawa
Contact person: Hidemitsu Ogawa (ogawa@cs.titech.ac.jp)
Abstract: We define and characterize a frame-like stable decomposition for subspaces of a general separable Hilbert space. We call it {\it pseudoframes for subspaces} (PFFS). Properties of PFFS are discussed. A necessary and sufficient characterization of PFFSs is provided. Analytical formulae for the construction of PFFSs are derived. An example of PFFSs for a band-limited subspace is constructed. Potential applications of PFFSs are discussed.