TR98-0001 (January)

Title
MT R&D in Asia

Author(s)
Hozumi Tanaka and Virach Sornlertlamvanich

Contact person
Hozumi Tanaka (tanaka@cs.titech.ac.jp)

Abstract
There has been a big shift in MT R&D in the Asian region after many large-scale projects being conducted over the past ten years. The Multi-lingual Machine Translation (MMT) project is a significant R&D project that has increased the number of NLP-related researchers and the quantity of research activities undertaken by research institutes in recent years. This project has provided considerable insight through collaborative research across languages, and is hoped to catalyze future MT R&D in this region. Though MT systems are still far from the ultimate goal of perfect translation, at present many MT systems are being applied to the support of information retrieval from the Internet.


TR98-0002 (February)

Title
Large-scale Document Clustering for Associative Document Search

Author(s)
Makoto Iwayama and Takenobu Tokunaga

Contact person
Takenobu Tokunaga (take@cs.titech.ac.jp)

Abstract
Approximated algorithms for clustering large-scale document collection are proposed and evaluated under the context of cluster-based document retrieval (i.e., associative document search). These algorithms use a precise clustering algorithm as a subroutine to construct a stratified structure of cluster trees. An experiment showed that more than 100 times speedup in cpu time was gained at best. Through experiments of self retrieval and topic assignment, we confirmed sufficient search performance on cluster trees that are constructed by approximated algorithms. In particular, top down construction offered over 99% accuracy of self retrieval which is comparable performance to exhaustive search. Top down construction also offered promising performance in topic assignment, that is, better recall/precision than that obtained by exhaustive search. All of the results for cluster-based retrieval were obtained by simple and efficient binary tree search.


TR98-0003 (March)

Title
Corpus-based word sense disambiguation

Author(s)
Atsushi Fujii

Contact person
Atsushi Fujii (fujii@cs.titech.ac.jp)

Abstract
Resolution of lexical ambiguity, commonly termed ``word sense disambiguation'', is expected to improve the analytical accuracy for tasks which are sensitive to lexical semantics. Such tasks include machine translation, information retrieval, parsing, natural language understanding and lexicography. Reflecting the growth in utilization of machine readable texts, word sense disambiguation techniques have been explored variously in the context of corpus-based approaches. Within one corpus-based framework, that is the similarity-based method, systems use a database, in which example sentences are manually annotated with correct word senses. Given an input, systems search the database for the most similar example to the input. The lexical ambiguity of a word contained in the input is resolved by selecting the sense annotation of the retrieved example.
In this research, we apply this method of resolution of verbal polysemy, in which the similarity between two examples is computed as the weighted average of the similarity between complements governed by a target polysemous verb. We explore similarity-based verb sense disambiguation focusing on the following three methods. First, we propose a weighting schema for each verb complement in the similarity computation. Second, in similarity-based techniques, the overhead for manual supervision and searching the large-sized database can be prohibitive. To resolve this problem, we propose a method to select a small number of effective examples, for system usage. Finally, the efficiency of our system is highly dependent on the similarity computation used. To maximize efficiency, we propose a method which integrates the advantages of previous methods for similarity computation.


TR98-0004 (March)

Title
Stochastic-based Integrated Natural Language Analysis (in Japanese)

Author(s)
Kiyoaki Shirai

Contact person
Kiyoaki Shirai (kshirai@cs.titech.ac.jp)

Abstract
This paper describes several methods to overcome problems in natural language analysis using statistics learned from language sources.
Disambiguation is one of the major problems in analyzing sentences. This paper proposes a new framework of statistical language modeling, integrating various statistics for disambiguation. This model consists of three submodels. The first submodel is a syntactic model, which reflects syntactic statistics such as structural preferences. The second submodel is a lexical model, which reflects lexical statistics such as the occurrence of each word and word collocations. And the third submodel is a semantic model, which reflects statistics of word sense. One of the significant characteristics of this model is that it learns each statistics type separately, although many previous models have learned them simultaneously. Learning each submodel separately enables us to use different language sources for the different submodels, and to make understanding of each submodel's behavior much easier. This model was applied to the disambiguation of dependency structures of Japanese sentences for experimental evaluation. Experimentation showed that the contribution of lexical statistics to disambiguation was as great as that of syntactic statistics, in the proposed framework.
In learning a language model, smoothing methods are generally applied to overcome the data sparseness problem. Maximum entropy methods, which estimate a probabilistic model from training data, are suitable for smoothing in natural language analysis, but their computational cost is prohibitively large. This paper aims at suppressing this overhead, by preventing of repetition of the same calculation in the estimation process, and by proposing a new method of selecting useful features for estimation of the probabilistic model. The probabilistic model estimated by the proposed methods was as good as that for existing methods, with much less computing time.
In order to analyze various types of sentences, a context free grammar covering a broad range of linguistic phenomena is required. This paper describes a new method of acquiring a grammar from bracketed corpora. The characteristic of this method is that it suppresses the computational cost for grammar acquisition, through reliance on linguistic-based heuristics. The acceptance rate of the acquired grammar using the proposed method was 92.4 %, documenting its broad coverage.


TR98-0005 (August)

Title
Probabilistic Language Modeling for Generalized LR Parsing

Author(s)
Virach Sornlertlamvanich

Contact person
Virach Sornlertlamvanich (virach@cs.titech.ac.jp)

Abstract
In this thesis, we introduce probabilistic models to rank the likelihood of resultant parses within the GLR parsing framework. Probabilistic models can also bring about the benefit of reduction of search space, if the models allow prefix probabilities for partial parses. In devising the models, we carefully observe the nature of GLR parsing, one of the most efficient parsing algorithms in existence, and formalize two probabilistic models with the appropriate use of the parsing context. The context in GLR parsing is provided by the constraints afforded by context-free grammars in generating an LR table (global context), and the constraints of adjoining pre-terminal symbols (local n-gram context).
In this research, firstly, we conduct both model analyses and quantitative evaluation on the ATR Japanese corpus to evaluate the performance of the probabilistic models. Ambiguities arising from multiple word segmentation and part-of-speech candidates in parsing a non-segmenting language, are taken into consideration. We demonstrate the effectivity of combining contextual information to determine word sequences in the word segmentation process, define parts-of-speech for words in the part-of-speech tagging process, and choose between possible constituent structures, in single-pass morphosyntactic parsing. Secondly, we apply empirical evaluation to show that the performance of the probabilistic GLR parsing model (PGLR) using an LALR table is in no way inferior to that of using a CLR table, despite the states in a CLR table providing more precise context than those in an LALR table. Thirdly, we propose a new node-driven parse pruning algorithm based on the prefix probability of PGLR, which is effective in beam search style parsing. The pruning threshold is estimated by the number of state nodes up to the current parsing stage. The algorithm provides significant evidence of reduction in both parsing time and computational resources. Finally, a further PGLR model is formalized which overcomes some problematic issues by the way of increasing the context in parsing.


TR98-0006 (December)

Title
EPBOBs (Extended Pseudo Biorthogonal Bases) for Signal Recovery

Author(s)
Hidemitsu Ogawa and Nasr-Eddine Berrached

Contact person
Hidemitsu Ogawa (ogawa@cs.titech.ac.jp)

Abstract
The purpose of this paper is to deal with the problem of recovering a signal from its noisy version. One example is to restore old images degraded by noise. The recovery solution is given within the framework of series expansion and we shall show that for the general case the recovery functions have to be elements of an extended pseudo biorthogonal basis (EPBOB) in order to suppress efficiently the corruption noise. After we discuss the different situations of noise, we provide some methods to construct the optimal EPBOB in order to deal with these situations.


TR98-0007 (December)

Title
The Family of Projection Learnings

Author(s)
Akira Hirabayashi and Hidemitsu Ogawa

Contact person
Akira Hirabayashi (hira@cs.titech.ac.jp)

Abstract
In feed-forward neural networks, the projection learning, the partial projection learning, and the averaged projection learning are proposed to obtain good generalization ability. The collection of learning methods that involve projections of the original function, including the previous three, are called {\em the family of projection learnings}. We propose a new and natural definition of the family of projection learnings, which have concrete and clear physical meanings, unlike previous ones. Comparison studies show that the previous definitions, though lack of physical meaning, also describe the complete family of projection learnings and is equivalent to the new definition.


TR98-0008 (December)

Title
Properties of the Family of Projection Learnings

Author(s)
Akira Hirabayashi and Hidemitsu Ogawa

Contact person
Akira Hirabayashi (hira@cs.titech.ac.jp)

Abstract
We proposed a new and natural definition of the family of projection learnings in the previous paper. Based on the new definition, we derive a general form of projection learning operators in this paper. Properties of these projection learnings such as noise suppression capability are analyzed. We also show that physical interpretation of the projection learnings is directly reflected in the general form of the learning operators. The general forms of the projection learning, the partial projection learning, and the averaged projection learning are also obtained that are different from the general forms previously derived.


TR98-0009 (December)

Title
Applicability of Memorization Learning to the Family of Projection Learnings

Author(s)
Akira Hirabayashi and Hidemitsu Ogawa

Contact person
Akira Hirabayashi (hira@cs.titech.ac.jp)

Abstract
One of major goals of learning in the feed-forward neural networks is to obtain good generalization capability. Most learning methods, however, can be classified as memorization learning which reduce training errors only. The memorization learning does not guarantee good generalization capability and is known to cause over-learning. On the other hand, it is true that this learning method is useful from the engineering view point due to ease of use. In this paper, we discuss as to what extent the memorization learning can be utilized to realize projection learnings, and conclude that it can be applied to the entire family of projection learning methods when proper training data is provided.


TR98-0010 (December)

Title
Applicability of Memorization Learning to the Family of Projection Learnings in the Presence of Noise

Author(s)
Akira Hirabayashi and Hidemitsu Ogawa

Contact person
Akira Hirabayashi (hira@cs.titech.ac.jp)

Abstract
In feed-forward neural networks, most learning methods are memorization learning which only reduces training errors. It is used frequently because, first, memorization learning does not need much apriori knowledge, and, second, due to ease of use it is useful from the engineering view point. Authors have already clarified that the memorization learning can achieve the same generalization ability as the entire family of projection learnings when, in the absence of noise, proper training data is provided. In this paper, we discuss the case where the training data contains noise. We show that according to the properties of the noise, in some cases the memorization learning can be applied to the entire family of projection learnings, while in other cases it can be applied to only one projection learning.


TR98-0011 (December)

Title
Incremental Projection Learning for Optimal Generalization

Author(s)
Masashi Sugiyama and Hidemitsu Ogawa

Contact person
Masashi Sugiyama (sugi@cs.titech.ac.jp)

Abstract
In the case where new training data is added after the learning process has been completed, incremental learning, in which the posterior result is built upon the prior results, is generally preferred because it is effective in computation. In this paper, we give an incremental projection learning in the presence of noise. The memory and computational complexity required for the incremental projection learning is far less than that required for the batch projection learning. Note that the incremental projection learning provides exactly the same result as that obtained by the batch projection learning. Moreover, we derive a condition for identifying redundant training data that has no effect on the generalization ability, so that the computation becomes more efficient.


TR98-0012 (December)

Title
Active Learning for Noise Suppression

Author(s)
Masashi Sugiyama and Hidemitsu Ogawa

Contact person
Masashi Sugiyama (sugi@cs.titech.ac.jp)

Abstract
If we choose training data which provide the optimal generalization ability, we can carry out the learning effectively. In the presence of noise, it is also important to suppress noise influence. In this paper, we give a training data selection method for minimizing the noise variance.


TR98-0013 (December)

Title
Training Data Selection for Optimal Generalization in a Trigonometric Polynomial Model

Author(s)
Masashi Sugiyama and Hidemitsu Ogawa

Contact person
Masashi Sugiyama (sugi@cs.titech.ac.jp)

Abstract
A necessary and sufficient condition of a set of training data that provides the optimal generalization capability in a trigonometric polynomial model is derived. In addition to provide the optimal generalization capability, training sets which satisfy the condition also reduce memory usage and computational complexity required for learning. There are infinitly many training sets which satisfy the condition. A selection method of training sets which further reduce both memory usage and computational complexity is presented. Finally, effectiveness of the proposed method is confirmed through computer simulations.


TR98-0014 (December)

Title
A Theory of Pseudoframes for Subspaces with Applications

Author(s)
Shidong Li and Hidemitsu Ogawa

Contact person
Hidemitsu Ogawa (ogawa@cs.titech.ac.jp)

Abstract
We define and characterize a frame-like stable decomposition for subspaces of a general separable Hilbert space. We call it {\it pseudoframes for subspaces} (PFFS). Properties of PFFS are discussed. A necessary and sufficient characterization of PFFSs is provided. Analytical formulae for the construction of PFFSs are derived. An example of PFFSs for a band-limited subspace is constructed. Potential applications of PFFSs are discussed.