- Title
- MT R&D in Asia
- Author(s)
- Hozumi Tanaka and Virach Sornlertlamvanich
- Contact person
- Hozumi Tanaka (tanaka@cs.titech.ac.jp)
- Abstract
- There has been a big shift in MT R&D in the Asian region after many
large-scale projects being conducted over the past ten years. The
Multi-lingual Machine Translation (MMT) project is a significant R&D
project that has increased the number of NLP-related researchers and
the quantity of research activities undertaken by research institutes
in recent years. This project has provided considerable insight
through collaborative research across languages, and is hoped to
catalyze future MT R&D in this region. Though MT systems are still
far from the ultimate goal of perfect translation, at present many MT
systems are being applied to the support of information retrieval from
the Internet.
- Title
- Large-scale Document Clustering for Associative Document
Search
- Author(s)
- Makoto Iwayama and Takenobu Tokunaga
- Contact person
- Takenobu Tokunaga (take@cs.titech.ac.jp)
- Abstract
- Approximated algorithms for clustering large-scale document
collection are proposed and evaluated under the context of
cluster-based document retrieval (i.e., associative document
search). These algorithms use a precise clustering algorithm as a
subroutine to construct a stratified structure of cluster trees. An
experiment showed that more than 100 times speedup in cpu time was
gained at best. Through experiments of self retrieval and topic
assignment, we confirmed sufficient search performance on cluster
trees that are constructed by approximated algorithms. In particular,
top down construction offered over 99% accuracy of self retrieval
which is comparable performance to exhaustive search. Top down
construction also offered promising performance in topic assignment,
that is, better recall/precision than that obtained by exhaustive
search. All of the results for cluster-based retrieval were obtained
by simple and efficient binary tree search.
- Title
- Corpus-based word sense disambiguation
- Author(s)
- Atsushi Fujii
- Contact person
- Atsushi Fujii (fujii@cs.titech.ac.jp)
- Abstract
- Resolution of lexical ambiguity, commonly termed ``word sense
disambiguation'', is expected to improve the analytical accuracy for
tasks which are sensitive to lexical semantics. Such tasks include
machine translation, information retrieval, parsing, natural language
understanding and lexicography. Reflecting the growth in utilization
of machine readable texts, word sense disambiguation techniques have
been explored variously in the context of corpus-based
approaches. Within one corpus-based framework, that is the
similarity-based method, systems use a database, in which example
sentences are manually annotated with correct word senses. Given an
input, systems search the database for the most similar example to the
input. The lexical ambiguity of a word contained in the input is
resolved by selecting the sense annotation of the retrieved example.
In this research, we apply this method of resolution of verbal
polysemy, in which the similarity between two examples is computed as
the weighted average of the similarity between complements governed by
a target polysemous verb. We explore similarity-based verb sense
disambiguation focusing on the following three methods. First, we
propose a weighting schema for each verb complement in the similarity
computation. Second, in similarity-based techniques, the overhead for
manual supervision and searching the large-sized database can be
prohibitive. To resolve this problem, we propose a method to select a
small number of effective examples, for system usage. Finally, the
efficiency of our system is highly dependent on the similarity
computation used. To maximize efficiency, we propose a method which
integrates the advantages of previous methods for similarity
computation.
- Title
- Stochastic-based Integrated Natural Language Analysis (in Japanese)
- Author(s)
- Kiyoaki Shirai
- Contact person
- Kiyoaki Shirai (kshirai@cs.titech.ac.jp)
- Abstract
- This paper describes several methods
to overcome problems in natural language analysis
using statistics learned from language sources.
Disambiguation is one of the major problems in analyzing sentences.
This paper proposes
a new framework of statistical language modeling,
integrating various statistics for disambiguation.
This model consists of three submodels.
The first submodel is a syntactic model,
which reflects syntactic statistics such as structural preferences.
The second submodel is a lexical model,
which reflects lexical statistics
such as the occurrence of each word and word collocations.
And the third submodel is a semantic model,
which reflects statistics of word sense.
One of the significant characteristics of this model
is that it learns each statistics type separately,
although many previous models have learned them simultaneously.
Learning each submodel separately enables us to
use different language sources for the different submodels,
and to make understanding of each submodel's behavior
much easier.
This model was applied to the disambiguation
of dependency structures of Japanese sentences
for experimental evaluation.
Experimentation showed that
the contribution of lexical statistics to disambiguation
was as great as that of syntactic statistics,
in the proposed framework.
In learning a language model,
smoothing methods are generally applied
to overcome the data sparseness problem.
Maximum entropy methods, which estimate
a probabilistic model from training data,
are suitable for smoothing in natural language analysis,
but their computational cost is prohibitively large.
This paper aims at suppressing this overhead,
by preventing of repetition of the same calculation
in the estimation process,
and by proposing a new method of selecting useful features
for estimation of the probabilistic model.
The probabilistic model estimated by the proposed methods
was as good as that for existing methods,
with much less computing time.
In order to analyze various types of sentences,
a context free grammar covering a broad range of linguistic phenomena
is required.
This paper describes a new method of acquiring a grammar
from bracketed corpora.
The characteristic of this method is
that it suppresses the computational cost for grammar acquisition,
through reliance on linguistic-based heuristics.
The acceptance rate of the acquired grammar
using the proposed method was 92.4 %,
documenting its broad coverage.
- Title
- Probabilistic Language Modeling for Generalized LR Parsing
- Author(s)
- Virach Sornlertlamvanich
- Contact person
- Virach Sornlertlamvanich (virach@cs.titech.ac.jp)
- Abstract
- In this thesis, we introduce probabilistic models to rank the likelihood
of resultant parses within the GLR parsing framework. Probabilistic
models can also bring about the benefit of reduction of search space, if
the models allow prefix probabilities for partial parses. In devising
the models, we carefully observe the nature of GLR parsing, one of the
most efficient parsing algorithms in existence, and formalize two
probabilistic models with the appropriate use of the parsing
context. The context in GLR parsing is provided by the constraints
afforded by context-free grammars in generating an LR table (global
context), and the constraints of adjoining pre-terminal symbols
(local n-gram context).
In this research, firstly, we conduct both model analyses and
quantitative evaluation on the ATR Japanese corpus to evaluate the
performance of the probabilistic models. Ambiguities arising from
multiple word segmentation and part-of-speech candidates in parsing a
non-segmenting language, are taken into consideration. We demonstrate
the effectivity of combining contextual information to determine word
sequences in the word segmentation process, define parts-of-speech for
words in the part-of-speech tagging process, and choose between possible
constituent structures, in single-pass morphosyntactic
parsing. Secondly, we apply empirical evaluation to show that the
performance of the probabilistic GLR parsing model (PGLR) using an LALR
table is in no way inferior to that of using a CLR table, despite the
states in a CLR table providing more precise context than those in an
LALR table. Thirdly, we propose a new node-driven parse pruning
algorithm based on the prefix probability of PGLR, which is effective in
beam search style parsing. The pruning threshold is estimated by the
number of state nodes up to the current parsing stage. The algorithm
provides significant evidence of reduction in both parsing time and
computational resources. Finally, a further PGLR model is formalized
which overcomes some problematic issues by the way of increasing the
context in parsing.
- Title
- EPBOBs (Extended Pseudo Biorthogonal Bases) for Signal Recovery
- Author(s)
- Hidemitsu Ogawa and Nasr-Eddine Berrached
- Contact person
- Hidemitsu Ogawa (ogawa@cs.titech.ac.jp)
- Abstract
- The purpose of this paper is to deal with the problem of recovering a signal
from its noisy version. One example is to restore old images degraded by noise.
The recovery solution is given within the framework of series expansion and we
shall show that for the general case the recovery functions have to be
elements of an extended pseudo biorthogonal basis (EPBOB) in order to suppress
efficiently the corruption noise. After we discuss the different situations of
noise, we provide some methods to construct the optimal EPBOB in order to deal
with these situations.
- Title
- The Family of Projection Learnings
- Author(s)
- Akira Hirabayashi and Hidemitsu Ogawa
- Contact person
- Akira Hirabayashi (hira@cs.titech.ac.jp)
- Abstract
- In feed-forward neural networks,
the projection learning, the partial projection learning,
and the averaged projection learning are proposed
to obtain good generalization ability.
The collection of learning methods that
involve projections of the original function, including
the previous three, are called {\em the family of projection learnings}.
We propose a new and natural definition of the family of projection learnings,
which have concrete and clear physical meanings, unlike previous ones.
Comparison studies show that the previous definitions, though
lack of physical meaning, also describe the complete family of projection
learnings and is equivalent to the new definition.
- Title
- Properties of the Family of Projection Learnings
- Author(s)
- Akira Hirabayashi and Hidemitsu Ogawa
- Contact person
- Akira Hirabayashi (hira@cs.titech.ac.jp)
- Abstract
- We proposed a new and natural definition of the family of projection
learnings in the previous paper. Based on the new definition,
we derive a general form of projection learning operators in this paper.
Properties of these projection learnings such as noise suppression capability
are analyzed. We also show that physical interpretation of the projection
learnings is directly reflected in the general form of the learning operators.
The general forms of the projection learning,
the partial projection learning,
and the averaged projection learning are also obtained
that are different from the general forms previously derived.
- Title
- Applicability of Memorization Learning to the Family of Projection Learnings
- Author(s)
- Akira Hirabayashi and Hidemitsu Ogawa
- Contact person
- Akira Hirabayashi (hira@cs.titech.ac.jp)
- Abstract
- One of major goals of learning in the feed-forward neural networks is
to obtain good generalization capability. Most learning methods, however, can
be classified as memorization learning which reduce training errors only.
The memorization learning does not guarantee good generalization
capability and is known to cause over-learning. On the other hand, it is true
that this learning method is useful from the engineering view point due
to ease of use. In this paper, we discuss as to what extent the memorization
learning can be utilized to realize projection learnings, and conclude that it
can be applied to the entire family of projection learning methods when proper
training data is provided.
- Title
- Applicability of Memorization Learning to the Family of Projection Learnings in the Presence of Noise
- Author(s)
- Akira Hirabayashi and Hidemitsu Ogawa
- Contact person
- Akira Hirabayashi (hira@cs.titech.ac.jp)
- Abstract
- In feed-forward neural networks, most learning methods
are memorization learning which only reduces training errors.
It is used frequently because,
first, memorization learning does not need much apriori knowledge,
and, second, due to ease of use it is useful from the engineering view point.
Authors have already clarified that the memorization learning can achieve
the same generalization ability as the entire family of projection learnings
when, in the absence of noise, proper training data is provided.
In this paper, we discuss the case where the training data contains noise.
We show that according to the properties of the noise,
in some cases the memorization learning can be applied to the entire
family of projection learnings, while in other cases it can be applied to
only one projection learning.
- Title
- Incremental Projection Learning for Optimal Generalization
- Author(s)
- Masashi Sugiyama and Hidemitsu Ogawa
- Contact person
- Masashi Sugiyama (sugi@cs.titech.ac.jp)
- Abstract
- In the case where new training data is added
after the learning process has been completed,
incremental learning, in which the posterior result is built
upon the prior results, is generally preferred
because it is effective in computation.
In this paper, we give an incremental projection learning in the
presence of noise. The memory and computational
complexity required for the incremental projection learning is far less than
that required for the batch projection learning.
Note that the incremental projection learning
provides exactly the same result as that obtained by the batch projection learning.
Moreover, we derive a condition for identifying
redundant training data that has no effect on the generalization ability,
so that the computation becomes more efficient.
- Title
- Active Learning for Noise Suppression
- Author(s)
- Masashi Sugiyama and Hidemitsu Ogawa
- Contact person
- Masashi Sugiyama (sugi@cs.titech.ac.jp)
- Abstract
- If we choose training data which provide the optimal generalization ability,
we can carry out the learning effectively.
In the presence of noise, it is also important to suppress noise influence.
In this paper, we give a training data selection method for minimizing
the noise variance.
- Title
- Training Data Selection for Optimal Generalization in a Trigonometric Polynomial Model
- Author(s)
- Masashi Sugiyama and Hidemitsu Ogawa
- Contact person
- Masashi Sugiyama (sugi@cs.titech.ac.jp)
- Abstract
- A necessary and sufficient condition of a set of training data that provides the optimal
generalization capability in a trigonometric polynomial model is derived.
In addition to provide the optimal generalization capability,
training sets which satisfy the condition also reduce
memory usage and computational complexity required for learning.
There are infinitly many training sets which satisfy the condition.
A selection method of training sets
which further reduce both memory usage and computational complexity is presented.
Finally, effectiveness of the proposed method is confirmed through computer simulations.
- Title
- A Theory of Pseudoframes for Subspaces with Applications
- Author(s)
- Shidong Li and Hidemitsu Ogawa
- Contact person
- Hidemitsu Ogawa (ogawa@cs.titech.ac.jp)
- Abstract
- We define and characterize a frame-like stable decomposition for
subspaces of a general separable Hilbert space. We call it {\it pseudoframes
for subspaces} (PFFS). Properties of PFFS are
discussed. A necessary and sufficient characterization of PFFSs
is provided.
Analytical formulae for the construction of PFFSs are derived.
An example of PFFSs for a band-limited subspace is constructed.
Potential applications of PFFSs are discussed.