In this thesis, we address the word mismatch problem through automatic query expansion which is a technique utilized within information retrieval to remedy this problem. A query is expanded by adding other terms that are closely related to the original query terms. Expansion terms can be selected by referring to thesauri or by consulting users through the relevance feedback technique. Past research has verified the effectiveness of relevance feedback, but it puts a burden on users to a certain extent. Furthermore, if a user is not familiar with the vocabulary of a document collection, it is difficult to obtain good expansion terms, unless the system can suggest terms to the user.
Many researchers found that query expansion using thesaurus has no improvement or very limited improvement when compared to the relevance-feedback method. This thesis analyses why query expansion using thesaurus shows only limited performance and based on this analysis proposes a method to improve the performance of automatic query expansion by using heterogeneous thesauri. Experimental results show that our method can improve the retrieval performance significantly.
Analysis of results shows that the performance of queries that contain multiple aspect is degraded by our method. We investigated several methods to overcome this problem either manually and automatically and experiments show that these methods can successfully increase the performance of those queries. Further analysis also shows that queries containing negation statements is degraded by our method. By eliminating the negation statements we found that our method also improve the performance of those queries.
We compared our results with relevance feedback and found that the performance of retrieval using our method is comparable to retrieval system performance using relevance feedback, and better than the performance of retrieval system using pseudo-relevance feedback. Further, we propose a simple combination of query expansion using our method and pseudo-relevance feedback method. The performance of this combination is better than using only one method.
In translation retrieval, disambiguation is based on, for a given input, determining the translation in the ``translation memory'' (database of source/target language string pairs) which will be of maximum practical use in translating the input. In this, we look at the effects of segment (word) order, segmentation and segment contiguity on translation retrieval performance. In extensive experimentation, character-based indexing (where each string is split up into constituent characters) was shown to be superior to word-based indexing (where each string is split up into words with a segmentation module), and bag-of-words methods roughly equivalent to segment order-sensitive methods. Modelling of local segment contiguity in the form of segment N-grams was found to be beneficial in terms of both retrieval speed and accuracy. We also tested a number of both static and dynamic segment weighting methods, but found them to have little effect.
With Japanese RCC interpretation, each RCC is described by way of a vector of morphological and shallow semantic features, and supervised learning used to classify RCC inputs according to a taxonomy aimed at Japanese-English machine translation. We focus particularly on feature selection and construction in an attempt to attain the maximum achievable classification accuracy. Feature selection/construction is carried out by way of backwards sequential search through the feature space, in increasing order of estimated feature relevance, and the impact of each feature of overall classification accuracy evaluated by way of ``nested cross-validation''.
Finally, verb sense disambiguation was performed over a pattern-based valency dictionary, and linked extensively to both the ``argument status'' (complementhood) and selectional restriction annotation of each case slot. Rather than simply evaluating the satisfaction of selectional restrictions in a binary fashion, with the proposed method, we return a score on the quality of the match, including provision for penalised ``backing-off''. Argument status features in the penalisation of backing-off of selectional restrictions, penalisation of the non-alignment of case slots within the case frame, and determination of the scope for case marking alternation.