Page numbers I- correspond to the volume LNCS 7181, page numbers II- to the volume LNCS 7182. Complementary proceedings volumes are not yet assigned page numbers.
– Hide abstracts |
Springer LNCS 7181 | |
Paper |
Page |
NLP System Architecture | |
Invited paper: | |
Thinking Outside the Box for Natural Language Processing
Abstract:
Natural Language Processing systems are often composed of a sequence of
transductive components that transform their input into an output with
additional syntactic and/or semantic labels. However, each component in this
chain is typically error-prone and the error is magnified as the processing
proceeds down the chain. In this paper, we present details of two systems,
first, a speech driven question answering system and second, a dialog
modeling system, both of which reflect the theme of tightly incorporating
constraints across multiple components to improve the accuracy of their
tasks.
| I-1 |
Lexical Resources | |
A graph-based method to improve WordNet Domains
Abstract:
WordNet Domains (WND) is a lexical resource where synsets have been semi-automatically annotated with one or more domain labels from a set of 165 hierarchically organized domains. The uses of WND include the power to reduce the polysemy degree of the words, grouping those senses that belong to the same domain. But the semi-automatic method used to develop this resource was far from being perfect. By cross-checking the content of the Multilingual Central Repository (MCR) it is possible to find some errors and inconsistencies. Many are very subtle. Others, however, leave no doubt. Moreover, it is very difficult to quantify the number of errors in the original version of WND. This paper presents a novel semi-automatic method to propagate domain information through the MCR. We compare both labellings (the original and the new one) allowing us to detect anomalies in the original WND labels. We also compare the quality of both resources (the original labelling and the new one) in a common Word Sense Disambiguation task. The results show that the new labelling clearly outperforms the original one by a large margin.
| I-17 |
Best paper award, 2nd place: | |
Corpus-Driven Hyponym Acquisition for Turkish
Abstract:
In this study, we propose a method for acquisition of hyponymy relation for Turkish. The integrated
method relies on both lexico-syntactic pattern and semantic similarity. Once the model extracts the items using patterns and applies similarity based elimination of the incorrect ones in order to increase
the precision. We showed that the algorithm based on a particular lexico-syntactic pattern for Turkish can retrieve a lot of hyponymy relation. We also presented that an elimination based on semantic
similarity gives promising result. We discussed how to measure the similarity between the concepts.
The point is that we aim at getting better relevance and more precise result. We observed that this
approach gives successful results with high precision.
| I-29 |
Automatic Taxonomy Extraction in Different Languages using Wikipedia and minimal language-specific Information
Abstract:
Knowledge bases extracted from Wikipedia are particularly useful for various NLP and Semantic Web applications due to their coverage, actuality and multilingualism. This has led to many approaches for automatic knowledge base extraction from Wikipedia. Most of these approaches rely on the English Wikipedia as it is the largest Wikipedia version. However, each Wikipedia version contains socio-cultural knowledge, i.e. knowledge with relevance for a specific culture or language. In this work, we describe a method for extracting a large set of hyponymy relations from the Wikipedia category system that can be used to acquire taxonomies in multiple languages. More specifically, we describe a set of 20 features that can be used for for Hyponymy Detection without using additional language-specific corpora. Finally, we evaluate our approach on Wikipedia in four different languages and compare the results with the WordNet taxonomy and a multilingual approach based on interwiki links of the Wikipedia.
| I-42 |
Ontology-driven Construction of Domain Corpus with Frame Semantics Annotations
Abstract:
Semantic Role Labeling plays a key role in many text mining applications. The development of SRL systems for the biomedical domain is frustrated by the lack of large domain specific corpora that are labeled with semantic roles. In this paper we propose a method for building corpus that are labeled with semantic roles for the domain of biomedicine. The method is based on the theory of frame semantics, and relies on domain knowledge provided by ontologies. By using the method, we have built a corpus for transport events strictly following the domain knowledge provided by GO biological process ontology. We compared one of our frames to BioFrameNet. We also examined the gaps between the semantic classification of the target words in this domain-specific corpus and in FrameNet and PropBank/VerbNet data. The successful corpus construction demonstrates that ontologies, as a formal representation of domain knowledge, can instruct us and ease all the tasks in building this kind of corpus. Furthermore, ontological domain knowledge leads to well-defined semantics exposed on the corpus, which will be very valuable in text mining applications.
| I-54 |
Building a Hierarchical Annotated Corpus of Urdu: The URDU.KON-TB Treebank
Abstract:
This work aims at the development of a representative treebank for the South Asian language Urdu. Urdu is a comparatively under resourced language and the development of a reliable treebank for Urdu will have signicant impact on the state-of-the-art for Urdu language processing. In URDU.KON-TB treebank described here, a POS tagset, a syntactic tagset and a functional tagset
have been proposed. The construction of the treebank is based on an existing corpus of 19 million
words for the Urdu language. Part of speech (POS) tagging and annotation of a selected set of sentences from dierent sub-domains of this corpus is in process manually and the work performed till to date is
presented here. The hierarchical annotation scheme we adopted has a
combination of a phrase structure (PS) and a hybrid dependency structure (HDS).
| I-66 |
Morphology and Syntax | |
A Morphological Analyzer Using Hash Tables in Main Memory (MAHT) and a Lexical Knowledge Base
Abstract:
This paper presents a morphological analyzer for the Spanish language (MAHT). This system is mainly based on the storage of words and its morphological information, leading to a lexical knowledge base that has almost five million words. The lexical knowledge base practically covers the whole morphological casuistry of the Spanish language. However, the analyzer solves the processing of prefixes and of enclitic pronouns by easy rules, since the words that can include these elements are much and some of them are neologisms. MAHT reaches a processing average speed over 275,000 words/second. This one is possible because it uses hash tables in main memory. MAHT has been designed to isolate the data from the algorithms that analyze words, even with their irregularities. This design is very important for an irregular and highly inflectional language, like Spanish, to simplify the insertion of new words and the maintenance of program code.
| I-80 |
Optimal Stem Identification in Presence of Suffix List
Abstract:
Stemming is considered crucial in many NLP and IR applications. In the absence of any linguistic information, stemming is a challenging task. Stemming of words using suffixes of a language as linguistic information is in comparison an easier problem. In this work we considered stemming as a process of reducing lexicons in an unannotated corpus by using a suffix set. We proved that the exact lexicon reduction problem is NP-hard and came up with a polynomial time approximation. One probabilistic model that minimizes the stem distributional entropy is also proposed for stemming. Performances of these models are analyzed using an unannotated corpus and a suffix set of Malayalam, a morphologically rich language of India belonging to the Dravidian family.
| I-92 |
On the Adequacy of Three POS Taggers and a Dependency Parser
Abstract:
A POS-tagger can be used in front of a parser to reduce the number of combinations of possible dependency trees which, in the majority, give spurious analyses. In the paper we compare the results of the addition of three morphological taggers to the parser of the CDG Lab. The experimental results show that these models perform better than the model which do not use a morphological tagger at the cost of loosing some correct analyses. In fact, the adequacy of these solutions is mainly based on the compatibility between the lexical units defined by the taggers and the dependency grammar.
| I-104 |
Will the Identification of Reduplicated Multiword Expression (RMWE) Improves the Performance of SVM Based Manipuri POS Tagging?
Abstract:
Reduplicated Multiword Expressions (RMWEs) are abundant in Manipuri, the highly agglutinative India language. The Part of Speech (POS) tagging of Manipuri using Support Vector Machine (SVM) has been developed and evaluated. The POS tagger has been updated with identified RMWEs as another feature. The performance of the SVM based POS tagger before and after adding RMWE as a feature have been compared. The SVM based POS tagger has been evaluated with the of F-Score of 77.67% which has increased to 79.61% with RMWE as an additional feature. Thus the performance the POS tagger has improved after adding RMWE as an additional feature.
| I-117 |
On a Formalisation of Word Order Properties
Abstract:
This paper contains an attempt to formalize the degree of word order freedom for natural languages. It concentrates on the crucial distinction between word order complexity (how diffcult it is to parse sentences with more complex word order) and word order freedom (to which extent it is possible to change the word order without causing a change of individual word forms, their morphological characteristics and/or their surface dependency relations). We exemplify this distinction on a pilot study on Czech sentences with clitics.
| I-130 |
Core-periphery organization of graphemes in written sequences: Decreasing positional rigidity with increasing core order
Abstract:
The positional rigidity of graphemes (as well as words considered as single units) in written sequences has been analyzed in this paper using complex network methodology. In particular, the information about adjacent co-occurrence of graphemes in a corpus has been used to construct a network, where the nodes represent the distinct signs used. Core-periphery structure of this network has been uncovered using k-core decomposition technique suitably generalized for directed networks. This allows identification of a core signary or “graphem-ome” of the corresponding writing system, i.e., the group of frequently co-occurring graphemes. The distribution of the frequency with which such signs occur at different positions in a sequence (e.g., at the beginning or at the end or in the middle) shows that while signs belonging to the periphery often occur only at specific positions, those in the innermost cores may occur at many different positions. This is quantified by using a positional entropy measure that shows a systematic increase with core order for the different databases used in this study (corpus of English, Chinese and Sumerian sentences as well as a database of Indus civilization inscriptions).
| I-142 |
Discovering linguistic patterns using sequence mining
Abstract:
In this paper, we present a method based on data mining techniques to automatically discover linguistic patterns matching appositive qualifying phrases. We develop an algorithm mining sequential patterns made of itemsets with gap and linguistic constraints. The itemsets allow several kinds of information to be associated with one word. The advantage is the extraction of linguistic patterns with more expressiveness than the usual sequential patterns. In addition, the constraints enable to automatically prune irrelevant patterns. In order to manage the set of generated patterns, we propose a solution based on a partial ordering. A human user can thus easily validate them as relevant linguistic patterns. We illustrate the efficiency of our approach over two corpora coming from a newspaper.
| I-154 |
What About Sequential Data Mining Techniques to Identify Linguistic Patterns for Stylistics?
Abstract:
In this paper, we study the use of data mining techniques for stylistic analysis, from a linguistic point of view, by considering emerging sequential patterns. First, we show that mining sequential patterns of words with gap constraints gives new relevant linguistic patterns with respect to patterns built on state-of-the-art n-grams. Then, we investigate how sequential patterns of itemsets can provide more generic linguistic patterns. We validate this approach from a linguistic point of view by conducting experiments on three corpora of various types of French texts (poetry, letters, and fiction). By considering more particularly poetic texts, we show that characteristic linguistic patterns can be identified using data mining techniques. We also discuss how our current approach based on sequential pattern mining can be improved to be used more efficiently for linguistic analyses.
| I-166 |
Resolving Syntactic Ambiguities in NL Specification of Constraints Using UML Class Model
Abstract:
Translating multiple constraint specifications expressed in a natural language (NL) into formal constraint language is limited to 65%-70% accuracy. In NL2OCL project, we aim to translate English specification of constraints to formal constraints such as OCL (Object Constraint Language). Our semantic analyzer uses the output of the Stanford POS tagger and the Stanford Parser employed for syntactic analysis of English specification. However, in few cases, the Stanford POS tagger and parser are not able to handle particular syntactic ambiguities in English specification. In this paper, we highlight the identified cases of syntactic ambiguities and we also present a novel technique to automatically resolve the identified syntactic ambiguities. By addressing the identified cases of syntactic ambiguities, we can generate more accurate and complete formal specifications such as OCL.
| I-178 |
A Computational Grammar of Sinhala
Abstract:
A Computational Grammar for a language is a very useful resource for carrying out various language processing tasks for that language such as Grammar checking, Machine Translation and Question Answering. As is the case in most South Indian Languages, Sinhala is a highly inflected language with three gender forms and two number forms among other grammatical features. While piecemeal descriptions of Sinhala grammar is reported in the literature, no comprehensive effort to develop a context-free grammar (CFG) has been made that has been able to account for any significant coverage of the language. This paper describes the development of a feature-based CFG for non-trivial sentences in Sinhala. The resulting grammar covers a significant subset of Sinhala as described in a well-known grammar book. A parser for producing the appropriate parse tree(s) of input sentences was also developed using the NLTK toolkit. The grammar also detects and so rejects ungrammatical sentences. Two hundred sample sentences taken from primary grade Sinhala grammar books were used to test the grammar. The grammar accounted for 60% of the coverage over these sentences.
| I-188 |
Automatic Identification of Persian Light Verb Constructions
Abstract:
Automatic identification of multiword expressions (MWEs) is important for the development of many Natural Language Processing (NLP) applications, such as machine translation. Light verb constructions (LVCs) are a type of verb-based MWEs, in which a semantically-light basic verb is combined with another word to form a complex predicate, e.g., take a walk. Our focus here is on the automatic detection of Persian LVCs. This is a particularly challenging and important problem since (i) LVCs are very common and highly productive in Persian, and they greatly outnumber simple verbs; and (ii) to our knowledge there has not been much work on their automatic identification.
| I-201 |
Word Sense Disambiguation and Named Entity Recognition | |
A Cognitive Approach to Word Sense Disambiguation
Abstract:
In this paper, an unsupervised, knowledge-based, parametric approach to Word Sense Disambiguation is proposed based on the well-known cognitive architecture ACT-R. In this process, a Spreading Activation Network is built with the chunks and their relations in the declarative memory system of ACT-R and the lexical representation has been achieved by integrating WordNet with the cognitive architecture. The target word is disambiguated based on surrounding context words using an accumulator model of word sense disambiguation system which is realized by incorporating RACE/A (Retrieval by ACcumulating Evidence in an architecture) with ACT-R 6.0. The implementation is a partial representation of human semantic processing in computer system as it supports the local processing assumption of the extended theory by Collins et al.(1975) added to the basic theory proposed by Quillian(1962, 1967). The resulting Word Sense Disambiguation system is evaluated using test data set from English Lexical Sample task of Senseval-2 and overall accuracy of the proposed algorithm outperforms all the participating Word Sense Disambiguation Systems.
| I-211 |
A graph-based approach to WSD using Relevant Semantic Trees and N-Cliques model
Abstract:
In this paper we propose a new graph-based approach to solve semantic ambiguity using a built in semantic net based on WordNet. Our proposal uses an adaptation of the Clique Partitioning Technique to extract sets of strongly related senses. The initial graph is obtained from senses of WordNet combined with the information of several semantic categories from different resources: WordNet Domains, SUMO and WordNet Affect. In order to obtain the most relevant concepts in a sentence we use the Relevant Semantic Trees method. The evaluation of the results has been conducted using the test data set of Senseval-2 obtaining promising results.
| I-225 |
Using Wiktionary to Improve Lexical Disambiguation in Multiple Languages
Abstract:
This paper proposes using linguistic knowledge from Wiktionary to improve lexical disambiguation in multiple languages, focusing on part-of-speech tagging in selected languages with various characteristics including English, Vietnamese, and Korean. Dictionaries and subsumption networks are first automatically extracted from Wiktionary. These linguistic resources are then used to enrich the feature set of training examples. A first-order discriminative model is learned on training data using Hidden Markov-Support Vector Machines. The proposed method is competitive with related contemporary works in the three languages. In English, our tagger achieves 96.37% token accuracy on the Brown corpus, with an error reduction of 2.74% over the baseline.
| I-238 |
Organization Name Disambiguity
Abstract:
Twitter is a widespread social media, which rapidly gained worldwide popularity. Pursuing on the problem of finding related tweets to a given organization, we propose supervised and semi-supervised based methods. This is a challenging task due to the potential organization name ambiguity. The tweets and organization contain little information. The organizations in training data are different with those in test data, which leads that we could not train a classifier to a certain organization. Therefore, we induce external resources to enrich the information of organization. Supervised and semi-supervised methods are adopted in two stages to classify the tweets. This is a try to utilize both training and test data for this specific task. Our experimental results on WePS-3 are primary and encouraging, they prove the proposed techniques are effective in performing the task.
| I-249 |
Optimizing CRF-based Model for Proper Name Recognition in Polish Texts
Abstract:
In this paper we present some optimizations of Conditional Random Fields-based model for proper names recognition in Polish running texts. The proposed optimizations refer to word-level segmentation problems, gazetteers extension, gazetteer and wordnet-based features, feature construction and selection, and finally combination of general knowledge sources with statistical model. The problem of proper name recognition is limited to recognition of person first names and surnames, names of countries, cities and roads. The evaluation is performed in two ways: a single domain evaluation using cross validation on a Corpus of Stock Exchange Reports and a part of InfiKorp, and a cross-domain evaluation on a Corpus of Economic News. The combination of proposed optimizations improved the final result from 94.53% to ca. 96.00% of F-measure for single domain and from 70.86% to 79.30% for cross-domain evaluation.
| I-258 |
Methods of Estimating the Number of Clusters for Person Cross Document Coreference Tasks
Abstract:
Knowing the number of different individuals carrying the same name may improve the overall accuracy of a Person Cross Document Coreference System, which processes large corpora and clusters the name mentions according to the individuals carrying them. In this paper we present a series of methods of estimating this number. In particular, an estimation method based on name perplexity, which brings a large improvement over the baseline given by the gap statistics, is instrumental in reaching accurate clustering results because not only it can predict the number of clusters with a very good confidence, but also it can indicate what type of clustering method works best for each particular name.
| I-270 |
Coreference Resolution using Tree-CRF
Abstract:
Coreference resolution is the task of identifying which noun phrases or mentions refer to the same real-world entity in a text or a dialogue. This is an essential task in many of the NLP applications such as information extraction, question answering system, summarization, machine translation and in information retrieval systems. Coreference Resolution is traditionally considered as pairwise classification problem and different classification techniques are used to make a local classification decision. We are using Tree-CRF for this task. With Tree-CRF we make a joint prediction of the anaphor and the antecedent. Tree-based Reparameterization (TRP) for approximate inference is used for the parameter learning. TRP performs an exact computation over the spanning trees of a full graph. This helps in learning the long distance dependency. The approximate inference methodology does a better convergence. We have used the parsed tree from the OntoNotes, released for CoNLL shared task 2011. We derive features from the parse tree. We have used the different genre data for the experiments. The results are encouraging.
| I-285 |
Detection of Arabic Entity Graphs using Morphology, Finite State Machines, and Graph Transformations
Abstract:
Research on automatic recognition of named entities from Arabic text uses techniques that work well for the Latin based languages such as local grammars, statistical learning models, pattern matching, and rule-based techniques. These techniques boost their results by using application specific corpora, parallel language corpora, and morphological stemming analysis. We propose a method for extracting entities, events, and relations amongst them from Arabic text using a hierarchy of finite state machines driven by morphological features such as part of speech and gloss tags, and graph transformation algorithms.We evaluated our method on three natural language processing applications. We automated the extraction of narrators and narrator relations from several corpora of Islamic narration books (hadith). We automated the extraction of genealogical family trees from Biblical texts. Finally, we automated locating and extracting individual biographies from historical biography books. In all applications, our method reports high precision and recall and learns lemmas about phrases that improve results.
| I-297 |
Integrating Rule-based System with Classification for Arabic Named Entity Recognition
Abstract:
Named Entity Recognition (NER) is a subtask of information extraction that seeks to recognize and classify named entities in unstructured text into predefined categories such as the names of persons, organizations, locations, etc.
The majority of researchers used machine learning, while few researchers used handcrafted rules to solve the NER problem. We focus here on NER for the Arabic language (NERA), an important language with its own distinct challenges. This paper proposes a simple method for integrating machine learning with rule-based systems and implement this proposal using the state-of-the-art rule-based system for NERA. Experimental evaluation shows that our integrated approach increases the F-measure by 8 to 14\% when compared to the original (pure) rule based system and the (pure) machine learning approach, and the improvement is statistically significant for different datasets. More importantly, our system outperforms the state-of-the-art machine-learning system in NERA over a benchmark dataset. | I-311 |
Semantics and Discourse | |
Space projections as distributional models for semantic composition
Abstract:
The representation of words meaning in texts using empirical distributional methods is a central problem in Computational Linguistics and it has become increasingly popular in cognitive science. Several approaches account for the meaning of syntactic structures by combining words according to algebraic operators (e.g. tensor product) acting over lexical vectors. In this paper, a novel approach for semantic composition based on space reduction techniques (e.g. Singular Value Decomposition) over geometric lexical representations is proposed. In line with Frege's context principle, the meaning of a phrase, or a sentence, is modeled in terms of the subset of properties shared by the co-occurring words. In the geometric perspective here pursued, syntactic bi-grams or tri-grams are projected in the so called \emph{Support Subspace}, in order to emphasize the common semantic features that better capture phrase-specific aspects of the involved lexical meanings. The capability of this model to capture compositional semantic information is confirmed by the state-of-the-art results that are achieved in a phrase similarity task, used as a benchmark for this class of methods.
| I-232 |
Distributional Models and Lexical Semantics in Convolution Kernels
Abstract:
The representation of word meaning in texts is a central problem in Computational Linguistics. Geometrical models represent lexical semantic information in terms of the basic co-occurrences that words establish each other in large-scale text collections.
As recent works already address, the definition of methods able to express the meaning of phrases or sentences as operations on lexical representations is a complex problem and a still largely open issue. In this paper, a perspective centered on Convolution Kernels is discussed and the formulation of a Partial Tree Kernel that integrates the syntactic information of sentences and the generalization of their lexicals is presented. The interaction of such information and the role of different geometrical models is investigated on the question classification task where the state-of-the-art result is achieved. | I-336 |
Multiple Level of Referents in Information State
Abstract:
As we strive for sophisticated machine translation and reliable information extraction, we have launched a subproject pertaining to the practical elaboration of “intensional” levels of discourse referents in the framework of a representational dynamic discourse semantics, the DRT-based [12] ReALIS [2], and the implementation of resulting representations within a complete model of communicating interpreters’ minds as it is captured formally in ReALIS by means of functions alpha, sigma, lambda and kappa [4]. We show analyses of chiefly Hungarian linguistic data, which range from revealing complex semantic contribution of small affixes through pointing out the multiply intensional nature of certain (pre)verbs to studying the embedding of whole discourses in information state. An outstanding advantage of our method, due to our theoretical basis, is that not only sentences / discourses are assigned semantic representations but relevant factors of speakers’ information states can also be revealed and implemented.
| I-349 |
Inferring the Scope of Negation in Biomedical Documents
Abstract:
In the last few years negation detection systems for biomedical texts have been developed successfully. In this paper we present a system that finds and annotates the scope of negation in English sentences. It infers which words are affected by negations by browsing dependency syntactic structures. Thus, firstly a greedy algorithm detects negation cues (a negation cue is defined as the lexical marker that expresses negation), like no or not. And secondly the scope of these negation cues is computed. We tested the system over the Bioscope corpus, annotated with negation, obtaining better results than other published approaches.
| I-363 |
LDA-Frames: an Unsupervised Approach to Generating Semantic Frames
Abstract:
In this paper we introduce a novel approach to identifying semantic frames from semantically unlabelled text corpora. There are many frame formalisms but most of them suffer from the problem that all the frames must be created manually and the set of semantic roles must be predefined. The LDA-frames approach, based on the Latent Dirichlet Allocation, avoids both these problems by employing statistics on a syntactically tagged corpus. The only information that must be given is a number of semantic frames and a number of semantic roles to be identified. The power of LDA-frames is first shown on a small sample corpus and then on the British National corpus.
| I-373 |
Unsupervised Acquisition of Axioms to Paraphrase Noun Compounds and Genitives
Abstract:
A predicate is usually omitted from text when it is highly predictable from the context. This omission is due to the effort optimization that humans perform during the language generation process. Authors omit the information that they know the addressee is able to recover effortlessly. Most noun-noun structures including genitives and compounds are result of this process. The goal of this work is to generate automatically and without supervision the paraphrases that make explicit the omitted predicate in these noun-noun structures. The method is general enough to address also the cases were components are Named Entities. The resulting paraphrasing axioms are necessary for recovering the semantics of a text, and therefore, useful for applications such as Question Answering.
| I-388 |
Age-Related Temporal Phrases in Spanish and Italian
Abstract:
This paper reports research on temporal expressions. The analyzed phrases include a common temporal expression for a period of years reinforced by an adverb of time. Some of those phrases are age-related expressions. We analyzed samples obtained from the Internet for Spanish and Italian to determine appropriate annotations for marking up text and possible translations. We present the results for a group of selected classes.
| I-402 |
Can Modern Statistical Parsers Lead to Better Natural Language Understanding for Education?
Abstract:
We use state-of-the-art parsing technology to build GeoSynth – a system that can automatically solve word problems in geometric constructions. Through our experiments we show that even though off-the-shelf parsers perform poorly on texts containing specialized vocabulary and long sentences, appropriate preproc-essing of text before applying the parser and use of extensive domain knowledge while interpreting the parse tree can to-gether help us circumvent parser errors and build robust domain specific natural language understanding modules useful for various educational applications.
| I-415 |
Exploring Classification Concept Drift on a Large News Text Corpus
Abstract:
Concept drift research has regained research interest during recent years as many applications use data sources that are changing over time. We study the classification task using logistic regression on a large news collection of 248K texts during a period of 13 years. We present extrinsic methods of concept drift detection and quantification using training set formation with different windowing techniques. On our corpus, we characterize concept drift and show the overestimation of classifier performance if it is neglected.We lay out paths for future work where we plan to refine characterization methods and investigate the drifting of learning parameters when few examples are available.
| I-428 |
An Empirical Study of Recognizing Textual Entailment in Japanese Text
Abstract:
Recognizing Textual Entailment (RTE) is a fundamental task in Natural Language Understanding. The task is to decide whether the meaning of a text can be inferred from the meaning of the other one. In this paper, we conduct an empirical study of the RTE task for Japanese, adopting a machine-learning-based approach. We quantitatively analyse the effects of various entailment features and the impact
of RTE resources on the performance of a RTE system. This paper also investigates the use of Machine Translation for RTE task and determines whether Machine Translation can be used to improve the performance of our RTE system. Experimental results achieved on benchmark data sets show that our machine-learning-based RTE system outperforms the baseline based on lexical matching, and suggest that the Machine Translation component can be utilized to improve the performance of the RTE system. | I-438 |
Best paper award, 1st place: | |
Automated Detection of Local Coherence in Short Essays Based on Centering Theory
Abstract:
We describe in this paper an automated method to assessing the local coherence of short essays. Our analysis relies on one analytical feature of essay quality, Continuity, which is meant to measure the local coherence of essays based on human judgements. We then use the notion local and global coherence from Grosz and Sidner’s (1986) theory of discourse and Grosz, Joshi, and Weinstein’s (1995) Centering Theory, to automatically measure local coherence and compare that to the human judgments of Continuity. Center concepts for each paragraph are detected and a decision is made whether the paragraph is locally coherent based on the number of dominant center concepts it has. Results on a corpus of 184 essays approximately equally distributed among three essay prompts show promising results.
| I-450 |
A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations among Intra-sentence Discourse Segments in Spanish
Abstract:
Nowadays automatic discourse analysis is a very prominent research topic, since it is useful to develop several applications, as automatic summarization, automatic translation, information extraction, etc. Rhetorical Structure Theory (RST) is the most employed theory. Nevertheless, there are not many studies about this subject in Spanish. In this paper we present the first system assigning nuclearity and rhetorical relations to intra-sentence discourse segments in Spanish texts. To carry out the research, we analyze the learning corpus of the RST Spanish Treebank, a corpus of manually-annotated specialized texts, in order to build a list of lexical and syntactic patterns marking rhetorical relations. To implement the system, we use this patterns' list and a discourse segmenter called DiSeg. To evaluate the system, we apply it over the test corpus of the RST Spanish Treebank and we compare the automatic and the manually rhetorical analysis of each sentence, by means of recall and precision, obtaining positive results.
| I-462 |
Sentiment Analysis, Opinion Mining, and Emotions | |
Feature Specific Sentiment Analysis for Mixed Product Reviews
Abstract:
Abstract: In this paper, we present a novel approach to identify feature specific expressions of opinion, in product reviews with different features and mixed emotions. The objective is realized by identifying a set of potential features in the review and extracting opinion expressions about those features by exploiting their associations. Taking advantage of the view that more closely associated words come together to express an opinion regarding a certain feature, dependency parsing is used to identify relations between the opinion expressions. The system learns the set of significant relations to be used by dependency parsing and a threshold parameter which allows us to merge closely associated opinion expressions. The data requirement is minimal as this is a one time learning of the domain independent parameters. The associations are represented in the form of a graph which is partitioned to finally retrieve the opinion expression describing the user specified feature. We show that the system achieves a high accuracy across all domains and performs at par with state-of-the-art systems despite its data limitations
| I-475 |
Biographies or Blenders: Which Resource is Best for Cross-Domain Sentiment Analysis?
Abstract:
Domain adaptation is usually discussed from the point of view of new algorithms that minimise performance loss when applying a classifier trained on one domain to another. However, finding pertinent data similar to the test domain is equally important for achieving high accuracy in a cross-domain task. This study proposes an algorithm for automatic estimation of performance loss in the context of cross-domain sentiment classification. We present and validate several measures of domain similarity specially designed for the Sentiment classification task.
We also introduce a new characteristic, called domain complexity, as another independent factor influencing performance loss, and propose various functions for its approximation. Finally, a linear regression for modeling accuracy loss is built and tested in different evaluation settings. As a result, we are able to predict the accuracy loss with an average error of 1.5% and a maximum error of 3.4%. | I-488 |
A Generate-and-Test Method of Detecting Negative-Sentiment Sentences
Abstract:
Sentiment analysis requires human efforts to construct clue lexicons and/or annotations for machine learning, which are considered domain-dependent. This paper presents a sentiment analysis method where clues are learned automatically with a minimum training data at a sentence level. The main strategy is to learn and weight sentiment-revealing clues by first generating maximal set of candidates from the annotated sentences for maximum recall and learning a classifier using linguistically-motivated composite features at a later stage for higher precision. The proposed method is geared toward detecting negative sentiment sentences as they are not appropriate for suggesting contextual ads. We show how clue-based sentiment analysis can be done without having to assume availability of a separately constructed clue lexicon. Our experimental work with both Korean and English news corpora shows that the proposed method outperforms word-feature based SVM classifiers. The result is encouraging especially because this relatively simple method can be used for documents in new domains and time periods for which sentiment clues may vary
| I-500 |
Role of Event Actors and Sentiment Holders in Identifying Event-Sentiment Association
Abstract:
In this paper, we propose an approach for identifying the roles of event actors and sentiment holders from the perspective of event sentiment relations within the TimeML framework. A bootstrapping algorithm has been used for identifying the association between the event and sentiment expressions on the event annotated TempEval-2 dataset. The algorithm consists of lexical keyword spotting and co-reference approaches. It identifies the association between the event and sentiment expressions in same or different text segments. Guided by the classical definitions of events in the TempEval-2 shared task, a manual evaluation was attempted to distinguish the sentiment events from the factual events and the agreement was satisfactory. In order to computationally estimate the different sentiments associated with different events, we propose to incorporate the knowledge of event actors and sentiment holders. To identify the roles between event actors and sentiment holders, we have developed two different systems. In case of event actor identification, the baseline model is developed based on the subject information of the dependency-parsed event sentences. Next to improve the performance of the baseline model, we use an open source Semantic Role Labeler (SRL). Then we develop an unsupervised syntax based model. The syntactic model is based on the relationship of the event verbs with their argument structures extracted from the head information of the chunks in the parsed sentences. Similarly, the baseline model for identifying sentiment holders is developed based on the subject information of the dependency-parsed sentences. The unsupervised syntax based model along with the Named Entity (NE) clues is prepared based on the relationship of the verbs with their extracted argument structures from the dependency parsed sentences. Additionally, we have used an open source JavaRAP (Resolution of Anaphora Procedure) tool for identifying the anaphoric presence of the event actors and sentiment holders. This tool takes the sentences of the Tempeval-2 corpus as input and generates a list of anaphora-antecedent pairs as output as well as an in-place annotation or substitution of the anaphors with their antecedents has been produced. We deduce the chains of event actors and sentiment holders based on their anaphoric presence in text sentences. From our experiments, it has been observed that the lexical equivalence between event and sentiment expressions easily identifies the similar entities that are both responsible for the event actors and sentiment holders. If the event and sentiment expressions occupy different text segments, the identification of their corresponding event actors and sentiment holders needs the knowledge of parsed-dependency relations, Named Entities, cause-effect and different other rhetoric relations along with the anaphors. The manual evaluation produces satisfactory results on the test documents of the TempEval-2 shared task in case of identifying the many to many associations between the event actors and sentiment holders for a specific event.
| I-513 |
Applying Sentiment and Social Network Analysis in User Modeling
Abstract:
The idea of applying a conjunction of sentiment and social network analysis to improve the performance of applications is attracting little attention. In widely used online shopping websites, customers can provide reviews about a product. Also a number of relations like friendship, trust and similarity between products or users are being formed. In this paper a combination of sentiment analysis and social network analysis is employed for extracting classification rules for each customer. These rules represent customer's preferences for each cluster of products and can be seen as a user model. The combination helps the system to classify products based on customer’s interests. We compared the results of our proposed method with a base method with no social network analysis. The results show an improvement of 8 percent in precision and 5 percent in recall of the base system for the accuracy of the classification rules.
| I-526 |
The 5W Structure for Sentiment Summarization-Visualization-Tracking
Abstract:
In this paper we address the Sentiment Analysis problem from the end user’s perspective. An end user might desire an automated at-a-glance presentation of the main points made in a single review or how opinion changes time to time over multiple documents. To meet the requirement we propose a relatively generic opinion 5Ws structurization, further used for textual and visual summary and tracking. The 5W task seeks to extract the semantic constituents in a natural language sentence by distilling it into the answers to the 5W questions: Who, What, When, Where and Why. The visualization system facilitates users to generate sentiment tracking with textual summary and sentiment polarity wise graph based on any dimension or combination of dimensions as they want i.e. “Who” are the actors and “What” are their sentiment regarding any topic, changes in sentiment during “When” and “Where” and the reasons for change in sentiment as “Why”.
| I-540 |
Naive Bayes Classifiers in opinion mining applications. In search of the best feature set
Abstract:
The paper focuses on how Naive Bayes classifiers work in opinion mining applications. The first question asked is what are the feature sets to choose when training such a classifier in order to obtain the best results when that classifier is later used for classifying other objects (in this case, texts). The second question is if combining the results of Naive Bayes classifiers trained on different feature sets affects the final results in a positive way. (All the tests were made on two data bases consisting of negative and positive movie reviews - one used for training the classifiers and the other for testing them.)
| I-556 |
A Domain Independent Framework to Extract and Aggregate Analogous Features in Reviews
Abstract:
Extracting and detecting user mentioned feature in reviews without domain knowledge is a challenge. Moreover, people express their opinions on same feature of the product or service in various lexical forms. Thus it is also important to identify similar feature expressing terms and opinions in order to build effective feature based opinion mining system. In this paper, we present a novel framework to automatically detect, extract and aggregate semantically related features of reviewed products and services. Our approach involves implementation of the double propagation based algorithm to detect candidate features in reviews, followed by probabilistic generative model to aggregate ratable features in reviews. Our model uses the power of sentence level syntactic and lexical information to detect candidate feature words, and corpus level co-occurrence statistics to do grouping of semantically similar features to obtain high precision feature detection. The results of our model outperformed existing state of the art probabilistic models. Our model also shows distinct advantage over double propagation by grouping like features together making it easy and quick for users to perceive. We evaluate our model in two completely unrelated domains such as restaurant and camera to verify its domain independence.
| I-568 |
Learning Lexical Subjectivity Strength for Chinese Opinionated Sentence Identification
Abstract:
This paper presents a fuzzy set based approach to automatically learn lexical subjectivity strength for Chinese opinionated sentence identification. To approach this task, the log-linear probability is employed to extract a set of subjective words from opinionated sentences, and three fuzzy sets, namely low-strength subjectivity, medium-strength subjectivity and high-strength subjectivity, are then defined to represent their respective classes of subjectivity strength. Furthermore, three membership functions are built to indicate the degrees of subjective words in different fuzzy sets. Finally, the acquired lexical subjective strength is exploited to perform subjectivity classification. The experimental results on the NTCIR-7 MOAT data demonstrate that the introduction of lexical subjective strength is beneficial to opinionated sentence identification.
| I-580 |
Building Subjectivity Lexicon(s) From Scratch For Essay Data
Abstract:
While there are a number of subjectivity lexicons available for research purposes, none can be used commercially. We describe the process of constructing subjectivity lexicon(s) for recognizing sentiment polarity in essays written by test-takers, to be used within a commercial essay-scoring system. We discuss ways of expanding a manually-built seed lexicon using dictionary-based, distributional in-domain and out-of-domain information, as well as using Amazon Mechanical Turk to help “clean up” the expansions. We show the feasibility of constructing a family of subjectivity lexicons from scratch using a combination of methods to attain competitive performance with state-of-art research-only lexicons. Furthermore, this is the first use, to our knowledge, of a paraphrase generation system for expanding a subjectivity lexicon.
| I-591 |
Emotion Ontology Construction From Chinese Knowledge
Abstract:
An emotion ontology from Chinese Dictionary is created for human machine interaction. The construction of emotion ontology includes affective word annotation and emotion predicate hierarchy extraction semi-automatically. More than 50 affective categories are extracted and about 5,000 nouns, adjectives, 2,000 verbs are categorized into the predicate hierarchy.
| I-603 |
Springer LNCS 7182 | |
Natural Language Generation | |
Exploring Extensive Linguistic Feature Sets in Near-synonym Lexical Choice
Abstract:
In the near-synonym lexical choice task, the best alternative out of a set of near-synonyms is selected to fill a lexical gap in a text. Lexical choice is an important subtask in natural language generation systems, such as machine translation and question-answering. In this paper we experiment on an approach of an extensive set, over 650, linguistic features to represent the context of a word, and a range of machine learning approaches in the lexical choice task. We extend previous work by experimenting with unsupervised and semi-supervised methods, and use automatic feature selection to handle the problems arising from the rich feature set. It is natural to think that linguistic analysis of the word context would yield almost perfect performance in the task but we show in this paper that too many features, even linguistic, introduce noise to the classification and make it too difficult to solve for unsupervised and semi-supervised methods. We also show that for the task, purely syntactic features play the biggest role in the performance, but also certain semantic and morphological features are needed.
| II-1 |
abduction in games for a flexible approach to document planning
Abstract:
We propose a new approach to document planning that considers this process as strategic interaction between the generation component and a user model. The core task of the user model is abductive reasoning about the usefulness of rhetorical relations for the document plan with respect to the user’s information
requirements. Since the different preferences of the generation component and the user model are defined by parameterized utility functions, we achieve a highly flexible approach to the generation of document plans for different users. We apply this approach to the generation of reports on performance data. A questionnaire-based evaluation corroborates the assumptions made in the model. | II-13 |
Machine Translation and Multilingualism | |
Invited paper: | |
Document-Specific Statistical Machine Translation for Improving Human Translation Productivity
Abstract:
We present two long term studies of the productivity of human
translators by augmenting an existing Translation Memory system with
Document-Specific Statistical Machine Translation. While the MT Post-Editing
approach represents a significant change to the current practice of human
translation, the two studies demonstrate a significant increase in the
productivity of human translators, on the order of about 50% in the first
study and of 68% in the second study conducted a year later. Both studies
used a pool of 15 translators and concentrated on English-Spanish
translation of IBM content in a production Translation Services Center.
| II-25 |
Minimum Bayes Risk Decoding With Enlarged Hypothesis Space in System Combination
Abstract:
This paper describes a new system combination strategy in Statistical Machine Translation. Tromble et al. introduced the evidence space into Minimum Bayes Risk decoding in order to quantify the relative performance within lattice or n-best output with regard to the 1-best output. In contrast, our approach is to enlarge the hypothesis space in order to incorporate the combinatorial nature of MBR decoding. In this setting, we do experiments on two language pairs ES-EN and JP-EN. An improvement of our approach for ES-EN JRC-Acquis was 0.50 BLEU points absolute and 1.9\% relative compared to the standard confusion network-based system combination without hypothesis expansion, and 2.16 BLEU points absolute and 9.2\% relative compared to the best single system. An improvement for JP-EN was 0.94 points absolute and 3.4\% relative, and for FR-EN was 0.30 points absolute and 1.3\% relative.
| II-40 |
Phrasal Syntactic Category Sequence Model for Phrase-based MT
Abstract:
Incoporating target syntax into phrase-based machine translation (PBMT) can generate syntactically well-formed translations. We propose a novel phrasal syntactic category sequence (PSCS) model which allows a PBMT decoder to prefer more grammatical translations. We parse all the sentences on the target side of the bilingual training corpus. In the standard phrase pair extraction procedure, we assign a syntactic category to each phrase pair and build a PSCS model from the parallel training data. Then, we log linearly incorporate the PSCS model into a standard PBMT system. Our method is very simple and yields a 0.7 BLEU point improvement when compared to the baseline PBMT system.
| II-52 |
Integration of a Noun Compound Translator tool with Moses for English-Hindi Machine Translation and Evaluation
Abstract:
Noun Compounds are a frequently occurring multiword expression in English written texts. English noun compounds are translated into varied syntactic constructs in Hindi. The performance of existing translation system makes the point clear that there exists no satisfactorily efficient Noun Compound translation tool from English to Hindi although the need of one is unprecedented in the context of machine translation. In this paper we integrate Noun Compound Translator (Mathur & Paul 2009), a statistical tool for Noun Compound translation, with the state-of-the-art machine translation tool, Moses (Koehn et al. 2007). We evaluate the integrated system on test data of 300 source language sentences which contain Noun Compounds and are translated manually into Hindi. A gain of 29% on BLEU score and 27% on Human evaluation has been observed on the test data.
| II-60 |
Neoclassical compound alignment from comparable corpora
Abstract:
The paper deals with the automatic compilation of bilingual dictionary from specialized comparable corpora. We concentrate on a method to automatically extract and to align neoclassical compounds in two languages from comparable corpora. In order to do this, we assume that neoclassical compounds translate compositionally to neoclassical compounds from one language to another. The method covers the two main forms of neoclassical compounds and is split into three steps: ex- traction, generation, and selection. Our program takes as input a list of aligned neoclassical elements and a bilingual dictionary in two languages. We also align neoclassical compounds by a pivot language approach de- pending on the hypothesis that the neoclassical element remains stable in meaning across languages. We experiment with four languages: English, French, German, and Spanish using corpora in the domain of renewable energy; we obtain a precision of 96%.
| II-72 |
QAlign : A new method for bilingual lexicon extraction from comparable corpora
Abstract:
This paper presents a new way of looking at the problem of bilingual lexicon extraction from comparable corpora mainly inspired from information retrieval (IR) domain and more specifically from question-answering systems (QAS) . By analogy to QAS, we consider a term to be translated as a part of a question extracted from a source language, and we try to find out the right translation assuming that it should be contained in the right answer extracted from the target language. The methods traditionally dedicated to the task of bilingual lexicon extraction from comparable corpora tend to represent the hole contexts of a term in a single vector and thus, give a general representation of all the contexts. We believe that a local representation of the context of a term given by each query is more appropriate as we consider and give more importance to a local information that could be swallowed up in the volume if represented and treated in a single context vector. We show that the empirical results obtained are competitive with the standard approach traditionally dedicated to this task.
| II-83 |
Aligning the un-alignable --- a pilot study using a noisy corpus of nonstandardized, semi-parallel texts
Abstract:
We present the outline of a robust, precision oriented alignment method that deals with a corpus of comparable texts without standardized spelling or sentence boundary marking. The method identifies comparable sequences over a source and target text using a bilingual dictionary, uses various methods to assign a confidence score, and only keeps the highest scoring sequences. For comparison, a conventional alignment is done with a heuristic sentence splitting beforehand. Both methods are evaluated over transcriptions of two historical documents in different Early New High German dialects, and the method developed is found to outperform the competing one by a great margin.
| II-97 |
Parallel corpora for WordNet construction: machine translation vs. automatic sense tagging
Abstract:
In this paper we present a methodology for WordNet construction based on the exploitation of parallel corpora with semantic annotation on the English source text. We are using this methodology for the enlargement of the Spanish and Catalan WordNet 3.0, but the methodology can be used for other languages. As big parallel corpora with semantic annotation are not usually available we explore two strategies to overcome this problem: using monolingual sense tagged corpora and machine translation, on the one hand; and using parallel corpora and automatic sense tagging on the source text. Having these resources the problem of acquiring a WordNet from a parallel corpora can be seen as a word alignment task. Fortunately, this task is well known and freely available aligning algorithms are available.
| II-110 |
Method to Build a Bilingual Lexicon for Speech-to-Speech Translation Systems
Abstract:
Noun dropping and mis-translations occasionally occurs with Machine Translation (MT) output. These errors can cause communication problems between system users. Some of the MT architectures are able to incorporate bilingual noun lexica, which can improve the translation quality of sentences which include nouns. In this paper, we proposed an automatic method to enable a monolingual user to add new words to the lexicon. In the experiments, we compare the proposed method to three other methods. According to the experimental results, the proposed method gives the best performance in both point of view of Character Error Rate (CER) and Word Error Rate (WER). The improvement from using only a transliteration system is very large, about 13 points in CER and 37 points in WER.
| II-122 |
Text Categorization and Clustering | |
A Fast Subspace Text Categorization Method using Parallel Classifiers
Abstract:
In today's world, the number of electronic documents made available
to us is increasing day by day. It is therefore important to look at
methods which speed up document search and reduce classifier training
times. The data available to us is frequently divided into several
broad domains with many sub-category levels. Each of these domains of data constitutes a subspace which can be processed separately. In this
paper, separate classifiers of the same type are trained on different
subspaces and a test vector is assigned to a subspace using a fast
novel method of subspace detection. This parallel classifier
architecture was tested with a wide variety of basic classifiers and
the performance compared with that of a single basic classifier on the
full data space. It was observed that the improvement in subspace
learning was accompanied by a very significant reduction in training
times for all types of classifier used.
| II-132 |
Research on Text Categorization Based on a Weakly-Supervised Transfer Learning Method
Abstract:
This paper presents a weakly-supervised transfer learning based text categorization method, which does not need to tag new training documents when facing classification tasks in new area. Instead, we can take use of the already tagged documents in other domains to accomplish the automatic categorization task. By ex-tracting linguistic information such as part-of-speech, semantic, co-occurrence of keywords, we construct a domain-adaptive transfer knowledge base. Relation experiments show that, the presented method improved the performance of text categorization on traditional corpus, and our results were only about 5% lower than the baseline on cross-domain classification tasks. And thus we demonstrate the ef-fectiveness of our method.
| II-144 |
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
Abstract:
Document representation is an essential step in web page clustering. Web pages are usually written in HTML, offering useful information to select the most important features to represent them. In this paper we investigate the use of nonlinear combinations of criteria by means of a fuzzy system to find those important features. We start our research from a term weighting function called Fuzzy Combination of Criteria (fcc) that relies on term frequency, document title, emphasis and term positions in the text. Next, we analyze its drawbacks and explore the possibility of adding contextual information extracted from inlinks anchor texts, proposing an alternative way of combining criteria based on our experimental results. Finally, we apply a statistical test of significance to compare the original representation with our proposal.
| II-157 |
Clustering Short Text and its Evaluation
Abstract:
Recently there have been and increase in interest towards clustering short text because it could be used in many NLP applications.
According to the application a variety of short text could be defined mainly in terms of the length (e.g. sentence, paragraphs) and type
of the short text (e.g. scientific papers, newspapers). Finding a clustering method that is able to cluster short text in general is difficult. In this paper, we cluster 4 different corpora with different types of text and varying length and evaluate them against the gold standard solution. On
the basis of these corpora we try to show the effect of different similarity measures, clustering methods, and cluster evaluation methods. We discuss four existing corpus based similarity methods, Cosine similarity, Latent Semantic Analysis, Short text Vector Space Model, and Kullback-Leibler distance, four well known clustering methods, Complete Link,
Single Link, Average Link hierarchical clustering and Spectral clustering, and three evaluation methods, clustering F-measure, adjusted Rand
Index, and V. Our experiments show that corpus based similarity measures do not affect the clustering methods significantly and that spectral clustering performs better than hierarchical agglomerative clustering. We also show that the values given by the evaluation methods do not always represent the usability of the clusters.
| II-169 |
Information Extraction and Text Mining | |
Information Extraction from Webpages Based on DOM Distances
Abstract:
Retrieving information from Internet is a difficult task as it is demonstrated by the lack of real time tools able to extract information from webpages. The main cause is that most webpages in Internet are implemented using plain (X)HTML which is a language that lacks structured semantic information. For this reason much of the efforts in this area have been directed to the development of techniques for URLs extraction. This field has produced good results implemented by modern search engines. But, contrarily, extracting information from a single webpage has produced poor results or very limited tools. In this work we define a novel technique for information extraction from single webpages or collections of interconnected webpages. This technique is based on DOM distances to retrieve information. This allows the technique to work with any webpage and, thus, to retrieve information online. Our implementation and experiments demonstrate the usefulness of the technique.
| II-181 |
Combining Flat and Structured Approaches for Temporal Slot Filling or: How Much to Compress?
Abstract:
In this paper, we present a hybrid approach to Temporal Slot Filling (TSF) task. Our method decomposes the task into two steps: temporal classification and temporal aggregation. As in many other NLP tasks, a key challenge lies in capturing relations between text elements separated by a long context. We have observed that features derived from a structured text representation can help compressing the context and reducing ambiguity. On the other hand, surface lexical features are more robust and work better in some cases. Experiments on the KBP2011 temporal training data set show that both
surface and structured approaches outperform a baseline bag-of-word based classifier and the proposed hybrid method can further improve the performance significantly. Our system achieved the top performance in KBP2011 evaluation. | II-194 |
Event Annotation Schemes and Event Recognition in Spanish Texts
Abstract:
This paper presents an annotation scheme for events in Spanish texts, based on TimeML for English. This scheme is contrasted with different proposals, all of them based on TimeML, for various Romance languages: Italian, French and Spanish. Two manually annotated corpora are now available. While manual annotation is far from trivial, we obtained a very good event identification agreement (93% of events were identically identified by both annotators). Part of the annotated text was used as a training corpus for the automatic recognition of events. In the experiments conducted so far (SVM and CRF) our best results are in the state of the art for this task (80.3% of F-measure).
| II-206 |
Automatically generated noun lexicons for event extraction
Abstract:
n this paper, we propose a method for creating automatically weighted lexicons of event names. Almost all names of events are ambiguous in context (i.e. they can be interpreted in an eventive or non-eventive reading); Therefore, weights representing the relative "eventness" of a noun can help for disambiguating event detection in texts. Our method has been applied to both French and English corpora. We show by a basic machine learning evaluation that using weighted lexicons can be a good way to improve event extraction. We also propose a study concerning the necessary size of corpus to be used for creating a valuable lexicon.
| II-219 |
Invited paper: | |
Lexical Acquisition for Clinical Text Mining Using Distributional Similarity
Abstract:
We describe experiments into the use of distributional similarity for
acquiring lexical information from clinical free text, in particular notes
typed by primary care physicians (general practitioners). We also present a
novel approach to lexical acquisition from ‘sensitive’ text, which does not
require the text to be manually anonymised – a very expensive process – and
therefore allows much larger datasets to be used than would normally be
possible.
| II-232 |
Developing an Algorithm for Mining Semantics in Texts
Abstract:
We discuss an algorithm for identifying semantic arguments of a verb, word senses of a polysemous word, noun phrases in a sentence. The hart of the algorithm is a probabilistic graphical model. In contrast with other existed graphical models, such as Naive Bayes models, CRFs, HMMs, and MEMMs, this model determines a sequence of optimal class assignments among M choices for a sequence of N input symbols without using dynamic programming, running fast– O(M N ), and taking less memory space–O(M ). Experiments conducted on standard data sets show encourage results.
| II-247 |
Mining Market Trend from Blog Titles Based on Lexical Semantic Similarity
Abstract:
Today blogs have become an important medium for people to post their ideas and share new information. And the market trend of pricing Up/Down always draws people’s attention. In this paper, we make a thorough study on mining market trend from blog titles in the field of housing market and stock market, based on lexical semantic similarity. We focus our attention on the automatic extraction and construction of the Chinese Up/Down verb lexicon, by using both Chinese and Chinese-English bilingual semantic similarity. The experimental results show that verb lexicon extraction based on semantic similarity is of great use in the task of predicting market trend, and that the performance of applying English similar words to Chinese verb lexicon extraction is well compared with using Chinese similar words.
| II-261 |
Information Retrieval and Question Answering | |
Ensemble Approach for Cross Language Information Retrieval
Abstract:
Cross language information retrieval (CLIR) is a sub field of information retrieval (IR) which deals with retrieval of content from one language (source language) for a search query expressed in another language (target language) in the Web. Cross Language Information Retrieval evolved as a field due to the fact that majority of the content in the web is in English. Hence there is a need for dynamic translation of web content for a query expressed in the native language. The biggest problem is that of ambiguity of the query expressed in the native language. The ambiguity of languages is typically not a problem for human beings who can infer the appropriate word sense or meaning based on context, but search engines cannot usually overcome these limitations. Hence, methods and mechanisms to provide native languages access to information from the web are needed. There is a need, to not only retrieve the relevant results but also, present the content behind the results in a user understandable manner. The research in the domain has so far focused in terms of techniques that make use support vector machines, suffix tree approach, Boolean models, and iterative results clustering. This research work focuses on a methodology of personalized context based cross language information retrieval using ensem-ble-learning approach. The source language for this research is taken, as English and the target language is Telugu. The methodology has tested for various queries and the results are shown in this work.
| II-274 |
Web Image Annotation Using an Effective Term Weighting
Abstract:
The number of images on the World Wide Web has been increasing tremendously. Providing search services for images on the web has been an active research area. Web images are often surrounded by different associated texts like ALT text, surrounding text, image filename, html page title etc. Many popular internet search engines make use of these associated texts while indexing images and give higher importance to the terms present in ALT text. But, a recent study has shown that around half of the images on the web have no ALT text. So, predicting the ALT text of an image in a web page would be of great use in web image retrieval. We propose an approach to ALT text prediction and compare our approach to already existing ones. Our results show that our approach and previous approaches produce almost the same results. We analyze both the methods and describe the usage of the methods in different situations. We also build an image annotation system on top of our proposed approach and compare the results with the image annotation system built on top of previous approaches.
| II-286 |
Metaphone pt_BR: the phonetic importance on search and correction of textual information
Abstract:
The increasing automation in the communication among systems produces
a volume of information beyond the human administrative capacity to deal
with it on time. Mechanisms to find out the inconsistent information and facilitate decision making by parties responsible for the information are required. The use of a phonetic algorithm (Metaphone) adapted to Brazilian Portuguese proved to be valuable in the search for name and address fields for automatic
decisions beyond regular database queries could obtain in information retrieval.
| II-297 |
Robust and Fast Two-pass Search Method for Lyric Search Covering Erroneous Queries due to Mishearing
Abstract:
This paper proposes a robust and fast lyric search method for music information retrieval (MIR). The effectiveness of lyric search systems based on full-text retrieval engines or web search engines is highly compromised when the queries of lyric phrases contain incorrect parts due to mishearing. Though several previous studies proposed phonetic pattern matching techniques to identify the songs that the misheard lyric phrases refer to, a real-time search algorithm has yet to be realized. This paper proposes a fast phonetic string matching method using a two-pass search algorithm. It consists of pre-selecting the probable candidates by a rapid index-based search in the first pass and executing a dynamic-programming-based search process with an adaptive termination strategy in the second pass. Experimental results show that the proposed search method reduces processing time by more than 87% compared with the conventional methods, without loss of search accuracy.
| II-306 |
Bootstrap-based Equivalent Pattern Learning for Collaborative Question Answering
Abstract:
Semantically similar questions are submitted to collaborative question answering systems repeatedly even though with best answer before. To solve the problem, we propose a precise approach of automatically finding an answer to such questions by automatically identifying “equivalent” questions submitted and answered. Our method is based on a new pattern generation method T-IPG to automatically extract equivalent question patterns. Taking these patterns as seed patterns, we further propose a bootstrap-based learning method to extend more equivalent patterns on training data. The resulting patterns can be applied to match a new question to an equivalent one that has already been answered, and thus suggest potential answers automatically. We experimented with this approach over a large collection of more than 200,000 real questions drawn from the Yahoo! Answers archive, automatically acquiring over 16,991 question equivalent patterns. These patterns allow our method to obtain over 57% recall and over 54% precision on suggesting an answer automatically to new questions, significantly improving over baseline methods.
| II-318 |
How to answer yes/no spatial questions using qualitative reasoning?
Abstract:
We present a method of answering yes/no spatial questions for the purpose of the open-domain Polish question answering system based on the news texts. We focus on questions which refer to certain qualitative spatial relation (e.g. Was Baruch Lumet born in the United States?). In order to answer such questions we apply qualitative spatial reasoning to our state-of-art question processing mechanisms. We use Region Connection Calculus (namely RCC-5) in the process of reasoning. In this paper we describe our algorithm that finds the answer to yes/no spatial questions. We propose a method for the evaluation of the algorithm and report results we obtained for a self-made questions set. Finally, we give some suggestions for possible extensions of our methods.
| II-330 |
Question Answering and Multi-Search Engines in Geo-Temporal Information Retrieval
Abstract:
In this paper we present a complete system for the treatment of the geographical and temporal dimension in text and its application to information retrieval. This system has been evaluated in the GeoTime task of the 8th and 9th NTCIR workshop, making it possible to compare the system with other current approaches to the topic. In order to participate in this task we have added to our GIR system the temporal dimension. The system proposed here has a modular architecture in order to add or modify features. In the development of this system we have followed a QA-based approach as well as multi-search engines to improve the system performance.
| II-342 |
Document Summarization | |
Using Graph Based Mapping of Co-Occurring Words and Closeness Centrality Score for Summarization Evaluation
Abstract:
The use of predefined phrase patterns like: N-grams (N>=2), longest common sub sequences or pre defined linguistic patterns etc do not give any credit to non-matching/smaller-size useful patterns and thus, may result in loss of information. Next, the use of 1-gram based model results in several noisy matches. Additionally, due to presence of more than one topic with different levels of importance in summary, we consider summarization evaluation task as topic based evaluation of information content. Means at first stage, we identify the topics covered in given model/reference summary and calculate their importance. At the next stage, we calculate the information coverage in test / machine generated summary, w.r.t. every identified topic. We introduce a graph based mapping scheme and the concept of closeness centrality measure to calculate the information depth and sense of the co-occurring words in every identified topic. Our experimental results show that devised system works better than existing systems on TAC 2011 AESOP dataset.
| II-353 |
Combining Syntax and Semantics for Automatic Extractive Single-document Summarization
Abstract:
The goal of automated summarization is to combat the “Information Overload” problem by communicating the most important contents of a document in a compressed form. Due to the difficulty that single-document summarization has in beating a standard baseline, especially for news articles, most efforts are currently focused on multi-document summarization. The goal of this study is to reconsider the importance of single-document summarization by introducing a new approach and its implementation. This approach essentially combines syntactic, semantic, and statistical methodologies, and reflects psychological findings that determine specific human selection patterns as humans construct summaries. Successful summary evaluation results and baseline out-performance are demonstrated when our system is executed on two separate datasets: the Document Understanding Conference (DUC) 2002 data set and a cognitive experiment article set. These results have implications not only for extractive and abstractive single-document summarization, but could also be leveraged in multi-document summarization.
| II-366 |
Combining Summaries using Unsupervised Rank Aggregation
Abstract:
We model the problem of combining multiple summaries of a given document into a single summary in terms of the well-known rank aggregation
problem. Treating sentences in the document as candidates and summarization algorithms as voters, we determine the winners in an election where each voter selects and ranks k candidates in order of its preference. Many rank aggregation algorithms are supervised: they discover an optimal rank aggregation function from a training dataset of where each ”record” consists of a set of candidate rankings and a model ranking. But significant disagreements between model summaries created by human experts as well as high costs of creating them makes it interesting to explore the use of unsupervised rank aggregation techniques. We use the well-known Condorcet methodology, including a new variation to improve its suitability. As voters, we include summarization algorithms from literature and two new ones proposed here: the first is based on keywords and the second is a variant of the lexical-chain based algorithm in [1]. We experimentally demonstrate that the combined summary is often very similar (when compared using different measures) to the model summary produced manually by human experts. | II-378 |
Using Wikipedia Anchor Text and Weighted Clustering Coefficient to Enhance the Traditional Multi-Document Summarization
Abstract:
Similar to the traditional approach, we consider the task of summarization as selection of top ranked sentences from ranked sentence clusters. To achieve this goal, we rank the sentence clusters by using the importance of words calculated by using page rank algorithm on reverse directed word graph of sentences. Next, to rank the sentences in every cluster we introduce the use of weighted clustering coefficient. We use page rank score of words for calculation of weighted clustering coefficient. Finally the most important issue is the presence of a lot of noisy entries in the text, which downgrades the performance of most of the text mining algorithms. To solve this problem, we introduce the use of Wikipedia anchor text based phrase mapping scheme. Our experimental results on DUC-2002 and DUC-2004 dataset show that our system performs better than existing novel systems of this area.
| II-390 |
Best software award: | |
Extraction of Relevant Figures and Tables for Multi-document Summarization
Abstract:
We propose a system that extracts the most relevant figures and tables from a set of topically related source documents. These are then integrated into the extractive text summary produced using the same set. The proposed method is domain independent. It predominantly focuses on the generation of a ranked list of relevant candidate units (figures/tables), in order of their computed relevancy. The relevancy measure is based on local and global scores that include direct and indirect references. In order to test the system performance, we have created a test collection of document sets which do not adhere to any specific domain. Evaluation experiments show that the system generated ranked list is in statistically significant correlation with the human evaluators’ ranking judgments. Feasibility of the proposed system to summarize a document set which contains figures/tables as their salient units is made clear in our concluding remark.
| II-402 |
Best paper award, 3rd place: | |
Towards automatic generation of catchphrases for legal case reports
Abstract:
This paper presents the challenges and possibilities of a novel summarisation task: automatic generation of catchphrases for legal documents. Catchphrases are meant to present the important legal points of a document with respect of identifying precedents. Automatically generating catchphrases for legal case reports could greatly assist in searching for legal precedents, as many legal texts do not have catchphrases attached. We developed a corpus of legal (human-generated) catchphrases (provided with the submission), which lets us compute statistics useful for automatic catchphrase extraction. We propose a set of methods to generate legal catchphrases and evaluate them on our corpus. The evaluation shows a recall comparable to humans while still showing a competitive level of precision, which is very encouraging. Finally, we introduce a novel evaluation method for catchphrases for legal texts based on the known Rouge measure for evaluating summaries of general texts.
| II-414 |
Applications | |
Invited paper: | |
A Dataset for the Evaluation of Lexical Simplification
Abstract:
Lexical Simplification is the task of replacing individual words of a text with words that are easier to understand, so that the text as a whole becomes easier to comprehend, e.g. by people with learning disabilities or by children who learn to read. Although this seems like a straightforward task, the evaluation of methods for Lexical Simplification is not so trivial. The problem is how to build a data set in which words are replaced with their simpler synonyms and how to evaluate the value of this dataset as different annotators might use different replacement words. In this paper we reuse existing resources for a similar problem, that of Lexical Substitution, and transform this dataset into a dataset for Lexical Simplification. This new dataset contains 430 sentences, with in each sentence one word marked. For that word, a list of words that can replace it, sorted by their difficulty, is provided. The paper reports on how this dataset was created based on the annotations of different persons, and their agreement. In addition we provide several metrics for computing the similarity between ranked lexical substitutions, which are used to assess the value of the different annotations, but which can also be used to compare the lexical simplifications suggested by the machine with a ground truth model.
| II-426 |
Text Content Reliability Estimation in Web Documents: A New Proposal
Abstract:
This paper illustrates how a combination of information retrieval, machine learning, and NLP corpus annotation techniques was applied to a problem of text content reliability estimation in Web documents. Our proposal for text content reliability estimation is based on a model in which reliability is a similarity measure between the content of the documents and a knowledge corpus. The proposal includes a new representation of text which uses entailment-based graphs. Then we use the graph-based representations as training instances for a machine learning algorithm allowing to build a reliability model. Experimental results illustrate the feasibility of our proposal by performing a comparison with a state-of-the-art method.
| II-438 |
Fine-grained Certainty Level Annotations Used for Coarser-grained E-health Scenarios -- Classification of Diagnoses in Swedish Clinical Text
Abstract:
An important task in information access methods is distinguishing factual information from speculative or negated information. Fine-grained certainty levels of diagnostic statements in Swedish clinical text have been annotated in a corpus from a medical university hospital. The corpus has a model of two polarities (positive and negative) and three certainty levels (certain, probable and possible). However, in the domain of e-health, there are many scenarios where such fine-grained certainty levels are not practical for information extraction. Instead, more coarse-grained groups are needed. We present three scenarios: adverse event surveillance, decision support alerts and automatic summaries. For each scenario, we collapse the fine-grained certainty level classifications into coarser-grained groups. We build automatic classifiers for each scenario, using Conditional Random Fields and simple local context features and analyze the results quantitatively through precision, recall and F- score. Annotation discrepancies are analyzed qualitatively through manual corpus analysis. Our main findings are that it is feasible to use a corpus of fine-grained certainty level annotations to build classifiers for coarser-grained real-world scenarios: 0.89, 0.91and 0.8 F-score (overall average). However, the class label distributions for each scenario are skewed, and reflect problems in the distinction between probably negative and certainly negative diagnosis statements. This border needs to be further defined in the fine-grained classification model, as there are discrepancies and inconsistencies in the annotations, for instance by refining the guidelines for this task.
| II-450 |
Combining Confidence Score and Mal-rule Filters for Automatic Creation of Bangla Error Corpus: Grammar Checker Perspective
Abstract:
This paper describes a novel approach for automatic creation of Bangla error corpus for training and evaluation of grammar checker systems. The procedure begins with automatic creation of large number of erroneous sentences from a set of grammatically correct sentences. A statistical Confidence Score Filter has been implemented to select proper samples from the generated erroneous sentences such that sentences with less probable word sequences get lower confidence score and vice-versa. Rule based Mal-rule filter with HMM based semi-supervised POS tagger has been used to collect the sentences having improper tag sequences. Combination of these two filters ensures the robustness of the proposed approach such that no valid construction is getting selected within the synthetically generated error corpus. Though the present work focuses on the most frequent grammatical errors in Bangla written text, detail taxonomy of grammatical errors in Bangla is also presented here, with an aim to increase the coverage of the error corpus in future. The proposed approach is language independent and could be easily applied for creating similar corpora in other languages.
| II-462 |
Best student paper award: | |
Predictive Text Entry for Agglutinative Languages using Morphological Segmentation and Phonological Restrictions
Abstract:
Systems for predictive text entry on ambiguous keyboards
typically rely on dictionaries with word frequencies which are used to
suggest the most likely words matching user input. This approach is
insufficient for agglutinative languages, where morphological phenomena
increase the rate of out-of-vocabulary words. We propose a method for
text entry, which circumvents the problem of out-of-vocabulary words,
by replacing the dictionary with a Markov chain on morph sequences
combined with a third order hidden Markov model (HMM) mapping key
sequences to letter sequences and phonological constraints for pruning
suggestion lists. We evaluate our method by constructing text entry systems
for Finnish and Turkish and comparing our systems with published
text entry systems and the text entry systems of three commercially
available mobile phones. Measured using the keystrokes per character ratio
(KPC) [8], we achieve superior results. For training, we use corpora, which are segmented using unsupervised morphological segmentation.
| II-478 |
Comment Spam Classification in Blogs through Comment Analysis and Blog Post-Comment Relationships
Abstract:
Spamming refers to the process of providing unwanted and irrelevant information to the users. It is a widespread phenomenon that is often noticed in e-mails, instant messages, blogs and forums. In our paper, we consider the problem of spamming in blogs. In blogs, spammers usually target commenting systems which are provided by the authors to facilitate interaction with the readers. Unfortunately, spammers abuse these commenting systems by posting irrelevant and unsolicited content in the form of spam comments. Thus, we propose a novel methodology to classify comments into spam and non-spam using previously-undescribed features including blog post-comment relationships. Experiments conducted using our methodology produced a spam detection accuracy of 94.82% with a precision of 96.50% and a recall of 95.80%.
| II-490 |
Detecting Players Personality Behavior with any Effort of Concealment
Abstract:
We introduce a novel natural language processing component using machine learning techniques for prediction of personality behaviors of players in a serious game, Land Science, which intents at developing an environment in which players can improve their conversation skills during the group interaction in written natural language. Our model learns vector space representations for various feature extraction. In order to apply this framework, input excerpts must be classied into one of six possible
personality classes. We applied this personality classication task using several machine learning algorithms, such as; Nave Bayes, Support Vector Machines, and Decision Tree. Training is performed on a relatively dataset of manually annotated excerpts. By combing these features space from psychology and computational linguistics, we perform and evaluate our approaches to detecting personality, and eventually develop a classier that is nearly 83% accurate on our dataset. Based on the feature analysis of our models, we add several theoretical contributions, including revealing a relationship between dierent personality behavior in players writing.
| II-502 |
Complementary proceedings | |
Performance Analysis of Pedestrian Detection at Night Time with different Classifiers
Abstract:
Pedestrian detection is one of the most important components in driver-assistance system. A performance analysis is done with various classifiers (AdaBoost, Neural Network and SVM) and its behavior of the system is analyzed. As there is large intra-class variability in the pedestrian class, a two stage classifier is used. A review of different pedestrian detection system is done in the paper. Classifiers are arranged based on HAAR-like and HOG features in a coarse to fine manner. Adaboost give better performance.
| |
Summarizing Public Opinions in Tweets
Abstract:
The objective of Sentiment Analysis is to identify any clue of positive or negative emotions in a piece of text reflective of the authors opinions on a subject. When performed on large aggregations of user generated content, Sentiment Analysis may be helpful in extracting public opinions. We use Twitter for this purpose and build a classifier which classifies a set of tweets. Often, Machine Learning techniques are applied to Sentiment Classification, which requires a labeled training set of considerable size. We introduce the approach of using words with sentiment value as noisy label in a distant supervised learning environment. We created a training set of such Tweets and used it to train a Naive Bayes Classifier. We test the accuracy of our classifier using a hand labeled training set. Finally, we check if applying a combination of minimum word frequency threshold and Categorical Proportional Difference as the Feature Selection method enhances the accuracy.
| |
N-gram approach to transliteration
Abstract:
Machine transliteration is a method for automatically converting words in one language into phonetically equivalent ones in another language. Machine transliteration plays an important role in natural language applications such as information retrieval and machine translation, especially for handling proper nouns and technical terms.Transliteration is a crucial factor in CLIR and MT. It is important for Machine Translation, especially when the languages do not use the same scripts. This paper addresses the issue of machine transliteration from English to Punjabi. N-gram Approach to transliteration is used for transliteration from English to Punjabi using MOSES, a statistical machine translation tool. After applying transliteration rules Transliteration accuracy rate (TAR in %age) comes out to be 63.31%.
| |
From Co-occurrence to Lexical Cohesion for Automatic Query Expansion
Abstract:
Designing a good Information Retrieval System (IRS) is still a big challenge and an open research problem. To overcome some of the problems of Information retrieval system, researchers have investigated query expansion (QE) techniques to help users in formulating better queries and hence improve efficiency of an Information Retrieval System. The scope of this work is limited to Pseudo Relevance Feedback based Query Expansion. Most of the work done in pseudo relevance based automatic query expansion has been based on selecting the terms using co-occurrence based measures, which has some inherent limitations. In this paper we have focused on the limitation of query expansion based on co-occurring terms. Keeping in view limitations of co-occurrence based query expansion; we have tried to explore the utility of lexical based measures for expanding the query. The paper investigates the use of query expansion based on lexical links and provides an algorithm for the same. Based on theoretical justification and intensive experiments on TREC data set, we suggest that lexical based methods are at least as good as query based measures and in some cases may work better than co-occurrence based measures. Depending on the nature of query, lexical based measures have great potential for improving the performance of an information retrieval system.
| |
Mapping Synsets in WordNet to Chinese
Abstract:
WordNet is a large lexical database which has important influence on many computational linguistics related applications, but unfortunately cannot be used in other languages except English. This paper presents an automatic method to map WordNet synsets to Chinese, and then generate an homogeneous Chinese WordNet. The proposed approach is grounded on the viewpoint that most cognitive concepts are languages independent, and can be mapped from one language to another unambiguously. Firstly, we utilize offline/online English-Chinese lexicons and term translation system to translate the words in WordNet. One English word is translated to multiple Chinese words, and one synsets is translated to a group of Chinese words. We secondly cluster these Chinese words into synonym-sets according to their senses. And finally, we select the right synonym-set for given synset. We regard the proper word-set choosing process as a classifier problem, and put forward 9 classifying features based on relations in WordNet, Chinese morphologies, and translation intersections. Besides, an lexico-syntactic patterns based heuristic rule is combined for higher recall. Experiment results on WordNet 3.0 show the overall synsets translating coverage of out method is 85.12% with the precision of 81.37%.
| |
Combined Inverted - Bigram – Phrase Index Enriched With Named Entities and Coreferences
Abstract:
In this paper, a three way index, based on information extraction is proposed. Three types of combined inverted - bigram - phrase indexes enriched with named entities and coreferences are proposed. The first type of index, called Joined Entities Index, stores the named entities as a single token by joining the words that compose them. The second type of index, called One Term of Entity Index, stores the named entities as N-grams and, if there is a named entity next to a stopword, the index stores the bigram composed by the stopword and the next word, that can be first or the last word of the entity. The third type of index, called Named Entity Combined Index, also stores the named entities as N-grams and if there is a named entity next to a stopword, the index stores the N-gram composed by the stopword and the named entity. The proposed indexes were implemented on Lucene, using the newswire corpus Reuters RCV1. Experiments show that the Named Entity Combined index spends less search time than a traditional combined inverted bigram partial phrase index.
| |
Comparing Sanskrit Texts for Critical Editions: the sequences move problem.
Abstract:
A critical edition takes into account different versions of a same text in order to show the differences between two distinct versions, in term of words missing, changed, omitted or displaced.
Traditionally Sanskrit is written without blank, and the word order can be changed without changing the meaning of a sentence. This paper describes the Sanskrit characteristics which make Sanskrit text comparisons a specific matter, then it presents two different comparison methods of Sanskrit texts which can be used for the elaboration of computer assisted critical edition. The first one is using the L.C.S., while the second one uses the global alignment algorithm. Comparing them, we see that the second method provides better results, but that none of these methods can detect when a word or a sentence fragment have been moved. We present then a method based on N-gram that can detect such a movement when it is not too far from its original location. We will see how the method behaves on several examples and look for future possible developments. | |
Content Extraction Using DOM Structures
Abstract:
Content extraction is a research area of widely interest due to its many applications. It basically consists in the detection of the main content in a web document. This is useful, e.g., to show webpages in small screens such as PDAs; and also, to enhance processing and indexing tasks by avoiding the treatment of irrelevant content such as menus and advertisements. In this paper we present a new technique for content extraction which is based on the hierarchical relations of the DOM structure of a webpage. This information provides the technique with the ability to extract the main content with a high recall and precision. Using the DOM structure for content extraction give us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block, thus, producing very cohesive blocks. Our implementation and experiments demonstrate the usefulness of the
technique. | |
Abstract:
This paper describes an automatic method for harvesting and sense-annotating domain-independent data from the web. As a proof of concept, this method has been applied to German, a language for which sense-annotated corpora are still in short supply. The sense inventory is taken from the German wordnet GermaNet. The web-harvesting relies on an existing mapping of GermaNet to the German version of the web-based dictionary Wiktionary. The data obtained by this method for German have resulted in the WebCAGe (short for: Web-Harvested Corpus Annotated with GermaNet Senses) resource which currently represents the largest sense-annotated corpus available for this language. While the present paper focuses on one particular language, the method as such is language-independent.
| |
Improving Finite-State Spell-Checker's Corrections with POS Tagger's Context N-Grams
Abstract:
In this paper we demonstrate a finite-state implementation of context-aware spell checking utilising n-gram based pos-tagger to rerank the quality of suggestions by simple edit-distance based spell-checker. We demonstrate the context-aware spell-checking for English and Finnish and suggest the modifications that are necessary to have traditional n-gram models working for morphologically more complex languages, such as Finnish.
| |
Arabic Temporal Entity Extracting using Morphological Analysis
Abstract:
The detection of temporal entities within natural language texts is an interesting information extraction problem. Temporal entities help to estimate authorship dates, enhance information retrieval capabilities, detect and track topics in news articles, and augment electronic news reader experience. Research has been performed on the detection, normalization and annotation guidelines for Latin temporal entities. However, research in Arabic lags behind and is restricted to commercial tools. This paper presents a temporal entity detection technique for the Arabic language using morphological analysis and a finite state transducer. It also augments an Arabic lexicon with 550 tags that identify 12 temporal morphological categories. The technique reports a temporal entity detection success of 94.6% recall and 84.2% precision, and a temporal entity boundary detection success of 89.7% recall and 90.8% precision.
| |
A Flexible Table Parsing Approach
Abstract:
This paper introduces a novel table parsing approach. In contrast to recent approaches to table extraction that utilizes spatial reasoning over the positional information of the table cells and headers, we do not assume tables to be encoded in HTML or even have perfectly aligned columns or rows. Given that tables are often copied from a structured environment such as web pages and spread sheets into text where formatting is not maintained correctly, we propose a parsing technique that uses two simple parsing heuristics: table headers are to the left of and above a data cell.
Generally, tables can be difficult to parse because of the different ways information can be encoded in tables. Our approach starts with finding the data cells (i.e., bid/ask prices) in emails and pulls out all tokens associated with the respective data cell. Basically, the approach ”flattens” the table by pulling out sequences of tokens that have scope over a data cell.
We also propose a clustering and classifying method for finding prices reliably in the data set we used. This method is transferable to other data cell types and can be applied to other table content.
Given the open-ended problem of table extraction, we also propose a measure to identify tables we can derive information from with high confidence and separate them from tables that seem too complex to reliably extract any information. Hence, we developed a confidence metric that would allow us to distinguish between tables that are difficult to analyze from tables that we can confidently extract correct information from.
| |
Subsymbolic Semantic Named Entity Recognition
Abstract:
In this paper, we present the novel application of our subsymbolic semantic TSR approach onto named entity recognition and classification (NERC) in order to demonstrate the generic utility of the TSR approach beyond word sense disambiguation and language identification. Experimental results support our hypothesis that TSR techniques can successfully recognize named entities of different types in several languages. These experiments were based on a common framework infrastructure but using different sets of features on two
German and two English corpora.
| |
The Quantum of Language. Metaphoric Self Reflection as a Key to Mind Uncover
Abstract:
Our research has been established on the comparative findings from cognitive and computational linguistics, philosophical and phenomenological mind’s theories, modern physics discoveries and the natural language semantics dealing with meanings on the quantum level, which is pilot research mentioned as “The Elegant Mind” project. This project was created for implementing in artificial intelligence metaphoric reasoning algorithms. We have been particularly studied expressions describing a realm of mind in Czech and English. The data has been acquired by online Free Association Experiment (FEA) to visualize the latent connection between particular states of mind. A multilingual mind semantic dependency networks founded on the knowledge-based system of authentic association data has been generated by Gephi dynamic graphs.
| |
Corpus Materials for Constructing Learner Corpus Compiling Speaking, Writing, Listening, and Reading Data
Abstract:
Learner corpora, which are defined as a collection of texts produced by learners of a second or foreign language, have contributed to the advancement of research on the second language learning and teaching by providing texts to analyze what sorts of linguistic items such as vocabularies and grammars learners adequately or inadequately use. Some learner corpora are annotated with information tags on errors that learners made, and thus we can directly analyze learners' errors, or compare the errors across different proficiency level learners. In addition to this use of learner corpora, learner corpora can also be used as a language resource in constructing computer-based language learning or teaching systems by machine learning algorithms. The construction of a learner corpus follows the three steps: design, data collection, and analysis of collected data. The design step determines variables of a corpus such as the following three types: language related variables, task-related variables and learner-related variables. The data collection step literally collects raw texts and information to be annotated with the texts such as learner information and error information. The analysis of collected data step carries out some basic analyses such as a descriptive statistics analysis or a qualitative analysis in order to confirm the validity of the collected data. Although most learner corpora consist of texts that reflect learners' proficiency in writing or in speaking, some learner corpora consist of texts that reflect learners' proficiency in multiple modalities. Wen et al. (2008) constructed a learner corpus that consists of texts that reflect earners' proficiency in speaking and writing. The data of speaking include sound and texts transcribed what learners told in writing exercises, and those of writing include texts of learners' essays. Meurers et al. (2010) constructed a learner corpus that consists of texts that reflect learners' reading and writing proficiency. The data include texts written by learners as answers for comprehension questions in reading exercises. Kotani et al. (2011) constructed a learner corpus, which is called as I(ntegrated)-learner corpus, that consists of texts that reflect learners' speaking (focusing on pronunciation), writing, reading and listening proficiency. According to Kotani et al. (2011), the goal of this corpus is to provide language resource for the analysis of learners' language use based on the four skills. The data of speaking (pronunciation) include sound and texts when learners pronounced sentences. The data of writing include texts when learners wrote sentences in writing exercises. The data of reading include texts which learners read. The data of listening include texts that someone read aloud, and learners listened to. The texts of reading and those of listening were annotated with information that showed how learners read or listened to sentences such as the reading time, the comprehension rate, and the subjective judgment score, which showed the difficulty of each sentence in reading or listening. The texts of speaking and those of writing are also annotated with information showing learners' pronunciation and writing such as the pronunciation time, the writing time, the subjective judgment score, which showed whether a sentence is understandable as an English sentence or to what extent a sentence is difficult to pronounce.
In constructing a learner corpus, a prerequisite is to prepare corpus materials that properly reveal learners' second language ability. Thus, the previous studies used corpus materials taken from linguistic exercises such as essay writing exercises, and exercises of language tests. However, we consider that corpus materials of the I-Learner corpus should be examined in more detail, because this corpus includes reading data and listening data, which have not been compiled in the previous learner corpora. Therefore, this paper describes the corpus materials of the I-learner corpus regarding how and why the materials were chosen as well as the types and quantity of the corpus materials.
| |
Harnessing Wordnet Senses to Unify Sentiment Across Languages
Abstract:
Cross-lingual sentiment analysis (SA) refers to performing sentiment analysis of a language using training corpus from another language. Existing approaches to cross-lingual SA are limited by machine translation (MT) since MT-based techniques have been reported to be the best so far. However, MT systems do not exist for most pair of languages and even if they do, their translation accuracy is low. Our approach to cross-lingual SA uses word senses as features for a supervised sentiment classifier. We develop a universal feature space based on linked wordNet synsets and train sentiment classifiers belonging to different languages on it. This novel approach does not rely on MT. We present our results on two widely spoken Indian languages: Hindi (450 million speakers) and Marathi (72 million speakers). Since MT between the two languages is not available, we present a lexical transfer-based approach that produces a naive translation to act as the baseline. We show that the sense-based approach gives a classification accuracy of 72% and 84% for Hindi and Marathi sentiment classification respectively. An improvement of 14%-15% over the approach that uses the naive translation underlines the utility of our approach where good MT systems are not available.
| |
Hindi Subjective Lexicon Generation using WordNet Graph Traversal
Abstract:
With the induction of UTF-8 unicode standards, web content in Hindi language is increasing at a rapid pace. There is a great opportunity to mine the content and get insight of sentiments and opinions expressed by people and various communities. In this paper, we present a graph based method to build a subjective lexicon for Hindi language using WordNet as a resource. Our method takes a pre-annotated seed list and expands this list into a full lexicon using synonym and antonym relations. We show two different evaluation strategies to validate the Hindi Lexicon built. Main contribution of our work
1) Developing a Subjective lexicon of adjectives using Hindi WordNet. 2) Developing an annotated corpora of Hindi reviews. | |
Using the ILCI Annotation Tool for POS Annotation: A Case of Hindi
Abstract:
In the present paper, we present an annotation tool, ILCIANN (Indian Languages Corpora Initiative Annotation Tool), which could be potentially used for crowd-sourcing the annotation task and creation of language resources for use in NLP. This tool is expected to be especially helpful in creating annotated corpora for the less-resourced languages. ILCIANN is a server-based web application which could be used for any kind of word-level annotation task in any language. In the paper a description of the architecture of the tool, its functionality, its application in the ILCI (Indian Languages Corpora Initiative) project for POS annotation of Hindi data and the extent to which it increases the efficiency and accuracy of the annotators is given. It describes the results of an experiment conducted to understand the increase in the efficiency (in terms of time spent on annotation) and the accuracy (in terms of the inter-annotator agreement) with the use of the tool when compared to the manual annotation.
| |
A corpus based method for product feature ranking for interactive question answering systems
Abstract:
Choosing a product is not an easy task due to the fact that customers
need to consider many features before they can reach a decision.
Interactive question answering (IQA) systems can help customers in this process, by answering questions about products and initiating a dialogue
with the customer when their needs are not clearly defined. For this
purpose we propose a corpus based method for weighting the importance of
product features depending on how likely they are to be of interest for
a user. By using this method, we hope that users can select the desired
product in an optimal way. The system is developed for mobile phones
using various corpora. Experiments are carried out with a corpus of user
opinions, the assumption being that the features mentioned there are
more likely to be important for a person who is likely to purchase a
product. An opinion mining system is also used in order to distinguish
between features mentioned in a positive and a negative context.
| |
Puzzle Out the Semantic Web Search
Abstract:
In this paper we propose a Cooperative Question Answering System that takes as input natural language queries and is able to return a cooperative answer based on semantic web resources, more specifically DBpedia represented in OWL/RDF as knowledge base and WordNet to build similar questions. Our system resorts to ontologies not only for reasoning but also to find answers and is independent of prior knowledge of the semantic resources by the user. The natural language question is translated into its semantic representation and then answered by consulting the semantics sources of information. The system is able to clarify the problems of ambiguity and helps finding the path to the correct answer. If there are multiple answers to the question posed (or to the similar questions for which DBpedia contains answers), they will be grouped according to their semantic meaning, providing a more cooperative answer to the user.
| |
How Do Users Express Acceptance: An Ontology-Based Analysis of Blog Comments
Abstract:
Studying technology acceptance requires the survey and analysis of user opinions to identify acceptance-relevant factors. In addition to surveys, Web 2.0 poses a huge collection of user discussions and comments regarding different technologies. Blog comments provide semi-structured data with unstructured comment content. Extracting acceptance-relevant factors and user opinions from these comments requires the application of Natural Language Processing(NLP) methods. Hence, unstructured data is transformed into suitable representations. Due to the language used in blogs, NLP results suffer from high error rates. In this paper, we present a user-specific study of blog comments to analyze the relation between blog language and performance of NLP methods. Application of the proposed approach leads to enhancement of POS-tagging and lemmatizing quality. Furthermore, we present an ontology-based corpus generation tool to improve the identification of topic- and user-specific blog comments. Utilizing the generation tool, we exemplarily analyze and transform the identified comments into structured datasets.
| |
String distances for near-duplicate detection
Abstract:
Near-duplicate detection is important when dealing with large, noisy databases in data mining tasks. In this paper, we present the results of applying the Rank distance and the Smith-Waterman distance, along with more popular string similarity measures such as the Levenshtein distance, together with a disjoint set data structure, for the problem of near-duplicate detection.
| |
Extracting Emotive Patterns for Languages with Rich Morphology
Abstract:
This paper describes a new method of acquiring emotive patterns for morphologically rich languages, thus extending the recall of automatically generated sentiment lexical resources. The proposed bootstrapping algorithm is resource lean, requires only a small corpus with morphosyntactic annotations to discover new patterns. The approach, which involves rule mining and contrast sets, has been demonstrated and evaluated for Polish.
| |
Knowledge Vertices in XUNL
Abstract:
This paper addresses some lexical issues in the development of XUNL – a knowledge representation language descendent from and alternative to the Universal Networking Language (UNL). We present the current structure and the role of Universal Words (UW) in UNL and claim that the syntax and the semantics of UWs demand a thorough revision in order to accomplish the requirements of language, culture and human independency. We draw some guidelines for XUNL and argue that its vertices should be represented by Arabic numerals; should be equivalent to sets of synonyms; should consist of generative lexical roots; should correspond to the elementary particles of meaning; and should not bear any non-relational meaning.
| |
Translog-II: a Program for Recording User Activity Data for Empirical Translation Process Research
Abstract:
This paper presents a novel implementation of Translog-II. Translog-II is a Windows-oriented program to record and study reading and writing processes on a computer. In our research, it is an instrument to acquire objective, digital data of human translation processes. As their predecessors, Translog 2000, Translog 2006, also Translog-II consists of two main components: Translog-II Supervisor and Translog-II User. The two components are interdependent: Translog-II User requires a project file which is created in Translog-II Supervisor, and some of the main functions of Translog-II Supervisor use the log files created by Translog-II User. This paper gives an overview of the tool and its data visualization options.
| |
Exploring self-training and co-training for dependency parsing
Abstract:
In this paper, we explore the effect of selftraining and co-training on Hindi dependency parsing. We use Malt parser, that is the state-of-the-art Hindi dependency parser, and apply self-training using a large unannotated corpus. For co-training, we use MST parser with comparable accuracy to the Malt parser. Experiments are performed using two types of raw corpora - one from the same domain as the test data and another, which is out-of-domain from the test data. Through these experiments, we compare the impact of self-training and cotraining on Hindi dependency parsing.
| |
On the Adequacy of Three POS Taggers and a Dependency Parser
Abstract:
A POS-tagger can be used in front of a parser to reduce the number of combinations of possible dependency trees which, in the majority, give spurious analyses. In the paper we compare the results of the addition of three morphological taggers to the parser of the CDG Lab. The experimental results show that these models perform better than the model which do not use a morphological tagger at the cost of loosing some correct analyses. In fact, the adequacy of these solutions is mainly based on the compatibility between the lexical units defined by the taggers and the dependency grammar.
| |
Using Continuations to Account for Plural Quantification and Anaphora Binding
Abstract:
We give in this paper an explicit formal account of plural semantics in the framework of continuation semantics introduced in [1] and extended in [4]. We deal with aspects of plural dynamic semantics such as plural quantification, plural anaphora, conjunction and disjunction, distibutivity and maximality conditions. Those phenomena need no extra stipulations to be accounted for in this framework, because continuation semantics provides a unified account of scope-taking.
| |
A Computational Implementation of Symmetric and Asymmetric Verbal Coordination
Abstract:
Of the coordination structures in Korean, the symmetric and asymmetric properties
of verbal coordination
have challenged both theoretical and computational approaches.
This paper shows how a typed feature structure grammar, HPSG, together with the notions of `type hierarchy' and `constructions',
can provide a robust basis for parsing (un)tensed verbal coordination as well as pseudo-coordination
found in the language. We show that the analysis sketched here
and computationally implemented in the existing resource grammar
for Korean, Korean Resource Grammar (KRG), can yield robust syntactic structures as well as enriched semantic representations
for real-time
applications such as machine translation.
|