CICLing 2016 Accepted Papers with Abstracts

Notes:

LNCS

Santanu Pal, Sudip Naskar and Josef van Genabith. Forest to String Based Statistical Machine Translation with Hybrid Word Alignments
Abstract: Forest to String Based Statistical Machine Translation (FSBSMT) is a forest-based tree sequence to string translation model for syntax based statistical machine translation. The model automatically learns tree sequence to string translation rules from a given word alignment estimated on a source-side parsed bilingual parallel corpus. This paper proposes a hybrid method which combines different word alignment methods and integrates them into an FSBSMT system. The hybrid word alignment provides the most informative alignment links to the state-of-the-art FSBSMT system. We show that hybrid word alignment integrated into various experimental settings of FSBSMT provides considerable improvement over state-of-the-art Hierarchical Phrase based SMT (HPBSMT). The research also demonstrates that integration of prior alignment of Named Entities (NEs) and Example based Machine Translation (EBMT) phrases into the proposed system brings about further considerable performance improvements over the hybrid FSBSMT system. We apply our hybrid model to a distant language pair, English–Bengali. The proposed system provides 78.5% relative (9.84 BLEU points) improvement over baseline HPB- SMT.
Goutam Majumder, Dr. Partha Pakray, Zoramdinthara Khiangte and Alexander Gelbukh. Literature Survey: Multiword Expressions (MWE) for Mizo Language
Abstract: In this paper, we examine the formation of Multi Word Expressions (MWEs) and reduplicated words in Mizo language from a news corpus. To understand the structure of reduplication this paper follows lexical as well as morphological approach those were used for other Indian language like Manipuri, Bengali, Odia, Marathi etc. In this task, we also try to show their effect on natural language task for Mizo language as compare with other. After identification of MWEs and reduplication words it has been verified by the linguistic experts of Mizo language.
Wuying Liu and Lin Wang. Fast-Syntax-Matching-based Japanese-Chinese Limited Machine Translation
Abstract: Limited machine translation (LMT) is an unliterate automatic translation based on bi-lingual dictionary and sentence bank, and related algorithms can be widely used in natural language processing applications. This paper addresses the Japanese-Chinese LMT problem, proposes two syntactic hypotheses about Japanese language, and de-signs a fast-syntax-matching-based Japanese-Chinese (FSMJC) LMT algorithm. In which, the fast syntax matching function, a modified version of Levenshtein func-tion, can approximately get the syntactic similarity after the efficient calculating of the formal similarity between two Japanese sentences. The experimental results show that the FSMJC LMT algorithm can achieve the preferable performance with greatly reduced time costs, and prove that our two syntactic hypotheses are effective on Jap-anese text.
Arjun Mukherjee. Extracting Aspect Specific Sentiment Expressions implying Negative Opinions
Abstract: Subjective expression extraction is a central problem in fine-grained sentiment analysis. Most existing works focus on generic subjective expression extraction as opposed to aspect specific opinion phrase extraction. Given the ever-growing product reviews domain, extracting aspect specific opinion phrases is important as it yields the key product issues that are often men-tioned via phrases (e.g., “signal fades very quickly,” “had to flash the firm-ware often”). In this paper, we solve the problem using a combination of generative and discriminative modeling. The generative model performs a first level processing facilitating (1) discovery of potential head aspects con-taining issues, (2) generation of a labeled dataset of issue phrases, and (3) feed latent semantic features to subsequent discriminative modeling. We then employ discriminative large-margin and sequence modeling with pivot features for issue sentence classification and issue phrase boundary extraction. Experimental results using real-world reviews from Amazon.com demon-strate the effectiveness of the proposed approach.
Geli Fei, Arjun Mukherjee, Zhiyuan Chen and Bing Liu. Discovering Correspondence of Sentiment Words and Aspects
Abstract: Extracting aspects and sentiments is a key problem in sentiment analysis. Existing models rely on joint modeling with supervised aspect and sentiment switching. This paper explores unsupervised models by exploiting a novel angle – correspondence of sentiments with aspects via topic modeling under two views. The idea is to split documents into two views and model the topic correspondence across the two views. We propose two new models that work on a set of document pairs (documents with two views) to discover their corresponding topics. Experimental results show that the proposed approach significantly out-performs strong baselines.
Mohamed Dermouche, Julien Velcin, Rémi Flicoteaux, Sylvie Chevret and Namik Taright. Supervised Topic Models for Diagnosis Code Assignment to Discharge Summaries
Abstract: Mining medical data has significantly gained interest in the recent years thanks to the advances in data mining and machine learning fields. In this work, we focus on a challenging issue in medical data mining: automatic diagnosis code assignment to discharge summaries, i.e., characterizing patient's hospital stay (diseases, symptoms, treatments, etc.) with a set of codes usually derived from the International Classification of Diseases (ICD). We cast the problem as a machine learning task and we experiment some recent approaches based on the probabilistic topic models. We demonstrate the efficiency of these models in terms of high predictive scores and ease of result interpretation. As such, we show how topic models enable gaining insights into this field and provide new research opportunities for possible improvements.
Adiel Mittmann, Aldo von Wangenheim and Alckmar Dos Santos. Aoidos: A System for the Automatic Scansion of Poetry Written in Portuguese
Abstract: Scansion is the ancient activity of determining the patterns that give verses their poetic character. In Portuguese, this means discovering the number of syllables that the verses in a poem possess and fitting all verses to this measure, while attempting to place syllables so that an adequate stress pattern is produced. This article presents Aoidos, a rule-based system that takes a poem written in the Portuguese language and performs scansion automatically, further providing an analysis of rhymes. The system works by making a phonetic transcription of a poem, determining the amount of poetic syllables that verses in the poem should possess, fitting all verses according to this measure and looking for verses that rhyme. Experiments show that the system attains a high accuracy rate (above 98%).
Batuer Aisha. Uyghur Shallow Parsing using Part-of-Speech Features
Abstract: In this paper, we introduce a novel model for shallow parsing of Uyghur text with rich morphological information. A new Uyghur text shallow parsing algorithm is proposed based on conditional random fields (CRFs) with part-of-speech (POS) features. The experimental results show that the full use of the morphology feature information can improve the accuracy of Uyghur text chunk identification. The algorithm achieves impressive accuracy of 93.1% in terms of the F-score.
Matīss Rikters and Inguna Skadiņa. Combining machine translated sentence chunks from multiple MT systems
Abstract: This paper presents a hybrid machine translation (HMT) system that pursues syntactic analysis to acquire phrases of source sentences, translates the phrases using multiple online machine translation (MT) system application program interfaces (APIs) and generates output by combining translated chunks to obtain the best possible translation. The aim of this study is to improve translation quality of English – Latvian texts over each of the individual MT APIs. The selection of the best translation hypothesis is done by calculating the perplexity for each hypothesis using an n-gram language model (LM). The result is a phrase-based multi-system machine translation (ChunkMT) system that allows to improve MT output compared to individual online MT systems and the baseline (best translation hypothesis) system. Results show improvements in BLEU up to +1.48 and TER down to -0.015 scores compared to the baselines and related research projects.
Aneeta Niazi. Morphological Analysis of Urdu Verbs
Abstract: The acquisition of knowledge about word characteristics is a basic requirement for developing natural language processing applications of a particular language. In this paper, we present a detailed analysis for the morphology of Urdu verbs. During our analysis, we have observed that Urdu verbs can have 47 different types of inflections.The different inflected forms of 975 Urdu verbs have been analyzed and the details of the analysis have been presented. We propose a new classification scheme for Urdu verbs, based on morphology. The morphological rules proposed for each class have been tested by simulating with a 2-layer morphological analyzer, based on finite state transducers. The analysis and generation of surface forms have been successfully carried out, indicating the robustness of proposed methodology.
Souvick Ghosh, Dr. Dipankar Das and Tanmoy Chakraborty. Determining sentiment in citation text and analyzing its impact on the proposed ranking index
Abstract: Whenever human beings interact with each other, they exchange or express opinions, emotions and sentiments. These opinions can be expressed in text, speech or images. Analysis of these sentiments is one of the popular research areas of present day researchers. Sentiment analysis, also known as opinion mining tries to identify or classify these sentiments or opinions into two broad categories – positive and negative. Much work on sentiment analysis has been done on social media conversations, blog posts, newspaper articles and various narrative texts. However, when it came to identifying emotions from scientific papers, researchers used to face difficulties due to the implicit and hidden natures of opinions or emotions. As the citation instances are considered inherently positive in emotion, popular ranking and indexing paradigms often neglect the opinion present while citing. Therefore in the present paper, we deployed a system of citation sentiment analysis to achieve three major objectives. First, we identified sentiments in the citation text and assigned a score to each of the instances. We have used a supervised classifier for this purpose. Secondly, we have proposed a new index (we shall refer to it hereafter as M-index) which takes into account both the quantitative and qualitative factors while scoring a paper. Finally, we developed a ranking of research papers based on the M-index. We have also shown the impacts of M-index on the ranking of scientific papers.
Ella Rabinovich, Shuly Wintner and Ofek Luis Lewinsohn. A Parallel Corpus of Translationese
Abstract: We describe bilingual English-French and English-German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpora are diverse, consisting of parliamentary proceedings, literary works, transcriptions of TED talks and political commentary. They will be instrumental for research of translationese and its applications to (human and machine) translation; specifically, they can be used for the task of translationese identification, a research direction that enjoys a growing interest in recent years. To validate the quality and reliability of the corpora, we replicated previous results of supervised and unsupervised identification of translationese, and further extended the experiments to additional datasets and languages.
Elnaz Davoodi, Leila Kosseim, Felix-Herve Bachand, Majid Laali and Emmanuel Argollo. Classification of Textual Genres using Discourse Information
Abstract: This papers aims to measure the influence of textual genre on the usage of discourse relations and discourse markers. Specifically, we wish to evaluate to what extend the use of certain discourse relations and discourse markers are correlated to textual genre and consequently can be used to predict textual genre. To do so, we have used the British National Corpus and compared a variety of discourse-level features on the task of genre classification.

The results show that individually, discourse relations and discourse markers do not outperform the standard bag-of-words approach even when the number of features is reduced. However, discourse features do provide a significant increase in performance when they are used to augment the bag-of-words approach. Using discourse relations and discourse markers allowed us to increase the F-measure of the bag-of-words approach from 0.796 to 0.878.
Fériel Ben Fraj Trabelsi, Chiraz Ben Othmane Zribi and Saoussen Mathlouthi. Arabic Anaphora resolution using Markov decision process
Abstract: The anaphora resolution belongs to the attractive problems of the NLP field. In this paper, we treat the problematic of resolving pronominal anaphora which are very abundant in Arabic texts. Our approach includes a set of steps; namely: identification of anaphoric pronouns, removing non-referential ones, identification of the lists of candidates from the context surrounding the anaphora and choosing the best candidate for each anaphoric pronoun. The last two steps could be seen as a dynamic and probabilistic process that consists of a sequence of decisions and could be modeled using a Markov Decision Process (MDP). In addition, we have opted for a reinforcement learning approach because it is an effective method for learning in an uncertain and stochastic environment like ours and could resolve MDPs. In order to evaluate the proposed approach, we have developed an interactive system that gives us encouraging results. The resolution accuracy reaches up to 80%.
Calkin Suero Montero, Hatem Haddad, Maxim Mozgovoy and Chedi Bechikh Ali. Detecting the Likely Causes behind the Emotion Spikes of Influential Twitter Users
Abstract: Understanding the causes of spikes in the emotion flow of influential social me-dia users is a key component when analyzing the diffusion and adoption of opin-ions and trends. Hence, in this work we focus on detecting the likely reasons or causes of spikes within influential Twitter users’ emotion flow. To achieve this, once an emotion spike is identified we use linguistic and statistical analyses on the tweets surrounding the spike in order to reveal the spike’s likely explanations or causes in the form of keyphrases. Experimental evaluation on emotion flow visualization, emotion spikes identification and likely cause extraction for several influential Twitter users shows that our method is effective for pinpointing inter-esting insights behind the causes of the emotion fluctuation. Implications of our work are highlighted by relating emotion flow spikes to real-world events and by the transversal application of our technique to other types of timestamped text.
Mourad Gridach. Deep Learning Approach for Arabic Named Entity Recognition
Abstract: Inspired by recent work in Deep Learning that have achieved excellent performance on difficult problems such as computer vision and speech recognition, we introduce a simple and fast model for Arabic named entity recognition based on Deep Neural Networks (DNNs). Named Entity Recognition (NER) is the task of classifying or labelling atomic elements in the text into categories such as Person, Location or Organization. The unique characteristics and the complexity of the Arabic language make the extraction of named entities a challenging task. Most state-of-the-art systems use a combination of various Machine Learning algorithms or rely on handcrafted engineering features and the output of other NLP tasks such as part-of-speech (POS) tagging, text chunking, prefixes and suffixes as well as a large gazetteer. In this paper, we present an Arabic NER system based on DNNs that automatically learns features from data. The experimental results show that our approach outperforms the model based on Conditional Random Fields by 11.97 points in F-measure. Moreover, our model outperforms the state-of-the-art by 5.18 points in Accuracy and very close results in F-measure. Most importantly, our system can be easily extended to recognize other named entities without any additional rules or handcrafted engineering features.
Jie Yang, Zhiyang Teng, Meishan Zhang and Yue Zhang. Combining Discrete and Neural Features for Sequence Labeling
Abstract: Neural network models have recently received heated research attention in the natural language processing community. Compared with traditional models with discrete features, neural models have two main advantages. First, they take low-dimensional, real-valued embedding vectors as inputs, which can be trained over large raw data, thereby addressing the issue of feature sparsity in discrete models. Second, deep neural networks can be used to automatically combine input features, and including non-local features that capture semantic patterns that cannot be expressed using discrete indicator features. As a result, neural network models have achieved competitive accuracies compared with the best discrete models for a range of NLP tasks.

On the other hand, manual feature templates have been carefully investigated for most NLP tasks over decades and typically cover the most useful indicator pattern for solving the problems. Such information can be complementary the features automatically induced from neural networks, and therefore combining discrete and neural features can potentially lead to better accuracies compared with models that leverage discrete or neural features only.

In this paper, we systematically investigate the effect of discrete and neural feature combination for a range of fundamental NLP tasks based on sequence labeling, including word segmentation, POS tagging and named entity recognition for Chinese and English, respectively. Our results on standard benchmarks show that state-of-the-art neural models can give accuracies comparable to the best discrete models in the literature for most tasks and combing discrete and neural features unanimously yield better results.
Mustafa Aksan, Umut Demirhan and Yeşim Aksan. Corpus Frequency and Affix Ordering in Turkish
Abstract: Suffix sequences in agglutinative languages derive complex structures. Based on frequency information from a corpus data, this study will present an analysis of emerging multi-morpheme sequences in Turkish. Morphgrams formed by incorporating voice suffixes with other verbal suffixes from finite and non-finite templates are analyzed on the basis of cited morheme orders in the corpus. Statistical analyses are conducted on permissible combinations of these suffixes. The findings of the study have implications for further studies on morphological processing in agglutinative languages.
İlknur Dönmez and Eşref Adalı. Turkish Document Classification with Coarse-grained Semantic Matrix
Abstract: In this paper, we present a novel method for Document Classification that uses semantic matrix representation of Turkish sentences by concentrating on the sentence phrases and their concepts in text. Our model has been designed to nd phrases in a sentence, identify their relations with specic concepts, and represent the sentences as coarsegrained semantic matrix. Predicate features and semantic class type are also added to the coarse-grained semantic matrix representation. The highest success rate in Turkish Document Classication 97.12% is obtained by adding coarse-grained semantic matrix representation to the data that previous highest result had been taken and by using Nave-Bayes Algorithm.
Zijian Győző Yang, László János Laki and Borbála Siklósi. Quality Estimation for English-Hungarian with Optimized Semantic Features and the HuQ corpus
Abstract: Quality estimation at run-time for machine translation systems is an important task. The standard automatic evaluation methods that use reference translations can not evaluate in real-time and the correlation between the results of these methods and that of human evaluation is very low in the case of translations from English to Hungarian. The new method to solve this problem is called quality estimation. These methods address the task by estimating the quality of translations as a prediction task for which features are extracted from only the source and translated sentences. In this study, we implement quality estimation for English-Hungarian. First, a corpus is created, which contains Hungarian human judgements. Using these human evaluation scores, different quality estimation models are described, evaluated and optimized. we created a corpus for English-Hungarian quality estimation and we developed 27 new semantic features using WordNet and a word embedding model, then we created feature sets optimized for Hungarian, which produced better results than the baseline feature set.
Balázs Indig and István Endrédy. Gut, Besser, Chunker -- Selecting the best models for text chunking with voting
Abstract: The CoNLL-2000 dataset is the de-facto standard dataset for measuring chunkers on the task of chunking base noun phrases (NP) or arbitrary phrases. The state-of-the-art tagging method is utilising TnT, an HMM-based Part-of-Speech tagger (POS), with simple majority voting on different representations and fine-grained classes created by lexcialising tags.
In this paper the state-of-the-art English phrase chunking method was deeply investigated, re-implemented and evaluated with several modifications. We also investigate a less studied side of phrase chunking, i.e. the voting between different currently available taggers, the checking of invalid sequences and the way how the state-of-the-art method can be adapted to morphologically rich, inflecting languages.

We propose a new, mild level of lexicalisation and a better combination of representations and taggers for English. The final architecture outperformed the state-of-the-art for arbitrary phrase identification and NP chunking.
Joan Byamugisha, C. Maria Keet and Langa Khumalo. Pluralising Nouns in isiZulu and Related Languages
Abstract: There are compelling reasons for a Controlled Natural Language of isiZulu in software applications, which requires pluralising nouns. Only `canonical' singular/plural pairs exist, however, which are insufficient for computational use of isiZulu.
Starting from these rules, we take an experimental approach as virtuous spiral to refine the rules by repeatedly testing two test sets against successive versions of refined rules for pluralisation. This resulted in the elucidation of additional pluralisation rules not included in typical isiZulu textbooks and grammar resources and motivated design choices for algorithm development. We assessed the potential for reuse of the approach and the type of deviations with Runyankore, which demonstrated encouraging results.
Ameur Douib, David Langlois and Kamel Smaïli. Genetic-based decoder for statistical machine translation
Abstract: In the community of statistical machine translation, MOSES is the open-source decoder which is the most used by researchers. It is based on a beam-search algorithm, where from a source sentence f the algorithm builds incrementally a set of complete translations, starting by the empty translation. For each translation hypothesis and for each step of the building process of the solution, a new phrase is added to the translation. Consequently, the algorithm produces a large number of partial translations. To reduce the number of partial translations, a pruning process is applied, where the n-best translations are retained for the next step. To select these n-best partial translations the language model and translation model are used to score each translation. Finally, from the set of complete translations, the one which have the highest score is chosen as the final translation e. This algorithm and its variants give good results, but present two significant risks. The first risk is in the search space exploration. It is impossible to change a decision which has been taken at previous steps. That is why it is possible to lose a good translation or even a partial solution which could lead to a final good translation. The second risk is the decision making. At each step, MOSES keeps some translations and eliminate others, depending on the scores of partial translations and not complete translations.

In this paper, we present an alternative to MOSES's decoder based on Evolutionary Algorithm (EA). The main argument of this choice is to use from the beginning step of the process a complete solution and not a partial as in MOSES. Another reason concerns the fact that this kind of algorithms impose no constraints about the underlying structure of the solution.

The implemented approach in this paper is the Genetic Algorithm. In this algorithm we start with an initial population of complete translations (chromosomes). After that, in an iterative process, the algorithm improves the quality of the population (translations) by producing new chromosomes. The new chromosomes are obtained by applying two kinds of functions. The crossover function, selects two chromosomes ("parents") for breeding by merging some information of parents, which produces two new chromosomes ("child" solution). The second kind of functions is the mutation. In this case one chromosome is taken as a parent and the function applies some modification at the phrase level to produce a new chromosome as a child. We define an objective function to score chromosomes and select which of them will be saved for the next iteration. This selection ensures the evolution of the population. At the end of the process, the best translation e is the one which have the best score.

Using BLEU and TER metrics, we evaluate the translation quality of our decoder, and compare these results with MOSES results. Our decoder achieves promising results.
Mohamed Amine Menacer, Abdelfetah Boumerdas, Chahnez Zakaria and Kamel Smaili. A new language model based on possibility theory
Abstract: Modeling language is a very important step in several applications of NLP. Most language models, used today, are based on probabilistic methods. In this paper, we describe a new approach of modeling language based on the possibility theory. For this, our goal is to propose a method for estimating the possibility of a sequence of word and to test our new approach in a machine translation system.
We propose a word-sequence possibilistic measure, which can be estimated from a corpus. Our proposal is evaluated in to two ways: the aim of the first one is to study the behavior of our approach compared to the existing work. In the second test, we compare our new approach with the probabilistic language model used in statistical MT systems.
The results, in terms of the METEOR metric, show that the possibilistic-language model is better than the probabilistic one. However, the probabilistic model remain better than the possibilistic one in terms of BLEU and TER scores.
Marwa Naili, Anja Habacha Chaibi and Henda Hajjami Ben Ghezala. Parameters driving effectiveness of LSA on topic segmentation
Abstract: Latent Semantic Analysis (LSA) is an efficient statistical technique for extracting semantic knowledge from large corpora. One of the major problems of this technique is the identification of the most efficient parameters of LSA and the best combination between them. Therefore, in this paper, we propose a new topic segmenter to study in depth the different parameters of LSA for the topic segmentation. Thus, the aim of this study is to analyze the effect of these
different parameters on the quality of topic segmentation and to identify the most efficient parameters. Based on extensive experiments, we showed that the choice of LSA parameters is very sensitive and it has an impact on the quality of topic segmentation. More important, according to this study, we are able to propose
appropriate recommendation for the selection of parameters in the field of topic segmentation.
Viet Tran Hong, Huyen Vu Thuong, Vinh Nguyen Van and Minh Nguyen Le. A Classifier-based Preordering Approach for English-Vietnamese Statistical Machine Translation
Abstract: Reordering is of essential importance problem for phrase based statistical machine translation (SMT). In this paper, we propose an approach to automatically learn reordering rules as preprocessing step based on a dependency parser in phrase-based statistical machine translation for English to Vietnamese. We used dependency parsing and rules extracting from training the features-rich discriminative classifiers for reordering source-side sentences. We evaluated our approach on English-Vietnamese machine translation tasks, and showed that it outperform the baseline phrase-based SMT system.
Mahdi Mohseni, Javad Ghofrani and Heshaam Faili. Persianp: a Persian Text Processing Toolbox
Abstract: This paper describes Persianp Toolbox, an integrated Persian text processing system and easily used in other software applications. The toolbox which provides fundamental Persian text processing steps includes several modules. In developing some modules of the toolbox such as normalizer, tokenizer, sentencizer, stop word detector, and Part-Of-Speech tagger previous studies are applied. In other modules i.e. Persian lemmatizer and NP chunker, new ideas in preparing required training data and/or applying new techniques are presented. Experimental results show the strong performance of the toolbox in each part. The accuracies of the tokenizer, the POS tagger, the lemmatizer and the NP chunker are 97%, 95.6%, 97%, 97.2%, respectively.
Daniela Gîfu and Radu Simionescu. Tracing Language Variation for Romanian
Abstract: This paper illustrates a pilot study on two collections of publications, written at the middle of the 19th century in two countries, Romania and Republic of Moldavia. The corpus includes articles from the most important Romanian and Bessarabian publications, categorized in three periods of time: 1840-1917, 1918-1940, and 1941-1991. The research conducted on these resources focuses on the lexical evolution of words. We use a machine learning approach to explore the patterns that govern the lexical differences between two lexicons. The model is used for automatically correlating different forms of a word. The approach is suitable for bootstrapping, in order to increase the quantity and quality of the training data. The presented approach is language independent. By using the contemporary language as a pivot, the data is analyzed and compared from various perspectives.
Ugur Sopaoglu and Gonenc Ercan. Evaluation of Semantic Relatedness Measures for Turkish Language
Abstract: The problem of quantifying the level of semantic relatedness of two words is a fundamental sub-task for many natural language processing systems. While there is a large body of research on measuring semantic relatedness in the English language, the literature lacks detailed analysis for these methods in agglutinative languages. In this article, evaluation resources for the Turkish language are constructed. An extensive set of experiments involving multiple tasks: word association, semantic categorization, and automatic WordNet relationship discovery are performed to evaluate different semantic relatedness measures in the Turkish language. Our experiments compare the performance of distributional similarity based semantic relatedness measures. The morphological processing component defines what to observe in distributional similarity based algorithms. For languages with rich morphological variations and productivity, methods ranging from simple stemming strategies to morphological disambiguation exists. In our experiments, different morphological processing methods for the Turkish language are evaluated in three different semantic relatedness tasks.
Francisco Manuel Rangel Pardo, Paolo Rosso and Marc Franco Salvador. A Low Dimensionality Representation for Language Variety Identification
Abstract: Language variety identification aims at labelling texts in a native language (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Argentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with the state-of-the-art and show an increase in accuracy of ∼35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show competitive performance while dramatically reducing the dimensionality — and increasing the big data suitability — to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages.
Tuba Parlar, Selma Ayse Ozel and Fei Song. Interactions between term weighting and feature selections methods for Sentiment Analysis of Turkish reviews
Abstract: Term weighting methods assign appropriate weights to the terms in a document so that more important terms receive higher weights for the text representation. In this study, we consider four term weighting methods and investigate their effects on the sentiment analysis of Turkish reviews. We also try several feature selection methods and examine how those term weighting methods respond to the reduced text representation. Experiments are conducted on five Turkish review datasets so that we can establish baselines and compare the performance of these term weighting methods. Furthermore, we tested these techniques on the English review datasets as well so that their differences could be compared with the Turkish review datasets.
William Léchelle and Philippe Langlais. An informativeness approach to Open IE evaluation
Abstract: Open Information Extraction (OIE) systems extract relational tuples from text
without requiring to specify in advance the relations of interest. Systems
perform well on widely used metrics such as precision and yield, but a close
look at systems output show a general lack of informativeness in facts deemed
correct. We propose a new evaluation protocol that is closer to text
understanding and end user needs. Extracted information is judged upon its
capacity to automatically answer questions about the source text. We devise a
small corpus of question/answer pairs, and use it to evaluate available
state-of-the-art OIE systems. Our results are in line with previous
findings. We will distribute our annotated data and automatic evaluation program.
Aslı Eyecioğlu Özmutlu and Bill Keller. Constructing A Turkish Corpus for Paraphrase Identification and Semantic Similarity
Abstract: The Paraphrase identification (PI) task has practical importance for work in Natural Language Processing (NLP) because of the problem of linguistic variation. Accurate methods for PI should help improve performance of key NLP applications. This paper describes the construction of a paraphrase corpus for Turkish. The corpus comprises pairs of sentences with semantic similarity scores based on human judgments of similarity, permitting experimentation with both PI and semantic similarity. We believe this is the first such corpus for Turkish and should be of value to other researchers. The methodology used to construct the corpus is described and we report initial PI experiments with the corpus using 'knowledge lean' methods (i.e. no use of manually constructed knowledge bases or processing tools that rely on these). We have previously achieved excellent results using such techniques on the Microsoft Research Paraphrase Corpus (MSRPC), and state-of-the-art performance on the Twitter Paraphrase Corpus (TPC).
Lyndon White, Roberto Togneri, Wei Liu and Mohammed Bennamoun. Generating Bags of Words from the Sums of their Word Embeddings
Abstract: Many methods have been proposed to generate sentence vector representations, such as recursive neural networks, latent distributed memory models, and the simple sum of word embeddings (SOWE). However, very few methods demonstrate the ability to reverse the process -- recovering sentences from sentence embeddings. Amongst the many sentence embeddings, SOWE has been shown to maintain semantic meaning, so in this paper we introduce a method for moving from the SOWE representations back to the bag of words (BOW) for the original sentences. This is a part way step towards recovering the whole sentence and has useful theoretical and practical applications of its own. This is done using a greedy algorithm to convert the vector to a bag of words. To our knowledge this is the first such work. It demonstrates qualitatively the ability to recreate the words from a large corpus based on its sentence embeddings.

As well as practical applications for allowing classical information retrieval methods to be combined with more recent methods using the sums of word embeddings, the success of this method has theoretical implications on the degree of information maintained by the sum of embeddings representation. This lends some credence to the consideration of the SOWE as a dimensionality reduced, and meaning enhanced, data manifold for the bag of words.
Sachin Pawar, Pushpak Bhattacharyya and Girish Palshikar. End-to-End Relation Extraction using Markov Logic Networks
Abstract: The task of end-to-end relation extraction consists of two sub-tasks: i) identifying entity mentions along with their types and ii) recognizing semantic relations among the entity mention pairs. It has been shown that for better performance, it is necessary to address these two sub-tasks jointly [22, 13]. We propose an approach for simultaneous extraction of entity mentions and relations in a sentence, by using inference in Markov Logic Networks (MLN) [21]. We learn three different classifiers : i) local entity classifier, ii) local relation classifier and iii) "pipeline" relation classifier which uses predictions of the local entity classifier. Predictions of these classifiers may be inconsistent with each other. We represent these predictions along with some domain knowledge using weighted first-order logic rules in an MLN and perform joint inference over the MLN to obtain a global output with minimum inconsistencies. Experiments on the ACE (Automatic Content Extraction) 2004 dataset demonstrate that our approach of joint extraction using MLNs outperforms the baselines of individual classifiers. Our end-to-end relation extraction performance also outperforms the best result reported previously on the ACE 2004 dataset.
Megala Uthayakumar, Pranavan Theivendiram, Nilusija Nadarasamoorthy, Mokanarangan Thayaparan, Sanath Jayasena, Gihan Dias and Surangika Ranathunga. Named-Entity-Recognition (NER) for Tamil Language Using Margin-Infused Relaxed Algorithm (MIRA)
Abstract: Named-Entity-Recognition (NER) is widely used as a foundation for Natural Language Processing (NLP) applications. There have been few previous attempts on building generic NER systems for Tamil language. These attempts were based on machine-learning approaches such as Hidden Markov Model (HMM), Maximum Entropy Markov Model (MEMM), Support Vector Machine (SVM) and Conditional Random Fields (CRF). Among them, CRF has been proven to be the best with respect to the accuracy of NER in Tamil. This paper presents a novel approach to build a Tamil NER system using the Margin-Infused Relaxed Algorithm (MIRA). We also present a comparison of performance of between MIRA and CRF algorithms for Tamil NER. When the gazetteer, POS tags and orthographic features are used with the MIRA algorithm, it attains a F1-measure of 81.38% on the Tamil BBC news data whereas the CRF algorithm shows only a F1-measure of 79.13% for the same set of features. Our NER system outperforms all the previous NER systems for Tamil language.
Rudra Murthy and Pushpak Bhattacharyya. A Complete Deep Learning Solution to Named Entity Recognition
Abstract: Identifying Named Entities is vital for many Natural Language Processing (NLP) applications. Much of the earlier work for identifying named entities is focused on using handcrafted features and knowledge resources (feature engineering). This is crucial for resource-scarce languages for which many resources are not readily available. Recently, Deep Learning techniques have been proposed for many NLP tasks requiring little/no hand-crafted features and knowledge resources and the features are learned from the data. Many deep learning solutions for Named Entity Recognition (NER) still rely on feature engineering opposed to feature learning. However, it is not clear if the deep learning architecture or the engineered features are responsible for the positive results reported. This is in contrast with the goal of deep learning system i.e., to learn the features from the data itself. In this study, we answer if a feature learned deep learning system is a viable solution to NER task. We test our deep learning system on CoNLL English NER dataset. Our system is able to give comparable results with existing state-of-the-art feature engineered systems. We report the best performance of 89.27 F-Score when comparing with systems which do not use any handcrafted features or knowledge resources. Evaluation of our trained system on out-of-domain data indicate that the results are promising with the reported results. Our system when tested on Spanish NER achieves improvements indicating its applicability to other languages.
Gozde Gul Sahin. Verb Sense Annotation For Turkish PropBank via Crowdsourcing
Abstract: In order to extract meaning representations from sentences, a corpus annotated with semantic roles is obligatory. Unfortunately building such a corpus requires tremendous amount of manual work for creating semantic frames and annotation of corpus. Thereby, we have divided the annotation task into two microtasks as verb sense annotation and argument annotation tasks and employed crowd intelligence to perform these microtasks. In this paper, we present our approach and the challenges on crowdsourcing verb sense disambiguation task and introduce the resource with 5855 annotated verb senses with 83,15% annotater aggrement.
Sachin Pawar, Nitin Ramrakhiyani, Swapnil Hingmire and Girish Palshikar. Topics and Label Propagation: Best of Both Worlds for Weakly Supervised Text Classification
Abstract: We propose a Label Propagation based algorithm for weakly supervised text classification. We construct a graph where each document is represented by a node and edge weights represent similarities among the documents. Additionally, we discover underlying topics using Latent Dirichlet Allocation (LDA) and enrich the document graph by including the topics in the form of additional nodes. The edge weights between a topic and a text document represent level of "affinity" between them. Our approach does not require document level labelling, instead it expects manual labels only for topic nodes. This significantly minimizes the level of supervision needed as only a few topics are observed to be enough for achieving sufficiently high accuracy. The Label Propagation Algorithm is employed on this enriched graph to propagate labels among the nodes. Our approach combines the advantages of Label Propagation (through document-document similarities) and Topic Modelling (for minimal but smart supervision). We demonstrate the effectiveness of our approach on various datasets and compare with state-of-the-art weakly supervised text classification approaches.
Vladislav Kubon, Marketa Lopatkova and Jiří Mírovský. Analysis of Word Order in Multiple Treebanks
Abstract: This paper gives an overview of the results of automatic analysis of word order in 23 dependency treebanks. These treebanks have been collected in the frame of the HamleDT project, whose main goal is to provide universal annotation for dependency corpora and thus it also makes it possible to use identical queries for all corpora.
The analysis concentrates on the basic characteristic of the word order, the order of three main constituents, a predicate, a subject and an object. The quantitative analysis is performed separately for main clauses and subordinated clauses because in many languages the subordinated clauses have a slightly different order of words than main clauses.
Marta R. Costa-Jussà and José A. R. Fonollosa. Combining Phrase and Neural-based Machine Translation: what worked and did not
Abstract: Phrase-based machine translation assumes that all words are at the same distance and translates them using feature functions that approximate the probability at dierent levels. On the other hand, neural machine translation performs a word embedding and translates these word vectors using a neural model. At the moment, both approaches co-exist and are being intensively investigated.

This paper to the best of our knowledge is the rst work that both compares and combines these two systems by: using the phrase-based output to solve unknown words in the neural machine translation output; using the neural alignment in the phrase-based system; comparing how the popular strategy of pre-reordering aects both systems; and combining both translation outputs. Improvements are achieved in Catalan-to-Spanish and German-to-English.
Begoña Altuna, María Jesús Aranzabe and Arantza Díaz de Ilarraza. Adapting TimeML to Basque: Event annotation
Abstract: In this paper we present an event annotation effort following EusTimeML, a temporal mark-up language for Basque based on TimeML. For this, we first describe events and their main ontological and grammatical features. We base our analysis on Basque grammars and TimeML mark-up language classification of events. Annotation guidelines have been created to address the event information annotation for Basque and an annotation experiment has been conducted. A first round has served to evaluate the preliminary guidelines and decisions on event annotation have been taken according to annotations and inter-annotator agreement results. Then a guideline tuning period has followed. In the second round, we have created a manually-annotated gold standard corpus for event annotation in Basque. Event analysis and annotation experiment are part of a complete temporal information analysis and corpus creation work.
Vigneshwaran Muralidaran and Dipti Sharma. Construction Grammar Approach for Tamil Dependency Parsing
Abstract: Syntactic parsing in NLP is the task of working out the grammatical structure of sentences. Some of the purely formal approaches to parsing such as phrase structure grammar, dependency grammar have been successfully employed for a variety of languages. While phrase structure based constituent analysis is possible for fixed order languages such as English, dependency analysis between the grammatical units have been suitable for many free word order languages. All these parsing approaches rely on identifying the linguistic units based on their formal syntactic properties and establishing the relationships between such units in the form of a tree. Instead, we characterize every morphosyntactic unit as a mapping between form and function on the lines of Construction Grammar and parsing as identification of dependency relations between such conceptual units. Our approach to parser annotation shows an average MALT LAS score of 82.21% on Tamil gold annotated corpus of 935 sentences in a five-fold validation experiment.
Agnivo Saha and Sudeshna Sarkar. Enhancing Neural Network based Dependency Parsing Using Morphological Information for Hindi
Abstract: In this paper, we propose a way of incorporating morphological resources for enhancing the performance of neural network based dependency parsing. We conduct our experiments in Hindi, which is a morphologically rich language. We report our results on two well known Hindi Dependency Parsing datasets. We show an improvement of both Unlabeled Attachment Score (UAS) and Labeled Attachment Score (LAS) compared to previous state-of-the art Hindi dependency parsers using only word embeddings, POS tag embeddings and arc-label embeddings as features. Using morphological features, such as number, gender, person and case of words, we achieve an additional improvement of both LAS and UAS. We find that many of the erroneous sentences contain Named Entities. We propose a treatment for Named Entities which further improves both UAS and LAS of our Hindi dependency parser.
Suman Dowlagar and Radhika Mamidi. A Karaka Dependency based Dialog Act Tagging for Telugu using combination of LM's and HMM
Abstract: The main goal of this paper is to perform the dialog act(DA) tagging for Telugu corpus.
Annotation of utterances with dialog acts is necessary to recognize the intent of speaker in dialog systems.
English language follows strict subject--verb--object(SVO) syntax.
Telugu is a free word order language. The n-gram DA tagging methods proposed for the English language
will not work for free word order languages like Telugu.
In this paper, we propose a method to perform DA tagging for the Telugu corpus using advanced machine
learning techniques combined with karaka dependency relation modifiers.
In other words, we use syntactic features obtained from karaka dependencies
and apply combination of language models(LM's) at utterance level with Hidden Markov Model(HMM) at
context level for DA tagging.
The use of karaka dependencies for free word order languages like Telugu helps in extracting
the modifier-modified relationships between words or word clusters for an utterance.
The modifier-modified relationships remain fixed even though the word order in an utterance changes.
These extracted modifier-modified relationships appear similar to n-grams. Statistical machine learning methods
such as combination of LM's and HMM are applied to predict DA for an utterance in a dialog.
The proposed method is compared with several baseline tagging algorithms.
Kovida Nelakuditi, Divya Sai Jitta and Radhika Mamidi. Part-of-Speech Tagging for Code mixed English-Telugu Social media data
Abstract: Part-of-Speech Tagging is a primary and an important step
for many Natural Language Processing Applications. POS taggers have
reported high accuracies on grammatically correct monolingual data.
This paper reports work on annotating code mixed English-Telugu data
collected from social media site Facebook and creating automatic POS
Taggers for this copus. POS tagging is visualised as a classification prob-
lem and we use different classifiers like SVMs, CRFs, Multinomial Bayes with
different combinations of features which capture both context of the word
and its internal structure. We also report our work on experimenting with
combining monolingual POS taggers for POS tagging of this code mixed
English-Telugu data.
Lukáš Svoboda and Tomáš Brychcín. New word analogy corpus for exploring embeddings of Czech words
Abstract: The word embedding methods have been prove to be very useful in many tasks of NLP (Natural Language Processing). Much has been investigated about word embeddings of English words and phrases, but only little attention has been dedicated to other languages.

Our goal in this paper is to explore the behavior of state-of-the-art word embedding methods on Czech, the language that is characterized by very rich morphology.
We introduce new corpus for word analogy task that inspects syntactic, morphosyntactic and semantic properties of Czech words and phrases. We experiment with Word2Vec and GloVe algorithms and discuss the results on this corpus. The corpus is available for the research community.
Seniz Demir, Murat Tan and Berkay Topcu. Turkish Normalization Lexicon for Social Media
Abstract: Social media has its own evergrowing language and distinct characteristics. Although social media is shown to be of great utility to research studies, varying quality of written texts degrades the performance of existing NLP tools. Normalization of texts, transforming from informal to well-written texts, appears to be a reasonable preprocessing step to adapt tools trained on different domains to social media. In this study, we compile the first Turkish normalization lexicon that sheds light to the kinds of observed lexical variations in social media texts. A graphical representation acquired from a text corpus is used to model contextual similarities between normalization equivalences and the lexicon is automatically generated by performing random walks on this graph. The underlying framework not only enables different lexicons to be generated from the same corpus but also produces lexicons that are tuned to specific genres. Evaluation studies demonstrated the effectiveness of induced lexicon in normalizing Turkish texts.
Chahira Lhioui, Anis Zouaghi and Mounir Zrigui. Knowledge Extraction with NooJ Using a syntactico-Semantic Approach for the Arabic Utterances Understanding
Abstract: Knowledge Extraction is a current research topic with regard to the ameliora-tion of Natural Language Processing (NLP) field. The need to such improvement through NLP technics has become necessary and interesting. Hence, in a general context of the construction of an Arabic touristic corpus equivalent to those of Euro-pean projects MEDIA and LUNA, and given the lack of Arabic electronic resources, we had the opportunity to expand the EL-DicAr of (Slim Mesfar, 2008) by knowledge hinging on Touristic Information and Hotel Reservations (TIHR). Thus, in the same manner of (Slim Mesfar 2008), we have developed local grammars for the recognition of essential knowledge in our study field. This task facilitates greatly the subsequent work of understanding user utterances interacting with a dialogue system.
Sobha Lalitha Devi and Pattabhi Rk Rao. Mining of Social Networks from Literary Texts of Resource Poor Languages
Abstract: We describe our work on automatic identification of social events and mining of social networks from literary texts in Tamil. Tamil belongs to Dravidian language family and is a morphologically rich language. This is a resource poor language; sophisticated resources for document processing such as parsers, phrase structure tree tagger are not available. In our work we have used shallow parsing for document processing. Conditional Random Fields (CRFs), a machine learning technique is used for automatic identification of social events. We have obtained an F-measure of 62% on social event identification. Social networks are mined by forming triads of the actors in the social events. The social networks are evaluated using graph comparison technique. The system generated social networks is compared with the gold network. We have obtained a very encouraging similarity score of 0.75.
Braja Gopal Patra, Soumadeep Mazumdar, Dr. Dipankar Das, Paolo Rosso and Sivaji Bandyopadhyay. A Multilevel Approach to Sentiment Analysis of Figurative Language in Twitter
Abstract: Commendable amount of work has been attempted in the field of Sentiment Analysis or Opinion Mining from natural language texts and twitter texts. One of the main goals in such tasks is to assign polarities (positive or negative) to a piece of text. But, at the same time, one of the important as well as difficult issues is how to assign the degree of positivity or negativity to certain texts. The answer becomes more complex when we perform a similar task on figurative language texts collected from Twitter. Figurative language devices such as irony and sarcasm contain an intentional secondary or extended meaning hidden within the expressions. In this paper, we present a novel approach to identify the degree of the sentiment (fine grained in an 11-point scale) for the figurative language texts. We used several semantic features such as sentiment and intensifiers as well as we introduced sentiment abruptness, which measures the variation of sentiment from positive to negative or vice versa, to train our systems at multiple levels to achieve the best performance of 82.3% as cosine similarity.
Burcu Can, Ahmet Üstün and Murathan Kurfalı. Turkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets
Abstract: Sparsity is one of the major problems in natural language processing tasks. The problem becomes more severe in agglutinating languages that are highly prone to be inflected. The sparsity problem may arise in any level of a natural language processing task (i.e. from syntactic level to semantic level). In this paper, we deal with sparsity in the syntactic level by adopting morphological features in Turkish part-of-speech tagging since sparsity is severe in Turkish due to agglutination. We learn morpheme tags (i.e. for both inflectional and derivational morphemes) in Turkish by using conditional random fields (CRF) and we employ these morpheme tags in part-of-speech (PoS) tagging to mitigate the sparsity in a pipeline framework. The results show that using morpheme tags helps alleviate the sparsity especially in emission probabilities. Our model outperforms other hidden Markov model (HMM) based PoS tagging models for small training datasets in Turkish. We obtain an accuracy of 94.1\% in morpheme tagging and 89.2\% in PoS tagging on a 5K training dataset.
Vasiliki Simaki, Iosif Mporas and Vasileios Megalooikonomou. Age Identification of Twitter Users: Classification Methods and Sociolinguistic Analysis
Abstract: In this article, we address the problem of age identification of Twitter users, after their online text. We used a set of text mining, sociolinguistic-based and content-related text features, and we evaluated a number of well-known and widely used machine learning algorithms for classification, in order to examine their appropriateness on this task. The experimental results showed that Random Forest algorithm offered superior performance achieving accuracy equal to 61%. We ranked the classification features after their informativity, using the ReliefF algorithm, and we analyzed the results in terms of the sociolinguistic principles on age linguistic variation.
Borbála Siklósi. Using embedding models for lexical categorization in morphologically rich languages
Abstract: Neural-network-based semantic embedding models are relatively new but popular tools in the field of natural language processing. It has been shown that continuous embedding vectors assigned to words provide an adequate representation of their meaning in the case of English. However, morphologically rich languages have not yet been the subject of experiments with these embedding models. In this paper, we investigate the performance of embedding models for Hungarian, trained on corpora with different levels of preprocessing. The models are evaluated on various lexical categorization tasks. They are used for enriching the lexical database of a morphological analyzer with semantic features automatically extracted from the corpora.
Ladislav Lenc and Pavel Král. Deep Neural Networks for Czech Multi-label Document Classification
Abstract: This paper is focused on automatic multi-label document classification of Czech text documents. The current approaches usually use some pre-processing which can have negative benefit (loss of information, additional implementation work, etc). Therefore, we would like omit it by using deep neural nets. This choice was motivated by their successful usage in many other machine learning fields. Two different nets are compared: the first one is a standard multi-layer perceptron, while the second net is a popular convolutional network. The experiments on a Czech newspaper corpus show that both nets significantly outperform baseline method which uses a rich set of features with maximum entropy classifier. We have further also shown that convolutional network gives the best results.
Md Shad Akhtar, Asif Ekbal and Pushpak Bhattacharyya. Aspect Based Sentiment Analysis: Category Detection and Sentiment Classifcation for Hindi
Abstract: E-commerce markets in developing countries (e.g. India) have witnessed a tremendous amount of user’s interest recently. Product reviews are now being generated daily in huge amount. Classifying the sentiment expressed in a user generated text/review into certain categories of interest, for example, positive or negative is famously known as sentiment analysis. Whereas aspect based sentiment analysis (ABSA) deals with the sentiment classification of a review towards some aspects or attributes or features. In this paper we propose an efficient method for aspect category detection and its sentiment classification for Hindi language. Aspect category can be seen as the generalization of various features or aspects that has been discussed in a review. As far as our knowledge is concerned, this is the very first attempt for this specific task in Hindi. The key contributions of the present work are two-fold, viz. providing a benchmark platform by creating annotated dataset for aspect category detection and sentiment classification, and developing supervised approaches for these two tasks that can be treated as a baseline model for further research.
Anupam Jamatia, Björn Gambäck and Amitava Das. Collecting and Annotating Indian Social Media Code-Mixed Corpora
Abstract: The pervasiveness of social media in the present digital era has empowered the `netizens' to be more creative and interactive, and to generate content using free language forms that often are closer to spoken language and hence show phenomena previously mainly analysed in speech. One such phenomenon is code-mixing, which occurs when multilingual persons switch freely between the languages they have in common. Code-mixing presents many new challenges for language processing and the paper discusses some of them, taking as a starting point the problems of collecting and annotating three corpora of code-mixed Indian social media text: one corpus with English-Bengali Twitter messages and two corpora containing English-Hindi Twitter and Facebook messages, respectively. We present statistics of these corpora, discuss part-of-speech tagging of the corpora using both a coarse-grained and a fine-grained tag set, and compare their complexity to several other code-mixed corpora based on a Code-Mixing Index.
Silpa Kanneganti, Himani Chaudhry and Dipti Mishra Sharma. Comparative Error Analysis Of Parser Outputs On Telugu Dependency Treebank Data
Abstract: In this paper we present a comparative error analysis of two parsers - MALT and MST on Telugu Dependency Treebank data. MALT and MST are currently two of the most dominant data-driven dependency parsers. We discuss the performances of both the parsers in relation to Telugu language. we also talk in detail about both the algorithmic issues of the parsers as well as the language specific constraints of Telugu.The purpose is, to better understand how to help the parsers deal with complex structures, make sense of implicit language specific cues and build a more informed Treebank.
Malek Lhioui, Kais Haddar and Laurent Romary. Algebraic specification for interoperability between data formats: Application on Arabic lexical data
Abstract: Linguistic data formats (LDF) became, over the years, more and more complex and heterogeneous due to the diversity of linguistic needs. Communication between these linguistic data formats is impossible since they are increasingly multiplatform and multi-providers. LDF suffer from several communication issues. Therefore, it has to face several interoperability issues in order to guarantee consistency and avoid redundancy. In an interoperability resolution context, we establish a method based on algebraic specifications to resolve interoperability among data formats. The categorical proposed method consists of constructing a unified language. In order to compose this unified language, we apply the co-limit algebraic specifications category for each data format. With this method, we establish a complex grid between existing data formats allowing the mapping to the unifier using algebraic specification. Then, we apply our approach on Arabic lexical data. We experiment our approach using Specware software
Łukasz Kobyliński and Witold Kieraś. Part of Speech Tagging for Polish: State of the Art and Future Perspectives
Abstract: In this paper we discuss the intricacies of Polish language Part of Speech Tagging, present the current state of the art by comparing available taggers in detail and show the main obstacles that are a limiting factor in achieving higher accuracy of Polish POS tagging than 91% of correctly tagged word segments. As this result is not only lower than in the case of English taggers, but also for other highly inflexive languages, such as Czech and Slovene, we try to identify the main weaknesess of the taggers, their underlying algorithms, the training data, or difficulties inherent to the language to explain this difference. For this purpose we analyze the errors made individually by each of the available Polish POS taggers, an ensemble of the taggers and also by a publicly available well-known OpenNLP tagger, adapted to Polish tagset. Finally, we propose further steps that should be taken to narrow down the gap between Polish and English POS tagging performance.
Marco Dinarelli and Isabelle Tellier. New Recurrent Neural Network Variants for Sequence Labeling
Abstract: In this paper we study different architectures of Recurrent Neural Networks
(RNN) for sequence labeling tasks. We propose two new variants of RNN
and we compare them to the more traditional RNN architectures of Elman and
Jordan. We explain in details which advantages the new variants of RNN give
with respect to the Elman and Jordan RNN. We evaluate all models, new and traditional
variants of RNN, on three different tasks: POS-tagging of the French
Treebank, and two tasks of Spoken Language Understanding (SLU), namely
ATIS and MEDIA. The results we obtain show crearly that the new variants of
RNN are more effective than traditional RNN.
Shiva Taslimipoor, Ruslan Mitkov, Gloria Corpas Pastor and Afsaneh Fazly. Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations
Abstract: Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. In addition, finding translations for collocations is different from translating other arbitrary sequences of words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents.
Wafa Wali, Bilel Gargouri and Abdelmajid Ben Hamadou. Using sentence semantic similarity to improve LMF standardized Arabic dictionary quality
Abstract: This paper presents a novel algorithm to measure semantic similarity between sentences. It will introduce a method that takes into account of not only semantic knowledge but also syntactico-semantic knowledge notably semantic predicate, semantic class and thematic role. Firstly, semantic similarity between sentences is derived from words synonymy. Secondly, syntactico-semantic similarity is computed from the common semantic class and thematic role of words in the sentence. Indeed, this information is related to semantic predicate. Finally, semantic similarity is computed as a combination of lexical similarity, semantic similarity and syntactico-semantic similarity using a supervised learning. The proposed algorithm is applied to detect the information redundancy in LMF Arabic dictionary especially the definitions and the examples of lexical entries. Experimental results show that the proposed algorithm reduces the redundant information to improve the content quality of dictionary.
Souha Mezghani Hammami and Lamia Hadrich Belguith. Arabic Pronominal Anaphora Resolution Based on New Set of Features
Abstract: In this paper, we present a machine learning approach for
Arabic pronominal anaphora resolution. This approach resolves anaphoric
pronouns without using linguistic or domain knowledge, nor deep parsing.
It relies on some features which are widely used in the literary for
other languages such as English. In addition, we propose new features
specific for Arabic language. We provide a practical implementation of
this approach which has been evaluated on three data sets (a technical
manual, newspaper articles and educational texts). The results of evaluation
shows that our approach provide good performance for resolving
the Arabic pronominal anaphora. The measures of F-measure are respectively
86.2% for the genre of technical manuals, 84.5% for newspaper
articles and 72.1% for the literary texts.
Wafa Neifar, Thierry Hamon, Pierre Zweigenbaum, Mariem Ellouze Khemakhem and Lamia Hadrich Belguith. Adaptation of a term extractor to Arabic specialised texts : first experiments and limits
Abstract: In this paper, we present the adaptation to the Modern Standard Arabic
of a French and English term extractor. The goal of this work is to
reduce the lack of resources and NLP tools for Arabic language in
specialised domains. The adaptation firstly focuses on the description
of extraction process similarly to those already defined for French
and English while considering the morpho-syntactic specificity of the
Arabic. Then, the agglutination phenomena has been taken into account
in the term extraction process. The evaluation has been performed on a
medical text corpus. Results show that among 400 maximal candidate
terms we analysed, 288 are correct (72%). The errors of term
extraction are due to the Part-of-Speech tagging and the
non-diacritised texts, but also to the agglutination phenomena.
Rihab Bouchlaghem, Aymen Elkhelifi and Rim Faiz. Sentiment analysis in Arabic Twitter posts using supervised methods with combined features
Abstract: With the huge amount of daily generated social networks posts, reviews, rat-ings, recommendations and other forms of online expressions, the web 2.0 has turned into a crucial opinion rich resource. Since others’ opinions seem to be determinant when making a decision both on individual and organizational level, several researches are currently looking to the sentiment analysis.
In this paper, we deal with sentiment analysis in Arabic written Twitter posts. Our proposed approach is leveraging a rich set of multilevel features like syntactic, surface-form, tweet-specific and linguistically motivated features. Sentiment features are also applied, being mainly inferred from both novel gen-eral-purpose as well as tweet-specific sentiment lexicons for Arabic words.
Several supervised classification algorithms (Support Vector Machines, Naive Bayes and Random Forest) were applied on our data focusing on modern standard Arabic (MSA) tweets. The experimental results using the proposed re-sources and methods indicate high performance levels given the challenge im-posed by the Arabic language particularities.
Ameni Bouaziz, Christel Dartigues, Célia Da Costa Pereira and Frédéric Precioso. Introducing Semantics in Short Text Classification
Abstract: To overcome the issues due to the shortness and sparseness
of texts in short text classification, the enrichment process is
classically proposed: topics (word clusters) are, for example,
extracted from external sources of knowledge using Latent
Dirichlet Allocation. All the words, associated to topics which encompass
short text words, are added to the initial short text content.
Here, we propose an explicit representation of a two-level enrichment
method in which the enrichment may be considered either with respect to
each single word in the text or to the global semantic meaning
of the entire short text. We demonstrate the validity of our
enrichment method with classifiers like Random Forest, MaxEnt, SVM
and Naive Bayes.
Rakesh Verma, Vasanthi Vuppuluri, Arjun Mukherjee, An Nguyen, Ghita Mammar, Reed Armstrong and Shahryar Baki. Mining the Web for Collocations: IR Models of Term Associations
Abstract: Automatic collocation
recognition has attracted considerable attention of researchers from
diverse fields since it is one of the fundamental
tasks in NLP, which feeds into several other tasks (e.g., parsing,
idioms, summarization, etc.). Despite this
attention the problem has remained a ``daunting challenge.''
As others have observed before, existing approaches
based on frequencies and statistical
information have limitations. An even bigger problem is that
they are restricted to bigrams and as yet there is no consensus on how
to extend them to trigrams and higher-order n-grams.
This paper presents encouraging results based on
novel angles of {\em general} collocation
extraction leveraging statistics and the
Web. In contrast to existing work, our algorithms are: applicable to
n-grams of arbitrary order and directional. Experiments across several
datasets, including a gold-standard benchmark dataset that we created,
demonstrate the effectiveness of proposed methods.
Shonosuke Ishiwatari, Naoki Yoshinaga, Masashi Toyoda and Masaru Kitsuregawa. Instant Translation-Model Adaptation by Projecting Word Semantic Representations
Abstract: In statistical machine translation (SMT), it is well known that the difference between domains of training and test data will result in poor translations. Though there have been many studies focusing on domain adaptation of language models and translation models, most of them require supervised in-domain language resources such as parallel corpora for training and tuning the models. The necessity of supervised data had made such methods difficult to be adapted to practical SMT systems. We propose a novel method that adapts translation models without in-domain parallel corpora. Our method infers the translation candidates of out-of-vocabulary words by projecting semantic representations of them to the semantic space of the target language. In our experiment of out-of-domain translation from Japanese to English, our method gave an improvement of 0.5-1.5 in BLEU score.
Pierre Marchal and Thierry Poibeau. A Continuous-based Model of Lexical Acquisition
Abstract: The automatic acquisition of verbal constructions is an important issue for natural language processing. In this paper, we have a closer look at two fundamental aspects of the description of the verb: the notion of lexical item and the distinction between arguments and adjuncts. Following up on studies in natural language processing and linguistics, we embrace the double hypothesis:
- i) of a continuum between ambiguity and vagueness,
- and ii) of a continuum between arguments and adjuncts.
We provide a complete approach to lexical knowledge acquisition of verbal constructions from an untagged news corpus. The approach is evaluated through the analysis of a sample of the 7,000 Japanese verbs automatically described by the system.
João Sequeira, Teresa Gonçalves, Paulo Quaresma, Amália Mendes and Iris Hendrickx. Using syntactic and semantic features for classifying modal values in the Portuguese language
Abstract: This paper presents a study made in a field poorly explored in the Portuguese language – modality and its automatic tagging. Our main goal was to find a set of attributes for the creation of automatic tag- gers with improved performance over the bag-of-words (bow) approach. The performance was measured using precision, recall and F1. Because it is a relatively unexplored field, the study covers the creation of the corpus (composed by eleven verbs), the use of a parser to extract syntac- tic and semantic information from the sentences and a machine learning approach to identify modality values. Based on three different sets of attributes – from trigger itself and the trigger’s path (from the parse tree) and context – the system creates a tagger for each verb achiev- ing (in almost every verb) an improvement in F1 when compared to the traditional bow approach.
Duc-Thuan Vo and Ebrahim Bagheri. Clause-based Open Information Extraction with Grammatical Structure Reformation
Abstract: Within the context of Open Information Extraction (OIE), relation extraction is oriented toward identifying a variety of relation phrases and their arguments in arbitrary sentences. In the plethora of research that focus on the use of syntactic and dependency parsing for the purposes of detecting relations, there has been increasing evidence of incoherent and uninformative extractions. The extracted relations have even been erroneous at times and failed to provide a meaningful interpretation. In this paper, we propose refinements to the grammatical structure of syntactic and dependency parsing. In lieu of this, we use the English clause structure and clause types in an effort to generate propositions that can be deemed as extractable relations. Ergo, our approach outperforms existing state-of-the art systems such as ReVerb, OLLIE and ClausIE. Particularly, our work shows improvements up to 15% in comparison to the aforementioned OIE systems on three benchmark datasets.
Hanen Ameur, Salma Jamoussi and Abdelmajid Ben Hamadou. A New Emotional Vector Representation For Sentiment Analysis
Abstract: With the advent of Web 2.0, social networks (like, Twitter and Facebook) offer to users a different writing style that’s close to the SMS language. This language is characterized by the presence of emotion symbols (emoticons, acronyms and exclamation words). They often manifest the sentiments expressed in the comments and bring an important contextual value to determine the general sentiment of the text. Moreover, these emotion symbols are considered as multilingual and universal symbols. This fact has inspired us to research in the area of automatic sentiment classification. In this paper, we present a new vector representation of text which can faithfully translate the sentimental orientation of text, based on the emotion symbols. We use Support Vector Machines to show that our emotional vector representation significantly improves accuracy for sentiment analysis problem compared with the well known bag-of-words vector representations, using dataset derived from Facebook.
Duc-Thuan Vo and Ebrahim Bagheri. Relation Extraction using Clause Patterns and Self-Training
Abstract: Bootstrapping techniques utilized for relation extraction have shown to be effective n terms of interactively expanding a set of initial relations. Such tasks are primarily carried out through semi-supervised classification approaches. Considering that choosing the most efficient seeds are pivotal to the success of the bootstrapping process, these methods depend on a reliable set of seeds or rules that incorporate domain knowledge. In this paper, we propose clause-based pattern extraction with self-training for unsupervised relation extraction. Accordingly, we extract patterns based on a clause-based approach that strives to consider all possible clause types that may contain a relation. The proposed self-training algorithm relies on the clause-based approach to extract a small set of seed instances in order to identify and derive new patterns. A fundamental distinction between our proposed method and other prominent approaches is that we automatically and iteratively extract seeds based on high confidence patterns that are identified through the clause-based approach. In our experiments, we show that our approach improves upon the performance of the current state-of-the-art systems such as DARE up to 26.88% and 14.13% on F-measure over the Nobel and MUC-6 datasets, respectively.
Bahar Karaoğlan, Tarik Kisla and Senem Kumova Metin. DESCRIPTION OF TURKISH PARAPHRASE CORPUS STRUCTURE AND GENERATION METHOD
Abstract: The very first step and the most tedious one before one can start natural language processing tasks is to develop a corpus if there isn’t one. Since this requires a long time and lots of human effort, it is desirable to make it as resourceful as possible: rich in coverage, flexible, multipurpose and expandable. Here we describe the steps we took in the development of Turkish paraphrase corpus, the factors we considered, problems we faced and how we dealt with them. Currently our corpus contains nearly 4000 sentences with the ratio of 60% paraphrase and 40% non-paraphrase sentence pairs. The sentence pairs are annotated at 5-scale: paraphrase, encapsulating, encapsulated, non-paraphrase and opposite. The corpus is formulated in a database structure integrated with Turkish dictionary. The structure consists of 7 relational database tables that are set up for the documents, the sentences, similarity between sentences, the words, the relations between the words (e.g synonym, antonym), dictionary and word-meaning. The sources we used till now are news texts from Bilcon 2005 corpus, a set of professionally translated sentence pairs from MSRP corpus and multiple Turkish translations from different languages that are involved in Tatoeba corpus. We hope to reach at least 6000 sentences including human generated paraphrases.
Prasha Shrestha, Arjun Mukherjee and Thamar Solorio. Large Scale Authorship Attribution of Online Reviews
Abstract: Traditional authorship attribution methods focus on the scenario of a limited number of authors writing long pieces of text. These methods are engineered to work on a small number of authors and generally do not scale well to a corpus of online reviews where the candidate set of authors is large. However, attribution of online reviews is important as they are replete with deception and spam. We evaluate a new large scale approach for predicting authorship via the task of verification on online reviews. Our evaluation considers the largest number of possible candidate authors seen to date. Our results show that multiple verification models can be successfully combined to associate reviews with their correct author in more than 78\% of the time. We propose that our approach can be used to slow down or deter the number of deceptive reviews in the wild.
Nabil Khoufi, Chafik Aloulou and Lamia Hadrich Belguith. A Corpus Based System for Language Resource Construction and Syntactic Analysis: Case of Arabic
Abstract: Linguistic resources such as grammars or dictionaries are very important to any natural language processing application. Unfortunately, the manual construction of these resources is laborious and time-consuming. The use of annotated corpora as a knowledge database might be a solution to a fast construction of a grammar for a given language. In this paper, we present our system to automatically induce a syntactic grammar from an Arabic annotated corpus (The Penn Arabic TreeBank), a probabilistic context free grammar in our case. The developed system allows the user to build a probabilistic context free grammar from the annotated corpus syntactic trees. It’s also offer the possibility to parse Arabic sentences using the generated resource. Finally, we present evaluation results.
Saket Kumar and Omar El Ariss. Word Sense Disambiguation Using Swarm Intelligence: A Bee Colony Optimization Approach
Abstract: Word Sense Disambiguation (WSD) is a problem of figuring out the correct sense of a word in a given context. We introduce an unsupervised knowledge-source approach for word sense disambiguation using a bee colony optimization algorithm that is constructive in nature. Our algorithm, using WordNet, optimizes the search space by globally disambiguating a document by constructively deter-mining the sense of a word using the previously disambiguated words. Heuristic methods for unsupervised word sense disambiguation mostly give less importance to the context words while determining the sense of the target word. In this paper, we put more emphasis on the context and the part of speech of a word while determining its correct sense. We make use of a modified simplified Lesk algorithm as a relatedness measure. Our approach is then compared with recent unsupervised heuristics such as ant colony optimization, genetic algorithms, and simulated annealing, and shows promising results. We finally introduce a voting strategy to our algorithm that ends up further improving our results.
Miran Shahine and Mohamed Sakre. Hybrid Feature Selection Approach for Arabic Named Entity Recognition
Abstract: Named Entity Recognition (NER) task has drawn a great attention in the research field in the last decade, as it played an important role in the Natural Language Processing (NLP) applications; In this paper, we investigate the effectiveness of a hybrid feature subset selection approach for Arabic Named Entity Recognition (NER) which is presented using filtering approach and optimized Genetic algorithm. Genetic algorithm is utilized through parallelization of the fitness computation in order to reduce the computation time to search out the most appropriate and informative combination of features for classification. Support Vector Machine (SVM) is used as the machine learning based classifier to evaluate the accuracy of the Arabic NER through the proposed approach. ANER and AQMAR are the datasets used in our experiments which is presented by both language independent and language specific features in Arabic NER; Experimental results show the effectiveness of the feature subsets obtained by the proposed hybrid approach which are smaller and effective than the original feature set that leads to a considerable increase in the classification accuracy.
Ahmad Musleh, Nadir Durrani, Irina Temnikova, Preslav Nakov, Stephan Vogel and Osama Alsaad. Enabling Medical Translation for Low-Resource Languages
Abstract: We present research towards bridging the language gap between migrant workers in Qatar and medical staff. In particular, we present the first steps towards the development of a real-world Hindi-English machine translation system for doctor-patient communication. As this is a low-resource language pair, especially for speech and for the medical domain, our initial focus has been on gathering suitable training data from various sources. We applied a variety of methods ranging from fully automatic extraction from the Web to manual annotation of test data. Moreover, we developed a %an original
method for automatically augmenting the training data with synthetically generated variants, which yielded a very sizable improvement of up to 1.66 BLEU points absolute.
Thierry Hamon and Natalia Grabar. Adaptation of cross-lingual transfer methods for the building of medical terminology in Ukrainian
Abstract: An increasing availability of parallel bilingual corpora and
of automatic methods and tools makes it possible to build linguistic
and terminological resources for low-resourced languages. We propose to
exploit corpora available in several languages for building bilingual and
trilingual terminologies. Typically, terminology information extracted in
better resourced languages is associated with the corresponding units
in lower-resourced languages thanks to the multilingual transfer. The
method is applied on corpora involving Ukrainian language. According
to the experiments, precision of term extraction varies between 0.454 and
0.966, while the quality of the interlingual relations varies between 0.309
and 0.965. The resource built contains 4,588 medical terms in Ukrainian
and their 34,267 relations with French and English terms.
Fahad Al-Obaidli, Stephen Cox and Preslav Nakov. Bi-Text Alignment of Movie Subtitles for Spoken English-Arabic Statistical Machine Translation
Abstract: We describe efforts towards getting better resources for English-Arabic machine translation of spoken text. In particular, we look at movie subtitles as a unique, rich resource, as subtitles in one language often get translated into other languages. Movie subtitles are not new as a resource and have been explored in previous research; however, here we create a much larger bi-text (the biggest to date), and we further generate better quality alignment for it. Given the subtitles for the same movie in different languages, a key problem is how to align them at the fragment level. Typically, this is done using length-based alignment, but for movie subtitles, there is also time information. Here we exploit this information to develop an original algorithm that outperforms the current best subtitle alignment tool, subalign. The evaluation results show that adding our bi-text to the IWSLT training bi-text yields an improvement of over two BLEU points absolute.
Hiram Calvo and Omar Juárez Gambino. Cascading Classifiers for Twitter Sentiment Analysis with Emotion Lexicons
Abstract: Many different attempts have been made to determine sentiment polarity in tweets, using emotion lexicons and different NLP techniques with machine learning. In this paper we focus on using emotion lexicons and machine learning only, avoiding the use of additional NLP techniques. We present a scheme that is able to outperform other systems that use both natural language processing and distributional semantics. Our proposal consists on using a cascading classifier on lexicon features to improve accuracy. We evaluate our results with the TASS 2015 corpus, reaching an accuracy only 0.07 below the top-ranked system for task 1, 3 levels, whole test corpus. The cascading method we implemented consisted on using the results of a first stage classification with Multinomial Naïve Bayes as additional columns for a second stage classification using a Naïve Bayes Tree classifier with feature selection. We tested with at least 30 different classifiers and this combination yielded the best results.
Adèle Désoyer, Frédéric Landragin, Isabelle Tellier, Anaïs Lefeuvre, Jean-Yves Antoine and Marco Dinarelli. Coreference Resolution for French Oral Data: Machine Learning Experiments with ANCOR
Abstract: We present CROC (Coreference Resolution for Oral Corpus), the first machine learning system for coreference resolution in French. One specific aspect of the system is that it has been trained on data that come exclusively from transcribed speech, namely ANCOR (ANaphora and Coreference in ORal corpus), the first large-scale French corpus with anaphorical relations annotations. In its current state, the CROC system requires pre-annotated mentions. We detail the features used for the learning algorithms, and we present a set of experiments with these features. The scores we obtain are close to those of state-of-the-art systems for written English.
Alawya Alawami. Aspect Terms Extraction of Arabic Dialects for Opinion Mining Using Conditional Random Fields
Abstract: While English opinion mining has been studied extensively, Arabic fine grained opinion mining has not received much attention. This paper looks at employing conditional random fields as a supervised method to extract aspect terms which can then be employed for fine-grained opinion mining. Despite the lack of Arabic Dialect NLP tools that limited the amount of improvement that can be added to the algorithm, Our analysis shows a comparable level of precision and recall to what
has been achieved for English.
Svetlana Toldova and Max Ionov. Features for discourse-new referent detection in Russian
Abstract: The paper concerns discourse-new mention detection in Russian. It might be helpful for different NLP applications such as coreference resolution, protagonist identification, summarization and different tasks of information extraction to detect the mention of an entity newly introduced into discourse. In our work, we are dealing with Russian language where there is no grammatical devices such as articles for overtly marking a newly introduced referent. Our aim is to check the impact of various features on this task. The focus is on the specific devices for introducing a new discourse prominent referent for Russian specified in theoretical studies. We conduct a pilot study of features impact and provide a series of experiments on detecting the first mention of a referent in a non-singleton coreference chain, drawing on the linguistic insights about how a prominent entity introducing into discourse is affected by structural, morphological and lexical features.
Orna Almogi, Lena Dankin, Nachum Dershowitz, Yair Hoffman, Dimitri Pauls, Dorji Wangchuk and Lior Wolf. Stemming and Segmentation for Classical Tibetan
Abstract: Tibetan is a monosyllabic language for which computerized language tools are largely lacking. We describe the development of a syllable stemmer for Tibetan. The stemmer is based on a set of rules that help identify the vowel, the core letter of the syllable, and then the other parts. We demonstrate the value of the stemmer with two applications: determining the semantic similarity of two syllables and word segmentation. Our stemmer is being made available as an open source tool and the word segmentation tool is freely available as an online tool.
Rajiv Bajpai, Danyuan Ho, Soujanya Poria and Erik Cambria. Singlish Sentiment analysis.
Abstract: Singlish SenticNet, a concept-level knowledge base for sentiment analysis that associates multiword expressions to a set of emotion labels and a polarity value. Unlike many other sentiment analysis resources, SenticNet is not built by manually labeling pieces of knowledge coming from general NLP resources such as WordNet or DBPedia. Instead, it is automatically constructed by applying graph-mining and multidimensional scaling techniques on the affective common-sense knowledge collected from three different sources . This knowledge is represented redundantly at three levels: semantic network, matrix, and vector space . Subsequently, semantics and sentics are calculated though the ensemble application of spreading activation, neural networks and an emotion categorization model . The SenticNet construction framework merges all these techniques and models together in order to generate a knowledge base of commonsense concepts and a set of semantics, sentics, and polarity for each of them. The current version of the knowledge base contains 300 concepts.

CyS

Chenggang Mi, Yating Yang, Xi Zhou, Lei Wang, Xiao Li and Tonghai Jiang. Exploiting Bishun to Predict the Pronunciation of Chinese
Abstract: Learning to pronounce Chinese characters is usually considered as a very hard part to foreigners to study Chinese. At beginning, Chinese learners must bear in mind thousands of Chinese characters, including their pronunciation, meanings, Bishun (order of strokes) etc., which is very time consuming and boring. In this paper, we proposed a novel method based on translation model to predict the Chinese character pronunciation automatically. We first convert each Chinese character into Bishun, then, we train the pronunciation prediction model (translation model) according to Bishun and their correspondence Pinyin sequences. To make our model practically, we also introduced some error tolerant strategies. Experimental results show that our method can predict the pronunciation of Chinese characters effectively.
Sandipan Dandapat and Andy Way. Improved Named Entity Recognition using Machine Translation-based Cross-lingual Information
Abstract: In this paper, we describe a technique to improve named entity recognition in a resource-poor language (Hindi) by using cross-lingual information. We use an on-line machine translation system and a separate word alignment phase to find the projection of each Hindi word into the translated English sentence. We estimate the cross-lingual features using an English named entity recognizer and the alignment information. We use these cross-lingual features in a support vector machine-based classifier. The use of cross-lingual features improves F1 score by 2.1 points absolute (2.9% relative) over a good-performing
Kwang-Yong Jeong and Kyung-Soon Lee. Follower Behavior Analysis via Influential Transmitters on Social Issues in Twitter
Abstract: A follower can be divided into supporter, non-supporter, or neutral according to a follower’s intention to a target user. Even though a follower is identified as a supporter, an opinion may not be positive to the target user. In this paper, we propose a method to classify a follower as supporter, non-supporter or neutral. To expand information of a follower, influential transmitters who support a target user are detected by using a modified HITS algorithm. To detect a follower’s specific opinion, social issues are extracted based on tweets of influential transmitters. The thread tweets are clustered based on Latent Dirichlet Allocation for social issues. Then, sentiment analysis is conducted for the clusters of a follower. To see the effectiveness of our method, a Korean tweet collection is constructed. As a result, we found that lots of supporting followers show opposite opinions depending on particular issues.
Tomas Hercig, Tomas Brychcin, Lukas Svoboda, Michal Konkol and Josef Steinberger. Unsupervised methods to improve aspect-based sentiment analysis in Czech
Abstract: We investigate the effectiveness of several unsupervised methods for latent semantics discovery as features for aspect-based sentiment analysis (ABSA). We use the shared task definition from SemEval 2014.

In our experiments we use labeled and unlabeled corpora within the restaurants domain for two languages: Czech and English. We show that our models improve the ABSA performance and prove that our approach is worth investigating. Moreover, we achieve new state-of-the-art results for Czech.

Another important contribution of our work is that we created two new Czech corpora within the restaurant domain for the ABSA task: one labeled for supervised training, and the other (considerably larger) unlabeled for unsupervised training. The corpora are available to the research community.
Nouha Othman and Rim Faiz. Question Answering Passage Retrieval and Re-ranking Using N-grams and SVM
Abstract: Over the last few years, with the meteoric rise of Information
Technology, Question Answering (QA) has attracted more attention and
has been extremely explored. Indeed, several QA systems are based on a
passage retrieval engine which aims to deliver a set of passages that are
most likely to contain a relevant response to a question stated in natural
language. In an attempt to enhance the performance of existing QA systems
by increasing the number of generated correct answers and ensure their
relevance, we propose a novel approach for retrieving and re-ranking pas-
sages based on n-grams and SVM models. Our passage retrieval module
relies on the dependency degree of n-gram words of the question in the
passage. The retrieved passages are then better ranked using an SVM based
model incorporating various lexical, syntactic and semantic similarity
measures. We validate our approach by the development of the PreRank
system that has outperformed other exiting ones.
Ahmed Nabhan and Khaled Shaalan. A Graph-based Approach to Text Genres Analysis
Abstract: Genre characterization can be achieved by a variety of methods that employ lexical, syntactic, and presentation features of text to highlight key domain differences and stylistic preferences. However, these traditional methods cannot uncover some important macro-structural features that are embedded in text. Representation of text as a word graph can enable effective frameworks for analysis and identification of key topological features that characterize genres of text. In this study, we investigated graph features such as clustering coefficients, centralization, diameter, and average path lengths for eight text genres. The findings indicated key patterns that vary from a genre to another according to the stylistic differences in text. Furthermore, evidence of subgenres was found through some graph features such as number of connected components and node heterogeneity.
Yu Zhao, Sheng Gao, Patrick Gallinari and Jun Guo. A Novel Multimodal Deep Neural Network Framework for Extending Knowledge Base
Abstract: Knowledge base is a very important database for knowledge management, which is very useful for Question Answering, Query Expansion and other AI tasks. However, due to the fast-growing information on the web and not all common knowledge expressed in the text is explicit, the Knowledge base always suffers from incompleteness. Recently many researchers are trying to solve the problem with link prediction, only using the existing knowledge base, however, it is just knowledge base completion without adding new entities, which emerges from unstructured text not in existing knowledge base. In this paper, we propose a multimodal deep neural network framework that trying to learn new entities from unstructured text and to extend the knowledge base. Experiments demonstrate the excellent performance.
Cătălina Mărănduc, Cenel-Augusto Perez and Radu Simionescu. Social Media – Processing and Discourse analysis
Abstract: In order to obtain a balanced Treebank for Romanian, a sub-corpus of 2,500 sentences illustrating the social media contemporary language has been added to the Dependency Treebank for Romanian. The texts were taken from online chat. The subject of this paper is the processing of the non-standard texts with a hybrid POS-tagger for Romanian and with a malt parser, both trained on standard language data, written in different styles of communication. We obtained results comparable with the tools for other languages, trained on similar corpora. In addition, we extend our resource, the Dependency Treebank for Romanian, not only in size, by doubling its dimension, but also by adding additional layers of annotation. A semantic layer and a discursive annotation will be added, allowing the study of discursive and conversational particularities. The paper contains examples illustrating discursive particularities of the chat communication.
Jin-Xia Huang, Kyung Soon Lee, Key-Sun Choi and Young-Kil Kim. Extract Reliable Relations from Wikipedia Texts for Practical Ontology Construction
Abstract: A feature based relation classification approach is presented in this paper to exact relation candidates from Wikipedia texts. A probabilistic and a semantic relatedness features are employed with other linguistic information for the purpose. The experiments show that, relation classification using the proposed relatedness features with surface information like word and part-of-speech tags is competitive with or even outperforms the one of using deep syntactic information. Meanwhile, an approach is proposed to distinguish reliable relation candidates from others, so that these reliable results can be accepted for knowledge building without human verification. The experiments show that, with the relation classification approach presented in this paper, more than 40% of the classification results are reliable, which means, at least 40% of the human and time costs can be saved in practice.
Kunal Chakma and Amitava Das. CMIR:A Corpus for Evaluation of Code Mixed Information Retrieval of Hindi-English Tweets
Abstract: Social media has become almost ubiquitous in present times. Such proliferation leads to automatic information processing need and has various challenges. Social media content is mostly informal in nature. Additionally while talking about Indian social media, users often prefer to use Roman transliterations of their native languages and English embedding. Therefore, Information retrieval (IR) on such Indian social media data is a challenging and difficult task when the documents and the queries are a mixture of two or more languages written in either the native scripts and/or in the Roman transliterated form. Here in this paper, we have emphasized issues related with Information Retrieval (IR) for Code-Mixed Indian social media texts, particularly texts from twitter. We describe a corpus collection process, reported limitations of available state-of-the-art IR systems on such data and formalize the problem of Code-Mixed Information Retrieval (CMIR) on informal texts.
Katrin Prikrylova, Vladislav Kubon and Katerina Veselovska. The Role of Conjunctions in Adjective Polarity Analysis in Czech
Abstract: Adjectives very often determine the polarity of an expression. This paper studies the role of conjunctions in the analysis of polarity. The study is being performed for Czech language on the basis of an existing algorithm for English. The algorithm has been modified in order to reflect the differences between both typologically different languages. The results of the original and modified algorithm are being compared and discussed. The paper also contains a thorough discussion of exceptions and special cases supported by a number of examples from a large corpus of Czech.
Braja Gopal Patra, Dr. Dipankar Das and Sivaji Bandyopadhyay. Multimodal Mood Classification Framework for Hindi Songs
Abstract: Several aspects of music information retrieval including music mood classification gained huge importance during the last decade. Mainstream research on music mood classification has been conducted specially on western music based on either audio or lyrics or combining both. Fortunately, due to the rapid growth of digitized resources in context of Indian music, the research on Hindi music mood classification has been started with full swing solely on audio data. Therefore, in the present task, we proposed mood taxonomy suitable for Hindi songs and described the process of developing a multimodal dataset (including features of both audio and lyrics) for classifying moods of Hindi songs. We observed differences in mood for several instances of Hindi songs while annotating the audio of such songs in contrast to their corresponding lyrics. Finally, we developed mood classification framework for Hindi songs that consists of three systems based on the features of audio, lyrics and both. The audio and lyrics based mood classification systems achieved the maximum F-measure of 58.2% and 55.1% respectively and the multimodal system (audio and lyrics) achieved the maximum F-measure of 68.6%.
Paheli Bhattacharya, Pawan Goyal and Sudeshna Sarkar. Using Word Embeddings for Query Translation for Hindi to English Cross Language Information Retrieval
Abstract: Cross-Language Information Retrieval (CLIR) has become an important problem to solve in the recent years due to the growth of the content in multiple languages in the Web. One of the standard methods is to use query translation from source to target language. In this paper, we propose an approach based on word embeddings, a method that captures contextual clues for a particular word 'w' in the source language and gives those words as translations that occur in a similar environment as 'w' in the target language.
Once we obtain the word embeddings of the source and target language pairs, we learn a projection from source to target word embeddings, making use of a dictionary with word translation pairs. We then propose various methods of query translation and aggregation. The advantage of this approach is that it does not require the corpora to be aligned (which is difficult to obtain for resource-scarce languages), a dictionary with word translation pairs is enough to train the word vectors for translation.

We experiment with Forum for Information Retrieval and Evaluation (FIRE) 2008 and 2012 datasets for Hindi to English CLIR, training word vectors across each of these languages and perform the query translation and retrieval task. The proposed word embedding based approach outperforms the basic dictionary based approach by 67% and when the word embeddings are combined with the dictionary, the hybrid approach beats the baseline dictionary based method by 76%. It outperforms the English monolingual baseline by 16%, when combined with the translations obtained from Google Translate and Dictionary. One interesting observation was that our approach could yield semantically related translations for most of the queries.
Irvin Vargas-Campos and Fernando Alva-Manchego. SciEsp: Structural Analysis of Abstracts Written in Spanish
Abstract: SciEsp is a tool for scientific writing in Spanish. Its objective is to help students when writing abstracts of scientific texts, such as a thesis or a dissertation. The tool identifies the different components of an abstract structure according to the guidelines of "good writing" proposed by the literature. Each sentence in the abstract is classified to one of six different rhetorical categories (background, gap, purpose, methodology, result, or conclusion), warning the writer of a possible missing component of the "optimal" structure. We manually annotated a corpus of abstracts from computer science thesis and dissertations, and use it to train a Naive Bayes classifier that achieves an F-measure of 0.65. We expect that SciEsp becomes a starting point for further projects in the area of supporting technologies for scientific writing in Spanish.
Sandeep Kumar Dash, Dr. Partha Pakray and Alexander Gelbukh. Natural Language Text to Virtual Action
Abstract: This paper describes our proposed framework of research for virtualizing the documented physiotherapy instructions. Our approach will try to bridge the gap between human understanding and the written manuals of instructions for physiotherapy. Techniques of Natural Language Processing involving semantic & spatial information processing carries importance in this approach. We have also explained the physiotherapic considerations that we will take up in the initial phase of research.
Joe Cheri Ross, Aditya Joshi and Pushpak Bhattacharyya. A Framework That Uses the Web for Named Entity Class Identification: Case Study for Indian Classical Music Forums
Abstract: Identification of named entity(NE) class (semantic class) is
crucial for NLP problems like coreference resolution where semantic compatibility
between the entity mentions is imperative to coreference decision.
Short and noisy text containing the entity makes it challenging to
extract the NE class of the entity through the context. We introduce a
framework for named entity class identification for a given entity, using
the web when the entity boundaries are known. The proposed framework
will be beneficial for specialized domains where data and class label challenges
exist. We demonstrate the benefit of our framework through a case
study of Indian classical music forums. Apart from person and location
included in standard semantic classes, here we also consider raga
, song, instrument and music concept. Our baseline approach follows a heuristic
based method making use of Freebase, a structured web repository.
The search engine based approaches acquire context from the web for
an entity and perform named entity class identification. This approach
shows improvement compared to baseline performance and it is further
improved with the hierarchical classification introduced. In summary, our
framework is a first-of-its-kind validation of viability of the web for NE
class identification.
Malek Hajjem and Chiraz Latiri. Thematic Clustering For Comparable Tweet corpora Building
Abstract: This paper deals with comparable corpus building from Twitter. Especially, we focus on the task related to relevance evaluation process of tweets. In fact, as Twitter microblog is very popular, tweets could be considered as a new data source of comparable corpora. So, a possible way to build comparable corpora from Twitter is to extract tweets in two selected languages and sharing a specific topic, in order to construct a multilingual corpus.
However, the problem of mining relevant tweets deals with a real challenge: how to only extract the most relevant tweets according to a specific topic from the huge number of collected tweets?
In this respect, we propose in this paper a unsupervised machine learning based approach to improve the quality of the collected textual data, in order to identify which messages, i.e tweets, address the specific topic. Several tweets representations are carried out to filter the extracted messages. The main goal of such relevance estimation process is improving the comparability degree between bilingual extracted tweet corpora.
Laurent Jakubina and Philippe Langlais. A Comparison of Methods for Identifying the Translation of Words in a Comparable Corpus: Recipes and Limits
Abstract: Identifying translations in comparable corpora is a challenge that has attracted many researchers since a long time. It has applications in several fields including Machine Translation and Cross-lingual Information Retrieval. In this study we compare three state-of-the-art approaches for this tasks: the so-called context-based projection method, the projection of word embeddings, as well as a method dedicated to identify translation of rare words. We carefully expose the meta-parameters of each method and measure their impact on a task of identifying the translation of English words into French in Wikipedia. Contrary to the standard practice, we designed a test case where we do not resort to heuristics in order to pre-select the target vocabulary among which to find the translation, therefore pushing each method to its limit. We show that all the approaches we tested have a clear biased toward frequent words. In fact, the best approach we tested could identify the translation of a third of a set of frequent test words, while it could only translate around 10\% of rare words.
Luz Marina Sierra, Carlos Cobos and Juan Carlos Corrales. Tokenizer adapted for nasa yuwe language
Abstract: The Colombian government conceives ethnic and cultural diversity within the social rights [1], which is expressed among other aspects, by the many indigenous languages that to have been kept alive for centuries, however, efforts towards conservation and preservation of these languages have not been enough, this is the case of the nasa yuwe language spoken by nasa indigenous community, which it is endangered [2]. In this situation, it has been found in the use of technology a strategic opportunity for their adequacy, ownership and development of social and cultural environment of the nasa people, including the use of computational techniques that allow the ex-change of information through activities of information retrieval (IR) [3], these promote different possibilities for nasa people, in order to interact in nasa yuwe, so it is necessary to adapt the stages of the IR process for nasa yuwe language. This pa-per presents specifically the process to adapt a tokenizer for texts written in nasa yuwe, which involves the use of the Precision - Recall Curve, as a measure of evalua-tion and comparison. The results allow appreciation of 1) every stage in the process of adaptation of nasa tokenizer, 2) the nasa tokenizer and its results over the texts written in nasa yuwe, 3) the analysis of the Precision - Recall curves baseline in con-trast to those of the nasa tokenizer.


References
[1] Asamblea Nacional Constituyente, República de Colombia, «Banco de la República,» 1991. [En línea]. Available: http://www.banrep.gov.co/regimen/resoluciones/cp91.pdf. [Último acceso: 18 Octubre 2012].
[2] Universidad del Cauca, CRIC-PEBIl-Comisión General de Lenguas, «Estudio Sociolingüistico Fase preliminar. Base de datos - CRIC 01/2007 Lengua Nasa Yuwe y Namtrik. Popayàn, Cauca, Colombia,» CRIC, Popayán - Colombia, 2008.
[3] R. Baeza-Yates, «Challenges in the Interaction of Information Retrieval and Natural Language Processing,» de Computational Linguistics and Intelligent Text Processing, vol. Volume 2945, Springer Berlin Heidelberg, 2004, pp. 445-456.
Edmundo Pavel Soriano Morales, Julien Ah-Pine and Sabine Loudcher. Using a Heterogeneous Linguistic Network for Word Sense Induction and Disambiguation
Abstract: Linguistic Networks are structures that allow us to model the characteristics of human language through a graph-like schema. This kind of modelization has proven to be useful while dealing with natural language processing tasks. In this paper, we first present and discuss the state of the art of recent semantic relatedness methods from a network-centric point of view. That is, we are interested in the types of networks used to solve practical semantic tasks. In order to address some of the short-comings in the studied approaches, we propose a hybrid linguistic structure that takes into account lexical and syntactical language information.
%Specifically, by joining more than two vertices per edge, a hypergraph permits deeper relations across words.
We show our model's practicality with a proof of concept: we set to solve word sense disambiguation and induction while using the presented network schema. Our modelization aims to shed light into ways of combining distinct types of linguistic information in order to take advantage of each of its components' unique characteristics.

Polibits

Nouri Nouha and Talel Ladhari. An Efficient Iterated Greedy Algorithm for the Makespan Blocking Flow Shop Scheduling Problem
Abstract: We propose in this paper a Blocking Iterated Greedy algorithm (BIG) which makes an adjustment between two relevant destruction and construction stages to solve the blocking flow shop scheduling problem and minimize the maximum completion time (makespan). The greedy algorithm starts from an initial solution generated based on some well-known heuristic. Then, solutions are enhanced till some stopping condition and through the above mentioned stages. The effectiveness and efficiency of the proposed technique are deduced from all the experimental results obtained on both small randomly generated instances and on Taillard's benchmark in comparison with state-of-the-art methods.
Zhixuan Yang, Chong Ruan, Caihua Li and Junfeng Hu. Optimize Hierarchical Softmax with Word Similarity Knowledge
Abstract: Hierarchical softmax is widely used to accelerate the training of word embedding models and neural language models. Different kinds of hierarchical softmax trees were proposed, including extracting tree from language resources, repeatedly partitioning words into a binary tree and directly using the Huffman tree. However, no work has analyzed how the tree structure influence the quality of the resulting model. In this paper, we try to analyze the structure of the hierarchical softmax tree theoretically by treating it as a parameter of the training objective function. As a result, we can show that the Huffman tree maximizes the training function when word embeddings are random. Following this, we propose SemHuff, a new tree constructing scheme based on adjusting the Huffman tree with word similarity knowledge. Experiment results show that optimized hierarchical softmax can improve word embeddings in various evaluation tasks.
Nattapong Sanchan, Ahmet Aker and Kalina Bontcheva. Understanding Human Preferences for Summary Designs in Online Debates Domain
Abstract: Research on automatic text summarization has primarily focused on summarizing news, web pages, scientific papers, etc. While in some of these text genres, it is intuitively clear what constitutes a good summary, the issue is much less clear cut in social media scenarios like online debates, product reviews, etc., where summaries can be presented in many ways. As yet, there is no analysis about which summary representation is favoured by readers. In this work, we empirically analyse this question and elicit readers’ preferences for the different designs of summaries for online debates. Seven possible summary designs in total were presented to 60 participants via an online study. Participants were asked to read and assign preference scores to each summary design. The results indicate that the combination of a Chart Summary and a Side-By-Side Summary is the most preferred summary design. This finding is important for future work in the automatic text summarization of online debates.
Henning Wold, Linn Vikre, Özlem Özgöbek and Jon Atle Gulla. Hybrid Entity Driven News Detection on Twitter
Abstract: In recent years, Twitter has become one of the most popular microblogging services on the Internet. People sharing their thoughts and feelings as well as the events happening around them, makes Twitter a promising source of the most recent news received directly from the observers. But detecting the newsworthy tweets is a challenging task. In this paper we propose a new hybrid method for detecting real-time news on Twitter using locality-sensitive hashing (LSH) and named-entity recognition (NER). The method is tested on 72,000 tweets from the San Fransisco area and yields a precision of 0.917.
Andrea Vanzo, Danilo Croce, Emanuele Bastianelli, Roberto Basili and Daniele Nardi. Robust Spoken Language Understanding for House Service Robots
Abstract: Service robotics has been growing significantly in the last years, leading to several research results and to a number of consumer products.
One of the essential features of these robotic platforms is represented by the ability of interacting with users through natural language. Spoken commands can be processed by a Spoken Language Understanding chain, in order to obtain the desired behavior of the robot.
The entry point of such a process is represented by Automatic Speech Recognition (ASR) system, that provides a list of transcriptions for a given spoken utterance.
Although several well-performing ASR engines are available off-the-shelf, they operate in a general purpose setting. Hence, they may be not well suited in the recognition of utterances given to robots in specific domains.
In this work, we propose a practical yet robust strategy to re-rank lists of transcriptions. This approach improves the quality of ASR systems in situated scenarios, i.e. the transcription of robotic commands.
The proposed method relies upon evidences derived by a semantic grammar with semantic actions, designed to model typical commands expressed in scenarios that are specific to human service robotics.
The outcomes obtained through an experimental evaluation show that the approach is able to effectively outperform the ASR baseline, obtained by selecting the first transcription suggested by the ASR.
Attila Novák. Improving corpus annotation quality using word embedding models
Abstract: Web-crawled corpora contain a significant amount of noise. Automatic corpus annotation tools introduce even more noise performing erroneous language identification or encoding detection, introducing tokenization and lemmatization errors and adding erroneous tags or analyses to the original words. Our goal with the methods presented in this article was to use word embedding models to reveal such errors and to provide correction procedures. The evaluation focuses on analyzing and validating noun compounds identifying bogus compound analyses, recognizing and concatenating fragmented words, detecting erroneously encoded text, restoring accents and handling the combination of these errors in a Hungarian web-crawled corpus.
King Ip Lin and Yang Zhou. Twitter Sentiment Analysis -- Comparison between Lexicon Based vs. Machine Learning Based approaches
Abstract: We explore the task of analyzing sentiments over twitter. While twitter posts is an excellent sources of sentiments, the short length and the more free form nature of the post provides challenges. In this paper we examine two classes of methods that extract sentiment from Twitter: a lexicon-based approach where n-grams corresponding to sentiments (either via a existing lexicon or derived from a corpus) are recognized; and a machine-learning based approach where features about each tweets (textual as well as structural) are extracted and learned via machine learning methods. Furthermore we introduce a combination for both methods to produce even better results.
Sondes Bannour, Laurent Audibert and Henry Soldano. Interactive learning of information extraction rules
Abstract: In the information extraction field, there is a current tendency, especially in the commercial world, to reconsider rule-based IE systems because rules are easy to manipulate to cope with errors and are interpretable by humans. In the context of designing an interactive IE rule-based learning system that assists the user in both writing IE rules and annotating examples to automatically infer such rules, we first chose an expressive rule language and a rule learning algorithm based on this same language to ensure rules' comprehensibility and then proposed and evaluated an active learning module that helps the user annotate relevant examples and thus, accelerate the rule learning convergence.
Mohammad Golam Sohrab, Makoto Miwa and Yutaka Sasaki. IN-DEDUCTIVE and DAG-Tree Approaches for Large-Scale Extreme Multi-label Hierarchical Text Classification
Abstract: This paper presents a large-scale extreme multi-label hierarchical text classification method that construct large-scale hierarchical inductive learning and deductive classification (IN-DEDUCTIVE) system of different efficient classifiers, and a DAG-Tree that refines the given hierarchy by eliminating nodes and edges to generate a new hierarchy. We target the standard hierarchical text classification datasets prepared for the PASCAL Challenge on Large-Scale Hierarchical Text Classification (LSHTC). We compare several classification algorithms on LSHTC including DCD-SVM, SVM-perf, Pegasos, SGD-SVM, and Passive Aggressive, etc. Experimental results show that IN-DEDUCTIVE systems with DCD-SVM, SGD-SVM, and Pegasos are promising and outperformed other learners as well as the top systems participated in the LSHTC3 challenge on Wikipedia medium dataset. Furthermore, DAG-Tree based hierarchy is effective especially for very large datasets since DAG-Tree exponentially reduce the amount of computation necessary for classification. Our IN-DEDUCTIVE system with DAG-Tree approach outperformed the top systems participated in the LSHTC4 challenge on Wikipedia large dataset.
Jiaqiang Chen, Niket Tandon, Charles Darwis Hariman and Gerard de Melo. WebBrain: Joint Neural Learning of Large-Scale Common Sense
Abstract: Massive volumes of text are now more easily available for knowledge harvesting, opening up the possibility of machine reading to acquire simple forms of commonsense knowledge. Still, we observe that many important facts about our everyday world are not frequently expressed in a particularly explicit way. To address this, we present WebBrain, a new approach for harvesting commonsense knowledge that relies on joint learning from Web-scale data to fill gaps in the knowledge acquisition. We train a neural network model that not only learns word2vec-style vector representations of words but also commonsense knowledge about them. This joint model allows general semantic information to aid in generalizing beyond the extracted commonsense relationships. Experiments show that we can obtain word embeddings that reflect word meanings, yet also allow us to capture conceptual relationships and commonsense knowledge about them.
Rajendra Prasath and Sudeshna Sarkar. Cross Language Information Retrieval with Incorrect Query Translations
Abstract: In this paper, we present a Cross Language Information Retrieval (CLIR) approach using corpus driven query suggestion. We have used corpus statistics to gather a clue on selecting the right query terms when the translation of a specific query is missing or incorrect. The derived set of queries are ranked to select the top ranked queries. These top ranked queries are further used to perform query formulation. Using the re-formulated weighted query, we perform cross language information retrieval. The results are compared with the results of CLIR system with Google translation of user queries and CLIR with the proposed query suggestion approach. We have English and Tamil corpus of FIRE 2012 dataset and analyzed the effects of the proposed approach. The experimental results show that the proposed approach performs well with the incorrect translation of the queries.
Doru Anastasiu Popescu, Nicolae Bold and Daniel Nijloveanu. A method based on genetic algorithms for generating assessment tests used for learning
Abstract: Tests are used in a variety of contexts in the activity of everyday and everywhere learning. They are a specific method in the process of assessment (evaluation), which is an important part of the educational activity. Setting an optimized sequence of tests (SOT) originating from a group of tests which have the same subject, with certain restrictions corresponding to a certain wish of the evaluator can be a slowly time-consuming task, because the restriction can be various and the number of tests can be high. In this matter, this paper presents a method of generating optimized sequences of tests within a battery of tests using a genetic algorithm. We associate a number of representative keywords with a test. The user expresses the restriction by setting up a number of keywords which approximate best the subject wanted to be tested. The genetic algorithm helps in finding the optimized solutions and uses a less amount of hardware resources.

IJCLA

Aytuğ Onan, Serdar Korukoğlu and Hasan Bulut. LDA-based topic modelling in text sentiment classification: an empirical analysis
Abstract: Sentiment analysis is the process of identifying the subjective information in the source materials towards an entity. It is a subfield of text and web mining. Web is a rich and progressively expanding source of information. Sentiment analysis can be modeled as a text classification problem. Text classification suffers from the high dimensional feature space and feature sparseness problems. The use of conventional representation schemes to represent text documents can be extremely costly especially for the large text collections. In this regard, data reduction techniques are viable tools in representing document collections. Latent Dirichlet allocation (LDA) is a popular generative probabilistic model to represent collections of discrete data. In this regard, this paper examines the performance of LDA in text sentiment classification. In the empirical analysis, five classification algorithms (Naïve Bayes, support vector machines, logistic regression, radial basis function network and K-nearest neighbor algorithms) and five ensemble methods (Bagging, AdaBoost, Random Subspace, voting and stacking) are evaluated on four sentiment datasets. To the best of our knowledge, this is the first empirical analysis of LDA based feature representation in conjunction with the classifier ensembles in text sentiment classification.
Asma Ben Abacha and Dina Demner-Fushman. Meta-Learning with Selective Data Augmentation for Medical Entity Recognition
Abstract: With the increasing number of annotated corpora for supervised Named Entity Recognition, it becomes interesting to study the combination and augmentation of these corpora for the same annotation task. In this paper, we particularly study the combination of heterogeneous corpora for Medical Entity Recognition by using a meta-learning classifier that combines the results of individual CRF models trained on different corpora. We propose selective data augmentation approaches and compare them with several meta-learning algorithms and baselines. We evaluate our approach using four sub-classifiers trained on four heterogeneous corpora. We show that despite the high disagreements between the individual models on the four test corpora, our selective data augmentation approach improves performance on all test corpora and outperforms the combination of all training corpora.
Masaki Murata, Shunsuke Tsudo, Masato Tokuhisa and Qing Ma. Correcting Redundant Japanese Sentences Using Patterns and Machine Learning for the Development of Writing Support Systems
Abstract: In this study, we propose a method to automatically correct redundant sentences using patterns and machine learning and propose methods that[Editor2] combine pattern-based and machine-learning methods. We conducted experiments to correct redundant sentences containing ``kanou'' (possible or possibility), ``toiu'' (``that is'' or called), and ``surukoto'' (to do). The results demonstrate that the proposed method can correct redundant sentences at an accuracy of 0.6 and estimate corrected expressions against redundant parts at an accuracy of 0.7; furthermore, we created a method to support a user's writing. In this method, a system displays redundant parts and provides candidate expressions.
Shirley Anugrah Hayati, Alfan Farizki Wicaksono and Mirna Adriani. Short Text Classification on Complaint Documents
Abstract: Indonesian government has proposed a system for citizens to voice their aspirations and complaints which are then stored in the form of short documents. Unfortunately, the existing system employs human annotators to manually categorize these short documents, which is very expensive and time-consuming. As a result, automatically classifying the short documents into their correct topics will reduce manual works and obviously increase the efficiency of the task itself. In this paper, we propose several approaches to automatically classify these complaint documents using various features, such as unigrams, bigrams, and their combination. Moreover, we also demonstrate the use of information gain and Latent Dirichlet Allocation (LDA) for selecting discriminative features.
Majid Laali and Leila Kosseim. Disambiguation of French Discourse Connectives
Abstract: Discourse connectives (e.g. \textit{however, because}) are terms that can explicitly convey a discourse relation within a text. While discourse connectives have been shown to be an effective clue to automatically identify discourse relations, they are not always used to convey such relations, thus they should first be disambiguated between discourse-usage and non-discourse usage. In this paper, we investigate the applicability of features proposed for the disambiguation of English discourse connectives for French. Our results with the French Discourse Treebank (FDTB) show that syntactic and lexical features developed for English texts are as effective for French and allow the disambiguation of French discourse connectives with an accuracy of 94.5%.
Mahran Farhat and Gammoudi Mohamed Mohsen. Enhanced metric for comparability analysis of multilingual documents
Abstract: The main goal of this paper is to automatically derive a score
that indicates the comparability level of the compared documents based
on shared features. The work of [5] is based on lexical information, document structure, keywords and named entities to define a comparability
metric of document across different languages. Furthermore, in [7], the
authors observe that keyphrases provide a brief description of the document’s content and can be viewed as semantic metadata that summarizes
the document. Based on the contributions of [5, 7], we propose the integration of keyphrase in the computation of the comparability score.
In fact, we take over keywords and named entities, and we expect to
benefit from the richness of keyphrases, in order to better compute the
comparability of documents across different languages. We evaluate the
reliability of the proposed comparability metric using a standard document collection. We have also compared the outcomes provided by our
comparability metric with those provided by an existing comparability
metric [5]. Experimental results show that our comparability metric obtained better results than the comparability metric proposed by [5]. In
addition, these results indicate the accuracy of the proposed metric in
measuring comparability of documents across different languages.
Daniela Gifu, Mihai Dascălu, Ștefan Trăușan-Matu and Laura Allen. Time Evolution of Writing Styles in Romanian Language
Abstract: This paper presents a diachronic analysis centered on the exploration of differences between the writing styles of journalistic texts in Romanian and Moldavian languages. Both languages have a common origin, but are spoken in two different, adjacent regions and have important historical differences. Our aim is to examine these language differences based on corpora of historical and contemporary texts. To this end, we employ the ReaderBench framework to calculate a number of textual complexity indices that can be reliably used to characterize writing style. These analyses are conducted on two independent corpora for each of the two languages, covering the following time periods: 1941-1991, when Bessarabia was separated from Romania, and after July 1991, when Bessarabia became an independent state, Republic of Moldavia. The results of our analyses highlight the lexical and cohesive textual complexity indices that best reflect the differences in writing style, ranging from sentence and paragraph structure, to word entropy and cohesion measured in terms of Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).
Achraf Ben Romdhane, Salma Jamoussi, Abdelmajid Ben Hamadou and Kamel Smaili. Phrase-Based Language Model in Statistical Machine Translation
Abstract: As one of the most important modules in statistical machine translation (SMT),
language model measures whether one translation hypothesis is more grammatically correct than other hypotheses. Currently the state-of-the-art SMT systems use standard word n-gram models, whereas the translation model is phrase-based. In this paper, we focus on one of the most important component of SMT: the language model. The idea is to use a phrase-based language model. For that, target portion of the translation table are retrieved and used to rewrite the training corpus and to calculate a phrase n-gram language model. In this work, we perform experiments with two language models word-based and phrase-based. The different SMT systems are trained with three optimization algorithms MERT, MIRA and PRO.Thus, the phrase-based systems are compared to the baseline system in terms of BLUE and TER. The experimental results show that the use of a phrase-based language model in SMT can improve results and is especially able to reduce the error rate.
Mohamed Mouine, Diana Inkpen, Pierre-Olivier Charlebois and Tri Ho. Identifying multiple topics in texts
Abstract: We present in this paper an innovative method for multi-label text classification. Our method uses Lucene to classify text and then assigns one or more classes in a new text based on its similarity relative to an annotated corpus. For finer granularity, we split the text into phrases, then we focus on the noun phrases. Instead of classifying the entire text, we classify each noun phrase. The result of classifying the text is then assembled as the set of classes allocated to its noun phrases.
Ilnar Salimzianov and Ozlem Cetinoglu. Dependency-based Sentence Simplification for Large-scale LFG Parsing: Selecting Simplified Candidates for Efficiency and Coverage
Abstract: Large scale LFG grammars achieve high coverages on corpus data, yet can fail to give a full analysis for each sentence. One approach proposed to gain at least the argument structure of those failed sentences is to simplify them by deleting subtrees from their dependency structure (provided by a more robust statistical dependency parser). The simplified versions are then re-parsed to receive a full analysis. However, the number of simplified sentences this approach generates is infeasible for parsing. As a solution, only a subset of candidates is selected based on a metric. In this work we apply the so called parsability metric (vanNoord 2004), introduced as an error-mining technique for grammar writing, for selecting among simplified candidates to be parsed and show that we improve over the previous results that use sentence length as the selection metric.
Cédric Maigrot, Sandra Bringay and Jérôme Azé. Concept drift vs suicide : How one can help prevent the other?
Abstract: Suicide has long been a troublesome problem for society and is an event that has
far-reaching consequences. Health organizations such as the World Health Organization (WHO) and the French National Observatory of Suicide (ONS) have pledged to reduce the number of suicides by 10% in all countries by 2020.
While suicide is a very marked event, there are often behaviors and words that can act as early signs of predisposition to suicide. The objective of this application is to develop a system that semi-automatically detects these markers through social networks.

Previous work has proposed the classification of Tweets using vocabulary in topics related to suicide: sadness, psychological injuries, mental state, depression, fear, loneliness, proposed suicide method, anorexia, insults, and cyber bullying. During this training period, we add a new dimension, time to reflect changes in the status of monitored people.
We implemented it with different learning methods including an original concept drift method. We have successfully used this metvhod on synthetic and real data sets issued from Facebook platform.
Cristian Cardellino and Laura Alonso Alemany. The Impact of Word Embeddings for Supervised and Semi-supervised Word Sense Disambiguation in Spanish and English
Abstract: The following work presents our research in the impact of Word Embeddings in Word Sense Disambiguation (WSD), both for Spanish and English. We are particularly interested in Spanish verbal sense disambiguation, using the SenSem lexicon and corpus [1]. To assess the reliability of our method in a more general context, we assess the performance on an all-word English word sense disambiguation task, using the Semeval dataset [2].

We want to assess the impact of Word Embeddings, an unsupervised technique, both in a supervised and in a semi-supervised approach to learning word senses. We show that word embeddings improve the performance of a supervised approach to word sense disambiguation over a bag-of-words approach, both for Spanish and for English. We also assess the impact of the techniques studied in supervised WSD for a semi-supervised approach with self-taught learning, using a bootstrap technique to extend the WSD coverage with unsupervised resources.

[1] Alonso, L., J.A. Capilla, I. Castellón, A. Fernández, G. Vázquez, 2005.
The SenSem project: Syntactico-Semantic Annotation of Sentences in Spanish. Recent Advantages in Natural Langage Processing IV. Selected papers from RANLP 2005

[2] Pradhan, S. S., Loper, E., Dligach, D., Palmer, M. 2007. SemEval-2007 Task 17: English Lexical Sample, SRL and All Words. Proceedings of the 4th International Workshop on Semantic Evaluations.
Goutam Majumder, Dr. Partha Pakray and Alexander Gelbukh. SEMANTIC TEXTUAL SIMILARITY BASED ON uni-gram LANGUAGE MODEL AND LEXICAL TAXONOMY
Abstract: In this paper, we present an extensive survey on semantic similarity based on taxonomy and contributes a method for textual similarity using unigram language model and lexical taxonomy. The proposed method considers the WordNet synsets for lexical relationships between nodes/words. Uni-gram language model is implemented over a large corpus to assign the information content value between of two nodes of the graph. Finally, a similarity score is generated between two text passages. For the system efficiency SemEval 2015 training dataset is considered.
Vandan Mujadia, Palash Gupta and Dipti Sharma. Pronominal reference type identification and event anaphora resolution for Hindi
Abstract: In this paper, we describe a hybrid approach for pronominal reference type (abstract or concrete) identification and event anaphora resolution for Hindi. Reference type identification is one of the crucial step for any anaphora resolution system as it helps resolver in optimal feature selection. We used language specific rules and a set of classifiers (ensemble learning) with various features for pronominal type identification. We also discuss event referring anaphors (pronouns) and their resolution using Paninian dependency grammar, proximity of events and cognitive load carrying ability of humans. We achieve around ~90% accuracy in the pronoun reference type identification and around ~77% f-score in the event anaphora resolution.
Diana Inkpen, T.Sima Paribakht, Farahnaz Faez and Ehsan Amjadian. Term Evaluator: A Tool for Terminology Annotation and Evaluation
Abstract: There are several methods and available tools for terminology extraction, but the quality of the extracted terms is not always high. Hence, an important consideration in terminology extraction is to assess the quality of the extracted terms. In this paper, we propose a tool for annotating the correctness of terms extracted by three term-extraction tools. This tool facilitates term annotation by using a domain-specific dictionary, a set of filters, and an annotation memory, and allows for post-hoc evaluation analysis. We present a study in which two human judges used the developed tool for term annotation. Their annotations were then analyzed to determine the efficiency of term extraction tools by measures of precision, recall, and F-score, and to calculate the inter-annotator agreement rate.
Attila Novák and Borbála Siklósi. Grapheme-to-phoneme transcription in Hungarian
Abstract: A crucial component of text-to-speech systems is the one responsible for the transcription of the written text to its phonemic representation. Though the complexity of the relation between the written and spoken form of languages varies, most languages have their regular and irregular phonological set of rules. In this paper, we present a system for the phonemic transcription of Hungarian. Beside the implementation of transcription rules, the tool incorporates the knowledge of a Hungarian morphological analyzer in order to be able to detect morpheme and compound boundaries. It is shown that the system performs well even on texts containing a high number of foreign names, which could not be achieved by a lexicon-based method.
Fernando Antônio Asevedo Nóbrega and Thiago Pardo. Improving Content Selection for Update Summarization with Subtopic-Enriched Sentence Ranking Functions
Abstract: Update Summarization aims to produce summaries under the assumption that the reader had some knowledge about the topic from the source texts. We explore subtopic representation for update summarization. Subtopics are coherent textual segments with one or more sentences in a row. The results of our experiments show that our text representation improves the summary recall and the performance of traditional summarization methods.
Roy Khristopher Bayot and Teresa Gonçalves. Multilingual author profiling using SVMs and linguistic features
Abstract: This paper describes various experiments done to investigate author profiling of tweets in 4 different languages – English, Dutch, Italian, and Spanish. Profiling consists of age and gender classification, as well as regression on 5 different personality dimensions – extroversion, stability, agreeableness, openness, and conscientiousness. Different sets of features were tested – bag-of-words, text ngrams, and POS ngrams. SVM was used as the classifier. Tfidf worked best for most English tasks while for most of the tasks from the other languages, the combination of the best features worked better.
Sapan Shah, Dhwani Vora, B P Gautham and Sreedhar Reddy. A Domain Specific Search Engine for Material Science Literature
Abstract: Knowledge of material properties as a function of material composition and manufacturing process parameters is of significant interest to materials scientists and engineers. A large amount of information of this nature is available in publications in the form of experimental measurements, simulation results, etc. However, getting to right information of this kind that is relevant for a given problem on hand is a non-trivial task. First an engineer has to go through a large collection of documents to select the right ones. Then the engineer has to scan through these selected documents to extract relevant pieces of information. Our goal is to help automate some of these steps. Traditional search engines are not much help here, as they are keyword centric and weak on relation processing. In this paper, we present a domain specific search engine, that processes relations to significantly improve search accuracy. The engine pre-processes material publication repositories to extract entities such as material compositions, material properties, manufacturing processes, process parameters and their values, and builds an index using these entities and values. The engine then uses this index to process user queries to retrieve relevant publication fragments. It provides a domain specific query language with relational and logical operators to compose complex queries. We compare the results of our search engine with the results of a keyword based search engine.
Jan Motl and Wei Nie. What Makes a Fairy Tale
Abstract: Traditionally, fairy tales were analyzed by their plots, how- ever, this approach was criticized since it omits tone, mood, character and other attributes that further differentiates one fairy tale from an- other. To find characteristic style of fairy tales written in English, factor analysis was applied on the extracted adjectives. The analysis gave rise to five unique factors describing characteristic style of fairy tales.

ACLing

Suhaib Al-Rousan and Ahmad Al-Taani. Arabic Multi-Document Text Summarization
Abstract: As the availability of electronic documents is rapidly rising and as the wide variety of its existence, so the Automatic Text Summarization (ATS) topic has gained a lot of interest. ATS makes the process of getting the needed information much easier. ATS is about transforming the large text documents into a shorter one so that the most significant information in the original documents must be maintained. Summarization process has a major role in many fields where the produced summary helps the users to make the decision of relevancy to their search without the need to view the whole original documents. Also many systems such as web portal (with its different services in news, E-mail, entertainment, quotes for stock) and search engines mange to perform their tasks effectively and efficiently using the summarization process.
Generally, ATS systems are very important in many fields. This importance could appear in high-quality text summarization which would enhance document searching and browsing. Moreover, Information Retrieval systems present an automatically-built summary in its retrieval results list. This method helps the user to decide rapidly which documents are interesting and worth opening for a closer look—similar to Google models to some extent with the snippets shown in its search outputs. ATS tasks can be categorized into single and multi-document summarization. Single-document summarizer summarizes only one document, while multi-document summarizer summarizes an entire collection of documents into one summary. Both Single and multi-document summarization methods are difficult tasks by themselves, because of the following problems: temporal dimension, redundancy, co-reference or sentence ordering. On the other hand, Multi- document summarization has additional challenges. The main challenge of Multi-document summarization is due to the multiple resources from which information is extracted, such as the risk of higher redundant information according to what will be found in a single text. Moreover, generating a summary from a set of documents into a cohesive text is a non-trivial task.
The summarization methods can be classified into abstractive and extractive summarizations. Extractive summarization includes assigning saliency measure (e.g. score) to some units (e.g. sentences, paragraphs) of the documents and then extracting those with the highest scores to include them in the summary. While abstractive summarization is a complex problem, which requires deeper analysis of source documents and concept-to-text generation. Currently, most of researches and commercial systems in ATS are extractive summarization. As for generality of summaries, there are two types that can be distinguished: generic and query-driven summaries. The first type attempts to represent all related topics of a source text, while the second concentrates on the user’s preferred query keywords or topics. ATS has been proved to be significant and rich field of study in natural language processing , which has recently witnessed a great development, and a wide variety of approaches have been proposed to tackle this.
In this paper, we present an Arabic multi-document summarization approach based on extractive approaches and the K-means clustering algorithm with a heuristic function to predict the optimal value of K. To evaluate the proposed approach, TAC 2011 MultiLing Pilot Dataset is used. Experimental results showed the effectiveness of the proposed approach for extracting summaries for Arabic multi-documents. Results also showed the advantage of the proposed approach over many state-of-the art approaches.
Ertuğrul Yilmaz, İlknur Durgar El-Kahlout and Coşkun Mermer. Effects of Pre/Post-processing Techniques on Improving Egyptian Arabic to English SMT
Abstract: This paper presents statistical machine translation (SMT) systems for Egyptian Arabic to English in SMS, Chat and CTS (Conversational Telephony Speech) genres. Pre-processing of the available data and post-processing of the system outputs in these domains are crucial in improving the overall translation quality. In this paper, we propose cleaning, pre/post-processing, and enriching available data as well as incorporating some advanced SMT techniques into our Egyptian Arabic to English translation systems. We achieve improvements of up to 6.5 BLEU points compared to the baseline translation systems.
Ahmad Abd Al-Aziz, Mervat Gheith and Ahmed Sharaf Eldin. Toward Building Arabizi Sentiment Lexicon based on Orthographic Variants Identification
Abstract: Arabizi is a character encoding of Arabic into the Roman alphabet and the Arabic numerals. Arabizi is used to present both Modern Standard Arabic (MSA) and/or Arabic dialects. It is commonly used in online communications: tweets, blogs, chats, etc. Most existing NLP tools for Arabic language are de-signed for processing formal and scripted MSA. Arabizi has no orthographic standard. Therefore, an Arabizi word may be written in many possible incon-sistent spellings makes building a lexicon of Arabizi words very challenging. This problem is not yet addressed in the literature of Arabizi orthographic vari-ants. To solve this problem, we attempt to apply spelling-correction approach to normalize the inconsistences in Arabizi words. In this paper, we attempt to build a sentiment lexicon using Arabizi phonetics’ specificities that addresses the prob-lem of identifying orthographic variants in Arabizi. Our algorithm is based on classical Soundex algorithm. Our approach is tested on data written on Arabizi to detect the sentiment words with different orthographic forms. The lexicon would be useful in analyzing the sentiment text written in Arabizi.
Hussein Khalil, Taha Osman, Paul Bowden and Mohammad Mlidan Mlitam. Extracting Arabic Composite Names using a knowledge Driven Approach
Abstract: Named Entity Recognition (NER) is a basic prerequisite of using Natural Language Processing (NLP) for Information Retrieval. Arabic NER is especially challenging as the language is phonologically rich, has short vowels with no capitalisation convention. In this paper, we present a novel rule-based approach that uses a linguistic grammar-based techniques to extract Arabic composite names from the Arabic text. Our approach uniquely exploits the genitive Arabic grammar rules, in particular the rules regarding the identification of definite nouns (معرفة) and indefinite nouns (نكرة) to support the process of extracting composite names. Furthermore, the approach presented here does not put any constraints on the length of the Arabic composite name, natural language processing by utilising the domain knowledge to formalise a set of syntactical rules and linguistic patterns in order to extract proper composite names from unstructured text. Initial experimentation demonstrated higher recall and precision results when applied our NER algorithm to a corpus of the financial domain.
Mallat Souheyl, Emna Hkiri, Mohsen Maraoui, Anis Zouaghi and Zrigui Mounir. Statistical Approach to Semantic Indexing in Multilingual Documents
Abstract: In this paper, we present a statistical approach to semantic indexing of multilingual text documents based on conceptual network formalism. We propose to use this formalism as indexing language in order to represent the significant terms that represent the content of a given document .Our contribution focuses on two aspects: firstly, we propose an approach for relevant terms extraction using EuroWordNet a multilingual lexical resource.
Secondly, we propose to enhance the semantic indexing using the association rules model, which aims to discover non taxonomic relations (contextual relations) between the terms of a document. These lasts are latent relations, buried in the text, and carried by the semantic context of the co-occurrence of words in the document. The proposed approach can be applied to several languages because it builds a linguistic and statistical process. This approach is validated by a set of experiments and comparison with other methods of indexing based on a corpus of TREC evaluation campaign 2001 and 2002 of the ad hoc task. We show that the proposed indexing approach provides encouraging results.
Mallat Souheyl, Emna Hkiri, Maraoui Mohsen, Anis Zouaghi and Mounir Zrigui. Hybrid model of query lexical expansion and translation for multilingual information retrieval on the web
Abstract: Queries on the web are often short and ambiguous. the development of storage media and the amount of multilingual documents available on the web necessiate to dispose an approach to improve the performance of multilingual information retrieval system. In this article, we explore the query expansion by including terms belonging to its concepts. We distinguish explicit and implicit concepts. We propose a hybrid model that integrates the terms of expansion issued from these two concepts in the initial query. Secondly, we use contextual query expansion for the IR in cross language (ar-en) and (ar-ang). This needs queries translation based on word sense disambiguation method. At this level, the second part of our model extends the IR to cover the multilingual domain. The experimental results, obtained on a multilingual collection of TREC compain 2001 and 2002, confirms the relevance of our idea, and show that the hybrid model greatly improves the performance of IR systems on the web in terms of precision , recall and results classification.
Jihene Younes, Emna Souissi and Hadhémi Achour. A Hidden Markov Model for Automatic Transliteration of Romanized Tunisian Dialect
Abstract: The Tunisian dialect is the naturally spoken language in Tunisia and unlike the MSA (Modern Standard Arabic), it is an informal and a non-written Arabic language variant. However, with the growing use of ICT and especially the internet, a written form of Tunisian dialect is emerging: the electronic Tunisian dialect (ETD). Originally used in the SMS with mobile phones, the ETD is becoming increasingly present on the web through social networks. It may be written in both Arabic and Latin alphabet.
In this work, we address the problem of the automatic transliteration of Romanized ETD into Arabic writing. The difficulty of this task lies in the ambiguity that characterizes most of the used Latin characters, that can match several possible Arabic transcriptions.
In the proposed approach, we consider this issue as a problem of sequential labeling, where a single Arabic character should be assigned to each Latin character constituting an ETD word to transliterate, and we propose to implement a solution based on a first-order Hidden Markov Model. The results of our experimentation are presented and discussed.
Fatma Ben Mesmia, Kais Haddar, Denis Maurel and Nathalie Friburger. Recognition and TEI annotation of Arabic Events Using Transducers
Abstract: The recognition of Arabic Named Entity (ANE) is an important task allowing the identification and classification of relevant entities to predefined catego-ries in the textual resources. In fact, the ANE having the category Event be-comes a new challenge in NLP applications. Therefore, their appearance is clearly related to the evolution of the web. Hence, it generates regularly new events’ articles appearing in the free resources such as Wikipedia. Neverthe-less, their recognition and annotation require a powerful formalism and standard in order to have structured output. In this paper, we propose a method to recognize and to annotate ANE event. The proposed method is based on finite state transducers using the TEI recommendation. These trans-ducers are regrouped in a cascade generated by CaSsys tool available under Unitex linguistic platform. Our corpora are extracted from Arabic Wikipedia through the Kiwix tool. The obtained results are satisfactory through the cal-culated measures.
Aymen Trigui, Naim Terbeh, Mohsen Maraoui and Mounir Zrigui. Statistical approach for spontaneous Arabic Speech understanding based on stochastic speech recognition module
Abstract: This work is part of a large research project named "Oreillodule" aimed to develop tools for automatic speech recognition, translation, and synthesis for Arabic language. In this paper, our attention has mainly been focused on an attempt to present the semantic analyzer developed for the automatic comprehension of the standard spontaneous Arabic speech. We present a model of Arabic speech understanding system. In this model, both speech recognition module and semantic decoding module are based on statistical approach. In this work, we present and evaluate speech recognition module but we just explain the principle of Arabic speech understanding module.
Chihebeddine Ammar, Kais Haddar and Laurent Romary. TMF Normalization of Arabic Technical and Scientific Terms
Abstract: Multilingual terminological data are an essential component for many industries and the development of these resources is very expansive. In addition, the exchange and data fusion is an important aspect in the construction of most of the terminological databases (TDBs). We noticed a lack of robust and rigorous terminological databases for the Arabic language and even less standardized terminological databases. In this context, we present in this paper, a process to organize the collection of terminological resources for the development of Arabic multidisciplinary standardized terminology (according to TMF ISO 16642 standard) based on a thorough study of the typology and characteristics of Arabic technical and scientific terms.
Nadia Ghezaiel Hammouda and Kais Haddar. A rule-Based lexical disambiguation for Arabic corpora
Abstract: Lexical disambiguation for Arabic corpus is important for different levels of analysis, because it facilitates Arabic language parsing process and reduces the execution time. In this context, our objective is to propose a rule-based method for lexical ambiguities and to implement a tool for Arabic language with NooJ platform through the transduction on text automaton. To do that, we need to identify and classify specific lexical rules for Arabic. Then, we implement these rules in NooJ platform and after that we call NooJ syntactic grammars in adequate order to remove ambiguities existing in Text Annotation Structures (TAS) especially in NooJ cascade. The obtained NooJ implementation will be used to construct an automatic annotation tool
Nadia Soudani, Ibrahim Bounhas and Yahya Slimani. Experimenting an unsupervised LSA-based Approach for Arabic queries SEmantic Disambiguation
Abstract: This paper deals with problems of semantic ambiguities in Arabic Language. The meaning of words is context-dependent. Thus, meaning of a word can change according to the context it is used. Hence, we are trying to encounter these problems. For that, we have proposed an unsupervised LSA-based approach for Arabic semantic disambiguation in a context of information retrieval. We are discussing deficiencies of actual systems which don’t consider semantics through the search process. This proposed approach was experimented and evaluated vis-à-vis a traditional vector-based model approach by trying to prove adequacy of semantic space model and its contribution in Natural Language Processing (NLP) and especially in word sense disambiguation. Approaches based on this framework are developed to show how a semantic search by considering semantic meaning of terms of queries can improve the degree of pertinence of results of search. Tests are carried out on an Arabic Dictionary and the Zad test collection.
Mohamed Aly Fall Seideh, Héla Fehri and Kais Haddar. Toward building a bilingual lexicon for Arabic and French named entities of Herbalism
Abstract: Information Retrieval Interlingua (IRI) (ex. Multilingual information extraction) and machine translation (MT) (eg. Learning languages) are important domain of Natural Language Processing (NLP). Bilingual dictionaries are important for these areas among other ones. But they are very expensive to enrich manually, especially with regard to specialized dictionaries (in a particular domain). In our work we deal with Herbal medicine, also known as herbalism or botanical medicine. Herbal medicine is a medical system based on the use of plants or plant extracts. In recent years, interests in herbal medicine become significant. Many international studies have shown that plants are capable of treating disease and improving health, often without any significant side effects.
The identification of Herbalism named entities is not an easy task, as the list of named entities is extensible and their structures are not accurate. In addition, the hierarchy of named entities is unstable and approaches to their identification are varied. The proposed approach requires two-phase process: recognition of Arabic named entities and French named entities of Herbalism domain and elaboration of different mapping criteria between recognized named entities. Each phase requires the construction of its proper resources (transducers and dictionaries). Theses resources are built using NooJ platform.
Lin Kassem, Caroline Sabty, Nada Sharaf, Menna Bakry and Slim Abdennadher. tashkeelWAP: A Game With A Purpose For Digitizing Arabic Diacritics
Abstract: Diacritics in Arabic language are the signs that are found above or under Arabic letters. Their main aim is to provide phonetic aid to readers as well as allowing them to understand the Arabic text in its intended and correct context. The existence of a diacritical mark can entirely change the meaning of Arabic text. Current Optical Character Recognition (OCR) systems face accuracy difficulties when trying to read Arabic letters with diacritics. This affects the quality of the digitized Arabic text. We introduce "tashkeelWAP", a web application with two Games With A Purpose (GWAP)s that allow the digitization of Arabic text by outsourcing it to native Arabic speaking players. A bi-product of playing the games, we collect possible digitization of Arabic words with diacritics that were not recognized by OCR systems.
Ayoub Kadim, Azzeddine Lazrek and Yahya Ould Mohamed Elhadj. An improved version of the Nemlar Arabic written corpus used in a developed Arabic POS Tagger
Abstract: Corpora represent the pillar on which is founded the majority of statistical approaches of Natural Language Processing. Their various applications are based on learning statistical models, especially Hidden Markov Model, which represents the most reputed machine learning technique of most stochastic Part Of Speech (POS) Taggers. Nevertheless, the form, in which the corpus is designed, and the richness of its content have a significant impact on the learning process. As a part of our work aiming to develop an Arabic POS Tagger, we worked out on one of the most important Arabic annotated corpora, the Nemlar corpus, to improve its structure and enrich its content. These improvements are based on some corpus requirements that will be presented in this paper along them various operations made on the original corpus, such as unifying multiple tags, separating affixes, adding new tags for untagged words and other additions and modifications. To validate the usability of this new structure of Nemlar corpus, we will first present experimentation evaluating the new word recognition rate; then the developed Arabic POS Tagger. Finally, we will discuss the pros and cons of the new corpus version.
Caroline Sabty, Mirna Yacout, Mohamed Sameh and Slim Abdennadher. Gamified Collection of Arabic Named Entity Recognition Data
Abstract: Named Entity Recognition (NER) is one of the most challenging tasks in Natural Language Processing (NLP). It is a task that is responsible for entity identification. Most of the techniques of NER mainly relies on
extracting patterns, dictionaries or pre-classified training corpora. There
are a lot of corpora available for the English language. However, due to the complexity of the Arabic language, very few corpora and small online dictionaries are available for Arabic. In addition, most of the available approaches are based on artificial intelligence and machine learning techniques. In this paper, a new technique is introduced to use human computation to create a large and diverse corpus. "3arosty" is a prototype of a two player Game With A Purpose (GWAP) that aims to collect Arabic words (entities) along with their categories from users in a fun and interactive way.
Mourad Mars. Spell Checking Arabic text: From error detection to error correction
Abstract: Spell checking is the process of detecting misspelled words
in a written text and recommending alternative spellings. The first stage
consists of detecting errors in a given text. The second stage consists of error correction. In this paper we propose a novel method for spell checking Arabic text. Our system is a sequential combination of approaches including lexicon based, rule based,and statistical based. The experimental results show that the proposed method achieved good performance in terms of recall rate or precision
rate in error detection, and correction comparing to other systems.
Karim Sayadi, Marcus Liwicki, Marc Bui and Rolf Ingold. Sentiment analysis on Tunisian dialect : case study of the Tunisian election context
Abstract: With the growth in use of social media platform, Sentiment Analysis methods become more and more popular while for classical Arabic there are many datasets available, most of the Arab countries' Dialects suffer from very limited resources for training an automated system. In this article, we study Sentiment Analysis applied on the Tunisian Dialect and Modern Standard Arabic by providing a manually annotated dataset. On this dataset, we performed feature selection and trained 6 classifiers to provide a benchmark for future studies.
Haithem Afli, Walid Aransa, Pintu Lohar and Andy Way. From Arabic User-Generated Content to Machine Translation: Integrating Automatic Error Correction
Abstract: With the wide spread of the social media and online forums, individual users have been able to actively participate in the generation of online content in different languages and dialects.
Arabic is one of the fastest growing languages used on Internet, but dialects (like Egyptian and Saudi Arabian) have a big share of the Arabic online content.
There are many differences between Dialectal Arabic and Modern Standard Arabic which cause many challenges for Machine Translation of informal Arabic language.
In this paper, we investigate the use of Automatic Error Correction method to improve the quality of Arabic User-Generated texts and its automatic translation.
Our experiments show that the new system with automatic correction module outperforms the baseline system by nearly 22.59% of relative improvement.
Abdulrahman Alosaimy and Eric Atwell. Ensemble Morphosyntactic Analyser for Classical Arabic
Abstract: The field of Arabic Natural Language Processing (NLP) has received a lot of contributions recently. Many analysers handle its morphological-rich problem in Modern Standard Arabic text (MSA), and there are at least seven available morphological analysers (MA). Several Part-of-Speech (POS) taggers uses these MAs to improve its accuracy. However, the choice between these analysers is challenging, and there is no one designed for Classical Arabic. Several morphological analysers have been studied and combined to be evaluated on a common ground. The goal of our language resource is to build a freely accessible multi-component toolkit (named SAWAREF) for part-of-speech tagging and morphological analysers that can provide a comparative evaluation, standardise the outputs of each component, combine different solutions, and analyse and vote for the best candidates. This paper describes the research method and design, and discuss the key issues and obstacles.


Khaled Dahmri, Hassina Aliane and Hamid Azzoune. Enhancing HeidelTime for Time expression Annotation in Arabic News
Abstract: Temporal expression is considered as a key information in many tasks of modern information retrieval field. This paper presents a rule based approach for identification of temporal entities in Arabic text. Our approach is founded on HeidelTime tool1, where lists of regular expressions are used. Based on some of those lists and in order to cover more Temporal expressions in Arabic Language we add some regular expressions to enhance our list. For this approach we use a linguistic preprocessing which consists of POS tagging, a list of regular expres-sion which eases the processes of entities identification and makes it efficient, and a set of manually developed rules are employed to analyze the recognized temporal entities in a sentence. This approach is showing to have an F-measure of 0.93. Tested on news corpus.
Yousef Alotaibi, Yasser M. Seddiq, Ali Meftah, Sid-Ahmed Selouani and Mansour M. Alghamdi. Distinctive Phonetic Features of Arabic Dialects: Comparative Study
Abstract: This paper reports the work of defining the distinctive phonetic feature (DPF) elements of Modern Standard Arabic (MSA) that describe the phonemes of MSA. This work is based on reviewing several views about Arabic DPFs in the classical and modern literature. Phonemes and DPF elements deviations with respect to regional dialects and foreign-accented speech are also investigated. DPF elements for all MSA phonemes are defined and presented in a single DPF table unifying the different views. Observations on dialectical varieties with respect to MSA are also detailed.
Asmaa Mountassir, Houda Benbrahim and Ilham Berrada. Building and Annotating an Arabic Corpus for Opinion Mining (ACOM)
Abstract: Opinion mining and sentiment analysis is a hot research area which aims to the processing and analysis of opinionated texts. Though this field has a rich literature, Arabic opinions have gained lower interest from researchers despite the importance of Arabic language. This is mainly due to the lack of linguistic resources on the one hand, and the lack of available corpora on the other hand. In this paper, we present the Arabic Corpus for Opinion Mining (ACOM) that we have built internally and annotated manually. Our corpus consists of three data sets that we collected from Arabic online forums. For the task of annotation, we present our annotation scheme and the inter-annotator agreement study that we have conducted. Then we describe the structure of each data set.
Rami Ayadi. A survey of Arabic Text Representation and Classification Methods
Abstract: In this paper we have presented a brief current state of the Art for Arabic text representation and classification methods. First we describe some algorithms applied to classification on Arabic text. Secondly, we cite all major works when comparing classification algorithms applied on Arabic text, after this, we mention some authors who proposing new classification methods and finally we investigate the impact of preprocessing on Arabic TC.

TurCLing

Batuer Aisha. A Uyghur Lemmatization and Part-Of-Speech Tagging
Abstract: Lemmatization and part-of-speech (POS) analysis are very important for Uyghur language syntax processing. In this paper we propose a new statistical method for Uyghur morphology analysis. The experimental results demonstrate that the proposed method is effective: the F-measure of lemmatization reaches 92.2% and POS tagging reaches 92.8% in the open test, and most exciting is that the research method is very easy to extend to other Altaic languages.
Aida Sundetova and Ualsher Tukeyev. Automatic detection of the type of “chunks” in extracting chunker translation rules from parallel corpora
Abstract: This paper describes the method of the automatic detection of the type of “chunks” are generated in methodology presented by Sánchez-Cartagena et. al. (Computer Speech & Language 32:1(2015) 46–90). The proposed automatic detection method type of “chunks” improve above methodology of extracting grammatical translation rules from bilingual corpora. Proposed improvement of methodology of extracting grammatical translation rules from corpora allows to apply output phrases of extracted translation “chunk” rules for next “interchunk” stage in machine translation system and improve of machine translation quality. Experiments are done for the English–Kazakh language pair using the free/open-source rule-based machine translation (MT) platform Apertium and bilingual English–Kazakh corpora.
Zhenisbek Assylbekov, Jonathan N. Washington, Francis M. Tyers, Assulan Nurkas, Aida Sundetova, Aidana Karibayeva, Balzhan Abduali and Dina Amirova. A free/open-source hybrid morphological disambiguation tool for Kazakh
Abstract: This paper presents the results of developing a morphological disambiguation tool for Kazakh. Starting with a previously developed rule-based disambiguation tool, we tried to cope with the complex morphology of Kazakh by breaking up lexical forms across their derivational boundaries into inflectional groups and modeling their behavior with statistical methods. A hybrid rule-based/statistical approach appears to benefit morphological disambiguation demonstrating a per-token accuracy of 91% in running text.
Gozde Gul Sahin. Framing of Verbs for Turkish PropBank
Abstract: In this work, we present our method for framing the verbs of Turkish PropBank and discuss incorporation of crowd intelligence to increase the quality and coverage rate of annotated frames. Framing process is the first and the most important step for creation of an annotated corpus with semantic roles. Therefore, we have followed a two-pass approach to guarantee the quality of the
created resource. First, the frames are manually created by the help of publicly available dictionaries, corpora and guiding morphosemantic features such as case markers. Later, a verb sense disambiguation task where the verb senses correspond to annotated frames, is crowdsourced. Finally, the results of verb sense disambiguation task are used to increase the coverage rate and quality of created linguistic resource. In conclusion, a new linguistic resource of Turkish verb
frames with 759 annotated verb roots and 1262 annotated senses is constructed.
Kadir Yalcin and Ilyas Cicekli. PlagDS: A Plagiarism Detection System Based on Document Similarity
Abstract: In this paper, we present a desktop plagiarism detection system based on document similarity for Turkish texts called as PlagDS. PlagDS aims to detect both verbatim plagiarism and semantic similarities between texts. Our system compares two sentences to determine its similarities. Since Turkish is morphologically rich language, stems of words are used in comparisons of words. We also use synonyms of words in order to determine semantic similarities between sentences. Since there is no test corpus for Turkish plagiarism studies, we have built a test corpus which consists of two different Turkish translations of 25 texts from world classic books. We compare the results of our plagiarism detection system with other existing plagiarism detection applications. The performance results of our system indicate that our system produces better results in the detection of semantic similarities between Turkish texts.
Ümit Mersinli and Yeşim Aksan. A Methodology for Multi-word Unit Extraction in Turkish
Abstract: Multi-word Unit (MWU) extraction in Turkish has its own challenges due to the agglutinative nature of the language and the lack of reliable tools and reference datasets. The aim of this study is to share the hands-on experience on MWU extraction in the on-going projects using Turkish National Corpus (TNC) as the data source. Since Turkish still does not have a reference MWU set, in order to evaluate the performance of any extraction tool or technique, the primary purpose of these projects is to form a reference MWU dictionary of Turkish. Techniques or suggestions compiled in this paper may provide an overall proposal for other Turkish-specific computational or statistical work. The linguistic perspective underlying the choices of a valid methodology is described in the first part of the study. In the second part, the important steps of the ongoing project are discussed through real examples from the TNC. In the conclusion, considerations for an interdisciplinary approach and a proposal for a hybrid methodology is presented.
Dilara Torunoğlu Selamet, Eren Bekar, Tugay İlbay and Gülşen Eryiğit. Exploring Spelling Correction Approaches for Turkish
Abstract: The spelling correction of morphologically rich
languages is hard to be solved with traditional approaches
since in these languages, words may have hundreds of different
surface forms which do not occur in a dictionary. Turkish is
an agglutinative language with a very complex morphology and
lacks annotated language resources. In this study, we explore the
impact of different spelling correction approaches for Turkish
and ways to eliminate the training data scarcity. We test with
seven different spelling correction approaches, four of which
are introduced in this study. As the result of this preliminary
work, we propose a new automatic training data collection
process where existing spelling correctors help to develop an
error model for a better system. Our best performing model uses
a unigram language model and this error model, and improves
the performance scores by almost 20 percentage points over the
widely used baselines. As a result, our study reveals the achievable
top performance with the proposed approach and gives directions
for a better future implementation plan.
Yeşim Aksan, S. Ayşe Özel, Hakan Yılmazer and Umut Demirhan. The Turkish National Corpus (TNC): Comparing the Architectures of v1 and v2
Abstract: Turkish National Corpus (TNC) released its first version in 2012 is the first large scale (50 million words), web-based and publicly-available free resource of contemporary Turkish. It is designed to be a well-balanced and representative reference corpus for Turkish. With 48 million words coming from the written part of it, the untagged TNC v1 represents 4438 different data sources over 9 domains and 34 different genres. The morphologically annotated, 50 million words TNC v2 with 5412 different documents compiled from written and spoken Turkish is planned for release in 2016 offers new query options for linguistic analyses. This paper aims to present a comparison of the architecture of the TNC v1 and v2 in terms of a variety of queries performed on both versions. It is shown that TNC v2 performs better and faster than that of TNC v1 due to the in-memory inverted index structure.
Serkan Kumyol, Burcu Can and Cem Bozşahin. Using Allomorphs in Turkish Morphological Segmentation Reduces Sparsity
Abstract: Turkish is an agglutinating language with heavy affixation. During affixation, morphophonemic operations change the surface forms of morphemes, leading to allomorphy. This paper explores the use of Turkish allomorphs in morphological segmentation task. The results show that aggregating morphemes in allomorph sets and treating them as the same morpheme decrease the sparsity in morphological segmentation, leading to higher accuracy. The source
of this supervision can be syntax, in particular the syntactic category of morphemes and their logical form.
We further investigate the dependency of Turkish morphemes on each other, using unigram and bigram morpheme models, by adopting a non-parametric Bayesian model as a Dirichlet process. The bigram morpheme model outperforms the single-morpheme model.
Kubra Adali, Tutkum DİnÇ, Memduh GÖkirmak and GÜlŞen EryİĞİt. Comprehensive Annotation of Multiword Expressions for Turkish
Abstract: Multiword expressions (MWEs) have a common utilization in morphological rich languages (MRLs), however it raises a challenging issue primarily due to the specific characteristics, and the scarcity of annotated language resources which need a noteworthy effort. In order to put the MWE annotation on a healthy footing and make a clear beginning for MWE annotation in Turkish, we determine and define the scopes of the types of MWEs and Named Entities(NE) . Under this scheme, we have two Turkish treebanks whose MWEs are fully annotated including these types of MWEs and subcategories of NEs. With this study, we introduce newly redefine 10 types of MWEs and the categories of NEs for Turkish, describe the sources which provide a direction for future studies of MWE annotation in Turkish. To the best of our knowledge, the annotation work is the first study which render a benchmark and shed the light on further comprehensive approaches for MWE annotation and extraction in Turkish.
Umut Sulubacak, Gülşen Eryiğit and Tuğba Pamay. A Revisited Turkish Dependency Treebank
Abstract: In this paper, we present a critical analysis of the dependency annotation framework used in the METU-Sabancı Treebank (MST), and propose new annotation schemes that would alleviate the issues we have identified. Later, we describe our attempt at reannotating the treebank from the ground up using the proposed schemes, and then compare the consistencies of the two versions via cross-validation using a dependency parser. According to our experiments, the reannotated version of the original treebank, which we call the X Treebank (XT), demonstrates a labeled attachment score of 75.3% and an unlabeled attachment score of 83.7%, surpassing the corresponding scores of 65.9% and 76.0% for MST by a very large margin.
Çağrı Çöltekin. (When) do we need inflectional groups?
Abstract: Inflectional groups (IGs) are sub-words units that became a de facto
standard in Turkish natural language processing (NLP). Despite
their prominence in Turkish NLP, similar units are seldom used in
other languages; theoretical or psycholinguistic studies on such units
are virtually nonexistent; they are typically overused in most
existing work; and there are no clear standards defining when a word
should or should not be split into IGs. This paper argues for the
need for sub-word syntactic units in Turkish NLP, followed by an
explicit proposal listing a small set of morphosyntactic contexts in
which these units should be introduced.
Dilara Torunoğlu Selamet, Tuğba Pamay and Gülşen Eryiğit. Simplification of Turkish Sentences
Abstract: Text Simplification is the process of transforming existing natural language text into a new form aiming to reduce their syntactic or lexical complexities while preserving their meaning. A sentence being long and complicated may pose multiple problems especially for elementary school children. In this paper, we focus on Turkish, a morphologically rich language, and examine sentences from an elementary school text book to extract complex structures and propose a sentence simplification system to automatically generate simpler versions of the sentences. Thereby, sentences become easier for children to understand, particularly children with difficulty in reading comprehension. Our system automatically uses simplification operations, namely splitting, dropping, reordering, and substitution.

RCS

Heba Ismail, Saad Harous and Boumediene Belkhouche. A Comparative Analysis of Machine Learning Classifiers for Twitter Sentiment Analysis
Abstract: Twitter popularity has increasingly grown in the last few years making influence on the social, political and business aspects of life. Therefore, sentiment analysis research has put special focus on Twitter. Tweet data have lots of peculiarities relevant to the use of informal language, slogans, and special characters. Furthermore, training machine learning classifiers from tweets data often faces the data sparsity problem primarily due to the large variety of Tweets expressed in only 140-character. In this work we evaluate the performance of various classifiers commonly used in sentiment analysis to show their effectiveness in sentiment mining of Twitter data under different experimental setups. For the purpose of the study the Stanford Testing Sentiment dataset STS is used. Results of our analysis show that multinomial Naïve Bayes outperforms other classifiers in Twitter sentiment analysis and is less affected by data sparsity.
C. Alberto Ochoa-Zezatti. Identifying consumption patterns in Twitter using text mining to classify trends in shopping
Abstract: Twitter stands out as a network of social, because users interact with each other to share preferences and therefore consumption patterns, but the social network is limited to show trends of topics through the use of tags and content analysis, despite the huge information flowing through publications it may be consider that the user preference data in the time and places often lack of classification that facilitates analysis and market research. It can be perform analysis of the collected data at Twitter posts that reflect which are the places visited by users in order to know consumption patterns in a given region. Twitter provides a set of tools for developers that allow search publications through keywords, there are publications that contain geographic coordinates letting you know where the publication are performed and the name of the venue. Tools for developers provided by the social network are limited to a number of user requests that hinder analysis by text mining, whereby a distributed data mining system is proposed. In this research we propose a model to gather consumption patterns according to time and place in a region through postings on Twitter, which will be applicable for market research.
Sukhada and Dipti Sharma. Analyzing English Phrases from Paninian Perspective
Abstract: This paper explores Paninian Grammar (PG) as an information processing device in terms of 'how', 'how much' and 'where' languages encode information. PG was based on a morphologically rich language, Sanskrit. We apply PG on English and see how Paninian perspective would explain it from information theoretic point of view and its effectiveness in machine translation.

We analyze English phrases defining 'sup' (nominal inflections) and 'ting' (finite verb inflections) and compare them with the notion of 'pada' (an inflected word form) and 'samasta-pada' (compound) in Sanskrit.

Sanskrit encodes relations between nouns and adjectives and nouns in apposition through agreement between gender, number and case markers, whereas English encodes them through positions. As a result constituents are formed. It appears that an English phrase contains more than one 'pada', hence, cannot be similar to a 'pada'. However, we show the linguistic similarities between a 'pada', 'samasta-pada' and 'phrase'.
M’hamed Mataoui, Omar Zelmati and Madiha Boumechache. A proposed Lexicon-Based Sentiment Analysis Approach for the Vernacular Algerian Arabic
Abstract: Nowadays, sentiment analysis research is widely applied in a variety of applications such as marketing and politics. Several studies on the Arabic sentiment analysis have been carried out in recent years. These studies mainly focus on Modern Standard Arabic among which few studies have investigated the case of Arab dialects, in this case, Egyptian, Jordanian, and Khaliji. In this paper, we propose a new lexicon-based sentiment analysis approach to address the specific aspects of the vernacular Algerian Arabic fully utilized in social networks. A manually annotated dataset and three Algerian Arabic lexicons have been created to explore the different phases of our approach.
Rishabh Srivastava and Soma Paul. Hindi Question Answering system using PurposeNet-based Ontology
Abstract: In this paper, we propose a question answering cum dialog system in Hindi, in the domain of MMTS (Multi-Modal Transport System) and recipe, using an Ontology, developed on an architecture based on PurposeNet. Apart from retrieving answers from the knowledgebase, this paper focuses on effectively removing the linguistic gap between the input query and the knowledgebase, which is in another language. In addition to answering the traditional factoid (single word and list-based) answers, we have discussed methods to answer why and how questions, as well. The system is primarily built on MMTS domain and it is extended to recipe domain to validate the usefulness of the system.
Ahmed Raof Nasser, Kıvanç DİnÇer and Hayri Sever. Investigation of Feature Selection Problem for Sentiment Analysis in Arabic Language
Abstract: Sentiment analysis which is also known as opinion mining or sentiment classification,
can be defined as the process of the automatic detection of emotions
in textual contents by the aid of computers. These emotions describe the feelings
or ideas of the author about a certain subject. In this study we investigate the
document-level supervised sentiment analysis in Arabic context. We use three
different feature generation methods based on Unigrams, Bi-grams and Trigrams
to generate three different datasets from the Opinion Corpus for Arabic (OCA).
We use three standard classification methods (Support Vector Machines, KNearest
Neighbor and Decision Trees) known by their effectiveness over these
datasets to build our supervised sentiment analysis system. We also present an
approach that finds the optimal number of features to reach the best time performance
for the supervised sentiment analysis systems. Two feature ranking methods
(Information Gain based and Chi-Square based) were used to calculate the
score of each feature with respect to class labels. This feature ranking step is
helpful for removing the irrelevant features and leaving only the relevant ones
and helps to increase the classification performance since it eliminates unnecessary
processing due to irrelevant features. In our study SVM classifier showed
superior classification performance compared to the other two classifiers. Our
experimentation results also show the effectiveness of the two feature selection
methods we use in reducing the feature space of the generated datasets as well as
in providing higher classification performance.
Zeineb Neji, Lamia Belguith and Marieme Ellouze. Question Answering Based on Temporal Inference
Abstract: Inference approaches in Arabic question answering are in their first steps if we compare it with other languages. In this paper, we focus on the task of question answering in Arabic by thinking of an approach which can improve the performance of traditional Arabic question answering systems for handling temporal inference. Evidently, any user is interested in obtaining a specific and precise answer to a specific question. Therefore, the challenge of developing a system capable of obtaining a relevant and concise answer is obviously of great benefit.
We have implemented the proposed approach in a question answering system entitled IQAS: Inference Question Answering System for handling temporal inference.
Athira U and Sabu M.Thampi. A Psychometric Approach to Authorship Analysis of Short Documents
Abstract: Authorship analysis intends to resolve the problem of identifying the authors of a document by scrutinizing the writing style involved in it. The area attains significance in current era, where online communications have procured exorbitant popularity. The major challenge pertaining to this area is the lack of availability of passable amount of content of documents for analysis. The proposed method accomplishes the task of analysis by gaining benefit from the psycholinguistic aspects of an author. The author's individualistic form of expression of emotional aspects and sociolinguistic aspects are identified to obtain an author style pattern. This in turn helps in the attribution of authorship. The proposed method illustrates an improvement in the accuracy of authorship attribution of short texts, in comparison with existing methods.
Sinan Polat, Merve Selcuk-Simsek and Ilyas Cicekli. A Modified Earley Parser for Huge Natural Language Grammars
Abstract: For almost a half century Earley parser has been used in the parsing of context-free grammars and it is considered as a touch-stone algorithm in the history of parsing algorithms. On the other hand, it is also known for being expensive from its time requirement and memory usage perspectives. For huge context-free grammars, its performance is not good since its time complexity also depends on the number of rules in the grammar. The time complexity of the original Earley parser is O(R^2N^3) where N is the string length, and R is the number of rules. In this paper, we aim to improve time and memory usage performances of Earley parser for grammars with a large number of rules. In our approach, we prefer radix tree representation for rules instead of list representation as in original Earley parser. We have tested our algorithm using different number of rule sets up to 200.000 which are all learned by an example-based machine translation system. According to our evaluation results, our modified parser has a time bound of O(log(R)N^3), and it has %20 less memory usage regarding the original Earley parser.
Malarkodi C.S., Elisabeth Lex and Sobha Lalitha Devi. Named Entity Recognition for the Agricultural Domain
Abstract: Agricultural data have a major role in the planning and success of rural development activities. Agriculturalists, planners, policy makers, government officials, farmers and researchers require relevant information to trigger decision making processes. This paper presents our approach towards extracting named entities from real-world agricultural data from different areas of agriculture using Conditional Random Fields (CRFs). Specifically, we have created a Named Entity tagset consisting of 19 fine grained tags. To the best of our knowledge, there is no specific tag set and annotated corpus available for the agricultural domain. We have performed several experiments using different combination of features and obtained encouraging results. Most of the issues observed in an error analysis have been addressed by post-processing heuristic rules, which resulted in a significant improvement of our system’s accuracy.
Sindhuja Gopalan, Paolo Rosso and Sobha Lalitha Devi. Discourse Connective - A Marker for Identifying Featured Articles in Biological Wikipedia
Abstract: Wikipedia is a free-content internet encyclopedia that can be edited

by anyone who accesses it. As a result, Wikipedia contains both featured and

non-featured articles. Featured articles are high-quality articles and non-

featured articles are poor quality articles. Since there is an exponential growth

of Wikipedia articles, the need to identify the featured Wikipedia articles has

become indispensable so as to provide quality information to the users. As very

few attempts have been carried out in biology domain of English Wikipedia

articles, we present our study to automatically measure the information quality

in biological Wikipedia articles. Since the coherence shows representational

information quality of a text, we have used the discourse connective count

measure for our study. We compare this novel measure with two other popular

approaches word count measure and explicit document model method that have

been successfully applied to the task of quality measurement in Wikipedia

articles. We organized the Wikipedia articles into balanced and unbalanced set.

The balanced set contains featured and non-featured articles of equal length and

the unbalanced set contains randomly selected featured and non-featured

articles. The best result for the balanced set is obtained with F-measure of

83.2%, while using Support Vector Machine classifier with 4-gram

representation and Term Frequency-Inverse Document Frequency weighting

scheme. Meanwhile, the best result for unbalanced corpus is obtained using the

discourse connective count measure with F-measure of 98.06%.
Firas Hmida, Emmanuel Morin and Beatrice Daille. Aligned Knowledge-Rich Contexts from Specialized Comparable Corpora
Abstract: During the specialized translation process, a revision phase is necessary to validate the initial translation proposed by the translator. This phase, which ensures the consistency of the document produced, requires the preparation of terminological information accessible through glossaries and dedicated management tools. In this work, we propose a methodology to build a bilingual concordancer providing not parallel context but aligned Rich-Knowledge Contexts (KRCs) from specialized comparable corpora. These contexts share bilingual conserved properties between the source and target language within the comparable corpus. KRCs are intended to assist in verifying usage of the term to be translated and its proposed translation. The assessment of the tool that we propose shows that the obtained KRCs are acceptable in order to help the terminological revision.
Biswanath Barik, Erwin Marsi and Pinar Öztürk. Event Causality Extraction from Natural Science Literature
Abstract: We aim to develop a text mining framework capable of extracting changing variables and their causal dependencies from scientific publications in the cross-disciplinary field of climate science, marine science and environmental science. The extracted knowledge can be used to infer new knowledge/hypotheses through reasoning, which forms the basis of a knowledge discovery support system. Automatic identification and extraction of causal relations from text is a challenging task. Generally methods for causal relation extraction proposed in the literature target specific domains such as news text or biomedical publications. However, these models may not be directly applicable to other/new domains. In this paper, we review the state-of-the-art in causal knowledge extraction from text and carefully select the methods and resources most likely to be applicable to our domain.
Václav Rajtmajer and Pavel Král. Event Detection in Czech Twitter
Abstract: The main goal of this paper is to create a novel experimental system for Czech News Agency (ČTK) able to monitor the current data-flow on Twitter, analyze it and extract relevant events. The detected events are then presented to users in an acceptable form. A novel event detection approach adapted to the Czech Twitter is thus proposed. It uses user-lists to discover potentially interesting tweets which are further clustered into groups based on the content. The final decision is based on thresholding. We experimentally show that the proposed approach is useful because it detects a significant amount of the potential events. It is worth of noting that this approach is domain independent.
Ivan Garrido Marquez, Jorge Garcia Flores, François Lévy and Adeline Nazarenko. Blog annotation: from corpus analysis to automatic tag suggestion
Abstract: Nowadays, some blogs cover a large audience and blogs become part of mainstream media. They contain information on diverse topics, personal opinions, and discussions with readers. In order to improve searching within the blogs, to enhance navigation and to increase the visibility of the blogs on the web, bloggers annotate posts with tags and categories. Unfortunately, these annotations are often made on subjective grounds and not in a systematic way. Although there are currently several tools to help bloggers to annotate their posts, these tools do not take into account the information preexisting inside the blogs nor the evolution of their topics. By the analysis of blog texual data we try to mark off the practices of blog annotation and we evaluate the annotation tools with respect to the bloggers' requirements. This paper presents an analysis of a corpus of blogs in French and an evaluation of blog annotation tools. From these results, we explain what are the advanced annotation functionalities that blog platforms should offer.
Noureddine Loukil and Kais Haddar. Extracting HPSG Lexicon from Arabic VerbNet
Abstract: This paper presents the construction of a HPSG lexicon of Arabic verbal entities, automatically inferred from the Arabic VerbNet, a large coverage verb lexicon where verbs are classified using syntactic alternations. We discuss the main verb specification along with the relation of the syntactic and semantic levels of representation within the HPSG framework. Extensive analysis of the Arabic VerbNet classes has led to the adoption of a finite set of mapping rules between AVN classes and HPSG subcategorization and semantics descriptions covering the majority of the verbal tokens. We employed the adopted mapping rules to extract the syntactic and semantic data from AVN and finally, we describe the resulting TDL descriptions in which the lexicon has been encoded. The generated resources has been evaluated with respect to a tiny gold lexicon to assess the precision and the coverage of the extracted data.
Tereza Pařilová, Filip Mrváň, Bruno Mižík and Eva Hladká. Emerging Technology Enabling Dyslexia Users To Read and Perceive Written Text Correctly
Abstract: Dyslexia is treated by many specialists as a cognitive impairment dwelling in visual attention deficit [1]. It may cause an image of letter spatial rotation or overlapping. Both children and adults suffer from this disease, differing in individual needs and seriousness. With growing number of information being distributed digitally, it is a need to accommodate online text to dyslexic users. However, with different types of operating systems, web browsers and substantial dyslexia individualism, it is not easy to fully automatize such needs. We develop an extension for Chrome browser which is based on our previous cognitive research and empirical data. Such extension will allow a user with dyslexia to accommodate the web content with special fragmentation sign which demonstrably suppresses reading problems caused by dyslexia.
Sarra Zrigui, Anis Zouaghi, Rami Ayadi, Salah Zrigui and Mounir Zrigui. An opinions analysis system for Arabic
Abstract: Today, the need to automatically process and analyze opinions is strongly felt.
It is in this context that we situate this work whose objective is to contribute to the implementation of an opinions analysis system, enabling a binary classification on a set of textual data.
For this, we studied and evaluated several methods, Support Vector Machines (SVM) and Naïve Bayes (NB), on a corpus composed of 500 film reviews.
These models have not been satisfactory. To improve the results we have introduced a pre-treatment phase before classifying the corpus; this phase has improved the quality of the classification.
Khaireddine Bacha. Contribution to the achievement of a spellchecker for Arabic
Abstract: The objective of this work is to perform a spell check tool that analyzes the text entered in search for possible misspellings. This tool will suggest possible corrections for each misspelled word in the text. This work will require the presence of a reference dictionary of words in the language arabe.Ces objectives were accomplished with resources, effective methods and approaches. First experimental results on real data are encouraging and provide evidence of the validity of the design choices. They also help to highlight the difficulty of the task, and suggest possible developments
Octavio Augusto Sánchez Velázquez and Gerardo Eugenio Sierra Martínez. Let's agree to disagree: Measuring agreement between annotators for opinion mining task
Abstract: There is a need to know up to what degree humans can agree when classifying a sentence as carrying some sentiment orientation. However, a little research has been done on assessing the agreement between annotators for the different opinion mining tasks. In this work we present an assessment of agreement between two human annotators. The task was to manually classify newspaper sentences into one of three classes. For assessing the level of agreement, Cohen's kappa coefficient was computed. Results show that annotators agree more for negative classes than for positive or neutral. We observed that annotators might agree up to a level of 0.58 for the best case or 0.30 for the worst.
Víctor Mijangos, Gerardo Sierra and Abel Herrera. A word embeddings model for sentence similarity
Abstract: Currently, word embeddings (Bengio et al, 2003; Mikolov et al, 2013) have had a major boom due to its performance in different Natural Language Processing tasks. This technique has overpassed many conventional methods in the literature. From the obtained embedding vectors, we can make a good grouping of words and surface elements. It is common to represent top-level elements such as sentences, using the idea of composition (Baroni et al, 2014) through vectors sum, vectors product or through defining a linear operator representing the composition. Here, we propose the representation of sentences through a matrix containing the word embedding vectors of such sentence. However, this involves obtaining a distance between matrices. To solve this, we use a Frobenius inner product. We show that this sentence representation overtakes traditional composition methods.
Amit Mishra and Sanjay Kumar Jain. Computing Sentiment Polarity of Questioners Asking Why Type Questions in Opinion Question Answering
Abstract: Opinion question answering systems (OQAS) search for answers from public opinions available on social web. Why questions asked in OQAS expect answers to incorporate reasons and explanations for the questioners’ sentiments expressed in the questions. Sentiment analysis has been recently used for determining sentiment polarity of why-questions so as to find the intention of users with which he is looking for getting information related to products. In our recent research [14, 15], we address complex comparative why types questions and propose an approach to perform sentiment analysis of the questioners. For example, the question, “I need mobile with good camera and nice sound quality. Why should I go for buying Nokia over Samsung?” we determine the main focused product (Nokia) with respect to questioner’s perspectives who shows positive intention for buying mobile. The work does not deal with questions that have mixed emotions like Why Dells are ok, HPs aren't that good, but Macs are Fantastic. Moreover, the work does not perform feature specific (camera and sound quality) sentiment analysis of questioners.

In this paper, we perform the feature based sentiment analysis of questioners. We also address complex questions that have mixed emotions towards different products. We examine semantic structures of questions and propose an approach for sentiment analysis of questioners on product review sites. We finally conduct experiments which obtain better results as compared to existing baseline systems.
Raheem Sarwar and Sarana Nutanong. The Key Factors and Their Influence in Authorship Attribution : Systematic Literature Review Protocol and Preliminary Results
Abstract: Authorship Attribution (AA) has a very long history started from 18th century. During last decade, this areas has been established significantly due to the advancement in natural language processing and machine learning. This paper presents a Systematic Literature Review (SLR) protocol to address some key research questions with preliminary results and the detailed implementation of this protocol is in process. The results of this study show that the selection of the stylometric features effects the accuracy of the AA task and the selection of appropriate computational methodology for a specific feature set would also effect the accuracy of the AA. Furthermore, the text of different genre require different set of stylometric features to obtain satisfactory results. However, genre-neutral stylometric features can be used for AA but it would not produce as much good results as the genre dependent stylometric features can. Finally, the lengthy text produce more accurate results as compared to the short text with same set of features, however, satisfactory results can be obtain with short length by selecting appropriate features. The final study based on proposed SLR protocol will provide a comprehensive overview of the AA, comparison of the existing solutions and reveals their benefits and drawbacks in a comprehensive way. It will also recap the existing techniques and find out the research gap which still needs to be covered in future.
Seifeddine Mechti, Maher Jaoua, Rim Faiz, Heni Bouhamed and Lamia Hadrich Belguith. Author profiling: Age prediction based on advanced Bayesian networks
Abstract: In this study, we present a new method for profiling the author of an anonymous English text. The aim of author profiling is to determine demographic (age, gender, region, education level) and psychological (personality, mental health) properties of the authors of a text, especially authors of user generated content in social media. To obtain the best classification, authors resort to machine learning methods.
Focusing on the works which use the Bayesian networks, all those methods rather apply the Bayesian naïve classifiers which do not yield the best results. Therfore we propose a method based on advanced Bayesian networks for age prediction to over come the mentioned detail problem. We obtained promising results by relying on an English PAN@CLEF 2013 corpus. The obtained results are comparable to the ones obtained by the best state of the art methods.
Rajendra Prasath and Pinar Ozturk. Learning to Extract Information from Scientific Articles - A Case Based Reasoning Approach
Abstract: In this paper, we present an information extraction approach, based on case based reasoning, that identifies the different parts of text content from various scientific articles published by different publishers. At first, we crawl the scientific articles from different publishers and create a corpus of scientific articles with open access option. Then, we apply our learning approach that performs information extraction of text content from scientific articles using the content portion and other meta data of the scientific article. We build a topic model using a known collection of relevant scientific articles from Marine Science literature collected from Nature abstracts and then we apply these topics to do the first level filtering of related documents. The basic idea is the following: If a document taken from the open access collection of scientific articles covers many topics then it is probably the article of our interests to perform knowledge discovery tasks. For example, if a scientific article covers many topics, it is likely to be the article from which we will explore knowledge discovery tasks like variables and their increase or decrease characteristics. Here we attempted to use this approach to make the seen evidences computable and the underlying model does not require any linguistic or semantic expertise to perform the right content from different publishers. Experimental results carried our on the open access scientific articles collection show that the proposed approach effectively captures the relevant information useful to do knowledge discovery.
Alibek Barlybayev. Development of an automated recognition system of Kazakh speech in Smart University
Abstract: Currently, we are developing the intelligent e-learning. Intelligent-Education System consists of the following subsystems: 1) Intelligent Electronic University (IEU), considered in this work. 2) Intelligent Electronic college. 3) Intelligent Electronic school.
Intelligent Electronic University (Smart University) – a new paradigm of high education. If static electronic resources replaced by dynamic smart-tutors, e-learning on smart-education, and computer testing on intelligent knowledge as-sessment, the quality of e-learning is certainly rise. IEU has the following prop-erties: Adaptive interface; Knowledge representation and processing; Self-learning; Self-verifiability; Smart interface - online reference agent having the possibility of written and spoken dialogue with users; Smart-tutor – can examine and evaluate students' knowledge regardless of tutor.
This research is needed to learn Smart interface to understand oral and respond to users' questions. Without speech recognition smart interface can not understand the spoken language. He should be able to convert the audio signal into a written text. Further it is necessary to use natural language processing, but it has already been described in other papers.
We have conducted the analysis of the technology used in speech recognition. We have described the technique of research using hidden Markov models. We have used the method of Baum-Welch to adjust the parameters of the acoustic model. We have made the calculation context-dependent acoustic model. Compiled acoustic material for the calculation model. Described the architecture of the computer system for calculating the acoustic model. We have build a list of questions for the construction of tree triphones.
Seifeddine Mechti, Maher Jaoua, Rim Faiz and Lamia Hadrich Belguith. An Analysis Framework for Hybrid Authorship Verification
Abstract: Given a set of candidate authors for whom some texts of undisputed authorship exist, attribute texts of unknown authorship to one of the candidates is called Author verification . This problem acquired great attention due to its new applications in forensic analysis, e-commerce and plagiarism detection. The author verification task is of great help in the plagiarism detection process. Indeed, the probability of plagiarism increases where two parts of a document are not assigned to the same author. This paper introduces a hybrid original method that amalgamates linguistic and statistical features for author verification. In fact, the proposed method takes advantage of a large set of linguistic features to fully address the identification of the document’s author. These features are explored to build a machine-learning process. We obtained promising results by relying on PAN1@CLEF 2014 English literature corpus.
Bagdat Myrzakhmetov and Aibek Makazhanov. Initial Experiments on Russian to Kazakh SMT
Abstract: We present our initial experiments on Russian to Kazakh phrase-based statistical machine translation. Following a common approach to SMT between morphologically rich languages, we perform basic morphological processing, namely lemmatization. Given a rather humble-sized parallel corpus at hand, we also put some effort in data cleaning and investigate the impact of data quality vs. quantity trade off on the overall performance. Although our experiments mostly focus on source side pre-processing we achieve a substantial relative improvement over the baseline that operates on raw, unprocessed data.
Lena Zhetkenbay, Altynbek Sharipbay, Gulmira Bekmanova and Unzila Kamanur. The ontological model of adjectives for Kazakh-Turkish machine translation system
Abstract: In this work, we discussed the structure of ontological model according to adjectives in Kazakh and Turkish languages for machine translation system. This model gives us an opportunity to compare generally the similarities and differences of the languages. There can be used informational searching, making machine translation and autoreport, dialogical and other systems.

Poster

Samar Anbarkhan. Extracting semantic associations between obesity, weight loss and herbal medicine features
Abstract: Obesity has become one of the more serious global medical conditions in public health. In the last few decades, obesity has increased at alarming rate in the Eastern Mediterranean Region. Overweight and obesity rates for adults in this region are estimated at 30.4% and 12%, respectively, reaching as high as 66% and 31.5% in countries of the Gulf Cooperation Council.
Natural products become the basis of numerous systems of traditional medicine and provide the new drugs sources. The complementary therapies of traditional Arabic Medicine (TAM) play an important role in obesity treatment. Alternative treatments such as the use of herbal medicine in treatment of obesity have received much attention in the recent years. Although ancient Arabic medicine has contributed to the modern western medicines, it is underexploited and has not emerged as a comprehensive alternative treatment in spite of 200 to 250 plants being acknowledged as medicinal herbs.
The main purpose of this study is to investigate the rise of obesity among Saudi adult people and study the relationships between obesity, weight loss and traditional Arabic medicine using a text mining approach. This research addresses the increasing need to replace pharmaceutical drugs by the alternative herbal medicine by developing a novel text mining method to explore potential tacit associations between Arabic herbal medicine and treatment of obesity, in particular. It is proposed to use Semantic Graph to elicit the associations between obesity, weight loss and herbal medicine features.