Using Comparable Corpora for Under-Resourced Areas of Machine Translation

Author:   Inguna Skadina ,  Robert Gaizauskas ,  Bogdan Babych ,  Nikola Ljubesic
Publisher:   Springer International Publishing AG
Edition:   1st ed. 2019
ISBN:  

9783319990033


Pages:   323
Publication Date:   22 February 2019
Format:   Hardback
Availability:   Manufactured on demand   Availability explained
We will order this item for you from a manufactured on demand supplier.

Our Price $284.60 Quantity:  
Add to Cart

Share |

Using Comparable Corpora for Under-Resourced Areas of Machine Translation


Add your own review!

Overview

This book provides an overview of how comparable corpora can be used to overcome the lack of parallel resources when building machine translation systems for under-resourced languages and domains. It presents a wealth of methods and open tools for building comparable corpora from the Web, evaluating comparability and extracting parallel data that can be used for the machine translation task. It is divided into several sections, each covering a specific task such as building, processing, and using comparable corpora, focusing particularly on under-resourced language pairs and domains. The book is intended for anyone interested in data-driven machine translation for under-resourced languages and domains, especially for developers of machine translation systems, computational linguists and language workers. It offers a valuable resource for specialists and students in natural language processing, machine translation, corpus linguistics and computer-assisted translation, and promotes the broader use of comparable corpora in natural language processing and computational linguistics.

Full Product Details

Author:   Inguna Skadina ,  Robert Gaizauskas ,  Bogdan Babych ,  Nikola Ljubesic
Publisher:   Springer International Publishing AG
Imprint:   Springer International Publishing AG
Edition:   1st ed. 2019
Weight:   0.664kg
ISBN:  

9783319990033


ISBN 10:   3319990039
Pages:   323
Publication Date:   22 February 2019
Audience:   Professional and scholarly ,  Professional & Vocational
Format:   Hardback
Publisher's Status:   Active
Availability:   Manufactured on demand   Availability explained
We will order this item for you from a manufactured on demand supplier.

Table of Contents

1 Introduction2 Cross-language comparability and its applications for MT 2.1 Introduction: Definition and use of the concept of comparability2.2 Development and calibration of comparability metrics on parallel corpora2.2.1 Application of corpus comparability: Selecting coherent parallel corpora for domain-specific MT training2.2.2 Methodology2.2.2.1 Description of calculation method2.2.2.2 Symmetric vs. asymmetric calculation of distance2.2.2.3 Calibrating the distance metric2.2.3 Validation of the scores: cross-language agreement for source vs. target sides of TMX files2.2.4 Discussion2.3 Exploration of comparability features in document-aligned comparable corpora: Wikipedia2.3.1 Overview: Wikipedia as a source of comparable corpora2.3.2 Previous work on using Wikipedia as a linguistic resource2.3.3 Methodology2.3.3.1 Document pre-processing2.3.3.2 Similarity measures2.3.3.3 Eliciting human judgments2.3.4 Results and analysis2.3.4.1 Responses to the questionnaire2.3.4.2 Inter-assessor agreement2.3.4.3 Correlation of similarity measures to human judgments2.3.4.4 Classification task2.3.5 Discussion2.3.5.1 Features of `Similar' articles2.3.5.2 Measuring cross-language similarity2.3.6 Section conclusions2.4 Metrics for identifying comparability levels in non-aligned documents2.4.1 Using parallel and comparable corpora for MT2.4.2 Related work2.4.3 Comparability metrics2.4.3.1 Lexical mapping based metric2.4.3.2 Keyword based metric2.4.3.3 Machine translation based metrics2.4.4 Experiments and evaluation2.4.4.1 Data sources2.4.4.2 Experimental results2.4.5 Metric application to equivalent extraction2.4.6 Discussion2.4.6.1 Advantages and disadvantages of the metrics2.4.6.2 Using semi-parallel equivalents in MT systems2.4.7 Conclusion3 Collecting comparable corpora3.1 Introduction3.2 Previous work in collecting comparable corpora3.2.1 Web crawling3.2.2 Identifying comparable text3.3 ACCURAT techniques to collect comparable documents3.3.1 Comparable corpora collection from Wikipedia3.3.1.1 Extracting comparable articles3.3.1.2 Measuring similarity in inter-language linked documents3.3.2 Comparable corpora collection from news articles3.3.3 Comparable corpora collection from narrow domains3.3.3.1 Acquiring comparable documents3.3.3.2 Aligning comparable document pairs4 Extracting data from comparable corpora4.1 Introduction4.2 Term extraction, tagging, and mapping for under-resourced languages4.2.1 Related work4.2.2 Term Extraction, tagging, and mapping with the ACCURAT toolkit4.2.2.1 Term candidate extraction with CollTerm4.2.2.1.1 Linguistic filtering4.2.2.1.2 Minimum frequency filter4.2.2.1.3 Statistical ranking4.2.2.1.4 Cut-off method4.2.2.2 Term tagging in documents4.2.2.3.1 Term tagging evaluation for Latvian and Lithuanian4.2.2.3.2 Term tagging evaluation for Croatian4.2.2.3 Term mapping4.2.2.4 Comparable corpus term mapping task4.2.2.5 Discussion4.2.3 Experiments with English and Romanian term extraction4.2.3.1 Single-word term extraction4.2.3.2 Multi-word term extraction4.2.3.3 Experiments and results4.2.4 Multi-word term extraction and context-based mapping for English-Slovene4.2.4.1 Resources and tools used4.2.4.1.1 Comparable corpus4.2.4.1.2 Seed lexicon4.2.4.1.3 LUIZ4.2.4.1.4 ccExtractor4.2.4.2 Experimental setup4.2.4.2.1 Term extraction4.2.4.2.2 Term mapping4.2.4.2.3 Extension of the Seed lexicon4.2.4.3 Evaluation of the results4.2.4.3.1 Evaluation of term extraction4.2.4.3.2 Evaluation of term mapping4.2.4.4 Discussion4.3 Named entity recognition using TildeNER4.3.2 Annotated corpora4.3.3 System design4.3.3.1 Feature function selection4.3.3.2 Data pre-processing4.3.3.3 NER model bootstrapping4.3.3.4 Refinement methods4.3.4 Evaluation4.3.4.1 Non-comparative evaluation4.3.4.2 Experimental comparative evaluation4.3.5 Discussion4.4 Lexica extraction4.4.1 Related work4.4.2 Experiments on bilingual lexicon extraction4.4.2.1 Experimenting with key parameters4.4.2.2 Corpus size and comparability4.4.2.3 Seed lexicons4.4.2.4 Vector building and comparison4.4.2.5 Evaluation of results4.4.3 Bilingual lexicon extraction for closely related languages4.4.3.1 Building the comparable corpus4.4.3.2 Building the Seed lexicon4.4.3.3 Extending the Seed lexicon with cognates4.4.3.4 Extending the Seed Lexicon with first translations4.4.3.5 Combining cognates and first translations of the most frequent words to extend the Seed lexicon4.4.3.6 Re-ranking of translation candidates with cognate clues4.4.3 Discussion5 Mapping and aligning units from comparable corpora 5.1 Introduction5.2 Related work5.3 Document alignment in comparable corpora5.3.1 EMACC5.3.2 EMACC evaluations5.4 Parallel sentence mining from comparable corpora5.4.1 LEXACC5.4.1.1 Indexing target sentences5.4.1.2 Finding translation candidates for source sentences5.4.1.3 Filtering5.4.1.4 The PEXACC translation similarity measure5.4.1.4.1 Features5.4.1.4.2 Learning the optimal weights5.4.1.5 Evaluations5.4.1.5.1 Experimental setting5.4.1.5.2 Search engine efficiency5.4.1.5.3 Filtering efficiency5.4.1.5.4 Translation similarity efficiency5.4.1.5.5 SMT experiments5.4.2 PEXACC5.4.2.1 The algorithm5.4.2.2 Evaluations5.4.2.2.1 Computing P, R, and F15.4.2.2.2 Comparison with the state of the art5.4.2.2.3 Running PEXACC on real-world data5.5 Parallel phrase mining from comparable corpora5.5.1 Parallel phrase mining with SVM5.5.1.1 Phrase pair generation5.5.1.1.1 Training example extraction5.5.1.1.2 Test instance generation5.5.1.2 SVM classifier5.5.1.2.1 Cognate-based methods for translation purposes5.5.1.3 Experiments5.5.1.3.1 Data sources5.5.1.3.2 Phrase extraction for classifier training and testing5.5.1.3.3 Phrase extraction from comparable corpora5.5.1.3.4 Results5.5.2 Parallel phrase mining with PEXACC6 Training, enhancing, evaluating and using MT systems with comparable data6.1 Introduction6.2 Enriching general domain SMT systems with data from comparable corpora6.2.1 Data used for experiments6.2.2 Methodology6.2.3 Experiments with data extracted from comparable corpora6.2.4 Staggered experiments6.3 Human evaluation of MT output6.3.1 Evaluation methodology and the interface6.3.2 Experiment set-up6.3.3 Human evaluation results6.4 MT adaptation for under-resourced domains6.4.1 Initial extraction and alignment of terms and named entities6.4.2 Comparable corpora collection6.4.3 Extraction of term pairs from comparable corpus6.4.4 Baseline system training6.4.5 SMT system adaptation6.5 MT adaptation to a narrow domain in case of resource-rich languages6.5.1 Evaluation objects: Narrow-domain-tuned MT systems6.5.2 Evaluation data6.5.3 Evaluation methodology6.5.5 Evaluation tools6.5.6 Evaluation results6.5.7 Conclusion6.6 Application of Machine Translation in web authoring6.6.1 The role of translation and MT in web authoring6.6.2 Characteristics and requirements for translation in web authoring6.6.3 MT systems enhanced with comparable corpora in web authoring - a use case6.6.4 Conclusion6.7 Systems for computer aided translation6.7.1 Collecting and processing a comparable corpus6.7.2 Building SMT systems6.7.3 Automatic and comparative evaluation6.7.4 Evaluation in localisation task6.7.5 Discussion7 New areas of application of comparable corpora7.1 Automatic dictionary expansion using a Seed lexicon and non-parallel corpora7.1.1 Motivation: Improving algorithms and boosting performance via cross-language transitivity7.1.2 Approach7.1.3 Language resources7.1.4 Results7.1.5 Discussion7.2 Identifying word translations from comparable documents without a Seed lexicon7.2.1 Motivation7.2.2 Approach7.2.2.1 Pre-processing steps7.2.2.2 Alignment steps7.2.2.3 Vocabularies7.2.3 Evaluation setup7.2.4 Results and evaluation7.2.4.1 Comparison with other work7.2.4.2 Application to other languages7.2.5 Discussion7.3 Chinese-Japanese parallel sentence extraction from quasi-comparable and comparable corpora7.3.1 Motivation7.3.2 Parallel sentence extraction system7.3.3 Binary classification of parallel sentence identification7.3.3.1 Training and testing7.3.3.2 Features7.3.4 Experiments7.3.4.1 Data7.3.4.2 Classification experiments7.3.4.3 Extraction and translation experiments on quasi-comparable corpora7.3.4.4 Extraction experiments on comparable corpora7.3.5 Related work7.3.6 Conclusion and future work8 Appendices 8.1 Introduction8.2 Tools for building a comparable corpus from the web8.2.1 A workflow based corpora crawler8.2.2 Focussed Monolingual Crawler (FMC)8.2.3 Wikipedia retrieval tool8.2.4 News information downloader using RSS feeds8.2.5 News text crawler and RSS feed gatherer8.2.6 News article alignment and downloading tool8.3 Parallel data mining workflow8.3.1 Tools to identify comparable documents and to extract parallel sentences and/or phrases from them8.3.1.1 ComMetric: A toolkit for measuring comparability of comparable documents8.3.1.2 DictMetric: A toolkit for measuring comparability of comparable documents8.3.1.3 Features extractor and document pair classifier8.3.1.4 EMACC: A textual unit aligner for comparable corpora using Expectation-Maximisation8.3.2 Tools to extract parallel sentences and/or phrases from comparable documents8.3.2.1 PEXACC: A parallel phrase extractor from comparable corpora8.3.2.2 LEXACC: Fast parallel sentence mining from comparable corpora8.4 The workflow for named entity and terminology extraction and mapping8.4.1 Tools for named entity recognition8.4.1.1 TildeNER8.4.1.2 OpenNLP wrapper8.4.1.3 NERA1: Named entity recognition for English and Romanian8.4.2 Tools for terminology extraction8.4.2.1 CollTerm - A tool for term extraction8.4.2.2 Tilde's wrapper system for CollTerm8.4.2.3 KEA wrapper8.4.2.4 Terminology extraction for English and Romanian8.4.3 Tools for named entity and terminology mapping8.4.3.1 Multi-lingual named entity and terminology mapper8.4.3.2 NERA2: Language-independent named entity mapper8.4.3.3 A language-independent terminology aligner8.4.3.4 P2G: A tool to extract term candidates from aligned phrases8.5 Sisyphos-II: MT-Evaluation tools8.6 Conclusions and related information

Reviews

Author Information

Prof. Inguna Skadina has been working on language technologies for over 25 years. Her research interests are in machine translation, human-computer interaction, and language resources and tools for under-resourced languages. She has coordinated and participated in many national and international projects related to human language technologies, and has authored or co-authored more than 60 peer-reviewed research papers. Bogdan Babych is an Associate Professor of Translation Studies at the University of Leeds, UK. He holds a PhD in machine translation and in Ukrainian linguistics. Dr. Babych was a coordinator of the EU FP7 Marie Curie project HyghTra, and received a Leverhulme Early Career Fellowship for his project Translation Strategies in Comparable Corpora. He previously worked as a computational linguist at L&H Speech Products, Belgium. Robert Gaizauskas is a Professor of Computer Science and head of the Natural Language Processing group, Department of Computer Science, University of Sheffield, UK. His research interests are in computational semantics, information extraction, text summarization and machine translation. He holds a DPhil from the University of Sussex, UK (1992), and has published more than 150 papers in peer-reviewed journals and conference proceedings. Nikola Ljubesic is an Assistant Professor at the Department of Information Science, University of Zagreb, Croatia, and researcher at the Jozef Stefan Institute in Ljubljana, Slovenia. His main research interests are in language technologies for South Slavic languages, linguistic processing of non-standard texts, author profiling and social media analytics. Prof. Dan Tufis, director of RACAI and full member of the Romanian Academy, has been active in computational and corpus linguistics for more than 30 years. His expertise is in tagging, word alignment, multilingual WSD, SMT, QA in open domains, lexical ontologies, language resource annotation and encoding. He has authored or co-authored more than 250 peer-reviewed papers, book chapters and books. Andrejs Vasiljevs is a co-founder and chairman of the board of Tilde, a leading European language technology and localization company. His expertise is in terminology management, machine translation and human computer interaction. He initiated and coordinated the ACCURAT project as well as several other international research and innovation projects. He holds a PhD in computer sciences from the University of Latvia and a Dr.h. from the Latvian Academy of Sciences.

Tab Content 6

Author Website:  

Customer Reviews

Recent Reviews

No review item found!

Add your own review!

Countries Available

All regions
Shopping Cart
Your cart is empty
Shopping cart
Mailing List