Submissions/Reciprocal Enrichment between Wikipedia and Machine Translators


Information

This is an open submission for Wikimania 2010.


Title of the submission
Reciprocal Enrichment between Wikipedia and Machine Translators
Type of submission (workshop, tutorial, panel, presentation)
Presentation
Author of the submission
Mikel Iturbe
Unai Fdz. de Betoño
Galder Gonzalez
Arkaitz Zubiaga
Iñaki Alegria
Gorka Labaka
Kepa Sarasola
E-mail address or username (if username, please confirm email address in Special:Preferences)
Janfri
Theklan
Unai Fdz. de Betoño
TXiKi
Country of origin
Spain
Affiliation, if any (organization, company etc.)
University of the Basque Country
Personal homepage or blog
Abstract

A successful characteristic of Wikipedia is the high number of languages it is available in. Nonetheless, the rapid growth of English Wikipedia is making most of the languages left behind, especially minority languages, where the number of contributors is immensely smaller. In this sense, partially automated processes relying on tasks like machine translation present a new option for easing article generation in many languages. Currently, translations provided by existing machine translation systems are riddled with inaccuracies. Hence, they are useful for understanding the meaning of source text rather than for getting a correct translation, since the subsequent post-editing process requires hard human work.

In this context, we present 'OpenMT-2: Hybrid Machine Translation and advanced evaluation', a project for which Basque Wikipedia contributors collaborate with the University of the Basque Country (EHU) and Polytechnic University of Catalonia (UPC), funded by the Spanish Ministry of Science and Innovation (TIN2009-14675-C03-01). Within this project, a set of 100 long articles of the Spanish Wikipedia will be selected, and afterwards translated into Basque language by using Matxin-Opentrad, an open source rule-based machine translation system. The authors have presented their Spanish into Basque translation approaches in previous works[1].

The automatically translated articles will be full of errors. Thus, a group of users from Basque Wikipedia will review them, correcting the errors they will find; this process is also known as post-editing. In this process, changes made by these users will be logged. In addition, the fixed articles will be included into Basque Wikipedia.

Researchers from the aforementioned universities will analyze the resulting post-editing logs. Thus, they can work on improving their machine translation process by manually improving the different modules of their MT system, or by implementing an automated statistical post-editing process[2] that is expected to improve the accuracy of the translation also for the Spanish-Basque language pair[3].

At the moment, they are examining different alternatives to create a human post-editing interface suitable to translate Wikipedia contents, by means of adapting any current free and open software: (1) OmegaT seems to be a free translation memory application suitable to do it; (2) the World Wide Lexicon Translator is a Firefox add-on (WWL) that makes browsing foreign languages sites easy and automatic. Simply open a URL and it detects its language and translates using human and machine translations. With it you can view and create translations for any website. However, its post-editing interface does not yet work very properly; (3) the Google Translation Toolkit provides specific help to translate wikipedia contents, but it is not a free and open software.

As regards to Wikimania 2010, we would like to present the details of the OpenMT-2 project, showing the positive aspects of a collaborative work among Wikipedia and universities, with the aim of increasing available resources for information treatment and generation.

Slides

References

  1. Alegria I., Arregi X., Díaz de Ilarraza A., Labaka G., Lersundi M., Mayor A., Sarasola K. 2008. Strategies for sustainable MT for Basque: incremental design, reusability, standardization and open-source. Proceedings of the IJCNLP-08, pp: 235-243. Hyderabad, India.
  2. Simard, M., Ueffing, N., Isabelle, P., and Kuhn, R. 2007. Rule-based translation with statistical phrase-based post-editing Proceedings of the Second Workshop on Statistical Machine Translation. pp:203-206. Prague, Czech Republic.
  3. Díaz de Ilarraza A., Labaka G., Sarasola K. 2008. Statistical Post-Editing: A Valuable Method in Domain Adaptation of RBMT Systems MATMT2008 workshop: Mixing Approaches to Machine Translation. pp.35-40.
Track (People and Community/Knowledge and Collaboration/Infrastructure)
Knowledge and Collaboration
Will you attend Wikimania if your submission is not accepted?
Probably not
Slides or further information (optional)


Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. Jodi.a.schneider 18:38, 13 May 2010 (UTC)[reply]
  2. Azor1
  3. bevcorwin
  4. --Ravidreams 06:39, 20 May 2010 (UTC)[reply]
  5. --Gomà 19:40, 20 May 2010 (UTC)[reply]
  6. Siebrand 10:42, 30 May 2010 (UTC)[reply]
  7. Kocio 22:16, 2 June 2010 (UTC)[reply]
  8. Mn6230 14:16, 4 June 2010 (UTC)[reply]
  9. Waldir 07:54, 16 June 2010 (UTC)[reply]
  10. Amir E. Aharoni 06:37, 22 June 2010 (UTC)[reply]