Submissions/Interlanguage links in Wikipedia: current problems and future development/draft

Text draft written by Incnis Mrsi /'ɪnknis mrsi/ with the aid of several ru.WP users at a small wikimeetup (Qkowlew, Claymore, LEMeZza, Kalan), and will be performed by Amir E. Aharoni. Includes ideas from m:Fine interwiki and m:A newer look at the interlanguage link essays. Such distinction indicates an intended spoken text.

 


Wikimania 2010
Interlanguage links in Wikipedia:
current problems and future development
Special thanks to
VasilievVV and Drbug for the struggle for Schengen visas.



slide 0.2
(placeholder for image)
He listens us, but...
Incnis Mrsi ►

Do not enter!
Video translation impose a delay about 30 seconds. If you want to send your questions to Incnis Mrsi more quickly, please, use IRC-channel #wikipedia-interwiki at Freenode network.

Объяснить своими словами ситуацию так, как она выглядит оттуда.


slide 0.3
What shall we tell?
  • The concerns of currently used system.
  • Possible ways of improvements, entropy reducing and structure extension. Implementations, some of which require changes in the engine, but some (almost) do not.
  • A time for standards.
  • Coordination issues, human and technical.
Warning, these hyperlinks do not work in my slides

(tell something about it in ≈ 30 s)


Why these links

slide 1.1

Technically, any interwiki link is just a special format of hyperlink to another wikiproject. From here and forth we will discuss interlanguage links, which connect similar items (most notably, articles) on different languages. Generally, "interwiki" is more general term, but in this presentation we shall always mean interlanguage links.

Interlanguage links are metadata.
These links are metadata of all Wikipedia, not of each language wiki separately.

We consider interlanguage links as metadata, like categories but in some sense even more metadata than categories. Categories belong to particular wikiproject, interwikis do not. Evidently, inbound interwikis do not belong to a wiki, but in Wikipedia even outbound interlanguage links should obey some constrains due to automated processing.

We shall talk about Wikipedia experience, although it is applicable to another projects. These wikis consist a system, where interlanguage links is the second important metadata structure after categories. #####

There are interests of readers and interests of editors.

End users and editors use interlanguage links in a bit different ways. Readers want to have more informative links. Editors want more regularity, want system to be more predictable.


What do we have now

slide 2.1
Concerns of interwiki
  1. Poor traceability
  2. Lack of editors’ attention
    This includes, but not limited to, omitting interwikis in newly created articles, mistaken links (human and bot errors), moving a page with tampering original name.
  3. Technical and organizational flaws
    This includes bad algorithms in bots, lack of management etc.
  4. Conceptual flaws.

With several millions of articles, structure of interwiki faces many challenges. Traditional approach is to build some clusters of articles in different languages, say, to establish an equivalence relation between articles. But current technical implementation is based on links, from one page in one language to another page in another language. There is no easy way to check #####, and there are many cases where the current attitude behaves faulty.


slide 2.2
There are automated tools to change interlanguage links across languages.
 

interwiki.py is a component of Pywikipediabot.

It follows redirects indiscriminately.


  
 

Due to enormous volume of this task, bot make most changes in interlanguage links. Most used platform for bots is Pywikipedia. Рассказать что-нибудь хорошее про ботов, если есть желание. Unfortunately, in some situations bot makes mistake itself. For example, Pywikipedia substitutes all redirects, but it does not check the reason of redirect. It is correct way if an article was moved, but almost always bad in other circumstances.


Possible solutions

slide 3.1

Generally, there are two ways to improve: to reduce the entropy and to extend the structure. The idea of the former is that simpler system will fail less likely. Second is to make the system do more than was thought initially.

Ways to improve the system
    

  #

#
structural extension  
   
entropy reducing

Entropy reducing

slide 3.2
   
A diagram by HenkvD for N=7
illustrates the main idea of this approach
 
 
 
N·(N−1) links.
N edits to add a new language and N−1 edits to remove one.
  Two edits per operation: add/change/remove link in the article and one change in the central database.
Queries like http://toolserver.org/~vvv/sulutil.php?user=Amire80 perform slowly due to hundreds of database. The same trouble experience current and possible interwiki tools.   Central database could made some operations simpler.

Entropy reducing approach is to restrict possible configurations to clusters containing no more than 1 article in any language. This makes a unique (injective) correspondence between articles on any pair of languages. It usually assumes some central interwiki database, like in the proposal of 2008 at Meta-wiki, which is now discussed at strategy wiki.

Another advantage is technical, that querying a centralized database will produce less overhead. We know that such tools as Single Unified Login utility are extremely slow due to about thousand of SQL queries to different wiki databases.


Semantic keys

slide 3.3

§aÇ6íkZ.[¨OĽB”ę$ or type: human readable ID?

We should persuade developers do not use random or sequential IDs except for internally in the central database itself.

Will be without
a proper structure

  ~  
  ~ §

Must be:

  ~     CAS: nnnnn
  ~     fouentium barium
  ~  
  ~     ? born: 1973/12/06...
  ~     IMDB: nnnnn

According to Incnis Mrsi, it is way we should accept for some types of articles.

What keys should we use, just sequential numbers or random data? May these keys semantically significant for any subtype of articles? Persons have birth dates, organizations are registered under some official names, films and books have at least a year of release, astronomical and geographical objects in some cases may be identified by approximate values of coordinates. Also, we should use existing (external) keys as strong as possible, such as Latin names for species, IMDB index for films, CAS registry numbers for substances. We should obtain not only a piece of syntactic sugar for interwiki, but a semantically useful system. If carefully designed, it will deter some mistakes, such as inaccurate translation or disambiguation. Of course, programmers prefer numbers as it is simpler and faster. But this it not the case to cut everything with the Ockham's razor.


slide 3.4
Yes, instead of always using the name in the first language, this is another possibility: use the "most basic" name if possible. Plants and animals could go by the Latin name, chemicals by their chemical formula or CAS number, astronomical objects by their catalogue number, people and places by their name in their language (though the last two may be a point of contention).
Nikola Smolenski, 22 June 2008 [1]

(tell something about it)

 
?
Identification (record keys) at the centre

Many systems are possible even for the same type of articles. Choose one? Translate between several?

ישראל, country: IL or 29°25’–33°20’ N × 34°30’–35°40’ E:

Many systems are possible even for the same object type. How to identify countries, say for example, the state of Israel: by its Hebrew name, or as a territory between 29½°–33½° Northern latitude and between 34½°–36° Eastern longitude? Or, may be, international country code would be the best solution?


Extended structure

 
slide 3.5

But there are many cases where equivalence-like interwiki structure on articles is not possible. It may be due to lack of some required articles at the moment, or sometimes it seems to be impossible at all because of semantic incompatibility of different languages. In such cases we should not decrease a complexity.

Causes

Sometimes an article that refers to a specific concept may correspond to two or more articles in another language. It is not just a problem of translations but of differences in culture, habits, laws, implementations. In such cases it is necessary to disambiguate when moving from a wikipedia to another. A central hub would help, working as a disambiguation page.

(tell something about it)

Of course, this is not only a matter of translation.

Non-classical links

slide 3.6
What Currently Extended
# Article
   section
Technically impossible interwiki source {{Section-links}}, experimental
Possible interwiki destination, but some bots delete it Should be avoided in favor of redirects
·
 Redirect Never is interwiki source Disputable
Redirected interwiki target considered as overhead and followed by bots RwP should be a legitimate interwiki target

There are some possible structural extensions which require no changes in MediaWiki, or almost no changes. Two types of MediaWiki objects: redirects and sections are currently underestimated in the context of interlanguage linking.


slide 3.7
Are redirects aliases?
 
This is a translation. Such redirect page should not serve as an interwiki target (arrival point)

  
 
  versus  
 
  They see no difference


Redirect pages (or, colloquially, just redirects) are often considered as aliases for page names. For example, the connectivity project does not see any difference between a direct link and a link through redirect.



slide 3.8

It may be simple:    
... or ...
... points to a section:    
A redirect to section linked from another language. {{section-links}} should be placed in the section.

How to design {{section-links}} visually?

A question for usabilists.


Conclusion

Walk back to the slide 3.1, move pointer to upper right corner of the table. Some extensions of structure are compatible with entropy reducing, indeed. For example, we should have links from the central hub to redirects, and links from redirects and sections to the central hub.



Standards and technical questions

A time for standards
slide 4.1
                                                     
Optimal
 
Acceptable
 
Errors
banned by the engine?

It is a time to establish some standards for interlanguage links in Wikipedia. First of all, what configurations may be considered acceptable? What is better than acceptable? What structures should be considered incorrect and how to repair it?

#####

Some configurations

slide 4.2
 
Should it link back?

If some non-section redirect page with possibility linked from another language, then should it have a corresponding interlanguage link itself? In the case of redirect to section there should be a link from section to provide end user access, and a link from redirect page would be obviously excessive.


slide 4.3
Should we ban "hooks"?
+ ——————
i —————————
If the terminal article "includes" the initial one, then it is OK. Links leads from the specific to the general.

But it is not OK otherwise, and we should fix it.

This is named "triangles" in the presentation in the presentation "Analyzing Interlanguage links" at Wikimania 2008.

A "hook" is a condition when and interwiki link returns to the same language to another point. #####



slide 4.4
No interwiki links or more than one interwiki links
×
Are there some conditions for an article (not a disambiguaton page) to have not interlanguage links at all?

Some users think that there are.

Are there some conditions for a page to have more that one outbound links to the same language, not counting {{section-links}}?

Some users think that this as an error.



Formal model

 
slide 4.5
Do we need a formal model?

Should we restrict to graphs made of elements mentioned above, or we can check possibly more complicated graphs to satisfy some conditions?

Say, any page in the same language reachable via interlanguage links in forward direction must be in some sense a top page for the origin. This is a weakened DAG condition.

We can consider a structure of directed graph, with interlanguage links, redirects and inclusion (section "links" to the article). It forms a binary relation (reachability), but is will not be symmetric, only transitive. This directed graph carries more information than an equivalence relation.


Improved traceability

slide 4.6
What bots should report in edit summaries?

Should bots report the cause of any interwiki change? How to trace an edit which resulted in that condition?



Automatic error deterrence

 
slide 4.7

May and should we restrict some incorrect (or dangerous) configurations by the engine?


Coordination

slide 5.1
Tasks for the interwiki community
  1. Develop a standard for links itself.
  2. Separate tasks which may be done purely automatically from ones which implies some responsibility of user or bot owner.
  3. Educate editors about complexity of this problem, that some changes cannot be easily undone.
  4. Support a chat (IRC-channel, we propose #wikipedia-interwiki).

As a conclusion, let us speak about tasks for the interwiki community. How to bring interested editors together to discuss these problems? How to establish some standards accepted for all language Wikipedias? How to educate local users (most notably, patrollers/reviewers) that interlanguage links are a complex matter? They may not insert garbage data in outbound links or blindly revert any change which do not understand, because with the current system it may lead to grave consequences.


Thank you
for your attention!


 

Wikimedia Commons provided many icons and diagrams which I did not draw myself.

 

Mozilla Firefox was used as a presentation environment.