Titles and sources

This part is used to analyse, how different is the code stored under the same title.

That is about 10% of total amount, which is quite a lot, looking at the numbers itself.

Can we superficially look into code difference between functions with the same name?


So there's only 21 unique value, huh?

Feeling so much pain because of the same code...

But still, there's a few points to make:

  1. require is used to call module from other module. (it is probably noted in templatelinks)
  2. 'Copied from the English Wikipedia:' commentary also already refers to existing code. Is it noted in iwlinks?
  3. if code is not directly the same here (it is an easy module, speaking of implementation logic), it is the same in structure - look at the 'or' statements, which can be all in one row or in multiple rows.
  4. most of the variants are modification of the other modules, with addition to the localized answer variants

Points 3 and 4 mean that common copy detection algorithms, used in academia, should be fine, as a lot of them (at least, the ones I heard of) use tree-based detection (and tree should be the same when parsing things, mentioned in point 3) and handle included code, that is the thing for point 4.


It took me some time, but I realized, that require is used to fix the problem with some people writing 'yesno' in full lowercase... Still, it would be good to see, how exactly all this transclusion works and whether this require is shown in the database tables.

That's confusing - the module is not shown as transcluded and is not mentioned in another linking form. But it should still work, so it needs to be mentioned somewhere - but where exactly?

'copied from'

So this module is really linked to enwiki, which is cool.


Testing to fetch and look at templatelinks for Module:Category_handler from enwiki, as there are a few require(module-name) instances.

Category handler is transcluded to Category handler (row 10)? What does that even mean?

At least, all the modules from require here are mentioned in templatelinks.


ltc-pron is a pronunciation module, connected to Middle chinese pronunciation. This would most likely return different info every time, as it's more likely to translate diiferent strings every time.

So, here's a theoretical idea - can we just omit pron modules, as they are simply pronunciational ones? (IPA modules also seem to be connected, based on this link)

We can probably broaden this guess - if there's a module, which name is used a lot, we can look at the ratio of unique soursecodes to all entries. If this number is small, it is probably reused a lot, big is more common to see in modules like this one.

Unique sourcecode ratio

Let's define sourcecode uniquness coefficient (uniq) as the ratio of unique sourcecodes to the whole amount of functions under this title. When it tends to zero, it means there's one original script variant, and all the othes are its copies.

Fun fact - TNT function, whose description says 'Please do NOT rename this module - it has to be identical on all wikis', has 3 variants in total, so there are at least 2 wikis, where this module is different from the one on the Mediawiki.

Let's remove excessively big amounts to look at the main part of the plot closer.

The distribution is closer to uniform then to anything I know.

Let's explore low-unique functions a bit more.

Most of these modules really look like language-related ones. As for the ones, that are clearly not linguistical (for example, Module:headword or Module:Test) I wa either unable to find them on enwiki or they were deleted there, so their content and 'theme' can't be analysed.