Looks like my current limitation is the slow speed of order by rand(), so I really need to test it on "real" data.

No, wrong - looks like 100 is already a bit much for Levestein. 60 is still fast, so the limit is somewhere in between (at least for PAWS, Toolforge might work faster or slower, so it might be changed).

Note: good eps depends on amount of data, the bigger amount of soursecodes is, the smaller eps should be - or awful artefacts start to appear.

eps amount of scripts
0.2 30
0.1 60
0.03 100
return {
{ "s08067" },
}


Sometimes DBSCAN returns 1 as amount of clusters. If this happens, it means that each soursecode is considered to belong to unique group.