Detecting documents near duplications in realtime
To make it clear, our problem is not to find all near duplications. We just with to find near duplications in articles we serve, but it must be very fast. We might return 100 articles in one set. Comparing all of them to each other will take about 10K comparisons.
Some of conventional methods to solve near duplications I know of are using shingling and matching term frequency vectors. Shingling is great and most accurate, but is expensive. You can’t take all the articles and compare them, not mentioning keeping all in memory. Creating the shingles takes time and in large documents there may be many of them. Vectors might be less accurate for these purposes and have similar caveats.
I go most of my professional life never hearing about shingling, suddenly I hear about twice in a week. An artifact of growing computing power?