Duplicate finder

Tags: duplicates, build, cut and paste, code, simian, tools

A while back (2007) I was challenged to make a duplicate code detector for C#. Simian  did a valuable job, but cost money. And, someone said to me, how hard could it be? So I wrote one and put it here. I wrote about it here and here.

I learned these things about the problem of detecting duplicate code:


1) It's a puzzle. I mean that like a Sudoku solver, and unlike most commercial software, it's not something that can be solved incrementally, one feature at a time. There is one main feature - detecting duplicate groups of lines in files. You need that flash of insight of how to crack the core problem, and then you can incrementally layer a program around that.

2) Performance is the other problem. It's fairly easy to detect duplicate lines. It's hard to do it fast. Sadly the solution that I have is prohibitively slow on large codebases. It compares every line in every file to every other line in every other file and the performance is probably O(n2) where n is the number of lines of code in all files. A useful solution would have to be O(n) or O(n log(n))

There are cases in which you can cheat - for instance, comparing the code at the cursor to code elsewhere in the codebase is O(n). But as a task to run during an Continous Integration build, you want to compare all code to all other code.
It is also defeated by very minor changes - it works on text and doesn't understand the code, so changing a variable name means that it won't match as a duplicate any more.

So, in the absence of anything else, I'd still recommend having a look at Simian.

Add a Comment