Friday, March 28, 2008

Find Duplicate Files - dups.py

I have a lot of images imported at different times and from different sources, and I wanted a quick way to find duplicates. Not finding a satisfactory (read: free) solution (though I admittedly didn't do a very exhaustive search), I took this opportunity to learn Python and came up with dups.py. Note that the file displays within a frame, so you might have to view frame source to get to the actual code.

Without arguments, dups.py checks the current directory, recursively:
$ dups.py
Duplicates found:
./Data/2004/05_4/015_12A.jpg
./Data/2004/2004.09.29 Grandma/015_12A.jpg
Duplicates found:
./Data/2002/19/uvs021219-008.jpg
./Data/2006/01_2/uvs040430-006.jpg
...
This has been tested on the Mac OS X and cygwin, and should also work with Python for Windows.

There are lots of nerdy options, like filtering by file size and following symbolic links. Try dups.py -h to see them all:
usage: dups.py [options] [<file_or_directory> ...]

Find duplicate files in the given path(s). Defaults to searching files recursively,
except for hidden files (beginning with "."), empty files, and symbolic links.

Options:
--version show program's version number and exit
-h, --help show this help message and exit
-v, --verbose verbose

Exclusion Options:
-f, --flat do not scan directories recursively
-g n, --greater-than=n
only scan files of size greater than n bytes
-l n, --less-than=n
only scan files of size less than n bytes

Inclusion Options:
-L, --follow-links follow symbolic links (warning: beware of infinite
loops)
-H, --hidden-files include hidden files
-z, --zero-files include empty files

Miscellaneous:
-D, --delete delete subsequent duplicates (files are scanned in
argument-list order)
-c, --create-rel-links
replace subsequent duplicates with relative links
(non-Windows only)
-C, --create-abs-links
same as "-c", but links are absolute
-s, --special-hidden
changes meaning of "hidden files" (-H) depending on
platform: cygwin - uses Windows file attributes
(warning: slow); win32 - files with names starting
with "." considered hidden

P.S. I hacked together a way to detect Windows hidden files from cygwin but it's ugly and slow.

4/6/08 update: I added the ability to delete duplicates (-D), and create relative (-c) or absolute (-C) symbolic links.

3 comments:

Brendan Hemens said...

Hi,

I applaud (and appreciate) your efforts, but this did not function as expected on my WinXP system. I ran it under Python 2.4 on a directory full of tiffs (and some other file types) that were band images from IKONOS. It detected many duplicates among bands, i.e., for a given tile, there are 4 bands, and it would detect three as duplicates of the first. Not all the time, but much of the time.

I checked the results with the DOS comp command, and it says they are, in fact, different. In short, it doesn't seem to work.

Vic said...

Brendan, are you saying the results were different for different runs on the same data?

Would it be possible for you to send me some of the files that were reported duplicates?

The script actually returns whether the md5 hash of files match. I suppose with diverse enough data there could be some false positives, but it's pretty unlikely. I can add the final comparison check to eliminate these, if this is what is actually happening.

Angel Ikaz said...

i would suggest you to try DuplicateFilesDeleter , it can help resolve duplicate files issue.