Friday, March 28, 2008

Find Duplicate Files - dups.py

I have a lot of images imported at different times and from different sources, and I wanted a quick way to find duplicates. Not finding a satisfactory (read: free) solution (though I admittedly didn't do a very exhaustive search), I took this opportunity to learn Python and came up with dups.py. Note that the file displays within a frame, so you might have to view frame source to get to the actual code.

Without arguments, dups.py checks the current directory, recursively:
$ dups.py
Duplicates found:
./Data/2004/05_4/015_12A.jpg
./Data/2004/2004.09.29 Grandma/015_12A.jpg
Duplicates found:
./Data/2002/19/uvs021219-008.jpg
./Data/2006/01_2/uvs040430-006.jpg
...
This has been tested on the Mac OS X and cygwin, and should also work with Python for Windows.

There are lots of nerdy options, like filtering by file size and following symbolic links. Try dups.py -h to see them all:
usage: dups.py [options] [<file_or_directory> ...]

Find duplicate files in the given path(s). Defaults to searching files recursively,
except for hidden files (beginning with "."), empty files, and symbolic links.

Options:
--version show program's version number and exit
-h, --help show this help message and exit
-v, --verbose verbose

Exclusion Options:
-f, --flat do not scan directories recursively
-g n, --greater-than=n
only scan files of size greater than n bytes
-l n, --less-than=n
only scan files of size less than n bytes

Inclusion Options:
-L, --follow-links follow symbolic links (warning: beware of infinite
loops)
-H, --hidden-files include hidden files
-z, --zero-files include empty files

Miscellaneous:
-D, --delete delete subsequent duplicates (files are scanned in
argument-list order)
-c, --create-rel-links
replace subsequent duplicates with relative links
(non-Windows only)
-C, --create-abs-links
same as "-c", but links are absolute
-s, --special-hidden
changes meaning of "hidden files" (-H) depending on
platform: cygwin - uses Windows file attributes
(warning: slow); win32 - files with names starting
with "." considered hidden

P.S. I hacked together a way to detect Windows hidden files from cygwin but it's ugly and slow.

4/6/08 update: I added the ability to delete duplicates (-D), and create relative (-c) or absolute (-C) symbolic links.

Saturday, March 1, 2008

Hacking Lite - Evading Coffee Shop Banners

This is mostly a note to myself and not intended to express approval of the behavior described ;-)

Occasionally I like to bring my laptop to a nearby coffee shop to get some work done without all of the distractions of my apartment. My favorite place has been a Tanner's Coffee Company within walking distance of my place. It's a little noisy sometimes, and the food isn't the freshest, but the drinks are decent and I seem to get a lot done whenever I'm there.

Their wireless offering injects an ad banner at the top of every page. This alone would not be prohibitively annoying since adblock successfully strips the ads, leaving only the banner, but what does tend to dampen the customer experience is that it breaks some sites, Google Reader in particular. Because of this, I started to do a little tinkering...

I figured they didn't inject all internet traffic, since I'm able to ssh without problems. Maybe they detect requests to servers at port 80? I toyed with the idea of using a local proxy server, blah blah blah...

Turns out, they actually filter on the user agent field within HTTP requests! This means that if you're using Firefox or Safari (or, I imagine, Internet Explorer), the banner will be injected; Opera, however, is ad-free. This also means that simply changing the user agent field that your browser declares in its HTTP requests sets you (ad) free as well.

In Firefox there are a number of ways to do this: install a Firefox extension, or simply add a string value to about:config named:

general.useragent.override

with a value like

Opera/9.26 (Macintosh; Intel Mac OS X; U; en)

as described here. It's probably a good idea to stick with a realistic user agent string as opposed to something arbitrary, since websites like Gmail may switch to less functional versions if they don't recognize your browser.

A quick way to determine your browser's user agent is javascript:document.write(navigator.userAgent).

The service responsible for the ads at this particular Tanner's (I think they're all independently owned) seems to be a company named AnchorFree. Chances are, this technique could work for ad-injection schemes used by other wi-fi spots.

Done and done. Back to high-quality coffee shop web surfing!

5/1/08 update: Okay, I'm dumb. A much easier way to do this is to add the filter

*.anchorfree.*

in AdBlock Plus. This solves the problem much more elegantly and doesn't run into issues with sites not supporting your supposed user agent.