IgnorantGuru's Blog

Linux software, news, and tips

Script: rmdupe

«Downloads

Download Links:
Script: downloadbrowseauthenticateinstructions
Debian/Ubuntu: packagesPPA
Arch Linux: AUR
Description: Removes duplicate files from specified folders
Recommended For: Linux
Requires:
License: GNU GPL v3     * SEE DISCLAIMER *
Related: sedname
Feedback: commentsissues

Overview

rmdupe uses standard linux commands to search within specified folders for duplicate files, regardless of filename or extension. Before duplicate candidates are removed they are compared byte-for-byte. rmdupe can also check duplicates against one or more reference folders, can trash files instead of removing them, allows for a custom removal command, and can limit its search to files of specified size. rmdupe includes a simulation mode which reports what will be done for a given command without actually removing any files.

rmdupe --help

Usage: rmdupe [OPTIONS] FOLDER [...]
Removes duplicate files in specified folders.  By default, newest duplicates
 are removed.
Options:
-R, -r              search specified folders recursively
--ref FOLDER        also search FOLDER recursively for copies but don't
                    remove any files from here (multiple --ref allowed)
                    Note: files may be removed from a ref folder if that
                    folder is also a specified folder
--trash FOLDER      copy duplicate files to FOLDER instead of removing
--sim               simulate and report duplicates only - no removal
--quiet             minimize output (disabled if used with --sim)
--verbose           detailed output
--old               remove oldest duplicates instead of newest
--minsize SIZE      limit search to duplicate files SIZE MB and larger
--maxsize SIZE      limit search to duplicate files SIZE MB and smaller
--rmcmd "RMCMD"     execute RMCMD instead of rm to remove copies
                    (may contain arguments, eg: "rm -f" or "shred -u")
--xdev              don't descend to other filesystems when recursing
                    specified or ref folders
Notes: do not use wildcards; symlinks are not followed except on the
       command line; zero-length files are ignored

Anytime ––sim is included anywhere on the command line, rmdupe goes into simulation mode which only reports, doesn’t remove. This can be used to see what files a given command will remove, or can simply be used to search for duplicates.

Examples of usage:

# remove dupes in /user/test but not subfolders
rmdupe /user/test

# remove dupes from /user/test and subfolders
rmdupe -r /user/test

# remove dupes from /user/test1 and /user/test2 and subfolders
rmdupe -r /user/test1 /user/test2

# trash dupes from /user/test
rmdupe --trash /user/trash /user/test

# only remove dupes larger than 50MB
rmdupe --minsize 50 /user/test

# shred dupes before removing
rmdupe --rmcmd "shred -u" /user/test

# remove dupes from /user/test using /user/keep as a reference
rmdupe -r /user/test --ref /user/keep

rmdupe will always remove the newest duplicates, preserving the oldest copy of a file, unless the ––old option is used, which reverses this.

All specified folders are searched as a group. When rmdupe is finished, they will collectively contain only one copy of each file.

Reference folders are folders you want checked against for duplicates, but not cleared of duplicates. For example, if copies of file “A” in specified folders exist in a reference folder, the copies in the specified folders will be removed. The copy in the reference folder is never removed unless the reference folder is also specified on the command line as a non-reference folder. The reference folder method is used to check for duplicates against a collection of files which you don’t want removed. You may specify more than one reference folder with multiple ––ref options. Reference folders are always searched recursively.

Note that rmdupe may take some time if there are large numbers of files of exactly the same size. If files are the same size, this triggers rmdupe to do a byte-by-byte compare (cmp) on the files.

The ––trash option allows specification of a trash folder to be used. Each duplicate is moved to the trash folder, using a unique filename if needed. If a move to the trash folder fails, rmdupe halts with an error.

Normally rmdupe only reports files it’s removing. For more detailed feedback as it’s running, use the ––verbose option.

 

Installation Instructions


Follow the standard Script Installation Instructions. Alternatively, for Debian and Ubuntu a deb package and a PPA repository are available. On Arch Linux, rmdupe can be installed automatically using the AUR.

 

16 Comments

  1. This script is precisely what I was looking for. I migrated to Ubuntu a few months ago, and the only major thing that I missed was a program called duplic8.exe that searches for duplicates.

    Other tools (like fdupes) are not very precise about which files you want to keep. The –ref option is excellent!

    I tried fslint, which has similar functionality, but the performance was very poor. Your use of built-in linux functions should render very good performance.

    Comment by John | February 5, 2010

    • Glad to hear this is useful to you. Linux has great functionality in core programs that have been optimized over the years by many contributors, so scripts can make use of that huge library.

      Comment by igurublog | February 6, 2010

      • I just finished a large task with rmdupe helping me sort through 3 TB of image archives that had gotten out of sync. I wanted to assure that I had a complete set without losing any versions of images by blindly deleting files with the same name. I had a master directory (which you would call –ref), and then various working and archive directories.

        Sorting them out with rmdupe was a breeze. It is substantially faster than the alternatives of fslint or duplic8.exe (running under wine) and offers much more precise functionality.

        Had I not found this site, I’d still have been waiting for my computer to finish the task now!

        Thanks again!

        Comment by John | February 6, 2010

  2. How is this better than liten[1], duff[2], or fdupes[3] ?

    [1]:http://aur.archlinux.org/packages.php?ID=27335
    [2]:http://www.archlinux.org/packages/community/i686/duff/
    [3]:http://www.archlinux.org/packages/community/i686/fdupes/

    Comment by aeosynth | February 17, 2010

    • Maybe it’s not – have a look at the features of each and see what best suits your needs.

      Comment by igurublog | March 5, 2010

    • None of the programms support compare dir by dir means use one of the directory as a reference.

      Comment by Tilo | May 17, 2010

  3. Hi,

    I stumbled across this script and just wanted to leave a note to thank you for making it available as it appears to do what I want and my quick tests have found it very useful. So thanks!

    Comment by Steve | November 14, 2011

  4. Same comment as Steeve, as I have been using your script for years (particularly great –ref option). Thanks for great work.

    Comment by Anonymous | June 29, 2012

  5. I haven’t used this yet. I was going to write something similar in Python but use hashing (such as MD5) to compare files, ie same files (byte for byte) will create the same MD5. No need to look at the entire file at first, just the first 8k or so, if the two are the same that far, then do a more full compare (full file MD5), otherwise, they aren’t the exact same file anyhow.

    I’ll give this a try, after 7 years of keeping backups of all my systems, I know I am wasting a lot of space ;) I came here because SpaceFM is great and the Gnome3 blog really got my attention (I agree with the blog and most comments).

    BTW, I work with another, h2, on a script to view system information, http://www.inxi.org. I was looking to create a GUI version and was searching for a gui library to use, glad I read the blog when I did, looks like gtk3 is out.

    Thanks for the software, keep it up!

    Comment by trash80 | January 6, 2013

    • Thanks for the feedback. Not sure this will run on a trash80. ;) I deliberately didn’t take the MD5 approach with this due to the unlikely but possible collision potential, and also since its smart enough to not compare the same two files twice, the cost is reduced. A little more drive access and a little less CPU. Only an issue when file sizes are identical.

      I wouldn’t rule out gtk yet, but it does look to be heading in an ugly direction. Maybe it will fork away from Red Hat’s control. Looks like a nice info script. Also note that SpaceFM Dialog provides much of GTK’s interface for scripts to use – saves you having to get into the dirt of GTK, and is also compatible with both GTK2 and 3.

      Comment by IgnorantGuru | January 7, 2013

  6. Interesting. I developed a similar program called dupekill; it’s written in Python and achieves mostly the same results. I guess great minds think alike! I didn’t think of checking two copies and keeping the older one; that’s a great idea! Would you have any qualms with me borrowing an idea or two from rmdupe? dupekill’s free software (available at https://github.com/sporkbox/dupekill ), so I believe it’s compatible with your license.

    Comment by sporkbox | March 17, 2013

    • By all means – that’s why I share my scripts. I consider it a compliment if someone ‘steals’ an idea. And I steal them too. I must say rmdupe was fun to write – an interesting recursive puzzle to solve, efficiently.

      What, no README for dupekill? I am just supposed to ‘know’? ;)

      Not sure your DWTFYW license is compatible with the GPL. :) Be careful of granting too many rights or you can give its free nature away – a license should restrict some things to ensure free use. If I can do anything with your software, that means I can copyright it, declare exclusive ownership of it, restrict its use, etc.

      Comment by IgnorantGuru | March 17, 2013

  7. Your script really helped me a lot. I hope there is a feature or option to also compare by the names of the files to check for duplicate. This should help reducing comparison times if we’ve already known the supposed duplicates have exact filename.

    Comment by Seaweed | March 24, 2013

    • Names are irrelevant to the way rmdupe works – it identifies duplicates based on content, and compares all files whose size matches.

      Comment by IgnorantGuru | February 4, 2014

  8. I have had this running for almost 24 hours using a 30 GB ref and running it on an 80 GB directory. Is this normal?

    Comment by Simon | February 3, 2014

    • Yes, that can be normal, especially if you have many files that are the same size. Also, the ––sim option is slower than actual use. You can use ––verbose to see where rmdupe is working. In general just let it run for however long it takes.

      Comment by IgnorantGuru | February 4, 2014


Follow

Get every new post delivered to your Inbox.

Join 142 other followers