IgnorantGuru's Blog

Linux software, news, and tips

Script: rmdupe

«Downloads

Download Links:
Script: downloadbrowseauthenticateinstructions
Debian/Ubuntu: packagesPPA
Arch Linux: AUR
Description: Removes duplicate files from specified folders
Recommended For: Linux
Requires:
License: GNU GPL v3     * SEE DISCLAIMER *
Related: sedname
Feedback: commentsissues

Overview

rmdupe uses standard Linux commands to search within specified folders for duplicate files, regardless of filename or extension. Before duplicate candidates are removed they are compared byte-for-byte. rmdupe can also check duplicates against one or more reference folders, can trash files instead of removing them, allows for a custom removal command, and can limit its search to files of specified size. rmdupe includes a simulation mode which reports what will be done for a given command without actually removing any files.

rmdupe --help

Usage: rmdupe [OPTIONS] FOLDER [...]
Removes duplicate files in specified folders.  By default, newest duplicates
 are removed.
Options:
-R, -r              search specified folders recursively
--ref FOLDER        also search FOLDER recursively for copies but don't
                    remove any files from here (multiple --ref allowed)
                    Note: files may be removed from a ref folder if that
                    folder is also a specified folder
--trash FOLDER      copy duplicate files to FOLDER instead of removing
--sim               simulate and report duplicates only - no removal
--quiet             minimize output (disabled if used with --sim)
--verbose           detailed output
--old               remove oldest duplicates instead of newest
--minsize SIZE      limit search to duplicate files SIZE MB and larger
--maxsize SIZE      limit search to duplicate files SIZE MB and smaller
--rmcmd "RMCMD"     execute RMCMD instead of rm to remove copies
                    (may contain arguments, eg: "rm -f" or "shred -u")
--xdev              don't descend to other filesystems when recursing
                    specified or ref folders
Notes: do not use wildcards; symlinks are not followed except on the
       command line; zero-length files are ignored

Anytime ––sim is included anywhere on the command line, rmdupe goes into simulation mode which only reports, doesn’t remove. This can be used to see what files a given command will remove, or can simply be used to search for duplicates.

Examples of usage:

# remove dupes in /user/test but not subfolders
rmdupe /user/test

# remove dupes from /user/test and subfolders
rmdupe -r /user/test

# remove dupes from /user/test1 and /user/test2 and subfolders
rmdupe -r /user/test1 /user/test2

# trash dupes from /user/test
rmdupe --trash /user/trash /user/test

# only remove dupes larger than 50MB
rmdupe --minsize 50 /user/test

# shred dupes before removing
rmdupe --rmcmd "shred -u" /user/test

# remove dupes from /user/test using /user/keep as a reference
rmdupe -r /user/test --ref /user/keep

rmdupe will always remove the newest duplicates, preserving the oldest copy of a file, unless the ––old option is used, which reverses this.

All specified folders are searched as a group. When rmdupe is finished, they will collectively contain only one copy of each file.

Reference folders are folders you want checked against for duplicates, but not cleared of duplicates. For example, if copies of file “A” in specified folders exist in a reference folder, the copies in the specified folders will be removed. The copy in the reference folder is never removed unless the reference folder is also specified on the command line as a non-reference folder. The reference folder method is used to check for duplicates against a collection of files which you don’t want removed. You may specify more than one reference folder with multiple ––ref options. Reference folders are always searched recursively.

Note that rmdupe may take some time if there are large numbers of files of exactly the same size. If files are the same size, this triggers rmdupe to do a byte-by-byte compare (cmp) on the files.

The ––trash option allows specification of a trash folder to be used. Each duplicate is moved to the trash folder, using a unique filename if needed. If a move to the trash folder fails, rmdupe halts with an error.

Normally rmdupe only reports files it’s removing. For more detailed feedback as it’s running, use the ––verbose option.

Installation Instructions

Follow the standard Script Installation Instructions. Alternatively, for Debian and Ubuntu a deb package and a PPA repository are available. On Arch Linux, rmdupe can be installed automatically using the AUR.

27 Comments »

  1. This script is precisely what I was looking for. I migrated to Ubuntu a few months ago, and the only major thing that I missed was a program called duplic8.exe that searches for duplicates.

    Other tools (like fdupes) are not very precise about which files you want to keep. The –ref option is excellent!

    I tried fslint, which has similar functionality, but the performance was very poor. Your use of built-in linux functions should render very good performance.

    Comment by John | February 5, 2010 | Reply

    • Glad to hear this is useful to you. Linux has great functionality in core programs that have been optimized over the years by many contributors, so scripts can make use of that huge library.

      Comment by igurublog | February 6, 2010 | Reply

      • I just finished a large task with rmdupe helping me sort through 3 TB of image archives that had gotten out of sync. I wanted to assure that I had a complete set without losing any versions of images by blindly deleting files with the same name. I had a master directory (which you would call –ref), and then various working and archive directories.

        Sorting them out with rmdupe was a breeze. It is substantially faster than the alternatives of fslint or duplic8.exe (running under wine) and offers much more precise functionality.

        Had I not found this site, I’d still have been waiting for my computer to finish the task now!

        Thanks again!

        Comment by John | February 6, 2010 | Reply

  2. How is this better than liten[1], duff[2], or fdupes[3] ?

    [1]:http://aur.archlinux.org/packages.php?ID=27335
    [2]:http://www.archlinux.org/packages/community/i686/duff/
    [3]:http://www.archlinux.org/packages/community/i686/fdupes/

    Comment by aeosynth | February 17, 2010 | Reply

    • Maybe it’s not – have a look at the features of each and see what best suits your needs.

      Comment by igurublog | March 5, 2010 | Reply

    • None of the programms support compare dir by dir means use one of the directory as a reference.

      Comment by Tilo | May 17, 2010 | Reply

  3. Hi,

    I stumbled across this script and just wanted to leave a note to thank you for making it available as it appears to do what I want and my quick tests have found it very useful. So thanks!

    Comment by Steve | November 14, 2011 | Reply

  4. Same comment as Steeve, as I have been using your script for years (particularly great –ref option). Thanks for great work.

    Comment by Anonymous | June 29, 2012 | Reply

  5. I haven’t used this yet. I was going to write something similar in Python but use hashing (such as MD5) to compare files, ie same files (byte for byte) will create the same MD5. No need to look at the entire file at first, just the first 8k or so, if the two are the same that far, then do a more full compare (full file MD5), otherwise, they aren’t the exact same file anyhow.

    I’ll give this a try, after 7 years of keeping backups of all my systems, I know I am wasting a lot of space ;) I came here because SpaceFM is great and the Gnome3 blog really got my attention (I agree with the blog and most comments).

    BTW, I work with another, h2, on a script to view system information, http://www.inxi.org. I was looking to create a GUI version and was searching for a gui library to use, glad I read the blog when I did, looks like gtk3 is out.

    Thanks for the software, keep it up!

    Comment by trash80 | January 6, 2013 | Reply

    • Thanks for the feedback. Not sure this will run on a trash80. ;) I deliberately didn’t take the MD5 approach with this due to the unlikely but possible collision potential, and also since its smart enough to not compare the same two files twice, the cost is reduced. A little more drive access and a little less CPU. Only an issue when file sizes are identical.

      I wouldn’t rule out gtk yet, but it does look to be heading in an ugly direction. Maybe it will fork away from Red Hat’s control. Looks like a nice info script. Also note that SpaceFM Dialog provides much of GTK’s interface for scripts to use – saves you having to get into the dirt of GTK, and is also compatible with both GTK2 and 3.

      Comment by IgnorantGuru | January 7, 2013 | Reply

  6. Interesting. I developed a similar program called dupekill; it’s written in Python and achieves mostly the same results. I guess great minds think alike! I didn’t think of checking two copies and keeping the older one; that’s a great idea! Would you have any qualms with me borrowing an idea or two from rmdupe? dupekill’s free software (available at https://github.com/sporkbox/dupekill ), so I believe it’s compatible with your license.

    Comment by sporkbox | March 17, 2013 | Reply

    • By all means – that’s why I share my scripts. I consider it a compliment if someone ‘steals’ an idea. And I steal them too. I must say rmdupe was fun to write – an interesting recursive puzzle to solve, efficiently.

      What, no README for dupekill? I am just supposed to ‘know’? ;)

      Not sure your DWTFYW license is compatible with the GPL. :) Be careful of granting too many rights or you can give its free nature away – a license should restrict some things to ensure free use. If I can do anything with your software, that means I can copyright it, declare exclusive ownership of it, restrict its use, etc.

      Comment by IgnorantGuru | March 17, 2013 | Reply

  7. Your script really helped me a lot. I hope there is a feature or option to also compare by the names of the files to check for duplicate. This should help reducing comparison times if we’ve already known the supposed duplicates have exact filename.

    Comment by Seaweed | March 24, 2013 | Reply

    • Names are irrelevant to the way rmdupe works – it identifies duplicates based on content, and compares all files whose size matches.

      Comment by IgnorantGuru | February 4, 2014 | Reply

      • I’m not sure this is a correct statement? It appears to me that rmdupe first compares filename if they match THEN it does a hash comparison (byte for byte comparison) otherwise if no file with the same filename is found, it does not do the comparison. The reason I say this is that you seek ‘skipping’ and it only takes a fraction of a second, but then you see ‘Comparing to’ and it can take a while depending on the size. I’m wondering if the files also have to have the same modification date to initiate the comparison?

        On fdupes (which as was stated here that it is much slower), it is much slower because it does actually do a hash comparison even if the filenames and file modification times are different. See you can have a file that you load into an app, don’t change it, then do a Save As with a different filename. So you would have a duplicate file with a different filename and a different modification time.

        Comment by Jeff | December 10, 2016 | Reply

  8. I have had this running for almost 24 hours using a 30 GB ref and running it on an 80 GB directory. Is this normal?

    Comment by Simon | February 3, 2014 | Reply

    • Yes, that can be normal, especially if you have many files that are the same size. Also, the ––sim option is slower than actual use. You can use ––verbose to see where rmdupe is working. In general just let it run for however long it takes.

      Comment by IgnorantGuru | February 4, 2014 | Reply

  9. I kept getting “Error: relative folder spec not permitted for safety”. Does this mean that only absolute paths are accepted? Why are relative paths unsafe?

    Comment by Nick | April 12, 2015 | Reply

    • Yes, you are required to use absolute paths. Because rmdupe can remove large numbers of files, this safety measure ensures clarity on what folders are specified. If you don’t want this, you’ll need to remove that check from the script, and there may be other adjustments required.

      Comment by IgnorantGuru | July 20, 2015 | Reply

      • Ok thanks.

        Comment by Nick | August 25, 2015 | Reply

  10. i would suggest you to try DuplicateFilesDeleter , it can help resolve duplicate files issue.

    Comment by angel ikaz | February 25, 2016 | Reply

  11. Love love love rmdupe, but I was wondering: Would it be possible to add a progress meter? I’m not a coder, so I have no idea what that would entail. Anyway, I was running ‘rmdupe’ against a directory with 40k files, and it took a couple of days. I could tell it was working since I occasionally saw new files in trash, and my I/O was higher than normal, but it would have been even better if I knew how much longer it was guestimating to finish.

    Comment by C. Dale | March 31, 2016 | Reply

  12. I’d consider adding an option to skip the compare step. I use the script within a bash script loop that loops through all the folders and runs the script on each folder. I know that any files with the same size and filename within that folder are duplicates and they do not need to be compared byte for byte. Would REALLY speed up the process if I could tell rmdupe to not do the comparison before deleting.

    Comment by Jeff | December 10, 2016 | Reply

  13. Fantastic tool and just what I needed, thanks for sharing.

    One little ask: Would it be possible for the trash option to maintain paths? I note that at the moment if more than one duplicate is found then it does keep them all (renamed xxx.copy1.xxx etc) but maintaining the relative paths would make it a lot easier to revert rmdupe if ever that would be needed.

    Comment by Mark Rogers | March 17, 2018 | Reply

  14. Quick request – an option to delete files that match AND have the same basename (possibly also preferentially deleting common duplicate names like “base (int).ext” or “base.ext~int~” or simply “base.ext~”

    Alternately (and likely a better generalized solution), passing the reference file that is not being deleted via an environmental variable to the –rmcmd command or replacing %o in the –rmcmd string with not-to-be-deleted file would allow for anybody to craft a command that preferentially deletes based specific custom logic using on the name and path of the “original” file (meaning the one that is not going to be deleted, be it in a –ref directory or another file in the delete path) and the name and path of the “duplicate” file that would normally be deleted.

    Comment by Evan Edwards | August 15, 2019 | Reply

  15. This looks like a great script. I just need to get it running on my mac High Sierra.
    I’m getting the error
    stat: illegal option — c
    I’ve read online how to change the stat command, but since that’s within rmdupe, I don’t know how to do that.
    Is there a solution?
    Also, I’m trying to flag only identical files with same filename.
    Thanks.

    Comment by ericlindell | January 20, 2021 | Reply

  16. Looking for a way to flag only files with identical names as duplicates.

    Comment by ericlindell | January 24, 2021 | Reply


Leave a reply to Simon Cancel reply