Delete large numbers of duplicate emails from Mail.app
Jan 28, '10 07:30:00AM
Contributed by: unsubscriber
This workaround, using Thunderbird, allowed me to successfully remove 30,000 duplicate emails (from a collection of about 80,000 emails) in OS X's Mail.app. I spent a lot of time searching this question, and this is the only solution I found that worked.
My mail.app emails got rather out of hand; I won't bore you with how. I had most emails at least twice, and some up to five times. I tried Andreas Amann's Mail Script for this, but, even though it was working OK, it only found about 200 duplicates in three hours, and was cooking my CPU at about 95%. There was no way that this could go on, so I cancelled the process. (Thanks anyway, AA.)
I looked at importing into Entourage, because there are some scripts to eliminate duplicates from there, but for reasons I shall not bore you with here, this proved a dead end.
The solution turned out to be the amazing add-on for Mozilla's Thunderbird called Remove Duplicate Messages (ALTERNATE). I had to use version 3.0 of Thunderbird, not the current version 3.01 (so thanks to those reviewers who reported that the add-on version 0.3.3 was failing with Thunderbird 3.01). I found the older version on the Thunderbird releases page. Below is the process I used. (It took several hours, because of the number of emails. Ironically, every stage except the actual analysis of duplicates takes ages. This makes using Thunderbird permanently quite tempting.)
Note: The following process needs to be done for each folder that resides in the On My Mac section of Mail. (I will have more to say about that soon.)
- Use Mailbox » Archive Mailbox in Mail.app to create a proper mbox export file of the mail folder you want to de-duplicate. (The mailboxes that Mail keeps in the Mail folder inside your user's Library, with their extension .mbox, are not real mboxes.)
- Follow these instructions to import your newly-created mbox file quickly and easily into Thunderbird. I think my use of Path Finder (instead of Finder) helped a bit here. After I did this, I did wait a long time for the Spotlight indexing in Thunderbird to finish -- not sure whether this was necessary, but I suspect it probably was.
- Install the add-on mentioned above. Set its prefs for email matching criteria (in Thunderbird » Prefs » Manage Add-Ons) according to what works for you. I ran some tests with a small collection of dupes until I had these right. What worked for me was ticking Message ID plus a few others, but unticking Size, Lines, CC and Body. This did an excellet job of correctly collating the two to five copies of each email.
- Move the dupes to a chosen folder (e.g. trash). With each of my folders of about 40,000 emails, I had to wait roughly 30 seconds for the dialog box to appear. Then it took just two minutes to move about 13,000 dupes to the 'unneeded duplicates' folder I created. Amazing.
I did hit one snag, because I had 80,000 emails in a single folder in Mail (I had moved them into one folder in the hope that I could run Andreas Amann's script all night. Like the Thunderbird add-on, the Mail script searches for dupes inside one folder at a time.)
Mail.app (on both my attempts) only created an archive mbox of about 4.3GB, which comprised some 43,000 emails; the remaining 36,000+ didn't make it into the archive! Luckily, I found when I had imported these into Thunderbird that they were in date order, and that no emails were dated earlier than 27Oct07. So I went back to Mail.app and created a new folder, then dragged all the pre-27Oct07 emails (the remaining 36,000+) into it, and created another Archive mbox. I then repeated the import process and everything worked.
Interestingly, it took close to half an hour in Mail.app for my MacBook Pro (Core2Duo at 2.2GHz) to even select those 36,000 emails -- be patient while the wheel spins! It then took nearly another hour to move them to the new folder and index them and their 6,000 attachments. (Again, I'm not sure that I really needed to wait for that indexing, but whatever.) Still, it was all wonderfully stable.
So, as the last step, I have reimported everything into Mail.app (File » Import mailboxes » files in mbox format). But I'm thinking of experimenting with Thunderbird as well, given the third-party geniuses who write extensions for it. It seems very zippy.
Finally, I get tired of reading blogs that tell me I don't need past emails. In my job (I'm a senior high school English teacher), I need them all the time. I write substantive replies to student X's questions, then rehash them for student Y, maybe years later. (It's much more complicated than that even, but you see my point!) One day I will delete the thousands I don't need, too.
Comments (12)
Mac OS X Hints
http://hints.macworld.com/article.php?story=20100126104840894