Okay, was looking for email corpi (corpuses? No, apparently corpora) to run some tests, and found these...
- Test emails for a Ruby lib called Mail
- A single eml example file at the FlexConfirmMail for Thunderbird github site
- EDRM doesn't seem to have Enron any more (see below), but it does have its EDRM Public Micro Dataset, which contains "4 email boxes with shared correspondence, threads and attachments" (and a lot more that's not email, but you have to grab the whole zip).
- Enron data, perhaps the most famous, large-ish email corpus.
- Its state is... complicated.
- EnronData.org sounds promising, but links are broken.
- FoundationDB claims to have it but that's from Archive.org.
- Carnegie Mellon hosts it but has removed anyone who requested they do so.
Prior versions of the dataset are no longer being distributed. If you are using the March 2, 2004 Version; the August 21, 2009 Version; or the April 2, 2011 Version of this dataset for your work, you are requested to replace it with the newer version of the dataset below, or make the the appropriate changes to your local copy.
- It's still plenty for testing
- The Library of Congress (!!) has a copy with everyone*, but no attachments.
- Okay, they say they don't have everyone either, though I found a few at LoC that were not at Carnegie Mellon...
The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form user@enron.com whenever possible (i.e., recipient is specified in some parse-able format like "Doe, John" or "Mary K. Smith") and to no_address@enron.com when no recipient was specified."
- Carnegie Mellon says that EDRM has it with attachments ("A version of the dataset with all attachments is available from EDRM."), but that URL is dead.
Check the licenses for each and enjoy.