Enron Corpus

The Enron Corpus is a large database of over 600,000 emails generated by 158 employees^[1] of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse.^[2]

History

The Enron data was originally collected at Enron Corporation headquarters in Houston during two weeks in May 2002 by Joe Bartling,^[3] a litigation support and data analysis contractor working for Aspen Systems, now Lockheed Martin, whom the Federal Energy Regulatory Commission (FERC) had hired to preserve and collect the vast amounts of data in the wake of the Enron Bankruptcy in December 2001. In addition to the Enron employee emails, all of Enron's enterprise database systems,^[4] hosted in Oracle databases on Sun Microsystems servers, were also captured and preserved including its online energy trading platform, EnronOnline.

Once collected, the Enron emails were processed and hosted in litigation platform Concordance, and then iCONECT, for the investigative team from the Federal Energy Regulatory Commission, the Commodity Futures Trading Commission, and Department of Justice investigators to review. At the conclusion of the investigation, and upon the issuance of the FERC staff report,^[5] the emails and information collected were deemed to be in the public domain, to be used for historical research and academic purposes. The email archive was made publicly available and searchable via the web using iCONECT 24/7, but the sheer volume of email of over 160GB made it impractical to use. Copies of the collected emails and databases were made available on hard drives.

A copy of the email database was subsequently purchased for $10,000 by Andrew McCallum, a computer scientist at the University of Massachusetts Amherst.^[6] He released this copy to researchers, providing a trove of data that has been used for studies on social networking and computer analysis of language.

Legacy

The corpus is unique in that it is one of the only publicly available mass collections of real emails easily available for study, as such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access.^[6] In 2010, EDRM.net published a revised version 2 of the corpus.^[7] This expanded corpus, containing over 1.7 million messages, is now available on Amazon S3 for easy access to the research community. Jitesh Shetty and Jafar Adibi from the University of Southern California processed this corpus in 2004 and released a MySQL version^[8] of it and also published some link analysis results based on this.^[9]

References

↑ Klimt, Bryan; Yiming Yang. "The Enron Corpus: A New Dataset for Email Classification Research". CiteSeerX: 10.1.1.61.1645.
↑ "The Enron Email Corpus" Retrieved March 5, 2011.
↑ Bartling, Joe (September 3, 2015). "The Enron Data Set - Where Did It Come From?". Bartling Forensic and Advisory. Retrieved September 3, 2015.
↑ "FERC: Industries - Enron's Energy Trading Business Process and Databases". www.ferc.gov. Retrieved 2015-09-02.
↑ FERC Staff Report - Price Manipulation in Western Markets - Findings at a Glance (3-26-2003)
1 2 Markoff, John. "Armies of Expensive Lawyers, Replaced by Cheaper Software". New York Times March 5, 2011. p A1.
↑ Socha, George. "EDRM Enron Email Data Set v2 Now Available". www.edrm.net.
↑ "Enron processed database"
↑ Shetty, Jitesh; Adibi, Jafar (2005). "Discovering important nodes through graph entropy the case of Enron email database": 74–81. doi:10.1145/1134271.1134282.

External links

Nuix data set cleansed of PII (requires registration)
Tutorial on data modeling with the Enron Corpus
Shetty Adibi's enron email dataset download on S3 (178 MB)

Corpus linguistics

Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus Spoken English Corpus Wellington Corpus of Spoken New Zealand English

Text corpora, non-English	Bijankhan Corpus CHILDES Croatian Language Corpus Croatian National Corpus Europarl corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project PropBank Quranic Arabic Corpus Russian National Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto Thesaurus Linguae Graecae TIMIT VerbNet

Organizations	BNC consortium COBUILD

This article is issued from Wikipedia - version of the Friday, October 09, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.