Does Social Media really matter to eDiscovery?

18 07 2011

This post, and future posts will be hosted on the official Hudson Legal Blog, but this one is re-posted here in case anyone is still following this feed.

I’m speaking next week at the Marcus Evans Social Media Legal Risk and Strategy conference in San Francisco. I’m excited and honored to have the opportunity to share my thoughts on the impact of social media on e-discovery alongside a very impressive group of attorneys, who will be speaking on a variety of legal topics related to social media.

In e-discovery circles, social media has recently become one of the hottest topics, and with seemingly good reason. It’s without question that communication is shifting to social media at an incredible clip, and it may not be long before email is surpassed as the primary means for written electronic communication. It’s also without question that this hasn’t happened yet for business communications.

A distinct pattern, not immediately obvious, began to emerge as I read through the surprisingly large number of social media e-discovery decisions in preparation for my presentation. Yes, there’s a solid amount of precedent already, but the central questions of fact in those cases almost universally revolved around a single individual’s physical condition or state of mind.

This makes intuitive sense. While people are likely to share personal info on a social networking site – e.g., to describe their physical health or mental state – it’s far less likely that important or sensitive business material would appear here. Most businesses do have a social media presence, but its content is calculated and intentionally public. In other words, a company’s social media content is probably the last place you’d find a smoking gun.

Part of this depends on how one defines social media. There are internal corporate environments under the umbrella of what’s being called “Enterprise 2.0” with social features, such as internal blogs and wikis, and arguably SharePoint. But from an e-discovery perspective, these environments are not dissimilar to other shared resources under corporate control. And where things are complicated by the fact that this type of service is stored in the cloud, I’d call that a general cloud storage issue and not a social media-specific issue.

Think about the types of cases where e-discovery is really important. The ones with huge volumes and incredible costs, without which the need for innovation would not have been so great and the entire industry may not have existed. These are the antitrust second requests, the bet-the-company patent litigations, the multidistricts and the government investigations. They’re not the personal injury, employment discrimination, and defamation cases in the social media e-discovery caselaw to date.

In the near future, we may see more business content intertwined with the social media space. Google Plus, for example, promises to not only provide for private workspaces alongside personal social content, but also to deeply integrate social and web content. Things are going to get a lot more complicated. But for the moment, social media is more of an e-discovery novelty that’s fun to think about than a serious e-discovery problem for corporations.


The Document Family Circus

1 04 2011

Of the three listed in my last post, a family is the most well-defined and consistent relationship used in e-discovery.  “Family” is the most commonly-used term and universally recoginzed, but the same type of relationship is also referred to as a “message unit” or “message attachment group”.  When it comes to families, every single document falls into one of three categories: (1) parent, (2) attachment, or (3) standalone.

  1. Parent:  There are only single-parent families in e-discovery – each family has one and only one parent document.  Most often, the parent is an email with attachments, but other types of documents are also treated as parents.  A document is the parent of any documents embedded within it, such as a Word document with an Excel file inside it represented by an icon.  Often .zip container files are treated as parents, with the files inside considered attachments.
  2. Attachment:   Any document that’s part of a family is an attachment, unless it’s the parent.  If an email is attached to another email, it is considered an attachment even if it has its own attachments.  While some review tools will let you navigate to a document’s immediate sub-parent, all documents under the topmost parent are still identified as part of just one single family.
  3. Standalone:  A standalone document has no family members at all.  Most files collected from non-email sources and emails without attachments are standalone documents.

In most cases, a unique metadata field in the review database will have an identifying number common to all members of a family.  In a Concordance load file, this may be “BegAttach” (described in the Production Metadata Checklist).  It may also be a special “Family ID” or “Message Unit ID”.  When looking at a new database, one of the first things you should do is figure out how to identify members of a family using document metadata. Any data export should include this field, so you’ll always be able to see the association.

The great debate with respect to families is whether in initially reviewing the documents, it’s better to conduct a “contextual review” or a “four-corners review”.

  • Contextual Review means that members of the same family are reviewed together.  This has the obvious advantage of providing context to the documents, which can be particularly important for making privilege determinations.  For example, a memo describing a competitor’s product may be tagged “responsive” when reviewed alone, but properly tagged “privileged” when reviewed alongside a parent email from outside counsel explaining that she wrote the memo to show why the competitor should be sued for patent infringement.
  • Four-Corners Review refers to reviewing each individual document as if it were a standalone document.  Typically this is only done as a “first-pass” review, and must be supplemented with additional review to avoid issues like the one described above.  The main advantage of four-corners review is that it allows for more effective use of deduplication and technologies that exploit document similarities, without the burden of considering family members.

We’ll get into where the law stands on contextual vs. four-corners review in an upcoming post.

The Three Types of Document Relationships

25 03 2011

Documents in a collection of ESI, like people in a society, are connected to one another through different types of relationships.  There are three categories of document relationships:

  1. Family Relationships are obvious in the documents as they are kept in the ordinary course of business.  The most common family consists of an email and its attachments, but other situations – such as a .zip file and the files contained within it, or a Word document with other documents embedded within it – also constitute a family relationship.  Even before collection or any processing, if you look at a document you can see its family members too.

    it doesn't always work out

  2. Similarity Relationships range from exact duplicates (identical files) to “near duplicates” (based on proprietary technical algorithms) to conceptually similar documents (based on linguistic or other automated analysis of each document).  Some will argue that these are three wildly different things, but I group them all together as similarity relationships because they all share a purpose of identifying documents with similar contents or ideas.
  3. Email Thread Relationships are unique to emails and describe how they relate to each other in a conversation.  Two simple commands – reply and forward – create a complex web of conversation without an intrinsic trail.  At the processing step, email thread relationships can be identified by combining information such as subject lines, header information, metadata created by email clients such as Outlook or Gmail, and analysis of the email text itself.  Some advanced tools can recognize when emails within a chain are not present in a collection.

The next several posts will take a closer look at each of the three types of document relationships, how they interact and interfere with each other, and legal obligations in discovery with respect to document relationships.  This is an area of e-discovery that’s rarely handled well in practice, and worthy of deep exploration.  Anyone who thinks they have an answer to the wide array of issues created by document relationships probably hasn’t thought about the topic enough!

Meet the Metadata: Nat’l Day Laborer Point 5

17 03 2011

Judge Scheindlin’s National Day Laborer opinion was a wonderful early Valentine’s Day gift to practitioners, as she provided a list of metadata fields which can now be considered the standard for production of ESI.  Today’s entry in this series of practice points from the opinion discusses, and a handy checklist can be downloaded here.

The court orders defendants to include “load files that contain the following fields, which apply to all forms of ESI” with all future productions.  More significantly, footnote 41 includes this sentence: “I believe that these are the minimum fields of metadata that should accompany any production of a significant collection of ESI.

There is legitimate debate about whether Scheindlin intended these fields to become a broadly-applicable standardized production format.  For example, Bill Belt observes that in footnote 44, Scheindlin writes “I am certainly not suggesting that the Proposed Protocol should be used as a standard production protocol in all cases,” and concludes that the court did not intend to create a master protocol for all to follow.

can't argue with this guy

I think it’s also possible to read footnote 44 as only disclaiming the capitalized “Proposed Protocol” offered by the plaintiffs and described earlier in the opinion.  If so, footnote 41 still stands as an undiminshed endorsement of the listed metadata fields for all future ESI productions in any case.

So if it’s my call, I’m including a load file with the exact fields in National Day Laborer.  Let the other side make the argument that it shouldn’t be the standard.  I’m in good company: Ralph Losey’s robo-lawyer uses the same strategy at his virtual 26(f) conference (the case is referred to as “n-DEE-l-o-n…VEE ice” – check out this instant classic if you haven’t already).

Let’s look at the required fields. There are three sections, for (1) all ESI, (2) additional fields for email, and (3) fields for paper documents.

Download the checklist here!

Fields for all ESI

  1. Identifier
  2. File Name
  3. Custodian
  4. Source Device
  5. Source Path
  6. Production Path
  7. Modified Date
  8. Modified Time
  9. Time Offset Value

Additional Fields for Email

  1. To
  2. From
  3. CC
  4. BCC
  5. Date Sent
  6. Time Sent
  7. Subject
  8. Date Received
  9. Time Received
  10. Attachments

Fields for Paper Documents:

  1. Bates_Begin
  2. Bates_End
  3. Attach_Begin
  4. Attach_End

In this supplemental order dated Feb. 14, Judge Scheindlin clarifies that full text should also be included as a metadata field for all ESI.  I discussed the full-text field in detail in my previous post on redactions.

One interesting question is whether it’s important for paper documents to be distinguished from ESI in the production.  One of the plaintiffs’ complaints was that “paper and electronic records were indiscriminately merged together in one PDF file.”  The court later stated that defendants did not comply with the July 23rd email (discussed in a previous post) by “merging paper with electronic records“.

This suggests that it is important to distinguish paper from ESI in a production, although Scheindlin does not provide a metadata field that explicitly states the source format.  Fields such as “Source Device” and “Source Path” will be blank for paper documents and should be populated for all ESI, so there should be a way to tell.  Given the apparent importance of making the distinction, I’d suggest negotiating a separate “Source Format” field, or better yet have a separate prefix for paper documents.  This will also make electronic production easier since you won’t have to worry about synching up with the bates numbers from paper documents.

So without further ado, the rest of the information I’ll provide on production metadata comes in the form of this (hopefully) useful Production Metadata Checklist.  It should make a great starting point for discussing production details per the National Day Laborer (NDLON) standards, and worth glancing at before your next 26(f) conference.

Please feel free to download this and share freely.  Would also welcome comments or contact if anyone has suggestions or would like to collaborate on updated versions.

Production Metadata Checklist

Excel in its Native State: Nat’l Day Laborer Point 4

7 03 2011

The court in National Day Laborer held that Microsoft Excel spreadsheets should be produced as native files, which brings us to this fourth practice point in a series stemming from the opinion.

Aside from the fact that plaintiffs requested Excel files in native, Judge Scheindlin doesn’t give much reasoning for why she orders it.  It’s tempered by footnote 35, and later where the court writes: “the Government may produce the spreadsheets in TIFF format with load files containing the applicable metadata fields, if it can demonstrate why native production of spreadsheets would inevitably reveal exempt information.”  However, there are plenty of reasons why producing Excel files as TIFFs should be considered a failure to meet discovery obligations, not the least of which is that they don’t scale well to the printed page in a way that makes it clear what’s happening on the spreadsheet.

In litigation, documents are reviewed before they are produced for two main reasons:

  • Privilege: ensure no privileged information is being produced and no privilege is waived
  • Evidentiary: gain familiarity with the substance of the outgoing documents, so you: 
    • don’t get blindsided at a deposition by a document you weren’t aware existed
    • can begin preparing your case
    • get key information in the hands of your expert witnesses early
    • anticipate arguments from the other side, both on the sufficiency of your production and on the merits of your case

Native Excel files can be dangerous for both of these reasons, but more likely the second than the first.  The biggest problem with a spreadsheet from the review perspective is that what you see on the screen rarely shows you everything you need to know about the document before producing it.

First, there’s the hidden rows and columns issue, which is well-known.  While text or other information can be hidden in most office documents, it’s much more common for Excel users to hide cells.  Most vendors will offer to “un-hide” all rows and columns before review.  This is safer, because you don’t risk missing important or privileged information lurking in the hidden cells.  But “un-hiding” might also alter the document in a way that makes it less clear – for example if intermediate steps in a calculation were hidden in unlabeled columns, seeing the spreadsheet with all of the columns unhidden could be confusing – these should be re-hidden.  If the cells were hidden as the document was kept in the ordinary course of business, shouldn’t that be how it is produced?  Here’s the first practice point for producing Excel files in native:

  1. It’s crucial to look at the information in hidden rows and columns, but the document should be produced in the state it was originally found.  This can be achieved by keeping an un-altered copy of every Excel file for production purposes, but having a version that the reviewer can safely tinker with in order to uncover hidden info.
  2. Second, the formulas in the spreadsheet may be more important evidence than the data that appears in the cells.  This is as good an argument as any for native production of Excels.  The data just shows results; formulas show the calculations, and can shed light on the spreadsheet creator’s theory and intent.  There is a setting to show formulas in Excel, so one could technically overcome this issue by producing two versions of the TIFF, one showing the formulas, but due to the length of formulas those TIFFs would be even more difficult to review.  It makes a lot more sense to just have the ability to move through the spreadsheet and examine the formula bar, and you should:

  3. Always review (or have your expert witness review) the formulas in a spreadsheet and understand what information you’re producing by turning over those formulas.
  4. Third, realize that if you’re giving it up in native format, a spreadsheet is a machine, more of a “thing” under Rule 34 than a “document”.  The custodian of an Excel file has the ability to change information in the spreadsheet to create scenarios and test the process behind the results.  The way that a spreadsheet functions may even be the most central issue in an intellectual property or commercial litigation.  In turning an Excel file over to a requesting party, you’re potentially giving that party a powerful tool that can be used to make an effective case in a way that a static document cannot:

  5. Before you produce, learn how each spreadsheet functions and anticipate how it might be used to make the other side’s case, as well as your own.  This includes analyzing charts, pivot tables, macros, how protection is used, internal and external linking, and any other advanced functionality.  Understand that this review may require both an Excel expert and a subject matter expert.
  6. Fourth, unlike Microsoft Word documents, headers and footers in Excel files are not visible in the views used most often.  These may contain important information, including the author’s privilege intentions.  Check it out before you produce:

  7. Use the “print preview” or “page layout” views to examine the headers and footers of native Excel files.

Most of the more commonly used office documents are just means for communication, but spreadsheets are the exception.  TIFFs of Excel files – moreso than with emails, Word documents, or PowerPoints – really don’t tell the whole story or anything close to it.  The court should have omitted the option to produce in TIFF and mandated native production of Excel files in all cases. The “exempt information” the court was concerned would be revealed in this FOIA case would have been personal info – privacy concerns may have motivated the leniency in the opinion.

With native format now the production default for spreadsheets, knowing how to get information from Excel files is an important skill on both sides of an electronic discovery request.  The steps described above can provide a great advantage to savvy counsel on the receiving end of a production from an opponent unaware of the depth of information and utility in their native Excel files.

Do you have a bad (or good) experience with the production of native Excel files?  I’m sure I didn’t cover everything and would love to hear your thoughts on the topic – please comment.

Redacting is just Dacting Again: Nat’l Day Laborer Point 3

4 03 2011

Welcome to part three (or four, depending if you count the introductory “Quick Takeaways” post) of a series on practice points from National Day Laborer.

Although “native format is often the best form of production,” the court stopped well short of requiring it as the standard, and even goes so far as to say “it is not feasible where a significant amount of information must be redacted.”  Footnote 34 addresses the redaction/metadata problem, something I’ve come across often and would like to give a little background on.

Document Text in the Review Tool

If you’ve used any review tool, you know that you can usually view each document a number of ways.  These usually include the native document itself, a built-in viewer that provides a native-like rendering of the document, a TIFF or PDF image of each page in the document, and if all else fails, the “Plain Text” of the document.

To be able to run searches on a database of documents, there must be some body of text associated with each document in the database.  This searchable text is created when documents are indexed and processed, and is usually included in one of the fields in the database so it’s visible in the review tool.

  • OCR Text

One way to create the searchable text is to create an image (e.g., TIFF or PDF) of the document and then use optical character recognition (“OCR”) to “read” the text from the images.  When dealing with paper documents, scan-and-OCR is the only way to get the text of a document, but OCR is also used for electronic documents where text cannot be obtained by other means.  The OCR process will usually introduce a few errors into the text because character recognition isn’t perfect and sometimes has problems with text that’s formatted certain ways.

  • Extracted Text

The other way to get the text is to “extract” it directly from an electronic document.  For example, when an email gets processed, the email sender might go into the “Sender” field in the review database, the recipient into the “Recipient” field, and the body of the email directly into the “Body” field.  When another type of document is processed, such as a Microsoft Word doc, all of the text in the document might go into the “Body” field.  Extracted text, coming directly from the document file, is generally more accurate and complete than OCR text from an image of the document.  Extraction doesn’t work perfectly for every type of document, but when electronic documents are loaded into a review tool, directly-extracted text is almost always used to populate the database and make it searchable.

UPDATE: In a supplemental order on Feb. 14, Judge Scheindlin specifically describes “Full Text” as a metadata field that must be provided in every ESI production.  She describes the field as “The body text of a document, attachment, or email message, extracted directly from the electronic source or derived by optical character recognition of scanned paper documents.

Redactions in a Review Tool

If a document is redacted in the review tool, an image is made of the document (if there isn’t one already), and then a box redacting out a portion is drawn on the image.  While this makes the text invisible in the image, it doesn’t change the extracted text.  The searchable text produced to the other side comes from what’s in the review tool, and it will still show the text that got redacted unless something is done about it.

When redacted documents will be produced, a new version of the searchable text must be created specifically for production – it’s done by OCR’ing the redacted images, so only the text remaining in the image gets produced. This is an essential step in any production of redacted documents, and fortunately one that most e-discovery vendors are fully aware of.  The need for this common fix is part of what Judge Scheindlin is alluding to in the last sentence of footnote 34, “both static images and the metadata would have to be redacted”.

Redactions and Native File Metadata

metadata with deleted text

Footnote 34 mainly elaborates on why the court allows a non-native production especially when there are a lot of redactions, and most of the footnote is likely referring to the concern that metadata in the native file can potentially reveal text that has been deleted from a document.

This article out of Alabama does a great job of explaining the dangers of producing metadata, which go well beyond just the extracted text issue I described above: Metadata: What It Is and Why You Should Care, by Susan J. Silvernail, Alabama Association for Justice, August 11, 2007.

Clearly if someone were to just delete text in a native file and type in “Redacted”, there would be a number of problems in addition to the metadata ghosts.  The document just wouldn’t be the same document anymore.  I’m guessing this is why Judge Scheindlin, and people like the smart guys at Logik, say it’s impossible to redact a native file.

Still, I’m not 100% convinced.  The need for redaction just feels like a lame excuse for getting out of native production, which should be the standard. After all, Rule 45 used to just say documents should be produced “as they are kept in the ordinary course of business” – hat’s what we should be aiming for, even if there is now a “reasonably useable form” exception for ESI.  The discussion has been on for a while, and this whitepaper from Christine Musil (with input from Craig Ball and George Socha) is a great place to start if you’re interested in the topic.

For now though, National Day Laborer gives you a nice opinion to support your TIFF production, especially if you can point to a heavy need for redactions.

“Best Practices” are Not a Substitute for Legal Research: Nat’l Day Laborer Point 2

3 03 2011

This is the second in a series of practice points derived from the recent National Day Laborer decision out of the SDNY.  The full list is here in the “Quick Takeaways” post.

A question that comes to mind while reading Judge Scheindlin admonish the Government in National Day Laborer is this: did they really intend to obscure information by producing in this format?  That’s the key here – in reading this opinion there’s almost a presumption that the Government was intentionally hiding information or purposely making it difficult by not producing metadata.  This would have been a terribly silly strategy.  I just don’t believe that the U.S. Attorney meant to act in bad faith.

This is a speculative leap, but it’s much more likely that the person responsible for creating this production had produced documents the same way before, and honestly thought it would be sufficient.  Once the documents were produced in this lousy format, the Government was left having to make up “rhetorically nuanced” (i.e., poor) arguments to defend its production, and Judge Scheindlin seized the opportunity to slap them around a bit.

I can say from first-hand experience with litigation support groups at many law firms that it’s not uncommon for the production standard to be “this is how we usually do it”.  There’s nothing wrong with that if the practices have been based on a legal analysis of what’s required, but these decisions shouldn’t be made by non-lawyers.  Paralegals, litigation support professionals, e-discovery consultants, and vendor project managers need to make sure that the attorneys who are ultimately responsible for the production are the ones making not only the substantive calls, but also the procedural legal decisions about discovery.  And those attorneys, even when they have far less e-discovery experience than the support staff advising them (which is often the case), should not blindly rely on the advice of non-lawyers or any “best practices” without doing the legal research necessary to ensure they are meeting their discovery obligations.