Video
Predictive Discovery: Taking Predictive Coding out of the “Black Box”
Related topics:
Transcript:
Woman: [0:11] Ladies and gentlemen, we are about to begin. Angela, please go ahead.
Angela Navarro: [0:19] Thank you. Hello, everyone. My name is Angela Navarro. Welcome to today's webcast, titled "Predictive Discovery: Taking Predictive Coding out of the Black Box." This event is brought to you by InsideCounsel and sponsored by FTI Technology.
[0:34] I will help moderate this event, but before we get to the topic, let's get some simple housekeeping items out of the way. If you have a question for one of our speakers, please enter it in the Q&A widget on your console.
[0:47] We will endeavor to answer your questions throughout the presentation. We invite you to ask away. If we don't get to your question, you may receive an email response.
[0:58] In addition, there are some other customizable functions to be aware of. Every window you see, from the slide window to the Q&A panel, can either be enlarged or collapsed. If you want to change the look and feel of your console, please go right ahead.
[1:14] Now, let's meet today's speakers. Our first speaker is Joe Looby, Senior Managing Director at FTI Technology.
[1:22] Joe's background includes participation in the National Institute of Standard and Technology's Text Retrieval Conference to co-chairing the Sedona Conference's first panel proposing the use of statistics to defend reasonable e-discovery efforts, co-authoring the Sedona Conference's "Best Practices on the Use of Search and Information Retrieval Methods in E-Discovery" and "Achieving Quality in the E-Discovery Process."
[1:46] Joe is a regular speaker, author, and consultant to law firms and corporations. A former US Navy JAG Lieutenant, an experienced regulator, and published software developer, Joe has appeared before regulatory agencies and provided expert testimony.
[2:01] Welcome, Joe. Thanks for joining us.
Joe Looby: [2:04] Thank you.
Angela: [2:05] Our next speaker is Jason Baron. Jason Baron is the Director of Litigation for the National Archives and Records Administration and is an internationally recognized speaker and author on the subject of the preservation of electronic records.
[2:20] He is the 2011 recipient of the prestigious Emmett Leahy Award and past co-chair and current Steering Committee member of the Sedona Conference working group on electronic document retention and production.
[2:33] He has been a founding co-coordinator of the US National Institute of Standard and Technology TREC Legal Track, a founding co-organizer of the Discovery of ESI workshop series devoted to search issues, and a Visiting Scholar at the University of British Columbia.
[2:50] During his time in public service, Mr. Baron has received numerous awards. His degrees are from Wesleyan University and the Boston University School of Law.
[3:01] Finally, it is a pleasure to introduce, and welcome, Professor Daniel Slottje, an applied economist and statistician at the Southern Methodist University in Dallas, Texas. He is also a senior managing director in FTI Consulting's economic consulting services practice.
[3:21] Professor Slottje is a testifying expert, having given approximately 200 depositions, and testified live at trial, or at international arbitration proceedings approximately a hundred times over the past 25 years. Professor Slottje has published over 150 books and journal articles.
[3:41] He was named to the Applied Econometrician Hall of Fame in 1999, and ranked in the top three in the world. Dr. Slottje has been called upon to testify in some of the nation's highest-stakes litigations in intellectual property matters, statistical and labor employment cases, healthcare litigation and anti-trust matters.
[4:01] Welcome to our speakers. Now, I would like to turn the call over to Joe Looby. Joe, please go ahead.
Joe: [4:10] Thank you, Angela. In a survey conducted earlier this year, we found only about half of corporate in-house and external lawyers using predictive coding, but even companies using the technology, we found, were largely experimenting with it. We asked ourselves, "Why has the uptake been so slow?"
[4:29] Our study and experience suggests two primary reasons -- a reluctance to invest in something that the courts might not support, and a lack of understanding how predictive coding works.
[4:40] Two recent court cases are helping eliminate the first concern. In February, the parties in Da Silva Moore agreed to e-discovery by predictive coding. This was the first validation of the use of predictive coding by the courts.
[4:54] In April, the Virginia State Court in Global Aerospace permitted the defendant to use predictive coding to search two million electronic documents. In his approval, the judge cited studies noting the best benefits of predictive coding, including higher accuracy and lower cost than the other approaches.
[5:11] The second concern, though, is yet to be overcome. Our survey found that one of the biggest impediments to the adoption is that many lawyers perceive this technology as a "black box." Today, we explain how predictive discovery, at least as we call it, works.
[5:30] In commercial litigation, the process of discovery can place a huge burden on defendants. Often they have to review millions, or even tens of millions, of documents, to find those that are relevant to a case. The search and review costs can spiral quickly. Traditional methods such as manual reviews are time consuming and expensive, and keyword search is notoriously inaccurate, because a handful of keywords is a blunt tool for extracting relevant documents from a huge pool.
[5:57] Predictive coding is a different approach. It's a machine-based method that can rapidly classify a large pool of documents which are relevant to a lawsuit by taking an attorney and expert's judgment about the relevance to a sample set of documents and building a computer model that extrapolates their expertise to rest.
[6:15] In a typical case, a plaintiff might ask a company to produce all the documents regarding some specifics about Project X. The company prepares a list of all it's employees involved in the project, and from all the places where teams of employees working on Project X store documents, the firm secures all the potentially relevant emails and documents.
[6:37] It can easily total a collect of one million or more items. These documents are then broken into two subsets -- the training set and the prediction set. To have a high level of confidence in the sample, the training set can be around 17,000 documents.
[6:54] One or more expert attorneys code the training set's documents as responsive or non-responsive with respect to the litigation. The result of this exercise will predict with reasonable accuracy the total number of responsive documents in the collection.
[7:11] For example, a sample of 17,000 documents is sufficient to ensure that our answer will be accurate to plus or minus one percent at the 99 percent confidence levels. If the number of responsive documents in the sample set is 3,900 -- that is 23 percent of 17,000 are responsive -- we can say we are 99 percent that 23 percent plus or minus one percent of the one million documents are responsive.
[7:38] That defines our goal, which is to find the estimated 230,000 responsive documents in the million. So, let's turn to the computer model to see how it can help us achieve that goal.
[7:55] A computer model has three parts -- inputs, inner logic, and outputs.
[8:01] Inputs are what a machine uses to learn. From the sample set of documents, the software creates a list of all the features of every document in the sample set. We convert these single words or strings of consecutive words, perhaps up to three words long, into what are called unigrams, bigrams, and trigrams. Or to put it another way, one, two, or three word phrases as shown.
[8:25] So, a document containing the sentence, "Scaling's abrupt departure will raise suspicions of accounting improprieties and evaluation issues," gets converted into 33 features, also known as unigrams, bigrams, trigrams that we load into the computer model, and the inner logic does it's work.
[8:49] The inner logic is how the computer model learns, and it's a step of trial and error. The model picks a document from the training set and tries to guess if it's responsive. To do this, it sums the importance rates of the words in the document to give it a score.
[9:06] If the score is positive, the model guesses that the document is responsive. This is the trial. If the model guesses wrong, in other words the model and attorney disagree, the model adjusts the importance rates of the words recorded in the model's weight table. This is, of course, the error.
[9:24] Gradually, through this trial and error process, as the model iterates through the 17,000 documents, the importance rates become better indicators of responsiveness. Let's look at an example.
[9:38] For simplicity, let's assume that we have a document with just one feature, number 32, called improprieties evaluation. At the start, the importance rate of feature 32 is zero. The attorney expert marked this document as responsive, which we'll signify in the program with the number positive one.
[9:59] The computer sums the features and importance rate for the document, the result is zero. So, the model guesses that this document is not responsive. During training, positive scores equate to responsive and negative or zero document scores equate to non-responsive.
[10:18] The model guessed wrong. It did not match the attorney's expert code, so it recalculates feature number 32's importance rate using the simple mathematical formula shown.
[10:29] On the second pass through the documents, when this feature 32 in this document is encountered, the model scores this document by summing the features. Feature 32's weight is now one, the sum of which is one, so the model guesses correctly.
[10:45] To recap, the model tried. It erred, so it reset the feature weight, and then it got it right.
[10:54] The third component of the computer model is the output, the all important frame of reference, the criteria or values in which we can make judgments about the large collections of documents in the case.
[11:10] Pretty soon, the computer has created a weight table with scores of importance and unimportance for the words, phrases, and metadata in the training set, an excerpt of which might look like this.
[11:21] As the software reviews more documents, the weight table becomes an increasingly better reflection of the attorney's expert judgment. This process is called machine learning. The computer can take several passes through the sample set of 17,000 documents until the weight table is as good as it needs to be.
[11:40] Once the weight table is ready, the program is run on the full sample to generate a score for every document according to the final weight table. The model simply adds together the features and weights in a document to calculate a document score.
[11:53] In this illustration, we show a document where the sum of the features gives a score of -5.3, probably a non-responsive document, and a document scored positive 8.7, probably a responsive document.
[12:10] We turn now to the issue of defensibility, or as we call it, testing one, two, three. If predictive coding is to be used to select the highly responsive documents from the collection of one million, and to discard the highly non-responsive documents, a line has to be drawn.
[12:30] The line is the minimum score for documents that will be considered responsive. Everything with a higher score, in other words, above the line, will, subject to review, comprise the responsive set.
[12:43] In our 17,000-document sample of which the expert found 3,900 documents to be responsive, there might be 7,650 documents that have a score of greater than zero. Of these, 3,519 are documents the expert coded responsive, and 4,131 are documents that the expert did not code as responsive.
[13:07] Therefore, with the line drawn at zero, we will achieve a recall of 90 percent. Recall means the model found 90 percent of the 3,900 responsive documents, and will achieve 46 percent precision.
[13:20] Precision means, for every 46 responsive documents the model correctly found, unfortunately, it also returned 54 non-responsive documents. Recall and precision are the key performance indicators, or KPIs, of discovery.
[13:37] By drawing a line higher, one can increase the ratio of precision to recall. For example, if the line is drawn at 0.3, the precision may be significantly higher, say 70 percent, but with lower recall, for example, 80 percent, because some of the responsive documents between zero and 0.3 are missed when we draw the line higher.
[14:00] Typically there's a tradeoff. As precision goes up and the model delivers fewer non-responsive documents, recall declines and more of the relevant documents are left behind. However, compared to other techniques, the tradeoff can be managed much more precisely.
[14:22] Test two is a better indicator of the model's estimated performance in the prediction set because we use the model to predict on documents that the model did not train on.
[14:33] To do this, we break the original 17,000-document reference set into 10 buckets, each containing 1,700 documents. We trained the model on nine of the buckets, or 15,300 documents, and predict on the 10th. The 10th bucket is one that we hold back from training.
[14:49] By comparing the model's performance in the 10th bucket to the attorney expert, we see the model achieved 82 percent recall and 72 percent precision.
[14:59] This is just another illustration of the prior screen. We train the model on 9 buckets, and we predict on the 10th. The difference is we can use the computer to keep doing this over and over and over again, each time changing the 15,300 documents we train on and the 1,700 documents or the 10th bucket that we hold back and predict on.
[15:21] This gives us a range of the model's performance on documents it did not train on. For example, 75 to 88 percent recall, and 68 to 86 percent precision.
[15:33] Using this test, the model can estimate the range of precision and recall for any line. For example, if we draw the line at greater than zero, greater than 0.5, and so on.
[15:44] Short of our attorney expert reviewing more documents, this is the best indicator we have of how well the model will perform at correctly classifying documents in our 983,000 prediction set. But we're not done.
[15:58] Once the tradeoff between recall and precision has been made and the line determined, the full set of documents can be scored and classified as responsive or non-responsive. The outcomes will generally be close to those predicted by tests one and two. However, there will usually be some degradation because the sample set never perfectly represents the whole.
[16:19] As a final test, samples will be taken from both above and below the line to make sure results meet expectations. In this example, samples of 2,000 are selected so we can be accurate to plus or minus three percent at the 99 percent confidence level.
[16:34] Experts will review these random samples against the computer rating. They should use the results to confirm that the model is working as predicted and that the client's desired goals has been achieved.
[16:45] Here, by drawing the line at 0.3, the model found 83 percent of the responsive documents, and 7 of every 10 documents it found were responsive with a margin of error plus or minus 3 percent.
[16:57] This is quite remarkable given that our attorney expert only had to review 21,000 documents, the 17,000 training or reference set and two 2,000-document validation samples which we called samples A and B to achieve the result.
[17:13] Attorneys have been using search terms for e-discovery for many years, but the biggest drawback of search is that people determine the terms on which to search. A set of one million documents such as this may have 100,000 distinct words, and while people can easily come up with some likely terms, for example, the name of the project under investigation and the key players, this is a primitive way of mining a complex set of documents.
[17:38] With predictive discovery, attorneys begin with no preconceptions about the most relevant words. They let a computer figure that out, considering every word, and every combination of two and three consecutive words without constraint. Further, the computer can use the fact that certain words will indicate against responsiveness.
[17:57] To be sure, predictive discovery isn't right for every e-discovery process. In particular, cases with many documents of which only a handful are responsive, or searches for needles in a haystack should be approached differently with different tools and technologies.
[18:12] Smart people, processes, and a variety of technologies are required to address these varying and different research goals. But for most companies that have millions of documents to review in e-discovery, predictive discovery and this process is too good a solution to not be used a great deal more than it is today.
[18:34] With those introductory remarks, I'd like to pose our first question to the panel. Jason, how have courts reacted to requests to use predictive coding techniques in litigation?
Jason Baron: [18:47] Thanks, Joe. There's a wave of cases that have changed the game for all of us coming in 2012 on predictive coding. If I could step back, there really are two waves. One was post-2006, a growing recognition on the part of some of the leading judges in e-discovery area -- Judge Facciola in the O'Keefe case, Judge Grimm in Victor Stanley, Judge Peck in Gross Construction case -- that there are limitations to keyword searching. That was foundational to what we now have seen. It would say a new set of caseload that challenged us to think about advanced search techniques or the type that we've been discussing.
[19:39] What we have now seen, given advances in the technology as you've described, is a set of cases that have emerged in 2012 starting with the Da Silva Moore case where Judge Peck reacted quite well to an attempt on both parties to come to terms on a protocol that would go forward for the litigation.
[20:05] Now, there were differences between the parties on what exactly the methods that would be employed more. There is a protocol that I'm going to discuss in a moment when we get to the next question.
[20:17] The interesting point though is, in other cases, not just where you have cooperation among the parties but you have predictive coding being used as both a sword and a shield, a shield in the global airspace case in Virginia where one party essentially wanted to use it and ask the court to issue an order blessing the use of the new technology because it was noble. The court did so in a short order and some explanations in the briefing behind the scenes and in the hearings that took place.
[20:59] Judge Nolan in the Kleen Products case, out of the northern district of Illinois, was faced with one party wanting to force a party to basically do a do-over when they had done keyword searching against a certain set of custodians. One party wanted, originally, to use predictive coding.
[21:19] There were a couple of days of evidentiary hearings with the result being essentially no formal opinion because the parties agreed to go forward without using predictive coding and settling for other methods being used at least for the time being.
[21:38] What we have is at least three cases, and there are others, In Re Actos, in Louisiana. They're merging cottage industry of cases where courts are receptive to parties either stipulating by way of protocols for these methods or taking action by motion practice.
[22:00] What is clear to anyone in the space is that a year from now, two years from now, we're going to have a dozen more, dozens of cases. The natural question is, how do we seize the day here and use what is going to be an increasingly accepted set of techniques in future litigation?
Joe: [22:27] Those are great points. Can you follow-up and tell us some of these emerging best practices in this area of predictive coding?
Jason: [22:34] It is well worth it as Joe has heard me say recently to do a side-by-side of the Da Silva Moore protocol and In re Actos protocol. These two cases are both easily found on the Web. The protocols are extremely detailed where parties have stipulated to a regime of sampling with iterations built in and with examination by experts in cases of In re Actos in ways that really are noble and have not been contemplated before.
[23:16] The protocol in Da Silva Moore which Judge Peck approved required an initial review of a sample set of 2,399 non-privileged documents. There were eight issue codes to be tagged. Those documents were shared with the other side, with plaintiff, for purposes of looking at what was considered to be relevant and irrelevant documents.
[23:40] With that feedback, there were then a series of seven iterations of groups of 500 documents at a time provided between the defendant and the plaintiff for a review in this feedback loop to determine what is relevant and non-relevant to train the system.
[24:01] This is extraordinary. Even perhaps more extraordinary in the In re Actos litigation in Louisiana is a requirement that the parties agreed to where each side nominated three people as experts to work on a collaborative basis to review non-privileged documents to train the system where those experts were bound by their own terms not to reveal to counseling the case, what their initial judgments were.
[24:38] These are very creative solutions in the space. What is emerging as best practices is a set of commonalities in these protocols that one needs to consider how you see documents, how you sample, how you use iteration, what kind of quality control measures you're going to take both on the front end and the back end and throughout the process. In studying these protocols, it's very useful to your own practice to see what elements make sense to you and your client for using these technologies.
Joe: [25:22] Now that some of these cases are approaching discovery away, discovery was approached in the [indecipherable 0:25:27] research projects, very interesting. What are some of the open issues that remain to be worked out with these new techniques?
Jason: [25:37] I can think of a whole variety of issues that are open terrain. Some are even provocative or controversial to contemplate. The most provocative issue that seemed to be imbedded within these protocols is the sharing or irrelevant documents.
[26:00] I'm on record as supporting a transparent approach in this area because in my view, based on my talks with experts in information retrieval, search algorithms like support vector machines need to have examples that are difficult to discern to optimize the algorithm that's in the black box.
[26:29] It's not enough just to show cases of clearly relevant documents. You need to show the system irrelevant documents and marginally irrelevant documents that are close to the line that the algorithm is creating in a hyperspace mode.
[26:51] We, as lawyers, need to think about both ourselves and our clients embracing techniques where you're saving cost, but you're having more transparency than previously one would be comfortable with in sharing documents that are non-privileged but irrelevant. The parties in these two litigations that I've talked about have gone about doing that. Judge Peck has blessed it.
[27:21] I was at a conference last week, EDI Summit in Fort Lauderdale where very clear that Judge Peck is fully comfortable with this new approach. That's one issue. There's a second issue about courts setting a bar that's an absolute metric in this area. I've been pretty clear, coming out of the TREC experience, that while you're exactly right, Joe, that there are metrics in this area that are very important for best practices, recall and precision.
[27:55] I am weary of a court saying based on the state of practice that there should be 82.7 percent recall in all cases or whatever number, 75 percent which is a number thrown out in global airspace. Cases are heterogeneous and so much differing in different practice areas that it would not be fair to set some absolutely standard that all courts need to live by.
[28:28] On the other hand, the parties could well stipulate based on their own sampling, their own experience with the data set as to what goals are reached, which could be reached in a particular litigation with respect to recall and precision. That is something that was done in these cases based on the experience of the data set. That is supportable. It is an open issue as to what level of tolerance you have.
[28:55] The third, there's a modest issue about [indecipherable 0:28:59] and whether one uses judgmental sampling keywords to begin with or completely have a random approach at the beginning of these methods. That's something that different software vendors will tell one the pros and cons of. It remains to be worked out.
[29:18] There are lots more. I'll stop there. Of course, in this emerging area, we're only right at the dawn of using these techniques in reported cases and learning from what others have done. It is clear that there are some very interesting issues to work out.
Joe: [29:41] Jason, I got ask you. It's a timely topic. Say you decide to use predictive coding on White House emails that need to get produced, what if you do that? What if you're challenged? How do you defend the use of predictive coding if you're challenged?
Jason: [30:04] Of course, let me first sidestep the question at the moment which is that I hope that it wouldn't be challenged. The ideal here is to work out with the other side a protocol that makes sense and that you're cooperating, that you're being transparent, living up to the Sedona Cooperation Proclamation. All that said, we as lawyers know that cooperation is an allusive goal. It's like unicorns and rainbows.
[30:31] In the real world, there are going to be challenges -- and hopefully not in litigation involving White House email but we'll never know. The first thing that I would say is an obvious point given what I have already just said, which is that if you are able to establish, you want to go on your own and with your client, and you have decided that the methods that Joe has described are worthy of saving cost to the client and produce a reasonable result under the federal rules, then one could do no better than modeling what the workflow is based on cases that have already had that workflow accepted.
[31:26] Even if you're in an environment where you're being challenged, if you can point to the fact that you have done due diligence by using a predictive coding process that is modeled on Da Silva Moore or modeled on In re Actos, then I think you have a leg up in the litigation, because it really does represent a very thoughtful and a granular approach. Second, you need to document what you're doing. What gets lawyers in trouble and clients is a failure to be able to reconstruct whatever is done in this area, whatever final protocol that you have unilaterally employed.
[32:13] That is the hang-up that Judge Grimm in the Victor Stanley case was, that the parties standing before him, in particular one party, could not explain what exactly had transpired with respect to keyword searching. Now, we've moved on, four years later, to an environment which there really is a black box to this.
[32:36] One needs to be as articulate as one can about the process that has been employed. A judge will not -- Judge Peck has made this clear, and other judges have been clear -- they do not want an evidentiary hearing, whether it's Talbert or something else, that's on all the particulars of what's inside the black box.
[32:56] But they do want a comfort level that the process is producing reasonable results in terms of high recall, high precision, and the fact that you're getting large numbers of relevant documents. If you can document that process and follow what has already been adopted, then I think you have a major leg up in terms of defensibility.
Joe: [33:21] Thanks, Jason. Dan, I've got a question for you. How does random sampling ensure that a sample is representative of the population?
Professor Daniel Slottje: [33:34] Thank you for the question. The one thing that you know, when you're picking a sample from a population, is that that sample may or may not exactly mirror the characteristics of the overall population at a very broad level.
[33:49] You start from the premise that you expect there to be something called sampling error. You think that you want this to be the best representation. You want it to be accurate. You want it to be precise.
[34:02] But you know that because it is a sample, that particular sample you pick, even if you pick it in a random manner, if it's simple random sampling, meaning that every single element in the population has an equal probability of being included in the sample, you start from the premise that there is still going to be some sampling error from whatever the characteristic is, whether there were, in our case, particular documents responsive, and what proportion are responsive relative to the population.
[34:27] You start from that premise that there will be sampling error. The scientific elegance and beauty of using scientific sampling methods is that we can control those parameters. What I mean by that is Joe was talking earlier about the notion of the 99 percent confident that we had a sample with a margin of error of plus or minus one percent.
[34:53] Anybody that's been watching the elections knows that this election, particularly, has been about polling and about the results. You keep hearing people talk about statistical ties. The focus is on what they call the margin of error.
[35:07] The margin of error is nothing more than that sampling error that we're getting when we sit, and we pick a sample, and we don't know exactly how close that is to being the actual value in the population.
[35:19] The beauty of it is we can control that. We control that by the size of the population that we pick under the premise that we want to be pretty confident, whether it's 95 percent or 99 percent confident, that the sample estimate we get is within the range of the true underlying population value.
[35:39] You pick those parameters before you start. By picking those parameters before you start for both the margin of error and for the level of confidence that you want have in the sample, you're guaranteeing that in repeated samples, in 99 out of 100 cases, or 95 out of 100 cases, or 999 out of 1,000 cases, you would have an estimate that actually has a margin of error.
[36:05] If it's one percent, it's within one percent of the true value of that population.
[36:10] The other way you can think of it is, if you ever see a kid driving around in a car or it's a student driver, -- and I don't know about you, but it makes me a little nervous when I see those kids, especially on country roads, where I live -- I know that there's another guy next to him who can take control of that and control the parameters, how fast he's going, in what direction he or she is going.
[36:33] That's what you're doing with scientific sampling. You're setting the parameters before you begin to guarantee that you're going to have estimates that are within a certain level of accuracy. That's the elegance and the beauty of probability sampling that we're talking about here.
Joe: [36:50] Thanks. Do you perform any statistical test or analyses to validate that the sample is representative?
Professor Slottje: [36:55] There are a number of tests that you can do to see that a particular sample is representative. The best and easiest way to do it is to pick out some characteristics. If we're talking about documents as an example, you can think of a characteristic like the time frame of the documents.
[37:12] If we know our priority, that if we have, like our example, a million documents, and we know that maybe there are 15 or 20 different sources of documents, and we do have account, because we have to have what's called the sample frame account of how many documents there are, and that's generally going to be done, and let's say we know in March of 2000, or in the example that you were giving earlier, let's say that in September of 2012, between certain days, we know that there were 75 emails that were sent, we can figure out the proportion of those.
[37:48] We can compare the proportion in the overall population to the proportion in our sample. That's a very simple key statistic test that we can see if they're statistically equal to each other. That's one way that we can validate, based on a particular characteristic, that we do indeed have a representative sample.
[38:04] We can look at some very broad parameters that are characteristic for the documents that allow us to do that.
Joe: [38:11] This is a tough one. Is your sample the same size regardless of whether the population is 1 million or 10 million?
Professor Slottje: [38:20] No. It's not. But it may not be that different. I know that's a problem, that a lot of lawyers have, including my wife, a hard time -- I'm not a lawyer, I'm like you guys -- getting a grip on that concept that if your population is a million or your population is 10 million, if you pick a certain confidence interval, and you pick a certain margin of error, the sample size that you select may not seemingly be that different if you're only looking at the confidence interval.
[38:55] The thing that you've got to remember, though, is that the margin of error is what...Basically, you can think of that as fine-tuning courts. In my experience, I'm not working in the same domain that you gentlemen work in, but in my domain, 95 percent is pretty much the common standard. There's no such thing as an official statistical standard.
[39:16] But 95 percent, in science and in most of the legal disputes that I've been involved with, there's not much argument among statistical experts about that. Where you get into battles is, "What's a reasonable margin of error?"
[39:32] The margin of error, as we'll show later on in the appendix, can vary greatly, depending on what the sample size is ultimately going to be based on what the ultimate population size is. If the population is a hundred million, or the population is 2,000, and you say that, "We want to be 95 percent confident in whatever estimate we get," you can get great differences.
[39:58] As an example, if you're 99 percent confident that for a population of a million a margin of error of one percent that we've been talking about, and as Joe mentioned, requires a sample of roughly 17,000. If you're willing to tolerate a margin of error of three percent, the sample size you would pick would only be around 1,844.
[40:22] It's the margin of error where the fine-tuning comes in, how precise you want to be, and remembering that the reason, the objective of the sampling in the first place is because we don't have infinite resources to be able...We know that costs are significant. Undergoing the discovery process, that's where you've got to make the call.
[40:41] No, it's not going to be the same. But it's really the margin of error that's going to be how much error you're willing to tolerate in your estimates.
Joe: [40:48] Two really important questions. Where do you draw the line? What's the acceptable margin of error?
Professor Slottje: [40:54] I have been in that debate in many, many lawsuits. There are experts that will tell you that 10 percent is reasonable. My personal belief is that three percent is the highest amount. The problem, as you can imagine, is if you're willing...As an example, let's talk about the polling again. You can see what the problem is.
[41:18] There is a poll that just came out that I saw, a Rasmussen poll, that said that right now, among likely voters, President Obama had 45 percent of the vote and Governor Romney had 50 percent, and that they are now talking about...This is a great increase for Governor Romney. But the margin of error on that is plus or minus three.
[41:43] If you think of a worst-case scenario for both guys, if you subtract three from Romney, you get down to 47. If you add three to President Obama, you get 48. Guess what? That's still a statistical tie.
[41:58] In that particular case, going to the time and the expense of polling, where people pay attention to the results of these polls, and it doesn't allow us to distinguish between them, is not of any value.
[42:12] The same thing happens in pharmaceutical trials. You have to have a margin of error that is small enough that you can definitively say whether a trial was successful or not. If the margin of error is too broad, then it becomes relatively useless to us.
Joe: [42:27] What if the population underlying the sample is unknown?
Professor Slottje: [42:31] Another one of the beauty points of using statistical sampling is, in most cases that it's being used, the underlying population size is quite likely to be unknown. What we do know is that we have a notion of what the question is we're trying to look at.
[42:50] As the example in the documents, if a document is responsive or not responsive, we're looking for the proportion of documents that are responsive and not responsive. Using some fairly elementary statistical techniques, we can pick a sample size that assumes the worst-case scenario in terms of the variability in the documents so that the variance of the documents is as broad as possible.
[43:15] We can put those parameters into the equation that we used to pick the sample, and we adjust for it that way. Again, the underlying statistical process that we use -- there's nothing mysterious or magical about this, and it's just based on sound mathematics and statistics -- will allow us to account in a very straightforward way for the fact that the underlying population is unknown.
Joe: [43:40] Similar to the last question, how will a court know that your estimates of population parameters are accurate?
Professor Slottje: [43:52] You've got to go back to the first question.
[43:55] The only thing that the court is going to know without taking a census of every single document -- and if we're talking about looking at all million and having an expert read, expert attorneys read all million -- is they know that if the process is followed, a particular sampling process, and Jason alluded to this earlier, it's not necessarily true that it's a simple random sampling where you just simply pick a sample.
[44:23] But there are stratified sampling methods. There's clustering sampling methods that are all intended to make the estimates of the document or the characteristic of the population more precise. We know, before we even begin, that we can set the parameters to tell the court with certainty that we know that there is some uncertainty, but we can quantify what that uncertainty is.
[44:48] We know that we can't tell them exactly what the value is going to be, but we can tell them around a confidence interval that, like you've been talking about, we're 99 percent certain, whatever the values we come up with of those proportion of documents, that we're on the off by one percent at the most.
[45:10] It gets back to the same as the first question. The beauty of probability sampling is that it allows you to regulate and to fine-tune how much inaccuracy or sampling error you're willing to tolerate.
Joe: [45:24] That's great. We're turning to question from the audience now. Jason, the first question is up for you. Do we know how many predictive coding cases have been presented and are completed to date in both the US and global?
Jason: [45:36] Yes, I saw that question. I don't know the answer. But we can only track reported decisions and those that have been newsworthy or blogworthy. I can always count on my good friend, Ralph Losey, to apprise us in his column and in many places on the Web where there are decisions that are contemplating this.
[46:03] There's a handful of reported cases. Kleen Products did not end up being a reported case. We will provide citations to the Da Silva Moore and In re Actos as part of the slide set that's available to everyone. We can all look for what happens in the future.
Joe: [46:25] Thanks. The next question up, I'm going to take this one. It's addressed to me. What is the basis for the assertion that predictive coding is inappropriate for low prevalence data sets? What other methods do you consider appropriate for such data sets? That's a great question.
[46:40] If we think back to the example where we used predictive coding to draw a line, and we split the population documents into a big chunk of 300,000 potentially responsive documents and then a dark pool of 700,000 largely non-responsive documents, in the example we showed, in the 700,000, for example, there's only estimated to be 42,000 responsive documents in that 700,000. A very low prevalence.
[47:09] Instead of approaching that problem with predictive coding, at FTI, we use our mines technology. We've got technology that can data mine that 700,000 documents by clusters of key concepts. Attorneys can navigate the clusters of documents by the concepts. They could find the hidden pockets of responsive documents.
[47:32] The second tool and technology that's really well suited to that is our cubes technology where you can pivot the 700,000 documents by concept, keyword, custodian, or date. You create these pivot tables on top of the 700,000 documents. You could drill down to see the who, what, when, where, and how. You can slice and dice that with search terms, and concepts, and the like. Different tools for different research objectives are needed.
[48:04] We have a question for Dan. Are you coding three percent margin of error in the test three as in a recall because that is the margin of error of each sample, or are you further applying a propagation of error variance estimate that handles the uncertainty of each sample?
Professor Slottje: [48:21] The three percent that we were talking about was in the initial selection of the sample. As you go through a reiteration, you can obviously change that. In fact, you don't have to sample at all using traditional sampling methods. You can use something known, that Efron came up with as, and I'm sure whoever asked the question is aware of this, the bootstrap and the jackknife.
[48:46] You can sit and take the data that you've already pulled. You can get an estimate of the variants using the data that's already pulled without having to worry at all about the margin of error. It's a nonparametric way. You don't have to impose that. You can resample from the data that you've already got. It's called the resampling methods that will allow you to do that.
Joe: [49:11] Jason, are there any questions, as you look in the queue, that you'd like to pick up and address?
Jason: [49:19] I think there is some confusion about what the various technologies really are. There's a question about what distinguishes concept-based and vector machine learning, and the like, and a good place to find a glossary.
[49:38] There are a number of places to go. One is a set of articles in the "Richmond Journal" from 2011.
[49:55] My article, "Law in the Age of Exabytes," does set out some primer on this. There is an article by Maura Grossman and Gordon Cormack that talks about the technology-assisted review and discusses various methods. There are Sedona commentaries about the search commentary and the achieving quality commentary that one can have as a primer, and both are being updated by Sedona.
[50:25] There is a literature on information retrieval that is out there on the Web. The TREC website, the TREC Legal Track, if you type in T-R-E-C Legal Track, get the home page, and there's a bibliography with all sorts of resources that are in this area.
[50:48] Ultimately, though, my view on this topic is that we, as lawyers, are never going to be the kind of experts and have the expertise that we see here on this webinar from Dan as a statistical expert or in the information retrieval expert community.
[51:09] If you're in litigation that, from a proportionality perspective, merits, the expense, it really is a good idea to partner with someone who has expertise or a Ph.D. or otherwise, in information retrieval and to get a handle on the various methods. You can go and approach a set of legal actors in the marketplace, vendors and others, and ask for white papers and studies, whatever.
[51:45] I think that is something that it's worth doing as well. But I like sort of the gold standard in this area. If you're in litigation, you really want to try to approach people who have already stated expertise.
Joe: [52:04] Thanks. I'll take that. There's a question here. Can you discuss what the adoption rate of predictive coding might look like? I think there are a couple of forces in play. One is increasing sizes of collections of data by multinational corporations. There are really three approaches.
[52:27] You can address that with keywords, if the other party will agree. But there have been some studies clearly indicating that keywords are probably ineffective at finding responsive documents in a large collection. It's questionable as to whether parties will even agree to that technique in the future, as you look at what happened in the clean decision.
[52:50] Secondly, you can approach it with manual review, which is very expensive, time consuming. There have been some studies indicating that may not be the gold standard.
[53:03] Our third is sort of a predictive coding approach that we discussed on today's webinar, or a combination, using a variety of tools and technologies and processes to address it.
[53:14] The other thing though that just came out of our survey of the 24 corporate and council and [indecipherable 0:53:23] 100 law firms is that 50 percent of those folks had used predictive coding, but more than 90 percent of them said they had a positive experience with it.
[53:36] I think that folks are having a positive experience with this new technology, this new approach and that measured scientific approach to discovery that we discussed today. I think that's the future and it's going to take off.
[53:59] We have a question here. It was either for Joe or Jason. After you leverage other technologies, what is your recommendation with the remainder set you haven't looked at. I think I spoke a bit to that, that we would use...FTI has three other...I spoke about two other technologies, one being Mines or Cubes. We also have the Document Mapper technology.
[54:23] If we draw the line and we have the 300,000 document set, that review...If our sampling says that there's 70 percent precision in that set, seven out of every 10 of those 300,000 documents are estimated to be responsive, what we're going to do is we're going to go ahead and review that.
[54:44] We're going to identify the privileged documents so we can hold them back. We're going to learn about the case. It's still important to read the documents, by outside counsel, to learn about the case so they can defend it or prosecute it. We're also going to weed out some false positives.
[54:59] We actually use another machine learning technology, something totally different than the predictive discovery technology we described in today's webinar. That technology is Document Mapper. It's a different type of machine learning that we use for that process.
[55:18] I think we probably have time for one more question. Jason, do you see any questions you'd like to address?
Jason: [55:25] Let me just reiterate on that last point. This is a sea change in litigation. Ten years ago, no one would be doing quality control. We'd be doing keyword search. Whatever we get as hits, that's fine. Forget about the rest of the universe.
[55:41] We have moved to a much more sophisticated level in practice. The techniques that you've outlined here, that we see in the case law, really do address quality control, big time. I will say that, I mentioned Ralph once. Ralph Losey has a new website, www.edbp, E-Discovery Best Practices, which is on this point about quality control, and others, in terms of a workflow diagram. I would check it out. Ralph and I have both been very keen on thinking about what constitutes best practices in the space.
[56:27] We're seeing this emergence of these methods. We all, in our own litigation, are going to be pressed to defend these methods with a set of judges that are not Judge Peck. We all just need to be sort of up to speed, on what is said today, in the webinar.
[56:50] It is a new world. It is an uncomfortable world for many of us lawyers of a certain age, but we need to embrace it. In fact, the methods that we're talking about demand that a senior attorney give their perspective in a holistic way to what is going on in the litigation.
[57:13] It's not good enough to just use junior people to train the machines on predictive coding, in many instances, because they don't have a complete grasp. We, as more senior lawyers, need to embrace what is going on in the black box, and understand how we can advance the interests of our clients, in terms of saving money, saving time and effort by using these methods, and putting them in a way that will be acceptable to a range of judges, both in state and federal court.
[57:51] I am certainly devoted to being a cheerleader for advanced search methods like these advanced document review methods, going forth in the future.
Joe: [58:01] Thank you, Jason. That concludes it. Before I turn it over, back to Angela, Dan, any final comments or thoughts?
Daniel Slottje: [58:11] I think that you've covered most of the main points. One point that is important, from the statistical perspective, and you touched upon it as one of the issues, and that's to make sure that you have a common population, to begin with, because you've got to make sure that it makes sense to sample, in the first place.
[58:28] That comes up a lot, in my work, because frequently people will just assume that it's appropriate. If you don't have a common population, then you need to be very careful in what you do for their next step after that. That's my final point.
Joe: [58:42] Great comment. Thank you, Jason. Thank you, Dan. Thank you everyone for participating today. Angela?
Angela: [58:50] Thank you, Joe Looby, to you as well, and to all of our participants, we appreciate your time, and thank you for attending. For more information on predictive discovery topics, we invite you to check out FTITechnology.com.
[59:03] As mentioned earlier, this presentation has been recorded, and a copy of the recording and slide deck will be sent to all registrants.
[59:11] With that, thank you again to Professor Daniel Slottje and Jason Baron with the National Archives and Records Administration. This now concludes today's presentation.