Video
What Does Counsel Really Think About Predictive Coding?
While the promise of predictive coding is alluring, many questions remain for corporations and law firms. Where does the software end and the importance of workflow begin? What can lawyers do to effectively defend its use? Are companies using it successfully? How much money can it save?
FTI Technology commissioned an interdisciplinary survey of law firm leaders and senior corporate counsel to identify key trends and perspectives on the emergence of predictive coding. The interviews covered everything from high-profile court rulings and cost savings estimates to adoption inhibitors.
Related topics:
Transcript:
[0:00] [silence]
Angela Navarro: [0:36] ...Welcome to today's webcast titled, "Survey Results: What Does Counsel Really Think about Predictive Coding?" This event is brought to you by InsideCounsel and sponsored by FTI Technology.
[0:48] My name is Angela Navarro with FTI Technology, and I will be your moderator for this event.
[0:53] Let's get some simple house keeping items out of the way first. If you have a question for one of our speakers, please enter it in the Q&A widget on your console. We will get to your question during the Q&A section at the end. If we don't get to your question, you may receive an email response.
[1:12] In addition, there are some other customizable functions to be aware of. Every window you currently see, from the slide window to the Q&A panel, can either be enlarged or collapsed. So if you want to change the look and feel of your console, please go right ahead. Now let's take a look at today's agenda.
[1:34] After we introduce today's presenters, we'll receive a brief overview of predictive coding and then launch right into the survey methodology that was used to gather feedback from counsel on this topic.
[1:46] Our speakers will then cover adoption trends with predictive coding, types of cases where it has been used, and some of the top benefits and concerns to be aware of when considering using predictive coding.
[1:59] Our speakers will conclude this section on key findings by discussing some of the cost considerations and suggestions on when to use experts with this topic. We'll talk about looking ahead to the future and reserve time at the end for your questions. Now let's meet today's speakers.
[2:19] We are pleased to welcome our first speaker, Barry Murphy, founder of E-discovery Journal and founding principal of Murphy insights.
[2:27] Previously, Barry was director of product marketing at Mimosa Systems, and prior to that Barry was a principal analyst for E-discovery Records Management and Content Archiving with Forrester Research.
[2:40] Barry has spoken at numerous industry events and has been quoted in publications including the "Wall Street Journal," "KM World," "Red Herring," "Computer World," and "Intelligent Enterprise," and has appeared as an industry expert on outlets such as CNBC.
[2:56] His educational background includes a BS from the State University of New York then MBA from the University of Notre Dame.
[3:04] We are also fortunate to have Ari Kaplan, principal of Ari Kaplan Advisors, on today's presentation. After practicing in large law firms for nine years in Manhattan, Ari became one of the leading copy writers and industry analysts in legal community.
[3:19] The author of over 200 articles, Ari has been recognized in the Wall Street Journal law blog, the Chicago Tribune, the Miami Herald, the New York Post, the ABA Journal, Above the Law, the National Jurist, the Chicago Lawyer, and the California Recorder, among other publications.
[3:37] Named a Law Star by LawCrossing, Ari provides law-related bills writing for a number of companies, firms, and individuals in the legal industry.
[3:46] He also provides consulting to individuals and organizations interested in creating deeper connections with law students, lawyers, legal administrators, and other legal professionals.
[3:57] Finally, it is always a pleasure to have Joe Looby, Senior Managing Director at FTI Technology, as a presenter on our webcast.
[4:05] Joe's background includes participation in the National Institute of Standard and Technologies Text Retrieval Conference to co-chairing the Sedona Conference's first panel proposing the use of statistics to defend reasonable e-discovery efforts, co-authoring the Sedona Conference's best practices on the use of search and information retrieval methods in e-discovery, and achieving quality in the e-discovery process.
[4:29] Joe is a regular speaker, author, and consultant to law firms and to corporations. A former US Navy JAG lieutenant, an experienced regulator, and published software developer, Joe has appeared before regulatory agencies and provided expert testimony.
[4:45] Thank you again to all of our speakers. I now would like to turn the call over to Joe. Joe, please go ahead.
Joe Looby: [4:53] Thank you, Angela.
[4:56] Discovery issues and litigation in the United States has become an immense industry and quite a challenge for corporations.
[5:03] Anecdotally, at the start of the last decade, average litigation cases involved perhaps tens to hundreds of thousands of documents. Today, it's not uncommon for a case to involve millions to tens of millions of documents.
[5:16] This trend will likely continue as digitized information increases, and it presents a significant opportunity for attorneys, experts, technology and statistical process to address the big data of e-discovery.
[5:31] In August of 2012, FTI issued the results of a survey in which 24 leading corporate and law firm counsel executives were interviewed about the prospects for predictive coding. The insights of these practitioners will be presented on today's webinar.
[5:47] While the counsel interviewed expressed some concerns, there was widespread interest in predictive coding. Many counsel reported successful predictive coding projects.
[5:58] There was much optimism about the potential for predictive coding to provide cost savings over other approaches to address the big data of e-discovery.
[6:08] We've put the definition up here, a gating principle of what predictive coding is.
[6:14] Predictive coding as described on this webinar involves the use of a classifier, also known as a computer model, that takes in a training set of data that's been coded by an expert. It uses that training set to make predictions on a larger collection of documents.
[6:32] We're not talking about concept clustering. We're not talking about keyword search. With that, I'll turn it over to Ari.
Ari Kaplan: [6:43] Thanks, Joe. I have to start by saying it's a privilege to be associated with Joe and Barry on this call.
[6:51] A number of years, FTI and I have been collaborating on this research and have had a tremendous good fortune of gaining insights from some of the most extraordinary in-house lawyers and law firm practitioners around the country.
[7:07] In this case, we spoke with 13 in-house lawyers, 11 law firm lawyers, 10 of them were partners.
[7:13] It's important to note that of the partners of the law firms, 100 percent of them, every single one of them was a leader of e-discovery in their firm.
[7:24] Two-thirds of them were the chair or co-chair of their e-discovery practice. 100 percent of them recommended e-discovery solutions to their clients.
[7:34] Those folks who were on the ground and at the largest institutions, two-thirds of them were at firms with 1,000 to 5,000 total employees.
[7:41] Of the in-house representatives, there were four from the manufacturing industry, two from insurance. There are two from life sciences and then of course representatives from energy, financial services, retail technology and transportation.
[7:54] The reason for this was to really have dedicated conversations with a real cross-section as opposed to just those people who might use this technology or just those people who would never use this technology who really wanted to try to find a good cross-section to share with you views.
[8:12] These folks really spent a lot of time in discussing this. 92 percent of the in-house council were from companies of revenues over 10 billion and over 10,000 employees.
[8:23] What you see in this slide here is that their responsibilities are significant in terms of recommending processes, developing processes, dealing with organizations that we're going to see over a third of the organizations of more than a thousand litigation events. The goal was to really provide some strong information.
[8:44] These individuals were speaking directly in terms of their experience and in terms of the applicability of this technology. We were so fortunate to have a chance to connect with them.
[8:59] I thought I would share this. One of the comments that was particularly telling about this technology was that one of the lawyers in-house described it as auto-magical. That really captures some of the adoption trends that you're seeing and that are really well-captured on the slide.
[9:20] You see here that 21 percent of the respondents are talking about the need for expert use of this technology. You see that Da Silva Moore is having a real impact. 100 percent obviously have heard of it.
[9:40] What's particularly compelling here was that 69 percent of the in-house lawyers that participated said that they were positively impacted while 31 percent of the law firms said that they were positively impacted.
[9:54] You see an interesting distinction because the law firms had been using this technology, have been considering using this technology regardless whether the courts are going to approve it not.
[10:08] Frankly, someone said the dam was going to break as soon as the judge issued a holding establishing the propriety of using the technology.
[10:15] It's important that the overall theme was that, someone said, "I can't see how we're not going to be using universally something like predictive coding within the next five years" giving the exploding data volumes that Joe mentioned.
[10:30] In fact, another lawyer said, "Predictive coding is eventually going to get mainstreamed into normal IT and data management practices." The trends that we're seeing are not going away.
[10:42] In fact, there is a real interest in seeing what will happen over the course of the next few years. It's a short-term transition and transformation rather than something long-term.
[10:56] I know that Barry has been similarly doing some research. I'll allow him some time to share with you what he found.
Barry Murphy: [11:05] Thanks, Ari. I agree with you on the issue of the increased volume actually being one of the major drivers. We're also seeing an increase in regulation and compliance investigations that are just forcing people to say, "We need a better way to do the reviews that we've been doing."
[11:25] We did a study earlier in the year and found very similar numbers. In January, it looked as though just over a third of organizations had experimented with predictive coding.
[11:38] By the end of the year, adoption would be over 50 percent. We were predicting that it would be a fairly mainstream activity by the end of 2012. That seems to be playing out.
[11:50] One of the reasons for that is that the people that have experimented with predictive coding are getting value out of it. Almost 90 percent of respondents in our survey indicate that they're going to increase their usage of predictive coding. Only a couple of respondents indicated that they may decrease usage.
[12:10] Everyone that's using it is getting value out of it which is leading to the lower cost of review. From a corporate perspective, it's the potential for lowering or avoiding discovery cost.
[12:24] From a law firm perspective, the value is there as well because it's about being able to do higher margin reviews and not necessarily have to worry as much about sending things out to contract review attorneys.
[12:38] You mentioned it earlier, people are being to prioritize things and be able to do that. Being able to do better reviews at lower cost is continuing to drive the adoption of predictive coding.
Ari: [12:57] Barry, it's fantastic that you found that there's an increased usage of 88 percent because we found that 91 percent of those who used it describe it as successful.
[13:13] What we're seeing in total is that e-discovery, a lot of folks are saying e-discovery is a symptom of a larger problem. Corporate America still is not sure what it knows. That theme seems to resonate in terms of its future use, in terms also the ownership of the tool.
[13:29] Who's going to own it? Is it going to be owned by in-house cousels, by law firm lawyers? It's interesting to see what your research is finding.
Barry: [13:35] I would concur and I would say, the other thing that leads to adoption is you have these cases out there like the De Silva Moore, like clean products.
[13:46] There's a lot happening in the news that are bringing up all these questions around the defensibility. That ship has sailed to a certain extent because in general, our respondents tend to lean towards predictive coding being defensible.
[14:01] We'll talk a little bit more about how that defensibility can come into play when we talk about some of the experts needed. There's not that barrier that this isn't going to work. There is a desire for more prescriptive case law flow or rules which I don't think we're going to see.
[14:24] We see in the judiciary a desire for not having to rule on what's behind the curtains in predictive coding but rather having the parties work together. Because we're seeing a lot more of this adoption, we are seeing people realize "I have to do this."
[14:45] Ari, you have said that that's what you're hearing from people when you talk to them, is that "I have to do this. I can't simply wait and be behind the curve as this happens." Joe actually was going to talk about some of the types of cases where this comes in.
Ari: [15:05] Barry, I just want to mention something with respect to the defensibility.
[15:10] One of the things that we found was there was a certain ownership to the responsibility associated with using the technology that folks were less concerned about defensibility because they determined that it was their documentation process.
[15:25] A lot of the individuals that I spoke to said, "Listed, we were less concerned about defensibility because we have fairly well-documented process. As long as we can document exactly where the information is going, what we're doing with it, how it's being processed, we feel very confident with it."
Joe: [15:40] That's correct. The documentation is a statistical process as well. The challenge is we use the term predictive coding but that's an umbrella.
[15:48] In that, there's the technology. There's the attorney expertise. There's the subject matter expert. There's the statistical expert to validate the output from the model. There's the documentation and the workflow and the project management.
[16:03] These cases are more complex than some e-discovery cases handled several years ago.
[16:13] Turning to Barry's point about the types of cases, the survey respondents were pretty clear. They indicated that there appears to be a dividing line of when it makes sense to use predictive coding. The themes that we heard had to do with size, production line, deadlines, and complexities.
[16:33] If a case has more than 100,000 documents, if the production deadline is more than two weeks away, if the number of custodians is more than 50, those are good candidate projects for predictive coding.
[16:47] Because there is a startup cost and a training that's required to teach the computer model or the classifier and to test it and validate it to assure that it's generating the output that meets the research objective of the case.
[17:03] The other thing that was really interesting was that there were certain types of matters where the respondents said predictive would work. There were certain types of matters where they said it wouldn't work.
[17:14] Interestingly, some respondents noted that in multilingual matters, predictive coding was tremendously helpful because they were able to call out non-responsive documents and eliminate high foreign language review or translation costs.
[17:34] Especially if we look at the globalization of commerce and cross-border discovery, predictive coding has great potential in these multilingual matters.
Ari: [17:47] Joe, the other thing I felt that was interesting that we found was that 100 percent of the respondents were using it for litigation. That was a given but there was a real trend underlying a lot of these in terms of the misalignment of law firms and corporations.
[18:04] Predictive coding in some ways allows that alignment to reappear. There's this pressure on law firms to be more efficient.
[18:14] In some ways, there's an attractiveness to being more experimental. 27 percent of law firm lawyers who participated tested it on internal investigations, 18 percent on Foreign Corrupt Practice Act, nine percent on multi-district litigation.
[18:31] What you're seeing is this idea that we can align better with our clients by leveraging this new technology in the same way that we've leveraged technology over the years to be better and more efficient. This is in some ways proof of that.
[18:47] In fact, there were so many comments. Someone said, "It could not have been done without technology-assisted review. Whatever the case was, it couldn't have been done."
[18:55] In some ways, they were trying to justify it and saying, "Look, a normal review, we've generally been asking the least knowledgeable individual to evaluate the documents at the outset as opposed to really getting sophisticated insight right away."
[19:09] I think that's something that really was compelling about this particular research, and the combination of getting perspectives both from in-house lawyers and the lawyers at the firms who represent them.
Barry: [19:22] Joe, you mentioned cases of a certain size. One thing that comes up a lot is the metrics around predictive coding. What types of cases does it make sense for?
[19:34] One corporate practitioner that I talked to spoke to his finding that it was really all about trying to eliminate the roadblocks to using predictive coding.
[19:47] What they did was they bought essentially an all you can eat capacity for their internal process. So that even if a case was only 50,000 documents, instead of debating whether or not they would use predictive coding in that case, because of that start-up cost, they basically don't have any barriers to it, because of the way that they purchased it.
[20:11] I think that for practitioners, that's one thing that you'll want to think about is how are you sourcing this? Are you sourcing this? Is it all you can eat? Is it volume based? That will really play into that decision making process.
Joe: [20:23] And there are a lot of ways to use it. You can use it to cross check your human reviewers. You can use it to prioritize. You can use it to cull.
[20:34] To Ari's point, it really is bringing corporate counsel and outside counsel back together, because what we see is small teams of expert attorneys -- whether it's a subject matter expert, in-house counsel, and a litigation expert outside counsel, those are the teams of individuals.
[20:57] It's a small team reviewing 10 to 20 thousand documents required to train the model to go out and predictive code against a million documents, for example. And it's really a watershed event. It's a change.
[21:16] The types of cases...the attorneys also indicated that there are some matters where predictive coding is believed to be maybe less effective. We should maybe take a minute and highlight some of those.
[21:28] Counsel, again, appears to take a cost-benefit analysis, because while predictive coding is a scalable solution, meaning the cost to address 100,000 verses 1,000,000, or a 1,000,000 versus 10,000,000 is not a tenfold increase in cost. The startup cost remains about the same.
[21:47] But the return of investment on a 1,000,000 document case versus 100,000 document case is much greater. Once again, it's much greater in a 10,000,000 document case versus a million document case.
[22:01] I think the other thing that we want to call out here is that stakeholders should be aware that predictive coding is mostly used today for alphanumeric text -- word documents, spreadsheets, email, text messages. It's not used for multimedia data such as photos, voice, and video.
[22:18] There's no barrier. The technology can certainly be applied. In other industries, it's being applied to photos and multimedia. But at least in discovery, just a got-you to watch out for, is the electronic documents with little or no text still do require manual review.
Ari: [22:40] So, for the lawyers who are thinking of using it, one of the things that came up a bunch of times...Barry alluded to this point, essentially is that predictive coding could be appropriate for any matter as a technique, but not for the only technique on any matter.
[22:59] When we're talking about the distinction between types of cases and whether we should use it or shouldn't use it, one of the lawyers made sort of a comical point and said, "Look, there's no type of matter to which this is not suited. If it pays a buck, a buck is a buck."
[23:16] Again, he was thinking about demonstrating to his client or her client that cost control is what we're working on. Efficiency, doing the best possible, most effective and creative evaluation of this material, so that we can successfully represent you is a key point.
[23:36] Obviously there are certain cases where we found people thought, "No, we're not going to deal with a lot of images and stuff." But someone of them were really willing to try it whenever it was possible and gave a really practical reasoning for doing so.
Joe: [23:56] That's a great point.
Barry: [23:59] Well, I think if you take this, the reason people are using predictive coding is because there's a lot of benefits to it.
[24:09] Clearly there's, as we've talked about, the potential for cost reduction and there's the potential for the higher margin review and the illumination of some of that lower margin review. But there are other reasons that your survey respondents are really looking at predictive coding.
[24:28] 92 percent listed prioritization of documents as one of the top reasons to use predictive coding. It's about using review resources more effectively and efficiently. The ability to eliminate the irrelevant documents.
[24:45] I think it is about getting to the heart of the matter faster, but it's about getting those documents that really matter to the case, getting to them earlier and faster. And being able to make those decisions on a case much more quickly.
[25:03] Respondents mentioned being able to test the results of human review, because there's never going to be perfect precision and recall. Also I've heard, in talking to people to utilize predictive coding, that there's the ability to figure out who the problem reviewers are.
[25:21] If the system is consistently predicting the documents will be relevant or privileged, and a human reviewer is consistently getting that wrong, being able to find that reviewer, train them better, understand what's really going on.
[25:38] At the end of the day, finding more responsive documents is a key benefit. This becomes really important when you're getting your opposing party's production. And you need to really be able to get through that set quickly.
[25:57] There are a lot of key benefits to using predictive coding, which is really driving the fact that its adaptation is becoming as mainstream as we see it.
[26:09] That said, there are also a lot of concerns.
Ari: [26:14] Barry, I just wanted to point something out. That you're highlighting the fact that prioritization, just to dove tail that, is a key point.
[26:24] Essentially, a lot of folks were talking about putting the key documents...not just finding them, but putting them in a place where people who are reviewing the material are much fresher in terms of evaluating and can give a more credible, instantaneous response.
[26:42] A lot of this is the result of changed client expectation. Clients are used to having answers given to them more quickly. There are different stakeholders within the corporate law department now.
[26:54] There's a piece I wrote in the current issue of the ACC Docket about the shift of power in the general counsel's office. You're seeing different stakeholders who want more of this information and want it more quickly.
[27:06] When we're talking about key findings and benefits, there are some drivers that are shifting the adoption of technology as a result of the shift in culture in the way organizations and the way individuals within the organizations are trying to get their information and process it.
Joe: [27:28] Just as there are key benefits, there are key concerns. There are some of the top reasons that the respondents identified.
[27:36] First one was black box technology. Without explanation of the technology and how it works, respondents expressed discomfort in trusting the predicted coding output.
[27:48] In the third part of this webinar series, we're going to discuss the inner logic of the predictive coding computing model, how it works, and several tests that can be used to validate the results.
[27:59] The second highest concern was that it's not for finding needles in a haystack. We observed that there have been some studies of the quantity of responsive documents that predictive coding can find, but perhaps a harder question of answer is to the quality of documents that predictive coding finds or misses. Ari and Barry alluded to this.
[28:21] Depending on the needs of the case, counsel should still consider other approaches to supplement the predictive coding process, especially when there's a need to find essentially all the needles or hot documents in the case.
[28:33] Predictive coding is a great new tool in the toolbox, but there are other information retrieval methods that are perhaps better tools for different jobs. We actually think that combining several different tools and technologies together can create approaches that are greater than the sum of their parts.
[28:51] The third was fear of inadvertent [inaudible 28:53] , and the fourth was respondents do not want to be early adopters.
[29:00] I think some of these are humming a few bars along the, "We don't understand how it works. It's a black box. And if we can't understand it, how can we trust it and rely upon it, and defend it in court?"
Ari: [29:16] Joe, on that point, it was interesting, when we were doing the research, that 88 percent of the people, both in-house counsels and law firm lawyers, wanted to learn more about measuring and understanding the effectiveness of predictive coding. 83 percent wanted to learn more about training the system.
[29:34] But when we asked about really understanding the nuances and just the technical details behind the tools, they said, "You know what? With respect to understanding different predictive coding algorithms and classifiers, there are people smarter than me."
[29:52] Another person said, "I don't know that it's worth my time to understand the technology at the level of algorithms."
[29:59] I frankly think people are afraid of the word algorithm. It scares them. Nobody is naming their kid algorithm. I think that there's a concern among practitioners that they think, "Gosh, algorithm? I don't know math. I went to law school."
[30:11] But I'm guessing you'll see as the technology is simplified that some of these concerns...but it was interesting to see that distinction. "Yes, I want to learn more, but no, I don't know if it's worth my time."
Joe: [30:24] That's what happened early in e-discovery. And then, some judges said, "Well, it's now your job to understand these processes." I think we'll start to see that.
[30:35] When people start looking at the inner logic of how these predictive coding computer models work, I think they're going to find some ways actually to really improve the process and make it better. And there's going to be some surprises. We're going to dig into them a little deeper on our next few webinars.
[30:55] But great points, yes.
[31:06] Turning to cost. Of course, cost effectiveness of these new technologies is always the top concerns. We spoke a bit about the upfront cost to have an attorney or subject matter expert train the classifier, and they have to get it right.
[31:24] One interviewee referenced garbage in, garbage out, and that is quite accurate in this context. Errors in the training set get amplified by the classifier in the prediction set.
[31:36] Counsel that we interviewed did report a range of savings. Some of these reports were quite significant.
Ari: [31:47] You know, it's funny. I think it's still a struggle, though. You have over 70 percent that can provide some kind of a range, and yet only one of them was actually able to give an exact number.
[32:00] We find this year after year as we talk to corporate counsel about measuring the effectiveness of some of the things. Yes, we saved money. There's an order of magnitude, someone said that 30 percent is the right order of magnitude, because we probably review 70 percent less documents.
[32:16] But in terms of really coming up with exact numbers, in the same way that someone said that over the next few years, this is going to become an essential tool.
[32:25] My prediction is that this idea of relating the great point that Barry made about metrics into a determination of exact, or certainly much more exact cost will really yield some more finding in terms of cost.
[32:44] Someone said it pays to run the numbers because otherwise you are essentially just making it up as you go along.
[32:54] Again, because there are a series of leading individuals in-house who are responsible for determining is this effective, should we invest more, should we buy it ourselves, should we use it through our outside counsel?
[33:08] All these decisions are being made and they're, as Barry said and as Joe, as you said, it's...In many ways, they're trying to make determinations based on numbers, and if those numbers are too general then you're going to see some shifts.
Joe: [33:22] It's a challenge. Counsel now have three doors that they can walk through, just for the use case of trying to identify the responsive documents. They can use linear first-pass review. They can use keywords, or they can use predictive coding.
[33:36] Ordinarily they just pick one door and go down it, open it and go through, say the predictive coding door.
[33:43] Based upon some sampling and some modeling you can estimate how effective you would have been if you had used a search term approach followed by linear first-pass review.
[33:52] Or if you picked door number one, which was just straight linear first-pass review, you can model it. But unless you actually open each of those three doors and go down those roads, you're not going to know the cost of the road not taken.
Ari: [34:08] The other thing is there's a greater receptiveness to the organizations that are working together to try to help determine what the return is on a lot of these investments.
[34:20] As I do this research both here, abroad, and in different industries, you see that there is this partnership between the organizations, between the vendors that are partnering with them. To say, "Look, this is the kind of information we need."
[34:34] The savviest, the most successful companies that are working together are helping each other figure this information out. There's a benefit to everyone to understand why and how and where the value is.
Barry: [34:51] You mentioned a partnership, and I think that's one of the key takeaways that this is not something that is done without a team in place. There are a lot of different roles.
[35:07] One of the key findings here is that there is a lot of different expertise needed. In addition to excellent legal skills, which I think sometimes gets overlooked in the discussion; there is this expertise that's needed.
[35:24] On the one hand, the predictive coding experts need to be able to explain how it works. They need to be able to show and explain in court how the classifier actually works. Someone needs to, as Ari said, understand the algorithm.
[35:38] It does not necessarily have to be the review team and the legal experts. Although I do think that gaining a familiarity with it is going to be important for lawyers out there who want to practice predictive coding. But the team really needs to have someone who can explain how the classifier works.
[36:02] The predictive coding experts need to be able to collaborate with the review team to be able to statistically explain how the process works and how it's statistically valid. Excuse me, I'm stumbling over the word there.
[36:20] Someone needs to be able to explain the math behind how the sampling worked and how the precision and recall worked so that the processes can be deemed defensible.
[36:35] That needs to be monitored and tested throughout the process. There needs to be someone who can explain how it works and go with the math behind how it works and explain that.
[36:47] Then there needs to be some elements of process design and workflow and technical expertise as to how that process is going to run and how the technology behind it is going to actually make the process move.
[37:03] Then finally, there needs to be an expert witness who can testify to all of these things put together to explain why the process is defensible. I wrote a blog post a couple of months back about how one of the biggest winners in predictive coding would be expert witnesses.
[37:22] There is not a whole lot of people who have all of the legal review skills, the math skills, the technology skills in one package.
[37:29] But rather, some of those skills that are not traditional to the legal field, such as stats or process design and technology, become really important parts of the team. Those need to be sourced and brought to court to explain how this all works. It's going to take a huge team.
Ari: [37:51] I think this is a great point. In the same way that you see now chairs of their e-discovery departments at firms, or e-discovery leaders within corporate legal departments, you will see that there will be folks who people can come to, and so it's to everyone's advantage.
[38:11] Certainly to every single person on this call and on the next couple of calls, seek to their advantage to gain even a working knowledge of this information. So that they are the person that the organization comes to and that they are someday the chair of that department.
[38:28] Because there's so much opportunity with just leveraging this technology, and then all of the different things that Joe talked about in terms of what the technology makes possible just beyond review. That this idea of expertise, this new elite technological professional will really represent.
Barry: [38:47] It's such early days I think that people are just becoming aware of the opportunity to set themselves apart, career-wise, by becoming experts in some of these things.
[39:01] I was talking to a law firm practitioner who essentially said that she went out and learned statistics because she wasn't comfortable not understanding how that worked. Just for her own practice of this.
[39:16] She's been able to set herself apart in her firm by showcasing her intellectual curiosity. Because she can now walk into a meeting and not feel overwhelmed that she doesn't understand that topic.
[39:30] Even though it may not relate quote-unquote to the practice of law as we think about it, one of the quotes that I do often hear is that predictive coding is going to let litigators get back to doing what they do, which is litigate. As opposed to focusing on all the aspects of e-discovery that were time wasters before.
[39:51] The more that you can get comfortable with technology and how software works, show an interest in the statistics, and to a certain extent the algorithm, I wouldn't say that it's necessarily being able to explain the black box.
[40:07] But for someone who can do that, they're going to set their career apart. It's early days, and so now is the opportunity to really gain a leadership position.
Joe: [40:19] That's really interesting. Some of the meetings we've been involved with, and if you read the case law, a lot of this centers around statistics. If it's the Global Aerospace or Actos or De Silva Moore, statistics is the big frontrunner. That's consistent as we look ahead.
[40:48] Barry, Ari, just as you indicated, the predictive coding expertise will be in high demand. But it's not really predictive coding expertise. It's attorney expertise, technology expertise, statistics expertise, project management, workflow expertise.
[41:05] It's a cross-disciplinary team that needs to get put together to deliver on one of these projects.
[41:11] A small team with this technology and with statistics can do the work of a much, much larger team approaching the same problem several years ago.
[41:27] We think predictive coding will be less black-box. The technology and the statistics need to be clearly explained.
[41:35] If someone can't sit down and explain how this works, I think it's caveat emptor. It's be skeptical. We would caution folks to shop around till you can find someone that can explain the process to your satisfaction.
[41:54] In my career, I spend a lot of time looking at different computer models and complicated processes. It's working with smart attorneys who have insights. Once the process is explained, they know how to use the tool and fashion it to their specific litigation or investigation objective better than a consultant ever would be able to.
[42:19] The third thing is the concept of multiple classifiers for multiple cases. There are a number of different classifiers or different technologies that can be used for predictive coding in the market today, and this is going to grow.
[42:36] From an information retrieval perspective, predicting privilege is a very different project than predicting responsive. Effectively handling foreign language can add additional complexities.
[42:51] There are different ways to set the parameters on these technologies, and there also are different technologies or classifiers that are better at certain research goals.
[43:04] It's just a caution we wanted to make. Because in the future, it may be that multiple classifiers are actually run on a given training set and whichever one does best is the one that wins, and the one that gets applied to the specific research project.
[43:22] That wraps up our slide presentation. I'd like to turn it over to Angela Navarro, our moderator, to open up the Q&A.
Angela: [43:31] Thank you, Joe, and thank you again to all of our speakers. We are now ready to move into the Q&A portion of today's call. We invite you to enter your questions using the Q&A tab on your screen. If we are not able to get to your question with the time we have allotted, we will follow up with you via email.
[43:49] While you're doing that, we'd like to remind you of parts two and three in the three-part webcast series on predictive coding that we're hosting in conjunction with InsideCounsel.
[43:59] Part two is called predictive coding and then meet and confer with every attorney to know. That will be hosted on Tuesday, October 2nd and will feature speakers David Horgan, Ed Rigby, and Joe Looby once again.
[44:12] Part three is titled predictive discovery, taking predictive coding out of the black box, and will be held on October 24th featuring speakers Jason Barrett and again Joe Looby.
[44:23] Thank you again to our speakers. I'll now go to our first question, which is for Joe Looby. Joe, the question reads would you be able to give us some example, a scenario/example of how predictive coding has been used in more detail?
Joe: [44:40] Sure, can do. Let's walk through a hypothetical. Say we've got a million documents.
[44:47] If we picked a random sample of 17,000 documents out of that million, and say our expert reviews it and finds that 23 percent of those 17,000 are responsive, based upon statistics we're 99 percent confident plus or minus 1 percent that 23 percent of the million documents are in fact responsive.
[45:10] What we've done is we've put a fence around the research objective, which is to find those 220 to 240,000 responsive documents out of the million. We can take 17,000 documents that have been coded by the expert.
[45:25] We can use that as the input to our predictive coding computer model. It's going to read through those documents and it's going to decide the words and phrases in the documents that are indicators of responsiveness, and the words and phrases that are indicators of nonresponsiveness.
[45:44] The model's going to train. It's going to try to replicate the behavior of our expert in coding those 17,000 documents.
[45:52] What we can then do is we can take that trained model and we could apply it to the remaining 983,000 documents. It's going to classify those documents essentially through a process of scoring those documents as either potentially responsive or potentially nonresponsive.
[46:11] There are a number of statistical tests that we can perform to validate and gain assurance that the model is in fact working on the 983,000 documents as we need it to based upon our research objective.
[46:28] One last point. What's the research objective? If you're coming to this for the first time, you should go to wiki or somewhere and read about recall and precision, because those are the two key performance indicators of information retrieval and predictive coding and discovery generally.
[46:47] The other article I pointed to is the Global Aerospace decision where Judge Chamblin out of the state court of Virginia does a really good job summarizing some of the research in this area.
[47:00] Basically he says in his opinion that if the parties in that case achieve a 75 percent recall, which means you're finding 75 percent of the responsive documents using your predictive coding technology, that's good enough in his court. Ari or Barry, any other thoughts?
Barry: [47:24] I think that's a good example. I think you're right. That everyone should go out and be aware of what precision and recall are and how those are going to be important in the practice.
Ari: [47:39] There should also be a universal understanding within the team so that everyone is similarly consistent in their interpretation. That came up a lot in terms of, well, what does this mean, what does this mean?
[47:54] You'll find people in different organizations and counsel on different sides of an issue defining certain terms in a way that tends to confuse rather than clarify.
Angela: [48:05] Thank you, gentlemen. The next question will go to Barry Murphy first. We've had a couple of questions on this topic. Barry, where does one go to gain the statistics knowledge required to create a defensible predictive coding process?
Barry: [48:21] That's an excellent question. We're doing research on education and certification programs that are available in general in e-discovery.
[48:31] I don't think that we're close to a certification in the e-discovery space that has gained traction or critical mass. I do think that some of the organizations out there that offer training will have some training on predictive coding and how it works.
[48:51] But there's no real, within the industry there's no real certification board for, OK, this is the stats education that you need. I think rather what I've seen people do is go just study an old statistics textbook that they may have from college or grad school, or just pick one up and start learning it that way.
[49:15] I don't think that there's necessarily a go-to program right now in the market, although I think you'll see some start to emerge from some of the really good educators out there. I don't think you're going to see this come up in law school per se as a major, but some of the organizations out there will certainly be offering a refresher in statistics.
Ari: [49:42] You're also seeing some of these primers offered by the Organization of Legal Professionals and you're seeing other opportunities available, simply online. In terms of iTunes U and things on YouTube, searching through certain kinds of statistics. It's amazing how much of this information you can actually acquire.
[50:06] What's also amazing is how little you need to acquire to demonstrate some level of expertise within your organization. Barry made that point earlier. If you take a statistics course or you have some familiarity with the topic and can apply it in an artful way, you're the expert in your organization. People trust you.
[50:27] With that trust comes greater responsibility, and you'll learn more. So I encourage people to take those steps. OLP.org is the Organization of Legal Professionals. But there are many.
Joe: [50:42] There are great publications. Wiley, the publisher Wiley has tremendous books on statistics, and a lot of organizations have PhD statisticians who have been working in the litigation and investigation environment for many, many years.
[51:00] They went to school for this. They got their PhDs. They reach out to them and they're a tremendous help.
Angela: [51:09] Thank you. Our next question will go directly to Joe first, and then of course open to Ari and Barry for comment afterwards. The question reads, "Craig Ball has suggested that a seed document or documents to be created to train the software. What do you think about that?"
Joe: [51:29] Could you repeat that? I'm not sure I understand the question.
Angela: [51:31] Sure, yeah. Craig Bell has suggested a seed document be created to train the software. What do you think about that?
Joe: [51:42] I'm going to have to guess that he's saying create a seed document, that's basically like a...It's almost like a list of search terms, maybe? As an indicator of responsiveness.
Angela: [51:53] Perhaps the [inaudible 51:52] said.
Joe: [51:55] Yeah. I don't know if I really understand what the question is. I'll caveat that.
Barry: [52:06] I think the question is really about seed document sets being used to train the software. That's exactly how predictive coding works is you have a set of, a sample of a document corpus that you go through and start marking for relevance or privilege or whatever you're looking for.
[52:29] Then you take that seed set and you use it to train the software. That's essentially how the process of predictive coding works, on its most basic level.
Joe: [52:39] That clarifies. I understand, OK. There are two ways. There are two ways. There's something called active learning, where you start with a random sample, a small random sample of the collection, and then the computer identifies documents in the collection that it is having trouble predicting on.
[53:03] It actively sends those to your expert for him or her to review and code. That's the active learning seed step process.
[53:13] The other process is just to take a large range of samples of the collection, one that is sufficiently representative. The example that I gave earlier of the 17,000 documents, a sample of that size from a population of a million documents is...99 out of a 100 of those samples have the characteristic of responsiveness within plus or minus one percent of the population.
[53:48] Once again, just two approaches. There's a large random sample or there's the active learning approach to creating a seed set for a predictive coding classifier.
Angela: [53:58] Thanks, Joe. Just a reminder, we have time for just a couple more questions. If we are not able to answer your question on today's presentation, we will follow up with you via email. Also, the slide deck and recording of this presentation will be provided to all registrants.
[54:15] The next question is for Ari. Ari, the question reads, "Are you seeing corporations trying to incorporate predictive coding in the collection and preservation stage, or just when dealing with the traditional discovery phase?"
Ari: [54:32] I am seeing them consider both. What I'm hear from corporate counsel is that with the acceptance of the technology based on Da Silva Moore and the other cases that we've been talking about, that there is a slight openness. They are definitely more interested in having the discussion. They understand some of the terms.
[54:59] The point at which they use is up for discussion. Again, it now brings back their outside counsel into this very important discussion, because they're again relying on that insight to try and figure out whether or not this makes sense for this particular case.
[55:19] I think Joe articulated nicely, the range of matters for which they're finding it appropriate.
[55:28] You're going to see within the next 12, 18 months this expansion of how organizations are using it. Certainly you may see it sooner, but just in terms of a lifecycle of making decisions that these folks will start using it in a lot of different ways in addition to simply early or mid-stage discovery.
Angela: [55:53] Thank you. I think we have time for one more question. Joe, this one will go to you first. What is your recommended threshold on the size of the training set based on a million documents? Joe, are you there?
Joe: [56:13] Yes. I am thinking. [inaudible 56:15] example. It depends upon what you're trying to do. It depends upon what your information retrieval goal is. I'll say if you're trying to defensibly address two important issues. One is, how many responsive documents are there in the million?
[56:46] If you want to pick the sample that's the largest that gives you 99 percent confidence plus or minus one percent, then you want to pick a sample of roughly 17,000 documents out of the million. That's overly conservative, but it's very defensible.
Angela: [57:09] Thanks, Joe.
Joe: [57:10] It depends on what you want to think. That's not the end of the statistical process. If you think about a model, there's the input, there's the inner logic, and there's the output. That 17,000 training set is on the input side. There are things that you do to understand the inner logic and there are also statistical tests you do on the output.
[57:33] I don't want people to walk away and think that that's the only statistical test you perform. There are actually a number of statistical tests you perform throughout the process to gain assurance that you're achieving the recall and precision that you need in the matter.
Angela: [57:53] That's good to know. We do have time for one more question. This will be the last one. Again, if we're not able to get to your question, we will follow-up with email. Barry, this one is for you. The question reads, "What seems to be the tipping point for effective usage for predictive coding based on case files?"
Barry: [58:17] There's a modifier there which no one likes to hear, which is "it depends." It depends partly on the way that you're purchasing the solution. If you do an All You Can Eat purchase where you can run any amount of data there for one cost, there is no tipping point.
[58:38] Other tipping points that I've heard out there in the market are 50,000 documents or 100,000 documents.
[58:48] We're in the process of developing real metrics for predictive coding because so much of the usage right now is still in that learning phase of is this returning on my investment? What are the costs here? How often can I use it down the road?
[59:09] The two that I've heard the most are 50K and 100K documents. Cases larger than that, it will tend to make sense to use predictive coding. Smaller than that, it may not return. Again, it also depends on how you judge it. Is it just review cost or is it review quality?
[59:29] There's a discussion of quality that's coming up that maybe our industry hadn't had before. If it's about raising quality, then the tipping point may go way, way down.
Angela: [59:41] Thank you, Barry. Thank you again to Joe Looby and Ari Kaplan, as well as Barry Murphy for joining us on today's presentation. We would also like to thank the corporations and law firm executives who shared their insights and expertise for the whitepaper.
[59:59] I would like to thank all of you, our attendees, for participating today in the first of our three-part series on predictive coding. Also, thank you very much to InsideCounsel. This now concludes today's presentation. Thanks very much and have a great day.