Proper Use of Predictive Coding Technology

01/07/14

Best Practice Predictive Coding Workflow from a Review Manager Perspective


Predictive coding technology is limited by the math on which it is based and the level to which the documents adhere to the math. It is further limited by the data on which it relies to make the coding. Assuming the algorithms themselves have justified a sufficient level of confidence, the “seed documents” must sufficiently provide a framework for coding the entire database.



Well used and understood tools such as Venio, Equivio and Concept clustering tools have already proven the basic viability of the application of Technology Assisted Review (TAR) tools for the purpose of classifying documents. As statistical confidence in a particular predictive coding tool is established, they may eventually be a viable alternative to human document review, but not for eliminating humans from the review process. Some questions have already arisen regarding the ability of predictive coding algorithms to properly address terse documents (documents that do not contain abundant text for language based analysis such as spreadsheets or short documents) and “novel content” documents. Additionally, there is the need for human input before and after the predictive coding has occurred before the review can be considered “complete”.



The results of the application of the predictive coding algorithms will need to establish confidence in the automated coding by relying on sufficient amounts of data input prior to the predictive coding, a high confidence level that the seed documents are correctly coded, and a high level of confidence that the seed documents are truly representative of the entire database. It will further need to stand up statistically to manual quality control (QC) processes. If significant errors are found and/or incorrect coding patterns identified during the QC process, they must either be corrected manually or the predictive coding engine must be re-run and re-verified to eliminate those errors.



Further, in order for a manual quality control process to be effective, human beings that are “subject matter experts” in the protocol and the documents are critical requirements that cannot be overlooked. Mere familiarity with the protocol and case on a general level will often times be insufficient, as the nuances of the factual scenarios, terminologies, and overall environment in which the actors are conversing will need to be understood to properly identify the error patterns as well as the subtleties of individual documents. In a traditional review, deeper insight and understanding is gained during the review process by the attorneys performing the review, a selection of the most accurate of whom typically make up the QC team. The QC team team for automated review will still need to have that knowledge, but will have to gain it through other means.



By using a smaller team of attorneys, the analysis of the documents comprising the seed set can be controlled more carefully to ensure their accuracy. By using document review specialists or review managers, rather than a large team of contract attorneys, patterns can be quickly identified, understood, and adapted into the coding of the seed set and the identification of any additional seed documents that may need to be found and utilized.


Determining Sample Size


The determination of a proper seed set for predictive coding is not a simple “sample size” calculation based on traditional statistical analysis. This approach assumes too much about a very complex field. When a poll is taken the answers of the sample set can be used to statistically predict the answers other people would provide if asked the same question. However, in document review, each document is asking a question and each question is slightly different. The goal here is not to predict the answer to the same question, but which of several multiple choice answers will be given by an attorney to a series of questions some of which are similar, but only a handful of which are identical. In statistics, controls are necessary to account for differences in the administration. In document review, in place of control groups, a more robust seed set of both random documents and “targeted diversity” are necessary.


The Ideal Workflow for the use of Predictive Coding Technology


The process of review can be started during the process of filtering the document population. During this process the review team (in the form of the review manager or firm attorney) can gain a broad understanding of the document population requiring review, utilize traditional tools such as Venio to eliminate portions of the population, and apply categorization toother portions.


Once a reasonable review set has been established, a seed set can then be gathered for manual review. The ideal seed set should not be strictly a “random” sample, nor a targeted sample but a combination of both. An experienced review manager working with the firm attorney assigned to the case, working with some of the traditionally accepted similarity, batching, and clustering tools should be able to identify a variety of documents that will need to be included in the seed set based on the overall database. This will ensure that a variety of different documents can be included in the seed set to diversify the coding as much as possible and increase the data input for the predictive tool. It will also ensure the seed set does not contain too many similar documents but does contain some similar documents so that subtleties can be accounted for by the tool. This set can then be supplemented by a broader random sample to add more layers of input beyond what human eyes were able to identify up front.


Prior to any manual review of seed documents in the Review Set, a clear protocol should be established that identifies all of the various coding needs (typically including: what is and is not responsive, the known issues to be coded, any “hot topics”, any topics that will require redaction, and the privilege protocol). Cases with heightened privacy concerns that will require redaction of information such as personal healthcare information, social security numbers, etc., will need to be sufficiently established so that the predictive coding tool will be able toidentify that information. Whether the confidence levels in that identification of these special documents are sufficient or additional human review of the responsive population will be necessary is another factor to keep in mind.


The need for the well-defined protocol cannot be overlooked or overemphasized because a large team of contract attorneys will not be used for the bulk of the review process. Therefore, the protocol will be needed to determine what coding will be placed on the seed set. However, it may also be necessary to properly gather a sufficient seed set. Any time more than one person will be making decisions with regard to the coding either before or after the predictive tool is utilized, a clear protocol is the only means of being confident the documents are properly coded. 



At this point, if not sooner, it may be advisable to establish a broad search for a wide spectrum of potentially privileged terms and topics. These documents should be pre-identified as potentially privileged and considered carefully as to whether they need higher levels of attention for review.



Once the seed set has been identified and the protocol established, then the review of the seed set can commence with a small, thoroughly trained team of attorneys, using either or both of the law firm attorneys working on the case or a team of document review specialists or review managers. These same attorneys must be retained both for this portion and any QC that is run after the predictive coding tool is used.



The review team will need to code a sufficient number of documents within the seed set to provide the tool with the information it needs. This decision should not be a simple analysis by the tool based on its algorithms. Nor should review begin immediately when a sufficient number is reached.



Whether a sufficient set of documents have been coded should be determined in a two-step approach. First, the predictive coding tool will need to have sufficient statistical data to supply the algorithms with data for coding the remaining review set. This decision, in large part, will be made by the tool itself and/or the technical project manager administeringthe tool, in conjunction with the firm attorney. Second, the firm attorney alone (or in conference with the review manager) should determine whether there is a sufficient comfort level with the seed set population, in terms of its representation of the larger review set, accurate application of the protocol, and whether sufficient understanding of the documents exists in the human team members to proceed.



In determining “accurate application of the protocol” and “sufficient understanding”, even if a sufficient number and representation appear present, a QC process must be engaged at this point. Utilizing the gained knowledge from the review, the team should engage in a statistical random sample QC by reviewer and by each of the coding aspects, as well as an overall perusal of the database to identify individual or overall error patterns and confirm the coding meets the protocol. Identified error patterns must be corrected by a targeted re-review prior to utilizing the predictive coding tool to prevent duplication of the errors by the algorithms.



If, based on the QC results, the firm attorney, review manager, and technical project manager believe more documents should be identified before the predictive coding tool is utilized, they should be identified, reviewed, and QC’d consistent with the above approaches.



Once the seed set has been reviewed and QC’d sufficiently, the predictive coding tool is ready to run. Whatever the approach of the tool, there may be some need for input by the firm attorney and/or review manager during this process and they should remain involved with the technical project manager to address any concerns that may impact the results of the predictive coding. Any concern regarding these documents the tool is not equipped to handle should be marked separately so that they can be re-addressed as needed.



After the tool has successfully completed its operation, a complete review set should now be ready with coding complete. A repeat of the QC process should now be performed on the complete review set. Error patterns found will either need to be addressed through manual, targeted corrections, or a re-application of the predictive coding tool. Any re-application of the predictive coding tool should be followed by an additional QC process.



Specific QC review may be required of any document subset that still requires attention (such as the terse documents) to ensure those concerns are alleviated even if the overall review set appears correct. Among the concern sets at this stage should be any documents that were coded as responsive and are also among the potentially privileged set (best practices require that such documents get some level of human eye review prior to being released and potentially waiving privilege). Any documents identified as requiring or potentially requiring redaction will likely need to be addressed through manual review at this stage as well (some auto-redaction tools are now on the market and may be applied here as well).



Notably, in a truly ideal situation, all of the above will have been performed on only a subset of documents, particularly in a large review. Traditionally, tools such as Venio or Equivio are best applied to the largest population possible for best results. This is because they are more designed around organizing the document population to increase the efficiency of manual review. However, because the review is automated, by utilizing a smaller set and completing all steps through final QC, including privilege QC, the entire predictively coded review set can now be used as a larger, more comprehensive seed set for remaining population subsets, each of which can be exponentially larger than the previous set. If the QC is performed at each level, the total amount of QC will be exponentially reduced and the overall success of the tool in identifying all content and subtleties will be increased. The dangers of “novel content” will be reduced, the level of understanding of the documents and subject matter of the review/QC team will be maintained as necessary for QC, and the firm attorneys on the case will be able to gain a deep understanding of the document population.



Using the above steps, the reliability of predictive coding can be ensured, while still allowing the work of hundreds of contract attorneys to be done by a dozen firm attorneys, review managers, and document review specialists. The overall quality of the review should be enhanced, through the consistency of the automated tool and the use of a smaller number of attorneys.