Creating Searchable PDF Documents for Review
The searchable PDF (portable document format) is becoming increasingly popular and important for lawyers and litigation teams in discovery, litigation and related legal matters. Several factors are driving PDF’s adoption in legal matters:
1) PDF’s ubiquitous popularity in the business world.
2) Court requirements in many jurisdictions requiring that pleadings and motions be filed in PDF.
3) Availability of low cost scanners and multi-function copier/scanners that allow law offices to inexpensively create PDFs
4) Release of new versions of Acrobat Professional from Adobe, that now supports Bates numbering and legal redaction.
When scanning paper documents to PDF, we offer the following recommendations:
Choose the ‘Text-Under-Image’ Option: When scanning a document, you may be presented with different options for types of PDF files. If available, you will usually want to choose the option that applies optical character recognition (OCR) to make the document text searchable. This can be implemented in different ways depending on your specific hardware and software, including a ‘”make searchable (apply OCR)” option, or “text-under-image” or “searchable PDF” file type options. This means that your scanned document will be text searchable within the Acrobat viewer and many other programs designed to search PDF files. The other type of PDF you could choose is called an “image-only PDF”, which is not text-searchable. When viewing a PDF file you can tell if a file is searchable by looking for the ‘select tool’ on the top bar in Acrobat Reader. This indicates that the file is text searchable.
Get the Resolution Right: When scanning images to PDF for litigation purposes, 300 dpi (dots per inch) is a safe option. Scanning at a lower resolution (e.g. 200 dpi) is can work well and will produce a smaller file, but legibility can suffer with smaller fonts (e.g. 6 pt. in financial documents). OCR quality can also suffer from lower scan resolutions. The trade-off is that larger scan resolutions results in larger file sizes. File scans larger than 300 dpi usually do not appreciably increase the readability of a document or its OCR quality.
Scan to B&W, Grayscale or Color: For litigation review purposes, ‘Black & White’ is often a good option, particularly with good quality originals, and creates a much smaller file than a grayscale or color scan. Color or grayscale may be required for photos (which do not display well with a ‘black and white’ setting). For some documents, a color scan may be critical to understanding the document, such as some charts (e.g. in Powerpoint presentations) or CAD (computer aided design) documents. Color scans and grayscale scans will be larger than B&W or grayscale scans.
Watch the Other Settings: Scanners will often have a number of other settings that can help improve scan quality and OCR. These include ‘deskew’ (rotates any page that is not square with the sides of the scanner bed, to make the PDF page align vertically), ‘background removal’ (whitens nearly white areas of grayscale and color input), and ‘edge shadow removal’ (removes dark streaks that occur at the edges of scanned pages, where the scanner light is shadowed by the paper edge). ‘Deskew’ will help with OCR accuracy, while ‘background removal’ and ‘edge shadow removal’ can improve readability, but can sometimes impair OCR accuracy. For important documents, it’s best to run some tests.
Get a Quality OCR program: All OCR is not created equal. The quality of optical character recognition varies substantially based on the quality of the program and the various settings chosen when running the program. Programs often have a ‘fast’ and ‘slow’ mode, with the slow mode usually delivering better quality OCR. Some programs will auto-rotate pages when necessary, and others will not and will make resultant OCR errors.
Pay Special Attention to the Numbers: One secret of OCR programs is that they routinely rely on dictionaries to recognize the text of particular characters. This works pretty well with words (if they are in the dictionary), but doesn’t help with numbers or other arbitrary characters not in a dictionary. Expect to see lower quality OCR in financial reports and other number-intensive documents.
Make Sure Your Litigation Support Software Really Supports PDF: Many legacy litigation software systems were designed around files saved as TIFFs (tagged image file format), an older type of file format that does not support integrated text as part of the file as PDF does. These older software systems usually have added some support for PDFs, but often the integration with PDF is incomplete and some features are not supported with PDF files.
Do Redactions the Right Way: Redactions can be tricky in PDF and this has been a primary reason why TIFF has survived as a popular format in legal matters. A trap for the unwary is that it is possible in a PDF file to redact text on the image of a document, and still have the redacted text be searchable! In a text-under-image PDF file, the redaction must be done on the text and the image. This problem has been fixed in the latest version of Acrobat (Acrobat Professional 8 or 9), and this program can be used for PDF redactions. Third party tools doing redactions are available as well. Many practitioners play it safe with redacted documents by printing, marking out by hand, and rescanning. This method is manual but fool-proof, and works well if number of documents to be redacted is limited.
Be Specific in Discovery Requests: Litigators are increasingly asking that documents produced in response to discovery requests be provided in electronic form as PDFs. If you do this, be specific as to the matters above. In particular, be sure to specify that the scan resolution be 300dpi and that the OCR be applied. You may also wish to ask what OCR software is used and what settings will be applied. To be non-specific is to invite an adversary to return documents scanned at 150dpi without OCR, that may be unsearchable, illegible and unintelligible!
Lexbe eDiscovery Suite fully supports searchable PDF files in a web-based review and analysis format. All documents uploaded into our eDiscovery platform automatically have OCR applied and a PDF version created and associated with the original document.
- The Portable Document Format (PDF) is being more commonly used in eDiscovery review.
- Some jurisdictions require PDF for pleadings and motions
- You can identify searchable PDF documents by looking for the ‘select tool’ on the top bar in Acrobat Reader
- Not all litigation document management platforms support PDF files
- Searchability of a PDF document is dependent upon the quality of OCR applied
Get started today with Lexbe’s eDiscovery Services
And never miss a deadline again!