It’s easy to de-prioritize dealing with that room full of paper that has accumulated over the years – the many reports, letters, books, press clippings, images, etc. But there are some important reasons why it’s worth giving some attention to your paper-based archive. This article explains the importance of digitizing human rights archives, and will walk you through how to plan for a digitization project: from preparation, to how to select the right equipment, to how much to budget.
A typical digitization process will most likely include these steps:
- Define the goal and scope
- Assess what is in your archive
- Determine how documents will be accessed
- Identify and locate the human resources and equipment
- Test a sample of your documents
- Scan, store and index documents
Some useful definitions:
- Digitizing essentially means taking a digital snapshot of a document.
- Digitizing may go one step further by recognising the information inside the document and storing it digitally (optical character recognition, or OCR).
- Archiving means storing and organising a set of documents so they are secure and can be retrieved.
Define the goal and scope
There are four main reasons to digitize: to improve accessibility, to enhance security, to save space and allow for sharing.
- Improving accessibility: finding the right document in a roomful of paper is most likely a challenge for you. Digitization can help, because: you can place your documentation on an online or offline server, or in a private or public digital library, making it easily accessible to staff or public (more on this below). Digitization will inevitably require you to re-organize your documents in a better way, using a file naming scheme/convention to rename documents, making an inventory, grouping them into collections, and deciding on keywords.
- Improving security: do you run the risk that malevolent groups may seek to destroy or confiscate your documents? Do your documents contain sensitive information on sources or witnesses that needs to be protected? Or are your documents subject to storage conditions (humidity, insects, rodents) that make digital preservation a necessity? If the answer is yes to any of these questions, it will be worth the time to digitize your documents, or at least part of them.
- Saving space: if your offices are overflowing with boxes and folders, digitizing may be a way of saving office space. (Note: it may not necessarily be the more cost-effective solution compared to physical storage of paper documents, and in many cases, simply weeding out unnecessary documents will also save a lot of space.)
- Allow for sharing: your work may benefit from being able to share documents with colleagues in your and other organizations, or a wider audience. For example, if your organization does litigation and receives court or client correspondence in paper, these may need to be scanned so they can be shared by a workgroup, especially if the organization has offices in different locations.
The goal is a short definition of the purpose of the project. Is it to protect sensitive information? To share litigation correspondence? To preserve valuable information on violations collected decades ago, for a future truth commission? To make key documents available to the public in a digital library? This statement of goal should be agreed upon and supported by key stakeholders at the outset of the project.
You do not necessarily need to digitize all your documents. To identify the scope of this project, it is important to define what you need to digitize and why, based on the purposes listed above. For example you may want to digitize:
- Only documents that contain sensitive information.
- A selection of documents that you need to publish on your website or in a digital library.
- A collection of documents of historical value that need to be preserved.
- All testimonies and interviews, so that they can be stored in a digital vault, and the originals destroyed.
- Court decisions from your older litigation cases so that a complete collection can be published on your website.
- Only information related to the fundraising and final report from an event your organization organized.
Assess what is in your archive
Start with a practical understanding what documents you have. This is the kind of information about your archive that you will want to explore:
- Quantity: If you have piles of papers, one way to identify how many documents you have is to: count how many meters of documents you have, and then sample some shelves or boxes to count the average number of pages per meter or box. Examples:
- Your documents are stored in 50 archiving boxes, with an average 2000 double-sided pages per box. Therefore then you have 50 x 2000 = 100,000 pages.
- Your documents are stored vertically on 10 meters of shelving, with approximately 100 single-sided pages per centimetre, then you have 10 x 100 x 100 = 100,000 pages.
- Classification system: It is helpful to understand the existing classification system for the paper documents. For example the documents may be organized by year, then by region, then by event. Or by theme, then by year, then by investigation. Or by year, then by case.
- Physical quality: Review a sample of the documentation or at least 10% to determine the physical state of the documentation: for example, what portion was damaged by humidity or by rodents?
- Types of documents: Are these documents letters, thematic reports, annual reports, interviews, testimonies, books, investigation files, litigation case files, periodicals, grey literature, press clippings…?
- Ownership: Are the documents created by and the property of the organization which holds them, or was part of the collection obtained from other organizations? In the latter case, you will have to investigate whether there were the conditions attached with regard to the publication and confidentiality of the material.
- Document retention requirements: Are there documents which should be retained for legal reasons? For example, national law may require you to keep original (paper) signed documents from your litigation cases, or financial records for a period of 10 years.
It is important to actually physically work with the documentation to make this assessment. You may be surprised at what has accumulated over the years… or what is missing and needs to be located. These findings should be described in a concise but very precise document, which will be a very useful basis for further decisions. We are sharing a few examples of our own archive assessment documents so you can see the kind of detail that’s helpful:
- This is a document we prepared for the International Commission of Jurists for their digitization project (2012). HURIDOCS assisted the ICJ by scanning and digitizing over 800 publications from the period 1952 to 2007.
- This is a short assessment we prepared for KontraS (The Commission for “the Disappeared” and Victims of Violence) archive of human rights documents (2012). The KontraS archive was made available online in March 2016.
Determine how the documents will be accessed
Once all the documents are scanned, you’ll need a way for people to access them – whether that is just for your team, or for the public. Lucky for you, we’ve developed a platform for just this purpose! Uwazi (Swahili for “openness”) is an open-source solution for building and publishing document collections. Using Uwazi to publish your documents will allow you to: upload and convert your documents (making them searchable), create and add your own metadata, cross-reference other related content, and more.
If you don’t need to publish your documents, but instead want to archive them for future use, you may want to exploring working with these institutional archives that host documents related to human rights:
- The Open Society Archives
- Duke University Human Rights Archive
- Human Rights Web Archive at Columbia University
- University of Texas Libraries’ Human Rights Documentation Initiative
Whether your documents will be accessed publicly, or limited to your team, you will want to develop a file naming convention for all of these documents that can be implemented as they are scanned. For some best practices and guidance, read our article on developing a file naming convention.
Identify and secure the human resources and equipment you’ll need
Now that you have a better understanding of the files in your physical archive, and you know your goal and scope for this project, you can begin to identify what resources it will take to carry out the project.
Select and purchase the right scanner for your project. What to look for in a scanner:
- It should have a high duty cycle, meaning you should be able to scan all day, five days a week, without the scanner overheating.
- It should have a feeder tray, allowing you to scan 50 or 100-page documents in one go.
- It should be duplex, meaning it can scan double-sided documents.
- It should be fast – meaning at least 30 pages per minute.
Will your project require optical character recognition (OCR) software? OCR software will help your computer analyse printed text and translate it into something it can process. It is particularly helpful for digital archives of documents because it allows you to search for text within scanned documents. Some scanners come with a version of this software, but your project might require something more robust.
Based on your scanner and software, and you can determine the number of pages your scanner can handle and in what timeframe.
- You may need two persons per scanner (one to feed the documents, and one to name the digital files).
- You will need a fast and modern computer for each scanner. There’s little point in connecting a good scanner to a bad computer.
- If possible, you will also want an experienced documentalist as project manager, for supervision of performance and quality control.
- Your budget might limit your ability to purchase equipment and other resources.
It may be necessary to draw up a project budget to ensure that you have the necessary means available. A typical budget for a four-month project involving a 5-person team might look like this:
|Item and Item Type||Unit Cost||Number of Units||Total USD|
|Short-term clerks, monthly salary||600||16||9600|
|Shredder, rental 1 month||200||4||800|
|Admin support & overhead||1000||4||4000|
With this example budget, the point we are making is that you are undertaking a project which needs to be planned, managed, and budgeted, like any other organizational activity. You may reduce costs by renting equipment instead of buying it, or calling on volunteers rather than paid clerks. But there will always be costs!
If you need to fundraise for this project, your project proposal should describe and underline the value of your collection, and why it is necessary and important.
Note: we have only included the costs for capturing the documents: scanning, OCR, file-naming, weeding. It does not include more advanced processing such as cataloguing (this means: applying keywords or other descriptive information to each document) in a database. As a rule of thumb, you should calculate that cataloguing will take twice the time that it takes to capture documents.
Test a sample of your documents
Given all the information you have collected so far, you may be able to predict that if, for example, you have 200,000 pages to scan and your scanner can realistically scan about 2000 pages per day, then it will take 100 days, with one person and a scanner/computer. Or if you work with two teams, each with a scanner/computer, it will 50 days, with two scanning teams.
But this example is only a guess and it assumes optimal conditions. The real amount of time may be much higher. It will depend on the quality of the paper, the quality of the print, the speed not only of the scanner but also of the computer, and the speed and accuracy of your staff. For example abnormal documents may need to be scanned page-by-page manually, while normal A4 paper can be more easily processed via the feeder tray. At the end of the day, only a pilot test with a sample of your documents will tell you really how long it will take.
Note: this concerns only the digitization of the documents, meaning the scanning and naming. It does not include the time required for indexing the documents in a database, which will be done at a later stage.
Scan, store and index documents
Now you’re ready to get started! As you begin the scanning phase, here are a few things to keep in mind:
Naming and filing of documents
- Remember, each document is to be given a systematic name, using the file naming convention that you developed. The names are to be attributed at the time of scanning the document. For suggestions on how to develop an appropriate file naming system, see the HURIDOCS blog post https://www.huridocs.org/2016/07/file-naming-conventions-why-you-want-them-and-how-to-create-them/
- Check whether there are pages with text on recto and verso (pages with text on both front and back pages)
- Check whether there are pages on particularly thin paper – these should be scanned one by one
- Check whether there are pages in format bigger than A4 – if so, reduce the format (either the scanner has this option or elsewise use a photocopier)
- When you are scanning books, check whether there are still multiple copies available. If so, you can delete the binding and scan using the feeder tray, which is much quicker than scanning page by page.
- Ensure that the documents are re-stapled correctly, immediately after scanning
- Ensure that documents are returned shortly after scanning in the appropriate folder
- Ensure that folders are returned shortly after scanning to the correct shelf
Additional resources related to digitizing human rights archives
- The International Council on Archives (ICA) has a collection of toolkits, guides, articles and more in their Online Resource Centre.
- Archives and Dealing with the Past project hosts a collection of resources that includes reports, manuals, guidelines, and videos. They also offer a listserv!
- The Activists’ Guide to Archiving Video, from WITNESS (recently updated!)
- What we’re learning about digitizing and preserving documents, from the Engine Room
- Collection of resources on archiving from the Responsible Data Forum
- New Tactics online discussion on archiving for human rights advocacy, justice and memory
Written by Daniel D’Esposito, Bert Verstappen and Kristin Antin