Document Management: Difference between revisions

From Unallocated Space
Jump to navigation Jump to search
No edit summary
 
(31 intermediate revisions by one other user not shown)
Line 1: Line 1:
Updated May 1, 2012
= Scope =
= Scope =
* Document Management System (DMS) Project for scanning, storing, indexing, and sharing books.
* Document Management Library System (DMLS) Project for scanning, storing, indexing, and sharing books.
* DMS project is limited to scanned documents, though it certainly can be expanded to include any digitally stored content outside of the scope of this project.
* DMLS project is limited to scanned documents, though it certainly can be expanded to include any digitally stored content outside of the scope of this project.
* One of the purposes of this project is to record the content of books and eliminate the need for maintaining them in a physical library and to free up shelf space.  With the involvement of more people, the project could be extended to include non-destructive or labor intensive scanning that preserves the original material bindings.


== Ownership and Copyright Issues ==
== Ownership and Copyright Issues ==
* Access to all copyright content is restricted to Accounts in the DMS.  The DMS Accounts are the <i>Owners</i> of the content.
* Access to all copyright content is restricted to Accounts in the DMLS.  The DMLS Accounts are the <i>Owners</i> of the content.
* Contemporary or possibly contested materials will require additional Check In/Out protocol and content expiration for checked out documents.
* Contemporary or possibly contested materials will require additional Check In/Out protocol and content expiration for checked out documents.
* Content explicitly requiring only a single owner is owned by the Account that checks out the document.  The Check-out Account is the document's owner for the duration time from check-out to check-in or content expiration.
* Content explicitly requiring only a single owner is owned by the Account that checks out the document.  The Check-out Account is the document's owner for the duration time from check-out to check-in or content expiration.
* Anonymous access can only be granted to the overall Titles catalog and documents that have licenses allowing public domain access. (e.g. Creative Commons, GPL, etc.)
* Anonymous access can only be granted to the overall Titles catalog and documents that have licenses allowing public domain access. (e.g. Creative Commons, GPL, etc.)
* This DMS is not to be used to illegally store or transmit content.
* This DMLS is not to be used to illegally store or transmit content.
 
== Software Platforms Selection Criteria ==
Technically, the DMLS is not platform specific. 
 
After reviewing concerns regarding copyright issues, and considering his limited experience, this article's author has selected the Microsoft software platform:  Windows 7/Windows Server 2008, Visual Studio 2010, SQL Server 2008 Express, IIS, and Adobe Acrobat (X preferred).  This platform is most within the author's existing experience and capabilities to complete a proof of concept implementation that includes document lifecycle security in a reasonable amount of time. 
 
The documents repository can be hosted by any file sharing server platform.  The repository should support multiple connections enabling multiple software platforms access to the documents.  This enables anyone at the space to develop software to manage the documents using any software platform.
 
This author encourages others to get involved in any development that can use and build on the scanner and documents repository.


= Proof of Concept =
= Proof of Concept =
== Physical Requirements ==
== Physical Requirements ==
=== Bill of Materials ===
=== Components ===
<table style="border-collapse:collapse;border:solid; width: 100%">
<table style="border-collapse:collapse;border:solid; width: 100%">
<tr>
<tr>
<td style="width:20%;border:solid;">Document Scanner</td>
<td style="width:20%;border:solid;">Document Scanner</td>
<td style="width:40%;border:solid;">Auto Doc Feed (ADF), Duplex, and Flatbed</td>
<td style="width:40%;border:solid;">Auto Doc Feed (ADF), Duplex, and Flatbed</td>
<td style="width:20%;border:solid;">Epson GT3000? (looking into it)</td>
<td style="width:20%;border:solid;">May 1, 2012</td>
<td style="width:20%;border:solid;">Used but Functional or $2500</td>
<td style="width:20%;border:solid;">Donated</td>
</tr>
</tr>
<tr>
<tr>
Line 24: Line 36:
<td style="width:40%;border:solid;">300 pages capacity Guillotine Cutter</td>
<td style="width:40%;border:solid;">300 pages capacity Guillotine Cutter</td>
<td style="width:20%;border:solid;">Band Saw can be used for concept or [http://www.factory-express.com/Stack_Paper_Cutters-2069.htm Stack Paper Cutters]</td>
<td style="width:20%;border:solid;">Band Saw can be used for concept or [http://www.factory-express.com/Stack_Paper_Cutters-2069.htm Stack Paper Cutters]</td>
<td style="width:20%;border:solid;">New Blade for saw or $600</td>
<td style="width:20%;border:solid;">Band Saw w/New Blade or Cutter</td>
</tr>
</tr>
<tr>
<tr>
Line 33: Line 45:
</tr>
</tr>
<tr>
<tr>
<td style="width:20%;border:solid;">PC Workstation</td>
<td style="width:20%;border:solid;">Scanner PC Workstation</td>
<td style="width:40%;border:solid;">2.5GHz, 2GB RAM, <b>Firewire or SCSI for Epson GT scanner</b>, Windows 7, 32 bit, Adobe Acrobat X Prof, 1GbEth</td>
<td style="width:40%;border:solid;">2.5GHz, 2GB RAM, <b>Firewire or SCSI for Epson scanner</b>, Windows 7, 32 bit, Adobe Acrobat X Prof, 1GbEth</td>
<td style="width:20%;border:solid;">1280x1024 monitor or touch screen, keyboard, mouse, etc.</td>
<td style="width:20%;border:solid;">1280x1024 monitor or touch screen, keyboard, mouse, SW Dev Tools</td>
<td style="width:20%;border:solid;"></td>
<td style="width:20%;border:solid;"></td>
</tr>
</tr>
<tr>
<tr>
<td style="width:20%;border:solid;">DMS Server VM</td>
<td style="width:20%;border:solid;">Input Operational Foot Print</td>
<td style="width:40%;border:solid;">LAMP or WIMP, 500GB drive space (VHD expandable is OK)</td>
<td style="width:40%;border:solid;">A sturdy induction workspace, 3'x5' minimum</td>
<td style="width:20%;border:solid;">1/4 for Terminal, 1/4 for Document Prep, 1/2 for Scanner</td>
<td style="width:20%;border:solid;"></td>
<td style="width:20%;border:solid;"></td>
</tr>
<tr>
<td style="width:20%;border:solid;">DMLS Documents Repository Server VM</td>
<td style="width:40%;border:solid;">LAMP or WIMP, 500GB drive space (dynamic VHD is OK)</td>
<td style="width:20%;border:solid;">Software Development Tools</td>
<td style="width:20%;border:solid;"></td>
<td style="width:20%;border:solid;"></td>
</tr>
</tr>
<tr>
<tr>
<td style="width:20%;border:solid;">DMS Backupcache VM</td>
<td style="width:20%;border:solid;">DMLS Documents Repository Backupcache VM</td>
<td style="width:40%;border:solid;">Windows or Linux 500GB drive space (VHD expandable is OK)</td>
<td style="width:40%;border:solid;">Windows or Linux (FreeNAS?) 500GB drive space (dynamic VHD is OK)</td>
<td style="width:20%;border:solid;">Different Host from DMS server</td>
<td style="width:20%;border:solid;">Different Physical Host from DMLS server</td>
<td style="width:20%;border:solid;"></td>
<td style="width:20%;border:solid;"></td>
</tr>
</tr>
<tr>
<tr>
<td style="width:20%;border:solid;">DMS Public Interface Server VM</td>
<td style="width:20%;border:solid;">Public Application Interface Server VM</td>
<td style="width:40%;border:solid;">LAMP or WIMP minimal for web application to access the documents over the Internet</td>
<td style="width:40%;border:solid;">LAMP or WIMP minimal for web application to access the documents over the Internet</td>
<td style="width:20%;border:solid;">If security requires this</td>
<td style="width:20%;border:solid;">Can run on the same server as the DMLS Server (if properly secured)</td>
<td style="width:20%;border:solid;"></td>
<td style="width:20%;border:solid;"></td>
</tr>
</tr>
Line 59: Line 77:
<td style="width:20%;border:solid;">Subversion Edge Server VM</td>
<td style="width:20%;border:solid;">Subversion Edge Server VM</td>
<td style="width:40%;border:solid;">LAMP, Collab.net Subversion Edge, Ubuntu 12.0.4LTS (server), 120GB (VHD expanding OK), 2GB RAM (1GB will work if RAM is a constraint)</td>
<td style="width:40%;border:solid;">LAMP, Collab.net Subversion Edge, Ubuntu 12.0.4LTS (server), 120GB (VHD expanding OK), 2GB RAM (1GB will work if RAM is a constraint)</td>
<td style="width:20%;border:solid;">Can be used for the DMS and other software development projects at UAS</td>
<td style="width:20%;border:solid;">Can run on the DMLS server and be used for other software development projects at UAS</td>
<td style="width:20%;border:solid;"></td>
<td style="width:20%;border:solid;"></td>
</tr>
</tr>
Line 75: Line 93:


= Software =
= Software =
The architecture of the DMS will be Client-Server.  
The architecture of the DMLS will be Client-Server. The Software Platform described here is selected based on the Author's capabilities.  Other software developers interested in the project can select a different software platform like FOSS.
==== DMS client workstation performs document and data input ====
 
==== DMLS client workstation performs document and data input ====
* Windows 7/32bit operating system (32bit for easier scanner driver support)
* Windows 7/32bit operating system (32bit for easier scanner driver support)
* Acrobat X (current version) Professional version with SDK for COM automation
* Acrobat X (current version) Professional version with SDK for COM automation
* Visual Studio Professional (MSDN) for VB and C# DMS applicaton development
* Visual Studio Professional (MSDN) for VB and C# DMLS applicaton development
* MySQL Client to interface with MySQL running on the DMS server.
* SQL Server Express 2008 Management Studio Express
==== DMS server provides a central  database server and documents repository ====
* Libre Office for system documentation
* MySQL Server for storing Document Indexing, User Account, and other DMS information
* Subversion Client for software source code version control
 
==== DMLS server provides a central  database server and documents repository ====
* SQL Server 2008 Express for storing Document Indexing, User Account, and other DMLS information
* Repository File Server for storing PDF documents
* Repository File Server for storing PDF documents
==== Backup cache file server ====
==== Backup cache file server ====
* Maintain a backup of the documents
* Maintain a backup of the documents
Line 91: Line 114:


==== Public facing server(s) ====
==== Public facing server(s) ====
* Custom Web Application available DMS accounts to access the documents
* Custom Web Application available DMLS accounts to access the documents
* Subversion Edge for software developers' version controlled source code repository
* Subversion Edge for software developers' version controlled source code repository


Line 116: Line 139:
* Store and secure document files
* Store and secure document files
* Database for indexing documents
* Database for indexing documents
* Prepare documents for transmission
* Provide a Public Interface for authenticating users and processing their requests for documents.
* Provide a Public Interface for authenticating users and processing their requests for documents.
* Assign Security to documents and transmit them
<br />
<hr />
= IMPLEMENTATION - PHASE 1 =
<b>Prepare Scanner Workstation and Document Server for Development</b><br />
(TBD) = To Be Determined
== Scanner Workstation Setup ==
* 1. PC Workstation, Windows 7 Pro/32, install updates, test
* 2. Document Scanner, FireWire or SCSI, install drivers, test
* 3. Install Acrobat, test Import from Scanner
== Scanner Workstation Development Environment Setup
* 1. Install Microsoft .NET 4 extended (from the optional updates)
* 2. Install Visual Studio 2010 Pro
* 3. Install SQL Server Express 2008 Management Studio Express
== Document Server Setup ==
* 1. Server VM, Windows Server 2008 R2 Std, install updates, test
* 2. Install File Services Feature, Configure Share for Documents Repository
Note:  The only systems that should be able to access the Repository are the Server and Scanner Workstation.
* 3. Configure Scanner Workstation to access the Document Repository Share, test
* 4. Enable Remote Desktop for administrative and developer access through RDP client.
== Document Server Development Environment Setup ==
* 1. Install Microsoft .NET 4 extended (from the optional updates)
* 2. Install IIS Role with ASP.NET extension and other options (TBD)
* 3. Install Visual Studio 2010 Pro with SQL Server Express 2008.  Include Visual Basic, C#, and C
* 4. Install SQL Server Express 2008 Management Studio Express
* 5. Install Acrobat
<br />
<hr />
= IMPLEMENTATION - PHASE 2 =
<b>Start Scanning Documents and Developing Custom Software</b><br />
== Document Preparation ==
* 1. Develop criteria for identifying which documents should be scanned
* 2. Develop procedure(s) for preparing and queuing the documents to be scanned
* 3. Try various methods for preparing documents:  Band Saw/Belt Sander, Knife, etc
== Scan Documents ==
* 1. Try various document capture techniques.
** Scanner with Auto Document Page Feeder (ADF)
** Photographic Book Scanners [http://www.diybookscanner.org DIY Book Scanner]
* 2. Develop procedure for using Acrobat to Import from Scanner, Save, and Upload to Server
* 3. Develop schema for indexing and recording document meta-data
== Workstation Automation ==
*.1. Develop custom software to provide an application that enables the operator to efficiently scan and associate documents with database records.
== Document Management Library Services ==
* 1. Define document sharing protocols and security attributes
* 2. Develop application service(s) to apply security attributes to documents prior to transmitting them to users
* 3. Develop a web application that enables administrators to manage the documents' meta-data and security attributes
* 4. Develop a web application that enables end users to access the documents.
<br />
<hr />
= Other Related Links and References =
* [http://www.imageaccess.com/bookeye4.shtml#fragment-12 Bookeye Scanners]
* [http://www.npr.org/blogs/library/2009/04/the_granting_of_patent_7508978.html  Google Book Scanning]
* Photographic Book Scanners - [http://www.diybookscanner.org DIY Book Scanner]
* [http://www.youtube.com/watch?v=gjm6dBNlPug YouTube videos]
[[Category:Project]]

Latest revision as of 18:58, 4 January 2023

Updated May 1, 2012

Scope

  • Document Management Library System (DMLS) Project for scanning, storing, indexing, and sharing books.
  • DMLS project is limited to scanned documents, though it certainly can be expanded to include any digitally stored content outside of the scope of this project.
  • One of the purposes of this project is to record the content of books and eliminate the need for maintaining them in a physical library and to free up shelf space. With the involvement of more people, the project could be extended to include non-destructive or labor intensive scanning that preserves the original material bindings.

Ownership and Copyright Issues

  • Access to all copyright content is restricted to Accounts in the DMLS. The DMLS Accounts are the Owners of the content.
  • Contemporary or possibly contested materials will require additional Check In/Out protocol and content expiration for checked out documents.
  • Content explicitly requiring only a single owner is owned by the Account that checks out the document. The Check-out Account is the document's owner for the duration time from check-out to check-in or content expiration.
  • Anonymous access can only be granted to the overall Titles catalog and documents that have licenses allowing public domain access. (e.g. Creative Commons, GPL, etc.)
  • This DMLS is not to be used to illegally store or transmit content.

Software Platforms Selection Criteria

Technically, the DMLS is not platform specific.

After reviewing concerns regarding copyright issues, and considering his limited experience, this article's author has selected the Microsoft software platform: Windows 7/Windows Server 2008, Visual Studio 2010, SQL Server 2008 Express, IIS, and Adobe Acrobat (X preferred). This platform is most within the author's existing experience and capabilities to complete a proof of concept implementation that includes document lifecycle security in a reasonable amount of time.

The documents repository can be hosted by any file sharing server platform. The repository should support multiple connections enabling multiple software platforms access to the documents. This enables anyone at the space to develop software to manage the documents using any software platform.

This author encourages others to get involved in any development that can use and build on the scanner and documents repository.

Proof of Concept

Physical Requirements

Components

Document Scanner Auto Doc Feed (ADF), Duplex, and Flatbed May 1, 2012 Donated
Binding Paper Cutter 300 pages capacity Guillotine Cutter Band Saw can be used for concept or Stack Paper Cutters Band Saw w/New Blade or Cutter
Belt Sander 220 Grit? For use if Band Saw burr or other separation issues (spilled beer etc)
Scanner PC Workstation 2.5GHz, 2GB RAM, Firewire or SCSI for Epson scanner, Windows 7, 32 bit, Adobe Acrobat X Prof, 1GbEth 1280x1024 monitor or touch screen, keyboard, mouse, SW Dev Tools
Input Operational Foot Print A sturdy induction workspace, 3'x5' minimum 1/4 for Terminal, 1/4 for Document Prep, 1/2 for Scanner
DMLS Documents Repository Server VM LAMP or WIMP, 500GB drive space (dynamic VHD is OK) Software Development Tools
DMLS Documents Repository Backupcache VM Windows or Linux (FreeNAS?) 500GB drive space (dynamic VHD is OK) Different Physical Host from DMLS server
Public Application Interface Server VM LAMP or WIMP minimal for web application to access the documents over the Internet Can run on the same server as the DMLS Server (if properly secured)
Subversion Edge Server VM LAMP, Collab.net Subversion Edge, Ubuntu 12.0.4LTS (server), 120GB (VHD expanding OK), 2GB RAM (1GB will work if RAM is a constraint) Can run on the DMLS server and be used for other software development projects at UAS

System Constraints

System constraints are requirements or conditions that are expected to limit the overall performance of the system. In this case the system constraints will limit the rate that new documents can be added to the DMS. Exploiting these constraints will maximise the throughput of the DMS.

  • Document Preparation

The ADF requires that the binding be removed and the pages manually separated prior to loading.

  • Document Scanning

An operator must load the ADF, operate the scanner, and monitor its operation. The operator must be skilled enough to correct miss-feeds and jams

  • Document Managment

The scanner operator must identify the document and enter that information into the DMS. The operator must associate the scanned document file with the DMS.

Software

The architecture of the DMLS will be Client-Server. The Software Platform described here is selected based on the Author's capabilities. Other software developers interested in the project can select a different software platform like FOSS.

DMLS client workstation performs document and data input

  • Windows 7/32bit operating system (32bit for easier scanner driver support)
  • Acrobat X (current version) Professional version with SDK for COM automation
  • Visual Studio Professional (MSDN) for VB and C# DMLS applicaton development
  • SQL Server Express 2008 Management Studio Express
  • Libre Office for system documentation
  • Subversion Client for software source code version control

DMLS server provides a central database server and documents repository

  • SQL Server 2008 Express for storing Document Indexing, User Account, and other DMLS information
  • Repository File Server for storing PDF documents

Backup cache file server

  • Maintain a backup of the documents

Because human operation is a constraint, as soon as a document is committed to the repository, a copy should be transmitted to a backup to minimise the possibility that a document would need to be re-scanned.

  • Maintain a backup of the database
  • Maintain a backup of the source code repository.

Public facing server(s)

  • Custom Web Application available DMLS accounts to access the documents
  • Subversion Edge for software developers' version controlled source code repository

Note

  • For proof of concept, some of the above servers can run on the same machine. Only the backup cache should be on a separate physical machine than the main data.
  • Technically, the scanner workstation does not store any critical data, so it does not need to be backed up, other than maybe an image backup for ease of reconstruction due to hardware failure or catastrophic software update.

Scanner PC Client Software

General Requirements

  • Connection to Network Shared File Storage for the PDF documents.
  • Scanner Interface Drivers (e.g. TWAIN or vendor proprietary)

Optional Support Software

  • Software Development Tools for the Scanner Application
  • Database Management Tools

Adobe Acrobat Manual PDF Creation

  • Import from Scanner Feature

Custom Application

  • Automate Acrobat's Import from Scanner method to scan documents
  • Transfer scanned documents to the server
  • Associate Scanned Documents with Database Records
  • Manage Document Lifecycle properties (e.g. expiration)
  • Manage Document Security attributes (e.g. ownership, security level, access control)

DMLS Server Software

  • Store and secure document files
  • Database for indexing documents
  • Provide a Public Interface for authenticating users and processing their requests for documents.
  • Assign Security to documents and transmit them



IMPLEMENTATION - PHASE 1

Prepare Scanner Workstation and Document Server for Development

(TBD) = To Be Determined

Scanner Workstation Setup

  • 1. PC Workstation, Windows 7 Pro/32, install updates, test
  • 2. Document Scanner, FireWire or SCSI, install drivers, test
  • 3. Install Acrobat, test Import from Scanner

== Scanner Workstation Development Environment Setup

  • 1. Install Microsoft .NET 4 extended (from the optional updates)
  • 2. Install Visual Studio 2010 Pro
  • 3. Install SQL Server Express 2008 Management Studio Express

Document Server Setup

  • 1. Server VM, Windows Server 2008 R2 Std, install updates, test
  • 2. Install File Services Feature, Configure Share for Documents Repository

Note: The only systems that should be able to access the Repository are the Server and Scanner Workstation.

  • 3. Configure Scanner Workstation to access the Document Repository Share, test
  • 4. Enable Remote Desktop for administrative and developer access through RDP client.

Document Server Development Environment Setup

  • 1. Install Microsoft .NET 4 extended (from the optional updates)
  • 2. Install IIS Role with ASP.NET extension and other options (TBD)
  • 3. Install Visual Studio 2010 Pro with SQL Server Express 2008. Include Visual Basic, C#, and C
  • 4. Install SQL Server Express 2008 Management Studio Express
  • 5. Install Acrobat



IMPLEMENTATION - PHASE 2

Start Scanning Documents and Developing Custom Software

Document Preparation

  • 1. Develop criteria for identifying which documents should be scanned
  • 2. Develop procedure(s) for preparing and queuing the documents to be scanned
  • 3. Try various methods for preparing documents: Band Saw/Belt Sander, Knife, etc

Scan Documents

  • 1. Try various document capture techniques.
    • Scanner with Auto Document Page Feeder (ADF)
    • Photographic Book Scanners DIY Book Scanner
  • 2. Develop procedure for using Acrobat to Import from Scanner, Save, and Upload to Server
  • 3. Develop schema for indexing and recording document meta-data

Workstation Automation

  • .1. Develop custom software to provide an application that enables the operator to efficiently scan and associate documents with database records.

Document Management Library Services

  • 1. Define document sharing protocols and security attributes
  • 2. Develop application service(s) to apply security attributes to documents prior to transmitting them to users
  • 3. Develop a web application that enables administrators to manage the documents' meta-data and security attributes
  • 4. Develop a web application that enables end users to access the documents.



Other Related Links and References