DA Readme Page 3

Appendix A: Export Format Notes

Export Naming Conventions:

Exported files can be named using any combination of the following:

%ProjectID%	Three letter project ID
%FileID%	Internal file ID
%TITLE%	Original file name(includes parent if zip/msg/eml)
%SHORT_TITLE%	Guaranteed 32 char unique name
%EXT%	Original file extension
%BATESSTART%	Starting bates sequence for file
%BATESEND%	Ending bates sequence for file
%PAGE	Page number
%BATES%	Bates number for page
%DOCID%	User assigned document ID
ASCII_STRING	Any ASCII string

Export File Formats:

Export Directory Structure Options:

If we take a set of source files,

Source Files	Assigned Name
custodian1\list.doc	file1
custodian2\Folder1\sample.pdf	file2
custodian2\Folder2\sales.xls	file3
custodian3\Box1\Folder3\january.doc	file4
custodian3\Box1\Folder3\february.doc	file5
custodian3\Box1\Folder3\march.doc	file6

Will get the following directory exports:

Flat:

---------

|---Output

| File1.tif

| File2.tif

| File3.tif

| File4.tif

| File5.tif

| File6.tif

| File1.txt

| File2.txt

| File3.txt

| File4.txt

| File5.txt

| File6.txt

|---Source

File1.doc

File2.pdf

File3.xls

File4.doc

File5.doc

File6.doc

Mirror:

---------

|---Custodian1

| |---source

| | File1.doc

| File1.tif

| File1.txt

|---Custodian2

| |---Folder1

| | |---source

| | | File2.pdf

| | file2.tif

| | file2.txt

| |

| |---Folder2

| |---source

| | File3.xls

| file3.tif

| file3.txt

|---Custodian3

|---Box1

|---Folder3

|---souce

| File4.doc

| File5.doc

| File6.doc

File4.tif

File5.tif

File6.tif

File4.txt

File5.txt

Bates:

----OUTPUT

| ----Bates_file1

| | File1.tif

| ----Bates_file2

| | File2.tif

| ----Bates_file3

| | File3.tif

| ----Bates_file4

| | File4.tif

| ----Bates_file5

| | File5.tif

| ----Bates_file6

| File6.tif

----SOURCE

| ----Bates_file1

| | list.doc

| ----Bates_file2

| | sample.pdf

| ----Bates_file3

| | sales.xls

| ----Bates_file4

| | january.doc

| ----Bates_file5

| | february.doc

| ----Bates_file6

| | march.doc

----TEXT

| ----Bates_file1

| | File1.txt

| ----Bates_file2

| | File2.txt

| ----Bates_file3

| | File3.txt

| ----Bates_file4

| | File4.txt

| ----Bates_file5

| | File5.txt

| ----Bates_file6

File6.txt

Vol/Box

----VOL0001

----BOX0001

| |----source

| | File1.doc

| | File2.pdf

| File1.tif

| File2.tif

| File1.txt

| File2.txt

|---BOX0002

| |----source

| | File3.xls

| | File4.doc

| File3.tif

| File4.tif

| File3.txt

| File4.txt

|---BOX0003

| |----source

| | File5.doc

| | File6.doc

| File5.tif

| File6.tif

| File5.txt

| File6.txt

Summation DII notes:

Classifications of DII Files

Summation created a batch load file format and protocol that service bureaus can use to facilitate the processing and delivery of eDiscovery that will be loaded into a Summation case. Service bureaus can provide eDiscovery using three different types of DII files:

* Class I DII file - This class is geared toward traditional paper discovery service bureaus that scan paper documents and use Optical Character Recognition (OCR) technology on the resulting imaged documents. Also, in this model, e-mail messages and electronic documents (received in either in paper or native, electronic format) are converted or petrified by a service bureau to TIFF or PDF image formats, and the text and metadata are extracted. When loaded into a Summation case, the image information is loaded into the ImgInfo table, the full-text is loaded into the ocrBase, and generated metadata is loaded into the Core Database. The difference between a Class I DII file and a DII file prepared for previous versions of Summation is the ability of the Class I DII file to more easily maintain the parent/child relationships of compound documents.

* Class II DII file - This file is geared toward forensic-oriented service bureaus that extract or parse metadata and e-mail message information for loading into designated Summation Core Database fields. Native electronic files are copied to the eDocs repository specified in the case directory structure. Once the files are copied and the data loaded, the user can take advantage of Summation's multi-file format index, search, and retrieval functions to produce electronic documents in their native formats. These Class II DII file attributes will allow users to narrow or winnow down a collection of electronic data, such as e-mail messages, to only disclose relevant non-privileged data to the requesting party. The Class II DII file also facilitates the preservation of the parent/child relationships of compound documents.

* Class III DII file - This file is a combination of the Classes I and II DII file formats.

The above DII load file classes give Summation users the ultimate flexibility for applying the varying formats and protocols used to acquire, process, deliver, and deploy digital information underlying litigation, regulatory compliance, and risk management.

Note: The above DII load file formats are also acceptable formats to deliver electronic data that will be loaded into CaseVault, the litigation hosting service and subsidiary of Summation Legal Technologies. CaseVault can be used as a winnowing platform for cases that include large volumes of electronic data. Once the set is culled and reduced, the electronic data can be loaded into a Summation system for additional review and case preparation.

Note:

Tokens can be longer than 8 characters, but fields cannot be. For example, the @ATTACHRANGE token is 11, but it populates the ATTRANGE field, which is only 8. Custom tokens have to be under 8 because the fields they populate are limited to 8 chars in size.

ImageMAKER custom defined additional fields in the Summation Export DII file:

@C FILENAME calendar.zip

@C FILEPATH Z:\Web_test_files\calendar.zip

@C ISDUP True

@C DUPPATHS C:\test\test.HTM; C:\test\testcopy.htm.

@C PGCOUNT 10

Details:

FILENAME - name of file at time of conversion.

FELEPATH - original source path for file (when being converted).

PGCOUNT - number of pages in the converted file.

Default is 1 if record not defined in data set.. or defaults to last value defined if not defined in a FileID record.

If files are exported single page per file, then this value indicates total number of exported pages for the source file.

PgCount is already defined as a custom data field in the Summation

database.

ISDUP - defines whether the record has any other duplicates in the exported data set.

This information is used when reviewing the data - and indicates that there are other copies of the same information elsewhere in the data set. (Field name lengths are limited to 8 chars).

Supported values are 'True' and 'False'

DUPPATHS - lists the 'filePath' source file names that are in the duplicate set.

This value lists source filenames of the duplicate files, not DocIDs' and gives an immediate indication as to where the duplicate data is stored. FilePaths are separated by a '; ' character pair (Semicolon/space).

If there are no duplicates, then the character string 'NA' is required.

Sample DII File:

; Summation DII Class I File

; Created on 7/20/2005 2:55:29 PM

; Created by DiscoveryAssistant version 3.2 build 1095

;

; Machine Name: BLAISE

; Project Path: F:\Work\TEST.xml

; Project Name: TEST

; Project ID: TM

@FULLTEXT DOC

@T 0000038

@DOCID 0000038

@MEDIA eDoc

@APPLICATION WinZip File

@C FILENAME calendar.zip

@C FILEPATH Z:\Web_test_files\calendar.zip

@C PGCOUNT 1

@C ISDUP False

@C DUPPATHS NA

@ATTACH 0000039; 0000040; 0000041; 0000043; 0000044; 0000045; 0000046; 0000047; 0000048; 0000049; 0000050; 0000051; 0000052; 0000053

@ATTACHCOUNT 14

@DATESAVED 7/21/2005

@DATECREATED 7/21/2005

@D @I\

0000038.tif

@T 0000039

@DOCID 0000039

@MEDIA eMail

@MSGID

@C PGCOUNT 1

@C ISDUP True

@C DUPPATHS Z:\Web_test_files\calendar.zip\calendar.pst\Personal Folders\Tasks\a second task request.msg;C:\imgmaker\temp1\a second task request.msg

@SUBJECT a second task request

@EMAIL-BODY separate task item in a separate task list.

@EMAIL-END

@ATTACHCOUNT 0

@PARENTID 0000038

@D @I\

0000039.tif

Available MetaData Fields for Summation:

@C BEGDOC: Export file title of first page

@C ENDDOC: Export file title of last page

@APPLICATION: Name of creating application

@C ATTCOUNT: Count of attachments

@ATTACH: List of export file titles of attachments

@ATTACHRANGE: Range of export file titles of attachments

@C GROUPRANGE: Range of export file titles that belong as a group. e.g. an email and it's attachments or a zip file and its contents

@C BATESGROUPRANGE: Range of Bates Numbers that belong as a group. e.g. an email and it's attachments or a zip file and its contents

@C BEGATTACH: Export file title of first page of group. e.g. an email and it's attachments or a zip file and its contents

@C ENDATTACH: Export file title of last page of group. e.g. an email and it's attachments or a zip file and its contents

@C ATTTITLE: File title of attachment

@FROM: Document author

@BATESBEG: Beginning Bates number

@BATESEND: Ending Bates number

@C BATESGBEG: Beginning Bates number for group. e.g. an email and it's attachments or a zip file and its contents

@C BATESGEND: Ending Bates number for group. e.g. an email and it's attachments or a zip file and its contents

@BCC: Blind Carbon Copy recipient

@CC: Carbon Copy recipient

@C DACOMMNT: Discovery Assistant PassThru comment

@DATECREATED: Source document creation date

@TIMECREATED: Source document creation time

@DATERCVD: Email received date

@TIMERCVD: Email received time

@DATESAVED: Source document modified date

@TIMESAVED: Source document modified time

@DATESENT: Email sent date

@TIMESENT: Email sent time

@C DATEACC: Source Document Last Access Date

@C TIMEACC: Source Document Last Access Time

@C DOCTITLE: Document Title

@C DUPPATHS: Source document paths of duplicate items

@EMAIL-BODY: Body of email

@C FILEEXT: Source file extension

@C FILEPATH: Source file path

@C XSFPATH: Exported source file path

@C FTITLE: Source file title

@C FILENAME: Source file name (including extension)

@C FTYPENAME: Source file type name

@FOLDERNAME: Email parent folder name

@FROM: Email From address

@C HASHCODE: MD5 hash code value for source document

@C ISDUP: True/False is duplicate

@C ITEMID: Discovery Assistant file ID

@MSG: Email message ID

@C PGCOUNT: Output file page count

@PARENTID: Export file title of parent item

@C SFTITLE: Short file title

@C SIZEDISK: Source file size on disk

@STOREID: Message store identifier

@C STORNAME: Message store source file name

@SUBJECT: Email subject

@TO: Email To address

@C ITEMINDX: Item Index

@C INETHDR: Internet Header

@C DOCID: Document ID

@C ALTRCALW: Alternate Recipient Allowed

@C AUTOFWD: Auto Forwarded

@C BILLINFO: Billing Information

@C CATEGOR: Categories

@C COMPNIES: Companies

@C DATEDFDL: Deferred Delivery Date

@C TIMEDFDL: Deferred Delivery Time

@C DELAFSUB: Delete After Submit

@C DATEEXP: Expiry Date

@C TIMEEXP: Expiry Time

@MULTILINE HTMLBODY: HTML Message Body

@C IMPRTNCE: Importance

@C MSGCLASS: Message Class

@C MSGMLG: Message Mileage

@C NOAGING: No Aging

@C DLVRPTRQ: Originator Delivery Report Requested

@C OLINTVER: Outlook Internal Version

@C OLVER: Outlook Version

@C RDRECREQ: Read Receipt Requested

@C RCVBYNAM: Received By Name

@C RCVBENAM: Received On Behalf Of Name

@C RCPREPRO: Recipient Reassignment Prohibited

@MULTILINE REPRECIP: Reply Recipients

@C SAVED: Saved

@C SENSI: Sensitivity

@C SENT: Sent

@C SNTBENAM: Sent On Behalf Of Name

@C SUBMTTED: Submitted

@READ: Message read y/n?

@C UNREAD: UnRead

@C VOTOPT: Voting Options

@C VOTRESP: Voting Response

@C GLBLPRM: 'Yes' if this is the first occurance of this item in the global table.

@C GLBLCNT: Count of occurances of this item in the Global Project table.

@C SRCCUSTOD: Source Custodian. Obtained from third to last directory name in source file path.

@C SRCBOX: Source Box. Obtained from second to last directory name in source file path.

@C SRCFOLDER: Source Folder. Obtained from last directory name in source file path.

@C DATEPRNT: Source Document Last Print Date

@C TIMEPRNT: Source Document Last Print Time

Concordance Export File Format:

Source documents are to be generated into single page TIFF files, single page TXT files, and a meta-data file.

Meta data and the single page TXT file are then combined to create a single DAT file per page for import. Each data file is assigned a unique ID (Bates Number).

Concordance imports all the DAT files from a given directory into the database.

The list of image files is listed in the .LOG file. There is a unique TIFF file for each DAT file crated. The image files are imported all at the same time through the Opticom Viewer interface.

Detailed Requirements:

Create the following files:

1. multi-line .DAT files containing information for each page of each file.

2. multi-line .LOG file containing a list of tiff images (OPTICOM Load images) that are associated with each defined page.

.DAT File Description:

The .DAT file contains file meta data, with the exported text as the last field.

Export fields for the data are defined in the 'export fields' section (below).

Sample data are also provided in the 'sample data' data section (below).

The .DAT file contains a single comma delineated list of fields.

But... Rather than using the common notation

"field1","field2","field3"

notation, fields are delineated by substituting decimal 20 for ',', and decimal 254 for '"'.

Decimal 20 and decimal 254 are explicitly defined to NOT occur in any imported text.

Newline values in the imported text are modified to be decimal 174.

.DAT File Sample:

The sample data:

to:Ken Davies

from:Sales

Subject:The year ahead

Text: A long discussion about the year ahead.

Looking forward to your comments.

Call me if you want to do lunch.

becomes:

(245)Ken Davies(245)(20)(245)Sales(245)(20)(245)The year ahead(245)(20)(245)A long discussion about the year ahead(174) Looking forward to your comments.(174) Call me if you want to do lunch.(174)(245)

where the values in brackets (245) (20) (174) are decimal byte values in the data stream.

The data fields in this example are pre-defined to be "to","from","subject","text".

.DAT file fields:

Field Name	Sample Data	Populated
STARTPAGE	00010002	YES
ENDPAGE	00010002	YES
DATE	20041219	YES [Date Accessed/Sent Date]
DOCTYPE	Doc extension	YES [SourceFile Ext]
TITLE	Untitled	YES [Title from MetaData]
AUTHOR	Simmons;RC / McMurrian;HP	YES [Author/From:from MetaData]
AUTHORORG	Cole Evans and Peterson	NO
RECIPIENT	McCorman;SL	YES [To: from MetaData]
RECIPORG	Cowco	NO
CC	“”	YES [Cc: from MetaData
SUMMARY	“”	NO
CONDITION	“”	NO
ATTACH_TYPE	“”	NO
LEAD_DOC	“”	NO
ATTACHMENTS	“”	NO
PRIMARYDATE	19831220	YES [Date Created]
PAGES	3	YES
CCORG	“”	NO
ATT	“”	NO
ATTORG	“”	NO
OCR1	* 0010002 ** …. contents of page…	NO
OCR2	“”	NO
OCR3	“”	NO
OCR4	“”	NO
OCR5	“”	NO
RENUMBER	161	NO
ISSUE	“”	NO
DISC_STATUS	“”	NO
SOURCE_FILE_NAME	C:\fname.doc	YES
SOURCE_FILE_SIZE	104456	YES

Hyperlinked Source documents:

XSPATHNAME .\SOURCE\TST00002.msg

TIFF file destination:

XIPATHNAME OUTPUT

XIFILENAME TST00002.tif

.LOG file Sample:

00010001,Data,E:\DATABASE\COWCO\001\00010001.TIF,Y,,,

00010002,Data,E:\DATABASE\COWCO\001\00010002.TIF,,,,

00010003,Data,E:\DATABASE\COWCO\001\00010003.TIF,,,,

00010004,Data,E:\DATABASE\COWCO\001\00010004.TIF,Y,,,

00010005,Data,E:\DATABASE\COWCO\001\00010005.TIF,,,,

00010006,Data,E:\DATABASE\COWCO\001\00010006.TIF,,,,

00010007,Data,E:\DATABASE\COWCO\001\00010007.TIF,Y,,,

00010008,Data,E:\DATABASE\COWCO\001\00010008.TIF,Y,,,

00010009,Data,E:\DATABASE\COWCO\001\00010009.TIF,Y,,,

00010010,Data,E:\DATABASE\COWCO\001\00010010.TIF,,,,

00010011,Data,E:\DATABASE\COWCO\001\00010011.TIF,,,,

.LOG file fields:

Field 1: "Production Number" -- This is a text field which contains the "Production" or "Control" or Bates number for that page of the document. It is a unique value and is the load file "key".

Field 2: "Volume ID" -- This is also a text field. It should contain the Volume ID of the CD on which the images are delivered.

Field 3: "Full DOS Path" -- This is a text field containing the full DOS path to the image file.

Field 4: "Document Break" -- This is a text field. If this particular image is the first page of a document, this field should contain a "Y" (Yes).

Field 5: "Folder Break" -- This is a text field. It's fairly rarely used but if used is intended to work just like Document Break, i.e. it would contain a "Y" if this is the first page of a new folder

Field 6: "Box Break" -- This is a text field. Also rarely used but intended to work like Doc and Folder Break...would contain a "Y" if this is the first page of a new box.

Field 7: "Pages" -- This is a text field although it contains numeric data. If this is the first page of a new document, "Document Break" will contain a "Y" and this field will show the number of pages for the document. (This field is a "nice to have" as after the images are loaded, Opticon will calculate the number of pages based on the database.)

Contents of import directory for an 11 page file:

00010001.dat

00010001.tif

00010002.dat

00010002.tif

00010003.dat

00010003.tif

00010004.dat

00010004.tif

00010005.dat

00010005.tif

00010006.dat

00010006.tif

00010007.dat

00010007.tif

00010008.dat

00010008.tif

00010009.dat

00010009.tif

00010010.dat

00010010.tif

00010011.dat

00010011.tif

images.opt

IPRO LFP Export File Format:

(source: http://www.ediscovery.org/litigation-support/technical-standards_4_02_IPRO.htm)

To convert from Opticon format, download iConvert from http://www.IproCorp.com. (free)

Example 1: Single Page .TIF files

IM,MSC00014,D,0,@MSC001;IMAGES\ 00\ 00;MSC00014.TIF;2

IM,MSC00015,,0,@MSC001;IMAGES\ 00\ 00;MSC00015.TIF;2

IM,MSC00016,D,0,@MSC001;IMAGES\ 00\ 00;MSC00016.TIF;2

IM,MSC00017,,0,@MSC001;IMAGES\ 00\ 00;MSC00017.TIF;2

Example 2: Multi Page .TIF file

IM,MSC00014,D,1,@MSC001;IMAGES\ 00\ 00;MSC00014.TIF;2

IM,MSC00015,,2,@MSC001;IMAGES\ 00\ 00;MSC00014.TIF;2

IM,MSC00016,D,1,@MSC001;IMAGES\ 00\ 00;MSC00016.TIF;2

Note: Because the files are multi-page, the entire bates range (or image key range) must point to the same .TIF file. As example, MSC00014 contains both "14" and "15". Therefore, to view page 15, the computer must display MSC00014.TIF.

The following provides a breakdown of the fields:

Import code identifier (Importing New Page/Image database record)

MSC00014

The image key/document id number

Document designation; only designate the first page of each document.

Offset to the Tiff file. Always 0 for single page tiff files. When creating Multi-Page Tiff files, this number will increment for the pages within the file. (If there is an 11 page document, the offset would start at 1 and end at 11 and the next tiff file would start over at 1.

@MDEMO

CD volume name

IMAGES\00\00

Directory path on the CD for the image

MSC00014.TIF

Filename for the image.

Tells IPRO the Types* of image file, e.g. tiff, PDF

*Supported Image Types and their specification in the LFP file are:

1. Type 1 is for IPRO Tech image from DOS-Based version, still supported (.IMG)

2. Type 2 is for Standard single and multiple page black & white or color TIFF (.TIF)

3. Type 3 is for IPRO Tech stacked TIFF (.STF)

4. Type 4 is for Color image (.BMP, .PCX, .JPEG or .PNG)

5. Type 5 is for black & white .PDF

6. Type 6 is for Color .PDF

7. Type 7 is to Auto-detect the .PDF type, e.g. Color or Black & White

RINGTAIL Support

Exporting to RingTail:

1. Export to Ringtail from Discovery Assistant.

2. Load the CSV file into the Ringtail Flat File converter to convert to MDB, then run it through the Validator.

Reference Docs: (these seem to overlap)

CaseBook_Data_Standards_Manual_v602r5.pdf

Ringtail Legal Data Standards Manual v2[1].1.2.pdf

Tools Provided by FTI Ringtail

1. Data Standards Manual: outlines the Ringtail load file

2. Flat File Converter: a tool used to convert a flat-file database to a Ringtail load file; and

3. Validator: a tool used to verify the integrity of a Ringtail load file.

NO load file should be loaded to Ringtail without first being run through this free tool.

To access these free tools, browse to our support website http://support.ftiringtail.com . From there, click the button to LOGIN AS GUEST, then access the Downloads tab.

Ringtail Flat file converter Notes:

The validator does not understand Office 2007. You need to run on an Office 2003 machine. Time fields are not supported in Ringtail. Any time fields should be set to TEXT. Boolean fields are TEXT. We don't currently convert to T/F.

MAIN tab:

ImageMAKER field name Ringtail

----------------------------------

Main_Document_ID Document_ID User assigned Document ID

Main_Document_Date Document_Date Source document create date, otherwise received date, otherwise sent date (in that order)

Main_Document_Time ??? Source document create time, otherwise received time, otherwise sent time (in that order)

Main_Document_Type Document_Type Source file type name

Main_Title_docTitle Title Document Title

Main_Title_DocSubject Descripiton Email/Document subject

Main_Host_Reference Host_Reference Export file title of parent item

0 Estimated

Notes:

use "0" for Estimated (all dates are exact)

There are no time fields in Ringtail

PAGES tab:

ImageMAKER field name Ringtail

----------------------------------

Pages_Page_Start Page_Start Export file title of first page

Pages_Page_End Page_End Export file title of last page

Pages_Image_File_Name ?? Export file name [image] with extension.

.tif Page_Extension

Pages_Num_Pages Total_Number_of_Pages

??? Page_Range

Notes:

choose 'Use Page Range' (not 'Use Image_File_Name') when matching fields.

use ".tif" for Page_Extension.

Missing:

no values for Page_Range. Suggest using Pages_Num_Pages.

PARTIES tab:

ImageMAKER field name Ringtail type: to, from, between, cc, bcc, userDefined

----------------------------------

Parties_People_From_Author Document author

Parties_People_From_LastAuthor Last Document author

Parties_People_From_Sender Email From address

Parties_People_To Email To address

Parties_People_CC Carbon Copy recipient

Parties_People_BCC Blind Carbon Copy recipient

Notes:

assigned to 'people'

one to many

delimiter is the ';' character (semicolon).

no concatenate string

LEVELS tab:

ImageMAKER field name Ringtail

----------------------------------

Levels_Levels Fields Level Fields [1-10] Export file path (image)

EXTRAS tab:

ImageMAKER field name Ringtail (BOOL DATE NUMB PICK TEXT MEMO UTEXT UMEMO)

----------------------------------

Extras_ALTRCPALLOW TEXT(T/F) Alternate Recipient Allowed

Extras_APPLICATION_NAME TEXT Name of creating application

Extras_ATTACHLIST TEXT List of export file titles of attachments

Extras_ATTACHMENTRANGE TEXT Range of export file titles of attachments

Extras_ATTACHMENTSCOUNT NUMB Count of attachments

Extras_ATTACHTITLE TEXT File title of attachment

Extras_AUTOFWD TEXT(T/F) Auto Forwarded

Extras_BATESBEG TEXT Beginning Bates number

Extras_BATESBEGGROUP TEXT Beginning Bates number for group. e.g. an email and it's attachments or a zip file and it's contents

Extras_BATESEND TEXT Ending Bates number

Extras_BATESENDGROUP TEXT Ending Bates number for group. e.g. an email and it's attachments or a zip file and it's contents

Extras_BATESGROUPRANGE TEXT Range of Bates Numbers that belong as a group. e.g. an email and it's attachments or a zip file and it's contents

Extras_BEGATTACH TEXT Export file title of first page of group. e.g. an email and it's attachments or a zip file and it's contents

Extras_BILLINFO TEXT Billing Information

Extras_BODY MEMO Body of email

Extras_CATEGOR TEXT Categories

Extras_CNVINDEX TEXT Conversation Index

Extras_CNVTOPIC TEXT Conversation Topic

Extras_COMPANIES TEXT Companies

Extras_DACOMMENT TEXT Discovery Assistant PassThru comment

Extras_DEFDLVDATE TEXT(T/F) Deferred Delivery Date

Extras_DEFDLVTIME TEXT(T/F) Deferred Delivery Time

Extras_DELAFTSUB TEXT(T/F) Delete After Submit

Extras_DLVRPTREQ TEXT(T/F) Originator Delivery Report Requested

Extras_DOCTEXT MEMO Document Text

Extras_DUPPATHS TEXT Source document paths of duplicate items

Extras_ENDATTACH TEXT Export file title of last page of group. e.g. an email and it's attachments or a zip file and it's contents

Extras_EXPIRYDATE DATE Expiry Date

Extras_EXPIRYTIME TEXT(HMS) Expiry Time

Extras_EXPORTDATE DATE Export start date

Extras_EXPORTEDSOURCEFILEPATHNAME TEXT Exported source file path

Extras_EXPORTTIME TEXT(HMS) Export start time

Extras_FILEACCESSDATE DATE Source document Last Access Date

Extras_FILEACCESSTIME TEXT(HMS) Source document Last Access Time

Extras_FILECREATIONDATE DATE Source document creation date

Extras_FILECREATIONTIME Source document creation time

Extras_FILEDISPLAYNAME TEXT Source file title

Extras_FILEEXTENSION TEXT Source file extension

Extras_FILEMODIFYDATE DATE Source document modified date

Extras_FILEMODIFYTIME TEXT(HMS) Source document modified time

Extras_FILENAME TEXT Source file name (including extension)

Extras_FILEPATHNAME TEXT Source file path

Extras_FILEPRINTDATE DATE Source document Last Print Date

Extras_FILEPRINTTIME TEXT(HMS) Source document Last Print Time

Extras_GLOBALCOUNT NUMB Count of occurrences of this item in the Global Project table.

Extras_GLOBALPRIMARY TEXT(T/F) 'Yes' if this is the first occurrence of this item in the global table.

Extras_GROUPRANGE TEXT Range of export file titles that belong as a group. e.g. an email and it's attachments or a zip file and it's contents

Extras_HASHCODE TEXT MD5 hash code value for source document

Extras_HTMLBODY MEMO HTML Message Body

Extras_IMPORTANCE TEXT Importance

Extras_INETHEADER TEXT Internet Header

Extras_ISDUP TEXT(HMS) True/False is duplicate

Extras_ITEMID TEXT Discovery Assistant file ID

Extras_ITEMINDEX NUMB Item Index

Extras_LASTSAVEDDATE DATE Source document Last Saved date

Extras_LASTSAVEDTIME TEXT(HMS) Source document Last Saved time

Extras_MSGCLASS TEXT Message Class

Extras_MSGID TEXT Email message ID

Extras_MSGMLG TEXT Message Mileage

Extras_NOAGING TEXT(T/F) No Aging

Extras_OBJECTSIZE NUMB Source file size on disk

Extras_OLINTVER TEXT Outlook Internal Version

Extras_OLVER TEXT Outlook Version

Extras_PAGECOUNT NUMB Number of pages in TIFF file

Extras_PARENT TEXT Email parent folder name

Extras_PARENTCREATIONDATE DATE Parent document create date

Extras_PARENTCREATIONTIME TEXT(HMS) Parent document create time

Extras_PARENTMODIFYDATE DATE Parent document modified date

Extras_PARENTMODIFYTIME TEXT(HMS) Parent document modified time

Extras_PARENTRECEIVEDDATE DATE Parent email received date

Extras_PARENTRECEIVEDTIME TEXT(HMS) Parent email received time

Extras_PARENTSENTDATE DATE Parent email sent date

Extras_PARENTSENTTIME TEXT(HMS) Parent email sent time

Extras_RCPREASSPROHIB BOOL Recipient Reassignment Prohibited

Extras_RCVBYNAME TEXT Received By Name

Extras_RCVONBEHALFNAME TEXT Received On Behalf Of Name

Extras_RDRECREQ TEXT(T/F) Read Receipt Requested

Extras_READ TEXT(Y/N) Message read y/n?

Extras_RECEIVEDDATE DATE Email received date

Extras_RECEIVEDTIME TEXT(HMS) Email received time

Extras_REPLRECIPS TEXT Reply Recipients

Extras_REVNUM TEXT Last Document author

Extras_SAVED BOOL Saved

Extras_SENSITIVITY TEXT Sensitivity

Extras_SENT TEXT(T/F) Sent

Extras_SENTDATE DATE Email sent date

Extras_SENTTIME TEXT(HMS) Email sent time

Extras_SHORTFILETITLE TEXT Short file title

Extras_SNTONBEHALFNAME TEXT Sent On Behalf Of Name

Extras_SOURCELABEL TEXT Source volume label

Extras_SOURCEPAGECOUNT TEXT Source document page count

Extras_SRCBOX TEXT Source Box. Obtained from second to last directory name in source file path.

Extras_SRCCUSTOD TEXT Source Custodian. Obtained from third to last directory name in source file path.

Extras_SRCFOLDER TEXT Source Folder. Obtained from last directory name in source file path.

Extras_STOREID TEXT Message store identifier

Extras_STORENAME TEXT Message store source file name

Extras_SUBMITTED TEXT(T/F) Submitted

Extras_UNREAD TEXT(T/F) UnRead

Extras_VOTINGOPT TEXT Voting Options

Extras_VOTINGRESP TEXT Voting Response

Notes:

All fields are one-to-one

dtSearch: Notes on Searching using dtSearch

One of the problems with using dtSearch is it doesn't do NSF. Second problem is how to extract the responsive files from a PST while keeping all the metadata, and parent/child relationships intact.

Current solution is to:

1. Load files into discovery, use the COPY button to write files back out numbered by FileID, dtSearch the fileset.

2. Use the 'mark' and 'select' buttons

3. Use the 'user field' button to keep track of what search strings were used to find these files.

1. Convert all the files

2. Search the 'projectname.cnvt' directory TXT files

3. Use the 'mark', 'select' and 'user field' buttons to track responsive files.

To Download and install dtSearch:

http://www.dtsearch.com/download.html file: dtSearchEval750.exe

cost: $200 to buy, 1 month free evaluation.

Quick guide to converting and searching:

1. SETUP: import files into Discover Assistant.

2. SEARCH SOURCE: dtSearch the source files.

3. SEARCH TIFF/TEXT: dtSearch the converted project files.

4. EXPORT: load dtSearch selection set, and export msg files that contain search items.

SETUP: Import files into Discovery Assistant

a. Create a Discovery Assistant project, and add in one or more NSF/PST/Folder directories. Contents of imported email and documents are enumerated. Global and Local duplicates are identified at this point.

b. If you want to search source files, you can do so by exporting a 'copy' of each file (using the 'Copy' button) to a separate search directory. Copied files are identified by their fileID.

c. If you want to search converted files, you can do so by queuing the files for conversion, then converting.

d. When converting files, user options should be: skip local duplicates, don't skip children unless parent is skipped.

e. [will remove this restriction at a later date]

On completion of conversion, remove the NSF and PST records from the converted tab. Can queue these for re-conversion to get them out of the way. (Note: Don't delete from project).

f. if your files contain images, and you want text from those images, select OCR, and in the dialog, select 'OCR only those items without text'.

Note: requires that 'Microsoft Office Imaging' 2003 or 2007 is installed. We use the Microsoft provided OCR engine to do the text extraction. (Can install this from the Office installation disks - under Tools). Re-save project.

g. sort on FileID, assign Document ID's (string: %COUNT1%) and save the project.

h. f your files contain spreadsheets, there is a good chance there are blank pages that should be removed. To remove blank pages, select: DeBlank. Re-save project.

SEARCH SOURCE: dtSearch the source files.

a. From the All Files tab, select 'copy' All. Select a destination directory using the browse button. Best to choose somewhere that has a lot of available space.

Copied files are named same as the FileId, with the proper extension.

b. Use dtSearch to search the source files. See comments below (Search Tiff/Text) for how to proceed. Basic idea is to generate a list of files to be queued for conversion, without having to convert all the other files.

SEARCH TIFF/TEXT: dtSearch the converted project files.

a. use dtSearch to index the project.CNVT directory - *.TXT files only. (need to exclude .mtf, .tif, .log files)

b. Enter one or more search terms in DT_Search to create individual search results. enable stemming, phonic spelling, and fuzzy search to find similar words. (can check results using Browse Words button)

For individual search terms: save each result as a project_searchterm.CSV.

For all search terms: save 'all strings' search result as project_all.CSV.

Save search results by choosing "File / Save As" - choose CSV format.

Generate a report by choosing "Search / search report".

c. When done, open the project_all.CSV file, select Column E (display name), and copy to clipboard

d. Open Notepad, paste the clipboard into Notepad, then do a search and replace:

[abc] first 3 letters replaced with nothing [].

[F.tif.txt] replaced with nothing [].

delete header line, and blank line at end of file.

Save as project_all.txt in the project

Notes on using dtSearch:

dtSearch evaluation copy can be downloaded from: http://www.dtsearch.com/download.html

Stemming: searches grammatical variations of the words in your search request. For example, with stemming enabled a search for apply would also find applies.

Phonic: search finds words that sound similar to words in your request, like Smith and Smythe.

Fuzzy search: sifts through scanning and typographical errors. Fuzziness adjusts from 1 to 10 depending on the degree of misspellings. (Try starting with 3.)

Synonym search: tells dtSearch to use a thesaurus to find synonyms of words in your search request.

dtSearch provides three ways to perform synonym searching:

§ Check the User thesaurus box to find synonyms that you have defined in your own thesaurus.

§ Check the WordNet thesaurus box to find synonyms using the WordNet concept network included with dtSearch.

§ Check the WordNet related words box to find related words from the WordNet concept network.

EXPORT: Load dtSearch selection set, and export files:

a. Go back to Discovery Assistant, same project, go to the converted tab, Select 'Select / by FileIDList', and select the project_all.txt file.

b. From the Converted tab, do the following:

Select / Parent of selected items

Select / Children of selected items

Select Mark / Selected

You can choose 'User Fields' to assign a text string to selected items. One use for this feature is to define what search term was used to select the record.

Save project.

This ensures that we are exporting any file that matches a string (has the string in it) PLUS it's parent, PLUS any siblings of that file.

At any point from now on, you can choose 'Select / Marked Items' and get back the items to export as a selection set.

At any point, you can also 'sort' on the left hand column (marked) to see what items are marked.

If for what ever reason you have incorrectly marked items, and want to start over, choose Select / Marked Items, then, Toggle Mark / Selected. This will clear all marked items.

c. To export the selected items:

Choose 'Select / Marked Items', sort on Document ID, and then Export / Selected.

Naming convention is "%ProjectID%.%DOCID%.%PAGE%"

Other settings are:

Destination - location that files are going to be exported to.

Format - choose Summation DII Class I

Note: Press options to choose metadata fields to export Directory Structure - flat is recommended

Other files to include - select Text files.

Whew! You are now done....

Internal notes: ImageMAKER optimizations.

We will be making code changes to remove the following steps:

Load: item 3: will not have to remove NSF or PST

Search: item 3 and 4: will not have to create a TXT file (will use CSV directly)

Export: item 2: will simplify the selection and marking functionality.

Next step will be to integrate dtSearch engine directly into Discovery Assistant.

Embedded Files: XML, PDF, and OLE linking and Embedded files support:

Discovery Assistant extracts embedded files from OLE containers (DOC, XLS, PPT) ML containers (DOCX, XLSX, PPTX), RTF files, and in development:PDF files. (Early January 2008).

Supported Microsoft Office formats include: Office 95, Office 97, Office 2000, Office XP, Office 2003, and Office 2007.

Linked files are noted (in the warnings), but not extracted or enumerated.

Basic extraction logic is as follows:

· determine if the file is an XML or OLE container type, RTF, or PDF.

· do a quick check to see if there are embedded files.

· if there are embedded files, attempt to extract the files from the native document.

· if there is a failure condition, convert the document to Office 2007 format (zipped XML) using the Office 2007 migration tool, and the re-attempt to extract.

Discovery Assistant uses two tools provided by Microsoft to help with extraction:

Microsoft Office 2007 Compatibility Pack: http://www.microsoft.com/downloads/details.aspx?FamilyId=941b3470-3ae9-4aee-8f43-c6bb74cd1466&displaylang=en

Microsoft Office 2007 Migration Tool: http://www.microsoft.com/downloads/details.aspx?familyid=13580cd7-a8bc-40ef-8281-dd2c325a5a81&displaylang=en

These tools must be installed in order for everything to work correctly. The Options / Embedded tab contains links to both of these tools.

When downloading and installing the MigrationPlanningManager.exe tool, you need to specify an installation directory. Then, after installation, from the Options / embedded tab / Settings, specify the installation directory.

Other notes:

In the Options/ Embedded / Settings tab, you can also specify the prefix used for all extracted files. Current default is EMB_1, EMB_2, and EMB_3 (represents different types of embedded files). After loading in your file set into Discovery Assistant, if you sort on name, you should be able to group all the extracted embedded files.

You can conditionally turn file handling off for certain file types by selecting the file type from the Settings dialog, then hit the modify button.

Speed / Size of files.

For optimum speed and size, best to convert everything to B&W G4 TIFF.

When exporting to different file types, here are some of the speed/size metrics.

46,462 pages, 4592 tiff files, exported as:

TIFF (G4)	1.1 GB	20 minutes (doesn’t require reading/writing the files)
Scanned PDF	1.5 GB	2 hours (uses 8 bit Flat compression)
24 bit LZW	3.0 GB	4.5 hours