Product Description | FAQ | Pricing | Downloads | Tech Notes | Litigation Support | Contact Us | Resellers |
Goto Page 1 2 3
Appendix A: Export Format Notes
|
%ProjectID% |
Three letter project ID |
%FileID% |
Internal file ID |
%TITLE% |
Original file name(includes parent if zip/msg/eml) |
%SHORT_TITLE% |
Guaranteed 32 char unique name |
%EXT% |
Original file extension |
%BATESSTART% |
Starting bates sequence for file |
%BATESEND% |
Ending bates sequence for file |
%PAGE |
Page number |
%BATES% |
Bates number for page |
%DOCID% |
User assigned document ID |
ASCII_STRING |
Any ASCII string |
Export Directory Structure Options:
If we take a set of source files,
Source Files |
Assigned Name |
custodian1\list.doc |
file1 |
custodian2\Folder1\sample.pdf |
file2 |
custodian2\Folder2\sales.xls |
file3 |
custodian3\Box1\Folder3\january.doc |
file4 |
custodian3\Box1\Folder3\february.doc |
file5 |
custodian3\Box1\Folder3\march.doc |
file6 |
Will get the following directory exports:
Flat:
---------
|---Output
| File1.tif
| File2.tif
| File3.tif
| File4.tif
| File5.tif
| File6.tif
|
| File1.txt
| File2.txt
| File3.txt
| File4.txt
| File5.txt
| File6.txt
|---Source
File1.doc
File2.pdf
File3.xls
File4.doc
File5.doc
File6.doc
Mirror:
---------
|---Custodian1
| |---source
| | File1.doc
| File1.tif
| File1.txt
|
|---Custodian2
| |---Folder1
| | |---source
| | | File2.pdf
| | file2.tif
| | file2.txt
| |
| |---Folder2
| |---source
| | File3.xls
| file3.tif
| file3.txt
|
|---Custodian3
|---Box1
|---Folder3
|---souce
| File4.doc
| File5.doc
| File6.doc
File4.tif
File5.tif
File6.tif
File4.txt
File5.txt
File5.txt
Bates:
----OUTPUT
| ----Bates_file1
| | File1.tif
| ----Bates_file2
| | File2.tif
| ----Bates_file3
| | File3.tif
| ----Bates_file4
| | File4.tif
| ----Bates_file5
| | File5.tif
| ----Bates_file6
| File6.tif
----SOURCE
| ----Bates_file1
| | list.doc
| ----Bates_file2
| | sample.pdf
| ----Bates_file3
| | sales.xls
| ----Bates_file4
| | january.doc
| ----Bates_file5
| | february.doc
| ----Bates_file6
| | march.doc
----TEXT
| ----Bates_file1
| | File1.txt
| ----Bates_file2
| | File2.txt
| ----Bates_file3
| | File3.txt
| ----Bates_file4
| | File4.txt
| ----Bates_file5
| | File5.txt
| ----Bates_file6
File6.txt
Vol/Box
----VOL0001
----BOX0001
| |----source
| | File1.doc
| | File2.pdf
| File1.tif
| File2.tif
| File1.txt
| File2.txt
|
|---BOX0002
| |----source
| | File3.xls
| | File4.doc
| File3.tif
| File4.tif
| File3.txt
| File4.txt
|
|---BOX0003
| |----source
| | File5.doc
| | File6.doc
| File5.tif
| File6.tif
| File5.txt
| File6.txt
Classifications of DII Files
Summation created a batch load file format and protocol that service bureaus can use to facilitate the processing and delivery of eDiscovery that will be loaded into a Summation case. Service bureaus can provide eDiscovery using three different types of DII files:
* Class I DII file - This class is geared toward traditional paper discovery service bureaus that scan paper documents and use Optical Character Recognition (OCR) technology on the resulting imaged documents. Also, in this model, e-mail messages and electronic documents (received in either in paper or native, electronic format) are converted or petrified by a service bureau to TIFF or PDF image formats, and the text and metadata are extracted. When loaded into a Summation case, the image information is loaded into the ImgInfo table, the full-text is loaded into the ocrBase, and generated metadata is loaded into the Core Database. The difference between a Class I DII file and a DII file prepared for previous versions of Summation is the ability of the Class I DII file to more easily maintain the parent/child relationships of compound documents.
* Class II DII file - This file is geared toward forensic-oriented service bureaus that extract or parse metadata and e-mail message information for loading into designated Summation Core Database fields. Native electronic files are copied to the eDocs repository specified in the case directory structure. Once the files are copied and the data loaded, the user can take advantage of Summation's multi-file format index, search, and retrieval functions to produce electronic documents in their native formats. These Class II DII file attributes will allow users to narrow or winnow down a collection of electronic data, such as e-mail messages, to only disclose relevant non-privileged data to the requesting party. The Class II DII file also facilitates the preservation of the parent/child relationships of compound documents.
* Class III DII file - This file is a combination of the Classes I and II DII file formats.
The above DII load file classes give Summation users the ultimate flexibility for applying the varying formats and protocols used to acquire, process, deliver, and deploy digital information underlying litigation, regulatory compliance, and risk management.
Note: The above DII load file formats are also acceptable formats to deliver electronic data that will be loaded into CaseVault, the litigation hosting service and subsidiary of Summation Legal Technologies. CaseVault can be used as a winnowing platform for cases that include large volumes of electronic data. Once the set is culled and reduced, the electronic data can be loaded into a Summation system for additional review and case preparation.
Note:
Tokens can be longer than 8 characters, but fields cannot be. For example, the @ATTACHRANGE token is 11, but it populates the ATTRANGE field, which is only 8. Custom tokens have to be under 8 because the fields they populate are limited to 8 chars in size.
ImageMAKER custom defined additional fields in the Summation Export DII file:
@C FILENAME calendar.zip
@C FILEPATH Z:\Web_test_files\calendar.zip
@C ISDUP True
@C DUPPATHS C:\test\test.HTM; C:\test\testcopy.htm.
@C PGCOUNT 10
Details:
FILENAME - name of file at time of conversion.
FELEPATH - original source path for file (when being converted).
PGCOUNT - number of pages in the converted file.
Default is 1 if record not defined in data set.. or defaults to last value defined if not defined in a FileID record.
If files are exported single page per file, then this value indicates total number of exported pages for the source file.
PgCount is already defined as a custom data field in the Summation
database.
ISDUP - defines whether the record has any other duplicates in the exported data set.
This information is used when reviewing the data - and indicates that there are other copies of the same information elsewhere in the data set. (Field name lengths are limited to 8 chars).
Supported values are 'True' and 'False'
DUPPATHS - lists the 'filePath' source file names that are in the duplicate set.
This value lists source filenames of the duplicate files, not DocIDs' and gives an immediate indication as to where the duplicate data is stored. FilePaths are separated by a '; ' character pair (Semicolon/space).
If there are no duplicates, then the character string 'NA' is required.
Sample DII File:
; Summation DII Class I File
; Created on 7/20/2005 2:55:29 PM
; Created by DiscoveryAssistant version 3.2 build 1095
; Copyright © 2004,2005 ImageMaker Development Inc.
;
; Machine Name: BLAISE
; Project Path: F:\Work\TEST.xml
; Project Name: TEST
; Project ID: TM
@FULLTEXT DOC
@T 0000038
@DOCID 0000038
@MEDIA eDoc
@APPLICATION WinZip File
@C FILENAME calendar.zip
@C FILEPATH Z:\Web_test_files\calendar.zip
@C PGCOUNT 1
@C ISDUP False
@C DUPPATHS NA
@ATTACH 0000039; 0000040; 0000041; 0000043; 0000044; 0000045; 0000046; 0000047; 0000048; 0000049; 0000050; 0000051; 0000052; 0000053
@ATTACHCOUNT 14
@DATESAVED 7/21/2005
@DATECREATED 7/21/2005
@D @I\
0000038.tif
@T 0000039
@DOCID 0000039
@MEDIA eMail
@MSGID
@C PGCOUNT 1
@C ISDUP True
@C DUPPATHS Z:\Web_test_files\calendar.zip\calendar.pst\Personal Folders\Tasks\a second task request.msg;C:\imgmaker\temp1\a second task request.msg
@SUBJECT a second task request
@EMAIL-BODY separate task item in a separate task list.
@EMAIL-END
@ATTACHCOUNT 0
@PARENTID 0000038
@D @I\
0000039.tif
@C BEGDOC: Export file title of first page
@C ENDDOC: Export file title of last page
@APPLICATION: Name of creating application
@C ATTCOUNT: Count of attachments
@ATTACH: List of export file titles of attachments
@ATTACHRANGE: Range of export file titles of attachments
@C GROUPRANGE: Range of export file titles that belong as a group. e.g. an email and it's attachments or a zip file and its contents
@C BATESGROUPRANGE: Range of Bates Numbers that belong as a group. e.g. an email and it's attachments or a zip file and its contents
@C BEGATTACH: Export file title of first page of group. e.g. an email and it's attachments or a zip file and its contents
@C ENDATTACH: Export file title of last page of group. e.g. an email and it's attachments or a zip file and its contents
@C ATTTITLE: File title of attachment
@FROM: Document author
@BATESBEG: Beginning Bates number
@BATESEND: Ending Bates number
@C BATESGBEG: Beginning Bates number for group. e.g. an email and it's attachments or a zip file and its contents
@C BATESGEND: Ending Bates number for group. e.g. an email and it's attachments or a zip file and its contents
@BCC: Blind Carbon Copy recipient
@CC: Carbon Copy recipient
@C DACOMMNT: Discovery Assistant PassThru comment
@DATECREATED: Source document creation date
@TIMECREATED: Source document creation time
@DATERCVD: Email received date
@TIMERCVD: Email received time
@DATESAVED: Source document modified date
@TIMESAVED: Source document modified time
@DATESENT: Email sent date
@TIMESENT: Email sent time
@C DATEACC: Source Document Last Access Date
@C TIMEACC: Source Document Last Access Time
@C DOCTITLE: Document Title
@C DUPPATHS: Source document paths of duplicate items
@EMAIL-BODY: Body of email
@C FILEEXT: Source file extension
@C FILEPATH: Source file path
@C XSFPATH: Exported source file path
@C FTITLE: Source file title
@C FILENAME: Source file name (including extension)
@C FTYPENAME: Source file type name
@FOLDERNAME: Email parent folder name
@FROM: Email From address
@C HASHCODE: MD5 hash code value for source document
@C ISDUP: True/False is duplicate
@C ITEMID: Discovery Assistant file ID
@MSG: Email message ID
@C PGCOUNT: Output file page count
@PARENTID: Export file title of parent item
@C SFTITLE: Short file title
@C SIZEDISK: Source file size on disk
@STOREID: Message store identifier
@C STORNAME: Message store source file name
@SUBJECT: Email subject
@TO: Email To address
@C ITEMINDX: Item Index
@C INETHDR: Internet Header
@C DOCID: Document ID
@C ALTRCALW: Alternate Recipient Allowed
@C AUTOFWD: Auto Forwarded
@C BILLINFO: Billing Information
@C CATEGOR: Categories
@C COMPNIES: Companies
@C DATEDFDL: Deferred Delivery Date
@C TIMEDFDL: Deferred Delivery Time
@C DELAFSUB: Delete After Submit
@C DATEEXP: Expiry Date
@C TIMEEXP: Expiry Time
@MULTILINE HTMLBODY: HTML Message Body
@C IMPRTNCE: Importance
@C MSGCLASS: Message Class
@C MSGMLG: Message Mileage
@C NOAGING: No Aging
@C DLVRPTRQ: Originator Delivery Report Requested
@C OLINTVER: Outlook Internal Version
@C OLVER: Outlook Version
@C RDRECREQ: Read Receipt Requested
@C RCVBYNAM: Received By Name
@C RCVBENAM: Received On Behalf Of Name
@C RCPREPRO: Recipient Reassignment Prohibited
@MULTILINE REPRECIP: Reply Recipients
@C SAVED: Saved
@C SENSI: Sensitivity
@C SENT: Sent
@C SNTBENAM: Sent On Behalf Of Name
@C SUBMTTED: Submitted
@READ: Message read y/n?
@C UNREAD: UnRead
@C VOTOPT: Voting Options
@C VOTRESP: Voting Response
@C GLBLPRM: 'Yes' if this is the first occurance of this item in the global table.
@C GLBLCNT: Count of occurances of this item in the Global Project table.
@C SRCCUSTOD: Source Custodian. Obtained from third to last directory name in source file path.
@C SRCBOX: Source Box. Obtained from second to last directory name in source file path.
@C SRCFOLDER: Source Folder. Obtained from last directory name in source file path.
@C DATEPRNT: Source Document Last Print Date
@C TIMEPRNT: Source Document Last Print Time
Source documents are to be generated into single page TIFF files, single page TXT files, and a meta-data file.
Meta data and the single page TXT file are then combined to create a single DAT file per page for import. Each data file is assigned a unique ID (Bates Number).
Concordance imports all the DAT files from a given directory into the database.
The list of image files is listed in the .LOG file. There is a unique TIFF file for each DAT file crated. The image files are imported all at the same time through the Opticom Viewer interface.
Create the following files:
1. multi-line .DAT files containing information for each page of each file.
2. multi-line .LOG file containing a list of tiff images (OPTICOM Load images) that are associated with each defined page.
The .DAT file contains file meta data, with the exported text as the last field.
Export fields for the data are defined in the 'export fields' section (below).
Sample data are also provided in the 'sample data' data section (below).
The .DAT file contains a single comma delineated list of fields.
But... Rather than using the common notation
"field1","field2","field3"
notation, fields are delineated by substituting decimal 20 for ',', and decimal 254 for '"'.
Decimal 20 and decimal 254 are explicitly defined to NOT occur in any imported text.
Newline values in the imported text are modified to be decimal 174.
The sample data:
to:Ken Davies
from:Sales
Subject:The year ahead
Text: A long discussion about the year ahead.
Looking forward to your comments.
Call me if you want to do lunch.
becomes:
(245)Ken Davies(245)(20)(245)Sales(245)(20)(245)The year ahead(245)(20)(245)A long discussion about the year ahead(174) Looking forward to your comments.(174) Call me if you want to do lunch.(174)(245)
where the values in brackets (245) (20) (174) are decimal byte values in the data stream.
The data fields in this example are pre-defined to be "to","from","subject","text".
.DAT file fields:
Field Name |
Sample Data |
Populated |
STARTPAGE |
00010002 |
YES |
ENDPAGE |
00010002 |
YES |
DATE |
20041219 |
YES [Date Accessed/Sent Date] |
DOCTYPE |
Doc extension |
YES [SourceFile Ext] |
TITLE |
Untitled |
YES [Title from MetaData] |
AUTHOR |
Simmons;RC / McMurrian;HP |
YES [Author/From:from MetaData] |
AUTHORORG |
Cole Evans and Peterson |
NO |
RECIPIENT |
McCorman;SL |
YES [To: from MetaData] |
RECIPORG |
Cowco |
NO |
CC |
|
YES [Cc: from MetaData |
SUMMARY |
|
NO |
CONDITION |
|
NO |
ATTACH_TYPE |
|
NO |
LEAD_DOC |
|
NO |
ATTACHMENTS |
|
NO |
PRIMARYDATE |
19831220 |
YES [Date Created] |
PAGES |
3 |
YES |
CCORG |
|
NO |
ATT |
|
NO |
ATTORG |
|
NO |
OCR1 |
*** 0010002 **** . contents of page |
NO |
OCR2 |
|
NO |
OCR3 |
|
NO |
OCR4 |
|
NO |
OCR5 |
|
NO |
RENUMBER |
161 |
NO |
ISSUE |
|
NO |
DISC_STATUS |
|
NO |
SOURCE_FILE_NAME |
C:\fname.doc |
YES |
SOURCE_FILE_SIZE |
104456 |
YES |
Hyperlinked Source documents:
XSPATHNAME .\SOURCE\TST00002.msg
TIFF file destination:
XIPATHNAME OUTPUT
XIFILENAME TST00002.tif
00010001,Data,E:\DATABASE\COWCO\001\00010001.TIF,Y,,,
00010002,Data,E:\DATABASE\COWCO\001\00010002.TIF,,,,
00010003,Data,E:\DATABASE\COWCO\001\00010003.TIF,,,,
00010004,Data,E:\DATABASE\COWCO\001\00010004.TIF,Y,,,
00010005,Data,E:\DATABASE\COWCO\001\00010005.TIF,,,,
00010006,Data,E:\DATABASE\COWCO\001\00010006.TIF,,,,
00010007,Data,E:\DATABASE\COWCO\001\00010007.TIF,Y,,,
00010008,Data,E:\DATABASE\COWCO\001\00010008.TIF,Y,,,
00010009,Data,E:\DATABASE\COWCO\001\00010009.TIF,Y,,,
00010010,Data,E:\DATABASE\COWCO\001\00010010.TIF,,,,
00010011,Data,E:\DATABASE\COWCO\001\00010011.TIF,,,,
Field 1: "Production Number" -- This is a text field which contains the "Production" or "Control" or Bates number for that page of the document. It is a unique value and is the load file "key".
Field 2: "Volume ID" -- This is also a text field. It should contain the Volume ID of the CD on which the images are delivered.
Field 3: "Full DOS Path" -- This is a text field containing the full DOS path to the image file.
Field 4: "Document Break" -- This is a text field. If this particular image is the first page of a document, this field should contain a "Y" (Yes).
Field 5: "Folder Break" -- This is a text field. It's fairly rarely used but if used is intended to work just like Document Break, i.e. it would contain a "Y" if this is the first page of a new folder
Field 6: "Box Break" -- This is a text field. Also rarely used but intended to work like Doc and Folder Break...would contain a "Y" if this is the first page of a new box.
Field 7: "Pages" -- This is a text field although it contains numeric data. If this is the first page of a new document, "Document Break" will contain a "Y" and this field will show the number of pages for the document. (This field is a "nice to have" as after the images are loaded, Opticon will calculate the number of pages based on the database.)
00010001.dat
00010001.tif
00010002.dat
00010002.tif
00010003.dat
00010003.tif
00010004.dat
00010004.tif
00010005.dat
00010005.tif
00010006.dat
00010006.tif
00010007.dat
00010007.tif
00010008.dat
00010008.tif
00010009.dat
00010009.tif
00010010.dat
00010010.tif
00010011.dat
00010011.tif
images.opt
(source: http://www.ediscovery.org/litigation-support/technical-standards_4_02_IPRO.htm)
To convert from Opticon format, download iConvert from http://www.IproCorp.com. (free)
Example 1: Single Page .TIF files
IM,MSC00014,D,0,@MSC001;IMAGES\ 00\ 00;MSC00014.TIF;2
IM,MSC00015,,0,@MSC001;IMAGES\ 00\ 00;MSC00015.TIF;2
IM,MSC00016,D,0,@MSC001;IMAGES\ 00\ 00;MSC00016.TIF;2
IM,MSC00017,,0,@MSC001;IMAGES\ 00\ 00;MSC00017.TIF;2
Example 2: Multi Page .TIF file
IM,MSC00014,D,1,@MSC001;IMAGES\ 00\ 00;MSC00014.TIF;2
IM,MSC00015,,2,@MSC001;IMAGES\ 00\ 00;MSC00014.TIF;2
IM,MSC00016,D,1,@MSC001;IMAGES\ 00\ 00;MSC00016.TIF;2
Note: Because the files are multi-page, the entire bates range (or image key range) must point to the same .TIF file. As example, MSC00014 contains both "14" and "15". Therefore, to view page 15, the computer must display MSC00014.TIF.
The following provides a breakdown of the fields:
IM
Import code identifier (Importing New Page/Image database record)
MSC00014
The image key/document id number
D
Document designation; only designate the first page of each document.
0
Offset to the Tiff file. Always 0 for single page tiff files. When creating Multi-Page Tiff files, this number will increment for the pages within the file. (If there is an 11 page document, the offset would start at 1 and end at 11 and the next tiff file would start over at 1.
@MDEMO
CD volume name
IMAGES\00\00
Directory path on the CD for the image
MSC00014.TIF
Filename for the image.
;2
Tells IPRO the Types* of image file, e.g. tiff, PDF
*Supported Image Types and their specification in the LFP file are:
1. Type 1 is for IPRO Tech image from DOS-Based version, still supported (.IMG)
2. Type 2 is for Standard single and multiple page black & white or color TIFF (.TIF)
3. Type 3 is for IPRO Tech stacked TIFF (.STF)
4. Type 4 is for Color image (.BMP, .PCX, .JPEG or .PNG)
5. Type 5 is for black & white .PDF
6. Type 6 is for Color .PDF
7. Type 7 is to Auto-detect the .PDF type, e.g. Color or Black & White
Exporting to RingTail:
1. Export to Ringtail from Discovery Assistant.
2. Load the CSV file into the Ringtail Flat File converter to convert to MDB, then run it through the Validator.
Reference Docs: (these seem to overlap)
CaseBook_Data_Standards_Manual_v602r5.pdf
Ringtail Legal Data Standards Manual v2[1].1.2.pdf
Tools Provided by FTI Ringtail
1. Data Standards Manual: outlines the Ringtail load file
2. Flat File Converter: a tool used to convert a flat-file database to a Ringtail load file; and
3. Validator: a tool used to verify the integrity of a Ringtail load file.
NO load file should be loaded to Ringtail without first being run through this free tool.
To access these free tools, browse to our support website http://support.ftiringtail.com . From there, click the button to LOGIN AS GUEST, then access the Downloads tab.
Ringtail Flat file converter Notes:
The validator does not understand Office 2007. You need to run on an Office 2003 machine. Time fields are not supported in Ringtail. Any time fields should be set to TEXT. Boolean fields are TEXT. We don't currently convert to T/F.
MAIN tab:
ImageMAKER field name Ringtail
----------------------------------
Main_Document_ID Document_ID User assigned Document ID
Main_Document_Date Document_Date Source document create date, otherwise received date, otherwise sent date (in that order)
Main_Document_Time ??? Source document create time, otherwise received time, otherwise sent time (in that order)
Main_Document_Type Document_Type Source file type name
Main_Title_docTitle Title Document Title
Main_Title_DocSubject Descripiton Email/Document subject
Main_Host_Reference Host_Reference Export file title of parent item
0 Estimated
Notes:
use "0" for Estimated (all dates are exact)
There are no time fields in Ringtail
PAGES tab:
ImageMAKER field name Ringtail
----------------------------------
Pages_Page_Start Page_Start Export file title of first page
Pages_Page_End Page_End Export file title of last page
Pages_Image_File_Name ?? Export file name [image] with extension.
.tif Page_Extension
Pages_Num_Pages Total_Number_of_Pages
??? Page_Range
Notes:
choose 'Use Page Range' (not 'Use Image_File_Name') when matching fields.
use ".tif" for Page_Extension.
Missing:
no values for Page_Range. Suggest using Pages_Num_Pages.
PARTIES tab:
ImageMAKER field name Ringtail type: to, from, between, cc, bcc, userDefined
----------------------------------
Parties_People_From_Author Document author
Parties_People_From_LastAuthor Last Document author
Parties_People_From_Sender Email From address
Parties_People_To Email To address
Parties_People_CC Carbon Copy recipient
Parties_People_BCC Blind Carbon Copy recipient
Notes:
assigned to 'people'
one to many
delimiter is the ';' character (semicolon).
no concatenate string
LEVELS tab:
ImageMAKER field name Ringtail
----------------------------------
Levels_Levels Fields Level Fields [1-10] Export file path (image)
EXTRAS tab:
ImageMAKER field name Ringtail (BOOL DATE NUMB PICK TEXT MEMO UTEXT UMEMO)
----------------------------------
Extras_ALTRCPALLOW TEXT(T/F) Alternate Recipient Allowed
Extras_APPLICATION_NAME TEXT Name of creating application
Extras_ATTACHLIST TEXT List of export file titles of attachments
Extras_ATTACHMENTRANGE TEXT Range of export file titles of attachments
Extras_ATTACHMENTSCOUNT NUMB Count of attachments
Extras_ATTACHTITLE TEXT File title of attachment
Extras_AUTOFWD TEXT(T/F) Auto Forwarded
Extras_BATESBEG TEXT Beginning Bates number
Extras_BATESBEGGROUP TEXT Beginning Bates number for group. e.g. an email and it's attachments or a zip file and it's contents
Extras_BATESEND TEXT Ending Bates number
Extras_BATESENDGROUP TEXT Ending Bates number for group. e.g. an email and it's attachments or a zip file and it's contents
Extras_BATESGROUPRANGE TEXT Range of Bates Numbers that belong as a group. e.g. an email and it's attachments or a zip file and it's contents
Extras_BEGATTACH TEXT Export file title of first page of group. e.g. an email and it's attachments or a zip file and it's contents
Extras_BILLINFO TEXT Billing Information
Extras_BODY MEMO Body of email
Extras_CATEGOR TEXT Categories
Extras_CNVINDEX TEXT Conversation Index
Extras_CNVTOPIC TEXT Conversation Topic
Extras_COMPANIES TEXT Companies
Extras_DACOMMENT TEXT Discovery Assistant PassThru comment
Extras_DEFDLVDATE TEXT(T/F) Deferred Delivery Date
Extras_DEFDLVTIME TEXT(T/F) Deferred Delivery Time
Extras_DELAFTSUB TEXT(T/F) Delete After Submit
Extras_DLVRPTREQ TEXT(T/F) Originator Delivery Report Requested
Extras_DOCTEXT MEMO Document Text
Extras_DUPPATHS TEXT Source document paths of duplicate items
Extras_ENDATTACH TEXT Export file title of last page of group. e.g. an email and it's attachments or a zip file and it's contents
Extras_EXPIRYDATE DATE Expiry Date
Extras_EXPIRYTIME TEXT(HMS) Expiry Time
Extras_EXPORTDATE DATE Export start date
Extras_EXPORTEDSOURCEFILEPATHNAME TEXT Exported source file path
Extras_EXPORTTIME TEXT(HMS) Export start time
Extras_FILEACCESSDATE DATE Source document Last Access Date
Extras_FILEACCESSTIME TEXT(HMS) Source document Last Access Time
Extras_FILECREATIONDATE DATE Source document creation date
Extras_FILECREATIONTIME Source document creation time
Extras_FILEDISPLAYNAME TEXT Source file title
Extras_FILEEXTENSION TEXT Source file extension
Extras_FILEMODIFYDATE DATE Source document modified date
Extras_FILEMODIFYTIME TEXT(HMS) Source document modified time
Extras_FILENAME TEXT Source file name (including extension)
Extras_FILEPATHNAME TEXT Source file path
Extras_FILEPRINTDATE DATE Source document Last Print Date
Extras_FILEPRINTTIME TEXT(HMS) Source document Last Print Time
Extras_GLOBALCOUNT NUMB Count of occurrences of this item in the Global Project table.
Extras_GLOBALPRIMARY TEXT(T/F) 'Yes' if this is the first occurrence of this item in the global table.
Extras_GROUPRANGE TEXT Range of export file titles that belong as a group. e.g. an email and it's attachments or a zip file and it's contents
Extras_HASHCODE TEXT MD5 hash code value for source document
Extras_HTMLBODY MEMO HTML Message Body
Extras_IMPORTANCE TEXT Importance
Extras_INETHEADER TEXT Internet Header
Extras_ISDUP TEXT(HMS) True/False is duplicate
Extras_ITEMID TEXT Discovery Assistant file ID
Extras_ITEMINDEX NUMB Item Index
Extras_LASTSAVEDDATE DATE Source document Last Saved date
Extras_LASTSAVEDTIME TEXT(HMS) Source document Last Saved time
Extras_MSGCLASS TEXT Message Class
Extras_MSGID TEXT Email message ID
Extras_MSGMLG TEXT Message Mileage
Extras_NOAGING TEXT(T/F) No Aging
Extras_OBJECTSIZE NUMB Source file size on disk
Extras_OLINTVER TEXT Outlook Internal Version
Extras_OLVER TEXT Outlook Version
Extras_PAGECOUNT NUMB Number of pages in TIFF file
Extras_PARENT TEXT Email parent folder name
Extras_PARENTCREATIONDATE DATE Parent document create date
Extras_PARENTCREATIONTIME TEXT(HMS) Parent document create time
Extras_PARENTMODIFYDATE DATE Parent document modified date
Extras_PARENTMODIFYTIME TEXT(HMS) Parent document modified time
Extras_PARENTRECEIVEDDATE DATE Parent email received date
Extras_PARENTRECEIVEDTIME TEXT(HMS) Parent email received time
Extras_PARENTSENTDATE DATE Parent email sent date
Extras_PARENTSENTTIME TEXT(HMS) Parent email sent time
Extras_RCPREASSPROHIB BOOL Recipient Reassignment Prohibited
Extras_RCVBYNAME TEXT Received By Name
Extras_RCVONBEHALFNAME TEXT Received On Behalf Of Name
Extras_RDRECREQ TEXT(T/F) Read Receipt Requested
Extras_READ TEXT(Y/N) Message read y/n?
Extras_RECEIVEDDATE DATE Email received date
Extras_RECEIVEDTIME TEXT(HMS) Email received time
Extras_REPLRECIPS TEXT Reply Recipients
Extras_REVNUM TEXT Last Document author
Extras_SAVED BOOL Saved
Extras_SENSITIVITY TEXT Sensitivity
Extras_SENT TEXT(T/F) Sent
Extras_SENTDATE DATE Email sent date
Extras_SENTTIME TEXT(HMS) Email sent time
Extras_SHORTFILETITLE TEXT Short file title
Extras_SNTONBEHALFNAME TEXT Sent On Behalf Of Name
Extras_SOURCELABEL TEXT Source volume label
Extras_SOURCEPAGECOUNT TEXT Source document page count
Extras_SRCBOX TEXT Source Box. Obtained from second to last directory name in source file path.
Extras_SRCCUSTOD TEXT Source Custodian. Obtained from third to last directory name in source file path.
Extras_SRCFOLDER TEXT Source Folder. Obtained from last directory name in source file path.
Extras_STOREID TEXT Message store identifier
Extras_STORENAME TEXT Message store source file name
Extras_SUBMITTED TEXT(T/F) Submitted
Extras_UNREAD TEXT(T/F) UnRead
Extras_VOTINGOPT TEXT Voting Options
Extras_VOTINGRESP TEXT Voting Response
Notes:
All fields are one-to-one
One of the problems with using dtSearch is it doesn't do NSF. Second problem is how to extract the responsive files from a PST while keeping all the metadata, and parent/child relationships intact.
Current solution is to:
1. Load files into discovery, use the COPY button to write files back out numbered by FileID, dtSearch the fileset.
2. Use the 'mark' and 'select' buttons
3. Use the 'user field' button to keep track of what search strings were used to find these files.
OR
1. Convert all the files
2. Search the 'projectname.cnvt' directory TXT files
3. Use the 'mark', 'select' and 'user field' buttons to track responsive files.
To Download and install dtSearch:
http://www.dtsearch.com/download.html file: dtSearchEval750.exe
cost: $200 to buy, 1 month free evaluation.
Quick guide to converting and searching:
1. SETUP: import files into Discover Assistant.
2. SEARCH SOURCE: dtSearch the source files.
3. SEARCH TIFF/TEXT: dtSearch the converted project files.
4. EXPORT: load dtSearch selection set, and export msg files that contain search items.
SETUP: Import files into Discovery Assistant
a. Create a Discovery Assistant project, and add in one or more NSF/PST/Folder directories. Contents of imported email and documents are enumerated. Global and Local duplicates are identified at this point.
b. If you want to search source files, you can do so by exporting a 'copy' of each file (using the 'Copy' button) to a separate search directory. Copied files are identified by their fileID.
c. If you want to search converted files, you can do so by queuing the files for conversion, then converting.
d. When converting files, user options should be: skip local duplicates, don't skip children unless parent is skipped.
e. [will remove this restriction at a later date]
On completion of conversion, remove the NSF and PST records from the converted tab. Can queue these for re-conversion to get them out of the way. (Note: Don't delete from project).
f. if your files contain images, and you want text from those images, select OCR, and in the dialog, select 'OCR only those items without text'.
Note: requires that 'Microsoft Office Imaging' 2003 or 2007 is installed. We use the Microsoft provided OCR engine to do the text extraction. (Can install this from the Office installation disks - under Tools). Re-save project.
g. sort on FileID, assign Document ID's (string: %COUNT1%) and save the project.
h. f your files contain spreadsheets, there is a good chance there are blank pages that should be removed. To remove blank pages, select: DeBlank. Re-save project.
SEARCH SOURCE: dtSearch the source files.
a. From the All Files tab, select 'copy' All. Select a destination directory using the browse button. Best to choose somewhere that has a lot of available space.
Copied files are named same as the FileId, with the proper extension.
b. Use dtSearch to search the source files. See comments below (Search Tiff/Text) for how to proceed. Basic idea is to generate a list of files to be queued for conversion, without having to convert all the other files.
SEARCH TIFF/TEXT: dtSearch the converted project files.
a. use dtSearch to index the project.CNVT directory - *.TXT files only. (need to exclude .mtf, .tif, .log files)
b. Enter one or more search terms in DT_Search to create individual search results. enable stemming, phonic spelling, and fuzzy search to find similar words. (can check results using Browse Words button)
For individual search terms: save each result as a project_searchterm.CSV.
For all search terms: save 'all strings' search result as project_all.CSV.
Save search results by choosing "File / Save As" - choose CSV format.
Generate a report by choosing "Search / search report".
c. When done, open the project_all.CSV file, select Column E (display name), and copy to clipboard
d. Open Notepad, paste the clipboard into Notepad, then do a search and replace:
[abc] first 3 letters replaced with nothing [].
[F.tif.txt] replaced with nothing [].
delete header line, and blank line at end of file.
Save as project_all.txt in the project
Notes on using dtSearch:
dtSearch evaluation copy can be downloaded from: http://www.dtsearch.com/download.html
Stemming: searches grammatical variations of the words in your search request. For example, with stemming enabled a search for apply would also find applies.
Phonic: search finds words that sound similar to words in your request, like Smith and Smythe.
Fuzzy search: sifts through scanning and typographical errors. Fuzziness adjusts from 1 to 10 depending on the degree of misspellings. (Try starting with 3.)
Synonym search: tells dtSearch to use a thesaurus to find synonyms of words in your search request.
dtSearch provides three ways to perform synonym searching:
§ Check the User thesaurus box to find synonyms that you have defined in your own thesaurus.
§ Check the WordNet thesaurus box to find synonyms using the WordNet concept network included with dtSearch.
§ Check the WordNet related words box to find related words from the WordNet concept network.
EXPORT: Load dtSearch selection set, and export files:
a. Go back to Discovery Assistant, same project, go to the converted tab, Select 'Select / by FileIDList', and select the project_all.txt file.
b. From the Converted tab, do the following:
Select / Parent of selected items
Select / Children of selected items
Select Mark / Selected
You can choose 'User Fields' to assign a text string to selected items. One use for this feature is to define what search term was used to select the record.
Save project.
This ensures that we are exporting any file that matches a string (has the string in it) PLUS it's parent, PLUS any siblings of that file.
At any point from now on, you can choose 'Select / Marked Items' and get back the items to export as a selection set.
At any point, you can also 'sort' on the left hand column (marked) to see what items are marked.
If for what ever reason you have incorrectly marked items, and want to start over, choose Select / Marked Items, then, Toggle Mark / Selected. This will clear all marked items.
c. To export the selected items:
Choose 'Select / Marked Items', sort on Document ID, and then Export / Selected.
Naming convention is "%ProjectID%.%DOCID%.%PAGE%"
Other settings are:
Destination - location that files are going to be exported to.
Format - choose Summation DII Class I
Note: Press options to choose metadata fields to export Directory Structure - flat is recommended
Other files to include - select Text files.
Whew! You are now done....
Internal notes: ImageMAKER optimizations.
We will be making code changes to remove the following steps:
Load: item 3: will not have to remove NSF or PST
Search: item 3 and 4: will not have to create a TXT file (will use CSV directly)
Export: item 2: will simplify the selection and marking functionality.
Next step will be to integrate dtSearch engine directly into Discovery Assistant.
Discovery Assistant extracts embedded files from OLE containers (DOC, XLS, PPT) ML containers (DOCX, XLSX, PPTX), RTF files, and in development:PDF files. (Early January 2008).
Supported Microsoft Office formats include: Office 95, Office 97, Office 2000, Office XP, Office 2003, and Office 2007.
Linked files are noted (in the warnings), but not extracted or enumerated.
Basic extraction logic is as follows:
· determine if the file is an XML or OLE container type, RTF, or PDF.
· do a quick check to see if there are embedded files.
· if there are embedded files, attempt to extract the files from the native document.
· if there is a failure condition, convert the document to Office 2007 format (zipped XML) using the Office 2007 migration tool, and the re-attempt to extract.
Discovery Assistant uses two tools provided by Microsoft to help with extraction:
Microsoft Office 2007 Compatibility Pack: http://www.microsoft.com/downloads/details.aspx?FamilyId=941b3470-3ae9-4aee-8f43-c6bb74cd1466&displaylang=en
Microsoft Office 2007 Migration Tool: http://www.microsoft.com/downloads/details.aspx?familyid=13580cd7-a8bc-40ef-8281-dd2c325a5a81&displaylang=en
These tools must be installed in order for everything to work correctly. The Options / Embedded tab contains links to both of these tools.
When downloading and installing the MigrationPlanningManager.exe tool, you need to specify an installation directory. Then, after installation, from the Options / embedded tab / Settings, specify the installation directory.
Other notes:
In the Options/ Embedded / Settings tab, you can also specify the prefix used for all extracted files. Current default is EMB_1, EMB_2, and EMB_3 (represents different types of embedded files). After loading in your file set into Discovery Assistant, if you sort on name, you should be able to group all the extracted embedded files.
You can conditionally turn file handling off for certain file types by selecting the file type from the Settings dialog, then hit the modify button.
Speed / Size of files.
For optimum speed and size, best to convert everything to B&W G4 TIFF.
When exporting to different file types, here are some of the speed/size metrics.
46,462 pages, 4592 tiff files, exported as:
TIFF (G4) |
1.1 GB |
20 minutes (doesnt require reading/writing the files) |
Scanned PDF |
1.5 GB |
2 hours (uses 8 bit Flat compression) |
24 bit LZW |
3.0 GB |
4.5 hours |
Go to the Previous Page | Page 3 |
For more information ImageMAKER Development Inc. 416 Sixth Street, Suite 102 New Westminster, BC Canada V3L 3B2 http://www.imgmaker.com Copyright © 2004-2008 |
To contact us from overseas: Sales: 1.604.525.2170 Local (Pacific) time: GMT-8 |
ImageMAKER Development Inc. Sales: toll free (866) 525-2170 or (604) 525-2170 Support: (604) 525-2108 Fax: (604) 520-0029 Email: sales@imgmaker.com support@imgmaker.com |