The  connector will crawl content from the File System Staging Repository. 

Features


The File System Staging Repository

The file system staging repository utilizes a standard file system to store the data published to it. The file system staging repository works on both Linux and Window, and whilst it does not specifically support remote file system, it is possible to store data on remote storage using any operating system capability (for instance NFS, Windows shares etc).

This repository supports all of the functionality noted above and allows for optional compression and encryption of the data stored to the disk. A transaction log is used to determine the items currently in the repository.

The transaction log and items published to the repository are saved to disk under a directory configured for the repository. The directory must exist otherwise the repository will report errors.

Item Storage

Items stored in the repository are placed on disk under the store directory beneath the directory configured for the repository. This store directory will be created if it does not exist. The next level directory will indicate the content source to which the content belongs and under that the data owner. If the data owner is not given, the directory will be default. Under the owner directory a structure based on the MD5 hash of the document id is used. The hash of the document id is split in to three character pieces and these build up a directory hierarchy. The use of three character directory names prevents a large number of file existing in a single directory. At the bottom level, files are named using the complete MD5 hash of the document id, using a different extension to indicate the file content. The extensions used are shown below:

  • .props
    • Holds the properties for the document in the store – the id of the stored item, whether it is compressed and whether it is encrypted
  • .meta
    • The JSON representation of the document. This file will be encrypted and/or compressed as per the properties file
  • .stream
    • The original stream associated with the document (if stored). This file will be encrypted and/or compressed as per the properties file

Most levels of directory in the store hierarchy may contain up to 4096 directories (three hexadecimal characters from the MD5 hash gives 16^3 possibilities). The lowest directory in store hierarchy may contain up to 256 items (there are two hexadecimal characters from the MD5 left giving 16^ possibilities), each of which could have two or three files associated with it (the properties and metadata or properties, metadata and stream).

When items are deleted from the store, the files are deleted, but the directories will remain. When the repository is cleared, the appropriate set of items (either all, all for the content source, or all for the given owner for a content source) will be deleted along with the directories.

Transaction Storage

The transaction history for the repository is stored under the transactions directory beneath the directory configured for the repository. This store directory will be created if it does not exist. Under the transactions directory, a number of pieces of information are stored. The main two files are the transaction.log and transaction.idx. The transaction.log file holds a sequential list of the transactions in the log in the following form:

###############################################################################################
#
# Created: 2014-12-01T17:23:45Z
#
# DO NOT EDIT THIS FILE
#
0:clear:1417454625451:TestToRepo:default
1:add:1417454625549:TestToRepo:default:file%3A%2Fc%3A%2Ftestdata%2Fmixed%2Fdoc%2FProposed%2520New%2520FBI%2520Retrievalware%25208%25201%2520Architecture%2520August%25202005.doc
2:add:1417454625601:TestToRepo:default:file%3A%2Fc%3A%2Ftestdata%2Fmixed%2Fdoc%2FQuery%2520Performance%2520Tuning.doc
3:add:1417454625635:TestToRepo:default:file%3A%2Fc%3A%2Ftestdata%2Fmixed%2Fpdf%2Fheadlight_switch_OVH.pdf
4:add:1417454625674:TestToRepo:default:file%3A%2Fc%3A%2Ftestdata%2Fmixed%2Fpdf%2Fheadrelay.pdf
5:add:1417454625745:TestToRepo:default:file%3A%2Fc%3A%2Ftestdata%2Fmixed%2Fdoc%2FLDAP%2520Installation.doc
6:add:1417454626047:TestToRepo:default:file%3A%2Fc%3A%2Ftestdata%2Fmixed%2Fdoc%2FRWTechNote_XX.doc
7:add:1417454626134:TestToRepo:default:file%3A%2Fc%3A%2Ftestdata%2Fmixed%2Fpdf%2FHg-catalyst-review-ES-T-2006.pdf

Each line contains the transaction id, the action, the content source and owner and the document id (if applicable).

NOTE: Do not edit this file otherwise you will corrupt the transaction log.

The transaction.idx file holds an index of transaction id against the position of the transaction in the log file, allowing quick access. Other files in the transactions directories hold the first transactions for the content sources and owners (the .fst files) and the last transactions for given document ids (the .lst files). These files are used when iterating over the repository to allow transactions to be quickly skipped if they are not relevant to the current crawl.

Transaction Log Locking

Be default, the file staging repository uses memory locking to ensure the transaction log remains consistent. If you wish to use the file staging repository across different Java virtual machines (at the same time) you may turn on file locking, where the repository gains an exclusive lock on a file on disk before committing transactions.





  • No labels