Example Parent AspireObject for Scanning

The following shows an example of an AspireObject after fetching the connector source information from the RDB. This is a parent document which is supplied to a scanner.

<doc>    
   <!-- data for the source from the RDB -->
   <id>93</id>
   <action>start</action>
   <actionProperties>full</actionProperties>
   <crawlId>1</crawlId>
   <displayName>Smith Plant Fileshare</displayName>
   <TYPE>fileshare</TYPE>
   <ACTIVE>1</ACTIVE>
   <properties>... {RAW XML PROPERTY DATA GOES HERE} ... </properties>
   
   <!-- Copied from the PROPERTIES data -->
   <connectorSource>
     <fileNamePatterns/>
     <username>user</username>
     <domain>search</domain>
     <password>encrypted:6A2B871F3F30D3B5BF8D406B9C185FAF</password>
     <url>smb://server/SmithPlant/</url>
     
     <!-- Custom connector properties go here -->
   </connectorSource>
 </doc>

List of Standard Fields

AspireObject field name	Type	Description	Publisher Use
title	string	The extracted title from the document.	Mapped to the presentation title. Displayed to the user in the search results as the title of the document.
content	string	The full content of the document.	Used for full-text search over the entire contents of the document. Typically considered to be the weakest hits (i.e. less strong than any of the grank# fields).
fetchUrl	URL	The original URL used to fetch the content from the remote server.	Used as the unique ID of the document if id doesn't exist. If the displayUrl does not exist, then fetchUrl is used as the URL which is displayed in the search results to the user. When clicked by the user, the user will be sent to this Url (again, if displayUrl does not exist).
id	String	A unique identifier of the document (usually the URL of the document).	Used as the unique ID of the document.
displayUrl	URL	The URL used by a user to access the document.	Used as the URL which is displayed in the search results to the user. If it is not defined, fetchUrl is used for this purpose.
url	URL	The URL used during the scan process to uniquely identify a document.	If fetchUrl and id are not present, it's used as the unique ID of the document.
lastModified	String Date (ISO8601)	The last modified date of the document.	Used as a sortable field in the search results.
dataSize	long	Size of the document.	Used as a description field for the document in the search results.
owner	String	Username/name of the owner of the document. The person to refer too when additional information about the document is required.	Used as a description/sortable field for the document in the search results.
modifiedBy	String	Username/name of the last user to modify the document.	Used as a description/sortable field for the document in the search results.
createdBy	String	Username/name of the user who created the document.	Used as a description/sortable field for the document in the search results.
action	String	Action to perform (add/update/delete).	Used to determine what kind of action is going to be performed with the document at the publishers.

Example Child AspireObject

As Processed by Fetching and Metadata Processing

Document produced by the Scanner, which is input to the fetcher.

<doc>
  
  <!-- *** Common connector properties go here *** -->
    
    <!-- The item to fetch by the fetcher goes in <fetchUrl> -->
    <fetchUrl>file://stsd-dev-docu.sharepointlab.net/PerformanceTestData/BILLS</fetchUrl>
    <!-- The URL to present to the user goes in <displayUrl> -->
    <displayUrl>file://stsd-dev-docu.sharepointlab.net/PerformanceTestData/BILLS</fetchUrl>
    <action>add</action>
    <docType>container</docType>
    <lastModified>2011-07-06T18:45:14Z</lastModified>
    <dataSize>0</dataSize>
    <owner>AD\klarkin</owner>
    <createdBy>AD\klarkin</createdBy>
    <modifiedBy>AD\klarkin</createdBy>
    
      
  <!-- *** Connector source information goes here (copied from parent job) *** -->
  
    <connectorSource type="cifs">
      <displayName>Smith Plant Fileshare</displayName>
      <url>smb://smith.mydomain.com/dir1/dir2</url>
      
      <!-- Custom connector properties go here -->
      
    </connectorSource>
  
  <!-- *** custom properties produced by the connector scanner for the connector fetcher go here *** -->
  
    <connectorSpecific type="filesystem">
      <field name="smbUrl">smb://stsd-dev-docu.sharepointlab.net/PerformanceTestData/BILLS/</field>
      <!-- *** connector-specific properties go here -->
    </connectorSpecific>
  
  <!-- *** Security ACL Information *** -->
  <!-- Note that the ACLs are in order, with earlier ACLs taking precedence over lower ACLs -->
  <!--   @access attribute can be "allow" or "deny" -->
  <!--   @name is the base user or group name -->
  <!--   @domain is the domain name, typically the Active Directory name -->
  <!--   @fullname is the name stored in the search engine and used as the security token -->
  <!--   @entity is either "user" or "group" -->
  <!--   @scope is either "server" or "global", ACLs which are "server" scope may not be stored in the indexes -->
  
  <!--   other attributes and ACL contents is allowed, as required by the individual connector -->
  <!--   examples include:  @sid and @sidType below -->
    
    <acls>
        <acl name="Administrator" domain="STSD-DEV-DOCU" fullname="STSD-DEV-DOCU\Administrator" 
             access="allow" inherited="true" entity="user" scope="server"
             sidType="user" sid="S-1-5-21-1830488795-1236199006-3848916169-500"/>
        <acl name="Administrators" domain="BUILTIN" fullname="BUILTIN\Administrators"
             access="allow" inherited="true" entity="group" scope="server"
             sidType="localGroup" sid="S-1-5-32-544"/>
        <acl name="SYSTEM" domain="NT AUTHORITY" fullname="NT AUTHORITY\SYSTEM"
             access="allow" inherited="true" entity="group" scope="server"
             sidType="localGroup" sid="S-1-5-18"/>
        <acl name="GSASecurityDemo" domain="SHAREPOINTLAB" fullname="SHAREPOINTLAB\GSASecurityDemo"
             access="allow" inherited="true" entity="group" scope="global"
             sidType="domainGroup" sid="S-1-5-21-3009121436-1049919257-1970785288-1621"/>
    </acls>
   
   
  <!-- *** Hierarchy Browsing Information *** -->
  <!-- The following holds data necessary to implement hierarchy browsing of 
          original content hierarchies.
   
          <hierarchy> - Is the holder for all site map information.
          <item> - Identifies the entry in the hierarchy where this document or folder exists. 
                   If the entry is contained within multiple hierarchies, then there may be multiple <item> tags (typically, however, all documents only contain a single <item> tag)
          <ancestors> - The holder for all ancestors (parent, grandparent, great-grandparent, etc.) of the <item>
          <ancestor> - Holds the details for each ancestor.

          All <item> and <ancestor> tags will have the following attributes:
              @level - The level within the hierarchy where the node exists. Higher numbers indicate deeper nesting within the hierarchy. 0 = root node.
              @id - A simple, searchable ID token for the item. This is typically a hex representation of the MD5 of the displayUrl.
                       Ideally, this id should be searchable by most search engines as a single token. This usually means that the token is made up only of digits and upper-case letters.
              @name - A user-friendly name for the hierarchy node. For example, the folder name, document file name, etc.
                       Not the full path name, just the item name itself.
              @type - The type of the node, used to show an icon for the object in the user interface. This is usually a normalized mime type, or a "special" mime type which starts with "aspire/" for special Aspire types.
              @url - The URL for the hierarchy node. This is the "display URL" which will open up the folder or document in the native application.
              @parent - Identifies the ancestor which is the direct parent of the current document, false if not present.

  -->
  
  
    <hierarchy>
      <item level="3"
            id="145C6BEB96DB4BAB3C67ABD14386F46A" 
            name="BILLS" 
            type="aspire/folder" 
            url="smb://smithplant/PerformanceTestData/BILLS/">
        <ancestors>
          <ancestor 
              parent="true"
              level="2" 
              id="C6DF3927E8B6104218AD44D4038E8337" 
              name="PerformanceTestData"
              type="aspire/folder"
              url="smb://smithplant/TestFolder/PerformanceTestData/" />
          <ancestor 
              level="1" 
              id="78DFE189EC1A7DDF1A732878AB109A9C" 
              name="Smith Plant Fileshare"
              type="aspire/fileshare"
              url="smb://smithplant/TestFolder/" />
          </ancestors>
        </ancestors>
      </item>
    </hierarchy>
    <content>BILLS</content>
  </doc>

The <ancestors> tags are nested within an <item> tag to allow for multiple hierarchies, using multiple <item> tags. This might be needed by a production environment for multiple hierarchies within multiple product lines.
Only one <ancestor> tag should have parent=”true” to avoid duplication in the metadata.
All tags have a consistent set of attributes.

Additional elements expected for document metadata and content

 <doc>
    .
    .  (all other fields from above)
    .
  
  <!-- *** Document Content and Metadata *** -->
  <!-- Data comes from the aspire-extract-text stage. -->
  <!-- Fields can change based on the type of file and what metadata is available for that type -->
  
    <content source="ExtractText">BILLS</content>
    <title source="ExtractText">PEP 2010 Fan Blade Improvement Process</title>
    <author source="ExtractText">Katy Larkin</author>
  </doc>

Current list of types

cifs- a common internet file system FileShare source
documentum - a Documentum source
sharepoint - a SharePoint source
spsite - a SharePoint site
splibrary - a SharePoint library
splist - a SharePoint list
folder - any folder from any source
item - (default type, but should always be overwritten by a mime type once the mime type is known)

Plus, any MIME type can be a type.

Can custom applications can have their own types? For example: "customer:wma" for a customer's Work Management Application? - needs further research]

Some expected types for the future:

archive (for ZIP files, TAR files, GZIP files, etc.)

Page tree

Connector Metadata

Example Parent AspireObject for Scanning

List of Standard Fields

Example Child AspireObject

Additional elements expected for document metadata and content

Current list of types