Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page documents standard metadata included in the AspireObject class by all connectors. It is expected that this metadata will be used by post-xml (or other Search Engine indexing component) to communicate document data and metadata to the search engine.


Panel

On this page:

Table of Contents


Example Parent AspireObject for Scanning


 The following shows an example of an AspireObject after fetching the connector source information from the RDB. This is a parent document which is supplied to a scanner.

Code Block
languagexml
<doc>    
   <!-- data for the source from the RDB -->
   <id>93</id>
   <action>start</action>
   <actionProperties>full</actionProperties>
   <crawlId>1</crawlId>
   <displayName>Smith Plant Fileshare</displayName>
   <TYPE>fileshare</TYPE>
   <ACTIVE>1</ACTIVE>
   <properties>... {RAW XML PROPERTY DATA GOES HERE} ... </properties>
   
   <!-- Copied from the PROPERTIES data -->
   <connectorSource>
     <fileNamePatterns/>
     <username>user</username>
     <domain>search</domain>
     <password>encrypted:6A2B871F3F30D3B5BF8D406B9C185FAF</password>
     <url>smb://server/SmithPlant/</url>
     
     <!-- Custom connector properties go here -->
   </connectorSource>
 </doc>

List of Standard Fields


 

AspireObject field nameTypeDescriptionPublisher Use
titlestringThe extracted title from the document.Mapped to the presentation title. Displayed to the user in the search results as the title of the document.
contentstringThe full content of the document.Used for full-text search over the entire contents of the document. Typically considered to be the weakest hits (i.e. less strong than any of the grank# fields).
fetchUrlURLThe original URL used to fetch the content from the remote server.Used as the unique ID of the document if id doesn't exist. If the displayUrl does not exist, then fetchUrl is used as the URL which is displayed in the search results to the user. When clicked by the user, the user will be sent to this Url (again, if displayUrl does not exist).
idStringA unique identifier of the document (usually the URL of the document).Used as the unique ID of the document.
displayUrlURLThe URL used by a user to access the document.Used as the URL which is displayed in the search results to the user. If it is not defined, fetchUrl is used for this purpose.
urlURLThe URL used during the scan process to uniquely identify a document.If fetchUrl and id are not present, it's used as the unique ID of the document.
lastModifiedString Date (ISO8601)The last modified date of the document.Used as a sortable field in the search results.
dataSizelongSize of the document.Used as a description field for the document in the search results.
ownerStringUsername/name of the owner of the document. The person to refer too when additional information about the document is required.Used as a description/sortable field for the document in the search results.
modifiedByStringUsername/name of the last user to modify the document.Used as a description/sortable field for the document in the search results.
createdByStringUsername/name of the user who created the document.Used as a description/sortable field for the document in the search results.
actionStringAction to perform (add/update/delete).Used to determine what kind of action is going to be performed with the document at the publishers.

Example Child AspireObject


 As Processed by Fetching and Metadata Processing

Document produced by the Scanner, which is input to the fetcher.

Code Block
languagexml
<doc>
  
  <!-- *** Common connector properties go here *** -->
    
    <!-- The item to fetch by the fetcher goes in <fetchUrl> -->
    <fetchUrl>file://stsd-dev-docu.sharepointlab.net/PerformanceTestData/BILLS</fetchUrl>
    <!-- The URL to present to the user goes in <displayUrl> -->
    <displayUrl>file://stsd-dev-docu.sharepointlab.net/PerformanceTestData/BILLS</fetchUrl>
    <action>add</action>
    <docType>container</docType>
    <lastModified>2011-07-06T18:45:14Z</lastModified>
    <dataSize>0</dataSize>
    <owner>AD\klarkin</owner>
    <createdBy>AD\klarkin</createdBy>
    <modifiedBy>AD\klarkin</createdBy>
    
      
  <!-- *** Connector source information goes here (copied from parent job) *** -->
  
    <connectorSource type="cifs">
      <displayName>Smith Plant Fileshare</displayName>
      <url>smb://smith.mydomain.com/dir1/dir2</url>
      
      <!-- Custom connector properties go here -->
      
    </connectorSource>
  
  <!-- *** custom properties produced by the connector scanner for the connector fetcher go here *** -->
  
    <connectorSpecific type="filesystem">
      <field name="smbUrl">smb://stsd-dev-docu.sharepointlab.net/PerformanceTestData/BILLS/</field>
      <!-- *** connector-specific properties go here -->
    </connectorSpecific>
  
  <!-- *** Security ACL Information *** -->
  <!-- Note that the ACLs are in order, with earlier ACLs taking precedence over lower ACLs -->
  <!--   @access attribute can be "allow" or "deny" -->
  <!--   @name is the base user or group name -->
  <!--   @domain is the domain name, typically the Active Directory name -->
  <!--   @fullname is the name stored in the search engine and used as the security token -->
  <!--   @entity is either "user" or "group" -->
  <!--   @scope is either "server" or "global", ACLs which are "server" scope may not be stored in the indexes -->
  
  <!--   other attributes and ACL contents is allowed, as required by the individual connector -->
  <!--   examples include:  @sid and @sidType below -->
    
    <acls>
        <acl name="Administrator" domain="STSD-DEV-DOCU" fullname="STSD-DEV-DOCU\Administrator" 
             access="allow" inherited="true" entity="user" scope="server"
             sidType="user" sid="S-1-5-21-1830488795-1236199006-3848916169-500"/>
        <acl name="Administrators" domain="BUILTIN" fullname="BUILTIN\Administrators"
             access="allow" inherited="true" entity="group" scope="server"
             sidType="localGroup" sid="S-1-5-32-544"/>
        <acl name="SYSTEM" domain="NT AUTHORITY" fullname="NT AUTHORITY\SYSTEM"
             access="allow" inherited="true" entity="group" scope="server"
             sidType="localGroup" sid="S-1-5-18"/>
        <acl name="GSASecurityDemo" domain="SHAREPOINTLAB" fullname="SHAREPOINTLAB\GSASecurityDemo"
             access="allow" inherited="true" entity="group" scope="global"
             sidType="domainGroup" sid="S-1-5-21-3009121436-1049919257-1970785288-1621"/>
    </acls>
   
   
  <!-- *** Hierarchy Browsing Information *** -->
  <!-- The following holds data necessary to implement hierarchy browsing of 
          original content hierarchies.
   
          <hierarchy> - Is the holder for all site map information.
          <item> - Identifies the entry in the hierarchy where this document or folder exists. 
                   If the entry is contained within multiple hierarchies, then there may be multiple <item> tags (typically, however, all documents only contain a single <item> tag)
          <ancestors> - The holder for all ancestors (parent, grandparent, great-grandparent, etc.) of the <item>
          <ancestor> - Holds the details for each ancestor.

          All <item> and <ancestor> tags will have the following attributes:
              @level - The level within the hierarchy where the node exists. Higher numbers indicate deeper nesting within the hierarchy. 0 = root node.
              @id - A simple, searchable ID token for the item. This is typically a hex representation of the MD5 of the displayUrl.
                       Ideally, this id should be searchable by most search engines as a single token. This usually means that the token is made up only of digits and upper-case letters.
              @name - A user-friendly name for the hierarchy node. For example, the folder name, document file name, etc.
                       Not the full path name, just the item name itself.
              @type - The type of the node, used to show an icon for the object in the user interface. This is usually a normalized mime type, or a "special" mime type which starts with "aspire/" for special Aspire types.
              @url - The URL for the hierarchy node. This is the "display URL" which will open up the folder or document in the native application.
              @parent - Identifies the ancestor which is the direct parent of the current document, false if not present.

  -->
  
  
    <hierarchy>
      <item level="3"
            id="145C6BEB96DB4BAB3C67ABD14386F46A" 
            name="BILLS" 
            type="aspire/folder" 
            url="smb://smithplant/PerformanceTestData/BILLS/">
        <ancestors>
          <ancestor 
              parent="true"
              level="2" 
              id="C6DF3927E8B6104218AD44D4038E8337" 
              name="PerformanceTestData"
              type="aspire/folder"
              url="smb://smithplant/TestFolder/PerformanceTestData/" />
          <ancestor 
              level="1" 
              id="78DFE189EC1A7DDF1A732878AB109A9C" 
              name="Smith Plant Fileshare"
              type="aspire/fileshare"
              url="smb://smithplant/TestFolder/" />
          </ancestors>
        </ancestors>
      </item>
    </hierarchy>
    <content>BILLS</content>
  </doc>


Note
  • The <ancestors> tags are nested within an <item> tag to allow for multiple hierarchies, using multiple <item> tags. This might be needed by a production environment for multiple hierarchies within multiple product lines.
  • Only one <ancestor> tag should have parent=”true” to avoid duplication in the metadata.
  • All tags have a consistent set of attributes.

Additional elements expected for document metadata and content

 

Code Block
languagexml
 <doc>
    .
    .  (all other fields from above)
    .
  
  <!-- *** Document Content and Metadata *** -->
  <!-- Data comes from the aspire-extract-text stage. -->
  <!-- Fields can change based on the type of file and what metadata is available for that type -->
  
    <content source="ExtractText">BILLS</content>
    <title source="ExtractText">PEP 2010 Fan Blade Improvement Process</title>
    <author source="ExtractText">Katy Larkin</author>
  </doc>

Current list of types

  • cifs- a common internet file system FileShare source
  • documentum - a Documentum source
  • sharepoint - a SharePoint source
  • spsite - a SharePoint site
  • splibrary - a SharePoint library
  • splist - a SharePoint list
  • folder - any folder from any source
  • item - (default type, but should always be overwritten by a mime type once the mime type is known)

 

Plus, any MIME type can be a type.

Info

Can custom applications can have their own types? For example: "customer:wma" for a customer's Work Management Application? - needs further research]

 

Some expected types for the future:

  • archive (for ZIP files, TAR files, GZIP files, etc.)