Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Aspire 3.3 (the first Aspire release as part of Accenture) includes the successful integration of Aspire into the Hadoop ecosystem as a Cloudera parcel with Accenture AIP.  However, you can still use Aspire in stand-alone mode or as a parcel inside of Cloudera.

Find more information on the Cloudera parcel configuration at Aspire Parcel and Service for Cloudera.

A logical consequence of the integration into the Hadoop ecosystem is the support for HBase for crawl metadata and statistics (previously only MongoDB was supported). This facilitates the use of Aspire as part of Big Data solutions.

Review a  relevant use case success here. Extended support necessitated refactoring the connector framework and implementing improvements over the previous Aspire version (3.1.1).

All configuration steps needed to configure HBase for crawl metadata can be found at HBase Settings.

Other features of interest are Licensing and User Roles. User roles improve security control by separating users into "Developers" and "Administrators" with different roles and permissions over the Aspire configuration.

New connectors:

New publishers:

See the Aspire 3.3 Release Notes for more technical information about this release.

Image Removed

 includes significant enhancements with this new version.  

Note:  This version requires MongoDB or HBASE to be installed along with Aspire depending on the client’s environment.

An important new feature is the release of a Google Cloud Search (GCS) publisher.  (You can read a blog post about it here).  

Other enhancements include:

  • New Headless browser for rending and then crawling dynamically generated (JavasScript) pages now available
  • Numerous security improvements (MongoDB SCRAM authentication, logging of IP addresses, hashed IDs in the MongoDB crawl database, faster entitlements checking)
  • UI improvements (refresh of entitled components, crawl DB provider information)
  • Connector improvements (IBM Connections, SharePoint 2010, SharePoint Online)
  • Publisher improvements (Elasticsearch case sensitive index names, HBase publisher, StageR publisher)
  • Other improvements (time zone normalization, Job Usage, ExtractText default configuration limits)
  • 75+ additional bug fixes across the board

You can refer to the Release Notes for information on bug fixes and enhancements addressed in this version.


Migrating from Aspire 3.x

When importing a content source from 3.x into 3.3, the following error may occur. The content source may show up with a red "Failed" status.

Code Block
Error message: Unable to start appBundle: com.searchtechnologies.aspire:app-rap-connector
Caused by: com.searchtechnologies.aspire.services.AspireException: Failed to register components from appBundle: CONTENT_SOURCE_NAME (Parent: <null>)
	at com.searchtechnologies.aspire.application.AspireApplicationImpl.registerAppBundleComponents(AspireApplicationImpl.java:945)
	at com.searchtechnologies.aspire.application.AspireApplicationImpl.registerAppBundle(AspireApplicationImpl.java:980)
	at com.searchtechnologies.aspire.application.AspireApplicationComponent.loadApplication(AspireApplicationComponent.java:696)
	at com.searchtechnologies.aspire.application.AspireApplicationComponent.loadApplication(AspireApplicationComponent.java:692)
	at com.searchtechnologies.aspire.configuration.ConfigurationManager.reloadApplication(ConfigurationManager.java:697)
	at com.searchtechnologies.aspire.configuration.ContentSourcesModule.processSyncUnitUpdate(ContentSourcesModule.java:309)
	at com.searchtechnologies.aspire.configuration.SynchronizedModule.run(SynchronizedModule.java:289)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.searchtechnologies.aspire.services.AspireException: The value ("${waitForWfApps}") of element <waitForWfApps> is improperly formatted for a boolean - must be either "true" or "false"
	at com.searchtechnologies.aspire.framework.ComponentImpl.getBooleanFromConfig(ComponentImpl.java:2634)
	at com.searchtechnologies.aspire.connector.framework.controller.CrawlControllerImpl.initialize(CrawlControllerImpl.java:260)
	at com.searchtechnologies.aspire.framework.ComponentFactoryImpl.registerComponent(ComponentFactoryImpl.java:446)
	at com.searchtechnologies.aspire.application.ComponentManagerImpl.registerComponents(ComponentManagerImpl.java:328)
	at com.searchtechnologies.aspire.application.ComponentManagerImpl.initialize(ComponentManagerImpl.java:93)
	at com.searchtechnologies.aspire.application.PipelineManagerImpl.initialize(PipelineManagerImpl.java:75)
	at com.searchtechnologies.aspire.framework.ComponentFactoryImpl.registerComponent(ComponentFactoryImpl.java:446)
	at com.searchtechnologies.aspire.application.ComponentManagerImpl.registerComponents(ComponentManagerImpl.java:328)
	at com.searchtechnologies.aspire.application.ComponentManagerImpl.initialize(ComponentManagerImpl.java:93)
	at com.searchtechnologies.aspire.framework.ComponentFactoryImpl.registerComponent(ComponentFactoryImpl.java:446)
	at com.searchtechnologies.aspire.application.AspireApplicationImpl.registerAppBundleComponents(AspireApplicationImpl.java:941)


This could happen because Aspire 3.3 connectors contain configuration options that the "content source to import" lack.  To fix this error:

  1. Click on the content source to access the Configuration page.
  2. Click Save and Done.

Aspire generates the new options and saves them into the configuration files.

MongoDB Changes

Any migration from Aspire 3.x requires a Full Crawl of all content sources since there was a major refactor on the MongoDB provider component. In specific the following tables changed:

CollectionFields in 3.xFields in 3.3Compatibleaudit
  • _id (ObjectId)
  • id (String)
  • crawlStart (Int64)
  • url (String)
  • type (String)
  • action (String)
  • batch (String or Null)
  • ts (Int64)
  • _id (ObjectId)
  • id (String)
  • ts (Int64)
  • action (String)
Yeserrors
  • _id (ObjectId)
  • error (Object)
    • .@time (Int64)
    • .@crawlTime (Int64)
    • .@cs (String)
    • .@processor (String)
    • .@type (String)
    • ._$ (String)
  • _id (ObjectId)
  • message (String)
  • type (String)
  • crawlId (String)
  • time (Int64)
Nohierarchy
  • _id (String)
  • itemType (String)
  • name (String)
  • ancestors (Object or Null)
  • ._id (String)
  • .

    name (String)
  • .ancestors (Object or Null)
    • _id (String)
    • itemType (String)
    • name (String)
    • ancestors (Object or Null)
      • ._id (String)
      • .itemType (String)
      • .name (String)
      • .ancestors (Object or Null)
    No

    processQueue

    and scanQueue

    • _id (String)
    • metadata (Object)
    • type (String)
    • status (String)
    • action (String)
    • timestamp (Int64)
    • signature (String)
    • processor (String)
    • shouldScan (Boolean)
    • shouldProcess (Boolean)
    • crawlRetries (Int32) *
    • name (String)
    • isCrawlRootItem (Boolean)
    • hierarchyId (String)
    • inCrawlRetries (Int32) *
    • _id (String)
    • metadata (Object)
    • url (String)
    • type (String)
    • status (String)
    • action (String)
    • timestamp (Int64)
    • signature (String)
    • processor (String)
    • shouldScan (Boolean)
    • shouldProcess (Boolean)
    • crawlRetries (Int32)
    • name (String)
    • isCrawlRootItem (Boolean)
    • hierarchyId (String)
    • inCrawlRetries (Int32)
    Yessnapshot
    • _id (String)
    • container (Boolean)
    • crawlId (Int64)
    • signature (String)
    • timestamp (Int64)
    • error (Boolean)
    • notFoundCount (int32) *
    • _id (String)
    • id (String)
    • url (String)
    • fetchUrl (String)
    • itemType (String)
    • displayUrl (String)
    • container (Boolean)
    • crawlId (Int64)
    • signature (String)
    • timestamp (String)
    • error (Boolean)
    • notFoundCount (int32)
    Nostatistics
    • _id (String)
    • statistics (Object)
      • .@processor (String)
      • .@server (String)
      • .@status (String)
      • .@mode (String)
      • .@startTime (Int64)
      • .@endTime (Int64)
      • .@currentTime (Int64)
      • .@cs (String)
      • .queue (Object)
        • .scan (Object)
          • .@toScan (Int32)
          • .@scanning (Int32)
          • .@scanned (Int32)
          • .@total (Int32)
        • .process (Object)
          • .@toProcess (Int32)
          • .@processing (Int32)
          • .@processed (Int32)
          • .@total (Int32)
      • .inProgress (Object)
        • .@adding (Int32)
        • .@updating (Int32)
        • .@deleting (Int32)
        • .@total (Int32)
      • .processed (Object)
        • .@added (Int32)
        • .@updated (Int32)
        • .@deleting (Int32)
        • .@unchanged (Int32)
        • .@excluded (Int32)
        • .@terminated (Int32)
        • .@errored (Int32)
        • .@total (Int32)
      • .errors (Object)
        • .@batch (Int32)
        • .@scan (Int32)
        • .@document (Int32)
        • .@total (Int32)
    • _id (String)
    • @processor (String)
    • @server (String)
    • @status (String)
    • @mode (String)
    • @startTime (Int64)
    • @endTime (Int64)
    • @currentTime (Int64)
    • @cs (String)
    • queue (Object)
      • .scan (Object)
        • .@toScan (Int32)
        • .@scanning (Int32)
        • .@scanned (Int32)
        • .@total (Int32)
      • .process (Object)
        • .@toProcess (Int32)
        • .@processing (Int32)
        • .@processed (Int32)
        • .@total (Int32)
    • inProgress (Object)
      • .@adding (Int32)
      • .@updating (Int32)
      • .@deleting (Int32)
      • .@total (Int32)
    • processed (Object)
      • .@added (Int32)
      • .@updated (Int32)
      • .@deleting (Int32)
      • .@unchanged (Int32)
      • .@excluded (Int32)
      • .@terminated (Int32)
      • .@errored (Int32)
      • .@total (Int32)
    • errors (Object)
      • .@batch (Int32)
      • .@scan (Int32)
      • .@document (Int32)
      • .@total (Int32)
    Nostatus
    • _id (ObjectId)
    • connectorSource (Object)
    • @action (String)
    • @actionProperties (String)
    • @crawlId (String)
    • @normalizedCSName (String)
    • displayName (String)
    • @scheduler (String)
    • @scheduleId (String)
    • @jobNumber (String)
    • @sourceId (String)
    • @actionType (String)
    • @dbId (String)
    • crawlStart (Int64)
    • crawlStatus (String)
    • processDeletes (String)
    • processingDeletesStatus (String)
    • crawlEnd (Int64)
    • _id (String)
    • connectorSource (Object)
    • @action (String)
    • @actionProperties (String)
    • @crawlId (String)
    • @normalizedCSName (String)
    • displayName (String)
    • @scheduler (String)
    • @scheduleId (String)
    • @jobNumber (String)
    • @sourceId (String)
    • @actionType (String)
    • @dbId (String)
    • crawlStart (Int64)
    • crawlStatus (String)
    • processDeletes (String)
    • processingDeletesStatus (String)
    • crawlEnd (Int64)
    No*These fields were available in Aspire 3.1