Training Material

If you're interested in learning more, here's a recording of the Tech Talk on the Performance and Auditing Tech Talk along with the Performance and Auditing Tech Talk presentation.

 

On this page:

Related pages:

Monitoring Crawl Statistics


The connectors are able to fetch content from the content sources, while providing a way to monitor the status of each crawl. The basic information provided in those status are: added, updated and deleted documents in the crawl. Also if any error happens for some documents, those errors are going to be counted and displayed as well.

To see the statistics of the crawl, click on Statistics.

Clicking on Statistics will open this window

 

 

  1. Statistics Source
    • Select the source of the statistics, in distributed mode you will have a summary of all statistics and the ip address of each server with the statistics from only that Aspire instance.
  2. Refresh
    • Refresh the statistics, to display the updated information
  3. Cancel
    • Close the statistics window
  4. View Historical Crawl Statistics
    • View a list of all the stored statistics from previous cralws
  5. View Audit Logs
    • Go to the Audit section where you can track all actions completed for documents, see Audit Logs.

 

Historical Statistics


On this sections you will see a list of statistics for previous crawls, sorted from the newest to the oldest (top to button, left to right), all identified by and number indicating the number of the crawl. Clicking on the selected crawl, will load all the information from the statistic into the windows the same way as you saw on the first section.

  1. Sorted Statistics List
    • Every item, when clicked will load the information of the respective crawl
  2. Return
    • Return to the previous section
  3. Cancel
    • Close the statistics window

When loading a historical statistic, the refresh button from the first section will be replace by a return button.

View Audit Logs


The Audit Logs feature tracks all actions completed for documents by a content source. You can track all ADDED, UPDATED, DELETED, and NO CHANGED documents among the crawl. The goal of these logs is to help the administrator to identify differences between the content crawled by Aspire and what is indexed in the search engine.

You can also track WORKFLOW_ERRORS, which correspond to errors occurred during the Workflow execution, and BATCH_ERRORS, which are problems when sending a batch of documents to a search engine.

The Aspire publishers can be configured to dump their indexes to file in the form of Audit Logs, which then can be compared to the content-sources Audit Logs in order to determine differences and possible crawl problems. For more detailed information about dumps and index comparisons go to Audit Logs.

Aspire Performance Reports


The Aspire Performance Reports is a feature aimed to help the Developers and Administrators to identify hot-spots or bottlenecks of the execution of processing, extraction or publisher stages.

The Performance Reports include information about job start and end times, execution paths including timing information for:

  • Pipeline Manager
  • Pipelines
  • Stages
  • Workflow Rules
  • Scanner methods

How does it work?


Example

Given the following application.xml file:

<application name="PerformanceStatisticsExample">
  <components>
    <component name="StandardPipeManager" subType="pipeline" factoryName="aspire-application">
      <components>
        <component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" />
        <component name="ExtractText" subType="default" factoryName="aspire-extract-text" />
        <component name="ExtractDomain" subType="default" factoryName="aspire-extract-domain" />
        <component name="PrintToFile" subType="printToError" factoryName="aspire-tools">
          <outputFile>log/${app.name}/exampleDebug.out</outputFile>
        </component>
      </components>
      <pipelines>
        <pipeline name="doc-process" default="true">
          <stages>
            <stage component="FetchUrl" />
            <stage component="ExtractText" />
            <stage component="ExtractDomain" />
            <stage component="PrintToFile" />
          </stages>
        </pipeline>
      </pipelines>
    </component>
  </components>
</application>

When a job processes that application, the following information will be generated.

<performanceStatistics name="root" process="true">
  <stats>
    <startTime>2014-08-19T22:27:57Z</startTime>
    <endTime>2014-08-19T22:28:08Z</endTime>
    <processingTime>10927</processingTime>
  </stats>
  <pipelineManager name="/PerformanceStatisticsExample/StandardPipeManager">
    <stats>
      <startTime>2014-08-19T22:27:57Z</startTime>
      <endTime>2014-08-19T22:28:08Z</endTime>
      <processingTime>10926</processingTime>
    </stats>
    <pipeline name="doc-process">
      <stats>
        <startTime>2014-08-19T22:27:57Z</startTime>
        <endTime>2014-08-19T22:28:08Z</endTime>
        <processingTime>10926</processingTime>
      </stats>
      <stage name="/PerformanceStatisticsExample/StandardPipeManager/FetchUrl">
        <stats>
          <startTime>2014-08-19T22:27:57Z</startTime>
          <endTime>2014-08-19T22:28:02Z</endTime>
          <processingTime>5595</processingTime>
        </stats>
      </stage>
      <stage name="/PerformanceStatisticsExample/StandardPipeManager/ExtractText">
        <stats>
          <startTime>2014-08-19T22:28:02Z</startTime>
          <endTime>2014-08-19T22:28:08Z</endTime>
          <processingTime>5330</processingTime>
        </stats>
      </stage>
      <stage name="/PerformanceStatisticsExample/StandardPipeManager/ExtractDomain">
        <stats>
          <startTime>2014-08-19T22:28:08Z</startTime>
          <endTime>2014-08-19T22:28:08Z</endTime>
          <processingTime>0</processingTime>
        </stats>
      </stage>
      <stage name="/PerformanceStatisticsExample/StandardPipeManager/PrintToFile">
        <stats>
          <startTime>2014-08-19T22:28:08Z</startTime>
          <endTime>2014-08-19T22:28:08Z</endTime>
          <processingTime>0</processingTime>
        </stats>
      </stage>
    </pipeline>
  </pipelineManager>
</performanceStatistics>

The processing time of a parent node is the sum of its children. Sometimes it also gets an overhead in addition to the children's sum.

 

The processingTime is given in milliseconds. If it is 0, that means it took less than 1 millisecond to process (because it doesn't handle smaller time units than milliseconds).

For further information on how to enable and download the logs and reports go to: Performance Reports.

 

  • No labels