The Hash Table Lookup stage loads an in-memory hash table for very quickly looking up data and adding it to the document being processed. The hash table can be automatically loaded on start-up from a tabular file or relational database select.

When used as a pipeline stage, takes the key from an existing XML element, looks up the entry in the hash table, and then maps the hash table values-array elements to fields in the document being processing.

When used as an independent component, the component supports the "com.searchtechnologies.aspire.hashtable.AspireHashTable<K, V>" interface (see AspireHashTable for more info).

The stage can also be used as a generic hash table resource, for example by other components or Groovy scripts which need to quickly look up values in the hash table.

Note that the component can be used both as an interface and as a hash table service at the same time. That is, once you've configured the component to be a pipeline stage (inside of a pipeline manager), you can also access the hash table directly.

Hash Table Lookup
Factory Name	com.searchtechnologies.aspire:aspire-hash-table
subType	default
Inputs	AspireObject (when used as a pipeline stage)
Outputs	Fields as specified by the metadata mapper (when used as a pipeline stage)

Configuration

Element	Type	Default	Description
initialSize	int	10000 (10 thousand)	The estimated initial size for the hash table, used to specify its initial capacity. It is best to set this value large enough to contain all of the expected entries in the hash table. This will prevent additional hash table allocations and rehashing.
initializeFromFiles	boolean	false	Set this flag to true if you are initializing the hash table from a tabular file (i.e. a comma-separated or tab-separated file).
file	xml node	none	(requires initializeFromFiles = true) Contains the file location and separator value. Multiple file nodes can be configured to load multiple files.
file/fileName	string	none	(requires initializeFromFiles = true) The file name where the tabular file can be located. If a relative path is specified, this is assumed to be relative to Aspire Home.
file/separator	string	tab	(requires initializeFromFiles = true) This is either "comma", "tab" or a single character to specify the separator used for columns in the file. If a CSV file, use "comma". The tabular files use the Microsoft-Excel standards for specifying data. Specifically, data entries with embedded commas or tabs should be surrounded by double quotes. Data entries which contain double quotes should escape the double-quote character with a pair of double quotes. Finally, if you want to have some other separator (for example, the pipe-character / vertical-bar, \|, is popular), then you can specify that single character in the <separator> tab as well.
folder	xml node	none	(requires initializeFromFiles = true) Contains the folder location and separator value. Multiple folder nodes can be configured.
folder/folderName	string	none	(requires initializeFromFiles = true) The folder name where the tabular files can be located. If a relative path is specified, this is assumed to be relative to Aspire Home.
folder/separator	string	tab	(requires initializeFromFiles = true) This is either "comma", "tab" or a single character to specify the separator used for columns in the file. If a CSV file, use "comma". The tabular files use the Microsoft-Excel standards for specifying data. Specifically, data entries with embedded commas or tabs should be surrounded by double quotes. Data entries which contain double quotes should escape the double-quote character with a pair of double quotes. Finally, if you want to have some other separator (for example, the pipe-character / vertical-bar, \|, is popular), then you can specify that single character in the <separator> tab as well.
hasColumnLabels	boolean	false	(requires initializeFromFiles = true) Set this flag to true if the first row of the tabular file contain column labels.
keyColumn	string	column1	(requires initializeFromFiles = true) The name of the tabular file column which will be used for the hash table key. If <hasColumnLabels> = false, then the column labels will be numbered starting with 1, as in "column1", "column2", "column3", etc. <keyColumn> is also available when loading the hash table from the RDB. See below.
valueMap	Nested list of <column label=""/> tags	include all columns in the order in which they occur	(requires initializeFromFiles = true) The value map parent tag allows users to choose exactly which columns are stored in the hash table (controlling memory usage) and the order of the columns in the value array. Inside of <valueMap> list the columns desired with nested <column label=""> tags. Only columns specified in the value map will be stored in the hash table. The order of the values in the hash table will be the same as the order of the <column> tags inside the value map. Column labels will either be the labels specified in the file (if <hashColumnLables> is true) or "column1", "column2", "column3" etc. otherwise.
initializeFromSQL	boolean	false	Set this flag to true if you are initializing the hash table from a SQL select statement.
connectionPoolName	string	none	(requires initializeFromSQL = true) The Aspire component name of the RDBMS Connection component which maintains the pool of RDB connections for the database to be queried.
sqlQuery	string	none	(requires initializeFromSQL = true) The SQL query to use to access the data from the RDBMS to load the hash table. The order of the columns in the SQL table will be maintained in the list of values stored in the hash table.
keyColumn	string	none	(requires initializeFromSQL = true) The name of the SQL column from the "sqlQuery" query which will be used for the hash table key.
targetElement	string	none	(when used as a pipeline stage) The XML element from the document being processed which will be used as the key to look up the entry in the hash table.
metadataMap	Metadata Mapper	none	Specifies the mapping of fields or columns from the original hash table

Example Configurations

Initialized from a Tabular File - where the first row has column labels

 <component name="NormalizedAssigneeHashTable" subType="default" factoryName="aspire-hash-table">
   <initializeFromFiles>true</initializeFromFiles>
   <file>
     <fileName>data/NormalizedAssigneeFile.csv</fileName>
     <separator>comma</separator>
   </file>
   <hasColumnLabels>true</hasColumnLabels>
   <keyColumn>hashName</keyColumn>
   <valueMap>
     <column label="uniqueAsgnId"/>
     <column label="name"/>
     <column label="normAsgnId"/>
     <column label="count"/>
   </valueMap>
 </component>

Initialized from a Tabular File - with no column labels

 <component name="NormalizedAssigneeHashTable" subType="default" factoryName="aspire-hash-table">
   <initializeFromFiles>true</initializeFromFiles>
   <file>
     <fileName>data/NormalizedAssigneeFile.csv</fileName>
     <separator>comma</separator>
   </file>
   <keyColumn>column1</keyColumn>
   <valueMap>
     <column label="column3"/>
     <column label="column1"/>
     <column label="column8"/>
     <column label="column2"/>
   </valueMap>
 </component>

Initialized from a SQL Select Statement

 <component name="NormalizedAssigneeHashTable" subType="default" factoryName="aspire-hash-table">
   <initialSize>10000000</initialSize>
      
   <initializeFromSQL>true</initializeFromSQL>
   <connectionPoolName>/CPAAssigneeNorm/openRDBConnection</connectionPoolName>
   <sqlQuery><![CDATA[select Name, NormAsgnID
                from AssigneeNormalization.dbo.NormalizedAssignee]]></sqlQuery>
   <keyColumn>Name</keyColumn>
 </component>

Used as a pipeline stage

 <component name="NormalizedAssigneeHashTable" subType="default" factoryName="aspire-hash-table">
   <targetElement>documentId</targetElement>
      
   <metadataMap>
     <map from="name" to="title"/>
     <map from="subCategory" to="subCategory"/>
     <map from="geographicArea" to="geographicArea"/>
     <map from="searchKeywords1" to="searchKeywords1"/>
   </metadataMap>
       
   .
   .
 </component>

Note that in the above example, the metadata mapper @from attribute could be names such as "column1", "column2" etc. if the data comes from a tabular file with no column labels specified in the file.

Example use from within a Groovy scripting component

Note how, in the examples below, the hash table is referenced via a "component variable" declared in Groovy. See Groovy Scripting for more details.

Reading from the hash table

 <component name="printPatentPubCount" subType="default" factoryName="aspire-groovy">
   <variable name="normalizedAssigneeHashTable" component="/CPAAssigneeNorm/NormalizedAssigneeHashTable" />
   <variable name="uniqueAssigneeHashTable" component="/CPAAssigneeNorm/UniqueAssigneeHashTable" />        
   <script>
   <![CDATA[
     println "*** Normalized Assignee hash table size: " + normalizedAssigneeHashTable.size();
     println "*** Unique Assignee hash table size: " + uniqueAssigneeHashTable.size();          
   ]]>
   </script>
 </component>

Reading and writing the hash table

 <component name="update" subType="default" factoryName="aspire-groovy">
   <variable name="uniqueAssigneeHashTable" component="/CPAAssigneeNorm/UniqueAssigneeHashTable" />
   <script>
   <![CDATA[
     use(groovy.xml.dom.DOMCategory) {
       .
       .
       dom.'variants'[0].each() {
         if(uniqueAssigneeHashTable.contains(it.getAttribute("hash")))
           normAsgnId = uniqueAssigneeHashTable.get(it.getAttribute("hash"))[2];
       }
         
       dom.'variants'[0].each() {
         .
         .
           
         // update uniqueAssigneeHashTable with the above UniqueAsgnID
         String[] values = [localUniqueID, assigneeName, normAsgnId, '1', 
                            "PATENT", isDocDB, patent, '0', '0', sdf.format(date) ,lang];
           
         def returnValue = uniqueAssigneeHashTable.put(hashName, values);
           
         // Check to see if the key was already in the hash table...
         if(returnValue != null) {
           .
           .
           .
         }
       }
     }
   ]]>
   </script>					
 </component>

Page tree

Hash Table Lookup