Enterprise Search with ColdFusion Solr

Dan Sirucek cf.Objective 2012 May 2012 About Me

• Senior Learning Technologist at WellPoint, Inc • Developer for 14 years • Developing in ColdFusion for 8 years • Started in SQL Server, ASP, ASP.NET, VB.NET • Also work in Flash Builder/Flex, Java, and C#

2 Where We’ve Been: Growth and Consolidation

WellPoint, Inc. was formed in 2004 as the result of a merger between Anthem, Inc. and WellPoint Health Networks, creating the nation’s largest health benefits company by membership

3 Where We Are: National Scale

• Nation’s Largest Insurer 1 out of 9 Americans are covered by • ~34 million medical members WellPoint’s affiliated health plans

• Total Revenue • Nearly $60 billion

• Provider Network Advantage • ~94% Hospitals • ~82% Primary Physicians • ~84% Specialists

• Blue Licensee • 14 states

Note: Provider Network refers to BlueCard® PPO Network

4 Agenda

• Problem and Goal • Why Apache Solr for ColdFusion 9.01 • Solr Multi-core Overview • Replication Overview • Installation • Replication Configuration • Managing Collections on Multiple Solr Instances • Extending ColdFusion Solr Schema • Creating a Custom Search • Q & A • Resources

5 Problem and Goal

• Problem • Slow search response • Constant corruption issues • Verity wasn’t scalable • No redundancy • Goal • Improve search response • Create an enterprise scalable solution • Implement redundancy for high availability • Maintain compatibility with & tags

6 Why Apache Solr for ColdFusion 9.01

• Performance • Fast, very fast • Optimized for high volume web traffic • Scalable • Distributed searches • Replication • Redundancy • Replication supports • Master • Slave • Repeater

7 Solution Architecture

8 Technologies Used

• Windows Server 2008 64 bit • IIS 7.0 • Application Request Routing • ColdFusion 9.01 Multi-server • 6 • Master instance • Apache Solr Standalone Installation for ColdFusion 9.01 • Slave instances • Java SE JDK 1.6_026 64-bit

9 Solr Multi-core Overview

• Solr core = ColdFusion collection • Multiple Cores • Single Solr instance • Each Solr core has its own configuration and index • Unified administration • Multi-core template • A template is used for creating a new core (collection) • The template contains a directory structure and the configuration files needed to create a new core • Location SolrInstallationDirectory\multicore\template

10 Solr Multi-core Template

• conf directory • Contains configuration files used when creating a new Solr core • Two key files: . schema. – Contains the details about which fields your index can contain – How those fields should be dealt with when adding documents to the index – How those fields should be dealt with when querying those fields . solrconfig.xml – Contains the configuration settings for the Solr core – Used to configure replication

11 Solr Multi-core Template Continued

• conf directory continued • Files referenced by schema.xml: . protwords.txt – Words that need protection from stemming – i.e. “maine” is stemmed to “main” . stopwords.txt – Words to not index e.g. a, an, and . synonyms.txt – Synonym groups e.g. GB,gib,gigabyte,gigabytes – Mappings used for spelling corrections e.g. hippa => hipaa

12 Solr Multi-core Template Continued

• conf directory continued • Optional file: . solrcore.properties – User defined properties to be referenced within solrconfig.xml – Syntax – Property=Value – File is referenced by default when present in conf directory – Example:

• data directory • Empty directory • Solr will create the following directories the 1st time content is indexed . index . spellindex

13 Solr Replication Overview

• Replication Features • Efficient and automated distribution of index additions, updates, and deletions • Pull strategy allows for easy addition of slaves • Configurable distribution interval allows tradeoff between timeliness and cache utilization - interval is set by the slave instance • Replication and automatic reloading of configuration files • Works over HTTP • Works across platforms with same configuration • Replication Modes • Master – optimized for indexing • Slave – optimized for searches • Repeater – used in WAN to reduce bandwidth between data centers

14 Solr Replication Considerations & Challenges

• Considerations • Replication is not a server level configuration • Replication is configured in at the solr core (search collection) level • New cores need to be created on all solr instances • Challenges • Modify the multi-core template to implement replication when new cores are created • Automate the creation of a solr core on all solr instances • Create a consolidated view of cores on all instances

15 Solr Replication Requirements

• Basic Requirement • One master solr instance • One or more slave solr instances • Configuration of replication request handlers on master and slave instances • Replication Request Handler • Configuration is handled in the solrconfig.xml • Replication is defined by adding a request handler using XML syntax • Settings are used to set the properties for the request handler • Master and slave instances are both configured using a request handler, but use different attributes to define its role

16 Master Replication Request Handler

• Replication request handler with all possible attributes • Screen shot

17 Required Master Settings

• replicateAfter • Configures when replication will be triggered • Valid values: startup, commit, optimize • If using startup option, it is necessary to have a commit/optimize entry also, if you want to trigger replication on future commits/optimizes. • Example:

18 Recommended Master Settings

• confFiles • Used to specify configuration files to be replicated • Comma delimited list of files to replicate • Can be configured to rename files on replication . Syntax – source_file_name.xml:destination_file_name.xml • Example:

19 Optional Master Settings

• backupAfter • Configures when a backup will be created • Valid values: optimize, startup, commit • maxNumberOfBackups • Maximum number of backups to retain • commitReserveDuration • Default 10 seconds • If commits are very frequent and network is slow, you can tweak this value

20 Slave Replication Request Handler

• Slave replication request handler with all possible settings • Add screen shot and high level notes

21 Required Slave Settings

• Configuration file • solrconfig.xml • masterUrl • Sets the url of the Solr master instance • ${solr.core.name} – system variable • pollInterval • Sets the polling interval of the slave to poll the master for changes • Considerations . Frequency of updates to index . Network Bandwidth . Acceptable latency

22 Optional Slave Settings

• httpConnTimeout • Sets connection timeout on the underlying HttpConnectionManager • Default value 5000ms • httpReadTimeout • Sets timeout when fetching index from master • Default value 10000ms • httpBasicAuthUser • Use if basic authentication is enabled on master • httpBasicAuthPassword • Use if basic authentication is enabled on master • Compression • Use only if your bandwidth is low

23 Slave Replication Configuration Examples

• Basic configuration example

• Using solrcore.properties configuration example

24 Slave Solr Installation

• Slave Servers • Windows Server 2008 (64 bit 8gb ram) • Install Java SE JDK 1.6_026 64-bit . Note location of installation directory – Example : D:\Apps\Java\jdk1.6.0_26 • Execute Apache Solr Standalone Installation for ColdFusion 9.01 installer . Change Java Home from default to: javaInstallationDirectory\jdk1.6.0_26\jre – Example: D:\Apps\Java\jdk1.6.0_26\jre

25 Master Solr Installation

• Master Solr Server • Windows Server 2008 (64 bit 8gb ram) • Download Java JDK1.6_026 64-bit • Download Apache Tomcat 6 32-bit/64-bit Windows Service Installer • Execute Java JDK Installer . Note installation directory . Example: E:\Apps\java • Execute the Tomcat 6 installer . Java JRE – specify the jre in the jdk 1.6.0_26 installation – Example: E:\Apps\Java\jdk1.6.0_26\jre . Select installation directory – Example: E:\Apps\tomcat6

26 Master Solr Installation Continued

• Master Solr Installation continued • Create a solr directory – example E:\Apps\solr • Copy the following from slave installation . solr.war to solr directory – installationDirectory\webapps\solr.war . Mutli-core directory to solr directory – installationDirectory\mutlicore • Configure Tomcat service • Launch Configure Tomcat • Java tab • Set initial memory pool • Set maximum memory pool

27 Configure Tomcat for Solr

• Stop Apache Tomcat 6 service • Create solr context • A Context is what Tomcat calls a web application • Location: tomcatInstallDir\conf\Catalina\localhost\ • Create a solr.xml file • Edit solr.xml and define Solr context • Example:

• Start Apache Tomcat 6 service • Launch Tomcat 6 - http://127.0.0.1:8080/manager/html • Navigate to solr application 28 Tomcat 6 Web Application Manager

29 Slave Configuration

• Apache Solr for ColdFusion 9.01 runs on a Jetty servlet • Jetty Configuration • Configuration file location . SolrInstallationDirectory\etc\jetty.xml • Connector system properties . jetty.port – default = 8983 . jetty.host – default = not defined • Default configuration listens only on 127.0.0.1 • Add jetty.host system property to the connector setting . 0.0.0.0 = listen on all IPs . Example:

30

Slave Jetty Configuration Continued

• Default connector configuration

• After update

31 Slave Service Configuration

• Service start up configuration • Default java ram maximum memory setting is 256mb . InstallationDirectory\solr.lax

• Adjust maximum memory setting -Xmx • Add a minimum memory setting -Xms • Example:

32 Master Solr Multi-core Template Configuration

• Create solrcore.properties • Create a text file named solrcore.properties in the Solr multicore template directory • Add two properties . MASTER_CORE_URL=http://masterHostnameUrl:masterPort/solr . POLL_TIME=hh:mm:ss • Example:

• Create solrconfig_slave.xml • Make a copy of solrconfig.xml in the master Solr multicore template directory • Name the file solrconfig_slave.xml

33 Master Solr Multi-core Template Configuration Continued

• Configure solrconfig.xml for replication • Add master and slave replication request handlers • solrconfig.xml

• solrconfig_slave.xlm

34 Slave Solr Multi-core Template Configuration

• solrcore.properties • Copy solrcore.properties in template/conf directory on master to template/conf directory on slave • solrconfig.xml • Delete solrconfig.xml file in template/conf on slave • Copy solrconfig_slave.xml in template/conf directory on master to template/conf directory on slave • Rename solrconfig_slave.xml to solrconfig.xml on slave

35 Creating New Collections

• Collections (cores) need to be created on all Solr instances • Use Solr API to create new cores • REST-like API • Create new core parameters . action – CREATE . name – name of new core . instanceDir – directory path for new instance . template – directory path for the core template . wt – writer type – Format of response – Options: , javabin, xml – Default = xml . version = 1

36

Creating New Collections Code

• In CF create an array of server instances • Define collection name

37 Creating New Collections Code Continued

• Loop over server instance array • Create collection on each instance

38 Collection Create Result Struct

• De-serialized file content (cfdump from previous slide) • core – collection name • responseHeader . QTime – query time milliseconds . status • saved . File path to multicore\solr.xml . multicore\solr.xml file is used to store core names and instance directory

39 Solr Admin Master Replication

• Core admin • Navigate to Replication

• Replication admin • Index version • Location • Size

40 Solr Admin Slave Replication

• Core admin • Navigate to Replication

• Replication admin • Master • Poll Interval • Local Index . Version & location . Replication status • Controls . Disable Poll . Replicate Now

41

Deleting Collections

• Collections (cores) should be deleted from all Solr instances • Use Solr API to delete cores • Delete core parameters . action – UNLOAD . core – name of core to delete . wt – writer type – Format of response – json, javabin, xml – Default = xml . version = 1

42 Delete Collections Code

• Loop over server instance array • Delete collection on each instance

43 Extend ColdFusion Solr Schema (cfcore)

• Reasons to extend/change default functionality • Change default operator . The default is OR • Enable delete by key capability • Enable case sensitivity on search • Possible changes to schema.xml • Default operator between words is OR . Changing default operator to AND will reduce number of results

44 Extend ColdFusion Solr Schema – Enable Delete by Key

• Enable delete by key • Default unique key is a system generated identifier • Possible use case . Use API to delete indexed content by the key value • Changes . Create a copy of schema.xml and name it schema_slave.xml . Update replication conf attribute to use schema_slave.xml: schema.xml . Changes to schema.xml – Change index attribute on key field to true

– Change unique key from uid to key

. Changing unique key on slave instances will break cfsearch tag

45

Extend ColdFusion Solr Schema – Enable case sensitivity on search

• Enable case sensitivity on search • Default configuration uses a filter to change text to lower case • Possible use case . Search by title and retain case sensitivity • Schema Change . Comment out solr.LowerCaseFilterFactory

46 Creating a Custom Search

• Use case • Return category facet counts • Date range search • Solr Search API • Basic query parameters . q – search query . fq – facet query . qt – query type – name of the request handler in solrconfig.xml . start – start row . rows – number of rows to return in response . fl – comma delimited list of fields to include in response . wt – write response type

47 Creating a Custom Search Continued

• Solr Search API continued • Highlight parameters . hl – enable highlighted snippets to be generated . hl.fragsize – the size in characters, of the snippets created by highlighter . hl.snippets – maximum number of snippets to generate per field . hl.simple.pre – text which appears before highlighted term . hl.simple.post – text which appears after highlighted term • Facet parameters . facet – enable facet counts in query response . facet.field – specify a field which should be treated as a facet . facet.mincount - minimum count to include facet in response

48 Creating a Custom Search Continued

• JSON specific parameter • json.nl . Controls the output format of NamedList used for field faceting data . flat (default) – flat array – Example: [name1,val1, name2,val2] . map – JSON object – Is a hash and can have repeated keys, but preserves order . arrarr – an array of two element arrays – Example: [[name1,val1], [name2, val2], [name3,val3]]

49 Creating a Custom Search Code

• Code Review

50 Custom Search User Interface Example

51 Q & A

21555 Oxnard Dr Dan Sirucek MS: CAAC08-088I Sr. Learning Technologist Woodland Hills, CA 91316 Learning Technologies and Tel (818) 234-8017 Content Mobile (323) 251-1236 www.wellpoint.com [email protected]

52 Resources

• Apache Tomcat 6 - http://tomcat.apache.org/download-60.cgi • Apache Solr Standalone Installer for ColdFusion 9.0.1 - http://www.adobe.com/support/coldfusion/downloads.html • Java JDK 1.6_26 download- http://www.oracle.com/technetwork/java/javase/downloads/jdk-6u26-download-400750.html • Apache Solr - http://lucene.apache.org/solr/ • Solr Wiki - http://wiki.apache.org/solr/FrontPage • Solr Replication - http://wiki.apache.org/solr/SolrReplication • Solr JSON Response Writer - http://wiki.apache.org/solr/SolJSON#JSON_Query_Response_Format • Solr Facet Parameters - http://wiki.apache.org/solr/SimpleFacetParameters • Solr Highlighting Parameters - http://wiki.apache.org/solr/HighlightingParameters

53