Tuesday, January 29, 2013

Solr: An Opensource Search Platform

Searching is a basic requirement for any application in the current software world. With the emergence of Yahoo and Google, search technology has been revolutionized with various types of information including books, videos, maps, personal profiles etc being searchable instantly online. Although search technology has progressed far beyond speech recognition and artificial intelligence enabled search, it has being mostly remained proprietary to few the giant co-operations. There were very few alternatives to make your website searchable unless by adding a google or yahoo custom search bar. With the advent of Apache Lucene project, fully text indexing and powerful speedy search became possible in Open source community. Further with the core approach of Lucene to consider a document containing fields of text gives it ability to search Text in various file formats such as Xml, Html, Text, PDF, MS Word, Open Office Document etc.

Setting up Solr
First Apache Tomcat should be downloaded and must be installed in order to set up Solr. We assume that Java Runtime is installed and already configured to run on the system. In order to access the manager page for Apache Tomcat, edit the $CATALINA_HOME\conf\tomcat-users.xml to add user with an admin role associated with it.
   Now download the latest Apache Solr on the system. Extract the archive and copy the .war file to the webapps directory of the Tomcat installation. Copy the Solr configuration from the example directory of the Solr setup in the $CATALINA_HOME directory.
cp -R /path/to/apache-solr-x.x.x/example/solr $CATALINA_HOME/solr

In order for the tomcat to know about the Solr webapp, we add a file named solr.xml in the Tomcat configuration directory i.e. $CATALINA_HOME/conf/Catalina/localhost. Now we open the recently created solr.xml in any text editor and add the following configuration:
<context allowlinking="true" crosscontext="true" debug="0" docbase="{/full/path/to/webapps/solr.war}" priviledged="true">
  <environment name="solr/home" override="true" type="java.lang.String" value="{/full/path/to/CATALINA_HOME/solr}">
  </environment>
</context>

After restarting the Tomcat server we are able to access the Solr interface at http://localhost:8080/solr/admin/.


Configuring Solr
Solr can be configured using two configuration files namely, solrconfig.xml and schema.xml. Both these files reside in the conf directory in the Solr home directory. Solrconfig.xml configures the solr server itself while the Schema.xml is used to specify the fields which the documents might contain, which are used for indexing the documents and querying to seach the documents.

The Solrconfig.xml contains following set of configuration information:
1) Lib directive specifies the path to solr plugins in order to load them. If there are dependencies, list the lowest level dependency jar first. It also supports regular expressions to control loading of jars.
2) DataDir directive specifies the location of the index data files which is stored in "/data" directory by default.
3) IndexConfig section allows to configure low level behavior for Lucene Index writers such as index sizing, index merging , index locks and other parameters.
4) UpdateHandler section relates to the low level details of internal handling of updates such as maximum number of uncommitted documents or maximum time before an auto commit or soft auto commit is triggered, or enabling open search on hard commits etc. It also defines listeners such as RunExecutableListener (which executes external commands) for particular update events, postCommit and postOptimize. It also defines the max pending deletes parameter which sets a limit on the number of deletions that Solr will buffer during document deletion.
Data sent to Solr is not searchable until it has been committed to the index. The reason being that in some cases commits can be slow and they should be done in isolation from other possible commit requests to avoid overwriting data. Hence, it's preferable to provide control over when data is committed using the above commit and soft commit options. Soft commits as opposed to normal commit, does not guarantee that documents are in stable storage after committing.
5) Query section controls everything related to search queries such as maximum number of clauses in boolean query. It contains the caching section and event listener section.
  • Caching section is used to configure the caching parameters depending on the size of the index. Solr caches are associated with a specific instance of an Index Searcher, a specific view of an index that doesn't change during the lifetime of that searcher. As long as that Index Searcher is being used, any items in its cache will be valid and available for reuse. When a new searcher is opened, the current searcher continues servicing requests while the new one auto-warms its cache. The new searcher uses the current searcher's cache to pre-populate its own. When the new searcher is ready, it is registered as the current searcher and begins handling all new search requests. The old searcher will be closed once it has finished servicing all its requests. Details of each cache is as follows:
    FilterCache is used by SolrIndexSearcher for filters and unordered sets of all documents matching the query. For new searcher filterCache is pre-populated using the most recently accessed items. QueryResultCache caches the result of previously searched queries, while DocumentCache caches Lucene document objects which contains the fields of the document. The generic used defined cache can be defined and accessed by SolrIndexSearcher methods getCache(), cacheLookup() and cacheInsert(). Also there are optimizations to use filter for a search and to enable the use of queryResultCache for specific number of result items.
  • Listener section defines a set of listeners triggered by a query related event to perform operations such as prepare cache for new or first search etc. 
6) RequestDispatcher section provides configuration for Solr's RequestDispatcher for handling HTTP requests including whether it should handle "/select" urls; HTTP Request Parsing; remote streaming support; max multipart file upload size etc.
HandleSelect is for backward compatiblity


Indexing files
In order to index html and other files using the post.jar we add the library files "apache-solr-core-x.x.x", "apache-solr-solrj-x.x.x" as well as "slf4j-api", "commons-io", "httpcore", "httpmime", "httpclient" and others from the Solr_Setup/dist and Solr_Setup/dist/solrj-lib directory to the $CATALINA_HOME/webapps/solr/WEB-INF/lib. This would avoid Class not found issues while indexing the files.
  To index files we use the post.jar located in Solr_Setup/example/exampledocs directory using the following command: java -jar post.jar *.xml

Below is the required maven dependencies for using SolrJ library:
    <dependency>
      <groupId>org.apache.solr</groupId>
      <artifactId>solr-solrj</artifactId>
      <version>${solr.version}</version>
    </dependency>

    <dependency>
      <groupId>org.apache.solr</groupId>
      <artifactId>solr-core</artifactId>
      <version>${solr.version}</version>
    </dependency>

Below is the sample code to index html pages using SolrJ:
  // Scan the directory for all Html files and get instance of Solr server to index those files
  public static void indexDirectory(File directory) throws Exception {

    SolrServer solr = new HttpSolrServer("http://localhost:8090/solr");

    // Pattern to filter all html files
    String pattern = "^.*.html$";

    // FileUtils requires Apache Commons-IO library
    Collection<File> files = FileUtils.listFiles(directory, 
                                                 new RegexFileFilter(pattern), 
                                                 DirectoryFileFilter.DIRECTORY );
    for (File file : files) {
      indexFile(solr, file);
    }
  }


  // Add the file to the index of Solr and commit
  public static void indexFile(SolrServer solr, File file) throws Exception {

    // do not try to index files that cannot be read
    if (file.canRead()) {
      if (file.isDirectory()) {

        String[] files = file.list();
        // an IO error could occur

        if (files != null) {
          for (int i = 0; i < files.length; i++) {
            indexFile(solr, new File(file, files[i]));
          }
        }
      } else {

        try {

         ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
         String parts[] = file.getName().split("\\.");
         String type = "text";

         if (parts.length>1) {
          type = parts[1];
         }

         req.addFile(file, new MimetypesFileTypeMap().getContentType(file));
         req.setParam("literal.id", file.getAbsolutePath());
         req.setParam("literal.name", file.getName());
         req.setParam("literal.content_type", type);
         req.setAction(ACTION.COMMIT, true, true);
     
         solr.request(req); // submits one req at a time.   
        }
        catch (FileNotFoundException fnfe) {
          fnfe.printStackTrace();
        }
      }
    }
  }

Moving on further to searching the indexed files.
  public static void showResults(QueryResponse queryResponse) {
   
  System.out.println("Response Header = " + queryResponse.getHeader());
   
  System.out.println("Elapsed Time: " + queryResponse.getElapsedTime());
  System.out.println("Query Time:" + queryResponse.getQTime());
  System.out.println("Number Of Results:" + ((SolrDocumentList)(queryResponse.getResponse().get("response"))).getNumFound());
  System.out.println("Results: \n\n");
  SolrDocumentList solrDocumentList = queryResponse.getResults();

  Iterator solrDocumentIterator =  solrDocumentList.iterator();
  while(solrDocumentIterator.hasNext()) {
   SolrDocument solrDocument = solrDocumentIterator.next();
   Map fieldValueMap = solrDocument.getFieldValueMap();
   for (String key : fieldValueMap.keySet()) {
    
    if(key.equals("content")) {
     String value = (String) fieldValueMap.get(key);
     value = value.replaceAll("\\s+", " ");
     System.out.println(key + " = " + value);
    }
    else {
     System.out.println(key + " = " + fieldValueMap.get(key));
    }
   } 
  }
  }

No comments: