Setting up Solr
First Apache Tomcat should be downloaded and must be installed in order to set up Solr. We assume that Java Runtime is installed and already configured to run on the system. In order to access the manager page for Apache Tomcat, edit the $CATALINA_HOME\conf\tomcat-users.xml to add user with an admin role associated with it.
Now download the latest Apache Solr on the system. Extract the archive and copy the .war file to the webapps directory of the Tomcat installation. Copy the Solr configuration from the example directory of the Solr setup in the $CATALINA_HOME directory.
cp -R /path/to/apache-solr-x.x.x/example/solr $CATALINA_HOME/solr
In order for the tomcat to know about the Solr webapp, we add a file named solr.xml in the Tomcat configuration directory i.e. $CATALINA_HOME/conf/Catalina/localhost. Now we open the recently created solr.xml in any text editor and add the following configuration:
<context allowlinking="true" crosscontext="true" debug="0" docbase="{/full/path/to/webapps/solr.war}" priviledged="true"> <environment name="solr/home" override="true" type="java.lang.String" value="{/full/path/to/CATALINA_HOME/solr}"> </environment> </context>
After restarting the Tomcat server we are able to access the Solr interface at http://localhost:8080/solr/admin/.
Configuring Solr
Solr can be configured using two configuration files namely, solrconfig.xml and schema.xml. Both these files reside in the conf directory in the Solr home directory. Solrconfig.xml configures the solr server itself while the Schema.xml is used to specify the fields which the documents might contain, which are used for indexing the documents and querying to seach the documents.
The Solrconfig.xml contains following set of configuration information:
1) Lib directive specifies the path to solr plugins in order to load them. If there are dependencies, list the lowest level dependency jar first. It also supports regular expressions to control loading of jars.
2) DataDir directive specifies the location of the index data files which is stored in "/data" directory by default.
3) IndexConfig section allows to configure low level behavior for Lucene Index writers such as index sizing, index merging , index locks and other parameters.
4) UpdateHandler section relates to the low level details of internal handling of updates such as maximum number of uncommitted documents or maximum time before an auto commit or soft auto commit is triggered, or enabling open search on hard commits etc. It also defines listeners such as RunExecutableListener (which executes external commands) for particular update events, postCommit and postOptimize. It also defines the max pending deletes parameter which sets a limit on the number of deletions that Solr will buffer during document deletion.
Data sent to Solr is not searchable until it has been committed to the index. The reason being that in some cases commits can be slow and they should be done in isolation from other possible commit requests to avoid overwriting data. Hence, it's preferable to provide control over when data is committed using the above commit and soft commit options. Soft commits as opposed to normal commit, does not guarantee that documents are in stable storage after committing.
5) Query section controls everything related to search queries such as maximum number of clauses in boolean query. It contains the caching section and event listener section.
- Caching section is used to configure the caching parameters depending on the size of the index. Solr caches are associated with a specific instance of an Index Searcher, a specific view of an index that doesn't change during the lifetime of that searcher. As long as that Index Searcher is being used, any items in its cache will be valid and available for reuse. When a new searcher is opened, the current searcher continues servicing requests while the new one auto-warms its cache. The new searcher uses the current searcher's cache to pre-populate its own. When the new searcher is ready, it is registered as the current searcher and begins handling all new search requests. The old searcher will be closed once it has finished servicing all its requests. Details of each cache is as follows:
FilterCache is used by SolrIndexSearcher for filters and unordered sets of all documents matching the query. For new searcher filterCache is pre-populated using the most recently accessed items. QueryResultCache caches the result of previously searched queries, while DocumentCache caches Lucene document objects which contains the fields of the document. The generic used defined cache can be defined and accessed by SolrIndexSearcher methods getCache(), cacheLookup() and cacheInsert(). Also there are optimizations to use filter for a search and to enable the use of queryResultCache for specific number of result items. - Listener section defines a set of listeners triggered by a query related event to perform operations such as prepare cache for new or first search etc.
HandleSelect is for backward compatiblity
Indexing files
In order to index html and other files using the post.jar we add the library files "apache-solr-core-x.x.x", "apache-solr-solrj-x.x.x" as well as "slf4j-api", "commons-io", "httpcore", "httpmime", "httpclient" and others from the Solr_Setup/dist and Solr_Setup/dist/solrj-lib directory to the $CATALINA_HOME/webapps/solr/WEB-INF/lib. This would avoid Class not found issues while indexing the files.
To index files we use the post.jar located in Solr_Setup/example/exampledocs directory using the following command: java -jar post.jar *.xml
Below is the required maven dependencies for using SolrJ library:
<dependency> <groupId>org.apache.solr</groupId> <artifactId>solr-solrj</artifactId> <version>${solr.version}</version> </dependency> <dependency> <groupId>org.apache.solr</groupId> <artifactId>solr-core</artifactId> <version>${solr.version}</version> </dependency>
Below is the sample code to index html pages using SolrJ:
// Scan the directory for all Html files and get instance of Solr server to index those files public static void indexDirectory(File directory) throws Exception { SolrServer solr = new HttpSolrServer("http://localhost:8090/solr"); // Pattern to filter all html files String pattern = "^.*.html$"; // FileUtils requires Apache Commons-IO library Collection<File> files = FileUtils.listFiles(directory, new RegexFileFilter(pattern), DirectoryFileFilter.DIRECTORY ); for (File file : files) { indexFile(solr, file); } } // Add the file to the index of Solr and commit public static void indexFile(SolrServer solr, File file) throws Exception { // do not try to index files that cannot be read if (file.canRead()) { if (file.isDirectory()) { String[] files = file.list(); // an IO error could occur if (files != null) { for (int i = 0; i < files.length; i++) { indexFile(solr, new File(file, files[i])); } } } else { try { ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract"); String parts[] = file.getName().split("\\."); String type = "text"; if (parts.length>1) { type = parts[1]; } req.addFile(file, new MimetypesFileTypeMap().getContentType(file)); req.setParam("literal.id", file.getAbsolutePath()); req.setParam("literal.name", file.getName()); req.setParam("literal.content_type", type); req.setAction(ACTION.COMMIT, true, true); solr.request(req); // submits one req at a time. } catch (FileNotFoundException fnfe) { fnfe.printStackTrace(); } } } }
Moving on further to searching the indexed files.
public static void showResults(QueryResponse queryResponse) { System.out.println("Response Header = " + queryResponse.getHeader()); System.out.println("Elapsed Time: " + queryResponse.getElapsedTime()); System.out.println("Query Time:" + queryResponse.getQTime()); System.out.println("Number Of Results:" + ((SolrDocumentList)(queryResponse.getResponse().get("response"))).getNumFound()); System.out.println("Results: \n\n"); SolrDocumentList solrDocumentList = queryResponse.getResults(); IteratorsolrDocumentIterator = solrDocumentList.iterator(); while(solrDocumentIterator.hasNext()) { SolrDocument solrDocument = solrDocumentIterator.next(); Map fieldValueMap = solrDocument.getFieldValueMap(); for (String key : fieldValueMap.keySet()) { if(key.equals("content")) { String value = (String) fieldValueMap.get(key); value = value.replaceAll("\\s+", " "); System.out.println(key + " = " + value); } else { System.out.println(key + " = " + fieldValueMap.get(key)); } } } }
No comments:
Post a Comment