The document discusses the open source search platform Solr, describing how it provides a RESTful web interface and Java client for full text search capabilities. It covers installing and configuring Solr, adding and querying data via its HTTP API, and using the SolrJ Java client library. The presentation also highlights key Solr features like faceting, filtering, and scaling for performance.
2. Tonight's Talk Tonight's Talk should run about 1 1/2 hours About Solr Background & overview Installing & Bringing Up Solr Rest Interface & Java Client Configuring Solr
3. Why Implement Search? Does your site need search? Do you need to implement it, or is Google enough? Just text or Structured Data? Do you need to control ranking?
4. What is Solr? Web application for text search A wrapper around Apache Lucene Lucene is a library (.jar file) Solr is a web app (.war file) Written at CNet, now at Apache
5. What is Lucene? Text search library in Java Fast, feature rich. Written by Doug Cutting Active Apache development community Versions also in C++, C#, Ruby, Python, Delphi, Lisp, etc...
7. Solr Versions Current Version is 1.2 A year old 1.3 is coming "sometime" Large number of features in HEAD Use the latest from subversion for new projects
8. Alternatives to Solr Just Use Google Use Lucene Use Your Database Commercial Libraries Write your own
9. What Solr is Not A replacement for a relational database An embedded database* Fully cross platform :-( Replication depends on unix FS Admin scripts are bash(minor)
10. Solr Sites CNet (Reviews & Products) Internet Archive (Collections) Netflix (Movies) Zvents (Events) StripSearch.ws (Comics) And many more
11. Features Here's a quick look at some of the features of Solr, as implemented on Zvents.com
12.
13. Faceted Navigation Groups the results by category Can do multiple facets at once Returns matching counts
18. Scaling Solr Master/Slave architecture Writes to master/reads to slaves Replication: Periodic transfers, not continuous Rsync
19. Updates Updates flush caches, bad for performance Master therefor much slower than slaves So send all queries to slaves Depends on your update rates
20. Solr's Data Model Solr maintains a collection of documents A document is a collection of fields & values A field can occur multiple times in a document Documents are immutable. They can be deleted, and a new version added, however.
32. Getting Data Out http://localhost:8080/comix/select/?q=data&indent=on { "responseHeader":{ "status":0, "QTime":1, "params":{ "wt":"json", "rows":["1", "1"], "start":"0", "indent":"on", "q":"data", "version":"2.2"}}, "response":{"numFound":2,"start":0,"docs":[ { "feature_id":"3", "release_date":"1992-05-07", "id":"strip.3136", "timestamp":"2008-02-28T10:06:01.682Z"}] }} JSON format
33. Debug Query Option Add &debugQuery=on to request params Returns parsed form of query <str name="rawquerystring">c.i.a</str><str name="querystring">c.i.a</str><str name="parsedquery">PhraseQuery(text:"c i a")</str><str name="parsedquery_toString">text:"c i a"</str>
37. Solr in 3 minutes! Download Solr from Apache Untar "ant example" Start the example app Load data into Solr Query
38. Solr in Ten Minutes <Context docBase="/var/solr/apache-solr-1.2.0.war" debug="0" crossContext="true" > <Environment name="solr/home" type="java.lang.String" value="/var/solr" override="true" /></Context> Copy Solr's example/solr dir to /var/solr Edit schema.xml and solrconfig.xml Load data into Solr In $CATALINA_HOME/conf/Catalina/localhost/foo.xml
40. Java Solr Client Called SolrJ Not in Solr 1.2. I grabbed from the HEAD from svn Works with Solr 1.2 Add/Delete/Query/Commit/Optimize
41. Adding Docs w/SolrJ Given Map<String, String> fields; CommonsHttpSolrServer server = new CommonsHttpSolrServer( url ); SolrInputDocument doc= new SolrInputDocument(); for (Map.Entry<String, String> e : fields.entrySet()){ doc.addField(e.getKey(), e.getValue()); } UpdateResponse res = server .add( doc);
42. Deleting Docs w/SolrJ CommonsHttpSolrServer server = new CommonsHttpSolrServer( url ); UpdateResponse res; res = server .deleteById("100"); res = server .deleteByQuery("city:paris");
43. Simple Query CommonsHttpSolrServer server = new CommonsHttpSolrServer( url ); SolrQuery query = new SolrQuery(); query.setQuery("dance"); QueryResponse rsp = server .query(query);
44. More Interesting Query CommonsHttpSolrServer server = new CommonsHttpSolrServer( url ); SolrQuery query = new SolrQuery(); query.setQuery("dance"); query.setFacet( true ); query.addFacetField("city"); query.setFacetMinCount(1); query.addSortField( "price", SolrQuery.ORDER.asc ); QueryResponse rsp = server .query(query);
45. Query Responses QueryResponse qr = server .query(query); SolrDocumentList docs = qr.getResults(); List<FacetField> lf = qr.getFacetFields(); for (FacetField ff: lf) { String fieldName = ff.getName(); List<FacetField.Count> lc = ff.getValues(); for (FacetField.Count c: lc) { String countName = c.getName(); long count = c.getCount(); } }
46. Other Commands Commit server.commit() Optimize server.optimize() Not too complicated!
47. Request Handlers Request handler define how the query is processed. Two main types StandardRequestHandler DisMaxRequestHandler You can implement your own Changing in Solr 1.3
49. DisMaxRequestHandler Recommended for user queries Allows simple users keywords to be applied to multiple fields, with weighting. Boost Functions Boost Queries
50. Boost Functions Allow you to influence scoring at run time Computationally Expensive! Really useful for tuning scoring linear(x,2,4) returns 2*x+4 x is a field
51. The Solr Schema schema.xml Defines types used in this webapp Defines the fields and their types Defines "copyFields" READ THE EXAMPLE SCHEMA.XML
52. Types Types define processing for a field How the words are split (Whitespace? Punctuation? CIA != C.I.A.) Stemming Case Folding, etc Predefined date, int, float, etc c
68. Other Presentations Yonik Seely's Solr & Lucene http://people.apache.org/~yonik/presentations/ Slideshare.net Search for solr, or search for lucene
69. Thanks! Thanks for coming. Feel free to email me if you have questions about Solr Tom Hill [email_address]
70. Extra Slides Things I didn't have time for in the presentation. Some of them unfinished.
71. Search Engines are not the Same as Users Search engines have different usage patterns than users
74. Data Matters Gigo The better the data is, the better the search will be.
75. Watch Your Caches Just like any other app, check your statistics What's the hit rate for your caches?
76. Setting Up Replication Run rsyncd on the master Run snapshot on the master at intervals Run snappuller on the slaves at (different) intervals. Scripts don't print errors! Check the logs Use bash -xv
77. Autowarming Runs after an update to the index Updates flush caches Runs some queries to populate caches again Can be a problem, with frequent updates Don't autowarm master, if updating lots
80. Geographic Searching Local Lucene & Local Solr http://locallucene.wiki.sourceforge.net There's also geolucene, but it's not being actively developed, as far as I can tell. http://www.gossamer-threads.com/lists/l ucene/java-dev/53378