This document provides an overview of Solr, an open source enterprise search platform. It describes Solr's core functions like indexing, searching, and analyzing documents. It also explains how to configure Solr for indexing, querying, highlighting search results, and more. Various Solr query syntax and relevancy tuning options are demonstrated through examples.
1 of 33
Downloaded 632 times
More Related Content
Add Powerful Full Text Search to Your Web App with Solr
2. What is Lucene
• High performance, scalable, full-text
search library
• Focus: Indexing + Searching Documents
– “Document” is just a list of name+value pairs
• No crawlers or document parsing
• Flexible Text Analysis (tokenizers + token
filters)
• 100% Java, no dependencies, no config
files
3. What is Solr
• A full text search server based on Lucene
• XML/HTTP, JSON Interfaces
• Faceted Search (category counting)
• Flexible data schema to define types and fields
• Hit Highlighting
• Configurable Advanced Caching
• Index Replication
• Extensible Open Architecture, Plugins
• Web Administration Interface
• Written in Java5, deployable as a WAR
4. Basic App HTML
Indexer
Webapp
Document
super_name: Mr. Fantastic
Query Query Response
name: Reed Richards
(powers:agility) (matching docs)
category: superhero
powers: elasticity
http://solr/update http://solr/select
admin update select XML response writer
JSON response writer
Solr
Servlet Container
XML Update Handler Standard request handler
CSV Update Handler Custom request handler
Lucene
5. Indexing Data
HTTP POST to http://localhost:8983/solr/update
<add><doc>
<field name=“id”>05991</field>
<field name=“name”>Peter Parker</field>
<field name=“supername”>Spider-Man</field>
<field name=“category”>superhero</field>
<field name=“powers”>agility</field>
<field name=“powers”>spider-sense</field>
</doc></add>
6. Indexing CSV data
Iron Man, Tony Stark, superhero, powered armor | flight
Sandman, William Baker|Flint Marko, supervillain, sand transform
Wolverine,James Howlett|Logan, superhero, healing|adamantium
Magneto, Erik Lehnsherr, supervillain, magnetism|electricity
http://localhost:8983/solr/update/csv?
fieldnames=supername,name,category,powers
&separator=,
&f.name.split=true&f.name.separator=|
&f.powers.split=true&f.powers.separator=|
7. Data upload methods
URL=http://localhost:8983/solr/update/csv
• HTTP POST body (curl, HttpClient, etc)
curl $URL -H 'Content-type:text/plain;
charset=utf-8' --data-binary @info.csv
• Multi-part file upload (browsers)
• Request parameter
?stream.body=‘Cyclops, Scott Summers,…’
• Streaming from URL (must enable)
?stream.url=file://data/info.csv
8. Indexing with SolrJ
// Solr’s Java Client API… remote or embedded/local!
SolrServer server = new
CommonsHttpSolrServer(quot;http://localhost:8983/solrquot;);
SolrInputDocument doc = new SolrInputDocument();
doc.addField(quot;supernamequot;,quot;Daredevilquot;);
doc.addField(quot;namequot;,quot;Matt Murdockquot;);
doc.addField(“categoryquot;,“superheroquot;);
server.add(doc);
server.commit();
9. Deleting Documents
• Delete by Id, most efficient
<delete>
<id>05591</id>
<id>32552</id>
</delete>
• Delete by Query
<delete>
<query>category:supervillain</query>
</delete>
10. Commit
• <commit/> makes changes visible
– Triggers static cache warming in
solrconfig.xml
– Triggers autowarming from existing caches
• <optimize/> same as commit, merges all
index segments for faster searching
_0.fnm
_0.fdt
_0.fdx
_0.frq
Lucene Index Segments
_0.tis
_0.tii
_0.prx _1.fnm
_0.nrm _1.fdt
_1.fdx
_0_1.del […]
12. Response Format
• Add &wt=json for JSON formatted response
{“resultquot;: {quot;numFoundquot;:427, quot;startquot;:0,
quot;docsquot;: [
{“supername”:”Spider-Man”, “category”:”superhero”},
{“supername”:” Msytique”, “category”:” supervillain”}
]
}
• Also Python, Ruby, PHP, SerializedPHP, XSLT
13. Scoring
• Query results are sorted by score descending
• VSM – Vector Space Model
• tf – term frequency: numer of matching terms in field
• lengthNorm – number of tokens in field
• idf – inverse document frequency
• coord – coordination factor, number of matching
terms
• document boost
• query clause boost
http://lucene.apache.org/java/docs/scoring.html
17. DisMax Query Syntax
• Good for handling raw user queries
– Balanced quotes for phrase query
– ‘+’ for required, ‘-’ for prohibited
– Separates query terms from query structure
http://solr/select?qt=dismax
&q=super man // the user query
&qf=title^3 subject^2 body // field to query
&pf=title^2,body // fields to do phrase queries
&ps=100 // slop for those phrase q’s
&tie=.1 // multi-field match reward
&mm=2 // # of terms that should match
&bf=popularity // boost function
18. DisMax Query Form
• The expanded Lucene Query:
+( DisjunctionMaxQuery( title:super^3 |
subject:super^2 | body:super)
DisjunctionMaxQuery( title:man^3 |
subject:man^2 | body:man)
)
DisjunctionMaxQuery(title:”super man”~100^2
body:”super man”~100)
FunctionQuery(popularity)
• Tip: set up your own request handler with default parameters
to avoid clients having to specify them
19. Function Query
• Allows adding function of field value to score
– Boost recently added or popular documents
• Current parser only supports function notation
• Example: log(sum(popularity,1))
• sum, product, div, log, sqrt, abs, pow
• scale(x, target_min, target_max)
– calculates min & max of x across all docs
• map(x, min, max, target)
– useful for dealing with defaults
20. Boosted Query
• Score is multiplied instead of added
– New local params <!...> syntax added
&q=<!boost b=sqrt(popularity)>super man
• Parameter dereferencing in local params
&q=<!boost b=$boost v=$userq>
&boost=sqrt(popularity)
&userq=super man
24. copyField
• Copies one field to another at index time
• Usecase #1: Analyze same field different ways
– copy into a field with a different analyzer
– boost exact-case, exact-punctuation matches
– language translations, thesaurus, soundex
<field name=“title” type=“text”/>
<field name=“title_exact” type=“text_exact”
stored=“false”/>
<copyField source=“title” dest=“title_exact”/>
• Usecase #2: Index multiple fields into single
searchable field
29. Filters
• Filters are restrictions in addition to the query
• Use in faceting to narrow the results
• Filters are cached separately for speed
1. User queries for memory, query sent to solr is
&q=memory&fq=inStock:true&facet=true&…
2. User selects 1GB memory size
&q=memory&fq=inStock:true&fq=size:1GB&…
3. User selects DDR2 memory type
&q=memory&fq=inStock:true&fq=size:1GB
&fq=type:DDR2&…
31. MoreLikeThis
• Selects documents that are “similar” to the
documents matching the main query.
&q=id:6H500F0
&mlt=true&mlt.fl=name,cat,features
quot;moreLikeThisquot;:{
quot;6H500F0quot;:{quot;numFoundquot;:5,quot;startquot;:0,
quot;docs”: [
{quot;namequot;:quot;Apple 60 GB iPod with Video
Playback Blackquot;, quot;pricequot;:399.0,
quot;inStockquot;:true, quot;popularityquot;:10, […]
}, […]
]
[…]
32. High Availability Dynamic
HTML
Appservers Generation
HTTP search
Load Balancer requests
Solr Searchers
Index Replication
admin queries
updates
updates DB
Updater
admin terminal Solr Master