DDB Ch27
DDB Ch27
DDB Ch27
https://www.w3schools.com/sql/
Slide 27- 2
CHAPTER 27
Slide 27- 5
27.1 Information Retrieval (IR) Concepts
Information retrieval
Process of retrieving documents from a collection in
response to a query (search request)
Deals mainly with unstructured data
Example: homebuying contract documents
Unstructured information
Does not have a well-defined formal model
Based on an understanding of natural language
Stored in a wide variety of standard formats
Slide 27- 6
Information Retrieval (IR) Concepts (cont’d.)
Slide 27- 7
Information Retrieval (IR) Concepts (cont’d.)
Slide 27- 8
Information Retrieval (IR) Concepts (cont’d.)
Slide 9
Information Retrieval (IR) Concepts (cont’d.)
Characterizing an IR system
Types of users
Expert
Layperson
Types of data
Domain-specific
Types of information needs
Navigational search: locations
Informational search: information
Transactional search: transactions
Slide 27- 10
Information Retrieval (IR) Concepts (cont’d.)
Information Retrieval (IR) Concepts (cont’d.)
Slide 27- 12
Comparing Databases and IR Systems
Slide 27- 13
Generic IR approaches
1. Statistical approach
2. Semantic approaches
Slide 27- 14
Generic IR approaches
Statistical approach
Documents analyzed and broken down into
chunks of text
Each word or phrase is counted, weighted, and
measured for relevance or importance
Types of statistical approaches
1. Boolean
2. Vector space
3. Probabilistic
Slide 27- 15
Generic IR Pipeline (cont’d.)
Semantic approaches
Use knowledge-based retrieval techniques
Rely on syntactic, lexical, sentential, discourse-
based, and pragmatic levels of knowledge
understanding
Also apply some form of statistical analysis
Slide 27- 16
Figure 27.1 Generic IR framework Slide 27- 17
Figure 27.2 Simplified IR process pipeline
Slide 27- 18
Slide 27- 19
27.2 Retrieval Models
Boolean model
One of earliest and simplest IR models
Documents represented as a set of terms
Queries formulated using AND, OR, and NOT
Retrieved documents are an exact match
No notion of ranking of documents
Easy to associate metadata information and write
queries that match contents of documents
Slide 27- 20
27.2 Retrieval Models - Boolean model
plays
words
Slide 27- 21
27.2 Retrieval Models - Boolean model
Slide 27- 22
27.2 Retrieval Models - Boolean model
Result 1 0 0 1 0 0
Slide 27- 23
Retrieval Models (cont’d.)
Vector space model
Weighting, ranking, and determining relevance are possible
Uses individual terms as dimensions
Each document represented by an n-dimensional vector of
values.
Features
Subset of terms in a document set that are deemed most
Slide 27- 25
Retrieval Models (cont’d.)-
Vector space model
Retrieval Models (cont’d.)- Vector space model
Slide 27- 27
Retrieval Models (cont’d.)- Vector space model
Slide 27- 28
Retrieval Models (cont’d.)
Probabilistic model
Involves ranking documents by their estimated
probability of relevance with respect to the query
and the document
IR system must decide whether a document belongs
to the relevant set or nonrelevant set for a query
Calculate probability that document belongs to the
relevant set
BM25: a popular ranking algorithm
Slide 27- 30
Retrieval Models (cont’d.)-
Semantic model
Semantic model
Morphological analysis
Analyze roots and affixes to determine parts of
speech of search words
Syntactic analysis
Parse and analyze complete phrases in documents
Semantic analysis
Resolve word ambiguities and generate relevant
synonyms based on semantic relationships
Uses techniques from artificial intelligence and
expert systems.
Slide 27- 31
27.3 Types of Queries in IR Systems
Keyword queries
Simplest and most commonly used
Keyword terms implicitly connected by logical AND
Boolean queries
Allow use of AND, OR, NOT, and other operators
Exact matches returned
No ranking possible
Slide 27- 33
Types of Queries in IR Systems (cont’d.)
Phrase queries
Sequence of words that make up a phrase
Phrase enclosed in double quotes
Each retrieved document must contain at least one
instance of the exact phrase
Proximity queries
How close within a record multiple search terms
are to each other
Phrase search is most commonly used proximity
query
Slide 27- 34
Types of Queries in IR Systems (cont’d.)
Slide 27- 35
Types of Queries in IR Systems (cont’d.)
Wildcard queries
Supports regular expressions and pattern-based
matching
Example ‘data*’ would retrieve data, database,
dataset, etc.
Not generally implemented by Web search
engines.
Natural language queries
Definitions of textual terms or common facts
Semantic models can support
Slide 27- 36
Slide 27- 37
Figure 27.1 Generic IR framework Slide 27- 38
27.4 Text Preprocessing
Stopword removal must be performed before
indexing
Stopwords
Words that are expected to occur in 80% or more of
the documents of a collection
Examples: the, of, to, a, and, said, for, that
Do not contribute much to relevance
Queries preprocessed for stopword removal
before retrieval process
Many search engines do not remove stopwords
Slide 27- 39
Text Preprocessing (cont’d.)
Stemming
Trims suffix and prefix
Reduces the different forms of the word to a
common stem
Martin Porter’s stemming algorithm
Utilizing a thesaurus
Important concepts and main words that describe
each concept for a particular knowledge domain
Collection of synonyms
UMLS
Slide 27- 24
Figure 27.3 A portion of the UMLS Semantic Network: “Biologic Function” Hierarchy
Source: UMLS Reference Manual, National Library of Medicine
Slide 27-41
Text Preprocessing (cont’d.)
Other preprocessing steps
Digits
May or may not be removed during preprocessing
Hyphens and punctuation marks
Handled in different ways
Cases
Most search engines use case-insensitive search
Information extraction tasks
Identifying noun phrases, facts, events, people,
places, and relationships.
Slide 27-42
Slide 27- 43
27.5 Inverted Indexing
Inverted index structure
Vocabulary information
Set of distinct query terms in the document set
Document information: term frequency
Slide 27-44
Figure 27.2 Simplified IR process pipeline
Slide 27- 45
Inverted Indexing (cont’d.)
Construction of an inverted index
Break documents into vocabulary terms
Tokenizing, removing stopwords, stemming, and/or
using a thesaurus
Collect document statistics
Store statistics in document lookup table
Invert the document-term stream into a term-
document stream
Add additional information such as term frequencies,
term positions, and term weights
Slide 27-46
Figure 27.4 Example of an inverted index
Slide 27-47
Inverted Indexing (cont’d.)
Searching for relevant documents from an
inverted index
Vocabulary search
Document information retrieval
Manipulation of retrieved information
Slide 27-48
Introduction to Lucene
Lucene: open source indexing/search engine
Indexing is primary focus
Document composed of set of fields
Chunks of untokenized text
Series of processed lexical units called token
streams
Created by tokenization and filtering algorithms
Highly-configurable search API
Ease of indexing large, unstructured document
collections
Slide 27-49
Slide 27- 50
27.6 Evaluation Measures of Search
Relevance
Topical relevance
Measures result topic match to query topic
User relevance
Describes ‘goodness’ of retrieved result with
regard to user’s information need
Web information retrieval
No binary classification made for relevance or
nonrelevance
Ranking of documents
Slide 27-51
Evaluation Measures of Search Relevance
(cont’d.)
Recall
Number of relevant documents retrieved by a
search divided by the total number of actually
relevant documents existing in the database
Precision
Number of relevant documents retrieved by a
search divided by total number of documents
retrieved by that search
Slide 27- 52
Retrieved Versus Relevant Search Results
Slide 27-53
Information retrieval system evaluation
Slide 27- 54
Precision and recall
Slide 27- 55
Positive=retrieved
Negative=Not Retrieved
Slide 27- 56
Precision and recall
THE TRUTH
𝑻𝑻𝑻𝑻
P=
(𝑻𝑻𝑻𝑻+𝑭𝑭𝑭𝑭)
𝑻𝑻𝑻𝑻
R=
(𝑻𝑻𝑻𝑻+𝑭𝑭𝑭𝑭)
Evaluation Measures of Search Relevance
(cont’d.)
Average precision
Computed based on the precision at each
relevant document in the ranking
Recall/precision curve
Based on the recall and precision values at each
rank position
x-axis is recall and y-axis is precision
F-score
Harmonic mean of the precision (p) and recall (r)
values
Slide 27-59