DDB Ch27

Introduced by

Dr. Ebtsam Adel

Lecturer of Information Systems,
Information Systems department,
Faculty of computers and information,
Damanhour university
SQL Tutorial
https://www.w3schools.com/sql/

Slide 27- 2
CHAPTER 27
Introduction to Information Retrieval

and Web Search
Slide 27- 4
Information Retrieval
 Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers).
Slide 27- 5
27.1 Information Retrieval (IR) Concepts
 Information retrieval
 Process of retrieving documents from a collection in
response to a query (search request)
 Deals mainly with unstructured data
 Example: homebuying contract documents
 Unstructured information
 Does not have a well-defined formal model
 Based on an understanding of natural language
 Stored in a wide variety of standard formats
Slide 27- 6
Information Retrieval (IR) Concepts (cont’d.)
Data could be presented in three different types of data :

structured, semi-structured, and unstructured data.
Structured data as (database), semi-structured data (XML
files, JSON documents), and unstructured data such as
(documents, images, video, e-mails, and reports).
Slide 27- 7
Slide 27- 8
 Information retrieval field predates database field

 Academic programs in Library and Information
Science
 RDBMS vendors providing new capabilities to
support various data types
 Extended RDBMSs or object-relational database
management systems
 User’s information need expressed as free-form
search request
 Keyword search query
Slide 9
 Characterizing an IR system
 Types of users
 Expert
 Layperson
 Types of data
 Domain-specific
 Types of information needs
 Navigational search: locations
 Informational search: information
 Transactional search: transactions
Slide 27- 10
 Enterprise search systems

 Limited to an intranet
 Desktop search engines
 Searches an individual computer system
 Databases have fixed schemas
 IR system has no fixed data model
Slide 27- 12
Comparing Databases and IR Systems
Table 27.1 A comparison of databases and IR systems
Slide 27- 13
Generic IR approaches
1. Statistical approach
2. Semantic approaches
Slide 27- 14
Generic IR approaches
 Statistical approach
 Documents analyzed and broken down into
chunks of text
 Each word or phrase is counted, weighted, and
measured for relevance or importance
 Types of statistical approaches
1. Boolean
2. Vector space
3. Probabilistic
Slide 27- 15
Generic IR Pipeline (cont’d.)
 Semantic approaches
 Use knowledge-based retrieval techniques
 Rely on syntactic, lexical, sentential, discourse-
based, and pragmatic levels of knowledge
understanding
 Also apply some form of statistical analysis
Slide 27- 16
Figure 27.1 Generic IR framework Slide 27- 17
Figure 27.2 Simplified IR process pipeline
Slide 27- 18
Slide 27- 19
27.2 Retrieval Models
 Boolean model
 One of earliest and simplest IR models
 Documents represented as a set of terms
 Queries formulated using AND, OR, and NOT
 Retrieved documents are an exact match
No notion of ranking of documents
 Easy to associate metadata information and write
queries that match contents of documents
Slide 27- 20
27.2 Retrieval Models - Boolean model
plays
words
Slide 27- 21
 a vector for each term.

a) Brutus: 110100
b) Caesar: 110111
c) NOT Calpurnia: (complemented (1's complement) Calpurnia) 101111
 To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the
vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a
bitwise AND:
110100
110100 AND 110111 AND 101111 = 100100.
110111
101111
100100
Slide 27- 22
Result 1 0 0 1 0 0
Antony and Cleopatra , and Hamlet
Slide 27- 23
Retrieval Models (cont’d.)
 Vector space model
 Weighting, ranking, and determining relevance are possible
 Uses individual terms as dimensions
 Each document represented by an n-dimensional vector of
values.
 Features
 Subset of terms in a document set that are deemed most
relevant to an IR search for the document set.
Slide 27- 25
Retrieval Models (cont’d.)-
Vector space model
Retrieval Models (cont’d.)- Vector space model
 Vector space model (cont’d.)

 Different similarity assessment functions can be
used
 Term frequency-inverse document frequency
(TF-IDF)
 Statistical weight measure used to evaluate the
importance of a document word in a collection of
documents.
Slide 27- 27
Retrieval Models (cont’d.)- Vector space model
 Term frequency-inverse document frequency (TF-IDF)
Slide 27- 28
Retrieval Models (cont’d.)
 Probabilistic model
 Involves ranking documents by their estimated
probability of relevance with respect to the query
and the document
 IR system must decide whether a document belongs
to the relevant set or nonrelevant set for a query
 Calculate probability that document belongs to the
relevant set
 BM25: a popular ranking algorithm
Slide 27- 30
Retrieval Models (cont’d.)-
Semantic model
 Semantic model
 Morphological analysis
 Analyze roots and affixes to determine parts of
speech of search words
 Syntactic analysis
 Parse and analyze complete phrases in documents
 Semantic analysis
 Resolve word ambiguities and generate relevant
synonyms based on semantic relationships
 Uses techniques from artificial intelligence and
expert systems.
Slide 27- 31
27.3 Types of Queries in IR Systems
 Keyword queries
 Simplest and most commonly used
 Keyword terms implicitly connected by logical AND
 Boolean queries
 Allow use of AND, OR, NOT, and other operators
 Exact matches returned
 No ranking possible
Slide 27- 33
Types of Queries in IR Systems (cont’d.)
 Phrase queries
 Sequence of words that make up a phrase
 Phrase enclosed in double quotes
 Each retrieved document must contain at least one
instance of the exact phrase
 Proximity queries
 How close within a record multiple search terms
are to each other
 Phrase search is most commonly used proximity
query
Slide 27- 34
 Proximity queries (cont’d.)

 Specify order of search terms
 NEAR, ADJ (adjacent), or AFTER operators
 Sequence of words with maximum allowed
distance between them
 Computationally expensive
 Suitable for smaller document collections rather than
the Web.
Slide 27- 35
 Wildcard queries
 Supports regular expressions and pattern-based
matching
 Example ‘data*’ would retrieve data, database,
dataset, etc.
 Not generally implemented by Web search
engines.
 Natural language queries
 Definitions of textual terms or common facts
 Semantic models can support
Slide 27- 36
Slide 27- 37
Figure 27.1 Generic IR framework Slide 27- 38
27.4 Text Preprocessing
 Stopword removal must be performed before
indexing
 Stopwords
 Words that are expected to occur in 80% or more of
the documents of a collection
 Examples: the, of, to, a, and, said, for, that
 Do not contribute much to relevance
 Queries preprocessed for stopword removal
before retrieval process
 Many search engines do not remove stopwords
Slide 27- 39
Text Preprocessing (cont’d.)
 Stemming
 Trims suffix and prefix
 Reduces the different forms of the word to a
common stem
 Martin Porter’s stemming algorithm
 Utilizing a thesaurus
 Important concepts and main words that describe
each concept for a particular knowledge domain
 Collection of synonyms
 UMLS
Slide 27- 24
Figure 27.3 A portion of the UMLS Semantic Network: “Biologic Function” Hierarchy
Source: UMLS Reference Manual, National Library of Medicine
Slide 27-41
Text Preprocessing (cont’d.)
 Other preprocessing steps
 Digits
 May or may not be removed during preprocessing
 Hyphens and punctuation marks
 Handled in different ways
 Cases
 Most search engines use case-insensitive search
 Information extraction tasks
 Identifying noun phrases, facts, events, people,
places, and relationships.
Slide 27-42
Slide 27- 43
27.5 Inverted Indexing
 Inverted index structure
 Vocabulary information
 Set of distinct query terms in the document set
 Document information: term frequency
 Inverted index : Data structure that attaches distinct

terms with a list of all documents that contain the
term
Slide 27-44
Figure 27.2 Simplified IR process pipeline
Slide 27- 45
Inverted Indexing (cont’d.)
 Construction of an inverted index
 Break documents into vocabulary terms
 Tokenizing, removing stopwords, stemming, and/or
using a thesaurus
 Collect document statistics
 Store statistics in document lookup table
 Invert the document-term stream into a term-
document stream
 Add additional information such as term frequencies,
term positions, and term weights
Slide 27-46
Figure 27.4 Example of an inverted index
Slide 27-47
Inverted Indexing (cont’d.)
 Searching for relevant documents from an
inverted index
 Vocabulary search
 Document information retrieval
 Manipulation of retrieved information
Slide 27-48
Introduction to Lucene
 Lucene: open source indexing/search engine
 Indexing is primary focus
 Document composed of set of fields
 Chunks of untokenized text
 Series of processed lexical units called token
streams
 Created by tokenization and filtering algorithms
 Highly-configurable search API
 Ease of indexing large, unstructured document
collections
Slide 27-49
Slide 27- 50
27.6 Evaluation Measures of Search
Relevance
 Topical relevance
 Measures result topic match to query topic
 User relevance
 Describes ‘goodness’ of retrieved result with
regard to user’s information need
 Web information retrieval
 No binary classification made for relevance or
nonrelevance
 Ranking of documents
Slide 27-51
Evaluation Measures of Search Relevance
(cont’d.)
 Recall
 Number of relevant documents retrieved by a
search divided by the total number of actually
relevant documents existing in the database
 Precision
 Number of relevant documents retrieved by a
search divided by total number of documents
retrieved by that search
Slide 27- 52
Retrieved Versus Relevant Search Results
 TP: true positive

 FP: false positive
 TN: true negative
 FN: false negative
Figure 27.5 Retrieved versus relevant search results
Slide 27-53
Information retrieval system evaluation
 The standard approach to information retrieval system evaluation

revolves around the notion of relevant and non-relevant documents.
 Precision (P) is the fraction of retrieved documents that are relevant.

 Recall (R) is the fraction of relevant documents that are retrieved.
Slide 27- 54
Precision and recall
Precision (P) is the fraction of retrieved documents that are

relevant.
=#(relevant items retrieved )
Precision= = P(relevant|retrieved)
#(retrieved items)
 Recall (R) is the fraction of relevant documents that are

Retrieved.
#(relevant items retrieved)

Recall= = = P(retrieved|relevant)
#(relevant items)
Slide 27- 55
Positive=retrieved
Negative=Not Retrieved
Slide 27- 56
Precision and recall
THE TRUTH
Relevant Not Relevant

Retrieved true positives (TP) false positives
(FP)
Not retrieved false negatives (FN) true negatives
(TN)
𝑻𝑻𝑻𝑻
P=
(𝑻𝑻𝑻𝑻+𝑭𝑭𝑭𝑭)
𝑻𝑻𝑻𝑻
R=
(𝑻𝑻𝑻𝑻+𝑭𝑭𝑭𝑭)
(cont’d.)
 Recall can be increased by presenting more results

to the user
 May decrease the precision
Table 27.2 Precision and recall for ranked retrieval

Slide 27-58
(cont’d.)
 Average precision
 Computed based on the precision at each
relevant document in the ranking
 Recall/precision curve
 Based on the recall and precision values at each
rank position
 x-axis is recall and y-axis is precision
 F-score
 Harmonic mean of the precision (p) and recall (r)
values
Slide 27-59

DDB Ch27

Uploaded by

Copyright:

Available Formats

DDB Ch27

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DDB Ch27

Uploaded by

Copyright:

Available Formats

Introduced by

Dr. Ebtsam Adel

Introduction to Information Retrieval

Data could be presented in three different types of data :

 Information retrieval field predates database field

 Enterprise search systems

Table 27.1 A comparison of databases and IR systems

 a vector for each term.

Antony and Cleopatra , and Hamlet

relevant to an IR search for the document set.

 Vector space model (cont’d.)

 Term frequency-inverse document frequency (TF-IDF)

 Proximity queries (cont’d.)

 Inverted index : Data structure that attaches distinct

 TP: true positive

Figure 27.5 Retrieved versus relevant search results

 The standard approach to information retrieval system evaluation

 Precision (P) is the fraction of retrieved documents that are relevant.

Precision (P) is the fraction of retrieved documents that are

 Recall (R) is the fraction of relevant documents that are

#(relevant items retrieved)

Relevant Not Relevant

 Recall can be increased by presenting more results

Table 27.2 Precision and recall for ranked retrieval

You might also like