Explain Item Normalization?

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

 Explain Item normalization?

o Item Normalization:
o Normalize the incoming items to a standard format. Standardizing the
input takes the different external formats of input data and performs
the translation to the formats acceptable to the system. A system may
have a single format for all items or allow multiple formats.

o The next process is to parse the item into logical sub-divisions that
have meaning to the user. This process, called "Zoning," is visible to
the user and used to increase the precision of a search and optimize the
display. An item is subdivided into zones, which may be hierarchical
(Title, Author, Abstract, Main Text, Conclusion, and References). The
zoning information is passed to the processing token identification
operation to store the information, allowing searches to be restricted to
a specific zone.
o Once the standardization and zoning has been completed, information
(i.e., words) that are used in the search process need to be identified in
the item. The first step in identification of a processing token consists
of determining a word. Systems determine words by dividing input
symbols into three classes: valid word symbols, inter-word symbols,
and special processing symbols.

o A word is defined as a contiguous set of word symbols bounded by


inter-word symbols. Examples of word symbols are alphabetic
characters and numbers. Examples of possible inter-word symbols are
blanks, periods and semicolons.

o Next, a Stop List/Algorithm is applied to the list of potential


processing tokens. The objective of the Stop function is to save
system resources by eliminating from the set of searchable processing
tokens those that have little value to the system. Stop Lists are
commonly found in most systems and consist of words (processing
tokens) whose frequency and/or semantic use make them of no value
as a searchable token. e.g., "the"), have no search value and are not a
useful part of a user's query.

o The next step in finalizing on processing tokens is identification of


any specific word characteristics. The characteristic is used in
systems to assist in disambiguation of a particular word.
Morphological analysis of the processing token's part of speech is
included here. For Example word like “Plane” the system could
understands it as Verb or Adjective or Noun. Other characteristics that
are treaded separately like Numbers and dates.
o Once the potential processing token has been identified and
characterized, most systems apply stemming algorithms to normalize
the token to a standard semantic representation. The decision to
perform stemming is a trade off between precision of a search (i.e.,
finding exactly what the query specifies) versus standardization to
reduce system overhead in expanding a search term to similar token
representations with a potential increase in recall. The amount of
stemming that is applied can lead to retrieval of many non-relevant
items.

 Explain system capabilities?


 Search Capabilities
 The objective of the search capability is to allow for a mapping between a
user's specified need and the items in the information database that will
answer that need.  “Weighting" of search terms holds significant potential
for assisting in the location and ranking of relevant items. E.g. Find articles
that discuss data mining or data warehouses the system would recognize in
its importance ranking and item selection process that data mining are far
more important than items discussing data warehouses.

 Boolean Logic:
Boolean logic allows a user to logically relate multiple concepts together to
define what information is needed. The typical Boolean operators are AND,
OR, and NOT. Placing portions of the search statement in parentheses are used
to overtly specify the order of Boolean operations (i.e., nesting function). If
parentheses are not used, the system follows a default precedence ordering of
operations (e.g.,Use of Boolean Operators

 Proximity:
 Proximity is used to restrict the distance allowed within an item between two
search terms. The semantic concept is that the closer two terms are found in
a text the more likely they are related in the description of a particular
concept. Proximity is used to increase the precision of a search. If the terms
COMPUTER and DESIGN are found within a few words of each other then
the item is more likely to be discussing the design of computers than if the
words are paragraphs apart.
o Proximity TERM1 within "m . . . . units" of TERM2 The distance
operator "m" is an integer number and units are in Characters, Words,
Sentences, or Paragraphs. A special case of the Proximity operator is
the Adjacent (ADJ) operator that normally has a distance operator of
one and a forward only direction. Another special case is where the
distance is set to zero meaning within the same semantic unit.
 Contiguous Word Phrases A Contiguous Word Phrase (CWP) is both a way
of specifying a query term and a special search operator. A Contiguous
Word Phrase is two or more words that are treated as a single semantic unit.
An example of a CWP is "United States of America." It is four words that
specify a search term representing a single specific semantic concept (a
country) that can be used with any of the operators discussed above. Thus a
query could specify "manufacturing" AND "United States of America"
which returns any item that contains the word "manufacturing" and the
contiguous words "United States of America”. A contiguous word phrase
also acts like a special search operator that is similar to the proximity
(Adjacency) operator but allows for additional specificity
.
 . Fuzzy Searches :
o Fuzzy Searches provide the capability to locate spellings of words that
are similar to the entered search term. This function is primarily used
to compensate for errors in spelling of words. Fuzzy searching
increases recall at the expense of decreasing precision. A Fuzzy
Search on the term "computer" would automatically include the
following words from the information database: "computer”,
"compiter," "conputer," "computter," "compute." An additional
enhancement may lookup the proposed alternative spelling and if it is
a valid word with a different meaning, include it in the search with a
low ranking or not include it at all (e.g., "commuter"). In the process
of expanding a query term fuzzy searching includes other terms that
have similar spellings, giving more weight (in systems that rank
output) to words in the database that have similar word lengths and
position of the characters as the entered term.

o Term Masking :
 Term masking is the ability to expand a query term by masking a portion of
the term and accepting as valid any processing token that maps to the
unmasked portion of the term. The value of term masking is much higher in
systems that do not perform stemming or only provide a very simple
stemming algorithm. There are two types of search term masking: fixed
length and variable length. Fixed length masking is a single position mask.
It masks out any symbol in a particular position or the lack of that position in
a word. Variable length "don't cares" allows masking of any number of
characters within a processing token.
 Term Masking (Variable Length)

o Numeric and Date Ranges :


o Term masking is useful when applied to words, but does not work for
finding ranges of numbers or numeric dates. To find numbers larger
than "125”, using a term "125*" will not find any number except those
that begin with the digits "125.“ A user could enter inclusive (e.g.,
"125-425" or "4/2/93- 5/2/95" for numbers and dates) to infinite
ranges (">125“, "<=233“, representing "Greater Than" or "Less Than”
or “Equal") as part of a query.

 Concept/Thesaurus Expansion:
 Associated with both Boolean and Natural Language Queries is the ability to
expand the search terms via Thesaurus or Concept Class database reference
tool.  A Thesaurus is typically a one-level or two-level expansion of a term
to other terms that are similar in meaning.  A Concept Class is a tree
structure that expands each meaning of a word into potential concepts that
are related to the initial term

  Natural Language Queries:


 Natural Language Queries allow a user to enter a prose statement that
describes the information that the user wants to find.  The longer the prose,
the more accurate file results returned. The most difficult logic case
associated with Natural Language Queries is the ability to specify negation
in the search statement and have the system recognize it as negation.
 An example of a Natural Language Query is: Find for me all the items that
discuss databases and current attempts in database applications. Include all
items that discuss Microsoft trials in the development process. Do not
include items about relational databases.
 This usage pattern is important because sentence fragments make
morphological analysis of the natural language query difficult and may limit
the system's ability to perform term disambiguation (e.g., understand which
meaning of a word is meant). Natural language interfaces improve the recall
of systems with a decrease in precision when negation is required.

Explain how Data structures used to create indexing in IRS?


 INDEXING: the transformation from received item to searchable data structure
is called
indexing. • Process can be manual or automatic. • Creating a direct search in
document data base or indirect search through index files. • Concept based
representation: instead of transforming the input into a searchable format some
systems transform the input into different representation that is concept based
.Search ? Search and return item as per the incoming items. • History of
indexing: shows the dependency of information processing capabilities on
manual and then automatic processing systems . • Indexing originally called
cataloguing : oldest technique to identity the contents of items to assist in
retrieval. • Items overlap between full item indexing , public and private
indexing of files • Objectives : the public file indexer needs to consider the
information needs of all users of library system . Items overlap between full
item indexing , public and private indexing of files •Users may use public index
files as part of search criteria to increase recall. •They can constrain there search
by private index files •The primary objective of representing the concepts
within an item to facilitate users finding relevant information . •Users may use
public index files as part of search criteria to increase recall. •They can
constrain there search by private index files •The primary objective of
representing the concepts within an item to facilitate users finding relevant
information . Indexing process • 1.Decide the scope of indexing and the level of
detail to be provided. Based on usage scenario of users . • 2.Second decision is
to link index terms together in a single index for a particular concept. TEXT
PROCESSING Text process phases
Document Parsing. Documents come in all sorts of languages, character sets, and
formats; often, the same document may contain multiple languages or formats, e.g.,
a French email with Portuguese PDF attachments. Document parsing deals with the
recognition and “breaking down” of the document structure into individual
components. In this pre processing phase, unit documents are created; e.g., emails
with attachments are split into one document representing the email and as many
documents as there are attachments. 2. Lexical Analysis. After parsing, lexical
analysis tokenizes a document, seen as an input stream, into words. Issues related
to lexical analysis include the correct identification of accents, abbreviations, dates,
and cases. The difficulty of this operation depends much on the language at hand:
for example, the English language has neither diacritics nor cases, French has
diacritics but no cases, German has both diacritics and cases. The recognition of
abbreviations and, in particular, of time expressions would deserve a separate
chapter due to its complexity and the extensive literature in the field For current
approaches 3. Stop-Word Removal. A subsequent step optionally applied to the
results of lexical analysis is stop-word removal, i.e., the removal of high-frequency
words. For example, given the sentence “search engines are the most visible
information retrieval applications” and a classic stop words set such as the one
adopted by the Snowball stemmer,1 the effect of stop-word removal would be:
“search engine most visible information retrieval applications”. 4. Phrase
Detection. This step captures text meaning beyond what is possible with pure bag-
of-word approaches, thanks to the identification of noun groups and other phrases.
Phrase detection may be approached in several ways, including rules (e.g.,
retaining terms that are not separated by punctuation marks), morphological
analysis , syntactic analysis, and combinations thereof. For example, scanning our
example sentence “search engines are the most visible information retrieval
applications” for noun phrases would probably result in identifying “search
engines” and “information retrieval”. 5. Stemming and Lemmatization. Following
phrase extraction, stemming and lemmatization aim at stripping down word
suffixes in order to normalize the word. In particular, stemming is a heuristic
process that “chops off” the ends of words in the hope of achieving the goal
correctly most of the time; a classic rule based algorithm for this was devised by
Porter . According to the Porter stemmer, our example sentence “Search engines
are the most visible information retrieval applications” would result in: “Search
engin are the most visibl inform retriev applic”. • Lemmatization is a process that
typically uses dictionaries and morphological analysis of words in order to return
the base or dictionary form of a word, thereby collapsing its inflectional forms For
example, our sentence would result in “Search engine are the most visible
information retrieval application” when lemmatized according to a WordNet-based
lemmatizer 6. Weighting. The final phase of text pre processing deals with term
weighting. As previously mentioned, words in a text have different descriptive
power; hence, index terms can be weighted differently to account for their
significance within a document and/or a document collection. Such a weighting can
be binary, e.g., assigning 0 for term absence and 1 for presence.

SCOPE OF INDEXING
• When perform the indexing manually, problems arise from two sources the
author and the indexer the author and the indexer . • Vocabulary domain may
be different the author and the indexer. • This results in different quality
levels of indexing. • The indexer must determine when to stop the indexing
process. • Two factors to decide on level to index the concept in a item. •
The exhaustively and how specific indexing is desired. • Exhaustively of
index is the extent to which the different concepts in the item are indexed. •
For example, if two sentences of a 10-page item on microprocessors
discusses on-board caches, should this concept be indexed • Specific relates
to preciseness of index terms used in indexing. • For example, whether the
term “processor” or the term “microcomputer” or the term “Pentium” should
be used in the index of an item is based upon the specificity decision. •
Indexing an item only on the most important concept in it and using general
index terms yields low exhaustively and specificity. • Another decision on
indexing is what portion of an item to be indexed Simplest case is to limit
the indexing to title and abstract(conceptual ) zone . • General indexing leads
to loss of precision and recall. PREORDINATION AND LINKAGES •
Another decision on linkages process whether linkages are available between
index terms for an item . • Used to correlate attributes associated with
concepts discussed in an item .’this process is called preordination . • When
index terms are not coordinated at index time the coordination occurs at
search time. This is called post coordination , implementing by “AND” ing
index terms . • Factors that must be determined in linkage process are the
number of terms that can be related. • AUTOMATIC INDEXING •

You might also like