Explain Item Normalization?
Explain Item Normalization?
Explain Item Normalization?
o Item Normalization:
o Normalize the incoming items to a standard format. Standardizing the
input takes the different external formats of input data and performs
the translation to the formats acceptable to the system. A system may
have a single format for all items or allow multiple formats.
o The next process is to parse the item into logical sub-divisions that
have meaning to the user. This process, called "Zoning," is visible to
the user and used to increase the precision of a search and optimize the
display. An item is subdivided into zones, which may be hierarchical
(Title, Author, Abstract, Main Text, Conclusion, and References). The
zoning information is passed to the processing token identification
operation to store the information, allowing searches to be restricted to
a specific zone.
o Once the standardization and zoning has been completed, information
(i.e., words) that are used in the search process need to be identified in
the item. The first step in identification of a processing token consists
of determining a word. Systems determine words by dividing input
symbols into three classes: valid word symbols, inter-word symbols,
and special processing symbols.
Boolean Logic:
Boolean logic allows a user to logically relate multiple concepts together to
define what information is needed. The typical Boolean operators are AND,
OR, and NOT. Placing portions of the search statement in parentheses are used
to overtly specify the order of Boolean operations (i.e., nesting function). If
parentheses are not used, the system follows a default precedence ordering of
operations (e.g.,Use of Boolean Operators
Proximity:
Proximity is used to restrict the distance allowed within an item between two
search terms. The semantic concept is that the closer two terms are found in
a text the more likely they are related in the description of a particular
concept. Proximity is used to increase the precision of a search. If the terms
COMPUTER and DESIGN are found within a few words of each other then
the item is more likely to be discussing the design of computers than if the
words are paragraphs apart.
o Proximity TERM1 within "m . . . . units" of TERM2 The distance
operator "m" is an integer number and units are in Characters, Words,
Sentences, or Paragraphs. A special case of the Proximity operator is
the Adjacent (ADJ) operator that normally has a distance operator of
one and a forward only direction. Another special case is where the
distance is set to zero meaning within the same semantic unit.
Contiguous Word Phrases A Contiguous Word Phrase (CWP) is both a way
of specifying a query term and a special search operator. A Contiguous
Word Phrase is two or more words that are treated as a single semantic unit.
An example of a CWP is "United States of America." It is four words that
specify a search term representing a single specific semantic concept (a
country) that can be used with any of the operators discussed above. Thus a
query could specify "manufacturing" AND "United States of America"
which returns any item that contains the word "manufacturing" and the
contiguous words "United States of America”. A contiguous word phrase
also acts like a special search operator that is similar to the proximity
(Adjacency) operator but allows for additional specificity
.
. Fuzzy Searches :
o Fuzzy Searches provide the capability to locate spellings of words that
are similar to the entered search term. This function is primarily used
to compensate for errors in spelling of words. Fuzzy searching
increases recall at the expense of decreasing precision. A Fuzzy
Search on the term "computer" would automatically include the
following words from the information database: "computer”,
"compiter," "conputer," "computter," "compute." An additional
enhancement may lookup the proposed alternative spelling and if it is
a valid word with a different meaning, include it in the search with a
low ranking or not include it at all (e.g., "commuter"). In the process
of expanding a query term fuzzy searching includes other terms that
have similar spellings, giving more weight (in systems that rank
output) to words in the database that have similar word lengths and
position of the characters as the entered term.
o Term Masking :
Term masking is the ability to expand a query term by masking a portion of
the term and accepting as valid any processing token that maps to the
unmasked portion of the term. The value of term masking is much higher in
systems that do not perform stemming or only provide a very simple
stemming algorithm. There are two types of search term masking: fixed
length and variable length. Fixed length masking is a single position mask.
It masks out any symbol in a particular position or the lack of that position in
a word. Variable length "don't cares" allows masking of any number of
characters within a processing token.
Term Masking (Variable Length)
Concept/Thesaurus Expansion:
Associated with both Boolean and Natural Language Queries is the ability to
expand the search terms via Thesaurus or Concept Class database reference
tool. A Thesaurus is typically a one-level or two-level expansion of a term
to other terms that are similar in meaning. A Concept Class is a tree
structure that expands each meaning of a word into potential concepts that
are related to the initial term
SCOPE OF INDEXING
• When perform the indexing manually, problems arise from two sources the
author and the indexer the author and the indexer . • Vocabulary domain may
be different the author and the indexer. • This results in different quality
levels of indexing. • The indexer must determine when to stop the indexing
process. • Two factors to decide on level to index the concept in a item. •
The exhaustively and how specific indexing is desired. • Exhaustively of
index is the extent to which the different concepts in the item are indexed. •
For example, if two sentences of a 10-page item on microprocessors
discusses on-board caches, should this concept be indexed • Specific relates
to preciseness of index terms used in indexing. • For example, whether the
term “processor” or the term “microcomputer” or the term “Pentium” should
be used in the index of an item is based upon the specificity decision. •
Indexing an item only on the most important concept in it and using general
index terms yields low exhaustively and specificity. • Another decision on
indexing is what portion of an item to be indexed Simplest case is to limit
the indexing to title and abstract(conceptual ) zone . • General indexing leads
to loss of precision and recall. PREORDINATION AND LINKAGES •
Another decision on linkages process whether linkages are available between
index terms for an item . • Used to correlate attributes associated with
concepts discussed in an item .’this process is called preordination . • When
index terms are not coordinated at index time the coordination occurs at
search time. This is called post coordination , implementing by “AND” ing
index terms . • Factors that must be determined in linkage process are the
number of terms that can be related. • AUTOMATIC INDEXING •