RESEARCH PROJECTS
Amanda Spink
School of Information
Sciences
University of Pittsburgh
1
Research Areas
Human Information Behavior
Cognitive/Interactive Information Retrieval
Web Search
Charles Oppenheim & Amanda Spink (Eds.).
Handbook of Information Science and
Information Management. Sage.
2
Human Information Behavior Projects
Evolutionary Human Information Behavior
Integrated Human Information Behavior Framework
Multitasking Framework for Information Behavior
- Web search - Public
libraries study - Nuclear power plant
crews study
Amanda Spink & Charles Cole. (2005). New
Directions in Human Information Behavior.
Springer.
3
Cognitive Information Retrieval
Projects
Multitasking Framework for IR
Multitasking Web Search study
IR Evaluation Measures study – Monica Landoni
(Strathclyde University)
Amanda Spink & Charles Cole (2005). New
Directions in Cognitive Information Retrieval.
Springer.
4
Web Search
Vivisimo.com – clustering
InfoSpace, Inc. – meta-search
Alioplex – personalization
Reuters Ltd – search redevelopment
Amanda Spink & Jim Jansen (2nd Edition in 2006).
Web Search: Public Searching of the Web.
Springer.
5
Web Searching Trends: 1997-2004
Jim Jansen
IST, Penn State
Sherry Koshman
SIS, University of Pittsburgh
6
Web Search Book
Amanda Spink & Bernard J. Jansen (2004).
Web Search: Public Searching of the Web. Springer.
7
Web Search Engines
Large-scale Web transaction logs: 1997 – 2004
Excite.com
AlltheWeb.com
AltaVista.com
AskJeeves.com
Vivisimo.com
No Google or MSN! They haven’t been
collaborating with academics
8
Research Goals
Track Web search trends – user focus
Identify characteristics of Web searching - session
length, query length, and use of query operators.
Examine the distribution of query topics, terms,
queries, sources, and languages used on Web
search engines.
Implications for theoretical and user models, and
Web services, interfaces and systems design.
9
Web Search Engines
Search capabilities:
– Up to 10 terms per query; default OR
– Advanced search:
Boolean AND, OR, AND NOT & parentheses
“phrase” : must appear in answer
+ or - before term must or must not be in answer
– proprietary algorithms & concept linking method, but
follow basic information retrieval
10
Web Query Datasets
51,000 queries by 18,000 Excite users collected
in 1997
1 million query transaction logs from various Web
search engines – 1997 to 2004
Dataset of 20 million+ Web queries
11
Terms Per Query: 1997-2001
Terms 1997 1999 2001
Mean 2.32 2.4 2.35
1 term 31% 26% 26%
2 31% 31% 26%
3 18% 18% 15%
12
Terms Per Query: 2002-2004
Alta Vista (2002) – 2.9 terms per query
1 term=20.4% 2 terms=30.8% 3+ terms=48.5%
Vivisimo (2004) – 3.1 terms per query
1 term=20.3% 2 terms=29.4% 3+ terms=50.2%
Most queries include 2-3 terms
13
Queries Per User: 1997-2001
Queries 1997 1999 2001
Mean 2.8 2.5 2
1 query 67% 48% 78%
2 19% 21% 13%
3 7% 11% 4%
14
Queries Per User: 2002 & 2004
Alta Vista (2002) – mean 2.9 queries per user
Vivisimo (2004) – mean 3.8 queries per user
Short users sessions, but growing in length
15
Session Duration (Minutes): 2002 &
2004
Alta Vista (2002)
71.6% sessions less than 5 minutes
Vivisimo (2004)
50% sessions less than 1 minute
10% sessions 1-5 minutes
45% sessions longer than 5 minutes
Most sessions less than 5 minutes
16
Use of Boolean Operators: 1997-
2001
1997 1999 2001
In >10% of >5% 20%
queries
17
Use of Boolean Operators: 2002 & 2004
Alta Vista (2002) – 20% sessions included Boolean
operators
Vivisimo (2004) – 2.6% sessions included Boolean
operators
Varied use of Boolean operators, but use may be
incorrect
18
Query Reformulation: 2002 & 2004
Alta Vista (2002) – 52.4%
Vivisimo (2004) – 62%
More users modifying queries
19
Pages Viewed Per User: 1997-2001
Pages 1997 1997 2001
Mean 2.3 1.8 1.7
1 page 58% 29% 43%
2 19% 19% 21%
3 9% 14%
20
Pages Viewed Per User: 2002 & 2004
Alta Vista (2002) – mean 1 page viewed
1 page=72.8% 2 pages=13% 3+ pages=14.1%
Vivisimo (2004) – 1 page viewed
Limited page viewing – little click through studies
21
Term Distribution
22
Top 10 Query Terms: 1997-2001
97 99 01
sex sex sex
nude free Christmas
free nude nude
pictures pictures pictures
new university new
university pics pics
women chat music
chat adult university
gay women games
girls new porn
23
Top 10 Alta Vista Query Terms - 2002
Vivisimo Query Terms - 2004
Download
Free
Sex
New
Pictures
Software
New
Windows
Nude
Sex
Music
School
School
History
How
Online
Lyrics
Video
Home
What
24
Top 10 Co-Occurring Query
Terms: 1997-1999
97 99
free - pics new - york
university - of free - sex
new - york free - pics
free - sex university - of
real - estate pictures - of
home - page greeting - cards
free - nude britney - spears
pictures - of free - nude
free - pictures free - pictures
high - school real - estate
25
Top 10 Co-Occurring Vivisimo Query
Terms - 2004
and & and
free & download
for& the
for & sale
windows & xp
to & in
britney & spears
what & the
high & school
for & in
26
Query Subjects: 1997-1999
Subject Category 97 99
1. Entertainment, recreation 16.9% 7.5% (6)
2. Sex, porn, preferences 16.8% 7.5% (4)
3. Commerce travel, economy 13.3% 24.4% (1)
4. Computers & the Internet 12.5% 10.9% (3)
5. Health & the sciences 9.5% 7.8% (5)
6. People, places, things 6.7% 20.3% (2)
7. Society, culture, religion 5.7% 4.2% (9)
8. Education & the humanities 5.6% 5.3% (8)
9. Performing & fine arts 5.4% 1.1% (11)
10. Government 3.4% 1.6% (10)
11. Incomprehensible 4.1% 6.8% (7)
27
Query Subjects – Alta Vista 2002 &
Vivisimo 2004
1. People/Places
Commerce, etc. 49.2% 21%
2. Commerce,
Indiscernibleetc. 12.5% 19%
3. Computers,
People/Places,
etc. etc.
12.4%15%
4. Health/sciences
Computers/Internet 7.4% 13%
5. Education/Humanities
Social/Culture 5%9%
6. Entertainment,
Health/Sciences etc. 6%4.5%
7. Sex/Pornography
Education/Humanities3.2% 5%
8. Society/Culture,
Sex/Pornography etc.4%3.1%
9. Government
Performing/Fine Arts1.5% 3%
10. Performing/Fine
Government Arts3%0.6%
11. Entertainment, etc. 2%
28
Web Query Trends: 1997-2004
From 1997 to 1999 - shift from entertainment/sex
queries to e-commerce/people queries
Growth of non-English queries
Sex/pornography queries – U.S. less than 5% &
Europe 8%
29
Web Search Trends: 1997-2004
Users: not many queries per search
Terms: not many per query
– in traditional IR queries 3 to 7 times larger
Low Boolean use
30
Web Search Trends: 1997-2004
Users do not view many pages
Growing query reformulation
From 1997 to 2004 – many aspects Web searching did
NOT change dramatically
31
Web Search Trends: 1997-2004
Frequency of use of terms is highly skewed
– terms that were used only once
– Web query language quite unique
Sex represents a small proportion of all categories
– great many other topics searched
– diversity of subjects searched very high
32
Web Search Trends: 1997-2004
Move to more complex searches – Alta Vista
Successive searches – related searches over time
Multitasking searches – multiple topic sessions
33
Vivisimo Project
34
Vivisimo Project
CMU spin off company June 2000
Meta-searching environment
Dynamically generates clusters
35
36
Research Questions
What are the general characteristics of Vivisimo
searching?
What are the characteristics of cluster–based
searching?
What is the extent of cluster expansion by users?
What is the distribution of clusters?
37
Vivisimo Transaction Log Data
March 28 - April 04, 2004
1,200,000 Records
193,277 IP Addresses
Quantitative and Qualitative Analyses
38
Vivisimo Queries 2004
Mean queries per day 135,304
80% of queries on weekdays
1 term (20.3%) 2 terms=29.4% 3+ terms (22.6%)
Mean sessions < 1 minute (45%)
39
Language and Source Requests
Indiscernible and Non-English 19%
Language None-specified (90%)
Source: Web (87.6%) GermanWeb (6.8%)
40
Cluster Analysis Transaction Log
April 25 to May 02, 2004
196,802 IP addresses
4,219,925 records
41
Queries Per IP Address
90000
80000
70000
IP Addresses
60000
50000
40000
30000
20000
10000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Queries
42
Log File Records
Frame Structure
44% List Records
29% Tree Records
Post Query Records
48.2% records show clusters clicked
43
Cluster Expansion
2.50%
2.00%
Percentage of Records
1.50%
1.00%
0.50%
0.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Number of Clusters
44
Vivisimo User Summary
Majority of searches on weekdays
Over 100,000 searches entered per day
Highest % of queries contained 2 terms
Web primary information source
45
Vivisimo Summary
Language Preferences low use
Infrequent cluster tree manipulation
50% - post-query records show cluster clicking
activity to view the results pages of a cluster.
46
47
Future Vivisimo Research
Cluster analysis on per query basis
Usability - including cluster label selection, cluster
depth, “find in clusters”
Comparative analysis with Clusty transaction log
48
Conclusions
Web search technology is changing, however many
user search characteristics are relatively stable
More successive and multitasking searches
Will the introduction of new Web search technology
(e.g. history, visualization) impact user search
characteristics?
49
Conclusions
Need for more comparison of Web search engine
performance
Comparison of single versus meta-search engines
Need for better user-based evaluation measures
Better usability testing of Web search engine
interfaces and techniques
50
Conclusions
Need for improved users and better Web technology
Technology not the complete answer – more user
awareness of search process
Many spelling and Boolean errors
Many Web search features need to be studied to
determine the way searchers use the Web (e.g.
Advanced Search?)
51
Ongoing Web Research Projects
Vivisimo.com – clustering
InfoSpace, Inc. – meta-search
Alioplex – personalization
Reuters Ltd – search redevelopment
Amanda Spink & Jim Jansen (2006). Web Search:
Public Searching of the Web. Springer.[2nd Edition]
52