|
Enhancing the User Experience by Capitalizing on Link Traversal
Presented at SIGIR 98.
Andrea Michalek
and Douglas Grundman
July 1998
Introduction
Hypertext information retrieval systems
interact with users through two principal mechanisms: a search
mechanism, wherein users give the system information describing document
attributes and the system responds with documents, and a traversal
mechanism, wherein users follow links that bind documents together in the
hypertext corpus. These two mechanisms were developed separately, and
historically have been treated independently.
More recently, with the advent of the World Wide Web, work
has been done to try to merge aspects of these two functions in order to
improve users’ experiences. Most of this work involves analyzing links and
the directed graph of documents that forms a hypertext corpus in order to
extract more information about the documents; this is typically done with an
eye toward improving searching.
In this paper, we will take the opposite approach: we use
information gleaned from an analysis of documents and queries in order to
improve the experience of hypertext traversal. The simplest example is the
highlighting of query words in documents that result from a search, but there
are many others.
Search Information vs. Link Information
A user interacting with a hypertext
information retrieval system accesses documents through two main methods –
by searching for them and by following links to them. Accordingly, we view
such a system as having two distinct kinds of navigational information that
people can make use of. These are search information and link
information.
The searching method of accessing documents consists of
submitting a query and receiving a list of documents that satisfy it. Search
information is the information that is provided by a user in order to perform a
search over a corpus. People use search information by submitting it as a
query with a goal of accessing documents identified by that information. The
hypertext information retrieval system sees search information as a parameter
of a "search" function that operates over some set of documents. The
output of that search function is a collection of links to documents that meet
criteria defined by the query.
The link traversal method of accessing documents consists of
a user’s clicking on a link and receiving the document to which that link
refers. Link information is defined as references within documents to the same
or other documents – in other words, pointers within a hypertext data
structure. Link information is the input parameter to a "graph
traversal" function that operates over the directed graph defined by the
body of hypertext. The output of the traversal function is a document.
There is a duality between these two functions that underlie
the usage of hypertext systems: the search function takes nodes and
generates links; the traverse function takes links and generates
nodes. It follows that if link information can be used to improve a user’s
search experience, then search information should be exploitable to improve a
user’s traversal experience.
Improving the User Experience
Search information can be used to
assist a user when traversing links. Under the assumption that the user
submitting a query is interested in documents that are returned because of
their relevance to the query, the following improvements to the user experience
can be made:
- Exposing the reasons any particular document is
returned from the search engine. Documents in a result list are selected
by a search engine for various reasons. Displaying those reasons to the user
can be helpful in permitting a quick evaluation of the value of the document.
When displaying a result list, a single relevancy score per document is
commonly used to convey this information. This score can easily be expanded to
include per-query-term relevancy information. The concept of passing search
information to the link traversal function can be exploited to achieve the next
level of functionality along this path. When a user selects a document from
the result list, the original query terms can be called-out to expose the same
hit information at an additional level of granularity – not only what terms
hit and how many times, but also where they hit.
- Showing the relevant part of a long document without
user intervention. Documents can be very long. If a system can locate the
paragraphs within a document that best match the given query and allow the user
to navigate to those positions quickly and easily, the user may be able to
extract what he or she finds to be the most relevant information from the
document very quickly.
- Illuminating related metadata. It can be very
useful to display metadata about a document that is relevant to the query that
led to the document. For example, it is possible to build a system that knows
about person names and about associations among them. If a document is located
because of a query that contains the name of a person, it might be very
interesting to highlight the names of other people related to the one named in
the query. The same concept can be extended to other types of
metadata.
- Propagating search information along subsidiary
links. If search information is useful in illuminating a given document,
it should also be useful for illuminating documents to which the first one
links. The value provided by the above three improvements can be extended
through the entire hypertext network that is rooted at the original
document.
As the user traverses the hypertext network originating at a
particular document, the search information that led to the first document
becomes less important and can be de-emphasized. Additional information can be
gleaned from the text of the link(s) being traversed, and after several hops
the original search information can be phased out.
Technologies that Enable Implementation
Straightforward techniques can be used
to exploit search information to improve the user’s traversal experience.
To accomplish this, the traversal function must be extended to handle any
search information that will be exposed. Typically, the traversal function is
implemented by simple hyperlinks to documents, but the dynamic nature of the
exposure of search information requires an active mechanism. This is most
easily provided on the Web by Common Gateway Interface (CGI) programs.
To accomplish the four enhancements described in the
previous section, static hyperlinks are wrapped by a CGI that implements the
desired functionality. Thus on the Web, instead of having a simple URL, each
hyperlink reference to a document is converted to a reference to a CGI that
accepts the original URL as well as the desired search information as
arguments. This CGI would fetch the referenced document and output HTML marked
up to display the addit
ional search information. This amounts to passing the
search information to the traversal function (which is now implemented via the
CGI) by annotating the link. The traversal function (CGI) uses the annotation
to generate a marked-up representation of the document that displays the
enhancements to the user. This same approach can be used if the original URL
is a reference to a CGI as well.
Note that enhancing the traversal function in this way is
dual to the way that link information can be used to enhance the search
function. In both cases, the function is extended to accept additional
arguments necessary for the extended functionality.
The starting place for implementation must be a dynamically
generated HTML page. This requirement fits well with a traditional information
retrieval system. The first result list in response to a given query is
already dynamically generated by design.
Example Implementation – Calculating the Best Part of a Document
The Electric Library research product
(http://www.elibrary.com) takes advantage of this type of information passing
with its "Best Part" feature. The question the user asked to
generate a result list is passed via a link annotation to each document the
user views. The HTML document that the user will ultimately view is augmented
with data about how the user arrived at the document. That augmentation is
performed by the Electric Library’s "Best Part" module.
The Best Part module is handed a document and two sets of
words. One set of words contains the original query terms. The second set of
words contains other related words that may be of interest to the user. The
module’s primary function is to locate all places in the document where any
of those words (or permissible linguistic variants thereof) occur. We term all
such occurrences matches. Matches are termed hits if the words
they match are part of the query. The Best Part module’s second function is
to distinguish one hit as special. That hit is called the Best Part of
the document; it is the hit closest to the region of highest hit density in the
document.
A separate module takes a document and the information
provided by the Best Part module, and post-processes the document’s HTML to
expose the Best Part information to the user. The hits are highlighted in the
document and the Best Part is further emphasized to make it stand out from the
other hits. A button is provided that contains an intra-document link to the
Best Part; this allows the user to jump directly to the section of the document
that is most relevant to the submitted query. In addition, intra-document
links are added that lead from each hit to the next (and to the previous); this
traversal mechanism allows the user to navigate from one hit to the next
easily.
In addition to query terms, matches resulting from
linguistic metadata analysis of documents may be used to further augment
document viewing. These words can result from analysis of the document in
light of the query and/or our preconceived notions of what might be important
(for example, names of people or places), or from other analyses. One type of
such data that we’ve found to be interesting is derived from a
"what’s common" analysis of the entire result set stemming from a
search. This type of metadata is called Recurring ThemesTM on the
Electric Library products because it captures the prominent concepts that tie a
result set together. Recurring ThemesTM such as people, places,
subjects, etc. are detected by another Electric Library module, and are used to
enhance the user experience in various ways; typically, users are given the
option of limiting the result set based on the inclusion of a user-selectable
Recurring ThemeTM. The selected Recurring ThemeTM can
then be passed over document links to the Best Part and HTML markup modules for
appropriate treatment. Each match of a Recurring ThemeTM word is
highlighted in the document, and included in the traversal mechanism provided
for hits. However, since matches from Recurring ThemesTM are of
lesser importance than hits from the query terms, these matches are not
included in the Best Part computation. This allows the document to be
presented to the user in a way that relates to the initial question that was
asked as well as the how the user limited his/her result set.
Conclusion
In a dynamic hypertext-based
information retrieval system, significant value can be added by capturing
relevant search-related data and using it to enhance the hypertext traversal
process by passing it along the links that a user traverses. The documents
themselves can behave as more than static pieces of content; each can be
dynamically enhanced based upon what information is carried over links leading
to it. This model can be used any place that a user is traversing a set of web
pages that result (possibly indirectly, through subsidiary hyperlinks) from a
search.
|