lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev LYNX: fancy experimental work on www-searching


From: David Combs
Subject: lynx-dev LYNX: fancy experimental work on www-searching
Date: Tue, 18 May 1999 06:57:13 -0700

Am 3000 miles from these weekly seminars, but got onto
the mailing list just for fun.

Anyway, maybe some on lynx-dev ARE nearby?

Sure looks neat.

David



----- Forwarded message from Theory Seminar <address@hidden> -----

Date: Tue, 18 May 1999 00:28:00 -0700 (PDT)
From: Theory Seminar <address@hidden>
To: address@hidden
Subject: Chakrabarti to talk on Focused Crawling

This Thursday, in the Algorithms Seminar, Soumen Chakrabarti will talk on

"Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery"

Thursday, 20 May at 4:15 PM
Gates Building 498

----------------------------------------------------------------------------

 Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery      
              

                          Soumen Chakrabarti
                              IIT Mumbai


The next generation of Web databases will answer questions that combine
page contents, meta-data, and hyperlink structure in powerful ways, such
as ``find links from a finance page to an environmental protection page 
made over the last year.'' A key problem in answering such
queries is to discover web resources related to the topics involved in
such queries. We demonstrate that for distributed hypertext, a
keyword-based ``find similar'' search based on a giant all-purpose crawler
is neither necessary nor adequate for resource discovery.  Instead, we
propose a system called a "Focused Crawler".  The goal of a focused
crawler is to selectively seek out pages that are relevant to a
pre-defined set of "topics".  The topics are specified not using
keywords, but using exemplary documents.  Rather than collecting and
indexing all accessible web documents to be able to answer all possible
ad-hoc queries, a focused crawler analyzes its crawl boundary to find
the links that are likely to be most relevant for the crawl, and avoids
irrelevant regions of the web.  This leads to significant savings in
hardware and network resources, and helps keep the crawl more
up-to-date.

To achieve such goal-directed crawling, we design two hypertext mining
programs that guide our crawler: a "classifier" that evaluates the
relevance of a hypertext document with respect to the focus topics, and
a "distiller" that identifies hypertext nodes that are great access
points to many relevant pages within a few links. We report on extensive
focused-crawling experiments using several topics at different levels of
specificity.  Focused crawling acquires relevant pages steadily while
standard crawling quickly loses its way, even though they are started
from the same root set.  Focused crawling is robust against large
perturbations in the starting set of URLs.  It discovers largely overlapping 
sets of resources in spite of these perturbations.  It is also capable of
exploring out and discovering valuable resources that are dozens of
links away from the start set, while carefully pruning the millions of pages
that may lie within this same radius.  Our anecdotes suggest that
focused crawling is very effective for building high-quality collections of web
documents on specific topics, using modest desktop hardware.

Joint work with Martin Van den Berg and Byron Dom. 
+--------------------------------------------------------------------------+
| Unless you see another box below this one, you got this message through  |
| the AFLB mailing list.  To have your name added to or removed from this  |
| mailing list, please send your request to address@hidden   |
| or see our web page at http://theory.stanford.edu/~aflb.                 |
+--------------------------------------------------------------------------+

----- End forwarded message -----

reply via email to

[Prev in Thread] Current Thread [Next in Thread]