The World Wide Web, or simply the Web, is rapidly becoming the world's
collective information store, containing everything from news, to
entertainment, to personal communications, to product descriptions. This
world information store is distributed across millions of computers, but it
is often important to gather significant parts of it at a single site. One
reason is to build content indices, such as Google. Another reason is to
mine the cached Web, looking for trends or data correlations. A third reason
for gathering a Web copy is to create a historical record for Web sites that
are ephemeral or changing.
In this talk I will discuss how to build a repository of Web pages,
describing some of the technical challenges faced.
In doing so, I will illustrate with some of the work we have been doing in
our group at Stanford. |