Mining the Web for Site StructureRochester University September 1, 2000 |
1 Overview Our goal was to discover interesting structural properties on the Web,
using a Connectivity Server 2 database built from a Mercator crawl. We
were particularly interested in using structure to group Webpages into
Websites. Such groupings are crucial to the following applications:
• Link-based authority measures (for determining which links are biased).
In light of this complex but potentially very rewarding problem, our
approach was experimental: we wrote code to measure various properties
of the Web graph, hoping that they would reveal some as yet unseen structure.
Since hostname seems to at least partially correspond to the notion of
Website, many of our measures focused on grouping by hostname. Since we
believe that every Website has a homepage, some of our measures were intended
to reveal homepages. The actual properties we measured included:
• the number of local and remote links per host • the size of the link neighborhood of each page (to support the hypothesis that homepages have large neighborhoods)
• Even though in total there are more inlinks than outlinks on the Web, most hosts have more outlinks than inlinks. This means that hosts with fewer pages tend to have more outlinks than inlinks, and that hosts with a very large number of pages tend to have more inlinks than outlinks. • Exceptions to the above observation include www.geocities.com, which has a large number of pages but more outlinks than inlinks. The simplest explanation for this is that it hosts many distinct authors who each control relatively few pages. Since these authors are independent of each other, there is little reason for them to link within www.geocities.com. • A surprising number of hosts are nearly cliques (i.e. almost every possible link exists). • A surprising number of hosts have about as many links as pages. • Regarding the distribution of local strongly connected components, for hosts of every size, there were a large number of connected components that were close to the exact size of the host, a large number of very small strongly connected components, but very few intermediate-sized connected components. A possible explanation is that many hosts really do consist of a single strongly connected component, but the crawl simply didn’t connect them. The small components would then be residuals of an incomplete crawl. One way to test this hypothesis would be to pick some example hosts, try crawling them completely, and match the resulting graphs against the components found in the original crawl.
• grouping by the “significant part ”of the hostname,
2 Conclusion Our experimental approach was necessarily broad and unfocused, but we see ways in which future research in this area can be more structured, based on what we have learned. We believe it is worthwhile to first consider a standard for evaluating the structure we discover. We mention three such standards. Application-based structure. There are many motivating applications for this research. One way to measure the effectiveness of a grouping technique is to measure the performance of test applications that depend on some sort of accurate grouping, such as spam detection or authority. The downside of this approach is that it is not necessarily any easier to evaluate the performance of the test applications. Example-based structure. Choose several specific example sites and try
to determine what structural properties distinguish them. Proceed by running
experiments to determine if these properties generalize to similar sites
or if new types of example sites should be introduced.
|