I chose SRC for my internship this summer over Xerox PARC and FXPAL (both of which made me offers) mainly because I've talked to other grad students who had very positive summer experiences at SRC, and the SRC project was most closely aligned with my interests. SRC also responded to my internship application very quickly -- less than two days after I submitted it -- which is a compliment to their organization and preparation.
My host for this work was Krishna Bharat.
For future work, it would be interesting to scale the SPHINX
architecture to Web-wide crawlers, such as search engine indexing or
Web archiving, which retrieve and process millions of pages. Such
crawlers typically run on a farm of workstations, raising interesting
issues such as how to divide the crawling workload fairly and how much
information must be shared by cooperating crawlers. Also, the
platform-independence and safety of Java imply that SPHINX crawlers
could be moved around the network easily, to access the Web at the
most convenient point. Exploring the architecture and security
policies of server-side crawlers would be an interesting direction for
future work.
Solution
We built a system called SPHINX (Site-specific Processors for Html
INformation eXtraction). SPHINX is a user interface and Java class
library that supports developing and running crawlers from a user's
Web browser. For users, the SPHINX user interface offers a number of
advantages. One advantage is that common crawling operations can be specified
interactively, such as saving, printing, or extracting data from
multiple Web pages, which makes simple crawlers simple to write.
Another is that the pages and links explored by the crawler can be displayed in several
visualizations, including a graph view and an outline view. For Java
programmers writing custom crawlers, the SPHINX library provides
multithreaded crawling, HTML parsing, pattern matching, and Web
visualization, along with the ability to configure and run the custom
crawler in the SPHINX user interface.Observations and Future Work
The SPHINX user interface runs as a Java applet hosted by a Web
browser. This decision to run SPHINX inside a Web browser, as a
privileged Java applet, turned out to be an effective strategy. As a
consequence, SPHINX crawlers are portable, require no special
configuration to access the Web, and see the Web exactly as the user
sees it. In particular, when a SPHINX crawler requests a page, it
uses the same cache, authentication, proxy, cookies, and user-agent as
the user, ensuring that it gets the same response back from the server
that the user would.