TOOLBOX

BROWSE TOPICS

RESOURCES

ABOUT THIS SITE

pmwiki.org
pmwiki-2.2.0-beta65

edit SideBar

Web-Searching Agents

(a subtopic of Agents)

The World Wide Web has become a vast resource of information. The problem is that finding the information that an individual desires is often quite difficult, because of the complexity in organization and the quantity of information stored. - from Web Hunting: Design of a Simple Intelligent Web Search Agent

spider web


Introductory Readings

Text Parsing - Get a Job. Part of It's Alive! - From airport tarmacs to online job banks to medical labs, artificial intelligence is everywhere. By Jennifer Kahn. Wired (March 2002; 10.03). "The vast job bank Monster.com, for instance, uses an intelligent Web crawler called FlipDog to find new customers. Wandering the Web, the crawler develops a sense for which parts of sites are more likely to contain jobs, then parses the pages to pull out the relevant information (company, salary, kind of work, address for sending a resume) and files it in a database. The first time the crawler ran, it came back with more than half a million jobs. The real feat was not that FlipDog found the postings, but that it was able to organize them."

Web Hunting: Design of a Simple Intelligent Web Search Agent. By G. Michael Youngblood. ACM Crossroads Student Magazine (Summer 1999). "The goal of this article is to introduce the reader to the basic elements of an intelligent agent, and then apply those elements to a Web search agent to provide the framework for the construction of a simple intelligent Web search agent. An overview of typical artificial intelligence search algorithms will be presented and performance metrics will be discussed. This article presents a collection of ideas and pointers to resources that will hopefully provide some insight and basis for further inquiry into the subject matter."

Is There an Intelligent Agent in Your Future? By James A. Hendler. Nature Web Matters (March 11, 1999). " A good internet agent needs these same capabilities. It must be communicative: able to understand your goals, preferences and constraints. It must be capable: able to take options rather than simply provide advice. It must be autonomous; able to act without the user being in control the whole time. And it should be adaptive; able to learn from experience about both its tasks and about its users preferences. Let's look at each of these in turn...."

Introduction to the Special Issue: AI, Agents, and the Web. By James Hendler. IEEE Intelligent Systems (January/February 2006; Volume 21, Number 1). "[A]s we selected the articles for this issue, we realized that there was an 'emergent' theme of AI, agents, and the World Wide Web. The topics covered here --- the Semantic Web,content personalization, recommender systems, and personal agents --- fit together in a clear and exciting way. Together, they indicate a breakthrough in AI --- as our field explores new ways to utilize intelligent systems' powerful tools on the ever-expanding, continually changing information space that’s the World Wide Web."

Search engine spawned from antiterrorism efforts finds place in business - Fetch uses AI technology to extract data from 'deep Web.' By Heather Havenstein. Computerworld (March 13, 2007). "The Defense Advanced Research Projects Agency, the U.S. Air Force, the National Science Foundation and other agencies funded development of the Fetch technology by researchers at the University of Southern California's Information Sciences Institute during the 1990s. A group of computer science professors who developed the core AI algorithms behind the Fetch Agent Platform founded the company in 1999 to build a commercial product. ... 'We can go to places and extract information where Google and Yahoo can't,' [CEO Robert] Landes said.... To do that, Fetch builds an artificial intelligence agent to extract that particular data, not just to look for Web sites that may contain that data, he said. The strength of the system, added Fetch Chairman and CTO Steve Minton, emanates from the machine learning focus of the search engine's agent-based tools. The system can recognize types of data based on a pattern and can apply what is learned about that pattern to future searches, Minton said. In addition, the tool can mimic human behavior by automatically filling out a form without human intervention...."

  • Visit Fetch Technologies, Inc. to learn more about their Web agents: "a software program, or 'web robot', that enables online sources to be queried as if they were databases. In contrast to 'web spiders' that crawl an entire website, a Fetch agent can be configured to precisely target only certain parts of a web site and extract clean, structured data."

Smart Search. By David Pacchioli. Research|PennState (May 2003; Volume 24, Issue 2). "[Lee] Giles, the David Reese professor of information sciences and technology at Penn State, has devoted his career to finding better ways to get at information, to wring the most out of it, to marshal it efficiently. His background is in artificial intelligence, a field for which the processing of oceans of information is practically raison d'etre. ... Crawler-based engines, like Google, employ a software program -- called a crawler -- 'that goes out and follows links, grabs the relevant information, and brings it back to build your index,' Giles explains. 'Then you have an index engine that allows you to retrieve the information in some order, and an interface that allows you to see it. It’s all done automatically.' ... By limiting its crawling to a specific subject area, the niche engine can burrow deeper, providing more consistently useful information. A prime example is CiteSeer, a tool that Giles and Steve Lawrence created for the field of computer and information science. ... The ultimate goal, Giles says, is to create search engines that incorporate artificial intelligence."

Diving Deep Into The Web - Pair's search engine scours 'hidden' sites. By Michael Bazeley. The Mercury News (August 17, 2005; registration req'd.). "You think the Web is big? In truth, it's far bigger than it appears. The Web is made up of hundreds of billions of Web documents -- far more than the 8 billion to 20 billion claimed by Google or Yahoo. But most of these Web pages are largely unreachable by most search engines because they are stored in databases that cannot be accessed by Web crawlers. Now a San Mateo start-up called Glenbrook Networks -- says it has devised a way to tunnel far into the 'deep web. and extract this previously inaccessible information. ... Komissarchik and her father, Edward Komissarchik, say they have figured out how to analyze the forms on Web pages and understand the type of information the sites are looking for. Then, Glenbrook's Web crawlers use artificial intelligence to walk themselves through sometimes complex Web forms, answering questions, such as the location of their desired job, in the same way a human would."

The Semantic Web. By Tim Berners-Less, James Hendler, and Ora Lassila. Scientific American (May 2001). "The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users. ... The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation."

  • Also see:
    • Tiny Circuits: Tim Berners-Lee discusses the future of the Web. NPR Talk of the Nation: Science Friday With Ira Flatow. [Radio Interview; November 1, 2002]
    • The Semantic Web. From Semaview's "At-a-Glance" Illustration Series. "Designed as a one minute overview of the Semantic Web, this illustration discusses a half dozen key points in language that can be understood by managers and techies alike." Be sure to see #4: "Ontologies give the metadata meaning."
    • The Web's Father Expects a Grandchild - Tim Berners-Lee is working on the "Semantic Web," with its richer information links that unlock the power of "unplanned reuse of data." Interviewed by Andy Reinhardt. BusinessWeek online (October 22, 2004). "Q: You're working now on the Semantic Web, which will allow richer associations among data and, as the name implies, start to create a sense of "meaning" in online information. Where are things heading? A: The impact of the Semantic Web will be different from [today's] hypermedia Web. ... The Semantic Web is different. It's a space of data. It's all the information which is now in databases, spreadsheets, and application-specific files, like calendar files or photo metadata. What's exciting about the Semantic Web is its potential for serendipity, the unplanned reuse of data. The effect will be even more powerful for the Semantic Web because you won't have to be a person following the links. A machine will be able to follow links. Q: Can you give me an example? ..."
    • A Smarter Web - New technologies will make online search more intelligent--and may even lead to a "Web 3.0." By John Borland. Technology Review (March / April 2007 issue). "The Semantic Web community's grandest visions, of data-surfing computer servants that automatically reason their way through problems, have yet to be fulfilled. But the basic technologies that [Eric] Miller shepherded through research labs and standards committees are joining the everyday Web. They can be found everywhere--on entertainment and travel sites, in business and scientific databases--and are forming the core of what some promoters call a nascent 'Web 3.0.' ... Since 1998, researchers at W3C, led by [Tim] Berners-Lee, had been discussing the idea of a 'semantic' Web, which not only would provide a way to classify individual bits of online data such as pictures, text, or database entries but would define relationships between classification categories as well. Dictionaries and thesauruses called 'ontologies' would translate between different ways of describing the same types of data, such as 'post code' and 'zip code.' All this would help computers start to interpret Web content more efficiently. In this vision, the Web would take on aspects of a database, or a web of databases. ... In articles and talks, Berners-Lee and others began describing a future in which software agents would similarly skip across this 'web of data,' understand Web pages' metadata content, and complete tasks that take humans hours today. ... At the beginning of 2001, the effort to realize this vision became official. The W3C tapped Miller to head up a new Semantic Web initiative, unveiled at a conference early that year in Hong Kong."
    • Tim Berners-Lee on the Semantic Web (video: 8min 24sec). Technology Review Videos (March 2007). "The inventor of the World Wide Web explains how the Semantic Web works and how it will transform how we use and understand data."
    • The Semantic Web In Action - Corporate applications are well under way, and consumer uses are emerging. By Lee Feigenbaum, Ivan Herman, Tonya Hongsermeier, Eric Neumann and Susie Stephens. Scientific American (December 2007; subscription req'd). "Six years ago in this magazine, Tim Berners-Lee, James Hendler and Ora Lassila unveiled a nascent vision of the Semantic Web: a highly interconnected network of data that could be easily accessed and understood by any desktop or handheld machine. ... The enabling technologies have come of age. A vibrant community of early adopters has agreed on standards that have steadily made the Semantic Web practical to use. Large companies have major projects under way that will greatly improve the efficiencies of in-house operations and of scientific research. Other firms are using the Semantic Web to enhance business-to-business interactions and to build the hidden data-processing structures, or back ends, behind new consumer services. And like an iceberg, the tip of this large body of work is emerging in direct consumer applications, too."

Sony lab tips 'emergent semantics' to make sense of Web. By Junko and Yoshida R. Colin Johnson. EE Times (November 1, 2004). "Sony Computer Science Laboratory is positioning its 'emergent semantics' as a self-organizing alternative to the W3C's Semantic Web that does not require any recoding of the data currently available online. Based on successful experiments with communities of robots, emergent-semantic technology is built on the principles of human learning, representatives of the Sony lab said at an open house here last month. Much as these communities of 'agents' extract meaning (semantics) from the character of their interactions, emergent semantics extracts the meaning of Web documents from the manner in which people use them, the researchers said."

AI gets down to business. By Matthew Broersma. ZDNet UK. (January 23, 2001). "Web robots don't necessarily carry out tasks for one Web site. Many researchers envision a world of semi-autonomous 'agents', roaming the Web and carrying out various tasks for their owners. Present software such as the 'mobile agents' of Netherlands-based Tryllian could be the forerunner of intelligent bots making purchases and carrying out other business transactions without human intervention.

General Readings

AI Magazine cover

Intelligent Systems and the Internet - A Special Issue of AI Magazine. 18(2), Summer 1997. "The articles describe a broad and diverse set of systems. The AI technologies used span the gamut from machine learning to natural language processing, from case-based reasoning to knowledge representation, and more. Applications include Web page filtering, a grant finder, a FAQ finder, a home page finder, a shopping assistant, and more." - from the Introduction, by Oren Etzioni.

AI think, therefore I am. Virtual agents feature - Computerised characters that look, sound, move and seemingly think like real people are emerging from the realms of science fiction into everyday life. Superguide by David Braue. apcmag.com (December 16, 2003). "Agents are all over the Internet, across which search engine 'spiders' interactively locate and index sites, and are also common in subscription news services. ... Many researchers believe such agents will become pervasive personal assistants, helping people keep up with a constant flood of information by proactively sorting, cataloguing and presenting it in a meaningful way."

Personalized and Focused Web Spiders. By Michael Chau and Hsinchun Chen. In Web Intelligence (February 2003, pp. 197-217; Springer-Verlag). N. Zhong, J. Liu, Y. Yao, editors. Abstract: "As the size of theWeb continues to grow, searching it for useful information has become increasingly difficult. Researchers have studied different ways to search the Web automatically using programs that have been known as spiders, crawlers,Web robots, Web agents, Webbots, etc. In this chapter, we will review research in this area, present two case studies, and suggest some future research directions."

  • Check out the University of Arizona Artificial Intelligence Lab's Spiders are Us web site and their related publications. [The lab is part of the Management Information Systems Department and is headed by Dr. Hsinchun Chen.]

Weaving A Web of Ideas - Engines that search for meaning rather than words will make the Web more manageable. By Steven M. Cherry. IEEE Spectrum (September 2002). "What companies like Google, Autonomy, and Verity are doing, in other words, is figuring out better ways of doing what search engines have always tried to do: deliver the best documents the existing Web has on a given topic. The advocates of the Semantic Web, on the other hand, are looking beyond the current Web to one in which agent-like search engines will be able to not just deliver documents, but get at the facts inside them as well. ... Valuable as the Semantic Web might be, it won't replace regular Web searching. Peter Pirolli, a principal scientist in the user interface research group at the Palo Alto Research Center (PARC), notes that usually a Web querier's goal isn't an answer to a specific question. 'Seventy-five percent of the time, people are engaged in what we call sense-making,' Pirolli says. ... PARC researchers think there's plenty of room for improving Web searches. One method, which they call scatter/gather, takes a random collection of documents and gathers them into clusters, each denoted by a single topic word, such as 'medicine,' 'cancer,' 'radiation,' 'dose,' 'beam.' The user picks several of the clusters, and the software rescatters and reclusters them, until the user gets a particularly desirable set. ... For Autonomy, Bayesian networks are the starting point for improved searches. The heart of the company's technology, which it sells to corporations like General Motors and Ericsson, is a pattern-matching engine that distinguishes different meanings of the same term and so 'understands' them as concepts."

BIG: A Resource-Bounded Information Gathering Agent. By Victor Lesser, Bryan Horling, Frank Klassner, Anita Raja, Thomas Wagner and Shelley XQ. Zhang. 1998. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, 539 - . Menlo Park, Calif.: AAAI Press. "Effective information gathering on the WWW is a complex task requiring planning, scheduling, text processing, and interpretation-style reasoning about extracted data to resolve inconsistencies and to refine hypotheses about the data. This paper describes the rationale, architecture, and implementation of a next generation information gathering system - a system that integrates several areas of AI research under a single research umbrella. The goal of this system is to exploit the vast number of information sources available today on the NII including a growing number of digital libraries, independent news agencies, government agencies, as well as human experts providing a variety of services. The large number of information sources and their different levels of accessibility, reliability and associated costs present a complex information gathering coordination problem. Our solution is an information gathering agent, BIG, that plans to gather information to support a decision process, reasons about the resource trade-offs of different possible gathering approaches, extracts information from both unstructured and structured documents, and uses the extracted information to refine its search and processing activities."

The Push for News Returns. By Kendra Mayfield. Wired News (March 30, 2002). "The University of Michigan is working on a similar service called NewsInEssence, which also uses natural language techniques to find and summarize multiple news articles on the Web." Also see: AI-Generated News Collections

Tax Takers Send in the Spiders. By Quinn Norton. Wired News (January 25, 2007). "Websites around the world are getting a new computerized visitor among the Googlebots and Yahoo web spiders: The taxman. A five-nation tax enforcement cartel has been quietly cracking down on suspected internet tax cheats, using a sophisticated web crawling program to monitor transactions on auction sites, and track operators of online shops, poker and porn sites. The 'Xenon' program.... Xenon, explained Marten den Uyl of Sentient, is in some ways the opposite of something like Google's web crawler, which traverses a tree of links and grabs a copy of everything it sees. Xenon is smart about link selection and context, and uses a 'slow search paradigm,' he said. ... Once the web pages are screen-scraped, Xenon's Identity Information Extraction Module interfaces with national databases containing information like street and city names. ... As illuminating as Xenon is for the tax man, the data-mining effort poses dangers to citizen privacy, said Par Strom, a noted privacy advocate in the world of Swedish IT."

Agent-Based Engineering, the Web, and Intelligence. By Charles J. Petrie, Stanford Center for Design Research. IEEE Expert, 11:6, pp. 24-29, (December 1996). "This article concerns Internet-based 'agents', about which there has been much hyperbole recently. There has been much discussion on the software agents email list about the defining nature of agents on the Internet. Some have tried to offer the general definition of agents as someone or something that acts on one's behalf, but that seems to cover all of computers and software. Other than such generalities, there has been no consensus on the essential nature of agents. This suggests that the word is overloaded for a variety of contexts. In this article I will survey the types and definitions of agents eventually focusing on those useful for engineering. Because it is simply silly to discuss software agents without distinguishing them from other known types of software, I will venture to offer a definition."

Going where no search engine has gone before - Connotate Technologies uses information agents to extract data from Deep Web. By Dibya Sarkar. FCW.com (May 30, 2005). "Google, one of the most popular search engines, at best can index and search about 4 billion to 5 billion Web pages, representing only 1 percent of the World Wide Web. But officials from Connotate Technologies, a company based in New Brunswick, N.J., said they have developed technology that can mine and extract data from the Deep Web, which contains an estimated 500 billion Web pages, and deliver it in any format and through any delivery mechanism. The Deep Web refers to content in databases that rarely shows up in Web searches. Through the use of intelligence-based software modules called information agents, corporate and government organizations can quickly and easily target specific unstructured data from intranets and password-protected Web sites on a continual basis. 'What the agents do is they automate time-consuming Web interaction,' said Bruce Molloy, the company's chief executive officer. 'So an agent can act on your behalf, type in information, search terms, can click on links, can know your password — but we would keep it protected — can automatically go to sites and bring back information, format and cut and paste results.' ... Connotate was formed in 1999 by three Rutgers University professors, whose Web-mining technology research was funded by the Defense Advanced Research Projects Agency and the university. ... 'It's a lot like showing something to a small child for the first time,' said Chris Giarretta, Connotate's customer relationship manager. Essentially, he said, the more you show what a user wants, the better the agent will get at finding it."

Intelligent Searching Agents on the Web. Search Engines column by Tracey Stanley. Ariadne (Issue 7; January 1997). "Intelligent agents can utilise the spider technology used by traditional web search engines, and employ this in new kinds of ways. Typically, these tools are spiders which can be trained by the user to search the web for specific types of information resources. The agent can be personalised by its owner so that it can build up a picture of individual likes, dislikes and precise information needs. An intelligent agent can also be autonomous - so that it is capable of making judgements about the likely relevance of material."

Related Resources

Aware, from Stottler Henke Associates, Inc. "is a new tool for searching the Internet that learns what the user is looking for and helps gather highly targeted results. Aware uses patent pending intelligent agent technology to analyze the terms and documents that are relevant to the user’s research area, enabling it to search more deeply and broadly than unaided users can."

photo of a spider

Envisional. Check out their Discovery Engine: "This is an automated search system that can delve into the 'deep Internet' and probe the shady worlds of Internet relay chat channels, file-sharing networks, trading sites and secretive online communities. It uses intuitive, almost human, reasoning to uncover massive amounts of information, but selectively bring back just the hits you really need to know about. ... This is advanced, automated artificial intelligence...."

iVia: High Octane Software for Internet portal and Virtual Library Creation and Management. "The iVia system is an INFOMINE creation generously funded by the National Science Digital Library of the National Science Foundation, the National Leadership Grant Program of the U.S. Institute of Museum and Library Services, the Fund for the Improvement of Post-Secondary Education of the U.S. Department of Education and the Library of the University of California, Riverside." As explained on the New Technologies page: "iVia utilizes a range of programs known as crawlers to traverse the Web and identify new Internet resources. iVia's crawlers are used to help identify important academic resources on the Internet. The crawlers function as collection development tools."

InfoSpiders: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery. From Filippo Menczer and the Adaptive Agents Research Group, University of Iowa. "An artificial life - inspired multi-agent adaptive system for autonomous, scalable information search in the Web." In addition to the links you'll find on this page to related news articles, papers, and even narrated demos, there's one that invites you to give a troop of spiders their marching orders:

  • "MySpiders is a java applet that uses intelligent, autonomous, adaptive software agents to search the internet on behalf of the user for information about the user's query. MySpiders complement, rather than replace, traditional search engines, by locating recent documents that may not have been indexed by search engines yet."

"Letizia is a user interface agent that assists a user browsing the World Wide Web. As the user operates a conventional Web browser such as Netscape, the agent tracks user behavior and attempts to anticipate items of interest by doing concurrent, autonomous exploration of links from the user's current position. The agent automates a browsing strategy consisting of a best-first search augmented by heuristics inferring user interest from browsing behavior." From Henry Lieberman of the Media Laboratory at the Massachusetts Institute of Technology.

The Semantic Web. From Cycorp, Inc. "The Semantic Web is an exciting vision for the future of information technology, but it is a vision that presupposes the ability to represent web content with efficiency and expressiveness. If a scalable way to add semantics to the World Wide Web (WWW) can be found, the Semantic Web will create a world where agents, search engines, and other programs can read semantic markup to decipher the real meaning of a web page. The Semantic Web-aware agents will be able to retrieve computer readable facts, integrate and reason about those facts, answer questions, solve problems, and generally bring a new level of intelligence to the WWW that is unimaginable with today’s technology. ... The key to harvesting this new semantic information will be the creation of the Semantic Web-aware agents that can cope with a diversity of meanings and inconsistencies across local ontologies. These agents will need the capability to interpret, understand, elaborate, and translate among the many heterogeneous local ontologies that will populate the the Semantic Web."

  • Also see this article: Super Searches - IBM's webfountain, a new internet tool, helps companies spot online trends before they emerge. By Laura A. Locke. Time Magazine (November 8, 2004).

Softbots. Computer Science Department, University of Washington. You can read about softbot projects and view online demonstrations.

WebMate. From the Software Agents Group at Carnegie Mellon University. "WebMate, a personal digital assistant, is a promising solution to the problem of finding useful information among a sea of texts and other web documents."

Web Robots Pages. Includes a FAQ, a database of current webcrawlers, some online articles and a few related web sites.

"The World Wide Web Consortium (W3C) develops interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential. W3C is a forum for information, commerce, communication, and collective understanding. On this page, you'll find W3C news, links to W3C technologies and ways to get involved."

Related AI Topics Pages

Other References Offline

Leonard, Andrew. 1997. Bots: The Origin of New Species. San Francisco: Hardwired. Surveys the vast spectrum of software agents--from bots that retrieve information to bots that chat--and compares them to evolving organisms.

AAAI Home   Recent Changes   Edit   History   Print   Contact Us
Page last modified on December 14, 2008, at 05:18 AM