Philippe Martin and Peter Eklund
Griffith University,
School of Information Technology,
PMB 50 Gold Coast MC, QLD 9726 Australia
Tel: +61 7 5594 8271; Fax: +61 7 5594 8066;
E-mail: pm .@. phmartin dot info
Proceedings of the
workshop
"Virtual Documents, Hypertext Functionality and the Web"
at the
8th International World Wide Web Conference.
Web search engines - such as Altavista1 or Infoseek2 - retrieve entire documents based on keywords they include. They exploit undirected Web robots to periodically traverse and index internet/intranet documents. Directed Web robots - such as Harvest3, WebSQL4 and WebLog5 - apply string-matching and structure-matching commands (e.g. hypertext path expressions) to explore an intranet or a small subset of internet and retrieve entire documents or parts of them. However, people are generally not looking for lists of documents but either for a precise answer to a precise query, or for a structured presentation of information related to a certain object such as a particular event, technique, software, idea or person. For example, someone looking for "large-scale deductive database systems" does not want a giant list of references to conferences, articles and courses on database systems, or home pages and user manuals of specific database systems, s/he first wants a classification of features that such systems may have, and then s/he may ask for a classification of existing tools according to some features, e.g. the kinds of query language, exploited techniques, API, memory&performance characteristics, support for multi-users, reliability, license.
Though such precise information and comparisons are important for each person interested in using deductive database systems, it is a long and difficult task for that person to collect the information just by reading documents. However, it is not necessarily difficult for each provider of information on an object to represent this information in a document or a shared knowledge repository so that they can be retrieved - and to a certain extent, merged or composed - via conceptual commands. As opposed to string-matching and structure-matching commands, conceptual commands rely on logical inferences (e.g. exploitation of subsumption relations between terms in the knowledge statements) and improve both precision and recall in information retrieval. They may also be combined with other commands within scripts or usual documents to create virtual documents.
The easiest way to express information is in natural languages. However, outside limited domains, these languages are too ambiguous for the semantic content of sentences to be automatically extracted. We argue in our article for the WWW8 conference6 that general and intuitive knowledge representation languages or derived simpler notations (e.g. a "controlled language" that is a subset of natural language which eliminates sources of ambiguity) are preferable to metadata languages7 based on XML8 (e.g. RDF9 and OML10) for indexing Web documents and representing knowledge within them. Indeed, the retrieval of precise information is eased by a language designed to represent semantic content and support logical inference, and the readability of such a language eases its exploitation, presentation and direct insertion within a document (thus also avoiding information duplication). We advocate the use of Conceptual Graphs (CGs)11 and simpler notational variants that enhance knowledge readability (e.g. we propose a formalised English and structured text notations). To further ease the representation process, we propose (i) a technique allowing users to leave some knowledge terms undeclared, and (ii) a top-level ontology of 400 concept and relation types. We have implemented a knowledge-based directed Web robot named WebKB to parse and execute our notations, knowledge handling&retrieval commands, Web document handling commands, and script language (to combine groups of commands). This tool is accessible as a CGI server. The WebKB site12 provides HTML+Javascript interfaces.
Various kinds of applications of knowledge representation, indexation and queries are illustred by examples in the WebKB site. Here is how some information on the Aditi database system could be represented in one of the structured text notations accepted by WebKB. The difference with the structured way (These information are extracted from the "Catalog of free database systems"13). Relations between each term used in this knowledge statement and other terms may be similarly defined elsewhere (in other documents or shared knowledge repositories) by one or several other users. Then, for example, subsumption relations between terms may be exploited for conceptual retrieval.
[Aditi. isa: large-scale deductive database system; user interface: NU-Prolog, graphical interface (implemented with: Motif); index method: B-trees, multi-level signature files; ports: SunOS, IRIX; ](representation date: 1992/12/17; representation author: aditi@cs.mu.oz.au).
It is handy for an information provider to store and structure knowledge inside Web documents, especially if the duplication of information into machine readable statements and human-only readable statements can be avoided (e.g. by using controlled language14 for sentences and a visual language15 for graphics) or at least reduced by the possibility of mixing and linking the two kinds of statements. To allow this, WebKB exploits the convention that each group of knowledge statements or commands in a document must be delimited by the two special HTML tags "<KR>" and "</KR>" or the strings "$(" and ")$". The knowledge representation language used in each group must be specified at its beginning, e.g.: "<KR language="CG">". Each group is visible unless the document's author hides it with HTML comment tags. Furthermore, various notations allow people to use knowledge statements for indexing any part of any Web document (not just parts which can be refered by URLs). Thus, knowledge statements may be retrieved and handled via document-based commands, and conversely indexed parts of documents may be retrieved and handled via knowledge-based commands.
When a command sent to the WebKB CGI server requires it to "run" a Web document (referred to by a URL), the server retrieves the document and executes the knowledge statements and commands within it (some commands may be to run other Web documents). The results are sent back to the client and constitutes a generated document (hence, a virtual document). Depending on a parameter, the WebKB server may or may not send back the human-only readable statements along the results (if it does, the generated document is a copy of the original document with the command results in place of the commands). Similarly, the WebKB server may be used to exploit other CGI servers. Within HTML documents, dynamic linking may be achieved by using Javascript16 to associate a command with an hypertext link in such a way that the command is sent to the WebKB server when the link is activated.
However, as any other directed Web robot, the scalability and efficiency of the current WebKB is limited by the facts that (i) the users must know which documents countain (or may countain) the knowledge to exploit, and (ii) these documents must be accessed and parsed each time their content has to be exploited. Pieces of knowledge, like Web documents, may be provided by all Web users, and needs to be inter-related or integrated, to allow each user to benefit from the knowledge of users they do not know. For that, cooperatively built knowledge repositories are necessary.
Some Web servers, called ontology servers, support shared knowledge
repositories, e.g. the
Ontolingua ontology server17 and
Ontosaurus18.
However, they are not usable for managing large quantities of knowledge and,
apart from AI-Trader19,
they do not allow the indexation and retrieval of parts of documents.
Finally, support of cooperation between the users is essentially limited to
consistency enforcement, annotations and structured dialogues, as in
APECKS20,
Co421
and Tadzebao22.
We are extending WebKB to handle a
cooperatively built knowledge repository
which addresses scalability via the five following points
(five following points23):
(i) a scalable multi-user persistent object repository to support
the storage and exploitation of knowledge structures (we have chosen the
Shore24 system);
(ii) algorithms allowing the exploitation of large-scale dynamic
taxonomies efficiently (we have chosen
Fall's algorithms25);
(iii) visualisation techniques (mainly the handling of aliases for
terms and the generation of views) to avoid lexical conflicts and enable users
to focus on certain kinds of knowledge;
(iv) protocols to allow users to solve semantic conflicts via the
insertion of new terms and relations in the common ontology and, in some cases,
in the knowledge of other users;
(v) conventions for representing knowledge to improve the automatic
comparison of knowledge from different users and hence their consistency and
retrieval.
Though these five points permit the exploitation of a large knowledge
repository (that is essential for efficiency reasons and practical use),
it is also clear that for efficiency and reliability reasons, a unique
server cannot be used to handle a universal knowledge repository by all
Web users. Knowledge has to be distributed and mirrored on various knowledge
servers. However, since there is no static conceptual schemas in knowledge
bases, the techniques of distributed database systems - such as
AlephWeb26,
Hermes27,
Infomaster28
and TSIMMIS29 -
cannot all be reused.
A first step to the distribution of a knowledge repository is to
duplicate it on several servers, with updates made on a server automatically
duplicated in other servers. Some servers may be dedicated to searches
and others to updates.
A second step is to have general servers and specialized servers.
A specialised server would store the same knowledge as general servers
plus knowledge related to a well-defined set of objects, e.g. knowledge
expressed with the subtypes of certain types. Since these sets of objects
are well-defined (extensively or via definitions), a general server would
store the URLs of these servers and, when answering a query, would
delegate the query to the relevant servers if more precision is required.
These sets of objects might be determined by the managers of specialized
servers, or according to the frequency of accesses to objects in knowledge
repositories. Whatever the specialised server a user updates, if the knowledge
it enters is relevant to other servers (e.g. if the knowledge is expressed
with general terms), it should be automatically duplicated in these servers
The rationale of all these duplications is to speed searches and simplify the
query mechanisms by avoiding, whenever possible, parallel searches in various
servers and then the composition of the results.
Other steps may be necessary, but what should be avoided in this
knowledge-based approach (hence precision-oriented) is to let the specialized
servers developp independently
of each others instead of being part of a unique consistent virtual
knowledge repository. Otherwise, conceptual queries and cooperation across
the repositories are no more possible, and as in current traders, a most
relevant repository to answer a query has to be automatically "guessed".
Finally, knowledge servers should not be limited to the storage of knowledge statements: they should also allow a storage and handling of knowledge-based and document-based commands similar to the storage and handling we described for documents.
The more a piece of information is precisely represented, the more adequately it can be retrieved and exploited. General and intuitive knowledge representation languages seem best adapted for that. WebKB allows to use Conceptual Graphs and also simpler notations for less expressivivity or precision is needed. Ambiguities due to declared terms are partially solved according to the constraints in the used ontologies.
Storing knowledge within documents is handy but the scalability of this approach is limited. Ultimately, we believe a knowledge-based Web relies on scalable distributed cooperatively built knowledge repositories. We have proposed (and work on) some directions for that goal. In this view, knowledge-annotated documents can be used as isolated module of knowledge on which a user can work before submiting it to a knowledge server for integration. A document including commands can also be sent to a knowledge server as a template for generating virtual documents. Of course, scripts of commands could also be stored in a repository handled by a knowledge server and referred to from a document. We currently extend WebKB to allow these combinations.
In the same way as today we register a Web site, we will probably register knowledge representations (or documents including knowledge representations) and complement or refine each other's knowledge.