How knowledge is stored in WebKB-2


Table of contents

  1. Introduction
  2. The ontology: categories, category names, users
  3. The statements


Introduction

WebKB-2 was designed to permit users to store any kind of knowledge, in a precise, intuitive and normalized way into a large shared knowledge base ("normalized" means less alternative ways of representing something, and hence more graph matching possibilities). This implies expressive notations, and hence a rich data model. The notations and model of WebKB-2 are extensions of those of Conceptual Graphs (CGs) (click here for grammars and descriptions and here for an article comparing CGLF, CGIF, KIF, Frame-CG and Formalized-English). I will keep enriching the notations and data model of WebKB-2 as long as I encounter information (e.g. natural language sentences) that cannot yet be represented in an intuitive and normalized way. However, these notations and data model are now quite stable. They have been published, are public domain, and may be re-used in any way (including commercially).

As opposed to notations, data models/structures are implementation dependant, and much more goal/application dependant than notations. Knowledge exchange/entering require high-level expressive normalised notations; knowledge exploitation require tailored data structures. Hybrid solutions where the distinctions between notations and models are blurred, typically XML-based languages, may be sufficient for "data" but make "knowledge representation, comparison and exploitation" difficult to perform and debug. WebKB-2 is a large-scale KBMS oriented toward knowledge retrieval. As in DBMSs, the model used by WebKB-2 results from compromises between keeping the data structures small and easing the retrieval of the stored knowledge.

In many knowledge representation languages, especially in the terminological logics of the 1990s, the notations and the model were restricted to permit the inference engine to exploit all the entered knowledge in a sound and complete way. This approach of tailoring a language (notation and model) to the possibilities of a particular inference engine is quite restrictive and leads the knowledge providers to bias the knowledge to make it fit the model. Common knowledge representation tools, even terminological logics tools such as Loom, have features that do not permit the inference engine to use all the information and be sound and complete. With WebKB-2, the users are encouraged to represent knowledge as precisely as possible, and re-use, "correct" or annotate other users' knowledge. WebKB-2 does not exploit all the entered information for consistency checks and information retrieval, just the usual pieces of information: relation signatures, exclusion and subsumption links, graph structures (cycle detection, graph matching, etc.). Applications that require more information to be taken into account may re-use the same knowledge with more complete or more domain-dependant inference engines. This is not possible when the knowledge is biased to fit a restricted model.

WebKB-2 is written in C++ and re-uses the OODBMS FastDB (or Gigabase if the knowledge base exceeds 1 Gb). The data model (or "database schema") is specified by C++ classes (although Gigabase also has APIs for Java and Perl). Information about relations between the field descriptors or the way they should be indexed is described via macros within the classes (as explained below).

The data model of WebKB-2 is mainly composed of five classes: Term, TermName, User, Node and Relation. The first three classes permit the storage of the "ontology" of the knowledge, i.e. the names, identifiers and inter-relations of the objects (categories) in the statements. The last two classes permit the storage of the statements.



The ontology: categories, category names, users

WebKB-2's notions of category, category name, category identifier, user, link (between categories or between a category and a name), statement, node and relation are explained in the documentation home page of WebKB-2. A "term" is to be understood as a "formal term"; it is a synonym for a "category" (i.e. an individual or a concept/relation type). Below is the C++/FastDB description of the classes required for the ontology (without the declaration of the methods associated with the classes). The classes dbArray, dbDateTime and dbReference are FastDB classes for respectively a dynamic array, a date and an internal reference to an object in the database.
The macro TYPE_DESCRIPTOR permits to associate database descriptors to class fields, via the following macros:

typedef dbReference<User> dbrUser; typedef dbReference<TermName> dbrTermName; typedef dbReference<Term> dbrTerm; typedef dbReference<Node> dbrNode; //For readability reasons, the declaration of the sub-structures Partition, //NLinkInfo, SLinkInfo and LinkInfo are given after the declaration of Term class User { public: const char *name; //login/source name, e.g. "pm", "sowa", "cyc", "suo" const char *password; //encoded password dbrTerm uri; //term with an email/URL address for key dbArray<dbrTerm> createdTerms; TYPE_DESCRIPTOR(( KEY(name,HASHED|INDEXED), FIELD(password), FIELD(uri), RELATION(createdTerms,creator) )); }; class TermName { public: const char *name; dbArray<dbrTerm> terms; TYPE_DESCRIPTOR(( KEY(name,HASHED|INDEXED), RELATION(terms,names) )); }; class Term { public: const char *key; const char *alternativeKey; dbrUser creator; //null if the category comes from WordNet dbrTerm module; //null if the category has no associated module, //e.g. does not come from the loading of a Web file int1 kind; //'r': relation type, 'c': concept type, 'i': individual //'f': functional relation type, 'u': undefined //'a': artificial; follow the '=' link to the real category //Rel/fct. signature (see FO grammar), e.g. a) r(?,{#term},{#term}*) b) f({#term}+ ->#term) dbArray<dbrTerm> sigTypes; //a) {null,#term,#term} b) {null,#term} dbArray<int1> sigTemplates;//a) {'?', 'e', 's'} b) {'S','?'} //sigTemplates: '?':?, '*':*, '+':?+, 'e':{?}, 's':{?}*, 'S':{?}+ const char *annotation; //informal def. (plus may be hidden attributes) dbDateTime creationDate; //most often automatically added by WebKB dbArray<dbrTermName> names; dbArray<NLinkInfo> namesInfo; //in names[], a lowercase name follow each that hasToBeStoredLowered() dbArray<dbrTerm> supertypes; dbArray<SLinkInfo> supertypesInfo; dbArray<dbrTerm> subtypes; dbArray<SLinkInfo> subtypesInfo; dbArray<Partition> closedSubpartitions; dbArray<dbrTerm> exclusions; dbArray<LinkInfo> exclusionsInfo; //with "complement" ('/') links before the '!' links dbArray<dbrTerm> types; dbArray<SLinkInfo> typesInfo; dbArray<dbrTerm> instances; dbArray<SLinkInfo> instancesInfo; dbArray<dbrTerm> inLinks; dbArray<LinkInfo> inLinksInfo; dbArray<dbrTerm> outLinks; dbArray<LinkInfo> outLinksInfo; dbArray<dbrNode> directNodes; //nodes (directly) using this term dbArray<dbrNode> indirectNodes; //nodes using a subtype/instance of this term TYPE_DESCRIPTOR(( KEY(key,HASHED|INDEXED), FIELD(alternativeKey), INDEXED_RELATION(creator,createdTerms), FIELD(module), FIELD(kind), FIELD(sigTypes), FIELD(sigTemplates), FIELD(annotation), FIELD(creationDate), RELATION(names,terms), FIELD(namesInfo), RELATION(supertypes,subtypes), FIELD(supertypesInfo), RELATION(subtypes,supertypes), FIELD(subtypesInfo), FIELD(closedSubpartitions), RELATION(exclusions,exclusions), FIELD(exclusionsInfo), RELATION(types,instances), FIELD(typesInfo), RELATION(instances,types), FIELD(instancesInfo), FIELD(inLinks/*,outLinks*/), FIELD(inLinksInfo), FIELD(outLinks/*,inLinks*/), FIELD(outLinksInfo), RELATION(directNodes,term), FIELD(indirectNodes) )); }; //Sub-structures Partition, NLinkInfo, SLinkInfo and LinkInfo: class Partition { public: dbrUser creator; dbArray<dbrTerm> terms; TYPE_DESCRIPTOR(( FIELD(creator), FIELD(terms) )); }; class NLinkInfo //to store info on some links term->name { public: dbrTermName destName; //to retrieve the relevant link dbrUser creator; //rarely null: "this" not created if same creator const char *community; //as the term or if no community TYPE_DESCRIPTOR(( FIELD(destName), FIELD(creator), FIELD(community) )); }; class SLinkInfo //to store the link creator of some links between terms { public: dbrTerm destTerm; //to retrieve the relevant link dbrUser creator; //never null since "this" not created otherwise TYPE_DESCRIPTOR(( FIELD(destTerm), FIELD(creator) )); }; class LinkInfo //to store info on each inLinks/outLinks (see below) { public: int1 kind; //e.g. '<', '>', '=', 'p', 'm', 'u' dbrUser creator;//null if same as the term TYPE_DESCRIPTOR(( FIELD(kind), FIELD(creator) )); };



The statements

RDBMSs and most deductive DBMSs have relation-based models (and hence, at least one CG sytem has it too, Bernd Groh's RDBMS-based CG system; click here for his PhD thesis. Many first-order logic based systems have relation-based notations, e.g. KIF or CycL. The notations and models of frame-based systems and WebKB-2 are node-based (i.e. information such as quantifiers and relations are represented within the nodes). Theoretically, the two approaches are equivalent. However, for representing natural language sentences or similar general complex knowledge, I believe the node-based approach lead to much more intuitive linear notations (since most often, no variable or arc is needed to represent relations between nodes) and models easier to exploit (the information related to a node, e.g. its contexts, relations and quantifier, is directly accessible from it, instead of being distributed over many relations or small graphs). On the other hand, for specifying complex combinations of quantifications, the relation-based approach is easier to use. Hence, as a low-level interchange format (permitting to define the semantics of higher-level notations) KIF's lisp-like notation is adequate. However, for knowledge entering and exchange, a high-level format such as Frame-CG (FCG) is required. (More details in our comparison of CGLF, CGIF, KIF, Frame-CG and Formalized-English).

In the data model of WebKB-2, the relations are sub-structures of the concept nodes, and a graph is just a particular concept node. The data structures can be seen as trees (although each relation is stored in both source and destination nodes) and hence the scope of each quantifier (stored in each concept node) is well delimited (reminder: in FCG and Formalized-English (FE) the order of the concept nodes is used for defining the scope of the quantifiers). If a user provides a graph without specifying a unique identifier, i.e. if s/he does not encapsulate the graph into a concept node via representing an individual (graph), WebKB-2 automatically generates this individual and node before storing the graph (in such a case, the "head node" of the provided graph is the first of the sub-nodes of the generated node).

In WebKB-2, graph retrieval is done by first accessing candidate nodes and then checking that the relations of each of the candidate nodes match the relations of the selected query node (if it has relations). The alternative, accessing candidate relations matching a relation in the query (if there are relations) and then check to see if the nodes match, seems to be less efficient if relations types are much more re-used than concept types (this is likely to be the case, at least for what WebKB-2 is intended for).

WebKB-2 notations, models and conventions discourage the use of non-binary relations (for precision and normalization reasons). Strictly speaking, FCG and FE actually do not permit non-binary relations; however, FCG permits to enter "functions" (or functional relations) in their usual format and without restriction on their number of paramaters; this feature may be used for representing non-binary relations but is strongly discouraged. The graph-matching mechanisms of WebKB-2 do not yet compare normal relations with "functions".

class Relation { public: bool isOutRel; dbrNode destination; dbrTerm type; int1 kind; //'e':exists, 'c':can be, 'm':may be //'!':!=, '=':=, '<':<, 'l':<=, 'g':>=, '>':> //'i':=>, '~':<=>, 'I':<=, '?':? dbrUser creator; //null if same as for the node dbArray<dbrNode> embeddingNodes; //local contexts (unused yet) const char *annotation; TYPE_DESCRIPTOR(( FIELD(isOutRel), FIELD(destination), FIELD(type), FIELD(kind), FIELD(creator), FIELD(embeddingNodes), FIELD(annotation) )); }; class Node { public: bool isNegated; dbrUser creator; dbDateTime creationDate; const char *annotation; //what's inside the (^...^) node part const char *text; //if statement, text for asserting the statement //else application-dependant place of storage dbrTerm term; //type / individual / fct_or_n-aryRel / null //null: number/date/...(see qualifier)/undeclTerm/coll int1 relOrFct;//0: NO, 'r': n-ary relation (n!=2), 'f': function call //'D': term NSC, 'N': term NC, 'S': term SC ('T':term TC) //'F': function definition (body: embeddedNodes[0]) dbArray<dbrNode> parameters; //can be empty even if isRelOrFct const char *undeclaredTermName; //when term==null dbrTerm constraintType; //if not null, term must specialize it int1 quantifier;//0: none (individual), 'e': existential (default), //'u': universal, 'p': percentage //'U': local universal, 'P': local percentage real8 number; //meaning and range depends on above quantifier: //'u'-> Q_NONE(0), Q_FEW(20), Q_MOST(60), Q_ALL(100) //'U'-> RARELY(10), MOSTLY(80), ONLY(100) //'p'/'P' -> [0-100]: percentage value (or use neg numbers) //'e'-> (n>0): natural number of anonymous instances // 0: some (the default), // -1: a few, -3: several/many, // -10: tens, -12: dozens, -20: hundreds, // -30: thousands, -60:millions, -90:billions // 0 -> term==null: number/date... (see qualifier) int1 interval;// 0 : no interval (default; exact number/time/...), //'a': about/near, 'A': around/(very) approximatively //'l': at least, 'm': at most, 'b': between, '_': from-to real8 toNumber; //when interval=='b'|'_' int1 qualifier; //if (quantifier) //term!=null (-> qualitative degrees) // 0: none (default), 'c':certain, 'b':bad, 'B':Big, // 'g':good, 'G':Great, 'i':important, 's':small //else if (term==null) //number/graphId/date/period? // 0 : number (default), 'd': date, 'p': period, //else (term!=null) // 'g': graphId, 's': string (both pointed by term) dbArray<dbrNode> collElems; dbrNode embeddingCollection; int1 collKind; //0: not a coll, 'd':(at least partially) distributive, //'D':fully distributive,'c':collective, 's':cumulative int1 collType; //'s':set, 'b':bag, 'x'/'o'/'a':XOR/OR/AND-collection //'l':list (ordered bag), 'q': sequence (ordered set) //'B':closed bag, 'S': closed set, 'L': closed list, //'X'/'O'/'A': closed...-coll, 'Q': closed sequence int4 collNum; //collKind!=0 -> number of elems (>= collElems length) //collKind==0 -> place in a collection (1st, 2nd, ...) dbArray<dbrNode> embeddedNodes;//1st node at embeddedNodes.length()-1 dbrNode embeddingNode;//problem if embedding hierarchy is not a tree? bool isFirstNode; dbArray<Relation> relations; //dbArray<RelatedGraphNode> correctionOrInstantiations; //to add TYPE_DESCRIPTOR(( FIELD(isNegated), FIELD(creator), FIELD(creationDate), FIELD(annotation), FIELD(text), //FIELD(indexedDE), //FIELD(coref), FIELD(diffCoref), RELATION(term,directNodes), FIELD(relOrFct), FIELD(parameters), FIELD(undeclaredTermName), FIELD(constraintType), KEY(quantifier,INDEXED),FIELD(number),FIELD(interval), FIELD(toNumber),FIELD(qualifier), RELATION(collElems,embeddingCollection), RELATION(embeddingCollection,collElems), FIELD(collKind), FIELD(collType), FIELD(collNum), RELATION(embeddedNodes,embeddingNode), RELATION(embeddingNode,embeddedNodes), FIELD(isFirstNode), FIELD(relations) )); };



Dr. Philippe A. MARTIN