CGIF

CGIF versus CGLF
Extensions used for (re)presenting the ontology of WebKB-2
Our EBNF/Lex&Yacc grammars of CGIF

1 CGIF versus CGLF

CGIF (CG Interchange Format) is intended to be one of the standard notations for exchanging knowledge (click here for the current draft) and is the official textual notation for Conceptual Graphs (CGs).

The original textual notation for CGs is called CGLF (CG linear Form). It was (and still is) used as input/output of many CG tools but with some variations or extentions. These variations could have been integrated into a unique grammar (e.g. the CG grammar used by WebKB-1 includes the main variations for representing relations from a concept node). However, at ICCS'95, John Sowa (the creator of CGs) and other CG researchers decided that a new notation should be cooperatively designed by the members of the CG community (via the CG mailing list cg@cs.uah.edu) in order to involve everyone and define a minimal abstract syntax and notation that achieves the following
a) has a proper foundation with translations to KIF and permits extensions to be built upon;
b) permits fast interchange of CGs between CG tools.
The basis of the new notation was a syntax developed by Gerard Ellis for a rapid parsing (and hence transfer) of simple CGs. This notation is closer to KIF (or RDF) than CGLF: it uses a relation-based lisp-like notation with variables to connect concept nodes. Thus, except for simple graphs, CGIF is less readable than CGLF which is graph-oriented and hence reduces the need for variables.

The argued gain of speed over CGLF has been used as a rationale for the adoption of the new notation. However, this gain has not been demonstrated and even if it exists, it is unlikely to be sufficiently significant to be used as a rationale. On our machine, our Flex&Bison parser for CGIF (see below) takes 1.80s in average to parse the 277,666 lines of the WebKB-2 ontology in CGIF and 1.25s in average to parse the 210,382 lines of the same ontology represented in FT, a notation we derived from CGLF to (re)present links between categories. (In these tests, the "parsing" did not include semantic checks nor the construction of objects in the knowledge base: with FT, this takes 14 to 24s depending on the checking level).

2 Extensions used for (re)presenting the ontology of WebKB-2

By "ontology", we mean the list of categories (concept types, relation types and individuals) and links between them. The current CGIF draft does not have much detail about type declaration.
It specifies that TypeHierarchy constructs may be used for storing definitions and subsumption links between concept types, that RelationHierarchy constructs may be used for storing definitions and subsumption links between relation types, and that CatalogOfIndividuals constructs may be used for storing delaration/definitions of individuals. We assume that within these constructs, and only within them, categories that have not yet been declared (i.e. "forward references") are allowed.

For readability reasons and because categories in WebKB-2 may have many kinds of links (not just subsumption and equality links), we cannot re-use these constructs. The various kinds of links connected from a category have to be presented together, in a concise way, and as "links": separated type definitions for representing links other than subsumption links cannot be used (e.g. type definitions do not permit the presentation of links via clean indented lists and hence increase the complexity of the exploration of links in a large ontology and its design or update; use the "category search" tool to convince yourself of this). Furthermore, in WebKB-2, categories and links have additional information associated to them, e.g. (the identifiers of) their creators.

Thus, CGIF needs to be extended. The best way would be a special syntax for links between categories, and hence an adaptation, extension or replacement of the above cited constructs. FT, the syntax we designed for (re)presenting links between categories in WebKB-2, is an extension of the way that subsumption links can be represented in CGLF. Many people in the CG community would like to keep representing links this way, probably because it is clear and concise. A different notation for links also reflects the difference in the CG data model between the ontology and the (other) graphs, differences which make the exploitation of CGs much easier (click here if you want to see the data model of WebKB-2). Hence, we think the best extension would be to integrate FT into CGIF.

However, since FT is already proposed by WebKB-2 and not yet integrated in CGIF, we must adopt another way to offer an alternative to developers of CGIF parsers. Since we need a graph-like structure to replace the restrictive special constructs, the only alternative we found is to use second-order graphs (i.e. graphs with types in the referent fields) despite the fact that the CGIF current standard does not give details or examples using them.

For the concept types and relation types used in these second-order graphs, we could use the identifiers of second-order types in the ontology of WebKB-2. But then, a CGIF parser would have no (simple) way to know that these graphs are second-order graphs and that yet undeclared types (i.e. "forward references") should be allowed in the referent fields of concept nodes. For this reason, and also to improve readability, we have chosen to introduce reserved types. Here is an example.

//Original version in the FT notation:
pm#thing__top_concept_type (^type of things that are not relations^) 29/11/1999
 ^  rdfs#class,
 >  {(pm#situation pm#entity)} pm#thing_playing_some_role,
 !  pm#relation;
       //Reminder: '^' means "instance of", '!' means "has for subtype", '!' means "exclusive with",
       //          "{(...)}" represents a close partition of exclusive subtypes
       //          "{...}" represents an open partition of exclusive subtypes

//Translation in CGIF without reserved types:
[pm#type: pm#thing *x ;type of things that are not relations;]
   (pm#creator ?x spamOnly@phmartin.info)
   (pm#creation_date ?x 29/11/1999)
   (pm#name ?x "thing")  (pm#name ?x "top_concept_type")
   (pm#kind ?x rdfs#class)
   (pm#subtype ?x {pm#situation pm#entity})
   (pm#subtype ?x pm#thing_playing_some_role)
   (pm#exclusive_class ?x pm#relation)

//Translation in CGIF with reserved types:
[TYPE: pm#thing *x ;type of things that are not relations;]
   (CREATOR ?x spamOnly@phmartin.info) 
   (CREATION_DATE ?x 29/11/1999) 
   (NAME ?x "thing")  (NAME ?x "top_concept_type")
   (KIND ?x rdfs#class)
   (GT ?x {pm#situation pm#entity}) (GT ?x pm#thing_playing_some_role)
   (EXCL ?x pm#relation)

GT, LT and EQ are already reserved types in the standard but must be used within the above cited constructs. TYPE and KIND have always been used by John Sowa to denote the top-level of second-order types and the "instanceOf" relation. NAME is also usual. For the signature of relations, we use DOMAIN and RANGE, the terms used in RDF (DOMAIN and RANGE are always used when declaring relation types and only in that case). To indicate the category creator and its creation date, we use CREATOR and CREATION_DATE (note: for readability, WordNet is never specified as creator; it is the default creator).
Here are 4 other reserved types for common links (i.e. second-order relation) that we need to add: EXCL, REVERSE, COMPLEMENT, SIMILAR (these links are explained in the documentation of WebKB-2).

The ontology of WebKB-2 has other kinds of links (most are inherited from WordNet) but they are simple shortcuts for defining relations between instances of the connected types. The semantics of links in WordNet is not very consistent but generally, a link L from a type X to a type Y means that for most object x instance of X, x has a relation of type L to some individual of type Y. For example, #bird p #wing means that most birds have wings. (In FCG, this can be written as [most #bird, pm#part: a #wing]). The second-order relation types used in the part of WordNet included in the ontology of WebKB-2 (i.e. that part relative to categories representing the meanings of nouns) are: pm#location, pm#substance, pm#part, pm#wnMember, pm#wnObject. (pm#wnMember is a membership relation that can have anything as its source, not just a collection; pm#wnObject can be used to represent that some objects belong to a domain). WebKB-2 also has a link for pm#url, i.e. for connecting a category to a home page about the kind of object represented by the category. Other such "shortcut" links may be added in the future. Given their specificity, rather than introducing a reserved type for each of them, we prefer to introduce only one to cover all of them: LINK. Here are examples:

[TYPE: #table *x]
  (NAME ?x "table")
  (LT ?x #array)
  (GT ?x #correlation_table)  (GT ?x #periodic_table) //plus many others in WebKB-2
  (LINK pm#wnMember ?x #row/3)
  (LINK pm#wnMember ?x #column/3)

[TYPE: #table/2 *x]
  (LT ?x #furniture)
  (GT ?x #card_table)  (GT ?x #coffee_table) //plus many others in WebKB-2
  (LINK pm#part ?x #leg/3)
  (LINK pm#part ?x #tabletop)

[TYPE: pm#Web-scripting_language *x]  //just an example; more details in WebKB-2
   (CREATOR ?x spamOnly@phmartin.info) 
   (LT ?x #programming_language)
   (GT ?x pm#style_sheet_language) (GT ?x pm#Javascript)
   (LINK pm#url ?x http://www.phmartin.info/int3004/)

//In WebKB-2, pm#Javascript is a type (not an individual) and pm#Javascript_1.0 is one of its subtypes
//The Javascript parsed by the browser on the machine that you currently use
//may be considered as an instance.
//However, the WebKB-2 parsers accept graphs that use certain types as if they were individuals
//(the condition is that these types must be a subtype of pm#information_entity)

When a link has a creator that is different from the creator of its source category, there is a need to specify the creator of this link. The only sufficiently readable way we found was to use ternary relations with reserved types: NAME_BY, GT_BY, LT_BY, EQ_BY, TYPE_BY, KIND_BY, EXCL_BY, REVERSE_BY, COMPLEMENT_BY, SIMILAR_BY and LINK_BY. (Note: this does not go against the fact that, for knowledge use and re-use purposes, non-binary relations should not be used in WebKB-2 because these ternary relations can be translated back into adequate representation in WebKB-2).
For names, in addition to the creator, it is interesting to permit the specification of the linguistic community (e.g. #French, #English, #CG_English_terminology) since it cannot be guessed with certainty from information attached to the creator (the default community is #English). Examples:

[TYPE: sowa#object *x (^see http://users.bestweb.net/~sowa/ontology/toplevel.htm^)]
   (CREATOR ?x sowa@bestweb.net)
   (NAME ?x "object")  (NAME ?x "physical_independent_continuant")
   (NAME_BY ?x "independent_spatial_entity" spamOnly@phmartin.info)
   (LT_BY ?x pm#spatial_entity spamOnly@phmartin.info) 
   (LT ?x sowa#actuality)  (LT ?x sowa#continuant)

[TYPE: #object *x]
   (NAME ?x "object") (NAME ?x "physical_object") 
   (NAME_BY_IN ?x "objet" spamOnly@phmartin.info #French)
   (NAME_BY_IN ?x "chose" spamOnly@phmartin.info #French)
   (LT ?x #entity)
   (LT_BY ?x pm#physical_entity_that_cannot_be_alive spamOnly@phmartin.info #French)
   (GT ?x {#natural_object #artifact})  //plus many others in WebKB-2

Recall that we use a graph syntax for links only to give an alternative to a FT-like syntax. We assume that links are encoded differently from graphs in the data model (as is currently the case in CG workbenches). Thus, the above graphs would not be stored as such; the reserved link names would only have to be known to the CGIF parsers.

As the above examples show, for readability reasons, we have also extended CGIF in other ways: (i) there are no concept node around individual markers, strings, dates and partitions because there is no ambiguity about the types of these objects, (ii) we have kept the FT syntax for dates and for open/close subtype partitions. CGIF does not have a special format for dates and therefore no standard way to represent (and hence exchange) them. For partitions, if we had followed the current syntax of CGIF, we would have used [TYPE: {#natural_object #artifact}] instead of just {#natural_object #artifact}. But what would we have done if the partition had been closed: {(#natural_object #artifact)}. In CGIF, there is no standard way to distinguish between:
- open collection, close collection (meaning that all the elements have been given);
- bag, set (the usual set), xor-set, or-set, list (ordered set), ordered bag;
- distributive interpretation, collective interpretation (as with "together" in English).
Given the pervasiveness of collections in knowledge representation, the absence of distinctions implies that the parsing and exploitation of knowledge in CGIF is domain-dependant and source-dependant (which is what an interchange format is intended to avoid).
On the 23/08/2001, I sent an e-mail to the CG list about this problem and the lack of reserved keywords for distinguishing various kinds of quantifiers.

CGIF is intended to be basic but to permit the definition of extensions. However, how these extensions are to be made is not yet defined in the standard. Some extensions could be made via type definitions, rules or second-order graphs but each of these possibilities greatly complicates the parsing, checking, and exploitation of the knowledge, and presents theoretical problems in general cases. Furthermore, since many extensions are needed, we expect that each knowledge provider will define its own extensions and that the extensions defined by different people will be difficult to compare/combine automatically. Standard reserved types/keywords with well defined meanings constitute a simple way to avoid the use of ad-hoc extensions.

We welcome suggestions to improve the above proposed format, i.e. the way WebKB-2 presents links between categories in CGIF.

3 Our EBNF/Lex&Yacc grammars of CGIF

Though integrating FT in CGIF remains a better solution than the above "patches", the extension of the CGIF grammar to permit the omission of concept nodes around objects when there is no ambuiguity about their types seems important to improve readability (and hence knowledge representation/presentation/update and tool debugging). Objects such as dates and "robust" strings should also be added (by robust strings, we mean strings with unusual delimiters to permit the storage of most textual/binary data without need for escaping it). The syntax of identifiers should also be extended to permit import/export from various languages and ontologies (at least URLs and e-mail addresses should be accepted as identifiers). The distinction between annotations and comments should be made: annotations are stored in the knowledge base with the object they are associated to while comments are discarded by the parsers. Line and multi-line Java/C++ comments should be accepted.

We show the CGIF grammar in EBNF and then with the Lex&Yacc syntax, with the extensions we propose highlighted in bold.
Note: in WebKB-2, identifiers may begin by '#'. Given our extension, indexicals may and should be removed (below it is kept in the grammar but the lexical analyser interpret "#foo" as an identifier, not an indexical).

Extended BNF grammar for CGIF ("?" means 0 or 1 times, "*" means 0 to N times, "+" means 1 to N times)

CG            := (Concept | Relation | Actor | SpecialContext)+
SpecialContext:= "~[" CG "]"  |  "[" SpecialType ":"? CG "]"
SpecialType   := "if" | "then" | "either" | "or" | "sc"
Actor         :=  "<" Type(N) ArcConcept* "|" ArcConcept* Annotation? ">"

Concept       := "[" ConceptType? (CorefLinks? Referent?)+ Annotation? "]"
Relation      := "(" RelationType ArcConcept* Annotation? ")"

ConceptType   :=  Identifier | CT_expression
CT_expression := "(" "lambda" "(" CT_parameter ")" CG ")"  |  CT_disjuncts
CT_parameter  := ConceptType DefLabel?
CT_disjuncts  := "(" CT_conjuncts ("|" CT_conjuncts)* ")"
CT_conjuncts  := CT_term ("&" CT_term)*
CT_term       := "~"? ConceptType

RelationType  :=  Identifier | RT_expression
RT_expression := "(" "lambda" "(" RT_signature ")" CG ")"  |  RT_disjuncts
RT_signature  := RT_parameter ("," RT_parameter)*
RT_parameter  := RelationType DefLabel?
RT_disjuncts  := "(" RT_conjuncts ("|" RT_conjuncts)* ")"
RT_conjuncts  := RT_term ("&" RT_term)*
RT_term       := "~"? RelationType

CorefLink     := DefLabel | BoundLabel*
Referent      := ":" (Designator? Descriptor?)+
Descriptor    := Structure | CG
Structure     := StructID? Collection

Designator    := Literal | Locator | Quantifier
Literal       := Number | Quoted2_str
Quantifier    := "@" (UnsignedInt | Identifier Collection?)
Locator       := Name | IndividualMarker | Indexical | Identifier

Collection    := "{" ArcConcept* "}" | "{(" ArcConcept* ")}"
ArcConcept    := Concept | BoundLabel | Collection | Point_in_time | Time
               | Literal | Locator | "@" (Identifier Collection)

//Lexical elements (some LEX syntax is used):
StructID        := '%' Identifier
DefLabel        := "*" Identifier
BoundLabel      := "?" Identifier
IndividualMarker:= "#" UnsignedInt
Indexical       := "#" Identifier?
Digit           := [0-9]
UnsignedInt     := Digit+
Letter          := [a-zA-Z_]
Identifier      := IdLetter1 IdLetter*
IdLetter1       := Letter | "#"Letter | "#."
IdLetter        := Letter | Digit | "#" | "_" | "-" | "/" | "?" | "&" | "~" 
                 | "."?[a-z0-9?#~]  //"." may be within an identifier but not at the end
                 | "://"            //a URL may be an identifier
Number          := ("+"|"-")? Digit+ ("." Digit* )?
Point_in_time   := Date" "Time?
Date            := ([0-9][0-9]?"/")?[0-9][0-9]?"/""-"?[0-9][0-9]?[0-9]?[0-9]?
Time            := [0-9][0-9]?":"[0-9][0-9](":"[0-9][0-9])?

Name            := DelimitedStr("'")    //'...' (as in the standard)
Usual_string    := DelimitedStr('"')    //"..." (as QuotedStr in the standard)
Robust_string   := 
QuotedStr       := "$("([^\)]|(\)[^$]))*")$"    //$(...)$
QuotedStr       := Usual_string | Robust_string

Yacc + Lex grammar for CGIF

%token IDENTIFIER NUMBER QUOTED1_STR QUOTED2_STR ROBUST_STR INDIVIDUAL_MARKER
%token INDEXICAL DEF_LABEL BOUND_LABEL LAMBDA IF THEN EITHER OR SC
%token QUANTIF_INT QUANTIF_ID STRUCT_ID QUANTIF_ID_BRACKET QUANTIF_ID_CCB
%token ANNOTATION POINT_IN_TIME PERIOD CCB CCE
%start CGIF
%%
CGIF        : CG | CGIF CG;
CG          : CONCEPT | RELATION | ACTOR | SPECIAL_CTXT
            ;
ACTOR       : '<' RTYPE _ARCconcepts '|' _ARCconcepts _ANNOT '>'
            ;
SPECIAL_CTXT: '~' '[' CG ']'
            | '[' SPECIAL_TYPE ':' CG ']'
            | '[' SPECIAL_TYPE     CG ']';
SPECIAL_TYPE: IF | THEN | EITHER | OR | SC;

CONCEPT     : '[' _CTYPE _REFS _ANNOT ']';
RELATION    : '(' RTYPE _ARCconcepts _ANNOT ')';

_REFS       : | REF _REFS;
REF         : COREF_LINK | REFERENT;

_CTYPE      : | CTYPE;
CTYPE       : IDENTIFIER | CT_EXPR;
CT_EXPR     : '(' LAMBDA '(' CTYPE _DEF_LABEL ')' CG ')'
            | '(' CT_CONJUNCTS _CT_DISJS ')';
_CT_DISJS   : | '|' CT_CONJUNCTS _CT_DISJS;
CT_CONJUNCTS: CT_TERM _CT_CONJS;
_CT_CONJS   : | '&' CT_TERM _CT_CONJS;
CT_TERM     : CTYPE | '~' CTYPE;

RTYPE       : IDENTIFIER | RT_EXPR;
RT_EXPR     : '(' LAMBDA '(' RTYPE _DEF_LABEL _RT_PARAMS ')' CG ')'
            | '(' RT_CONJUNCTS _RT_DISJS ')';
_RT_PARAMS  : | ',' RTYPE _DEF_LABEL _RT_PARAMS;
_RT_DISJS   : | '|' RT_CONJUNCTS _RT_DISJS;
RT_CONJUNCTS: RT_TERM _RT_CONJS;
_RT_CONJS   : | '&' RT_TERM _RT_CONJS;
RT_TERM     : RTYPE | '~' RTYPE;

COREF_LINK  : DEF_LABEL | BOUND_LABEL;
_DEF_LABEL  : | DEF_LABEL;

REFERENT    : ':' _DESS;
_DESS       : | DESIG__DESCR _DESS;
DESIG__DESCR: DESIGNATOR | DESCRIPTOR;
DESCRIPTOR  : STRUCTURE | CG;
STRUCTURE   : _STRUCT_ID COLLECTION;

DESIGNATOR  : LITERAL | LOCATOR | QUANTIFIER;
LITERAL     : NUMBER | QUOTED2_STR | ROBUST_STR;
LOCATOR     : NAME | INDIVIDUAL_MARKER | INDEXICAL | IDENTIFIER;
NAME        : QUOTED1_STR;
QUANTIFIER  : QUANTIF_INT | QUANTIF_ID | QUANTIF_COLL;
QUANTIF_COLL: QUANTIF_ID_BRACKET _ARCconcepts '}'
            | QUANTIF_ID_CCB     _ARCconcepts CCE;

COLLECTION  : '{' _ARCconcepts '}' | CCB _ARCconcepts CCE;
_ARCconcepts: | ARCconcept _ARCconcepts;
ARCconcept  : CONCEPT | BOUND_LABEL | COLLECTION
            | LITERAL | LOCATOR | QUANTIF_COLL | POINT_IN_TIME | PERIOD;

_ANNOT      : | ANNOTATION;
_STRUCT_ID  : | STRUCT_ID;
%%

//lex code now: ANY (.|\n) DIGIT [0-9] ALPHA [a-z] ALNUM [a-z0-9] S ([ \t\r]|" ")+ IDENT0 ({ALNUM}|"_") IDENT1 ({IDENT0}|"#"({IDENT0}|".")) IDENTn ({IDENT0}|"-"|"/"|"|"|"?"|"&"|"@"|"#"|"~"|"'"|"."|"://") IDENT ({IDENT1}({IDENTn})*|"\$") U_INT {DIGIT}+ NUMERAL ("+"|"-")?{DIGIT}+("."{DIGIT}*)? TIME_STR ({DIGIT}{DIGIT}?":"{DIGIT}{DIGIT}(":"{DIGIT}{DIGIT})?) DATE_STR (({DIGIT}{DIGIT}?"/")?{DIGIT}{DIGIT}?"/""-"?{DIGIT}{DIGIT}?{DIGIT}?{DIGIT}?(" "{TIME_STR})?) ANNOT_ALL ";"([^\;]|(\;\;))*";" ANNOT_END ([^\;]|(\;\;))*";" QSTR1 '([^'\\]|(\\{ANY}))*' QSTR1_END ([^'\\]|(\\{ANY}))*' QSTR2 \"([^"\\]|(\\{ANY}))*\" QSTR2_END ([^"\\]|(\\{ANY}))*\" DELIM_STR "$("([^\)]|(\)[^$]))*")$" DELIM_END ([^\)]|(\)[^$]))*")$" %x C_COMM ANNOT1 DELIM1 STR1 STR2 %% "/*" BEGIN(C_COMM); /* much quicker than using yyinput() */ <C_COMM>"\n" LineNumber++; <C_COMM>"*/" BEGIN(0); <C_COMM>. ; {S} ; "\n" LineNumber++; "//"[^\n]* ;/* skip C++ line comments */ "" ; "<!" cgif_skipTag(); ";" BEGIN(ANNOT1); <ANNOT1>{ANNOT_END} {LineNumber+=nbLinesInStr(cgiftext); cgiftext[cgifleng-1]='\0';BEGIN(0);return ANNOTATION; } "$(" BEGIN(DELIM1); <DELIM1>{DELIM_END} {LineNumber+=nbLinesInStr(cgiftext); cgiftext[cgifleng-2]='\0'; BEGIN(0); return ROBUST_STR; } "'" BEGIN(STR1); <STR1>{QSTR1_END} {LineNumber+=nbLinesInStr(cgiftext); cgiftext[cgifleng-1]='\0'; BEGIN(0); return QUOTED1_STR; } \" BEGIN(STR2); /*comment closing the other \" of this line*/ <STR2>{QSTR2_END} {LineNumber+=nbLinesInStr(cgiftext); cgiftext[cgifleng-1]='\0'; BEGIN(0); return QUOTED2_STR; } {DATE_STR} return POINT_IN_TIME; {TIME_STR} return PERIOD; {NUMERAL} {strcpy(TmpBuff,cgiftext);return NUMBER;} "#"{U_INT} return INDIVIDUAL_MARKER; "@"{U_INT} return QUANTIF_INT; "@"{IDENT}"{(" return QUANTIF_ID_CCB; "@"{IDENT}"{" return QUANTIF_ID_BRACKET; "@"{IDENT} return QUANTIF_ID; "%"{IDENT} return STRUCT_ID; "*x" return DEF_LABEL; "?x" return BOUND_LABEL; "lambda" return LAMBDA; "IF" return IF; "THEN" return THEN; "EITHER" return EITHER; "OR" return OR; "SC" return SC; "{(" return CCB; /* Closed Collection Beginning */ ")}" return CCE; /* Closed Collection End */ {IDENT} {return IDENTIFIER;} . {return cgiftext[0];} %%

Philippe A. MARTIN