banner
SIL International Home

A guide to common mapping problems and their solutions

From:

Importing SGML data into CELLAR by means of architectural forms
Gary F. Simons
Summer Institute of Linguistics

Last revised: 13 December 1997


Find the problem you are trying to solve in this list:

  1. An SGML element corresponds to a CELLAR object or attribute
  2. An SGML element corresponds to nothing in CELLAR
  3. An SGML attribute corresponds to a CELLAR attribute
  4. An SGML attribute corresponds to nothing in CELLAR
  5. There is no SGML element or attribute for a CELLAR object or attribute
  6. Other issues


1. An SGML element corresponds to a CELLAR object or attribute

Mapping an element to a CELLAR object

In the GCAPAPER DTD, the document element (<GCAPAPER>) corresponds to an object of class Article in the target object model.  This mapping is expressed in the mapping DTD as follows:

<!ATTLIST GCAPAPER
     cellar      NAME  #FIXED object
     class       CDATA #FIXED Article  >

Mapping an element to a CELLAR attribute

In the GCAPAPER DTD, the <abstract> element corresponds to the abstract attribute of Article in the target object model.  This mapping is expressed in the mapping DTD as follows:

<!ATTLIST ABSTRACT
     cellar      NAME  #FIXED attr
     contentAttr CDATA #FIXED abstract   >

Since an Article is the object currently being built when <abstract> is encountered, this has the effect of putting the content of this element in the attribute of Article named abstract. See "Mapping the character data content to basic objects" if the element has PCDATA in its content model.

Mapping one element to both an object and an attribute

In the TEILITE DTD, the <docTitle> element embeds further elements which specify parts of the title.  This corresponds to a TitleGroup in the target object model.  The TitleGroup is placed in the title attribute of the parent object, which in this case is an Article.  This mapping is expressed  in the mapping DTD as follows:

<!ATTLIST docTitle 
     cellar      NAME  #FIXED object 
     class       CDATA #FIXED TitleGroup
     parentAttr  CDATA #FIXED title  >

Mapping one element to two objects

In the GCAPAPER DTD, the footnote element (<ftnote>) embeds paragraphs directly. In the target object model, a FootnoteMarker embeds a Note object in its note attribute. This Note in turn embeds paragraphs in its paragraphs attribute. This mapping would be expressed in the mapping DTD as follows:

 <!ATTLIST FTNOTE
     cellar       NAME  #FIXED double
     class        CDATA #FIXED FootnoteMarker
     contentAttr  CDATA #FIXED note
     class2       CDATA #FIXED Note
     contentAttr2 CDATA #FIXED paragraphs >

Note that a case like this might also use the parentAttr attribute to specify the attribute of the parent object into which the first object should go. In this case, contentAttr has been declared on the parent object in order to specify the target attribute for the top-level object created here; see "Placing a sequence of elements in an implied CELLAR attribute."

See also "Wrapping a sequence of elements in an implied CELLAR object."

Mapping the character data content to basic objects

Whenever an element can have PCDATA content, its mapping rule needs to declare a value for contetnAttr (or contentAttr2 in the case of <double>) in order specify the target attribute for the PCDATA content.  It may also optionally declare a value for pcdataClass to specify what class of basic object should be created.  If no value is declared, then the architecture supplies String by default.

For instance, in the GCAPAPER DTD, <verbatim> is used for text (like a source code sample) in which the line breaks are to be preserved.  In the target object model this corresponds to an object of class LiteralText.  The verbatim text itself goes into the contents attribute and must be a basic object of class Text rather than String, since the former permits line breaks while the latter converts them to space. This mapping would be expressed in the mapping DTD as follows:

<!ATTLIST VERBATIM 
     cellar      NAME  #FIXED object 
     class       CDATA #FIXED LiteralText
     contentAttr CDATA #FIXED contents
     pcdataClass CDATA #FIXED Text      >


2. An SGML element corresponds to nothing in CELLAR

Ignoring extraneous element markup

Both the GCAPAPER DTD and the TEILITE DTD have a <front> element to encode the front matter for the document itself. The target object model does not have a front matter object; rather, it treats front matter elements (like title, authors, abstract, and so on) as attributes of the document. In this case a mapping from the SGML model to the object model would need to ignore the <front> and </front> tags and just process the element content. This mapping would be expressed in the mapping DTD as follows:

<!ATTLIST FRONT
     cellar      NAME  #FIXED ignore    >

Using literal text in place of ignored markup

When a group of elements with PCDATA content all have their element markup ignored and are all mapped to the same attribute, the effect is that the character strings simply get appended.  A space is automatically inserted to separate concatenated strings, but often a more conventional treatment would be to use literal text delimiters.  For instance, in the GCAPAPER DTD each author is encoded in a separate <author> element; in the target object model, there is only an authorField attribute in which  all the authors are listed in a single string.  The textBetween architectural attribute is used to specify a literal text string that is to be inserted between strings that go into the same attribute.  (Actually, it is inserted before the current string if the attribute already has a value.)  In this case we want to use a comma followed by a space as the delimiter. Note that we must quote the literal string with two sets of quotes; the outer set is stripped off by the SGML parser and the inner set is passed on as part of the value to the CELLAR parser which in turn strips them off.  That is,

<!ATTLIST AUTHOR

     cellar      NAME  #FIXED attr
     contentAttr CDATA #FIXED authorField
     textBetween CDATA #FIXED "', '"      >

As another example of literal text delimiters, consider quotations.  The GCAPAPER DTD has a <quotation> element, but the target object model has nothing that corresponds. Instead, quotations are simply encoded by using quotation marks.  This can be achieved in the mapping by ignoring the element markup, and then using the textBefore and textAfter architectural attributes to attach quotation marks at the beginning and end of the PCDATA content.   That is,

<!ATTLIST QUOTATION
     cellar      NAME  #FIXED ignore
     textBefore  CDATA #FIXED "'&#34;'"
     textAfter   CDATA #FIXED "'&#34;'"    >

Note that the literal strings need two sets of quotes (as explained above) and so we must use a character reference (&#34;) to access the double quote character within the value of the string.  The SGML parser will pass the string to the CELLAR parser as '"'.

Discarding unusable elements

In some cases when an element corresponds to nothing in the object model, it is also the case that the subelements in its content do not correspond either. In a case like this, we cannot just ignore the element markup and pass through to the content; rather, the element (with all its content) must be discarded. The <teiHeader> element of the TEILITE DTD provides an example of this. It is a complex element that can contain dozens of embedded elements, but the target object model has no equivalent. The parser that loads the architectural document into the CELLAR database recognizes the keyword DISCARD in place of an attribute name as an instruction to discard all of the embedded content. The mapping to discard the <teiHeader> would thus be expressed as follows in the mapping DTD:

<!ATTLIST teiHeader
     cellar      NAME  #FIXED attr
     contentAttr CDATA #FIXED DISCARD  >

Note that it is not necessary to define mappings for the elements in the embedded content, since a client document element that has no mapping to an architectural element is not retained in the architectural document (though its content is). In other words, this is the effect of the <ignore> architectural form. It could thus be argued that the latter is not needed; it is included, however, so that the mapping DTD can document in full how each client element is handled.

Note, too, that since the DISCARD function is implemented as a special attribute name (rather than as an architectural form), it can take part in conditional mappings.


3. An SGML attribute corresponds to a CELLAR attribute

Mapping an SGML attribute to a CELLAR attribute

The case of mapping an SGML attribute onto an object attribute uses a set of three architectural attributes to specify the name of the target attribute, its type, and its value. For instance, in the TEILITE DTD the division tags have an n attribute that stores a label for the division, as in:

<div1 n="3"><!-- Content of Section 3 --></div1>

In the target object model, the corresponding Section object has an attribute named label for storing this kind of information. The mapping for the <div1> element and its n attribute would be expressed in the mapping DTD as follows:

<!ATTLIST div1 
     cellar      NAME  #FIXED object 
     class       CDATA #FIXED Section
     attrName    CDATA #FIXED label
     attrType    CDATA #FIXED String
     cellarNames CDATA #FIXED "attrValue n" >

The declaration for attrName (the name of the target attribute) and for attrType (the type of object to create for target attribute value) are straightforward. The tricky part is cellarNames. This attribute is declared in the architectural support attributes of the mapping DTD (see step 2 in the process) to be the "attribute renamer." Its value says, "Rename the attribute n in the client document to be attrValue in the architectural document. Thus when the architecture engine processes the above mapping declaration over the input <div1 n="3">, it produces:

<object class="Section" attrName="label" 
        attrType="String" attrValue="3">

Mapping IDREFs to pointers between objects

A special case of mapping an attribute to an attribute is the case of an IDREF. Rather than creating an object to set as the target attribute value, the system must find the object with the target ID and set a pointer to it. An example is a figure reference in the GCAPAPER DTD, for instance, <figref refloc="fig3">. This is a cross-reference to the figure which is tagged as <figure id="fig3">. In the target object model, a figure reference maps onto a CrossReference object and the link to the cross-referenced item goes in the reference attribute. This mapping would be expressed in the mapping DTD as follows:

<!ATTLIST FIGREF
     cellar      NAME  #FIXED object
     class       CDATA #FIXED CrossReference
     attrName    CDATA #FIXED reference
     attrType    CDATA #FIXED IDREF
     cellarNames CDATA #FIXED "attrValue refloc"   >

Note that this is just like the case of mapping an attribute to an attribute given above with one exception: the attrType is given as IDREF. The parser which reads the architectural document and converts it to the corresponding objects in CELLAR recognizes this keyword as a sign that it must resolve the string given in attrValue into a pointer to the object that was declared to have that ID. Note in the architectural DTD, id is declared to be an architectural attribute of the <object> architectural form. Thus the parser also knows to enter any object with an ID into a lookup table that it can use to resolve the IDREFs.

Mapping multiple attributes of an SGML element

The architecture defines two sets of architectural attributes (attrName, attrType, attrValue and attrName2, attrType2, attrValue2) for handling two attributes of an SGML element.  If your application needs to be able to handle more attributes, it is straightforward to add attrName3, attrType3, attrValue3 and so on.  The following things must be done to add them:

Mapping SGML attribute values to new values in CELLAR

Clause 3.5.2, "Architectural attribute renamer," of the AFDR standard [ISO97] states that an architecture-to-client attribute name pair may be followed by triples beginning with #MAPTOKEN.  The second and third member of the triple give a mapping for attribute values--the second member is the attribute value to substitute in the architectural document if the client document has the third member as the value.

For instance, in the GCAPAPER DTD there is a <highlight> element with a style attribute that specifies a rendition property like bold, ital, and so on.  In the target object model, each of these styles corresponds to a different subclass of the Phrase object.  Thus the value of the style attribute provides the value for the class attribute in the architure, with the additional requirement that the individual values for the style attribute be mapped onto the exact class names they correspond to.  For instance, style="bold" corresponds to class="BoldPhrase". This mapping would be expressed in the mapping DTD as follows:

<!ATTLIST HIGHLIGHT
     cellar      NAME  #FIXED object
     cellarNames CDATA #FIXED "class style 
                               #MAPTOKEN BoldPhrase       bold
                               #MAPTOKEN ItalicPhrase     ital
                               #MAPTOKEN BoldItalicPhrase bital
                               #MAPTOKEN UnderlinePhrase  under"
     contentAttr CDATA #FIXED contents >

Unfortunately, the #MAPTOKEN feature was a late addition to the AFDR standard is not yet supported by the SP parser. Thus the mapping is coded as follows in the sample mapping DTD:

<!ATTLIST HIGHLIGHT
     cellar      NAME  #FIXED object
     class       CDATA #FIXED ItalicPhrase
     contentAttr CDATA #FIXED contents
 --  cellarNames CDATA #FIXED "class style 
                               #MAPTOKEN BoldPhrase       bold
                               #MAPTOKEN ItalicPhrase     ital
                               #MAPTOKEN BoldItalicPhrase bital
                               #MAPTOKEN UnderlinePhrase  under" -->

In other words, every <highlight> element is mapped to an ItalicPhrase for now.  The full rule with #MAPTOKEN is preserved in a comment. Alternatively, one could use the attribute renamer to rename style as class and then run a script over the output to change class="bold" to class="BoldPhrase", and so on.


4. An SGML attribute corresponds to nothing in CELLAR

Discarding an SGML attribute

This is accomplished automatically.  If an attribute of the client document is not also an attribute in the architectural DTD, and if it is not renamed by invoking the attribute renamer, then the attribute is not passed on to the architectural document.


5. There is no SGML element or attribute for a CELLAR object or attribute

Wrapping a single element in an implied CELLAR object

Wrapping a sequence of elements in an implied CELLAR object

Sometimes an implicit object must embed more than one explicit one; this is an outstanding problem for the CELLAR architecture as currently defined. For instance, in the TEILITE DTD, a <list> may contain <item>s, or it may contain <label> plus <item> pairs. In the target object model, each such pair represents the attributes of an implicit LabeledListItem object. The best solution available, with the CELLAR architecture as currently formulated, is to generate separate LabeledListItems for both the <label> and the <item> and then to clean up the result by hand.

Placing a single element in an implied CELLAR attribute

Placing a sequence of elements in an implied CELLAR attribute

Generating an implied CELLAR attribute


6. Other issues

Making conditional mappings

Sometimes a given element in the client document is used in multiple contexts and what it corresponds to in the target object model depends on the context. The <title> element in the GCAPAPER DTD is an example of this. When it appears in a figure, it corresponds to the caption attribute of a CaptionedChunk object. When it appears in the front matter, it corresponds to the titleField attribute of an Article object. Otherwise, such as in a section, it corresponds to the heading attribute. This mapping would be expressed in the mapping DTD as follows:

<!ATTLIST TITLE
     cellar      NAME  #FIXED attr
     contentAttr CDATA #FIXED "if CaptionedChunk caption
                               if Article titleField
                               heading"   >

Note that this conditional expression is interpreted by the parser which converts the architectural document into objects in CELLAR, not by the architecture engine. Thus the context must be specified in terms of the class of object that is currently under construction; it cannot be in terms of the elements of the client document since that information is no longer available in the architectural document. Note, too, that the DISCARD keyword can be used as an attribute name in a conditional expression to set up a mapping in which an element should be used in some contexts and discarded in others.

Generating multilingual data

One of the distinctives of CELLAR as an object database system is its focus on multilingual data processing [ST97]. Every string in the database is marked as to what language it is in and what system of data encoding is used for it (i.e. what character set is used and how the writing system is mapped onto those characters). The TEILITE DTD makes use of a lang attribute throughout to identify the language of the data. For instance, <foreign lang="LAT">fiat lux</foreign> identifies the embedded phrase as being Latin. The mapping for this element is expressed in the mapping DTD as follows:

<!ATTLIST foreign
     cellar      NAME  #FIXED ignore
     cellarNames CDATA #FIXED "encoding lang" >

Note that the element markup is ignored since we don't want to generate a special object for a foreign phrase; rather, we just want the embedded string to have its encoding set appropriately. To accomplish this we use the "attribute renamer" to make the value of the architectural encoding attribute be the value of the client lang attribute.


Document date: 12-Nov-1997