A guide to common mapping problems and their solutions
From:
Importing SGML data into CELLAR by means of architectural forms
Gary F. Simons
Summer Institute of Linguistics
Last revised: 13 December 1997
Find the problem you are trying to solve in this list:
- An SGML element corresponds to a CELLAR object or attribute
- An SGML element corresponds to nothing in CELLAR
- An SGML attribute corresponds to a CELLAR attribute
- An SGML attribute corresponds to nothing in CELLAR
-
There is no SGML element or attribute for a CELLAR object or attribute
- N.B. This section is still under construction
- Wrapping a single element in an implied CELLAR object
- Wrapping a sequence of elements in an implied CELLAR object
- Placing a single element in an implied CELLAR attribute
- Placing a sequence of elements in an implied CELLAR attribute
- Generating an implied CELLAR attribute
- Other issues
1. An SGML element corresponds to a CELLAR object or attribute
Mapping an element to a CELLAR object
In the GCAPAPER DTD, the document element
(<GCAPAPER>) corresponds to an object of class Article
in the target object model. This mapping is expressed in the mapping
DTD as follows:
<!ATTLIST GCAPAPER
cellar NAME #FIXED object
class CDATA #FIXED Article >
Mapping an element to a CELLAR attribute
In the GCAPAPER DTD, the <abstract> element corresponds to the abstract attribute of Article in the target object model. This mapping is expressed in the mapping DTD as follows:
<!ATTLIST ABSTRACT
cellar NAME #FIXED attr
contentAttr CDATA #FIXED abstract >
Since an Article is the object currently being built when
<abstract> is encountered, this has the effect of putting
the content of this element in the attribute of Article named
abstract. See "Mapping the character data content
to basic objects" if the element has PCDATA in its content model.
Mapping one element to both an object and an attribute
In the TEILITE DTD, the <docTitle> element embeds
further elements which specify parts of the title. This corresponds
to a TitleGroup in the target object model. The TitleGroup is placed
in the title attribute of the parent object, which in this case is
an Article. This mapping is expressed in the mapping DTD as follows:
<!ATTLIST docTitle
cellar NAME #FIXED object
class CDATA #FIXED TitleGroup
parentAttr CDATA #FIXED title >
Mapping one element to two objects
In the GCAPAPER DTD, the footnote element (<ftnote>) embeds
paragraphs directly. In the target object model, a FootnoteMarker embeds
a Note object in its note attribute. This Note in turn embeds paragraphs
in its paragraphs attribute. This mapping would be expressed in the
mapping DTD as follows:
<!ATTLIST FTNOTE
cellar NAME #FIXED double
class CDATA #FIXED FootnoteMarker
contentAttr CDATA #FIXED note
class2 CDATA #FIXED Note
contentAttr2 CDATA #FIXED paragraphs >
Note that a case like this might also use the parentAttr attribute to specify the attribute of the parent object into which the first object should go. In this case, contentAttr has been declared on the parent object in order to specify the target attribute for the top-level object created here; see "Placing a sequence of elements in an implied CELLAR attribute."
See also "Wrapping a sequence of elements in an implied CELLAR object."
Mapping the character data content to basic objects
Whenever an element can have PCDATA content, its mapping rule needs to declare
a value for contetnAttr (or contentAttr2 in the case of
<double>) in order specify the target attribute for the
PCDATA content. It may also optionally declare a value for
pcdataClass to specify what class of basic object should be created.
If no value is declared, then the architecture supplies String by default.
For instance, in the GCAPAPER DTD, <verbatim> is used
for text (like a source code sample) in which the line breaks are to be
preserved. In the target object model this corresponds to an object
of class LiteralText. The verbatim text itself goes into the
contents attribute and must be a basic object of class Text rather
than String, since the former permits line breaks while the latter converts
them to space. This mapping would be expressed in the mapping DTD as follows:
<!ATTLIST VERBATIM
cellar NAME #FIXED object
class CDATA #FIXED LiteralText
contentAttr CDATA #FIXED contents
pcdataClass CDATA #FIXED Text >
2. An SGML element corresponds to nothing in CELLAR
Ignoring extraneous element markup
Both the GCAPAPER DTD and the TEILITE DTD have a
<front> element to encode the front matter for the document
itself. The target object model does not have a front matter object; rather,
it treats front matter elements (like title, authors, abstract, and so on)
as attributes of the document. In this case a mapping from the SGML model
to the object model would need to ignore the <front> and
</front> tags and just process the element content. This
mapping would be expressed in the mapping DTD as follows:
<!ATTLIST FRONT
cellar NAME #FIXED ignore >
Using literal text in place of ignored markup
When a group of elements with PCDATA content all have their element markup ignored and are all mapped to the same attribute, the effect is that the character strings simply get appended. A space is automatically inserted to separate concatenated strings, but often a more conventional treatment would be to use literal text delimiters. For instance, in the GCAPAPER DTD each author is encoded in a separate <author> element; in the target object model, there is only an authorField attribute in which all the authors are listed in a single string. The textBetween architectural attribute is used to specify a literal text string that is to be inserted between strings that go into the same attribute. (Actually, it is inserted before the current string if the attribute already has a value.) In this case we want to use a comma followed by a space as the delimiter. Note that we must quote the literal string with two sets of quotes; the outer set is stripped off by the SGML parser and the inner set is passed on as part of the value to the CELLAR parser which in turn strips them off. That is,
<!ATTLIST AUTHOR
cellar NAME #FIXED attr
contentAttr CDATA #FIXED authorField
textBetween CDATA #FIXED "', '" >
As another example of literal text delimiters, consider quotations. The
GCAPAPER DTD has a <quotation> element, but the target
object model has nothing that corresponds. Instead, quotations are simply
encoded by using quotation marks. This can be achieved in the mapping
by ignoring the element markup, and then using the textBefore and
textAfter architectural attributes to attach quotation marks at the
beginning and end of the PCDATA content. That is,
<!ATTLIST QUOTATION
cellar NAME #FIXED ignore
textBefore CDATA #FIXED "'"'"
textAfter CDATA #FIXED "'"'" >
Note that the literal strings need two sets of quotes (as explained above)
and so we must use a character reference (") to access the double
quote character within the value of the string. The SGML parser will
pass the string to the CELLAR parser as '"'.
Discarding unusable elements
In some cases when an element corresponds to nothing in the object model,
it is also the case that the subelements in its content do not correspond
either. In a case like this, we cannot just ignore the element markup and
pass through to the content; rather, the element (with all its content) must
be discarded. The <teiHeader> element of the TEILITE DTD
provides an example of this. It is a complex element that can contain dozens
of embedded elements, but the target object model has no equivalent. The
parser that loads the architectural document into the CELLAR database recognizes
the keyword DISCARD in place of an attribute name as an instruction to discard
all of the embedded content. The mapping to discard the
<teiHeader> would thus be expressed as follows in the
mapping DTD:
<!ATTLIST teiHeader
cellar NAME #FIXED attr
contentAttr CDATA #FIXED DISCARD >
Note that it is not necessary to define mappings for the elements in the
embedded content, since a client document element that has no mapping to
an architectural element is not retained in the architectural document (though
its content is). In other words, this is the effect of the
<ignore> architectural form. It could thus be argued that
the latter is not needed; it is included, however, so that the mapping DTD
can document in full how each client element is handled.
Note, too, that since the DISCARD function is implemented as a special attribute name (rather than as an architectural form), it can take part in conditional mappings.
3. An SGML attribute corresponds to a CELLAR attribute
Mapping an SGML attribute to a CELLAR attribute
The case of mapping an SGML attribute onto an object attribute uses a set of three architectural attributes to specify the name of the target attribute, its type, and its value. For instance, in the TEILITE DTD the division tags have an n attribute that stores a label for the division, as in:
<div1 n="3"><!-- Content of Section 3 --></div1>
In the target object model, the corresponding Section object has an attribute
named label for storing this kind of information. The mapping for
the <div1> element and its n attribute would be
expressed in the mapping DTD as follows:
<!ATTLIST div1
cellar NAME #FIXED object
class CDATA #FIXED Section
attrName CDATA #FIXED label
attrType CDATA #FIXED String
cellarNames CDATA #FIXED "attrValue n" >
The declaration for attrName (the name of the target attribute)
and for attrType (the type of object to create for target attribute
value) are straightforward. The tricky part is cellarNames.
This attribute is declared in the architectural support attributes of the
mapping DTD (see step 2 in the process) to be
the "attribute renamer." Its value says, "Rename the attribute n in
the client document to be attrValue in the architectural document.
Thus when the architecture engine processes the above mapping declaration
over the input <div1 n="3">, it produces:
<object class="Section" attrName="label"
attrType="String" attrValue="3">
Mapping IDREFs to pointers between objects
A special case of mapping an attribute to an attribute is the case of an
IDREF. Rather than creating an object to set as the target attribute value,
the system must find the object with the target ID and set a pointer to it.
An example is a figure reference in the GCAPAPER DTD, for instance,
<figref refloc="fig3">. This is a cross-reference to the
figure which is tagged as <figure id="fig3">. In the target
object model, a figure reference maps onto a CrossReference object and the
link to the cross-referenced item goes in the reference attribute.
This mapping would be expressed in the mapping DTD as follows:
<!ATTLIST FIGREF
cellar NAME #FIXED object
class CDATA #FIXED CrossReference
attrName CDATA #FIXED reference
attrType CDATA #FIXED IDREF
cellarNames CDATA #FIXED "attrValue refloc" >
Note that this is just like the case of mapping an attribute to an attribute
given above with one exception: the attrType
is given as IDREF. The parser which reads the architectural document and
converts it to the corresponding objects in CELLAR recognizes this keyword
as a sign that it must resolve the string given in attrValue
into a pointer to the object that was declared to have that ID. Note in the
architectural DTD, id is declared to be an
architectural attribute of the <object> architectural
form. Thus the parser also knows to enter any object with an ID into a lookup
table that it can use to resolve the IDREFs.
Mapping multiple attributes of an SGML element
The architecture defines two sets of architectural attributes (attrName, attrType, attrValue and attrName2, attrType2, attrValue2) for handling two attributes of an SGML element. If your application needs to be able to handle more attributes, it is straightforward to add attrName3, attrType3, attrValue3 and so on. The following things must be done to add them:
-
Add the three new attributes to the definition of the
<object>architectural form in cellar.dtd. -
In the main recursive function of the
CELLAR parser:
-
Add the three new variables to the declaration list (
var) at the beginning of the function definition. - Copy and paste the six lines that read attrName2, attrType2, attrValue2 from the ESIS file and edit them to change the suffixed digit.
-
In the code that implements the
<object>form, copy and paste a call to the ESISattribute method and edit it to use the new suffixed digit.
-
Add the three new variables to the declaration list (
Mapping SGML attribute values to new values in CELLAR
Clause 3.5.2, "Architectural attribute renamer," of the AFDR standard [ISO97] states that an architecture-to-client attribute name pair may be followed by triples beginning with #MAPTOKEN. The second and third member of the triple give a mapping for attribute values--the second member is the attribute value to substitute in the architectural document if the client document has the third member as the value.
For instance, in the GCAPAPER DTD there is a
<highlight> element with a style attribute that
specifies a rendition property like bold, ital, and so on.
In the target object model, each of these styles corresponds to a different
subclass of the Phrase object. Thus the value of the style attribute
provides the value for the class attribute in the architure, with
the additional requirement that the individual values for the style
attribute be mapped onto the exact class names they correspond to. For
instance, style="bold" corresponds to
class="BoldPhrase". This mapping would be expressed in the mapping
DTD as follows:
<!ATTLIST HIGHLIGHT
cellar NAME #FIXED object
cellarNames CDATA #FIXED "class style
#MAPTOKEN BoldPhrase bold
#MAPTOKEN ItalicPhrase ital
#MAPTOKEN BoldItalicPhrase bital
#MAPTOKEN UnderlinePhrase under"
contentAttr CDATA #FIXED contents >
Unfortunately, the #MAPTOKEN feature was a late addition to the AFDR standard is not yet supported by the SP parser. Thus the mapping is coded as follows in the sample mapping DTD:
<!ATTLIST HIGHLIGHT
cellar NAME #FIXED object
class CDATA #FIXED ItalicPhrase
contentAttr CDATA #FIXED contents
-- cellarNames CDATA #FIXED "class style
#MAPTOKEN BoldPhrase bold
#MAPTOKEN ItalicPhrase ital
#MAPTOKEN BoldItalicPhrase bital
#MAPTOKEN UnderlinePhrase under" -->
In other words, every <highlight> element is mapped to
an ItalicPhrase for now. The full rule with #MAPTOKEN is preserved
in a comment. Alternatively, one could use the attribute renamer to rename
style as class and then run a script over the output to change
class="bold" to class="BoldPhrase", and so on.
4. An SGML attribute corresponds to nothing in CELLAR
Discarding an SGML attribute
This is accomplished automatically. If an attribute of the client document is not also an attribute in the architectural DTD, and if it is not renamed by invoking the attribute renamer, then the attribute is not passed on to the architectural document.
5. There is no SGML element or attribute for a CELLAR object or attribute
Wrapping a single element in an implied CELLAR object
Wrapping a sequence of elements in an implied CELLAR object
Sometimes an implicit object must embed more than one explicit one; this
is an outstanding problem for the CELLAR architecture as currently defined.
For instance, in the TEILITE DTD, a <list> may contain
<item>s, or it may contain
<label> plus <item> pairs. In the target
object model, each such pair represents the attributes of an implicit
LabeledListItem object. The best solution available, with the CELLAR architecture
as currently formulated, is to generate separate LabeledListItems for both
the <label> and the <item> and then
to clean up the result by hand.
Placing a single element in an implied CELLAR attribute
Placing a sequence of elements in an implied CELLAR attribute
Generating an implied CELLAR attribute
6. Other issues
Making conditional mappings
Sometimes a given element in the client document is used in multiple contexts
and what it corresponds to in the target object model depends on the context.
The <title> element in the GCAPAPER DTD is an example
of this. When it appears in a figure, it corresponds to the caption
attribute of a CaptionedChunk object. When it appears in the front matter,
it corresponds to the titleField attribute of an Article object.
Otherwise, such as in a section, it corresponds to the heading attribute.
This mapping would be expressed in the mapping DTD as follows:
<!ATTLIST TITLE
cellar NAME #FIXED attr
contentAttr CDATA #FIXED "if CaptionedChunk caption
if Article titleField
heading" >
Note that this conditional expression is interpreted by the parser which converts the architectural document into objects in CELLAR, not by the architecture engine. Thus the context must be specified in terms of the class of object that is currently under construction; it cannot be in terms of the elements of the client document since that information is no longer available in the architectural document. Note, too, that the DISCARD keyword can be used as an attribute name in a conditional expression to set up a mapping in which an element should be used in some contexts and discarded in others.
Generating multilingual data
One of the distinctives of CELLAR as an object database system is its focus
on multilingual data processing [ST97]. Every
string in the database is marked as to what language it is in and what system
of data encoding is used for it (i.e. what character set is used and how
the writing system is mapped onto those characters). The TEILITE DTD makes
use of a lang attribute throughout to identify the language
of the data. For instance, <foreign lang="LAT">fiat
lux</foreign> identifies the embedded phrase as being Latin.
The mapping for this element is expressed in the mapping DTD as follows:
<!ATTLIST foreign
cellar NAME #FIXED ignore
cellarNames CDATA #FIXED "encoding lang" >
Note that the element markup is ignored since we don't want to generate a
special object for a foreign phrase; rather, we just want the embedded string
to have its encoding set appropriately. To accomplish this we use the "attribute
renamer" to make the value of the architectural encoding attribute
be the value of the client lang attribute.
Document date: 12-Nov-1997
