Tuesday, January 16, 2007

XML schema design for document authoring

By: Marie Bilde Rasmussen, 2006

This posting is a copy of my contribution to the poster session at the ExtremeMarkup conference 2006 in Montreal
David Birnbaum (University of Pittsburgh) wrote on his conference blog: You want your students and clients to read the intelligent and lucid two pages of Marie Bilde Rasmussen’s “XML Schema Design for Document Authoring,” which provide clear, concise, and practical guidelines to useful, learnable, teachable, and maintainable design. If there were an award for the best overall poster of the conference, this would be the winner.C. Michael Sperberg-McQueen (W3C) said in his closing remarks at the conference: I was very struck with something that Maria Bilde Rasmussen said in connection with her poster, about author-centered XML. The goal for an editing system for lexicographers is that when a lexicographer looks at the screen he should not say, “Oh, okay: this is an XML document that represents a lexicographic entry.” You want the lexicographer to look at the screen and say: “This is a lexicographic entry.” Period. If absolutely necessary it may be ok if they say, “This a lexicographic entry represented in XML” - but not “This is primarily an XML document.” You want them to see through the XML, to the information.

XML schema design for document authoring
using XML to support the process

Document authoring is sometimes the predominant time consuming factor in the production cyclus. XML schemas tend to be designed only to describe the final state of data, representing a structure that allows later data processing, but is not further customized to support the authoring process.

The schema should inform the author about exactly which elements (etc.) are relevant to insert/use in a given structural context. The grammar will appear much simpler from a local point of view. And the author will be able to recognize the information types of the current text type. However, we cannot assume that he is an expert on XML or markup in general

This schema use will probably require a more complex set of rules. But schema complexity and the resulting extra ressources spent on schema design is a good investment if the time spent on authoring is reduced and the data quality is increased. So, instead of only defining XML vocabularies as being either document or data oriented, it might be a good idea to focus on the authoring process when designing the XML-environment for (the authoring of) a given text type.

Taking the authoring process into account affects:

  1. schema design
  2. selection of relevant schema language(s)
  3. selection of an appropriate application
Assuming that the author uses an application that is schema aware (e.g. by exploiting the PSVI of a W3C schema intensively) the schema can be designed to facilitate the authoring task. If the application furthermore provides means of data presentation in the editing view, we can actually support and help authors to concentrate on their primary task: text production

definitions

  • document authoring is simoultaneous content production, editing and markup, performed by an author
  • an author is a person producing a piece of text
  • an application is a piece of software used for authoring

schema design goals

  • the vocabulary and the relations defined by the schema must be recognized by the author as a meaningful representation of the text type and the chosen analysis model
  • in any given structural context, the schema must allow the insertion/deletion/alteration of exactly the relevant elements/tree fragments as siblings or children
  • the author should recognize working with the XML environment as more beneficial than working without it. He should not feel reduced to some sort of technical encoder
  • the schema is the author's working tool, and the perfect tool excels by being inconspicuous (the schema language is the designer's tool). This means that if the schema is well-designed, the author does not pay any attention to -- and is not distracted by -- the xml'ness: he can focus on his text

schema design strategies

  • structures should be shallow in order to keep as much of the text on the screen at the same time and to prevent the text from being abruptly fragmented
  • depth should be dynamic and only be used when necessary (structures should not always be as deep as in the worst case)
  • bottom-up markup must be possible, i.e. that coherent pieces of text can first be written and thereafter marked up
  • there should be a high degree of context sensitivity in the sense that only relevant and all the relevant substructures are valid in a given context (in a W3C xsd this may result in a very large variety of global types in a Venetian blind or Garden of Eden approach)
  • element and attribute naming should be meaningful and take the distinction between content, form and function into account
  • mixed content should only be used in coherent "textual" contexts

Saturday, January 6, 2007

document-oriented XML vs. data-oriented XML: an awkward distinction

Whether you use the term document-oriented XML or narrative XML documents and define this phenomenon to be the opposite of something called data-oriented XML or record-oriented XML, it is assumed that the markup’s structure (or at least: the very nature of the data structure) emerges from the text type in it’s own right, i.e. regardless of it’s interaction with the surrounding world.

It is also assumed, that very complex and less ordered data are more likely to having been produced manually, whereas very simple, repetitive and maybe quite linear data are considered as the typical expression of computer-produced XML. Rigidly maintaining this distinction and these assumptions makes life difficult for some XML content architects.

The variety of text types is far more complicated than this. Lexicographers author texts that are VERY complex yet very well-ordered and carefully restricted structures. Other texts' structure may be very simple and repetitive, even though it is very difficult to analyse the content correctly in order to mark it up, and therefore the markup process is performed by human authors. Maybe other texts again could quite poorly structured, e.g. with plenty of mixed content, but despite this, it is being marked-up by computers.

In my view there are a number of parameters that must be taken into account if one wishes to seize the most adequate design of a data structure for some 'text':

  1. The text type itself. Are we dealing with
    • a medieval poem
    • a result of a database query
    • a scientific report
    • an entry in a phonebook/a dictionary/an encyklopedia
    • technical documentation
    • ...

  2. The content production process. Is the content
    • output from a computer process?
    • manually keyed in?
    • already there?
    • added here and there to an existing text?, e.g. as part of a revising process

  3. The markup process. Is the markup process
    • carried out manually?
    • handled by a computer process?
    • simultaneous with the writing of the text - perhaps even integrated with it?
    • applied to a preexisting text?
    • consisting of some structural rearrangement/transformation/extension of an already existing structure?

  4. The intended use of the marked-up text. Shall the text
    • only be published? (in print or electronically)
    • be read in a structured format?
    • be searchable by categories?
    • be subject to research and comparison with other similar texts?
    • be exchanged and read by automated processes?


Possibly we will add to this list of perspectives in order to decide wich design strategy to follow in a given project invlving xml authoring. But already at this point, it should be clear, that it is not only the nature of the data, that has to be considered during a schema design process. Authoring processes, the text type and the very purpose of marking up the text must not be ignored. In this light, the discrimination between document and data oriented xml appears at best to be insufficient