Administrivia
Professor Erik Wilde
Office Hours: Tuesday 3:30pm-4:30pm Thursday 3:30pm-4:30pm 314 South Hall
TA Katrina Rhoads Lindholm
Email: krhoads@sims.berkeley.edu
Office Hours: Monday 12:30pm-2:00pm 210 South Hall
Course Description
Three hours of lecture, one hour of Laboratory per week. The Extensible Markup Language (XML), with its ability to define formal structural and semantic definitions for metadata and information models, is the key enabling technology for information services and document-centric business models that use the Internet and its family of protocols. This course introduces XML syntax, styles and transformations, and schema languages. It balances conceptual topics with practical skills for designing and implementing conceptual models as XML schemas.
Course Information
Course Dates: August 29 to October 19, 2006
Lecture Schedule: Tuesday Thursday 2:00pm-3:30pm in 110 South Hall
Units: 2
Grading Option: Pass/Not Pass only
Course Text
Required
Learning XML, 2nd Edition, Erik T. Ray. O'Reilly, September 2003. ISBN: 0-596-00420-6
Course Work
August 29 : Tuesday
The Extensible Markup Language (XML) has been introduced in 1998 to enable content providers to publish their content on the Web in an application-specific format. HTML was considered as conveying not enough semantics, since its only purpose was (and is) the preparation of content for Web-based publishing. XML was the first step towards machine-readable data formats for the Web, a trend that since its invention has been taken to higher levels with the idea of the Semantic Web. XML appeared when the Web was in the steepest part of its success curve, and since then has taken over as the globally accepted format for the exchange of machine-readable structured data.
Resources
August 31 : Thursday
The Extensible Markup Language (XML) defines a simple way for structuring data. The power and popularity of XML can be explained by its versatility, the platform-independence, the standards and technologies leveraging it, and the number of tools and products supporting it. Understanding XML itself is rather simple, it only depends on a very small set of other technologies. Unicode and URIs are the most important foundations of XML. XML itself specifies two different things: on the one hand the format for structured data, which are called XML documents, and on the other hand a constraint language for XML documents, which is called Document Type Definition (DTD).
Required Readings
Chapters 1.3 (pp. 16-28) & 2.1-2.4 (pp. 49-66)
Resources
September 5 : Tuesday
Document Type Definition (DTD)
The XML specification defines a format for structured data (XML documents) and a grammar-based constraint language for these (DTD). In SGML-based systems, DTDs were often very complex and feature-rich constructs, which controlled a lot of the processing of SGML documents. XML greatly simplified DTDs, and de-facto usage of DTDs today simplified them even more. In many systems today, DTDs are not used at all or generated from sample documents. In this lecture, it is argued that DTDs (or schemas, to be more general) should be taken seriously in any non-trivial XML application, because they are a representation of the underlying (and often underspecified) data model of the application.
Required Readings
Chapter 4-4.2 (pp. 108-132)
Resources
September 7 : Thursday
The Good, the Bad, and the Ugly
While XML it rather easy to understand and use, it is also rather easy to use XML in ways which either produce ugly
XML, or which may lead to problems in components further processing the XML. The topic of this lecture thus is to look at design guidelines for XML schemas, leading to good
XML. Some of the simpler topics cover basic questions of how to map a data model to XML markup (e.g., when to use elements or attributes). The next question is how data should be represented in XML so that applications can process it efficiently. We also look at what part of the markup an application will actually have access to, and this is defined by the XML Information Set (Infoset), the specification underlying many XML technologies.
Required Readings
Resources
September 12 : Tuesday
Cascading Stylesheets (CSS) have been designed as a language for better separating presentation-specific issues from the structuring of documents as provided by HTML. However, CSS can be applied to XML as well, either directly (by applying a CSS stylesheet to an XML document), or as an supplement to basic HTML layout structures generated from an XML document. CSS uses a simple model of selectors and declarations. Selectors specify to which elements of a document a set of declarations (each being a value assigned to a property) apply; in addition there is a model of how property values are inherited and cascaded. The biggest limitation of CSS is that it cannot change the structure of the displayed document.
Required Readings
Chapter 5 (pp. 164-204)
Resources
September 14 : Thursday
XML is successful because it can be used in many different scenarios, and because it is easy to define a schema (such as a DTD) for new scenarios, producing a tailored XML data model for this scenario. This means that names in XML documents must be interpreted as belonging to a certain schema. As long as a document uses names from only one schema, this can be done rather easily. However, in many scenarios today documents combine names from different schemas, and XML Namespaces provide a mechanism how the names in an XML document can be associated with a namespace.
Required Readings
Resources
Due on September 19
September 19 : Tuesday
XML structures data into a rather small number of different constructs, most notably elements and attributes. The XML Path Language (XPath) defines a way how to select parts of XML documents, so that they can be used for further processing. XPath's primary use in in XSL Transformations (XSLT), but other XML technologies use it as well, e.g. XML Schema. XPath is a very compact language with a syntax that resembles the path expressions which are well-known from file systems. These path expressions, however, are generalized and therefore much more powerful than the rather simple path expressions in file systems. Because of its use in different XML technologies, XPath is one of the most important XML core technologies.
Required Readings
Resources
September 21 : Thursday
XML Transformations (XSLT) — Part I
Because XML can be used to represent any vocabulary (often defined by some schema), the question is how these different vocabularies can be processed and maybe transformed into something else. This something else
maybe another XML vocabulary (a common requirement in B2B scenarios), or it may be HTML (a common scenario for Web publishing). Using XSL Transformations (XSLT), mapping tasks can be implemented easily. XSLT leverages XPath's expressive power in a rather simple programming language. For easy tasks, XSLT mapping can be specified without much real programming
going on, by simply specifying how components of the source markup are mapped to components of the target markup.
Required Readings
Resources
September 26 : Tuesday
XML Transformations (XSLT) — Part II
XSLT processes documents by matching nodes in the document tree to templates, which then are executed to process these nodes. This process of matching and executing templates is the core of XSLT's processing model. XSLT has built-in templates which complement the user-supplied templates, so that the XSLT processor always finds a template to execute. Templates can conflict, and it is then necessary to resolve this conflict by finding the best match
of all matching templates. This conflict resolution process also is a very important component of the XSLT processing model.
Required Readings
Resources
September 28 : Thursday
XML Transformations (XSLT) — Part III
Advanced XSLT processing includes better control of the input and output documents, which can finely controlled in terms of how whitespace is treated. Another interesting feature of XSLT are keys, which allow shorthand notations for frequently used access paths to nodes, and provide XSLT processors with more information for performance optimizations. Instructions for creating all possible kinds of nodes in the output tree make it possible to write code which generates element or attribute names based on runtime evaluations.
Required Readings
Resources
October 3 : Tuesday
XML Schema is the most popular schema language for XML today. It has been introduced to overcome some of the commonly observed limitations of DTDs, most notably the lack of typing. Simple Types describe content which is not structured by XML markup, which means it describes attribute values and element content. Simple types can be defined by deriving new types from existing types by using type restriction. Complex Types describe element content if this content is using attributes and/or element content other than only character data. Using XML Schema's type concepts, it is easier to represent model-level information in a schema, because type hierarchies can represent model-level specializations.
Required Readings
Chapters 4.3 & 4.4 (pp. 132-159)
Resources
October 5 : Thursday
XML Schema allows greater flexibility in defining constraints on intra-document references than the ID/IDREF construct of DTDs. XML Schema's Identity Constraints are scoped, typed, and can be used for elements or attributes. The second aspect of XML Schema discussed today is the derivation of complex types. Complex types can be derived by restriction or extension. Complex type restriction defines the restricted type to be a more restricted version of the base type. Complex type extension make it possible to extend the base type by either adding attributes or contents (only by appending new content to the content model).
Required Readings
Resources
October 10 : Tuesday
While XML is very useful for representing and manipulating structured data, the question remains where these structures come from. They are usually some kind of encoding for a conceptual model, but there is no established and universally accepted way of how to connect the modeling world with XML markup. Some of the challenges and approaches to XML and modeling will be presented in this lecture. The goal of this lecture is to raise awareness for the current gap between models and markup, and for practical approaches how to bridge that gap.
Required Readings
Resources
October 12 : Thursday
October 17 : Tuesday
Alternative Schema Languages — Schematron
While XML Schema is the most popular schema language in use today and for the foreseeable future, it is only one representative from a class of languages which are all designed for the purpose of testing whether some XML document satisfies a set of constraints. This test could of course also be conducted programmatically, but this is not portable and not easily maintainable. Schema languages thus often use a declarative approach to specifying how to conduct validation. A very simple yet very powerful language for this is Schematron, which uses the expressive power of XPath for testing whether a document satisfies a set of conditions. Schematron is rule-based in contrast to the more traditional grammar-based schema languages and complements these very well.
Required Readings
Chapter 4.5 (159-163)
Resources
October 19 : Thursday
XML is the most popular data format for exchanging data, but the majority of data within applications and closed systems is still stored in Relational Database Managements Systems (RDBMS). This leads to two main issues, the first one being how moving data between XML formats and RDBMS can be done easily and efficiently, so that moving data between these two worlds can be done as easy as possible. The second issue is how to map the data models between these two worlds. Relational data can easily be represented in XML, because tables can be easily represented in trees. Things can be more complicated in the other direction, because arbitrary XML can be hard to store in a relational database. For XML-centric scenarios, XML Database Management Systems (XDBMS) are an interesting alternative, which provide XML-specific query capabilities with XML Query (XQuery).
Required Readings
Resources
October 24 : Tuesday
XML is a very basic technology for representing trees using a standardized markup-based syntax. An increasing number of technologies are building on this foundation, creating an expanding field of XML-based technologies for interoperability in many different fields. Application-specific XML-based data formats are used in many different settings, and the best data format for a given scenario depends on the existing formats in this area and the exact requirements. More interestingly, generic XML technologies which can be applied in many different settings make it easier for developers and system integrators to achieve their goal of making system interoperate.
Required Readings
Resources
last updated on 2006-07-18 by dret

