IS 242 : XML Foundations

Administrivia

Teaching Team 

Professor Erik Wilde

Email: dret@sims.berkeley.edu

Website: http://dret.net/netdret/

Office number: +1-510-6432253

Office Hours: Tuesday 3:30pm-4:30pm Thursday 3:30pm-4:30pm 314 South Hall

TA Katrina Rhoads Lindholm

Office Hours: Monday 12:30pm-2:00pm 210 South Hall

Course Description

Three hours of lecture, one hour of Laboratory per week. The Extensible Markup Language (XML), with its ability to define formal structural and semantic definitions for metadata and information models, is the key enabling technology for information services and document-centric business models that use the Internet and its family of protocols. This course introduces XML syntax, styles and transformations, and schema languages. It balances conceptual topics with practical skills for designing and implementing conceptual models as XML schemas.

Course Information

SIMS INFOSYS 242

Course Dates: August 29 to October 19, 2006

Lecture Schedule: Tuesday Thursday 2:00pm-3:30pm in 110 South Hall

Units: 2

Grading Option: Pass/Not Pass only

Course Text

Required

Learning XML, 2nd Edition, Erik T. Ray. O'Reilly, September 2003. ISBN: 0-596-00420-6

Course Work

August 29 : Tuesday

Overview and Introduction 

The Extensible Markup Language (XML) has been introduced in 1998 to enable content providers to publish their content on the Web in an application-specific format. HTML was considered as conveying not enough semantics, since its only purpose was (and is) the preparation of content for Web-based publishing. XML was the first step towards machine-readable data formats for the Web, a trend that since its invention has been taken to higher levels with the idea of the Semantic Web. XML appeared when the Web was in the steepest part of its success curve, and since then has taken over as the globally accepted format for the exchange of machine-readable structured data.

Resources

August 31 : Thursday

XML Basics 

The Extensible Markup Language (XML) defines a simple way for structuring data. The power and popularity of XML can be explained by its versatility, the platform-independence, the standards and technologies leveraging it, and the number of tools and products supporting it. Understanding XML itself is rather simple, it only depends on a very small set of other technologies. Unicode and URIs are the most important foundations of XML. XML itself specifies two different things: on the one hand the format for structured data, which are called XML documents, and on the other hand a constraint language for XML documents, which is called Document Type Definition (DTD).

Required Readings

Required

Chapters 1.3 (pp. 16-28) & 2.1-2.4 (pp. 49-66)

Resources

September 5 : Tuesday

Document Type Definition (DTD) 

The XML specification defines a format for structured data (XML documents) and a grammar-based constraint language for these (DTD). In SGML-based systems, DTDs were often very complex and feature-rich constructs, which controlled a lot of the processing of SGML documents. XML greatly simplified DTDs, and de-facto usage of DTDs today simplified them even more. In many systems today, DTDs are not used at all or generated from sample documents. In this lecture, it is argued that DTDs (or schemas, to be more general) should be taken seriously in any non-trivial XML application, because they are a representation of the underlying (and often underspecified) data model of the application.

Required Readings

Required

Chapter 4-4.2 (pp. 108-132)

Resources

Assignment 1: Getting Started with XML and XML Editors assigned 

Due on September 12

Assignment details

September 7 : Thursday

The Good, the Bad, and the Ugly 

While XML it rather easy to understand and use, it is also rather easy to use XML in ways which either produce ugly XML, or which may lead to problems in components further processing the XML. The topic of this lecture thus is to look at design guidelines for XML schemas, leading to good XML. Some of the simpler topics cover basic questions of how to map a data model to XML markup (e.g., when to use elements or attributes). The next question is how data should be represented in XML so that applications can process it efficiently. We also look at what part of the markup an application will actually have access to, and this is defined by the XML Information Set (Infoset), the specification underlying many XML technologies.

Required Readings

Resources

Assignment 2: Résumé XML and DTD assigned 

Due on September 14

Assignment details

September 12 : Tuesday

Cascading Style Sheets (CSS) 

Cascading Stylesheets (CSS) have been designed as a language for better separating presentation-specific issues from the structuring of documents as provided by HTML. However, CSS can be applied to XML as well, either directly (by applying a CSS stylesheet to an XML document), or as an supplement to basic HTML layout structures generated from an XML document. CSS uses a simple model of selectors and declarations. Selectors specify to which elements of a document a set of declarations (each being a value assigned to a property) apply; in addition there is a model of how property values are inherited and cascaded. The biggest limitation of CSS is that it cannot change the structure of the displayed document.

Required Readings

Required

Chapter 5 (pp. 164-204)

Resources

Assignment 1: Getting Started with XML and XML Editors due 

September 14 : Thursday

XML Namespaces 

XML is successful because it can be used in many different scenarios, and because it is easy to define a schema (such as a DTD) for new scenarios, producing a tailored XML data model for this scenario. This means that names in XML documents must be interpreted as belonging to a certain schema. As long as a document uses names from only one schema, this can be done rather easily. However, in many scenarios today documents combine names from different schemas, and XML Namespaces provide a mechanism how the names in an XML document can be associated with a namespace.

Required Readings

Resources

Assignment 2: Résumé XML and DTD due 

Assignment 3: CSS assigned 

Due on September 19

Assignment details

September 19 : Tuesday

XML Path Language (XPath) 

XML structures data into a rather small number of different constructs, most notably elements and attributes. The XML Path Language (XPath) defines a way how to select parts of XML documents, so that they can be used for further processing. XPath's primary use in in XSL Transformations (XSLT), but other XML technologies use it as well, e.g. XML Schema. XPath is a very compact language with a syntax that resembles the path expressions which are well-known from file systems. These path expressions, however, are generalized and therefore much more powerful than the rather simple path expressions in file systems. Because of its use in different XML technologies, XPath is one of the most important XML core technologies.

Required Readings

Resources

Assignment 3: CSS due 

September 21 : Thursday

XML Transformations (XSLT) — Part I 

Because XML can be used to represent any vocabulary (often defined by some schema), the question is how these different vocabularies can be processed and maybe transformed into something else. This something else maybe another XML vocabulary (a common requirement in B2B scenarios), or it may be HTML (a common scenario for Web publishing). Using XSL Transformations (XSLT), mapping tasks can be implemented easily. XSLT leverages XPath's expressive power in a rather simple programming language. For easy tasks, XSLT mapping can be specified without much real programming going on, by simply specifying how components of the source markup are mapped to components of the target markup.

Required Readings

Required

Resources

Assignment 4: XPath and Namespaces assigned 

Due on September 26

Assignment details

September 26 : Tuesday

XML Transformations (XSLT) — Part II 

XSLT processes documents by matching nodes in the document tree to templates, which then are executed to process these nodes. This process of matching and executing templates is the core of XSLT's processing model. XSLT has built-in templates which complement the user-supplied templates, so that the XSLT processor always finds a template to execute. Templates can conflict, and it is then necessary to resolve this conflict by finding the best match of all matching templates. This conflict resolution process also is a very important component of the XSLT processing model.

Required Readings

Required

Resources

Assignment 4: XPath and Namespaces due 

Assignment 5: XML to HTML Transformation assigned 

Due on October 3

Assignment details

September 28 : Thursday

XML Transformations (XSLT) — Part III 

Advanced XSLT processing includes better control of the input and output documents, which can finely controlled in terms of how whitespace is treated. Another interesting feature of XSLT are keys, which allow shorthand notations for frequently used access paths to nodes, and provide XSLT processors with more information for performance optimizations. Instructions for creating all possible kinds of nodes in the output tree make it possible to write code which generates element or attribute names based on runtime evaluations.

Required Readings

Resources

October 3 : Tuesday

XML Schema — Part I 

XML Schema is the most popular schema language for XML today. It has been introduced to overcome some of the commonly observed limitations of DTDs, most notably the lack of typing. Simple Types describe content which is not structured by XML markup, which means it describes attribute values and element content. Simple types can be defined by deriving new types from existing types by using type restriction. Complex Types describe element content if this content is using attributes and/or element content other than only character data. Using XML Schema's type concepts, it is easier to represent model-level information in a schema, because type hierarchies can represent model-level specializations.

Required Readings

Required

Chapters 4.3 & 4.4 (pp. 132-159)

Resources

Assignment 5: XML to HTML Transformation due 

October 5 : Thursday

XML Schema — Part II 

XML Schema allows greater flexibility in defining constraints on intra-document references than the ID/IDREF construct of DTDs. XML Schema's Identity Constraints are scoped, typed, and can be used for elements or attributes. The second aspect of XML Schema discussed today is the derivation of complex types. Complex types can be derived by restriction or extension. Complex type restriction defines the restricted type to be a more restricted version of the base type. Complex type extension make it possible to extend the base type by either adding attributes or contents (only by appending new content to the content model).

Required Readings

Resources

Assignment 6: DTD to Schema assigned 

Due on October 12

Assignment details

October 10 : Tuesday

From Model to Markup 

While XML is very useful for representing and manipulating structured data, the question remains where these structures come from. They are usually some kind of encoding for a conceptual model, but there is no established and universally accepted way of how to connect the modeling world with XML markup. Some of the challenges and approaches to XML and modeling will be presented in this lecture. The goal of this lecture is to raise awareness for the current gap between models and markup, and for practical approaches how to bridge that gap.

Required Readings

Required

Resources

October 12 : Thursday

Assignment 6: DTD to Schema due 

October 17 : Tuesday

Alternative Schema Languages — Schematron 

While XML Schema is the most popular schema language in use today and for the foreseeable future, it is only one representative from a class of languages which are all designed for the purpose of testing whether some XML document satisfies a set of constraints. This test could of course also be conducted programmatically, but this is not portable and not easily maintainable. Schema languages thus often use a declarative approach to specifying how to conduct validation. A very simple yet very powerful language for this is Schematron, which uses the expressive power of XPath for testing whether a document satisfies a set of conditions. Schematron is rule-based in contrast to the more traditional grammar-based schema languages and complements these very well.

Required Readings

Required

Chapter 4.5 (159-163)

Resources

Assignment 7: XML to XML Transformation with CSS on generated HTML assigned 

Due on October 26

Assignment details

October 19 : Thursday

XML and Database Systems 

XML is the most popular data format for exchanging data, but the majority of data within applications and closed systems is still stored in Relational Database Managements Systems (RDBMS). This leads to two main issues, the first one being how moving data between XML formats and RDBMS can be done easily and efficiently, so that moving data between these two worlds can be done as easy as possible. The second issue is how to map the data models between these two worlds. Relational data can easily be represented in XML, because tables can be easily represented in trees. Things can be more complicated in the other direction, because arbitrary XML can be hard to store in a relational database. For XML-centric scenarios, XML Database Management Systems (XDBMS) are an interesting alternative, which provide XML-specific query capabilities with XML Query (XQuery).

Required Readings

Resources

October 24 : Tuesday

XML Trends & Developments 

XML is a very basic technology for representing trees using a standardized markup-based syntax. An increasing number of technologies are building on this foundation, creating an expanding field of XML-based technologies for interoperability in many different fields. Application-specific XML-based data formats are used in many different settings, and the best data format for a given scenario depends on the existing formats in this area and the exact requirements. More interestingly, generic XML technologies which can be applied in many different settings make it easier for developers and system integrators to achieve their goal of making system interoperate.

Required Readings

Required

Resources

October 26 : Thursday

Assignment 7: XML to XML Transformation with CSS on generated HTML due 

last updated on 2006-07-18 by dret