XML Design Principles: How best to represent The World?

logantracyo · Dec 19, 2005

Hi! I'm actually trying to represent the NY State Learning Standards, but rather than trying to explain *those* to you in order to ask how to structure them, it seems more efficient to use the varying geopolitical divisions of Earth as an analogy that is more likely to be familiar to you:

In particular, I would like to avoid the kind of "force-fit" re-categorization that is usual -- for example, here in the US, 48 States are made up of Counties . . . but Louisiana is made up instead of Parishes, and Alaska uses Boroughs. Similiarly, most States have Townships as geographical divisions, which are similar to Police Jury Wards in Louisiana, and which the six New England states call Towns (a term that refers to a population center in the other States).

This structure is similiar to that of England, where we borrowed it -- but other countries are divided into Provinces, Regions, etc. (all of which is, again, quite similar to the disparate structure I'm really modelling!)

In an RDBMS, this is usually handled generically based on the divisions familiar to the DB designer, by force-fitting those Parishes and Boroughs into a Counties table, the Towns and Wards into Townships, and so on.

With XML, it's *possible* to treat them non-generically, to have the US States made up of either Counties, Parishes, or Boroughs -- but does it make sense to do so? Here's an example:

Old RDBMS "force-fit" method, in XML:

Code:

<state name="New York">
    <county name="Monroe">
        <township name="Town of Gates" />
        <township name="Town of Webster" />
    </county>
</state>
<state name="Louisiana">
    <county name="Orleans Parish">
        <township name="Police Jury Ward 1" />
        <township name="Police Jury Ward 2" />
    </county>
</state>

New non-generic method:

Code:

<state name="New York">
    <county name="Monroe">
        <town name="Gates" />
        <town name="Webster" />
    </county>
</state>
<state name="Louisiana">
    <parish name="Orleans">
        <police-jury-ward name="1" />
        <police-jury-ward name="2" />
    </parish>
</state>

The second example just seems to capture "reality" so much better than the first one. But am I overlooking real-world consequences to doing it this way?

And what about DTD's or schemas? Obviously, this makes for more complexity there -- would providing namespaces for the different structures be a good way to deal with that?

Naturally, I'd appreciate any help you can provide -- even (or particularly?) just telling me "what this approach is called", suggesting some terminology I can use to search effectively; I've tried "variable schema" and "variable data structure" to no avail.

Thanks!

k5tm · Dec 19, 2005

logantracyo,

Well, this is all very interesting, but just for what are you planning to use this document structure? Transforming via XSLT? XQuery? [ponder]

Tom Morrison

http://www.liant.com

logantracyo · Dec 19, 2005

I suppose "Yes" is the most accurate (if useless!) answer!

Let me move out of the analogy into our actual usage: we're a consortium of educational institutions from all across NY State, providing resources (lesson plans, diagnostic tests, etc.) to educators within our member institutions.

All of these resources need to be tied to the NYS Learning Standards. For example, a lesson plan on how to teach multiplication might be "aligned" to the Performance Indicator "Students will be able to multiple and divide numbers up to 1,000".

So one primary use is to provide an ID (probably a UUID/GUID) that would be referred to both internally (using refID) and externally (local databases, external webservices, etc.)

Another is the reverse of that; a page that displays that lesson plan might also display the text of that Performance Indicator (and perhaps some related items, a heirarchy, etc.)

Yet another use is more of a browser, where an educator would search for particular keywords or drill down through various areas.

We've done this for many years using a database that migrated from Access to SQL Server, with the disparate structure of different Subjects force-fit into a single schema. As you can imagine, English teachers have a different idea of an appropriate structure than Math teachers do; as you may find it more difficult to imagine, NY State simply allowed each teachers in each (of seven) subject areas to define their own structure, and made those each available electronically -- but only as a PDF!

This year brings a new revision of some of those decade-old Standards, with (you guessed it!) an entirely different structure, which we've again force-fit into the old tables.

My goal is to create an XML storage file that can be used to generate the old database tables (with our new UUID in addition to the old integer identity columns) for backward compatibility -- as well as used directly by new applications (built in ColdFusion & PHP).

And our eventual goal is to share this freely with the rest of the NYS education community (since the state itself has still not been able to provide the standards with that all-important ID field) -- and to eventually tie it into the similar work being done with the Dublin Core.

So rather than tying this to any one particular use, my (overly-lofty?) hope is to come up with a generically non-generic structure that can be transformed into more-specific formats as the need arises.

Am I out of my mind?

k5tm · Dec 19, 2005

Well, the answer to your final question is beyond the scope of this forum. [bigsmile]

This is a difficult area, for the reasons you have described very well. I tend to approach things from the aspect of XSLT conversion, which can be quite flexible, but also a bit arcane for the casual user. Absent some well-funded XML Schema definition work, it might be that you can have an XML document, organized by subject area, that would describe the subject-area specific structure (is that alliterative enough for you?) that can be used by the processor dynamically to provide structure information as you process one of the 'real' documents provided by the consortium. Given the almost free form requirement, this may be the best you can do.

Certainly one thing that I would do with any approach is to make a version attribute an absolute, so that as your definitions and processes evolve, you can accommodate material created for previous versions of your specifications in a seamless manner. Being able to advance gracefully without causing a lot of folks (who are probably not doing this as part of their real job) major annoyance will pay off the small cost of implementation many times over.

Tom Morrison

http://www.liant.com

ChrisHunt · Dec 19, 2005

Not really an answer to your question, but another possibility to throw into the mix. If the hierarchy's what's important, and the nomenclature secondary, you can do something like this:

Code:

<region type="state" name="New York">
    <region type="county" name="Monroe">
        <region type="town" name="Gates" />
        <region type="town" name="Webster" />
    </region>
</region>
<region type="state" name="Louisiana">
    <region type="parish" name="Orleans">
        <region type="police jury ward" name="1" />
        <region type="police jury ward" name="2" />
    </region>
</region>

Of course it's possible to have other elements involved...

Code:

<region type="state">
   <name>Florida</name>
   <leader title="governor">Jed Bush</leader>
   <subregions>
      <region type="county">
         <name>Daid</name>
      </region>
      ... etc ...
   </subregions>
</region>
[code]
I'm a sucker for recursive structures when it comes to representing hierarchies.

-- Chris Hunt
[url=http://www.mcgonagall-online.org.uk][i]Webmaster &  Tragedian[/i][/url]
[url=http://www.extraconnections.co.uk]Extra Connections Ltd[/url]

logantracyo · Dec 20, 2005

Tom, thanks for your ideas -- I particularly like the idea of versioning the schema, whatever I end up using; that's a great tip!

I'm also tempted by the idea of having multiple documents, one for each area (each with their own DTD); that matches Reality pretty well. It looks like using a Schema instead will allow multiple namespaces within one document, which might make for a cleaner approach. Any thoughts on DTD vs Schema?

Thanks!

logantracyo · Dec 20, 2005

Chris, thanks for your ideas as well! And by way of thanking you, may I use your suggestions as sort of "devil's advocate" placeholders, or perhaps windmills I can tilt against (to explain my thoughts in more detail)?

My first attempts at doing this looked very much like yours, with some standard elements making up a heirarchy. However, something just doesn't "feel" right about that, and it felt worse the deeper I got, and I realized it felt like I was force-fitting *elements* into attributes. So let me show you more of what I thought:

My analogy may have been a bad idea, because a "region" element makes quite a bit of sense in describing the World (since the similiarities are less true in the NYS Learning Standards) -- but even in this case, <region type="continent"> has a very different connotation from <region type="state"> -- it's relatively easy to change the shape of a State (pretty much political changes), quite hard to do so with a Continent. Also, State has a meaning that changes; is it a state within the US? Or a nation state?

I think there *are* places where such exactly uniform elements work great -- a file system, perhaps, where each element is essentiall the same as any other, but with different attributes:

<entry type="directory" name="/">
<entry type="directory" name="usr">
<entry type="directory" name="local">
<entry type="file" name="asdf" />
</entry>
</entry>
</entry>

But as you note in your second example, as soon as you move to heterogeneous data, adding different types of elements makes a good deal more sense (region, leader): even though, technically, you could have <region type="governer"><name="Jed" /></region>, it just doesn't feel right.

That's an extreme example of the "force-fit" problem -- but I think the same problem exists even at the state/continent level. The problem gets more complex when you try to set up a DTD or Schema to handle it -- you need rules like "Regions with a Type of State may have sub-regions whose type is either County or Parish", rather than just "State elements may consist of one or more of either County or Parish elements"

I hope it doesn't sound like I'm knocking your ideas -- I really appreciate the time and thought you and Tom have put into my problem!

It really seems like this is one of those "It Depends!" areas, full of shades of gray! I did find one strong reference on sorting some of this out (

http://www-128.ibm.com/developerworks/xml/library/x-eleatt.html),

and would love to talk about this in more detail if anyone has further thoughts on this.

Thanks!

k5tm · Dec 20, 2005

Without a lot more knowledge than I can afford to absorb right now, I think I would use schema a.o.t. DTD.

If you are suitably creative, you can segment stylesheets in a manner that permits the development of the stylesheets (or stylesheet segments) in parallel with the various subject-specific document structures. This allows a top-level stylesheet to use xsl:include to pull in the stylesheet segments and then discover the type of transformation(s) required in a source document and apply the appropriate template(s).

Namespaces could be used for this, but it might lead to some truly arcane stylesheets. (But some would say there are no other kind!) All I can say is that if you decide to go this route, be very methodical so that your stylesheets do not end up a knot of special cases. (From your postings here, I am guessing methodical is your normal mode of operation.)

Tom Morrison

http://www.liant.com

ChrisHunt · Dec 21, 2005

As you say, it's an "it depends" type of issue. You need to home in on what's important to you, and ignore unimportant details. My little XML docs only really home in on the facts that "regions are subdivided into smaller regions" and "regions have different names". If there were other things I cared about, I'd have to add further elements or attributes to represent them.

Two benefits of using my approach - you always know what the elements are called. This is a major advantage over your "non-generic approach" when it comes to writing code. Secondly, the recursive nature of the structure means that it's not tied to a particular number of tiers. You can use it to represent the Vatican City (with - I suspect - no subdivisions), or the US (with lots of them).

Bringing this back to your actual problem, you seem to have two distinct things to represent:

1) The NYS Learning Standards, which are some kind of hierarchy of subjects and performance indicators.

2) Learning resources, which are associated with one or more of the above subjects/indicators.

The first of these is the critical one. If you can figure out how best to do that, the other should be trivial - each (relevant) point in the tree should have an ID, and you can assign an ID (or IDs) to each resource in your database.

Here are some issues to consider:

Can the same performance indicator appear in multiple places in the tree?

Can resources be attached to whole subject areas, or only to performance indicators?

Can a given resource be attached to more than one indicator/area?

These are more questions of database design than of XML schema, but one will somewhat reflect the other.

-- Chris Hunt
Webmaster & Tragedian
Extra Connections Ltd

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

XML Design Principles: How best to represent The World?

logantracyo

Programmer

k5tm

Programmer

logantracyo

Programmer

k5tm

Programmer

ChrisHunt

Programmer

logantracyo

Programmer

logantracyo

Programmer

k5tm

Programmer

ChrisHunt

Programmer

Similar threads

Part and Inventory Search

Sponsor