XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition (720 page)

Places

Place names have an internal structure, but the structure is highly variable. In many cases, components of the place name may be missing, and the part that is missing may be the major part rather than the minor part. For example, you might know that someone was born in Wolverton, England, without knowing which of the three towns of that name it refers to. The GEDCOM schema allows the place name to be entered as unstructured text but also allows individual components of the name to be marked up using a

element which can carry two attributes:
Type
, which can take values such as
Country
,
City
, or
Parish
to indicate what kind of place this is, and
Level
, which is a number that represents the relationship of this part of the place name to the other parts.

Personal Names

As with place names, personal names have a highly variable internal structure. The name can be written simply as a character string (within an

) element, or the separate parts can be tagged using

elements. As with place names, these have a completely open-ended structure. The
Type
attribute can be used to identify the name part as, for example, a surname or generation suffix, and the
Level
attribute can be used to indicate its relative importance, for example when used as a key for sorting and indexing.

Creating a Data File

Our next task is to create an XML file containing the Kennedy family tree in the appropriate format. I started by entering the data in a genealogy package, taking the information from public sources such as the Web site of the Kennedy museum. The package I use is called
The Master Genealogist
, and like all such software it is capable of outputting the data in GEDCOM 5.5 format. This is a file containing records that look something like this (it's included in the downloads for this chapter as
kennedy.ged
):

0 @I1@ INDI

1 NAME John Fitzgerald/Kennedy/

1 SEX M

1 BIRT

2 DATE 29 MAY 1917

2 PLAC Brookline, MA, USA

1 DEAT

2 DATE 22 NOV 1963

2 PLAC Dallas, TX, USA

2 NOTE Assassinated by Lee Harvey Oswald.

1 NOTE Educated at Harvard University.

2 CONT Elected Congressman in 1945

2 CONT aged 29; served three terms in the House of Representatives.

2 CONT Elected Senator in 1952. Elected President in 1960, the

2 CONT youngest ever President of the United States.

1 FAMS @F1@

1 FAMC @F2@

This isn't XML, of course, but it is a hierarchic data file containing tagged data, so it is a good candidate for converting into XML that looks like the document below. This doesn't conform to the GEDCOM 6.0 data model or schema, but it's a useful starting point.


   John Fitzgerald/Kennedy/

   M

   

      29 MAY 1917

      Brookline, MA, USA

   

   

      22 NOV 1963

      Dallas, TX, USA

      Assassinated by Lee Harvey Oswald.

   

   Educated at Harvard University.

Elected Congressman in 1945

aged 29; served three terms in the House of Representatives.

Elected Senator in 1952. Elected President in 1960, the

youngest ever President of the United States.

   

   

   


Each record in a GEDCOM file has a unique identifier (in this case
I1
– that's letter I, digit one), which is used to construct cross-references between records. Most of the information in this record is self-explanatory, except the

and

fields:

is a reference to a

record representing a family in which this person is a parent, and

is a reference to a family in which this person is a child.

The first stage in processing data is to do this conversion into XML, a process which we will examine in the next section.

Converting GEDCOM Files to XML

The main purpose of XSLT is to convert one XML document into another. But that's not all it can do; it can also generate structured text as the output, and in XSLT 2.0, there are new facilities to accept structured text files as the input. That's exactly what we need to do here.

We'll do this in two stages (splitting a complex transformation into a series of simpler transformations arranged in a pipeline is always a good idea). Since GEDCOM 5.5 is a hierarchic format that uses level numbers to represent the nesting, we will start by converting this mechanically to an XML representation. Then in the second phase, we will convert this first cut XML into XML that conforms to the GEDCOM 6.0 specification.

The source document is thus a text file containing records like this:

0 @I1@ INDI

1 NAME John Fitzgerald/Kennedy/

1 SEX M

1 BIRT

2 DATE 29 MAY 1917

2 PLAC Brookline, MA, USA

which needs to be converted into XML like this:


   John Fitzgerald/Kennedy/

   M

   

      29 MAY 1917

      Brookline, MA, USA

   


The stylesheet that does this (
parse-gedcom.xsl
) is in fact a micropipeline in its own right, written as a series of variable declarations each one computing a new value from the value of the previous variable. It starts the usual way, and declares a parameter to accept the name of the input text document:

    xmlns:xsl=“http://www.w3.org/1999/XSL/Transform”

    xmlns:xs=“http://www.w3.org/2001/XMLSchema”

    exclude-result-prefixes=“xs”>



The file identified by this parameter is then read using the XSLT 2.0
unparsed-text()
function:

              select=“unparsed-text($input, ‘iso-8859-1’)”/>

I've actually cheated here. GEDCOM requires files to be encoded in a character set called ANSEL, otherwise ANSI Z39.47-1985, which is used for almost no other purpose. If ANSEL were a mainstream character encoding, it could be specified in the second argument of the
unparsed-text()
function call. In practice, however, it is rather unlikely that any XSLT 2.0 processor would support this encoding natively. Therefore, the conversion from ANSEL to a mainstream character encoding needs some extra logic. If you use Saxon, you can write a custom
UnparsedTextResolver
in Java to take care of low-level interfacing issues like this. This class can invoke a custom character-code converter in the form of a Java
Reader
—an example called
AnselInputReader
is supplied in the downloads for this chapter. (For detailed instructions, see the Saxon documentation.)

We can now split the input file into lines by using the XPath 2.0
tokenize()
function. We use a separator that matches both Unix and Windows line endings:

              select=“tokenize($input-text, ‘\r?\n’)”/>

The result is a sequence of strings (one for each line), and the next stage is to parse the individual lines. Each line in a GEDCOM file has up to five fields: a level number, an identifier, a tag, a cross-reference, and a value. We will create an XML

element representing the contents of the line, using attributes to represent each of these five components:


  

    

                        regex=“

([0-9]+)\s*

                              (@([A-Za-z0-9]+)@)?\s*

                              ([A-Za-z]*)?\s*

                              (@([A-Za-z0-9]+)@)?

                              (.*)$”>

      

        

              ID=“{regex-group(3)}”

              tag=“{regex-group(4)}”

              REF=“{regex-group(6)}”

              text=“{regex-group(7)}”/>

      

      

        Non-matching line “

      

    

  


This code creates a

element for each line of the input file. The content of the elements is constructed by analyzing the text of the input line using a regular expression, where the five lines of the regex correspond to the five fields that may be present. The attribute
flags=“x”
means that whitespace in the pattern is ignored, which allows the regex to be split into multiple lines for readability.

I describe this usage of

as a “single-match” usage, because the idea is that the regular expression matches the entire input string exactly once, and the

instruction is used only to catch errors. Within the

instruction, the content of the line is taken apart using the
regex-group()
function, which returns the part of the matching substring that matched the nth parenthesized subexpression within the regex. If the relevant part of the regex wasn't matched (for example, if the optional identifier was absent), then this returns a zero-length string, and our XSLT code then creates a zero-length attribute.

So we now have a sequence of XML elements each representing one line of the GEDCOM file, each containing attributes to represent the contents of the five fields in the input. It's useful when debugging to display the content of this intermediate variable, and I added a debugging template to the stylesheet
()
to enable this. If you run the stylesheet with this as the entry point, it displays the structure:









Other books

Lucid by P. T. Michelle
Shaking out the Dead by K M Cholewa
Pass It On by J. Minter
Loving Linsey by Rachelle Morgan
Fragrance of Revenge by Dick C. Waters
Toad Triumphant by William Horwood
Greedy Little Eyes by Billie Livingston
Just a Dead Man by Margaret von Klemperer
April Lady by Georgette Heyer