6 Document Accessioning


All libraries, digital or not, have processes for formally accepting and including items into their collection; a process known as accessioning, and for removing items from their collection known as deaccessioning. In this chapter we contrast and compare the accessioning methods of TE 1.0 and 2.0. We will see that, once again, the choice of XML vs. JSON, although not strictly a cause for difference in accessioning approaches, almost naturally led to differences between the two architectures. The most significant of these differences is that whereas in TE 1.0 document editing and accessioning where two separate processes executed and controlled by different people, in TE 2.0 they became integrated into a single process executed by the document editor.

Authoring ≠ Editing

One way in which we can categorize digital libraries is in content collections and meta collections. In a meta collection, no actual content is kept; only meta data are kept. A good example of a meta data collection is NSDL.org. NSDL (or National Science Digital Library) maintains meta data of about 80 digital library collections and allows faceted searches over those 80 collections. The actual items themselves, however, are held by the various collections over which NSDL can search.

For meta collections, accessioning tends to be a relatively simple process, mostly because each item they represent —a so-called meta record— tends not to contain much data. In fact, in many cases this accessioning is fully or semi-automated in that it can be entirely accommodated with Web services offered by the various libraries which allow their meta data to be collected by the meta collection. Of course, the main difficulty for meta collections is to keep them synchronized with the content collections they reference. Items newly added to the content collections must be referenced, without too long a delay, in the meta collection, and documents no longer available in the content collections must be dereferenced or removed from the meta collection.

For content collections, however, accessioning tends to be more complicated, partly because the items to be accessioned are more complicated and partly because they often have to be reformatted.

TeachEngineering documents —in both TE 1.0 and TE 2.0— are typically submitted by their authors as text-processed documents; most often Microsoft Word documents. Their authors are neither asked nor required to maintain strict formatting rules, but they are required to provide specific types of information for specific types of content such as a summary, a title, grade levels, etc. Depending on the type of document, entire text sections are either mandatory or optional. TeachEngineering lessons, for instance, must have a Background section and activities must have an Activity Procedure section. TeachEngineering document editors work with the authors to rework their documents so that they comply with the required structure. Once done, however, the documents are still in text-processed form. Hence, as we have learned in the previous chapter, the first step of accessioning consists of converting them from text-processed form into the format required by the collection: XML for TE 1.0; JSON for TE 2.0. This conversion is done by special editing staff known internally as ‘taggers.’[1]

TE 1.0 Tagging: Word XML

As discussed in the previous chapter, all TE 1.0 documents were stored as XML. Hence, conversion of their content as written by their authors to TE-specific XML was the main objective of the tagging process. This constituted a problem because the TE-XML specification was complex and asking taggers to themselves apply the proper tags to document content would almost certainly lead to failure. Moreover, as mentioned in chapter 3, the TE XML contained both content and some formatting tags. This mixing of tag types and the myriad of validation rules associated with these tags made it essentially impossible for student workers —the TE 1.0 tagging staff consisted mainly of student workers— to directly edit the documents in XML.

Of course, we as TeachEngineering were not the only ones having this problem. With the rapidly increasing popularity of XML came the common need to convert documents from one form or another into XML, and this task is not very human friendly.

Fortunately, Altova, a company specializing in XML technology made available (for free) its Authentic tool for in-document, what-you-see-is-what-you-get (WYSIWYG) editing of XML documents. With Authentic, TE taggers could view and edit TE documents without having to know their XML, yet Authentic would save their document in TE-XML format. Moreover, since Authentic kept track of the TE-XML schemas —recall that an XML schema is the specification of the rules of validity for a particular type of XML document— Authentic protects editors from violating the Schema, thereby guaranteeing that documents remain valid.

Figure 1: Editing a section of the Tug of War activity’s XML representation in Altova’s Authentic.

Figure 2: Section of Tug of War activity rendered in TeachEngineering

Figure 1 shows part of a TE 1.0 activity edit session using Authentic. Note how the look-and-feel of the activity as seen in Authentic is quite different from the look-and-feel of that same activity when rendered in TE 1.0 (Figure 2). There are two reasons for this difference. First and foremost, XML is (mostly) about content and content can be rendered in many different ways. Second, because XML is (mostly) about content, no great effort was neither made nor needed to precisely render the activity in Authentic as it would render in TeachEngineering. Still, in order to show a tagger the rendered version of the document, the TE 1.0 system offered a (password-protected) Web page where the tagger could test render the document.

You might, at this point, be wondering who then determines the look-and-feel of the Authentic version of the document and how that look-and-feel is set up? This would be a good question and it also points out the cleverness of Altova’s business model. In a way, Altova’s business model associated with Authentic is the reverse of that of Adobe’s business model associated with its PDF Reader. Adobe gives away PDF Reader as a loss leader so that it can generate revenue from other PDF-generating and PDF-processing products. Demand for these products would be low if few people can read what comes out of them. With Authentic we have the reverse situation. Altova sells tools for generating and validating XML schema’s. One of the uses of those schemas is for people to edit documents in XML which follow those schemas. So Altova makes the Authentic XML editor available free of charge but generates revenue with the tools that produce the files —XSDs and Authentic WYSIWIG document layouts— with which documents can be edited in Authentic. Hence, in TE 1.0, the TE engineers used Altova tools to construct the document XSDs and to generate a layout for WISYWIG editing in Authentic. TE taggers then used the free-of-charge Authentic tool to do the actual document editing and used a TeachEngineering test-rendering service to see the rendered version of the edited document.

TE 1.0 Document Ingestion and Rendering

Although XML editing of the document was the most labor-intensive step of the accessioning process, once we had an XML version of a TE document, we were not quite there yet as it still had to be registered to the collection. In TE 1.0 this was done in three steps.

  1. Document check-in. The tagger would check the document into a central code repository system. Code repository systems such as CVS, Subversion, Team Foundation Version Control, and Git maintain a history and a copy of all code changes, allow reverting to previous versions of the code, track who made which change when, and can checkpoint whole collections of code in so-called code branches or releases. They are indispensable for code development, especially when more than one coder is involved.

    Although developed for managing program source code, these systems can of course be used for tracking and maintaining other types of electronic data sets as well, for instance XML or JSON documents. Hence, in TE 1.0, taggers, once a document had been converted to XML, checked that document into a central code repository system.

  2. Meta data generation. Once checked into the code repository system, a program run once or twice a day would extract data from the XML documents and generate meta data for them. Recall from chapter 3 that these meta data were served to third party users interested in TeachEngineering contents. One of those was NSDL.org whose data-harvesting programs would visit the TeachEngineering meta data web service monthly to inquire about the state of the collection. A side effect of this meta data generator, however, was additional quality control of the content of the document. As we have seen, XSDs are an impressive quality control tool as they can be used to check the validity of XML documents. Such validity checking, however, is limited to the syntax of the document. Hence, a perfectly valid XML document can nevertheless have lots of problems. For example, it may contain a link to a non-existing image or document or it may declare a link to a TE lesson whereas in fact the link points to an activity. One of the things that the meta data generator did, therefore, was to conduct another set of quality control checks on the documents. If it deemed a document to be in violation of one or more of its rules, no meta data would be generated for it and the document would not be ingested into the collection. It would, of course, remain in the code repository system from which the tagger could then check it out, fix the problem(s) flagged by the meta data generator and check it back in for the next round of meta data generation.
  3. Document indexing. Twice daily TE 1.0 ran a process which would actually ingest newly created or modified documents. We named this process ‘the spider’ because its method of picking up documents for ingestion was very similar to that of so-called web crawlers, aka ‘web spiders.’ Such a crawler is a process which extracts from a document all the data it is looking for, after which it then looks for references or links to other documents and then crawls those in turn. Whereas most modern crawlers are multi-threaded; i.e., they simultaneously crawl more than one document, the TE 1.0 spider was simple and processed only one document at the time. This was perfectly acceptable, however, because although the overall process would complete more quickly when crawls would run in parallel, we only had to complete the process once or twice a day. Figure 3 shows the process of spidering a TE 1.0 document; in its generic form as pseudo code (a) and as an example of spidering a TE 1.0 curricular unit on heart valves (b). Note how the process in (a) is recursive, i.e., the spider() method contains a call to spider().
    spider (document)
      document.index_content(); // index the document
      doc_links = document.find_links(); // find links in the
                                         // document
      foreach (doc in doc_links) // spider all doc_links
        if (spidered(doc) == false) // only spider when
                                    // not yet spidered
    (b) Curricular Unit: Aging Heart Valves:

    • Lesson: Heart to Heart:
      • Activity: The Mighty Heart
      • Activity: What’s with all the pressure?
    • Lesson: Blood Pressure Basics:
      • Activity: Model Heart Valves

    0. Spider curricular unit:

    Aging Heart Valves:

      Index content of Aging Heart Valves

      Doc_links found:

        Lesson: Heart to Heart

        Lesson: Blood Pressure Basics

        1. Spider lesson: Heart to Heart:

          Index content of Blood Pressure Basics

          Doc_links found:

            Activity: The Mighty Heart

            Activity: What’s with all the pressure?

            1a. Spider activity The Mighty Heart:

              Index content of The Mighty Heart

              Doc_links found: none

            1b. Spider activity What’s with all the pressure?:

              Index content of What’s with all the pressure?

              Doc_links found: none

        2. Spider lesson: Blood Pressure Basics:

          Index content of Blood Pressure Basics

          Doc_links found:

            Activity: Model Heart Valves

            2a. Spider activity Model Heart Valves:

              Index content of Model Heart Valves

              Doc_links found: none

    Figure 3: TE 1.0 document spidering. Generic algorithm (a) and Curricular Unit example (b).
  4. Document rendering. One last step in the document production chain in both TE 1.0 and 2.0 is actual rendering of documents in users’ browsers. To a large extent, this is the simplest of the production steps, although it too has its challenges. Rendering in TE 1.0 was accomplished in PHP, then and now one of the more popular programming languages for Web-based programming.

    Rendering a TE 1.0 document relied partly on information stored in a document’s XML content and partly on information stored in the database generated during document indexing. Whereas all of the document’s content could be rendered directly form its XML, some aspects of rendering required a database query. An example is a document’s ‘Related Curriculum’ (Figure 4). Whereas a document may have ‘children;’ e.g., a lesson typically has one or more activities, it does not contain information about its parents or grandparents. Thus, while a lesson typically refers to its activities, it does not contain information as to which curricular unit it belongs. A document’s complete lineage, however, can be constructed from all the parent-child relationships stored in the database and hence a listing of ‘Related Curriculum’ can be extracted from the database, yet not from the document’s XML.

    Figure 4: TE 1.0 rendering of The Mighty Heart activity’s ‘Related Curriculum.’

    A second example of database-reliant document rendering in TE 1.0 concerns a document’s educational standards. Figure 5 shows the list of K-12 science and engineering standards to which the activity The Mighty Heart have been aligned.

    Figure 5: TE 1.0 rendering of The Mighty Heart activity’s aligned engineering and science educational standards.

    Because the relationship between educational standards and documents is a so-called many-to-many one (a standard can be related to multiple documents and one document can have multiple standards), in TE 1.0 standards were stored uniquely in the database and documents referred to those standards with standard IDs. For The Mighty Heart activity the associated XML was as follows[2]:

        <edu_standard identifier="S11326BD"/>
        <edu_standard identifier="S11326BE"/>
        <edu_standard identifier="S11416DF"/>

    Hence, it is clear that in order to show the information associated with the standard (text, grade level(s), issuing agency, date of issuance, etc.), it must be retrieved from the database rather than from the referring document.

TE 2.0 Tagging: Word JSON

While the TE 1.0 tagging process served the TeachEngineering team well, it had a few notable downsides.

The Authentic software to write (tag) the documents in XML needed to be installed on each editor’s computer along with the XML schema for each type of curriculum document.

The editing workflow had a number of steps that required the editor to understand specialized software, including Authentic and the Subversion version control system.

Previewing a rendered document required the editor to upload the resulting XML file to the TE site.

The fact that the spider ran only twice a day limited how quickly new documents (and edits to existing documents) appeared on the site.

One of the goals with TE 2.0 was to streamline the tagging and ingestion process. Since TeachEngineering is a web site, the logical choice was to allow editors to add and edit documents from their web browser; no additional software required. As such, TE 2.0 includes a web/browser-based document editing interface that is very similar to that of modern more generalized content management systems such as WordPress (Figure 6)

The JavaScript and HTML open source text editor TinyMCE, a tool specifically designed to integrate nicely with content management systems, was used as the browser-based editor. TinyMCE provides an interface that is very similar to a typical word processor.

Figure 6: TE 2.0 document editing interface for a Curricular Unit

Figure 6 shows an example of editing a document in TE 2.0. The interface provides a few options to support the editor’s workflow. The Save button saves the in-progress document to the (RavenDB) database. Documents that are in a draft state will not be visible to the public. The Preview button shows what the rendered version of the document will look like to end users. The Publish button changes the document’s status from draft to published, making it publicly visible. Any errors in the document, such as forgetting a required field are called out by displaying a message and highlighting the offending field with a red border. Publishing of a document with errors is impossible.

As in the case of TE 1.0, documents in TE 2.0 are hierarchically organized in that documents specify their children; e.g., a lesson specifying its child activities, or a curricular unit specifying its lessons. But whereas in TE 1.0 editors had to specify these children with a sometimes complex file path, in TE 2.0, they have a simple selection interface for specifying these relationships and are no longer required to know where documents are stored (Figure 7).

Figure 7: TE 2.0’s interface for specifying a document’s child documents

One other noteworthy difference between TE 1.0 and TE 2.0’s tagging processes is that with TE 1.0, content editors by necessity had to have some knowledge of the internal structure and working of TeachEngineering’s architecture. They had to create documents using Authentic, and check the resulting XML document into source control. With TE 2.0, editors edit documents using a familiar WYSIWYG interface. The software behind the scenes takes care of the technical details of serializing the documents to JSON and storing them in RavenDB.

TE 2.0 Document Ingestion and Rendering

With TE 2.0’s architecture, the document ingestion and rendering process is greatly simplified. Here we will revisit the ingestion and rendering steps from TE 1.0 and contrast them with the process in TE 2.0.

  1. Document check-in. In TE 2.0, there is no document check-in process; i.e., no process of moving the file from the local system into the TE repository of documents. When editors save the document they are editing in TE 2.0’s web interface, the document is immediately stored in RavenDB.
  2. Metadata generation. In TE 2.0, there is no separate metadata generation process. As noted in chapter 4, TE 2.0 neither generates nor stores metadata. The JSON representation of the curriculum document is the single version of the TE reality. Whereas TE 1.0 always generated and exposed its metadata for harvesting by meta collections such as the National Science Digital Library (NSDL), TE 2.0 no longer does this. This is mostly because the support for and use of generic metadata harvesting protocols such as OAI-PMH (Open Archive Initiative–Protocol for Metadata Harvesting) have dwindled in popularity.
  3. Document indexing. There is no document indexing step in TE 2.0. Since documents are immediately saved to RavenDB, there is no need for a separate process to crawl and discover new or modified documents.
  4. Document rendering. At a high level, the document rendering process in TE 2.0 is quite similar to TE 1.0’s process, with a few key differences. For one thing, TE 2.0 was developed in C# as opposed to PHP.

    Whereas in TE 1.0 the hierarchical relationships between any pair of documents were stored as parent-child rows in a relational database table, in TE 2.0, the relationship between all of the curriculum documents are stored in RavenDB in a single JSON document. This tree-like structure is cached in memory, providing a fast way to find and render a document’s relatives (ancestors and descendants). For example, a lesson will typically have one parent curricular unit and one or more child activities. The following is an en excerpt of the JSON document which describes the relationship between documents.

    { "CurriculumId": "cla_energyunit",
                "Title": "Energy Systems and Solutions",
                "Rank": null,
                "Description": null,
                "Collection": "CurricularUnit",
                "Children": [
                        "CurriculumId": "cla_lesson1_energyproblem",
                        "Title": "The Energy Problem",
                        "Rank": 1,
                        "Description": null,
                        "Collection": "Lesson",
                        "Children": [
                                "CurriculumId": "cla_activity1_energy_intelligence",
                                "Title": "Energy Intelligence Agency",
                                "Rank": 1,
                                "Description": "A short game in which students find energy facts among a variety of bogus clues.",
                                "Collection": "Activity",
                                "Children": []
                            } … additional child activities are not shown here for brevity

    Here you can see that the unit titled Energy Systems and Solutions has a child lesson titled The Energy Problem, which itself has a child activity titled Energy Intelligence Agency. Since this structure represents the hierarchy explicitly, it is generally a lot faster to extract hierarchical relationships from it than from a table which represents the hierarchy implicitly by means of independent parent-child relationships.

    Educational Standards are also handled differently in TE 2.0. As noted earlier in this chapter, curriculum documents in TE 1.0 only stored the identifiers of the standards to which the document was aligned. In TE 2.0, all of the properties necessary to render a standard alignment on a curriculum page are included in the JSON representation of the curriculum document. As discussed in chapter 4, it can sometimes be advantageous to de-normalize data in a database. This is an example of such a case. Since standards do not change once they are published by the standard’s creator, we do not need to worry about having to update the details of a standard in every document which is aligned to that standard. In addition, storing the standards with the curriculum document boosts performance by eliminating the need for additional queries to retrieve standard details. Whereas this implies a lot of duplication of standard data in the database, the significant speed gain in extracting the document-standard relationships is well worth the little bit of extra storage. The following is an example of the properties of a standard that are embedded in a curriculum document.

    "EducationalStandards": [
        "Id": "http://asn.jesandco.org/resources/S2454426",
        "StandardsDocumentId": "http://asn.jesandco.org/resources/D2454348",
        "AncestorIds": [
        "Jurisdiction": "Next Generation Science Standards",
        "Subject": "Science",
        "ListId": null,
        "Description": [
          "Biological Evolution: Unity and Diversity",
          "Students who demonstrate understanding can:",
          "Construct an argument with evidence that in a particular habitat some organisms can survive well, some survive less well, and some cannot survive at all."
          "GradeLowerBound": 3,
          "GradeUpperBound": 3,
          "StatementNotation": "3-LS4-3",
          "AlternateStatementNotation": "3-LS4-3"

While the document accessioning experience in TE 2.0 is more streamlined and user friendly, it does have a downside. In TE 1.0, if a property was added to a curriculum document, updating the XML schema was the only step needed to allow editors to utilize the new property. This was because the Authentic tool would recognize the schema change and the editing experience would automatically adjust. In TE 2.0, adding a field requires a developer to make code changes to the edit interface. On balance, however, since document schemas do not change that often, the advantages of a (much) more user-friendly document editing experience outweigh the occasional need for code changes.

  1. The term tagger stems from the TE 1.0 period during which document conversion consisted of embedding content in XML ‘tags.’
  2. The S* standard identifiers are maintained by the Achievement Standard Network project. They can be viewed using the following URL: http://asn.desire2learn.com/resources/S*_code_goes_here