Augmented Metadata and Annotations

From Metadata-Registry
Revision as of 12:14, 25 October 2005 by Diane (Talk | contribs)

Jump to: navigation, search

Outline--a harvest scenario

  1. We publish two metadata formats designed to support annotations and augmented metadata
    • The key feature of these formats is two metadata elements, one (and only one) of which must be present, either of which may be repeated:
      • xxxxUniqueIdRef? -- which contains a reference to an existing metadata record in the MR and must match an existing record
      • dcIdentifierRef? -- which contains a reference to a URI that may or may not exist in the MR. #*Augmented metadata must reference an existing URI in the MR. This could also be expressed as <reference type=xxxxUniqueId?> or <reference type=dcIdentifier>
    • These are intended to be used to supply annotations and augmented metadata for harvest via OAI and perhaps a services interface.
  2. Annotation and augmentation suppliers wishing to supply metadata about a resource identified by a URI should first query or harvest the MR to get a list of metadata records that are about that URI.
  3. They create metadata about their annotation in the above format and serve it via OAI. This record may carry the actual annotation or it can simply contain a reference. In the case of metadata augmentation, each record served should be a self-contained, incomplete metadata record and should not reference another source of metadata.
  4. We harvest the records through a standard harvest -- all incoming records will have to be associated with a collection record
  5. The ingest process creates a unique mrec record for each incoming record
  6. References in the MR must always be mrec_ids so in the case of dcIdentifierRef? the ingest process retrieves all mrecs that reference each dcIdentifierRef?.
  7. If a dcIdentifierRef? references a URI that is not found, an mrec record is created for that URI and is queued for metadata generation by iVia (controversial)
  8. An entry is created in the link table for each mrec identifed either directly or by reference. This will contain the mrec_id of the annotation record, the mrec_id of the mrec being annotated, a reference type, a datestamp, and a source mrec_id
    • Note that the link table will need an additional 'source' field that will, in the case ofannotations and augmentations, contain the mrec_id of the annotation or augmenation metadata record that supplied the link.
    • Note also that reference type and datestamp are denormalized values that can be determined by reference to the source mrec_id if necessary.
  9. Output of augmented metadata is the tough thing -- it needs to be served both as a component part of the metadata format being augmented and as a distinct format, both within and without the mudball.

Metadata Augmentation: Use cases

Use Case #1: field replacement or deprecation

The Repository receives a file of item records from the Whatsis provider. Each record contains a defaulted value "unknown" in the Coverage element. Based on the Repository policy to deprecate useless defaults, the element is marked as deprecated, and that assertion indicates the Repository Quality Improvement Service (RQIS) is its source. In addition, the dc:format value of "application/flash" is consistently misspelled. A second version of the dc:format element with the correctly spelled value is provided by RQIS and an error notification message is sent to the data provider. MR OAI format rqis_dc_plus will include both versions of dc:format; rqis_dc_gold will only show the correctly spelled one. Lastly, the DCMIType value of "InteractiveResource" is consistently misspelled by the provider in a dc:type field. A second version of the dc:type element with the correctly spelled value is provided by RQIS, and the encoding scheme of dct:DCMIType is added. An error notification message is sent to the data provider. rqis_dc_plus will include both versions of dc:type; rqis_dc_gold will only show the correctly spelled one, with its indicated encoding scheme.

Later, the Repository harvests updated item records from Whatsis. RQIS quality control routines are run on the updated metadata:

  1. <coverage>unknown</coverage> is provided again. The RQIS continues to keep the deprecation assertion and NOT serve this useless info to downstream users.
  2. <coverage>unknown</coverage> is no longer provided. The RQIS needs to remove the deprecation assertion because it no longer refers to an actively served statement.
  3. <coverage>unknown</coverage> is no longer provided BUT <coverage>Washington</coverage> is now provided. The RQIS removes the deprecation assertion because there is now a useful (!) value; the Repository must serve the new coverage info to downstream users.
  4. provider now serves newly misspelled "apprication/flash". Because we have a separate RQIS-provided element with the correct spelling, the new (incorrect) provider element replaces the old (incorrect) provider element, and the correct RQIS attributed element is left alone.
  5. correctly spelled "application/flash" is now provided. The Repository should now drop the RQIS-sourced correct element, as it is a duplicate of the provider sourced correct element. Or not -- the bottom line is to serve only ONE, rather than duplicate.
  6. newly misspelled "Inteactive Resource" is provided. Because the Repository has a separate element, with the correct spelling, that indicates encoding scheme, the newly incorrect provider element replaces the old (incorrect) provider element, and the correct one is retained.
  7. correctly spelled "InteractiveResource" is now provided. The MR should now either add the encoding scheme to the provider's newly correct element and drop the (duplicate) RQIS-sourced correct element, or the MR should keep both, with the encoding scheme only applied to the RQIS element. Or not -- the bottom line is to only serve ONE, rather than duplicate.
  8. provider no longer serves dc:type element. (orphaned field enhancement) Should the RQIS dc:type field be retained, or should it be discarded? If the Repository doesn't retain a connection from the RQIS assertion to the original provider assertion, then the RQIS dc:type element just remains (with what provenance?). [Alternatively, since this level of quality improvement is based on examination of metadata, not resources, the element is not retained.]

The critical thing is who makes the assertion. For example, if the original metadata provider supplies a field with a typo, "texp/html", and RQIS corrects the typo to "text/html", the original metadata provider made the assertion. However, if the original metadata provider says a resource is an image, when it's really (or also) text, then the RQIS correction has a new assertion in it.

Use Case #N1: provider updates their metadata after it has been augmented

  1. ThatsUs provides rqis_dc to the Repository
  2. iVia augments the ThatsUs items with dc:subject fields with LCC values
  3. MR harvests updated nsdl_dc from Shindy
    • ThatsUs' new rqis_dc has no dc:subject fields
    • ThatsUS' new rqis_dc has dc:subject fields with LCC values
  4. Q: Under what conditions do we trigger new iVia augmentations? Only if primary identifier changes?
  5. Q: (When do safe xforms happen? where are they in this sequence?)

Use Case #N2: augmentation service updates their provided augmentations

  1. ThatsUs provides oai_dc to the Repository
  2. iVia augments the ThatsUs items with dc:subject fields with LCC values
  3. iVia newly augments the ThatsUs items with new, improved dc:subject fields with LCC values
    • do we set up the process to assume augmentations supercede older versions of themselves?
  4. Q: (When do safe xforms happen? where are they in this sequence?)

Use Case #N3: auto-chosen/auto-gen item metadata is augmented by another service

  1. iVia does a crawl and provides item level metadata to the Repository as collection wowza.
  2. ENC augments the wowza items with dct:audience fields
  3. SDSC augments the wowza items with dc:format information and information about broken links.
  4. Q: (is there anything special about this case, or is it the same as N1?)
  5. Q: (When do safe xforms happen? where are they in this sequence?)

Use Case #2: multiple equivalent resources and their relationship to augmentations on output

ENC provides the Repository with metadata augmentations asserting that specified items in a number of collections relate to the Illinois third grade science standard for basic understanding of photosynthesis. ENC provides the Repository with metadata records identifying a Repository metadata record ID, a URL (providing an internal check as well as an additional identifier for the resource) and the DC refinement "conformsTo" specifying the particular standard to which the resource is related. This element contains the source ENC and is identified as human created data. The Repository Simple Equivalency Service (based on resource URLs) identifies three other items in other collections that this relationship assertion applies to, and the appropriate links are made, and the resource metadata records (aggregated version only) updated in the Repository OAI server.

Use Case #3: Multiple providers of metadata and augmentation </b> -- original metadata provider, RQIS (as augmenter), 3rd party augmenter, metadata served out in various flavors

The Whomever Collection supplies NSDL with 2233 item records described with oai_dc metadata. Based on routine normalization procedures, NSDL adds several encoding schemes to the records: "URI" to the identifier element (all values begin with "http") and "DCMIType" to most of the Type values which are valid DCMIType terms. In each of these cases, the source of the data continues to be identified as the original data provider. Several weeks later, the iVia staff harvest the metadata for the collection, and feed back to NSDL LCC classification and LCSH subject headings for the collection. This information is identified as originating with iVia and also as machine generated data.

The metadata is served out in a number of flavors:

  • native_oai_dc: metadata exactly how it came to us
  • rqis_dc: native_oai_dc plus safe xforms (was: as received, though normalized for errors and with added valid schemes)
  • rqis_dc_plus: rqis_dc plus augmentations (each safe xformed native record with any augmentations that apply, based on equivalence relationships)
  • rqis_dc_gold: rqis_dc, with erroneous values removed. Different from "rqis_dc_plus" because fields may be removed.
  • oai_dc: the RQIS's "dumbing down" of one of the above rqis_dc formats so we are compliant with OAI-PMH 2.0 (we must serve oai_dc)
  • "Mudball" (aggregation of all available metadata elements, with source, identified as being about a particular resource)

Use Case #4: focus on possible uses available to downstream users

ENC harvests "mudball" metadata records from the Repository to fulfil a number of specific requirements of their middle school portal:

  • They look for assertions of "conformsTo" relationships from a small number of sources that they consider reliable
  • They look for subject terms from controlled vocabularies on relevant resource metadata that they can use on their portal to provide topical navigation.
  • They look for annotations about middle school resources from teachers, librarians, and specific sources known by them to be reliable and appropriate for middle school audiences

MathForum re-harvests their metadata records from the Repository in the "rqis_dc_plus" flavor, looking for additional metadata added by others to provide additional value on their site. They also harvest the "mudball" records from other math collections to see if they can add some resources described by others to their site, making them available to their special services.

Use Case #5: when a resource or its metadata changes or is deleted, what happens to augmentations?

  • Deletion from specific providing collection: link moves to an equivalent resource metadata record? (or doesn't it matter, so long as there's another available Repository metadata record of that resource?)
  • Deletion of last Repository metadata record for that particular resource (perhaps it died?):
    • mark for deletion, but run occasional report to see if some can be revived?
    • point to Repository archived version of resource
  • Resource changes in ways that cannot be easily determined:
    • Augmentors notified to re-crawl or review,
    • non-updated augmentations could be "sunsetted" after some passage of time?
  • how can we be sure disappearance is permanent vs. temporary?

Use Case #6: when an augmentation is changed or deleted, what happens?

  1. a metadata augmentation is changed
    • MR picks it up on regular harvest from aug service (b/c OAI datestamp of changed record is after our "from" date argument in the OAI harvest from the aug service)
    • augmentation metadata record is updated in MR
    • changes to augmentation metadata record are duly propogated through MR storage and affected XML served out of MR
  2. a metadata augmentation is deleted
    • MR harvests deletion on regular harvest from aug service (b/c OAI datestamp of deleted record is after our "from" date argument in the OAI harvest from the aug service) [Q: what if aug service doesn't do persistent OAI deletes? (or transient deletes of a long period of time?]
    • augmentation metadata record is marked deleted in MR
    • deletion of augmentation metadata record is duly propogated through MR storage and affected XML served out of MR.

Use Case #7: a simple augmentation sequence

  1. Repository gets metadata record 1 from provider Q.
  2. Repository normalizes the metadata, creating record 1N.
  3. iVia harvests metadata record 1N from the MR's OAI server
  4. iVia uses IDs from harvested metadata to target resources for automated metadata creation, subject and classification assignment
  5. iVia exposes its metadata augmentations (not data harvested from original records) to the world as metadata record 1NiVia
  6. Repository harvests metadata record 1NiVia from iVia
  7. Repository normalizes or otherwise alters and stores the iVia aug record as record 1NiViaN
  8. Repository uses 1NiViaN as part of a rqis augmented/gold record

Use Case #8: a more complex augmentation sequence

  1. Repository gets metadata record 1 from provider Q.
  2. Repository normalizes the metadata, creating record 1N.
  3. iVia harvests metadata record 1N from the MR's OAI server
  4. iVia uses IDs from harvested metadata to target resources for automated metadata creation, subject and classification assignment
  5. iVia exposes its metadata augmentations (not statements harvested from the original record) to the world as metadata record 1NiVia
  6. Repository harvests metadata record 1NiVia from iVia
  7. Repository normalizes or otherwise alters and stores the iVia aug record as record 1NiViaN
  8. Repository uses 1NiViaN in record 1aug, a nsdl augmented/gold record.
  9. Repository search service harvests record 1N or record 1aug.
  10. Repository search service discovers that the dc:format value is wrong -- it's text, not an image.
  11. Repository search service provides a correction to the dc:format field
  12. Repository archive service harvests record 1N or record 1aug.
  13. Repository archive service discovers that the dc:format value is wrong -- it's text/xml, not an image.
  14. Repository archive service provides a correction to the dc:format field
  15. Repository harvests via OAI or otherwise gets the corrections from the search service.
  16. Repository harvests via OAI or otherwise gets the corrections from the archive service.
  17. Repository Rating Service determines what value(s) exposed for dc:format in record 1aug, the nsdl augmented/gold record

Use cases: Native metadata harvest, crosswalking, safe and collection-specific transformations, NSDL gold metadata

Use Case 1: Metadata Provision, Evaluation and Normalization

Use Case 2: Collection-Specific Transformation

Use Case 3: Crosswalking Instance Metadata

Use Case 4: Transforming Metadata Values

Use Case 5: Machine-Generated Metadata Augmentation

A routine crawl for item metadata is initiated via the MMS after the completion of a collection record for a resource without available metadata. The iVia Service makes machine-generated metadata available for harvest by the Repository. Because a rights statement applying to all the resources on the site is available, but the iVia Service does not reflect that in the items, a collection-specific transform is initiated for the collection, and the appropriate statement is defaulted in the Rights element for the items.