Difference between revisions of "Augmented Metadata and Annotations"

From Metadata-Registry
Jump to: navigation, search
(Outline--a harvest scenario)
(Tentative Listing of Annotation Types)
 
(23 intermediate revisions by one user not shown)
Line 1: Line 1:
==Outline--a harvest scenario==
+
==Management Use cases: Native metadata harvest, crosswalking, safe and collection-specific transformations, NSDL gold metadata==
  
#We publish two metadata formats designed to support annotations and augmented metadata
+
===Use Case 1: Metadata Provision, Evaluation and Normalization===
#*The key feature of these formats is two metadata elements, one (and only one) of which must be present, either of which may be repeated:
+
#**xxxxUniqueIdRef? -- which contains a reference to an existing metadata record in the MR and must match an existing record
+
#**dcIdentifierRef? -- which contains a reference to a URI that may or may not exist in the MR. #*Augmented metadata must reference an existing URI in the MR. This could also be expressed as <reference type=xxxxUniqueId?> or <reference type=dcIdentifier>
+
#*These are intended to be used to supply annotations and augmented metadata for harvest via OAI and perhaps a services interface.
+
#Annotation and augmentation suppliers wishing to supply metadata about a resource identified by a URI should first query or harvest the MR to get a list of metadata records that are about that URI.
+
#They create metadata about their annotation in the above format and serve it via OAI. This record may carry the actual annotation or it can simply contain a reference. In the case of metadata augmentation, each record served should be a self-contained, incomplete metadata record and should not reference another source of metadata.
+
#We harvest the records through a standard harvest -- all incoming records will have to be associated with a collection record
+
#The ingest process creates a unique mrec record for each incoming record
+
#References in the MR must always be mrec_ids so in the case of dcIdentifierRef? the ingest process retrieves all mrecs that reference each dcIdentifierRef?.
+
#If a dcIdentifierRef? references a URI that is not found, an mrec record is created for that URI and is queued for metadata generation by iVia (controversial)
+
#An entry is created in the link table for each mrec identifed either directly or by reference. This will contain the mrec_id of the annotation record, the mrec_id of the mrec being annotated, a reference type, a datestamp, and a source mrec_id
+
#*Note that the link table will need an additional 'source' field that will, in the case ofannotations and augmentations, contain the mrec_id of the annotation or augmenation metadata record that supplied the link.
+
#*Note also that reference type and datestamp are denormalized values that can be determined by reference to the source mrec_id if necessary.
+
#Output of augmented metadata is the tough thing -- it needs to be served both as a component part of the metadata format being augmented and as a distinct format, both within and without the mudball.
+
  
==Metadata Augmentation: Use cases==
+
Metadata : rqis_mudball and rqis_gold.
  
===Use Case #1: field replacement or deprecation===
+
===Use Case 2: Collection-Specific Transformation===  
  
: NSDL receives a file of item records from the Whatsis Collection.  Each record contains a defaulted value "unknown" in the Coverage element. Based on the NSDL decision to deprecate useless defaults, the element is marked as deprecated, and that assertion indicates NSDL is its sourceIn addition, the dc:format value of <nowiki>"application/flash"</nowiki> is consistently misspelled. A second version of the dc:format element with the correctly spelled value is provided by NSDL and an error notification message is sent to the data provider.   MR OAI format nsdl_dc_plus will include both versions of dc:format; nsdl_dc_gold will only show the correctly spelled oneLastly, the DCMIType value of <nowiki>"InteractiveResource"</nowiki> is consistently misspelled by the provider in a dc:type field. A second version of the dc:type element with the correctly spelled value is provided by NSDL , and the encoding scheme of dct:DCMIType is added.  An error notification message is sent to the data provider.  nsdl_dc_plus will include both versions of dc:type;  nsdl_dc_gold will only show the correctly spelled one, with its indicated encoding scheme.
+
the MMS, creates an initial collection record and initiates a harvest of metadata. Because this is an initial harvest, the Repository Manager receives an email notificationShe updates the collection record, and quickly examines the csv files for the two of the three harvested formats: qualified_dc and ieee_lom.  She notes that the crosswalk used by the data provider She calls up a form to create a collection specific transform for that collection, to reverse the values in each element and add the appropriate encoding schemeSince no other serious errors appear, she invokes the safe transformation and approves the data for the MR. She sends a notification to the provider, pointing out the error, and asking him to inform her when the error is corrected so that she can pull the collection-specific transform when the data can be correctly harvested.
  
: Later, the NSDL harvests updated item records from Whatsis.  <font color="red">How do we ensure that the new information from the Whatsis updates doesn't step on the NSDL improvements?</font>
+
===Use Case 3: Crosswalking Instance Metadata===
## <coverage>unknown</coverage> is provided again.  the MR needs to keep the deprecation assertion and NOT serve this useless info to downstream users, such as the NSDL Search service.
+
## <coverage>unknown</coverage> is no longer provided.  The MR needs to remove the deprecation assertion because it no longer refers to an element.
+
## <coverage>unknown</coverage> is no longer provided BUT <coverage>Washington</coverage> is now provided.  the MR needs to remove the deprecation assertion because there is now a useful (!) value;  the MR must serve the new coverage info to downstream users.
+
## provider now gives us newly misspelled <nowiki>"apprication/flash"</nowiki>.  Because we have a separate NSDL-provided element with the correct spelling, the new (incorrect) provider element replaces the old (incorrect) provider element, and the correct NSDL attributed element is left alone.
+
## correctly spelled <nowiki>"application/flash"</nowiki> is now provided.  The MR should now drop the NSDL-sourced correct element, as it is a duplicate of the provider sourced correct element.  Or not -- the bottom line is to serve only ONE, rather than duplicate.
+
## newly misspelled <nowiki>"Inteactive Resource"</nowiki> is provided.  Because we have a separate element, with the correct spelling, that indicates encoding scheme, the newly incorrect provider element replaces the old (incorrect) provider element, and the correct one is still there.
+
## correctly spelled <nowiki>"InteractiveResource"</nowiki> is now provided.  The MR should now either add the encoding scheme to the provider's newly correct element and drop the (duplicate) NSDL-sourced correct element, or the MR should keep both, with the encoding scheme only applied to the NSDL element.  Or not -- the bottom line is to only serve ONE, rather than duplicate.
+
## provider no longer serves dc:type element.  <font color='red'>(orphaned field enhancement) Should the NSDL dc:type field be retained, or should it be discarded?</font>  If we don't retain a connection from the NSDL assertion to the original provider assertion, then our dc:type element just remains.  Diane feels retaining these element level connections  is too costly with too little return.  We don't know why their dc:type element disappeared ...
+
  
: (Diane sez the critical thing is <i>who makes the assertion. </i> For example, if the original metadata provider supplies a field with a typo, "texp/html", and NSDL corrects the typo to "text/html", the original metadata provider made the assertionHowever, if the original metadata provider says a resource is an image, when it's really (or also) text, then the NSDL correction has a new assertion in it.)
+
is identified by an approved Repository Recommender as an appropriate collection for the RepositoryThe Recommender creates a collection record, and through the MMS initiates a a new metadata format specialized for science museums that the MR has not yet encountered.  The Metadata Manager receives notification that this new format has been harvested, and the schema provided allows the creation of a csv file so that the data can be reviewed. The schema also supports the creation of a crosswalk worksheet, allowing the Empress to set up a crosswalk from the richer format to qualified_dcWhen the crosswalk is completed, the data is transformed and made available through the NSDL OAI server, and the crosswalk itself is registered in the NSDL Registry, for specific reference in .
  
* <b> Use Case #N1: provider updates their metadata after it has been augmented </b>
+
===Use Case 4: Transforming Metadata Values===
  
## Shindy provides nsdl_dc to the NSDL
+
to ? as stored by the MMS. The Metadata Manager is notified of these changes? collection. One of the changes is the addition of Audience values. The Manager determines that the provider is not using the available standard vocabularies but a mix of other available vocabularies and unattributed terms. She sets up a collection-specific transform that crosswalks the non-standard vocabularies to standard vocabularies as well as a quick crosswalk from the unattributed terms to standard vocabulary.  She also sets up an rqis_gold profile for the provider, so that appropriate ratings are established for the range of terms available.
## iVia augments the Shindy items with dc:subject fields with LCC values
+
## MR harvests updated nsdl_dc from Shindy
+
*** Shindy's new nsdl_dc has no dc:subject fields
+
*** Shindy's new nsdl_dc has dc:subject fields with LCC values
+
:: Under what conditions do we trigger new iVia augmentationsOnly if primary identifier changes?  
+
: (When do safe xforms happen? where are they in this sequence?)
+
  
* <b> Use Case #N2: augmentation service updates their provided augmentations </b>
+
===Use Case 5: Machine-Generated Metadata Augmentation===
  
## Shindy provides oai_dc to the NSDL
+
A routine crawl for item metadata is initiated via the MMS after the completion of a collection record for a resource without available metadata.  The iVia Service makes machine-generated metadata available for harvest by the Repository.  Because a rights statement applying to all the resources on the site is available, but the iVia Service does not reflect that in the items, a collection-specific transform is initiated for the collection, and the appropriate statement is defaulted in the Rights element for the items.
## iVia augments the Shindy items with dc:subject fields with LCC values
+
## iVia newly augments the Shindy items with new, improved dc:subject fields with LCC values 
+
*** do we set up the process to assume augmentations supercede older versions of themselves?
+
  
: (When do safe xforms happen?  where are they in this sequence?)
+
==Metadata Augmentation: Use Cases for Specific Situations==
  
* <b> Use Case #N3: auto-chosen/auto-gen item metadata is augmented by another service </b>
+
===Use Case #1: field replacement or deprecation===
  
## iVia does a crawl and provides item level metadata to the NSDL as collection wowza.
+
: The Repository receives a file of item records from the Whatsis provider.  Each record contains a defaulted value "unknown" in the Coverage element. Based on the Repository policy to deprecate useless defaults, the element is marked as deprecated, and that assertion indicates the Repository Quality Improvement Service (RQIS) is its source. In addition, the dc:format value of <nowiki>"application/flash"</nowiki> is consistently misspelled. A second version of the dc:format element with the correctly spelled value is provided by RQIS and an error notification message is sent to the data provider.  MR OAI format rqis_dc_plus will include both versions of dc:format;  rqis_dc_gold will only show the correctly spelled one.  Lastly, the DCMIType value of <nowiki>"InteractiveResource"</nowiki> is consistently misspelled by the provider in a dc:type field. A second version of the dc:type element with the correctly spelled value is provided by RQIS, and the encoding scheme of dct:DCMIType is added.  An error notification message is sent to the data provider.  rqis_dc_plus will include both versions of dc:type;  rqis_dc_gold will only show the correctly spelled one, with its indicated encoding scheme.
## ENC augments the wowza items with dct:audience fields
+
## SDSC augments the wowza items with dc:format information and information about broken links.
+
  
: (is there anything special about this case, or is it the same as N1?)
+
Later, the Repository harvests updated item records from Whatsis. RQIS quality control routines are run on the updated metadata:
: (When do safe xforms happenwhere are they in this sequence?)
+
# <coverage>unknown</coverage> is provided again.  The RQIS continues to keep the deprecation assertion and NOT serve this useless info to downstream users.
 +
# <coverage>unknown</coverage> is no longer provided.  The RQIS needs to remove the deprecation assertion because it no longer refers to an actively served statement.
 +
# <coverage>unknown</coverage> is no longer provided BUT <coverage>Washington</coverage> is now provided.  The RQIS removes the deprecation assertion because there is now a useful (!) value;  the Repository must serve the new coverage info to downstream users.
 +
# provider now serves newly misspelled <nowiki>"apprication/flash"</nowiki>.  Because we have a separate RQIS-provided element with the correct spelling, the new (incorrect) provider element replaces the old (incorrect) provider element, and the correct RQIS attributed element is left alone.
 +
# correctly spelled <nowiki>"application/flash"</nowiki> is now provided.  The Repository should now drop the RQIS-sourced correct element, as it is a duplicate of the provider sourced correct element.  Or not -- the bottom line is to serve only ONE, rather than duplicate.
 +
# newly misspelled <nowiki>"Inteactive Resource"</nowiki> is provided.  Because the Repository has a separate element, with the correct spelling, that indicates encoding scheme, the newly incorrect provider element replaces the old (incorrect) provider element, and the correct one is retained.
 +
# correctly spelled <nowiki>"InteractiveResource"</nowiki> is now provided.  The MR should now either add the encoding scheme to the provider's newly correct element and drop the (duplicate) RQIS-sourced correct element, or the MR should keep both, with the encoding scheme only applied to the RQIS element.  Or not -- the bottom line is to only serve ONE, rather than duplicate.
 +
# provider no longer serves dc:type element.  <font color='red'>(orphaned field enhancement) Should the RQIS dc:type field be retained, or should it be discarded?</font> If the Repository doesn't retain a connection from the RQIS assertion to the original provider assertion, then the RQIS dc:type element just remains (with what provenance?). [Alternatively, since this level of quality improvement is based on examination of metadata, not resources, the element is not retained.] 
  
===Use Case #2: multiple equivalent resources and their relationship to augmentations on output===
+
The critical thing is <i>who makes the assertion. </i> For example, if the original metadata provider supplies a field with a typo, "texp/html", and RQIS corrects the typo to "text/html", the original metadata provider made the assertion.  However, if the original metadata provider says a resource is an image, when it's really (or also) text, then the RQIS correction has a new assertion in it.
  
: ENC provides NSDL with metadata augmentations asserting that specified items in a number of collections relate to the Illinois third grade science standard for basic understanding of photosynthesis. ENC provides NSDL with metadata records identifying an NSDL metadata record ID, a URL (providing an internal check as well as an additional identifier for the resource) and the DC refinement <nowiki>"conformsTo"</nowiki> specifying the particular standard to which the resource is related.  This element contains the source ENC and is identified as human created data.  The NSDL Simple Equivalency Service (based on resource URLs) identifies three other items in other collections that this relationship assertion applies to, and the appropriate links are made, and the resource metadata records (aggregated version only) updated in the NSDL OAI server.
+
====Use Case #N1: provider updates their metadata after it has been augmented====
* <b> Use Case #3: Multiple providers of metadata and augmentation </b> -- original metadata provider, NSDL (as augmenter), 3rd party augmenter, nsdl served out in various flavors
+
  
: The Whomever Collection supplies NSDL with 2233 item records described with oai_dc metadata.  Based on routine normalization procedures, NSDL adds several encoding schemes to the records: "URI" to the identifier element (all values begin with "http") and "DCMIType" to most of the Type values which are valid DCMIType terms. In each of these cases, the source of the data continues to be identified as the original data provider.  Several weeks later, the iVia staff harvest the metadata for the collection, and feed back to NSDL LCC classification and LCSH subject headings for the collection. This information is identified as originating with iVia and also as machine generated data.
+
# ThatsUs provides rqis_dc to the Repository
 +
# iVia augments the ThatsUs items with dc:subject fields with LCC values
 +
# MR harvests updated nsdl_dc from Shindy
 +
#* ThatsUs' new rqis_dc has no dc:subject fields
 +
#* ThatsUS' new rqis_dc has dc:subject fields with LCC values
 +
# Q: Under what conditions do we trigger new iVia augmentations? Only if primary identifier changes?
 +
# Q: (When do safe xforms happen?  where are they in this sequence?)
  
: The metadata is served out in a number of flavors:
+
====Use Case #N2: augmentation service updates their provided augmentations====
  
** native_oai_dc: metadata exactly how it came to us
+
# ThatsUs provides oai_dc to the Repository
** nsdl_dc: native_oai_dc plus safe xforms (was: as received, though normalized for errors and with added valid schemes)
+
# iVia augments the ThatsUs items with dc:subject fields with LCC values
** nsdl_dc_plus: nsdl_dc plus augmentations (each safe xformed native record with any augmentations that apply, based on equivalence relationships)
+
# iVia newly augments the ThatsUs items with new, improved dc:subject fields with LCC values   
** nsdl_dc_gold: nsdl_dc, with erroneous values removed. Different from "nsdl_dc_plus" because fields may be removed.
+
#* do we set up the process to assume augmentations supercede older versions of themselves?
** oai_dc: the NSDL's "dumbing down" of one of the above nsdl_dc formats so we are compliant with OAI-PMH 2.0 (we must serve oai_dc)
+
# Q: (When do safe xforms happenwhere are they in this sequence?)
** "Mudball" (aggregation of all available metadata elements, with source, identified as being about a particular resource)
+
*** Naomi asks: does this differ from nsdl_dc_plusIf so, how?
+
** nsdl_all: each unrestricted metadata format, separately, as a big tarball.
+
*** Includes nsdl_dc, nsdl_dc_plus, nsdl_dc_gold, native, augRec11 from iVia, augRec28 from SDSC ...
+
** nsdl_search: all the metadata formats as a big tarball. 
+
*** Perhaps only nsdl_dc_gold, not ALL nsdl_dc formats?
+
  
 +
====Use Case #N3: auto-chosen/auto-gen item metadata is augmented by another service====
  
===Use Case #4: focus on possible uses available to downstream users===
+
# iVia does a crawl and provides item level metadata to the Repository as collection wowza.
 +
# ENC augments the wowza items with dct:audience fields
 +
# SDSC augments the wowza items with dc:format information and information about broken links.
 +
# Q: (is there anything special about this case, or is it the same as N1?)
 +
# Q: (When do safe xforms happen?  where are they in this sequence?)
  
: ENC harvests "mudball" metadata records from NSDL to fulfil a number of specific requirements of their middle school portal:
+
===Use Case #2: multiple equivalent resources and their relationship to augmentations on output===
** They look for assertions of <nowiki>"conformsTo"</nowiki> relationships from a small number of sources that they consider reliable
+
** They look for subject terms from controlled vocabularies on relevant resource metadata that they can use on their portal to provide topical navigation.
+
** They look for annotations about middle school resources from teachers, librarians, and specific sources known by them to be reliable and appropriate for middle school audiences
+
  
: <nowiki>MathForum</nowiki> re-harvests their metadata records from NSDL in the "nsdl_dc_plus" flavor, looking for additional metadata added by others to provide additional value on their siteThey also harvest the "mudball" records from other math collections to see if they can add some resources described by others to their site, making them available to their special services.  
+
ENC provides the Repository with metadata augmentations asserting that specified items in a number of collections relate to the Illinois third grade science standard for basic understanding of photosynthesis. ENC provides the Repository with metadata records identifying a Repository metadata record ID, a URL (providing an internal check as well as an additional identifier for the resource) and the DC refinement <nowiki>"conformsTo"</nowiki> specifying the particular standard to which the resource is relatedThis element contains the source ENC and is identified as human created data.  The Repository Simple Equivalency Service (based on resource URLs) identifies three other items in other collections that this relationship assertion applies to, and the appropriate links are made, and the resource metadata records (aggregated version only) updated in the Repository OAI server.
  
===Use Case #5: when a resource or its metadata changes or is deleted, what happens to augmentations?===
+
===Use Case #3: Multiple providers of metadata and augmentation -- original metadata provider, RQIS (as augmenter), 3rd party augmenter, metadata served out in various flavors===
  
** Deletion from specific providing collection: link moves to an equivalent resource metadata record? (or doesn't it matter, so long as there's another available NSDL metadata record of that resource?)
+
The Whomever Collection supplies NSDL with 2233 item records described with oai_dc metadata.  Based on routine normalization procedures, NSDL adds several encoding schemes to the records: "URI" to the identifier element (all values begin with "http") and "DCMIType" to most of the Type values which are valid DCMIType terms. In each of these cases, the source of the data continues to be identified as the original data provider.  Several weeks later, the iVia staff harvest the metadata for the collection, and feed back to NSDL LCC classification and LCSH subject headings for the collection.  This information is identified as originating with iVia and also as machine generated data.  
** Deletion of last NSDL metadata record for that particular resource (perhaps it died?):
+
*** mark for deletion, but run occasional report to see if some can be revived?
+
*** point to NSDL archived version of resource
+
** Resource changes in ways that cannot be easily determined:  
+
*** Augmentors notified to re-crawl or review,
+
*** non-updated augmentations could be "sunsetted" after some passage of time?
+
** how can we be sure disappearance is permanent vs. temporary?
+
  
===Use Case #6: when an augmentation is changed or deleted, what happens?===
+
The metadata is served out in a number of flavors:
  
** a metadata augmentation is changed
+
* native_oai_dc: metadata exactly how it came to us
*** MR picks it up on regular harvest from aug service (b/c OAI datestamp of changed record is after our "from" date argument in the OAI harvest from the aug service)
+
* rqis_dc: native_oai_dc plus safe xforms (was: as received, though normalized for errors and with added valid schemes)
*** augmentation metadata record is updated in MR
+
* rqis_dc_plus: rqis_dc plus augmentations (each safe xformed native record with any augmentations that apply, based on equivalence relationships)
*** changes to augmentation metadata record are duly propogated through MR storage and affected XML served out of MR
+
* rqis_dc_gold: rqis_dc, with erroneous values removed.  Different from "rqis_dc_plus" because fields may be removed.
** a metadata augmentation is deleted
+
* oai_dc: the RQIS's "dumbing down" of one of the above rqis_dc formats so we are compliant with OAI-PMH 2.0 (we must serve oai_dc)
*** MR harvests deletion on regular harvest from aug service (b/c OAI datestamp of deleted record is after our "from" date argument in the OAI harvest from the aug service)
+
* "Mudball" (aggregation of all available metadata elements, with source, identified as being about a particular resource)
**** what if aug service doesn't do persistent OAI deletes?  (or transient deletes of a long period of time?)
+
*** augmentation metadata record is marked deleted in MR
+
*** deletion of augmentation metadata record is duly propogated through MR storage and affected XML served out of MR.
+
  
===Use Case #7: a simple augmentation sequence===
+
===Use Case #4: focus on possible uses available to downstream users===
:1. we get metadata record 1 from provider Q.
+
:2. we normalize the metadata, creating record 1N. 
+
::* it's possible that record 1N is also the nsdl_aug/gold at this point ...
+
:3. iVia harvests metadata record 1N from the MR's OAI server
+
:4. iVia does its automagic thang
+
:5. iVia exposes its metadata augmentations to the world as metadata record <nowiki>1NiVia</nowiki>
+
::* is this only the newly provided and/or altered fields, or is it all the old fields plus all the new fields, or what?
+
:6. we harvest metadata record <nowiki>1NiVia</nowiki> from iVia
+
:7. we normalize or otherwise alter and store the iVia aug record as record <nowiki>1NiViaN</nowiki>
+
::* do we store it locally if we don't change anything?
+
:8. we use <nowiki>1NiViaN</nowiki> in a nsdl augmented/gold/demented record.
+
  
===Use Case #8: a more complex augmentation sequence===
+
ENC harvests "mudball" metadata records from the Repository to fulfil a number of specific requirements of their middle school portal:
:1. we get metadata record 1 from provider Q.
+
* They look for assertions of <nowiki>"conformsTo"</nowiki> relationships from a small number of sources that they consider reliable
:2. we normalize the metadata, creating record 1N. 
+
* They look for subject terms from controlled vocabularies on relevant resource metadata that they can use on their portal to provide topical navigation.  
::* it's possible that record 1N is also the nsdl_aug/gold at this point ...
+
* They look for annotations about middle school resources from teachers, librarians, and specific sources known by them to be reliable and appropriate for middle school audiences
:3. iVia harvests metadata record 1N from the MR's OAI server
+
:4. iVia does its automagic thang
+
:5. iVia exposes its metadata augmentations to the world as metadata record <nowiki>1NiVia</nowiki>
+
::* is this only the newly provided and/or altered fields, or is it all the old fields plus all the new fields, or what?
+
:6. we harvest metadata record <nowiki>1NiVia</nowiki> from iVia
+
:7. we normalize or otherwise alter and store the iVia aug record as record <nowiki>1NiViaN</nowiki>
+
::* do we store it locally if we don't change anything?
+
:8. we use <nowiki>1NiViaN</nowiki> in record 1aug, a nsdl augmented/gold/demented record.
+
:9. search service harvests record 1N or record 1aug.
+
:10. search service discovers that the dc:format value is wrong -- it's text, not an image.
+
:11. search service provides a correction to the dc:format field
+
::* via same OAI mechanisms used by md augmentation services?
+
:12. archive service harvests record 1N or record 1aug.
+
:13. archive service discovers that the dc:format value is wrong -- it's text/xml, not an image.
+
:14. archive service provides a correction to the dc:format field
+
::* via same OAI mechanisms used by md augmentation services?
+
:15 we harvest via OAI or otherwise get the corrections from the search service.
+
:16 we harvest via OAI or otherwise get the corrections from the archive service.
+
:17 what value do we use for dc:format in record 1aug, the nsdl augmented/gold/demented record?
+
  
dih and nrd 4/27/04
+
<nowiki>MathForum</nowiki> re-harvests their metadata records from the Repository in the "rqis_dc_plus" flavor, looking for additional metadata added by others to provide additional value on their site.  They also harvest the "mudball" records from other math collections to see if they can add some resources described by others to their site, making them available to their special services.
  
==Use cases: Native metadata harvest, crosswalking, safe and collection-specific transformations, NSDL gold metadata==
+
===Use Case #5: when a resource or its metadata changes or is deleted, what happens to augmentations?===
  
===Use Case 1.===
+
* Deletion from specific providing collection: link moves to an equivalent resource metadata record? (or doesn't it matter, so long as there's another available Repository metadata record of that resource?)
 +
* Deletion of last Repository metadata record for that particular resource (perhaps it died?):
 +
** mark for deletion, but run occasional report to see if some can be revived?
 +
** point to Repository archived version of resource
 +
* Resource changes in ways that cannot be easily determined:
 +
** Augmentors notified to re-crawl or review,
 +
** non-updated augmentations could be "sunsetted" after some passage of time?
 +
* how can we be sure disappearance is permanent vs. temporary?
  
is , ?
+
===Use Case #6: when an augmentation is changed or deleted, what happens?===
  
===Use Case 2.===
+
#a metadata augmentation is changed
 +
#* MR picks it up on regular harvest from aug service (b/c OAI datestamp of changed record is after our "from" date argument in the OAI harvest from the aug service)
 +
#* augmentation metadata record is updated in MR
 +
#* changes to augmentation metadata record are duly propogated through MR storage and affected XML served out of MR
 +
#a metadata augmentation is deleted
 +
#* MR harvests deletion on regular harvest from aug service (b/c OAI datestamp of deleted record is after our "from" date argument in the OAI harvest from the aug service) [Q: what if aug service doesn't do persistent OAI deletes?  (or transient deletes of a long period of time?]
 +
#* augmentation metadata record is marked deleted in MR
 +
#* deletion of augmentation metadata record is duly propogated through MR storage and affected XML served out of MR.
  
and initiates a harvest of metadata. Because this is an initial harvest, the long-suffering metadata specialist receives and email notificationShe updates the collection record, and quickly examines the csv files for the two of the three harvested formats: nsdl_dc and ieee_lom.  She notes that the crosswalk used by the data provider to for that collection, to reverse the values in each element and add the appropriate encoding scheme.  Since no other serious errors appear, she invokes the safe transformation and approves the data for the MR. She sends a notification to the provider, pointing out the error, and asking him to inform her when the error is corrected so that she can pull the collection-specific transform when the data can be correctly harvested.
+
==Scenarios and Sequences==
 +
===A simple augmentation sequence===
 +
# Repository gets metadata record 1 from provider Q.
 +
# Repository normalizes the metadata, creating record 1N.   
 +
# iVia harvests metadata record 1N from the MR's OAI server
 +
# iVia uses IDs from harvested metadata to target resources for automated metadata creation, subject and classification assignment
 +
# iVia exposes its metadata augmentations (not data harvested from original records) to the world as metadata record <nowiki>1NiVia</nowiki>
 +
# Repository harvests metadata record <nowiki>1NiVia</nowiki> from iVia
 +
# Repository normalizes or otherwise alters and stores the iVia aug record as record <nowiki>1NiViaN</nowiki>
 +
# Repository uses <nowiki>1NiViaN</nowiki> as part of a rqis augmented/gold record
  
===Use Case 3.===
+
===A more complex augmentation sequence===
 +
# Repository gets metadata record 1 from provider Q.
 +
# Repository normalizes the metadata, creating record 1N. 
 +
# iVia harvests metadata record 1N from the MR's OAI server
 +
# iVia uses IDs from harvested metadata to target resources for automated metadata creation, subject and classification assignment
 +
# iVia exposes its metadata augmentations (not statements harvested from the original record) to the world as metadata record <nowiki>1NiVia</nowiki>
 +
# Repository harvests metadata record <nowiki>1NiVia</nowiki> from iVia
 +
# Repository normalizes or otherwise alters and stores the iVia aug record as record <nowiki>1NiViaN</nowiki>
 +
# Repository uses <nowiki>1NiViaN</nowiki> in record 1aug, a nsdl augmented/gold record.
 +
# Repository search service harvests record 1N or record 1aug.
 +
# Repository search service discovers that the dc:format value is wrong -- it's text, not an image.
 +
# Repository search service provides a correction to the dc:format field
 +
# Repository archive service harvests record 1N or record 1aug.
 +
# Repository archive service discovers that the dc:format value is wrong -- it's text/xml, not an image.
 +
# Repository archive service provides a correction to the dc:format field
 +
# Repository harvests via OAI or otherwise gets the corrections from the search service.
 +
# Repository harvests via OAI or otherwise gets the corrections from the archive service.
 +
# Repository Rating Service determines what value(s) exposed for dc:format in record 1aug, the nsdl augmented/gold record
 +
 
 +
===An Annotation and Augmentation Harvest Scenario===
 +
 
 +
# The Repository publishes two metadata formats designed to support annotations and augmented metadata
 +
#*The key feature of these formats is two metadata elements, one (and only one) of which must be present, either of which may be repeated:
 +
#**xxxxUniqueIdRef? -- which contains a reference to an existing metadata record in the MR and must match an existing record
 +
#**dcIdentifierRef? -- which contains a reference to a URI that may or may not exist in the MR. #*Augmented metadata must reference an existing URI in the MR. This could also be expressed as <reference type=xxxxUniqueId?> or <reference type=dcIdentifier>
 +
#*These are intended to be used to supply annotations and augmented metadata for harvest via OAI and perhaps a services interface.
 +
#Annotation and augmentation suppliers wishing to supply metadata about a resource identified by a URI should first query or harvest the MR to get a list of metadata records that are about that URI.
 +
#They create metadata about their annotation in the above format and serve it via OAI. This record may carry the actual annotation or it can simply contain a reference. In the case of metadata augmentation, each record served should be a self-contained, incomplete metadata record and should not reference another source of metadata.
 +
#We harvest the records through a standard harvest -- all incoming records will have to be associated with a collection record
 +
#The ingest process creates a unique mrec record for each incoming record
 +
#References in the MR must always be mrec_ids so in the case of dcIdentifierRef? the ingest process retrieves all mrecs that reference each dcIdentifierRef?.
 +
#If a dcIdentifierRef? references a URI that is not found, an mrec record is created for that URI and is queued for metadata generation by iVia (controversial)
 +
#An entry is created in the link table for each mrec identifed either directly or by reference. This will contain the mrec_id of the annotation record, the mrec_id of the mrec being annotated, a reference type, a datestamp, and a source mrec_id
 +
#*Note that the link table will need an additional 'source' field that will, in the case ofannotations and augmentations, contain the mrec_id of the annotation or augmenation metadata record that supplied the link.
 +
#*Note also that reference type and datestamp are denormalized values that can be determined by reference to the source mrec_id if necessary.
 +
#Output of augmented metadata is the tough thing -- it needs to be served both as a component part of the metadata format being augmented and as a distinct format, both within and without the mudball.
  
from the richer format to nsdl_dc.  When the crosswalk is completed, the data is transformed and made available through the NSDL OAI server, and the crosswalk itself is posted to the NSDL
+
==Annotation Documents from NSDL Annotation Services Planning==
  
===Use Case 4.===
+
===Tentative Listing of Annotation Types===
  
a by the CRS.  The Metadata Empress is notified of these changes, and takes a look at the csv files to see how these changes would affect access to the The ME determines that the provider is not using the available NSDL vocabularies but a mix of other available vocabularies and unattributed terms. She sets up a collection-specific transform that crosswalks the non-NSDL standard vocabularies to NSDL vocabularies as well as a quick crosswalk from the unattributed terms to NSDL vocabularyShe also sets up an nsdl_gold profile for the provider, so that appropriate ratings are established for the range of terms available.
+
This list is a collection of categorizations already used by annotaion services, and will provide the basis for a controlled vocabulary to be used for annotatoin records within the NSDLPlease contact Dian Hillmann or the workspace mailing list to add assitional possible types to this list.   
  
===Use Case 5.===
+
====Type termes: definition, if available(Source of term)====
  
A routine crawl for item metadata is initiated via the CRS after the completion of a collection record for a resource without available metadata.  The iVia Service makes machine-generated metadata available for harvest by the NSDL.  Because a rights statement applying to all the resources on the site is available, but the iVia Service does not reflect that in the items, a collection-specific transform is initiated for the collection, and the appropriate statement is defaulted in the Rights element for the items.
+
* Advice: A subcass of annotation representinng adice to the leader (Annotea)
 +
* Annotation: a super class describing the common features of annotations (Annotea)
 +
* Average scores of aggregated indices(DLESE)
 +
* Change: Annotions that document or porpose a change to the sourse document (Annotea)
 +
* Comment: A subclass of annotation describing annotation that are comments (Annotea)
 +
* Editor's summary (DLESE)                                                                                                                                   
 +
* Example: A subclass of Annotation representing examples (Annotea)

Latest revision as of 15:26, 18 April 2006

Contents

Management Use cases: Native metadata harvest, crosswalking, safe and collection-specific transformations, NSDL gold metadata

Use Case 1: Metadata Provision, Evaluation and Normalization

Use Case 2: Collection-Specific Transformation

Use Case 3: Crosswalking Instance Metadata

Use Case 4: Transforming Metadata Values

Use Case 5: Machine-Generated Metadata Augmentation

A routine crawl for item metadata is initiated via the MMS after the completion of a collection record for a resource without available metadata. The iVia Service makes machine-generated metadata available for harvest by the Repository. Because a rights statement applying to all the resources on the site is available, but the iVia Service does not reflect that in the items, a collection-specific transform is initiated for the collection, and the appropriate statement is defaulted in the Rights element for the items.

Metadata Augmentation: Use Cases for Specific Situations

Use Case #1: field replacement or deprecation

The Repository receives a file of item records from the Whatsis provider. Each record contains a defaulted value "unknown" in the Coverage element. Based on the Repository policy to deprecate useless defaults, the element is marked as deprecated, and that assertion indicates the Repository Quality Improvement Service (RQIS) is its source. In addition, the dc:format value of "application/flash" is consistently misspelled. A second version of the dc:format element with the correctly spelled value is provided by RQIS and an error notification message is sent to the data provider. MR OAI format rqis_dc_plus will include both versions of dc:format; rqis_dc_gold will only show the correctly spelled one. Lastly, the DCMIType value of "InteractiveResource" is consistently misspelled by the provider in a dc:type field. A second version of the dc:type element with the correctly spelled value is provided by RQIS, and the encoding scheme of dct:DCMIType is added. An error notification message is sent to the data provider. rqis_dc_plus will include both versions of dc:type; rqis_dc_gold will only show the correctly spelled one, with its indicated encoding scheme.

Later, the Repository harvests updated item records from Whatsis. RQIS quality control routines are run on the updated metadata:

  1. <coverage>unknown</coverage> is provided again. The RQIS continues to keep the deprecation assertion and NOT serve this useless info to downstream users.
  2. <coverage>unknown</coverage> is no longer provided. The RQIS needs to remove the deprecation assertion because it no longer refers to an actively served statement.
  3. <coverage>unknown</coverage> is no longer provided BUT <coverage>Washington</coverage> is now provided. The RQIS removes the deprecation assertion because there is now a useful (!) value; the Repository must serve the new coverage info to downstream users.
  4. provider now serves newly misspelled "apprication/flash". Because we have a separate RQIS-provided element with the correct spelling, the new (incorrect) provider element replaces the old (incorrect) provider element, and the correct RQIS attributed element is left alone.
  5. correctly spelled "application/flash" is now provided. The Repository should now drop the RQIS-sourced correct element, as it is a duplicate of the provider sourced correct element. Or not -- the bottom line is to serve only ONE, rather than duplicate.
  6. newly misspelled "Inteactive Resource" is provided. Because the Repository has a separate element, with the correct spelling, that indicates encoding scheme, the newly incorrect provider element replaces the old (incorrect) provider element, and the correct one is retained.
  7. correctly spelled "InteractiveResource" is now provided. The MR should now either add the encoding scheme to the provider's newly correct element and drop the (duplicate) RQIS-sourced correct element, or the MR should keep both, with the encoding scheme only applied to the RQIS element. Or not -- the bottom line is to only serve ONE, rather than duplicate.
  8. provider no longer serves dc:type element. (orphaned field enhancement) Should the RQIS dc:type field be retained, or should it be discarded? If the Repository doesn't retain a connection from the RQIS assertion to the original provider assertion, then the RQIS dc:type element just remains (with what provenance?). [Alternatively, since this level of quality improvement is based on examination of metadata, not resources, the element is not retained.]

The critical thing is who makes the assertion. For example, if the original metadata provider supplies a field with a typo, "texp/html", and RQIS corrects the typo to "text/html", the original metadata provider made the assertion. However, if the original metadata provider says a resource is an image, when it's really (or also) text, then the RQIS correction has a new assertion in it.

Use Case #N1: provider updates their metadata after it has been augmented

  1. ThatsUs provides rqis_dc to the Repository
  2. iVia augments the ThatsUs items with dc:subject fields with LCC values
  3. MR harvests updated nsdl_dc from Shindy
    • ThatsUs' new rqis_dc has no dc:subject fields
    • ThatsUS' new rqis_dc has dc:subject fields with LCC values
  4. Q: Under what conditions do we trigger new iVia augmentations? Only if primary identifier changes?
  5. Q: (When do safe xforms happen? where are they in this sequence?)

Use Case #N2: augmentation service updates their provided augmentations

  1. ThatsUs provides oai_dc to the Repository
  2. iVia augments the ThatsUs items with dc:subject fields with LCC values
  3. iVia newly augments the ThatsUs items with new, improved dc:subject fields with LCC values
    • do we set up the process to assume augmentations supercede older versions of themselves?
  4. Q: (When do safe xforms happen? where are they in this sequence?)

Use Case #N3: auto-chosen/auto-gen item metadata is augmented by another service

  1. iVia does a crawl and provides item level metadata to the Repository as collection wowza.
  2. ENC augments the wowza items with dct:audience fields
  3. SDSC augments the wowza items with dc:format information and information about broken links.
  4. Q: (is there anything special about this case, or is it the same as N1?)
  5. Q: (When do safe xforms happen? where are they in this sequence?)

Use Case #2: multiple equivalent resources and their relationship to augmentations on output

ENC provides the Repository with metadata augmentations asserting that specified items in a number of collections relate to the Illinois third grade science standard for basic understanding of photosynthesis. ENC provides the Repository with metadata records identifying a Repository metadata record ID, a URL (providing an internal check as well as an additional identifier for the resource) and the DC refinement "conformsTo" specifying the particular standard to which the resource is related. This element contains the source ENC and is identified as human created data. The Repository Simple Equivalency Service (based on resource URLs) identifies three other items in other collections that this relationship assertion applies to, and the appropriate links are made, and the resource metadata records (aggregated version only) updated in the Repository OAI server.

Use Case #3: Multiple providers of metadata and augmentation -- original metadata provider, RQIS (as augmenter), 3rd party augmenter, metadata served out in various flavors

The Whomever Collection supplies NSDL with 2233 item records described with oai_dc metadata. Based on routine normalization procedures, NSDL adds several encoding schemes to the records: "URI" to the identifier element (all values begin with "http") and "DCMIType" to most of the Type values which are valid DCMIType terms. In each of these cases, the source of the data continues to be identified as the original data provider. Several weeks later, the iVia staff harvest the metadata for the collection, and feed back to NSDL LCC classification and LCSH subject headings for the collection. This information is identified as originating with iVia and also as machine generated data.

The metadata is served out in a number of flavors:

  • native_oai_dc: metadata exactly how it came to us
  • rqis_dc: native_oai_dc plus safe xforms (was: as received, though normalized for errors and with added valid schemes)
  • rqis_dc_plus: rqis_dc plus augmentations (each safe xformed native record with any augmentations that apply, based on equivalence relationships)
  • rqis_dc_gold: rqis_dc, with erroneous values removed. Different from "rqis_dc_plus" because fields may be removed.
  • oai_dc: the RQIS's "dumbing down" of one of the above rqis_dc formats so we are compliant with OAI-PMH 2.0 (we must serve oai_dc)
  • "Mudball" (aggregation of all available metadata elements, with source, identified as being about a particular resource)

Use Case #4: focus on possible uses available to downstream users

ENC harvests "mudball" metadata records from the Repository to fulfil a number of specific requirements of their middle school portal:

  • They look for assertions of "conformsTo" relationships from a small number of sources that they consider reliable
  • They look for subject terms from controlled vocabularies on relevant resource metadata that they can use on their portal to provide topical navigation.
  • They look for annotations about middle school resources from teachers, librarians, and specific sources known by them to be reliable and appropriate for middle school audiences

MathForum re-harvests their metadata records from the Repository in the "rqis_dc_plus" flavor, looking for additional metadata added by others to provide additional value on their site. They also harvest the "mudball" records from other math collections to see if they can add some resources described by others to their site, making them available to their special services.

Use Case #5: when a resource or its metadata changes or is deleted, what happens to augmentations?

  • Deletion from specific providing collection: link moves to an equivalent resource metadata record? (or doesn't it matter, so long as there's another available Repository metadata record of that resource?)
  • Deletion of last Repository metadata record for that particular resource (perhaps it died?):
    • mark for deletion, but run occasional report to see if some can be revived?
    • point to Repository archived version of resource
  • Resource changes in ways that cannot be easily determined:
    • Augmentors notified to re-crawl or review,
    • non-updated augmentations could be "sunsetted" after some passage of time?
  • how can we be sure disappearance is permanent vs. temporary?

Use Case #6: when an augmentation is changed or deleted, what happens?

  1. a metadata augmentation is changed
    • MR picks it up on regular harvest from aug service (b/c OAI datestamp of changed record is after our "from" date argument in the OAI harvest from the aug service)
    • augmentation metadata record is updated in MR
    • changes to augmentation metadata record are duly propogated through MR storage and affected XML served out of MR
  2. a metadata augmentation is deleted
    • MR harvests deletion on regular harvest from aug service (b/c OAI datestamp of deleted record is after our "from" date argument in the OAI harvest from the aug service) [Q: what if aug service doesn't do persistent OAI deletes? (or transient deletes of a long period of time?]
    • augmentation metadata record is marked deleted in MR
    • deletion of augmentation metadata record is duly propogated through MR storage and affected XML served out of MR.

Scenarios and Sequences

A simple augmentation sequence

  1. Repository gets metadata record 1 from provider Q.
  2. Repository normalizes the metadata, creating record 1N.
  3. iVia harvests metadata record 1N from the MR's OAI server
  4. iVia uses IDs from harvested metadata to target resources for automated metadata creation, subject and classification assignment
  5. iVia exposes its metadata augmentations (not data harvested from original records) to the world as metadata record 1NiVia
  6. Repository harvests metadata record 1NiVia from iVia
  7. Repository normalizes or otherwise alters and stores the iVia aug record as record 1NiViaN
  8. Repository uses 1NiViaN as part of a rqis augmented/gold record

A more complex augmentation sequence

  1. Repository gets metadata record 1 from provider Q.
  2. Repository normalizes the metadata, creating record 1N.
  3. iVia harvests metadata record 1N from the MR's OAI server
  4. iVia uses IDs from harvested metadata to target resources for automated metadata creation, subject and classification assignment
  5. iVia exposes its metadata augmentations (not statements harvested from the original record) to the world as metadata record 1NiVia
  6. Repository harvests metadata record 1NiVia from iVia
  7. Repository normalizes or otherwise alters and stores the iVia aug record as record 1NiViaN
  8. Repository uses 1NiViaN in record 1aug, a nsdl augmented/gold record.
  9. Repository search service harvests record 1N or record 1aug.
  10. Repository search service discovers that the dc:format value is wrong -- it's text, not an image.
  11. Repository search service provides a correction to the dc:format field
  12. Repository archive service harvests record 1N or record 1aug.
  13. Repository archive service discovers that the dc:format value is wrong -- it's text/xml, not an image.
  14. Repository archive service provides a correction to the dc:format field
  15. Repository harvests via OAI or otherwise gets the corrections from the search service.
  16. Repository harvests via OAI or otherwise gets the corrections from the archive service.
  17. Repository Rating Service determines what value(s) exposed for dc:format in record 1aug, the nsdl augmented/gold record

An Annotation and Augmentation Harvest Scenario

  1. The Repository publishes two metadata formats designed to support annotations and augmented metadata
    • The key feature of these formats is two metadata elements, one (and only one) of which must be present, either of which may be repeated:
      • xxxxUniqueIdRef? -- which contains a reference to an existing metadata record in the MR and must match an existing record
      • dcIdentifierRef? -- which contains a reference to a URI that may or may not exist in the MR. #*Augmented metadata must reference an existing URI in the MR. This could also be expressed as <reference type=xxxxUniqueId?> or <reference type=dcIdentifier>
    • These are intended to be used to supply annotations and augmented metadata for harvest via OAI and perhaps a services interface.
  2. Annotation and augmentation suppliers wishing to supply metadata about a resource identified by a URI should first query or harvest the MR to get a list of metadata records that are about that URI.
  3. They create metadata about their annotation in the above format and serve it via OAI. This record may carry the actual annotation or it can simply contain a reference. In the case of metadata augmentation, each record served should be a self-contained, incomplete metadata record and should not reference another source of metadata.
  4. We harvest the records through a standard harvest -- all incoming records will have to be associated with a collection record
  5. The ingest process creates a unique mrec record for each incoming record
  6. References in the MR must always be mrec_ids so in the case of dcIdentifierRef? the ingest process retrieves all mrecs that reference each dcIdentifierRef?.
  7. If a dcIdentifierRef? references a URI that is not found, an mrec record is created for that URI and is queued for metadata generation by iVia (controversial)
  8. An entry is created in the link table for each mrec identifed either directly or by reference. This will contain the mrec_id of the annotation record, the mrec_id of the mrec being annotated, a reference type, a datestamp, and a source mrec_id
    • Note that the link table will need an additional 'source' field that will, in the case ofannotations and augmentations, contain the mrec_id of the annotation or augmenation metadata record that supplied the link.
    • Note also that reference type and datestamp are denormalized values that can be determined by reference to the source mrec_id if necessary.
  9. Output of augmented metadata is the tough thing -- it needs to be served both as a component part of the metadata format being augmented and as a distinct format, both within and without the mudball.

Annotation Documents from NSDL Annotation Services Planning

Tentative Listing of Annotation Types

This list is a collection of categorizations already used by annotaion services, and will provide the basis for a controlled vocabulary to be used for annotatoin records within the NSDL. Please contact Dian Hillmann or the workspace mailing list to add assitional possible types to this list.

Type termes: definition, if available(Source of term)

  • Advice: A subcass of annotation representinng adice to the leader (Annotea)
  • Annotation: a super class describing the common features of annotations (Annotea)
  • Average scores of aggregated indices(DLESE)
  • Change: Annotions that document or porpose a change to the sourse document (Annotea)
  • Comment: A subclass of annotation describing annotation that are comments (Annotea)
  • Editor's summary (DLESE)
  • Example: A subclass of Annotation representing examples (Annotea)