Difference between revisions of "Augmented Metadata and Annotations"

From Metadata-Registry
Jump to: navigation, search
(Outline--a harvest scenario)
(Outline--a harvest scenario)
Line 17: Line 17:
 
#Output of augmented metadata is the tough thing -- it needs to be served both as a component part of the metadata format being augmented and as a distinct format, both within and without the mudball.
 
#Output of augmented metadata is the tough thing -- it needs to be served both as a component part of the metadata format being augmented and as a distinct format, both within and without the mudball.
  
=== Metadata Augmentation: Brief use cases ===
+
==Metadata Augmentation: Use cases==
  
[http://metamanagement.comm.nsdl.org/cgi-bin/wiki.pl?MetaAugUseSample1 Metadata augmentation sequence for the XML samples to be shared with metadata augmentation providers (iVia and enc) on July 1, 2004]
+
===Use Case #1:  field replacement or deprecation===
 
+
* <b>Use Case #1:  field replacement or deprecation </b>
+
  
 
: NSDL receives a file of item records from the Whatsis Collection.  Each record contains a defaulted value "unknown" in the Coverage element. Based on the NSDL decision to deprecate useless defaults, the element is marked as deprecated, and that assertion indicates NSDL is its source.  In addition, the dc:format value of <nowiki>"application/flash"</nowiki> is consistently misspelled. A second version of the dc:format element with the correctly spelled value is provided by NSDL and an error notification message is sent to the data provider.  MR OAI format nsdl_dc_plus will include both versions of dc:format;  nsdl_dc_gold will only show the correctly spelled one.  Lastly, the DCMIType value of <nowiki>"InteractiveResource"</nowiki> is consistently misspelled by the provider in a dc:type field. A second version of the dc:type element with the correctly spelled value is provided by NSDL , and the encoding scheme of dct:DCMIType is added.  An error notification message is sent to the data provider.  nsdl_dc_plus will include both versions of dc:type;  nsdl_dc_gold will only show the correctly spelled one, with its indicated encoding scheme.
 
: NSDL receives a file of item records from the Whatsis Collection.  Each record contains a defaulted value "unknown" in the Coverage element. Based on the NSDL decision to deprecate useless defaults, the element is marked as deprecated, and that assertion indicates NSDL is its source.  In addition, the dc:format value of <nowiki>"application/flash"</nowiki> is consistently misspelled. A second version of the dc:format element with the correctly spelled value is provided by NSDL and an error notification message is sent to the data provider.  MR OAI format nsdl_dc_plus will include both versions of dc:format;  nsdl_dc_gold will only show the correctly spelled one.  Lastly, the DCMIType value of <nowiki>"InteractiveResource"</nowiki> is consistently misspelled by the provider in a dc:type field. A second version of the dc:type element with the correctly spelled value is provided by NSDL , and the encoding scheme of dct:DCMIType is added.  An error notification message is sent to the data provider.  nsdl_dc_plus will include both versions of dc:type;  nsdl_dc_gold will only show the correctly spelled one, with its indicated encoding scheme.
Line 65: Line 63:
 
: (When do safe xforms happen?  where are they in this sequence?)
 
: (When do safe xforms happen?  where are they in this sequence?)
  
* <b> Use Case #2:  multiple equivalent resources and their relationship to augmentations on output </b>
+
===Use Case #2:  multiple equivalent resources and their relationship to augmentations on output===
  
 
: ENC provides NSDL with metadata augmentations asserting that specified items in a number of collections relate to the Illinois third grade science standard for basic understanding of photosynthesis. ENC provides NSDL with metadata records identifying an NSDL metadata record ID, a URL (providing an internal check as well as an additional identifier for the resource) and the DC refinement <nowiki>"conformsTo"</nowiki> specifying the particular standard to which the resource is related.  This element contains the source ENC and is identified as human created data.  The NSDL Simple Equivalency Service (based on resource URLs) identifies three other items in other collections that this relationship assertion applies to, and the appropriate links are made, and the resource metadata records (aggregated version only) updated in the NSDL OAI server.  
 
: ENC provides NSDL with metadata augmentations asserting that specified items in a number of collections relate to the Illinois third grade science standard for basic understanding of photosynthesis. ENC provides NSDL with metadata records identifying an NSDL metadata record ID, a URL (providing an internal check as well as an additional identifier for the resource) and the DC refinement <nowiki>"conformsTo"</nowiki> specifying the particular standard to which the resource is related.  This element contains the source ENC and is identified as human created data.  The NSDL Simple Equivalency Service (based on resource URLs) identifies three other items in other collections that this relationship assertion applies to, and the appropriate links are made, and the resource metadata records (aggregated version only) updated in the NSDL OAI server.  
Line 87: Line 85:
  
  
* <b> Use Case #4: focus on possible uses available to downstream users </b>
+
===Use Case #4: focus on possible uses available to downstream users===
  
 
: ENC harvests "mudball" metadata records from NSDL to fulfil a number of specific requirements of their middle school portal:
 
: ENC harvests "mudball" metadata records from NSDL to fulfil a number of specific requirements of their middle school portal:
Line 96: Line 94:
 
: <nowiki>MathForum</nowiki> re-harvests their metadata records from NSDL in the "nsdl_dc_plus" flavor, looking for additional metadata added by others to provide additional value on their site.  They also harvest the "mudball" records from other math collections to see if they can add some resources described by others to their site, making them available to their special services.  
 
: <nowiki>MathForum</nowiki> re-harvests their metadata records from NSDL in the "nsdl_dc_plus" flavor, looking for additional metadata added by others to provide additional value on their site.  They also harvest the "mudball" records from other math collections to see if they can add some resources described by others to their site, making them available to their special services.  
  
* <b> Use Case #5: when a resource or its metadata changes or is deleted, what happens to augmentations? </b>
+
===Use Case #5: when a resource or its metadata changes or is deleted, what happens to augmentations?===
  
 
** Deletion from specific providing collection: link moves to an equivalent resource metadata record? (or doesn't it matter, so long as there's another available NSDL metadata record of that resource?)
 
** Deletion from specific providing collection: link moves to an equivalent resource metadata record? (or doesn't it matter, so long as there's another available NSDL metadata record of that resource?)
Line 107: Line 105:
 
** how can we be sure disappearance is permanent vs. temporary?
 
** how can we be sure disappearance is permanent vs. temporary?
  
* <b> Use Case #6: when an augmentation is changed or deleted, what happens? </b>
+
===Use Case #6: when an augmentation is changed or deleted, what happens?===
  
 
** a metadata augmentation is changed
 
** a metadata augmentation is changed
Line 119: Line 117:
 
*** deletion of augmentation metadata record is duly propogated through MR storage and affected XML served out of MR.
 
*** deletion of augmentation metadata record is duly propogated through MR storage and affected XML served out of MR.
  
* <b> Use Case #7: a simple augmentation sequence </b>
+
===Use Case #7: a simple augmentation sequence===
 
:1. we get metadata record 1 from provider Q.
 
:1. we get metadata record 1 from provider Q.
 
:2. we normalize the metadata, creating record 1N.   
 
:2. we normalize the metadata, creating record 1N.   
Line 132: Line 130:
 
:8. we use <nowiki>1NiViaN</nowiki> in a nsdl augmented/gold/demented record.
 
:8. we use <nowiki>1NiViaN</nowiki> in a nsdl augmented/gold/demented record.
  
* <b> Use Case #8: a more complex augmentation sequence </b>
+
===Use Case #8: a more complex augmentation sequence===
 
:1. we get metadata record 1 from provider Q.
 
:1. we get metadata record 1 from provider Q.
 
:2. we normalize the metadata, creating record 1N.   
 
:2. we normalize the metadata, creating record 1N.   
Line 157: Line 155:
  
 
dih and nrd 4/27/04
 
dih and nrd 4/27/04
=== Metadata Augmentation: Brief use cases ===
 
  
[http://metamanagement.comm.nsdl.org/cgi-bin/wiki.pl?MetaAugUseSample1 Metadata augmentation sequence for the XML samples to be shared with metadata augmentation providers (iVia and enc) on July 1, 2004]
+
==Use cases: Native metadata harvest, crosswalking, safe and collection-specific transformations, NSDL gold metadata==
  
* <b>Use Case #1:  field replacement or deprecation </b>
+
===Use Case 1.===
  
: NSDL receives a file of item records from the Whatsis CollectionEach record contains a defaulted value "unknown" in the Coverage element. Based on the NSDL decision to deprecate useless defaults, the element is marked as deprecated, and that assertion indicates NSDL is its source.  In addition, the dc:format value of <nowiki>"application/flash"</nowiki> is consistently misspelled. A second version of the dc:format element with the correctly spelled value is provided by NSDL and an error notification message is sent to the data provider.  MR OAI format nsdl_dc_plus will include both versions of dc:format;  nsdl_dc_gold will only show the correctly spelled one.  Lastly, the DCMIType value of <nowiki>"InteractiveResource"</nowiki> is consistently misspelled by the provider in a dc:type field. A second version of the dc:type element with the correctly spelled value is provided by NSDL , and the encoding scheme of dct:DCMIType is addedAn error notification message is sent to the data provider.  nsdl_dc_plus will include both versions of dc:type;  nsdl_dc_gold will only show the correctly spelled one, with its indicated encoding scheme.
+
NSDL Collection Developers as a relevant and useful addition to the libraryA collection record is created in the CRS and harvest is initiated. to policy, NSDL harvests both formats for the MR. The long-suffering metadata specialist receives an email reporting the successful initial harvest and the location of csv files for the harvested metadata. Using Spotfire, she examines the files quickly and determines that the oai_dc files are very sparse, but the MARC files have exploitable information that would be useful to improve access to the resources of Provider “F.of the MARC file into nsdl_dc and examines the results quickly using Spotfire to assure that the appropriate elements are populated and the values appropriate.  Lastly, the loading of the newly transformed data into the MR. NSDL then exposes the additional nsdl_dc format, along with the providers supplied oai_dc and MARC, via the NSDL OAI serverAdditionally, the nsdl_dc elements become available via the augmented formats: nsdl_mudball and nsdl_gold.
  
: Later, the NSDL harvests updated item records from Whatsis.  <font color="red">How do we ensure that the new information from the Whatsis updates doesn't step on the NSDL improvements?</font>
+
===Use Case 2.===
## <coverage>unknown</coverage> is provided again.   the MR needs to keep the deprecation assertion and NOT serve this useless info to downstream users, such as the NSDL Search service.
+
## <coverage>unknown</coverage> is no longer provided.  The MR needs to remove the deprecation assertion because it no longer refers to an element.
+
## <coverage>unknown</coverage> is no longer provided BUT <coverage>Washington</coverage> is now provided.  the MR needs to remove the deprecation assertion because there is now a useful (!) value;  the MR must serve the new coverage info to downstream users.
+
## provider now gives us newly misspelled <nowiki>"apprication/flash"</nowiki>.  Because we have a separate NSDL-provided element with the correct spelling, the new (incorrect) provider element replaces the old (incorrect) provider element, and the correct NSDL attributed element is left alone.
+
## correctly spelled <nowiki>"application/flash"</nowiki> is now provided.  The MR should now drop the NSDL-sourced correct element, as it is a duplicate of the provider sourced correct element.  Or not -- the bottom line is to serve only ONE, rather than duplicate.
+
## newly misspelled <nowiki>"Inteactive Resource"</nowiki> is provided.  Because we have a separate element, with the correct spelling, that indicates encoding scheme, the newly incorrect provider element replaces the old (incorrect) provider element, and the correct one is still there.
+
## correctly spelled <nowiki>"InteractiveResource"</nowiki> is now provided.  The MR should now either add the encoding scheme to the provider's newly correct element and drop the (duplicate) NSDL-sourced correct element, or the MR should keep both, with the encoding scheme only applied to the NSDL element.  Or not -- the bottom line is to only serve ONE, rather than duplicate.
+
## provider no longer serves dc:type element.  <font color='red'>(orphaned field enhancement) Should the NSDL dc:type field be retained, or should it be discarded?</font>  If we don't retain a connection from the NSDL assertion to the original provider assertion, then our dc:type element just remains.  Diane feels retaining these element level connections  is too costly with too little return.  We don't know why their dc:type element disappeared ...
+
  
: (Diane sez the critical thing is <i>who makes the assertion. </i> For example, if the original metadata provider supplies a field with a typo, "texp/html", and NSDL corrects the typo to "text/html", the original metadata provider made the assertionHowever, if the original metadata provider says a resource is an image, when it's really (or also) text, then the NSDL correction has a new assertion in it.)
+
the CRS, creates an initial collection record and initiates a harvest of metadata. Because this is an initial harvest, the long-suffering metadata specialist receives and email notificationShe updates the collection record, and quickly examines the csv files for the two of the three harvested formats: nsdl_dc and ieee_lom.  She notes that the crosswalk used by the data provider ,, though correct, appear in the wrong element.  She calls up a form to create a collection specific transform for that collection, to reverse the values in each element and add the appropriate encoding schemeSince no other serious errors appear, she invokes the safe transformation and approves the data for the MR. She sends a notification to the provider, pointing out the error, and asking him to inform her when the error is corrected so that she can pull the collection-specific transform when the data can be correctly harvested.
  
* <b> Use Case #N1: provider updates their metadata after it has been augmented </b>
+
===Use Case 3.===
 
+
## Shindy provides nsdl_dc to the NSDL
+
## iVia augments the Shindy items with dc:subject fields with LCC values
+
## MR harvests updated nsdl_dc from Shindy
+
*** Shindy's new nsdl_dc has no dc:subject fields
+
*** Shindy's new nsdl_dc has dc:subject fields with LCC values
+
:: Under what conditions do we trigger new iVia augmentations?  Only if primary identifier changes?
+
: (When do safe xforms happen?  where are they in this sequence?)
+
 
+
* <b> Use Case #N2: augmentation service updates their provided augmentations </b>
+
 
+
## Shindy provides oai_dc to the NSDL
+
## iVia augments the Shindy items with dc:subject fields with LCC values
+
## iVia newly augments the Shindy items with new, improved dc:subject fields with LCC values 
+
*** do we set up the process to assume augmentations supercede older versions of themselves?
+
 
+
: (When do safe xforms happen?  where are they in this sequence?)
+
 
+
* <b> Use Case #N3: auto-chosen/auto-gen item metadata is augmented by another service </b>
+
 
+
## iVia does a crawl and provides item level metadata to the NSDL as collection wowza.
+
## ENC augments the wowza items with dct:audience fields
+
## SDSC augments the wowza items with dc:format information and information about broken links.
+
 
+
: (is there anything special about this case, or is it the same as N1?)
+
: (When do safe xforms happen?  where are they in this sequence?)
+
 
+
* <b> Use Case #2:  multiple equivalent resources and their relationship to augmentations on output </b>
+
 
+
: ENC provides NSDL with metadata augmentations asserting that specified items in a number of collections relate to the Illinois third grade science standard for basic understanding of photosynthesis. ENC provides NSDL with metadata records identifying an NSDL metadata record ID, a URL (providing an internal check as well as an additional identifier for the resource) and the DC refinement <nowiki>"conformsTo"</nowiki> specifying the particular standard to which the resource is related.  This element contains the source ENC and is identified as human created data.  The NSDL Simple Equivalency Service (based on resource URLs) identifies three other items in other collections that this relationship assertion applies to, and the appropriate links are made, and the resource metadata records (aggregated version only) updated in the NSDL OAI server.
+
* <b> Use Case #3: Multiple providers of metadata and augmentation </b> -- original metadata provider, NSDL (as augmenter), 3rd party augmenter, nsdl served out in various flavors
+
 
+
: The Whomever Collection supplies NSDL with 2233 item records described with oai_dc metadata. Based on routine normalization procedures, NSDL adds several encoding schemes to the records: "URI" to the identifier element (all values begin with "http") and "DCMIType" to most of the Type values which are valid DCMIType terms. In each of these cases, the source of the data continues to be identified as the original data provider.  Several weeks later, the iVia staff harvest the metadata for the collection, and feed back to NSDL LCC classification and LCSH subject headings for the collection.  This information is identified as originating with iVia and also as machine generated data.
+
 
+
: The metadata is served out in a number of flavors:
+
 
+
** native_oai_dc: metadata exactly how it came to us
+
** nsdl_dc: native_oai_dc plus safe xforms (was: as received, though normalized for errors and with added valid schemes)
+
** nsdl_dc_plus: nsdl_dc plus augmentations (each safe xformed native record with any augmentations that apply, based on equivalence relationships)
+
** nsdl_dc_gold: nsdl_dc, with erroneous values removed.  Different from "nsdl_dc_plus" because fields may be removed.
+
** oai_dc: the NSDL's "dumbing down" of one of the above nsdl_dc formats so we are compliant with OAI-PMH 2.0 (we must serve oai_dc)
+
** "Mudball" (aggregation of all available metadata elements, with source, identified as being about a particular resource)
+
*** Naomi asks: does this differ from nsdl_dc_plus?  If so, how?
+
** nsdl_all: each unrestricted metadata format, separately, as a big tarball.
+
*** Includes nsdl_dc, nsdl_dc_plus, nsdl_dc_gold, native, augRec11 from iVia, augRec28 from SDSC ...
+
** nsdl_search: all the metadata formats as a big tarball. 
+
*** Perhaps only nsdl_dc_gold, not ALL nsdl_dc formats?
+
 
+
 
+
* <b> Use Case #4: focus on possible uses available to downstream users </b>
+
 
+
: ENC harvests "mudball" metadata records from NSDL to fulfil a number of specific requirements of their middle school portal:
+
** They look for assertions of <nowiki>"conformsTo"</nowiki> relationships from a small number of sources that they consider reliable
+
** They look for subject terms from controlled vocabularies on relevant resource metadata that they can use on their portal to provide topical navigation.
+
** They look for annotations about middle school resources from teachers, librarians, and specific sources known by them to be reliable and appropriate for middle school audiences
+
 
+
: <nowiki>MathForum</nowiki> re-harvests their metadata records from NSDL in the "nsdl_dc_plus" flavor, looking for additional metadata added by others to provide additional value on their site.  They also harvest the "mudball" records from other math collections to see if they can add some resources described by others to their site, making them available to their special services.
+
 
+
* <b> Use Case #5: when a resource or its metadata changes or is deleted, what happens to augmentations? </b>
+
 
+
** Deletion from specific providing collection: link moves to an equivalent resource metadata record? (or doesn't it matter, so long as there's another available NSDL metadata record of that resource?)
+
** Deletion of last NSDL metadata record for that particular resource (perhaps it died?):
+
*** mark for deletion, but run occasional report to see if some can be revived?
+
*** point to NSDL archived version of resource
+
** Resource changes in ways that cannot be easily determined:
+
*** Augmentors notified to re-crawl or review,
+
*** non-updated augmentations could be "sunsetted" after some passage of time?
+
** how can we be sure disappearance is permanent vs. temporary?
+
 
+
* <b> Use Case #6: when an augmentation is changed or deleted, what happens? </b>
+
 
+
** a metadata augmentation is changed
+
*** MR picks it up on regular harvest from aug service (b/c OAI datestamp of changed record is after our "from" date argument in the OAI harvest from the aug service)
+
*** augmentation metadata record is updated in MR
+
*** changes to augmentation metadata record are duly propogated through MR storage and affected XML served out of MR
+
** a metadata augmentation is deleted
+
*** MR harvests deletion on regular harvest from aug service (b/c OAI datestamp of deleted record is after our "from" date argument in the OAI harvest from the aug service)
+
**** what if aug service doesn't do persistent OAI deletes?  (or transient deletes of a long period of time?)
+
*** augmentation metadata record is marked deleted in MR
+
*** deletion of augmentation metadata record is duly propogated through MR storage and affected XML served out of MR.
+
 
+
* <b> Use Case #7: a simple augmentation sequence </b>
+
:1. we get metadata record 1 from provider Q.
+
:2. we normalize the metadata, creating record 1N. 
+
::* it's possible that record 1N is also the nsdl_aug/gold at this point ...
+
:3. iVia harvests metadata record 1N from the MR's OAI server
+
:4. iVia does its automagic thang
+
:5. iVia exposes its metadata augmentations to the world as metadata record <nowiki>1NiVia</nowiki>
+
::* is this only the newly provided and/or altered fields, or is it all the old fields plus all the new fields, or what?
+
:6. we harvest metadata record <nowiki>1NiVia</nowiki> from iVia
+
:7. we normalize or otherwise alter and store the iVia aug record as record <nowiki>1NiViaN</nowiki>
+
::* do we store it locally if we don't change anything?
+
:8. we use <nowiki>1NiViaN</nowiki> in a nsdl augmented/gold/demented record.
+
 
+
* <b> Use Case #8: a more complex augmentation sequence </b>
+
:1. we get metadata record 1 from provider Q.
+
:2. we normalize the metadata, creating record 1N. 
+
::* it's possible that record 1N is also the nsdl_aug/gold at this point ...
+
:3. iVia harvests metadata record 1N from the MR's OAI server
+
:4. iVia does its automagic thang
+
:5. iVia exposes its metadata augmentations to the world as metadata record <nowiki>1NiVia</nowiki>
+
::* is this only the newly provided and/or altered fields, or is it all the old fields plus all the new fields, or what?
+
:6. we harvest metadata record <nowiki>1NiVia</nowiki> from iVia
+
:7. we normalize or otherwise alter and store the iVia aug record as record <nowiki>1NiViaN</nowiki>
+
::* do we store it locally if we don't change anything?
+
:8. we use <nowiki>1NiViaN</nowiki> in record 1aug, a nsdl augmented/gold/demented record.
+
:9. search service harvests record 1N or record 1aug.
+
:10. search service discovers that the dc:format value is wrong -- it's text, not an image.
+
:11. search service provides a correction to the dc:format field
+
::* via same OAI mechanisms used by md augmentation services?
+
:12. archive service harvests record 1N or record 1aug.
+
:13. archive service discovers that the dc:format value is wrong -- it's text/xml, not an image.
+
:14. archive service provides a correction to the dc:format field
+
::* via same OAI mechanisms used by md augmentation services?
+
:15 we harvest via OAI or otherwise get the corrections from the search service.
+
:16 we harvest via OAI or otherwise get the corrections from the archive service.
+
:17 what value do we use for dc:format in record 1aug, the nsdl augmented/gold/demented record?
+
 
+
dih and nrd 4/27/04
+
=== Use cases: Native metadata harvest, crosswalking, safe and collection-specific transformations, NSDL gold metadata ===
+
  
<b>Use Case 1. </b>Provider ? is identified by NSDL Collection Developers as a relevant and useful addition to the libraryA collection record is created in the CRS and harvest is initiated.  Provider ? exposes oai_dc and MARC records, and according to policy, NSDL harvests both formats for the MR. The long-suffering metadata specialist receives an email reporting the successful initial harvest and the location of csv files for the harvested metadata. Using Spotfire, she examines the files quickly and determines that the oai_dc files are very sparse, but the MARC files have exploitable information that would be useful to improve access to the resources of a standard MARC crosswalk stored for this purpose, she initiates a transformation of the MARC file into nsdl_dc and examines the results quickly using Spotfire to assure that the appropriate elements are populated and the values appropriate.  Lastly, and approves the loading of the newly transformed data into the MR. NSDL then exposes the additional nsdl_dc format, along with the providers supplied oai_dc and MARC, via the NSDL OAI server. Additionally, the nsdl_dc elements become available via the augmented formats: nsdl_mudball and nsdl_gold.
+
Provider ? is identified by an approved NSDL Recommender as an appropriate collection for the NSDLThe Recommender creates a collection record, and through the CRS initiates a first harvest.  Provider ? exposes oai_dc and a new metadata format specialized for science museums that the MR has not yet encountered. The Metadata Empress receives notification that this new format has been harvested, and the schema provided allows the creation of a csv file so that the data can be reviewed. The schema also supports the creation of a crosswalk worksheet, allowing the Empress to set up a crosswalk from the richer format to nsdl_dc.  When the crosswalk is completed, the data is transformed and made available through the NSDL OAI server, and the crosswalk itself is posted to the NSDL approved crosswalks page, for specific reference in the the crosswalked records and for use by others. The provider is also notified about the presence of the crosswalk, and invited to comment or suggest improvements.
  
<b>Use Case .
+
===Use Case 4.===
  
<b>Use Case 3:</b> Provider ? is identified by an approved NSDL Recommender as an appropriate collection for the NSDL.  The Recommender creates a collection record, and through the CRS initiates a first harvest.  Provider ? exposes oai_dc and a new metadata format specialized for science museums that the MR has not yet encountered.  The Metadata Empress receives notification that this new format has been harvested, and the schema provided allows the creation of a csv file so that the data can be reviewed. The schema also supports the creation of a crosswalk worksheet, allowing the Empress to set up a crosswalk from the richer format to nsdl_dcWhen the crosswalk is completed, the data is transformed and made available through the NSDL OAI server, and the crosswalk itself is posted to the NSDL approved crosswalks page, for specific reference in the of the crosswalk, and invited to comment or suggest improvements.
+
Provider ? a long time NSDL as stored by the CRS.  The Metadata Empress is notified of these changes, and takes a look at the csv files to see how these changes would affect access to the Provider ? collection. One of the changes is the addition of Audience values.  The ME determines that the provider is not using the available NSDL vocabularies but a mix of other available vocabularies and unattributed terms. She sets up a collection-specific transform that crosswalks the non-NSDL standard vocabularies to NSDL vocabularies as well as a quick crosswalk from the unattributed terms to NSDL vocabularyShe also sets up an nsdl_gold profile for the provider, so that appropriate ratings are established for the range of terms available.
  
<b>Use Case .
+
===Use Case 5.===
  
<b>Use Case 5.</b> A routine crawl for item metadata is initiated via the CRS after the completion of a collection record for a resource without available metadata.  The iVia Service makes machine-generated metadata available for harvest by the NSDL.  Because a rights statement applying to all the resources on the site is available, but the iVia Service does not reflect that in the items, a collection-specific transform is initiated for the collection, and the appropriate statement is defaulted in the Rights element for the items.
+
A routine crawl for item metadata is initiated via the CRS after the completion of a collection record for a resource without available metadata.  The iVia Service makes machine-generated metadata available for harvest by the NSDL.  Because a rights statement applying to all the resources on the site is available, but the iVia Service does not reflect that in the items, a collection-specific transform is initiated for the collection, and the appropriate statement is defaulted in the Rights element for the items.

Revision as of 08:40, 25 October 2005

Outline--a harvest scenario

  1. We publish two metadata formats designed to support annotations and augmented metadata
    • The key feature of these formats is two metadata elements, one (and only one) of which must be present, either of which may be repeated:
      • xxxxUniqueIdRef? -- which contains a reference to an existing metadata record in the MR and must match an existing record
      • dcIdentifierRef? -- which contains a reference to a URI that may or may not exist in the MR. #*Augmented metadata must reference an existing URI in the MR. This could also be expressed as <reference type=xxxxUniqueId?> or <reference type=dcIdentifier>
    • These are intended to be used to supply annotations and augmented metadata for harvest via OAI and perhaps a services interface.
  2. Annotation and augmentation suppliers wishing to supply metadata about a resource identified by a URI should first query or harvest the MR to get a list of metadata records that are about that URI.
  3. They create metadata about their annotation in the above format and serve it via OAI. This record may carry the actual annotation or it can simply contain a reference. In the case of metadata augmentation, each record served should be a self-contained, incomplete metadata record and should not reference another source of metadata.
  4. We harvest the records through a standard harvest -- all incoming records will have to be associated with a collection record
  5. The ingest process creates a unique mrec record for each incoming record
  6. References in the MR must always be mrec_ids so in the case of dcIdentifierRef? the ingest process retrieves all mrecs that reference each dcIdentifierRef?.
  7. If a dcIdentifierRef? references a URI that is not found, an mrec record is created for that URI and is queued for metadata generation by iVia (controversial)
  8. An entry is created in the link table for each mrec identifed either directly or by reference. This will contain the mrec_id of the annotation record, the mrec_id of the mrec being annotated, a reference type, a datestamp, and a source mrec_id
    • Note that the link table will need an additional 'source' field that will, in the case ofannotations and augmentations, contain the mrec_id of the annotation or augmenation metadata record that supplied the link.
    • Note also that reference type and datestamp are denormalized values that can be determined by reference to the source mrec_id if necessary.
  9. Output of augmented metadata is the tough thing -- it needs to be served both as a component part of the metadata format being augmented and as a distinct format, both within and without the mudball.

Metadata Augmentation: Use cases

Use Case #1: field replacement or deprecation

NSDL receives a file of item records from the Whatsis Collection. Each record contains a defaulted value "unknown" in the Coverage element. Based on the NSDL decision to deprecate useless defaults, the element is marked as deprecated, and that assertion indicates NSDL is its source. In addition, the dc:format value of "application/flash" is consistently misspelled. A second version of the dc:format element with the correctly spelled value is provided by NSDL and an error notification message is sent to the data provider. MR OAI format nsdl_dc_plus will include both versions of dc:format; nsdl_dc_gold will only show the correctly spelled one. Lastly, the DCMIType value of "InteractiveResource" is consistently misspelled by the provider in a dc:type field. A second version of the dc:type element with the correctly spelled value is provided by NSDL , and the encoding scheme of dct:DCMIType is added. An error notification message is sent to the data provider. nsdl_dc_plus will include both versions of dc:type; nsdl_dc_gold will only show the correctly spelled one, with its indicated encoding scheme.
Later, the NSDL harvests updated item records from Whatsis. How do we ensure that the new information from the Whatsis updates doesn't step on the NSDL improvements?
    1. <coverage>unknown</coverage> is provided again. the MR needs to keep the deprecation assertion and NOT serve this useless info to downstream users, such as the NSDL Search service.
    2. <coverage>unknown</coverage> is no longer provided. The MR needs to remove the deprecation assertion because it no longer refers to an element.
    3. <coverage>unknown</coverage> is no longer provided BUT <coverage>Washington</coverage> is now provided. the MR needs to remove the deprecation assertion because there is now a useful (!) value; the MR must serve the new coverage info to downstream users.
    4. provider now gives us newly misspelled "apprication/flash". Because we have a separate NSDL-provided element with the correct spelling, the new (incorrect) provider element replaces the old (incorrect) provider element, and the correct NSDL attributed element is left alone.
    5. correctly spelled "application/flash" is now provided. The MR should now drop the NSDL-sourced correct element, as it is a duplicate of the provider sourced correct element. Or not -- the bottom line is to serve only ONE, rather than duplicate.
    6. newly misspelled "Inteactive Resource" is provided. Because we have a separate element, with the correct spelling, that indicates encoding scheme, the newly incorrect provider element replaces the old (incorrect) provider element, and the correct one is still there.
    7. correctly spelled "InteractiveResource" is now provided. The MR should now either add the encoding scheme to the provider's newly correct element and drop the (duplicate) NSDL-sourced correct element, or the MR should keep both, with the encoding scheme only applied to the NSDL element. Or not -- the bottom line is to only serve ONE, rather than duplicate.
    8. provider no longer serves dc:type element. (orphaned field enhancement) Should the NSDL dc:type field be retained, or should it be discarded? If we don't retain a connection from the NSDL assertion to the original provider assertion, then our dc:type element just remains. Diane feels retaining these element level connections is too costly with too little return. We don't know why their dc:type element disappeared ...
(Diane sez the critical thing is who makes the assertion. For example, if the original metadata provider supplies a field with a typo, "texp/html", and NSDL corrects the typo to "text/html", the original metadata provider made the assertion. However, if the original metadata provider says a resource is an image, when it's really (or also) text, then the NSDL correction has a new assertion in it.)
  • Use Case #N1: provider updates their metadata after it has been augmented
    1. Shindy provides nsdl_dc to the NSDL
    2. iVia augments the Shindy items with dc:subject fields with LCC values
    3. MR harvests updated nsdl_dc from Shindy
      • Shindy's new nsdl_dc has no dc:subject fields
      • Shindy's new nsdl_dc has dc:subject fields with LCC values
Under what conditions do we trigger new iVia augmentations? Only if primary identifier changes?
(When do safe xforms happen? where are they in this sequence?)
  • Use Case #N2: augmentation service updates their provided augmentations
    1. Shindy provides oai_dc to the NSDL
    2. iVia augments the Shindy items with dc:subject fields with LCC values
    3. iVia newly augments the Shindy items with new, improved dc:subject fields with LCC values
      • do we set up the process to assume augmentations supercede older versions of themselves?
(When do safe xforms happen? where are they in this sequence?)
  • Use Case #N3: auto-chosen/auto-gen item metadata is augmented by another service
    1. iVia does a crawl and provides item level metadata to the NSDL as collection wowza.
    2. ENC augments the wowza items with dct:audience fields
    3. SDSC augments the wowza items with dc:format information and information about broken links.
(is there anything special about this case, or is it the same as N1?)
(When do safe xforms happen? where are they in this sequence?)

Use Case #2: multiple equivalent resources and their relationship to augmentations on output

ENC provides NSDL with metadata augmentations asserting that specified items in a number of collections relate to the Illinois third grade science standard for basic understanding of photosynthesis. ENC provides NSDL with metadata records identifying an NSDL metadata record ID, a URL (providing an internal check as well as an additional identifier for the resource) and the DC refinement "conformsTo" specifying the particular standard to which the resource is related. This element contains the source ENC and is identified as human created data. The NSDL Simple Equivalency Service (based on resource URLs) identifies three other items in other collections that this relationship assertion applies to, and the appropriate links are made, and the resource metadata records (aggregated version only) updated in the NSDL OAI server.
  • Use Case #3: Multiple providers of metadata and augmentation -- original metadata provider, NSDL (as augmenter), 3rd party augmenter, nsdl served out in various flavors
The Whomever Collection supplies NSDL with 2233 item records described with oai_dc metadata. Based on routine normalization procedures, NSDL adds several encoding schemes to the records: "URI" to the identifier element (all values begin with "http") and "DCMIType" to most of the Type values which are valid DCMIType terms. In each of these cases, the source of the data continues to be identified as the original data provider. Several weeks later, the iVia staff harvest the metadata for the collection, and feed back to NSDL LCC classification and LCSH subject headings for the collection. This information is identified as originating with iVia and also as machine generated data.
The metadata is served out in a number of flavors:
    • native_oai_dc: metadata exactly how it came to us
    • nsdl_dc: native_oai_dc plus safe xforms (was: as received, though normalized for errors and with added valid schemes)
    • nsdl_dc_plus: nsdl_dc plus augmentations (each safe xformed native record with any augmentations that apply, based on equivalence relationships)
    • nsdl_dc_gold: nsdl_dc, with erroneous values removed. Different from "nsdl_dc_plus" because fields may be removed.
    • oai_dc: the NSDL's "dumbing down" of one of the above nsdl_dc formats so we are compliant with OAI-PMH 2.0 (we must serve oai_dc)
    • "Mudball" (aggregation of all available metadata elements, with source, identified as being about a particular resource)
      • Naomi asks: does this differ from nsdl_dc_plus? If so, how?
    • nsdl_all: each unrestricted metadata format, separately, as a big tarball.
      • Includes nsdl_dc, nsdl_dc_plus, nsdl_dc_gold, native, augRec11 from iVia, augRec28 from SDSC ...
    • nsdl_search: all the metadata formats as a big tarball.
      • Perhaps only nsdl_dc_gold, not ALL nsdl_dc formats?


Use Case #4: focus on possible uses available to downstream users

ENC harvests "mudball" metadata records from NSDL to fulfil a number of specific requirements of their middle school portal:
    • They look for assertions of "conformsTo" relationships from a small number of sources that they consider reliable
    • They look for subject terms from controlled vocabularies on relevant resource metadata that they can use on their portal to provide topical navigation.
    • They look for annotations about middle school resources from teachers, librarians, and specific sources known by them to be reliable and appropriate for middle school audiences
MathForum re-harvests their metadata records from NSDL in the "nsdl_dc_plus" flavor, looking for additional metadata added by others to provide additional value on their site. They also harvest the "mudball" records from other math collections to see if they can add some resources described by others to their site, making them available to their special services.

Use Case #5: when a resource or its metadata changes or is deleted, what happens to augmentations?

    • Deletion from specific providing collection: link moves to an equivalent resource metadata record? (or doesn't it matter, so long as there's another available NSDL metadata record of that resource?)
    • Deletion of last NSDL metadata record for that particular resource (perhaps it died?):
      • mark for deletion, but run occasional report to see if some can be revived?
      • point to NSDL archived version of resource
    • Resource changes in ways that cannot be easily determined:
      • Augmentors notified to re-crawl or review,
      • non-updated augmentations could be "sunsetted" after some passage of time?
    • how can we be sure disappearance is permanent vs. temporary?

Use Case #6: when an augmentation is changed or deleted, what happens?

    • a metadata augmentation is changed
      • MR picks it up on regular harvest from aug service (b/c OAI datestamp of changed record is after our "from" date argument in the OAI harvest from the aug service)
      • augmentation metadata record is updated in MR
      • changes to augmentation metadata record are duly propogated through MR storage and affected XML served out of MR
    • a metadata augmentation is deleted
      • MR harvests deletion on regular harvest from aug service (b/c OAI datestamp of deleted record is after our "from" date argument in the OAI harvest from the aug service)
        • what if aug service doesn't do persistent OAI deletes? (or transient deletes of a long period of time?)
      • augmentation metadata record is marked deleted in MR
      • deletion of augmentation metadata record is duly propogated through MR storage and affected XML served out of MR.

Use Case #7: a simple augmentation sequence

1. we get metadata record 1 from provider Q.
2. we normalize the metadata, creating record 1N.
  • it's possible that record 1N is also the nsdl_aug/gold at this point ...
3. iVia harvests metadata record 1N from the MR's OAI server
4. iVia does its automagic thang
5. iVia exposes its metadata augmentations to the world as metadata record 1NiVia
  • is this only the newly provided and/or altered fields, or is it all the old fields plus all the new fields, or what?
6. we harvest metadata record 1NiVia from iVia
7. we normalize or otherwise alter and store the iVia aug record as record 1NiViaN
  • do we store it locally if we don't change anything?
8. we use 1NiViaN in a nsdl augmented/gold/demented record.

Use Case #8: a more complex augmentation sequence

1. we get metadata record 1 from provider Q.
2. we normalize the metadata, creating record 1N.
  • it's possible that record 1N is also the nsdl_aug/gold at this point ...
3. iVia harvests metadata record 1N from the MR's OAI server
4. iVia does its automagic thang
5. iVia exposes its metadata augmentations to the world as metadata record 1NiVia
  • is this only the newly provided and/or altered fields, or is it all the old fields plus all the new fields, or what?
6. we harvest metadata record 1NiVia from iVia
7. we normalize or otherwise alter and store the iVia aug record as record 1NiViaN
  • do we store it locally if we don't change anything?
8. we use 1NiViaN in record 1aug, a nsdl augmented/gold/demented record.
9. search service harvests record 1N or record 1aug.
10. search service discovers that the dc:format value is wrong -- it's text, not an image.
11. search service provides a correction to the dc:format field
  • via same OAI mechanisms used by md augmentation services?
12. archive service harvests record 1N or record 1aug.
13. archive service discovers that the dc:format value is wrong -- it's text/xml, not an image.
14. archive service provides a correction to the dc:format field
  • via same OAI mechanisms used by md augmentation services?
15 we harvest via OAI or otherwise get the corrections from the search service.
16 we harvest via OAI or otherwise get the corrections from the archive service.
17 what value do we use for dc:format in record 1aug, the nsdl augmented/gold/demented record?

dih and nrd 4/27/04

Use cases: Native metadata harvest, crosswalking, safe and collection-specific transformations, NSDL gold metadata

Use Case 1.

Use Case 2.

Use Case 3.

Use Case 4.

Use Case 5.

A routine crawl for item metadata is initiated via the CRS after the completion of a collection record for a resource without available metadata. The iVia Service makes machine-generated metadata available for harvest by the NSDL. Because a rights statement applying to all the resources on the site is available, but the iVia Service does not reflect that in the items, a collection-specific transform is initiated for the collection, and the appropriate statement is defaulted in the Rights element for the items.