From Metadata-Registry
Jump to: navigation, search

Harvest steps for tab-delimited files for Spotfire Data analysis:

harvest to files

note: file for each resumption block note: filename should be [service_id]_[set_id]_[metadata-prefix]_[yyyy-mm-dd-hh-mm-ss(request time)_[00000(chunk number)].xml

  • store harvest log for each harvest with
    • stat section
      • harvest stat
        • start time
        • end time
        • record count
        • http errors
        • http redirects
      • chunk stats (written at the top of each chunk file?)
        • chunk 01
          • start time
          • end time
          • record count

parse the files to create csv

  • get total record count from harvest stats -- x and number of files -- z
  • get number of requested records from csv convert command args -- y
  • divide y/z and get that number of random records from each file
  • get them all if y == 0
  • open csv file for write
    note: filename should be [service_id]_[set_id]_[yyyy-mm-dd-hh-mm-ss(request time).csv
  • for each record
    • store record.header.identifier
    • store namespaces -- record.metadata.dc xmlns:dc, xmlns:oai_dc,xmlns:xsi
    • for each row in record
      • metadata record id == record.header.identifier (stored)
      • element namespace == record.metadata.dc xmlns:dc, xmlns:oai_dc,xmlns:xsi (stored)
      • element name == record.metadata.dc.[any element]
      • element value == record.metadata.dc.[any element].value
      • element type == record.metadata.dc.[any element].type.value
      • element lang == record.metadata.dc.[any element].lang.value