# How CIIM Works

# Summary

CIIM (Collections Information Integration Middleware) is an open-source framework that integrates, links and manages distributed resources of all types. It synchronises metadata and digital assets from across multiple systems (e.g. DAMS and CMS), either extracting the data centrally (on a schedule or on demand), or receiving push updates via its API.

Next, CIIM parses the raw data into a flexible, schemaless model, preserving original source documents and identifying links both within systems and across system boundaries. During metadata processing, it enriches these data by resolving and validating the relationship graph (both internal and external), generating hierarchies, applying custom validation and privacy rules, and processing media assets (e.g., creating watermarked derivatives, zooms etc). Each record is then assigned a data status (e.g., 'Public', 'Invalid', 'Private'). Metadata can be further enhanced by passing it through pipelines of AI services.

During media processing CIIM automatically processes media assets, (optionally) applies watermarking, transcodes video, generates pyramid tiffs for IIIF delivery and deep zooming and extracts text from documents for content searching.

For publication, 'Public' records are indexed to Elasticsearch and their associated media synced to a public store. Finally, for content delivery, systems like Collections Online access the data through Elasticsearch (often via Rosetta) and media from a file server, with Kong acting as a secure gateway.

CIIM also handles asynchronous operations, including AI enhancements, zoom generation, and maintaining a full historical audit.

This highly scalable system comfortably manages over 80 million records and is adaptable for any sector dealing with large, disparate datasets.


# I. Metadata Synchronisation

This initial stage focuses on gathering metadata from various sources.

  • Multiple Source Systems: Data flows into and through CIIM from a broad spectrum of sources, including:
    • Digital Asset Management Systems (DAMS)
    • Collections Management Systems (CMS)
    • Library Management Systems (LMS)
    • Archival Management Systems
    • Syndication APIs like OAI (Open Archives Initiative) and IIIF (International Image Interoperability Framework).
    • Various other data authoring systems, and files (e.g., Dokuwiki, Wordpress, CSV, spreadsheets, historical data dumps)
  • Data Ingestion:
    • CIIM can pull data from these source systems.
    • Data can also be pushed into CIIM via its submission API.
  • Scheduling: Synchronisation typically occurs on a configurable schedule as often as you like (or can be triggered on demand) and is manageable via the user interface.
  • Efficient Update Strategy: CIIM does not typically synchronise all data every time. Instead, it works on rolling updates based on a delta of changes from the source systems using modification dates. This ensures efficient processing by only transferring updated, added, or deleted records within a given time frame.
  • Source Document Preservation: Where possible, the original data format is preserved and stored as a 'Source Document' record so that changes can be made across large data sets without having to revisit the originating systems.
  • Exception and Recovery: To ensure continuous operation and data integrity, CIIM is designed with robust exception handling and recovery mechanisms. It incorporates exception handling and retry strategies specifically to manage transient exceptions. This allows the system to gracefully handle temporary failures, network issues, or timeouts that may occur when interacting with external APIs, minimising disruptions and ensuring data synchronisation completes successfully.

# II. Metadata Parsing

Once synchronised, the raw metadata is structured and prepared.

  • Parsing Process: Original data is parsed (from source or 'Source Document') into a record representing your required data model.
  • Schemaless Model: The CIIM model is flexible and schemaless but maintains consistent structures for common elements.
  • Exploded View: Parsing creates an "exploded view" of the data, preserving the separation of entities and their distinct (though unverified at this stage) relationships.
  • Augmentation: Metadata may be augmented with informational messages or warnings.
  • Relationship Modelling: Parsing includes modelling relationships found within the same source.
  • Output: The result is stored as a 'Source Data' record in the database.
  • External References: References to metadata from other sources are identified for later resolution.

# III. Metadata Processing

This critical stage enriches, validates, and prepares the metadata for publication.

  • Relationship Resolution: External relationships are resolved and validated.
  • Relationship Augmentation: Metadata is augmented with those relationships.
  • Hierarchy Generation: Full (poly-)hierarchies are generated (e.g., for place, thesaurus, archival data), with warnings for cyclical hierarchies.
  • Validation & Privacy:
    • Metadata is tagged with validation errors based on your validation criteria.
    • Metadata is annotated with any Privacy reasons as defined by your business logic.
    • Additional information messages or warnings may be added.
  • Data Status Assignment: Each record receives a 'data status' based on validation, privacy, relationships, and whether it's primary data (e.g., objects vs. authority data).
    • Public: Eligible for publication, not private, not invalid, is primary data or links to public primary data.
    • Unavailable: Deleted stub due to source deletion. Cannot be published.
    • Invalid: Failed CIIM validation rules. Cannot be published.
    • Retained: Not primary data, does not have relationships to Public primary data, cannot be published.
    • Recalled: Manually deleted via takedown. Cannot be published.
    • Private: Matched a CIIM privacy rule. Cannot be published.
    • Draft: Created via CIIM Management Interface, not yet submitted to CIIM Core.
  • Output: The result is stored as a 'Processed Data' record.
  • Further Augmentation: Metadata can be augmented with 'edits' or 'enhancements' (e.g., from analytics, AI services). Media metadata is augmented with details about generated zooms and derivatives.

# IV. Media Processing

This stage of the pipeline checks for the existence of related media assets,resolves their location (facilitating reporting on missing assets and relationships) and applies the required processing criteria.

  • Artifact resolution: Digital media assets described by metadata records are retrieved
  • Artifact generation: Different-sized artifacts/derivatives (e.g. thumbnails) are generated (using GraphicsMagick & VIPs) and stored locally
  • Watermarking and Embedded metadata: Optional watermarking and metadata embedding can be applied (for example with metadata derived from a linked object)

# V. Publication

This stage makes the processed data accessible to external systems.

  • Elasticsearch Indexing: Typically, only Public status metadata records are indexed to one or more Elasticsearch endpoints, often on public-facing servers in a DMZ.
  • Media Synchronisation: Derivatives for Public status media records are synchronised to a public media store (via SSH, rsync, DeltaCopy).
  • Removal of Non-Public Records: Records no longer Public status are removed from Elasticsearch.
  • Metadata Merging: Partial or full authority metadata records can be merged onto 'primary data' records for convenience and searchability.
  • Transformations: Additional metadata transformations may be applied for content delivery.

# VI. Content Delivery to Collections Online

The published data is carefully indexed in a way that specifically optimises search and content discovery in a highly scalable manner.

  • Data Retrieval: Collections Online retrieves data directly from public Elasticsearch or via Rosetta (a proxy layer simplifying search and providing many other significant benefits).
  • Media Retrieval: Collections Online retrieves media assets from the media file server.
  • Gateway: Kong serves as the gateway, sitting in front of Rosetta, Elasticsearch, and the media store.

# VII. Auditing

All ingested data, processing and publication is audited.

  • Extraction: If the source systems support it, the ingested data is regularly audited to ensure all the data is up to date
  • Processing: All the CIIM internal processes are audited so that you can be sure that what you are seeing in the CIIM interface is up to date.
  • Publication: Published metadata and media assets are audited so that your public facing data and media is where it should be and the correct version.

# VIII. Asynchronous Operations

Several operations run outside the main detailed flow, often triggering reprocessing events.

  • Zoom Artifact Generation: Pyramid TIFFs (for deep zoom and IIIF) are optionally generated. Local copies or direct writes to the public media server occur, potentially triggering reprocessing of impacted records.
  • AI Service Enhancements: Metadata can pass through AI services for 'enhancements,' which may trigger reprocessing events.
  • Historical Audit: Since v7.8.1, historical versions of 'Source Document', 'Source Data', and 'Processed Data' are persisted to a separate archive database during 'system cleanup' for auditing.
  • Event-Triggered Auditing: Audit routines ensure Elasticsearch and Media stores remain in sync with the database (the source of truth).