CDStar is package-oriented data-management framework for scientific and other data-driven applications. It enables the development of powerful tools and workflows against a simple and stable REST interfaces that hides away the details and complexities of the actual storage back-end in use.

The CDStar storage API is organized in vaults, archives (or packages) and files: A vault can store any number of archives and distributes them transparently across different storage infrastructures. Each archive is identified by a unique ID and contains a list of named files. Within an archive, files can be organized in folder structures and annotated with search-able attributes. Archives themselves can also be annotated. The search integration indexes attributes and file content to allow near-realtime search across an entire vault.

Getting started

This guide is a step-by-step tutorial which shows how to install, configure, and use cdstar in a simple example setup. You will download and run cdstar locally, configure a single vault and store some files. All you need is a computer with Java (8+) installed. This tutorial assumes you are running some flavor of Linux.

Installation

CDSTAR is written in Java and the "cdstar.jar" binary distributions runs on any platform with a compatible Java Runtime Environment (OpenJDK or Oracle Java 11 or newer). There are several ways to obtain a recent version of cdstar, described here.

Download binary release

wget https://cdstar.gwdg.de/release/dev/cdstar.jar

Older and stable releases are also available here: https://cdstar.gwdg.de/release/

Build from source

Building CDSTAR requires a Java JDK (Java 11 or newer) and Maven. The CDSTAR source distribution ships with a Maven wrapper script (./mvnw or ./mvnw.bat) that fetches the correct version of Maven and sould be preferred over whatever Maven version is offered as a system package by your distribution.

Install build dependencies
sudo apt install git build-essential # for 'git' and 'make'
sudo apt install default-jdk-headless
Checkout source code
git clone https://gitlab.gwdg.de/cdstar/cdstar.git
cd cdstar
Build standalone server executable
make cdstar.jar
# or manually:
./mvnw -pl cdstar-cli -am -DskipTests=true -Pshaded clean package
cp cdstar-cli/target/cdstar-cli-*-shaded.jar cdstar.jar
Tip
The -DskipTests=true parameter will save you some time. Releases are always tested before they are published, so there is no point in running all tests again.

Configuration

CDStar can read configuration from yaml and json files, whichever you prefer. Here is a small example to get you started:

Example cdstar-demo.yaml
---
path.home: /tmp/cdstar-demo
vault.demo:
  create: True
  public: True
  pool.autotrim: True

realm.static:
  class: StaticRealm
  # This role can create, read and list archives in the 'demo' vault.
  role.demoRole: vault:demo:create, vault:demo:read, vault:demo:list
  # This group inherit all permissions from 'demoRole'.
  group.demoGroup: "demoRole"
  # This user has the password 'test' and belongs to the 'demoGroup'.
  # Password hashes can be computed using cdstar.jar:
  #   $ java -cp cdstar.jar de.gwdg.cdstar.auth.realm.StaticRealm
  user.test:
    password: "cGxhaW4=:FmtSc7NSX8fsjLTmpLpoqRLP4vqWFg/r5uy3EU6JsEs="
    groups: demoGroup
    # permissions: ...
    # roles: ...
Note
A secure password-hash can be generated with the java -cp cdstar.jar de.gwdg.cdstar.auth.realm.StaticRealm tool.

The only required parameter is path.home. Everything else is optional. See Configuration for details.

First run

$ java -jar cdstar.jar -c cdstar-config.yaml run -p 8080
Test if server is running
curl http://localhost:8080/v3/

Command line parameters

Example: cdstar --help
Usage: cdstar [-h] [--version] [--log-config=<file>] [-c=<file>]...
              [-C=<key=value>]... [--debug=<logger>]... [COMMAND]
Run or manage CDSTAR instances
  -c, --config=<file>       Load configuration from this file(s). Prefix the
                              filename with '?' to mark it optional
  -C=<key=value>            Override individual configuration parameter. Use
                              'KEY=VALUE' to override or 'KEY+VALUE' to append
      --debug=<logger>      Increase loggin for specific packages. The value
                              'ROOT' may be used as an alias for the root logger
  -h, --help                Print help and exit
      --log-config=<file>   Provide a custom log4j2.properties file
      --version             Show version string and exit
Commands:
  run     Start server instance
  config  Manage configuration
  vault   Manage vaults
Example: cdstar run --help
Usage: cdstar run [-h] [-b=<bind>] [-p=<port>]
Start server instance
  -b, --bind=<bind>   Override 'http.host' setting
  -h, --help          Print help and exit
  -p, --port=<port>   Override 'http.port' setting

Run as a service

CDStar can be compiled into a cdstar.war file and run within a servlet container, but this is not recommended and not officially supported. CDStar also does not offer any built-in daemonizing capabilities. If you want to run cdstar as a long-running background process, use proper system tools like systemd, supervisord or traditional init.d scripts and start-stop-daemon as a last resort.

Systemd: Example config
# /etc/systemd/system/cdstar.service
[Unit]
Description=CDStar Storage Service
After=syslog.target network.target remote-fs.target

[Service]
User=cdstar
ExecStart=/usr/bin/java -jar /path/to/cdstar.jar -c /etc/cdstar/cdstar.yaml run -p 8080

[Install]
WantedBy=multi-user.target
Systemd: Enable service
sudo systemctl enable cdstar.service

Tutorial

For this tutorial we are using the excellent requests python library and assume that you already have an instance up and running on http://localhost:8080/ with an account that is allowed to create archives in a vault named demo.

Creating our first Archive

To begin, we import some helpful functions from the 'requests' module, define our API base URL and create our first archive.

Setup and create a new archive
>>> from requests import get, post, put, delete
>>> baseurl = 'http://test:test@localhost:8080/v3'
>>> r = post(baseurl + '/demo/')
>>> r.status_code
201
>>> r.headers['Location']
"/v3/demo/ab587f42c2570a884"
>>> r.json()
{
    'id': 'ab587f42c2570a884',
    'vault': 'demo',
    'revision': '0'
}

CDStar returns JSON most of the time, so we can use requests.Response.json() to parse the response directly into a python dictionary. In this case, we are only interested in the id field of the response. This string identifies our archive within a vault and can be used to build the archive URL. Alternatively, we could just follow the Location header.

The archive is still empty. We can list its content with simple a GET request.

Show Archive Info
>>> get(baseurl + '/demo/ab587f42c2570a884').json()
{
  "id": "ab587f42c2570a884",
  "vault": "myVault",
  "revision": "0",
  "created": "2016-12-20T13:59:37.160+0000",
  "modified": "2016-12-20T13:59:37.231+0000",
  "file_count": 0
}

As you can see, there are no files in this archive. Let’s change that and upload some files.

Upload Files

There are mutiple ways to populate an archive. The simplest way is to send multipart/form-data POST requests to the archive URL. Each file upload with a name that start with a slash (e.g. /example.txt) creates a new file in our archive.

Upload files
>>> files = {'/report.xls': open('report.xls', 'rb')}
>>> post(baseurl + '/demo/ab587f42c2570a884', files=files).json()
{
    'id': 'ab587f42c2570a884',
    'vault': 'myVault',
    'revision': '1',
    'report:': [ {
        'change': 'file',
        'file': {
            'name': 'report.xls',
            'type': 'application/vnd.ms-excel',
            'size': 65992,
            'created': '2016-12-20T13:59:37.217+0000',
            'modified': '2016-12-20T13:59:37.218+0000',
            'digests': {
                'md5': '1a79a4d60de6718e8e5b326e338ae533',
                'sha1': 'c3499c2729730a7f807efb8676a92dcb6f8a3f8f',
                'sha256': '50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c'
            }
        }
    } ]
}

The response is JSON again and contains a list of all files that changed during the last request. We can use this info to double-check if everything was uploaded correctly.

Annotate Archives and Files

Now we want to attach some meta attributes to our archive and the file we just uploaded. We send just another POST request to the same URL, but this time we use form-fields starting with meta: to define new meta attribute on the archive or a file within the archive.

Set metadata properties
>>> data = {
...  'meta:dc:title': 'My Report Archive',             (1)
...  'meta:dc:title:/report.xls': 'My Report',          (2)
...  'meta:dc:contributor': ['Alice', 'Bob'],          (3)
... }
>>> post(baseurl + '/demo/ab587f42c2570a884', data=data).json()
{
    'id': 'ab587f42c2570a884',
    'vault': 'myVault',
    'revision': '2',
    'report:': [ {
        'change': 'meta',
        'field': 'dc:title',
        'values': ['My Report Archive']
    }, {
        'change': 'meta',
        'field': 'dc:contributor',
        'values': ['Alice', 'Bob']
    }, {
        'change': 'meta',
        'field': 'dc:title',
        'file': 'report.xls',
        'values': ['My Report']
    } ]
}
  1. Meta form fields start with meta: followed by the field name.

  2. If a meta attribute should be set on a specific file instead of the archive, you can specify the file name after the field name, separated by a /.

  3. Some meta attributes accept more than a single value.

Just like the file upload example from above, we get back a report of everything that changed.

Tip
You can upload multiple files and set multiple meta-attributes with a single request. It is even possible to create a fully populated archive in a single step by submitting the POST request to the createArchive endpoint.

List Files and Meta-Attributes

Let us have a look at our archive again and also request file and meta-attribute listings this time.

Show Archive Info
>>> get(baseurl + '/demo/ab587f42c2570a884?with=files,meta').json()
{
  "id": "ab587f42c2570a884",
  "vault": "myVault",
  "revision": "0",
  "created": "2016-12-20T13:59:37.160+0000",
  "modified": "2016-12-20T13:59:37.231+0000",
  "file_count": 1,
  'meta': {
    'dc.title': ['My Report Archive']
  },
  'files': [ {
    'name': '/report.xls',
    'type': 'application/vnd.ms-excel',
    'size': 65992,
    'created': '2016-12-20T13:59:37.217+0000',
    'modified': '2016-12-20T13:59:40.114+0000',
    'digests': {
      'md5': '1a79a4d60de6718e8e5b326e338ae533',
      'sha1': 'c3499c2729730a7f807efb8676a92dcb6f8a3f8f',
      'sha256': '50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c'
    },
    'meta': {
        'dc.title': ['My Report']
    }
  } ]
}

The file and meta fields are hidden by default and only included if you add with=files,meta as a query parameter. For large archives, you can even filter and paginate the returned information. See getArchiveInfo for details.

Direct File API (CRUD)

Each file within an archive has its own URL, for example /myVault/ab587f42c2570a884/some/file.txt. You can create, read, update or delete individual files by sending the respective PUT, GET, POST or DELETE requests to these URLs, which is sometimes a lot easier than working with the form-based API described earlier, especially from within scripts or programmable REST clients.

First, let’s upload a new file to the archive. Just PUT the raw file content to the file URL.

Example: Create or replace a file
>>> with open('example.txt', 'rb') as fp:
...     put(baseurl + '/demo/ab587f42c2570a884/some/example.txt', data=fp).json()
{
  'name': 'some/example.txt',
  'type': 'text/plain',                             (1)
  'id': '4e2cdf90ae00bff1e2bad79ffebdb63b',         (2)
  'size': 12,
  'created': '2017-07-25T11:08:02.558+0000',
  'modified': '2017-07-25T11:08:02.602+0000',
  'digests': {
    'sha256': '30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58',
    'sha1': '3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3',
    'md5': 'b6d81b360a5672d80c27430f39153e2c'},
}
  1. The type is auto-detected from the file name if you do not specify a Content-Type header.

  2. The id of a file does not change, even if you rename or modify it.

If you need more control over whether a file should be overwritten or not, you can add one of the following conditional headers to your request:

Table 1. Conditional headers for PUT requests
Header Description

If-None-Match: *

Create new file. If the file already exists, it is not modified.

If-Match: *

Update existing file. If the file does not exist, it is not created.

You should check for 412 Precondition Failed errors in your application if you use these headers.

Once the file is stored in the archive, you can retrieve it using the same URL.

Example: Download a file
>>> r = get(baseurl + '/demo/ab587f42c2570a884/some/example.txt', stream=True)
>>> with open("download.txt", 'wb') as fd:
...     for chunk in r.iter_content(chunk_size=1024*8):
...         fd.write(chunk)

This downloads the entire file and stores it locally. You can also request parts of the file (using Range headers) and make your request conditional (If-Match, If-None-Match, If-Modified-Since, If-Unodified-Since and If-Range headers are fully supported).

Instead of the actual file content, you can also request the file attributes or meta-attributes via the info and meta sub-resources.

Example: Get file attributes or meta-attributes
>>> get(baseurl + '/demo/ab587f42c2570a884/some/example.txt?info').json()
{
  'name': 'some/example.txt',
  'type': 'text/plain',
  'id': '4e2cdf90ae00bff1e2bad79ffebdb63b',
  'size': 12,
  'created': '2017-07-25T11:08:02.558+0000',
  'modified': '2017-07-25T11:08:02.602+0000',
  'digests': {
    'sha256': '30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58',
    'sha1': '3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3',
    'md5': 'b6d81b360a5672d80c27430f39153e2c'},
}
>>> get(baseurl + '/demo/ab587f42c2570a884/report.xls?meta').json()
{
  'dc:title': ['My Report']
}
Tip
Since meta is a sub-resource of info, you can fetch both at the same time via ?info&with=meta.

And finally: Deleting individual files is just a plain and simple DELETE request.

Example: Delete file from archive
>>> delete(baseurl + '/myVault/ab587f42c2570a884/some/example.txt')

Thats it for now. To be continued …​

Configuration

CDStar is configured via configuration files (YAML or json), command-line arguments or environment variables, or a combination thereof. In any cases, configuration is treated as a flat list of dot-separated keys and plain string values (e.g. key.name=value). File formats that support advanced data types and nesting (namely json an yaml) are flattened automatically when loaded. Arrays or multiple values for the same key are simply joined into a comma-separated list.

Example: Nested documents are flattened automatically.
---
# Nested document
path:
  home: "/mnt/vault"
vault.demo:
  create: True
---
# Flattened form
path.home: "/mnt/vault"
vault.demo.create: "True"
Value references

Values may contain references to other keys (e.g. ${path.home}) or environment variables (e.g. ${ENV_NAME}). The latter is recommended for sensitive information that should not appear in config files or command line arguments (e.g. passwords). A cololon (:) is used to separate the reference from an optional default value.

For example, ${CDSATR_HOME:/var/lib/cdstar} would be replaced by the content of the CDSTAR_HOME environment variable, or the default path if the environment variable is not defined.

Disk Storage

CDStar stores all its data and internal state on the file system. You usually only need to set set path.home, as all other parameters default to subdirectories under the path.home directory.

path.home

This directory is used as as base directory for the other paths. (default: ${CDSTAR_HOME:/var/lib/cdstar/})

path.data

Storage location for archive-data and runtime information. CDStar creates a subdirectory for each vault and follows symlinks, which makes it easy to split the storage across several mounted disks. (default: ${path.home}/data)

path.var

Storage location for short-lived temporary data. Do NOT use a ramdisk or other volatile storage, as transaction and crash-recovery data will also be stored here. (default: ${path.home}/var)

path.lib

Plugins and extensions are searched for in this directory, if they are not found on the default java classpath. (default: ${path.home}/lib)

Transports

CDStar supports http and https transports out of the box. By default, only the unencrypted http transport is enabled and binds to localhost port 8080. The high port number allows CDStar to run as non-root, which is the recommended mode of operation.

External access should be encrypted and off-loaded to a reverse proxy (e.g. nginx) for security and performance reasons. Only enable the build-in https transport for testing or if you know what you are doing.

http.host

IP address to bind to. A value of 0.0.0.0 will bind to all available interfaces at the same time. (default: 127.0.0.1).

http.port

Network port to bind to. Ports below 1024 require root privileges (not recommended). A value of 0 will bind to a random free port. A value if -1 will disable this transport. (default: 8080)

https.host

IP address to bind to. (default: ${http.port}).

https.port

Network port to listen to. (default: 8433)

https.certfile

Path to a *.pem file containing the certificate chain and private key. (required)

https.h2

Enable HTTP/2. This requires Java 9+ and should be considered experimental. (default: false)

Public REST API

The REST API is exposed over all configured transports.

api.dariah.enable

Enable or disable the dariah REST API. (default: False)

api.v2.enable

Enable or disable the legacy v2 REST API. (default: False)

api.v3.enable

Enable or disable the current v3 REST API. (default: True)

api.context

Provide the public service URL. This is required if cdstar runs behind a reverse proxy or load balancer and cannot detect its public URL automatically. (default: /)

Vaults

Vaults are usually created at runtime via the management API, but can also be be bootstrapped from configuration. Statically configured vaults are created at startup if they do not exist, and ignored otherwise. It is not possible to change the parameters of a vault via configuration after they were created.

vault.<name>.create

If true, create this vault on startup if it does not exist already.

vault.<name>.public

If true, allow public (non-authenticated) read access to this vault. Archive permissions are still checked.

Each vault is backed by a storage pool, which can be configured as part of the vault configuration. The default pool configuration looks like this, and may be overwritten if needed (experimental, not recommended).

vault.<name>.pool.class

Storage pool class or name. Defaults to the NioPool class.

vault.<name>.pool.name

Storage pool name. Defaults to the vault name.

vault.<name>.pool.path

Data path for this storage pool. Defaults to ${path.data}/${name}:

Other StoragePool implementations may accept additional parameters.

Plugins may also read vault-level configuration to control vault-specific behavior. The DefaultPermissions feature for example controls the permissions defined on newly created archives and can be configured differently for each vault.

Realms

Realms manage authentication and authorization in CDStar. For a simple setup with only a hand full of users, you usually only need a single 'default' realm (e.g. StaticRealm) with everything configured within the same config file. More complex scenarios (e.g. LDAP, JWT or SAML auth) are supported via specialized implementations of the Realm interface (e.g. StaticRealm, JWTRealm or LdapRealm) and can be combined in many ways.

realm.<name>.class

Realm implementation to use. Either a simple class name or a fully qualified java class. (required)

realm.<name>.<field>

Additional realm configuration.

See Realms for a list of available realm implementations and their configuration options.

Warning
If no realm is configured, cdstar adds an 'admin' user with a randomly generated password to the implicit 'system' realm. The password is logged to the console on startup and changes every restart.
Tip
Realms are no different from plugins. They are only configured in a separate reaml.* name-space to avoid accidental misconfiguration.

Plugins and Extentions

CDSTAR can be extended with custom implementations for event listeners, storage pools, long-term storage adapters and many other interfaces. These can be referenced by name, simple class-name or fully qualified java class name.

plugin.<name>.class

Plugin to load. Either a name, java class name or a fully qualified java class path.

plugin.<name>.<field>

Additional plugin configuration.

Example
plugin.ui:
   class: UIBlueprint
plugin.bagit:
   class: de.gwdg.cdstar.runtime.lts.bagit.BagitTarget
   path: ${path.home}/bagit/

API Basics

The cdstar HTTP API is the primary method for accessing CDStar instances. Requests are made via HTTP to one of the documented API Endpoints and responses are returned mostly as JSON documents for easy consumption by scripts or client software.

The current stable HTTP API is reachable under the /v3 path on a cdstar server. Other APIs (e.g. legacy-v2, dariah or S3) may be available under different paths on the same server, but these are not part of this chapter.

Basics

The cdstar HTTP API follows RESTful principles. The core concepts are described here. You can skip this section if you are already familiar with HTTP and REST.

HTTP Methods

CDStar API Endpoints make use of the following standard HTTP request methods:

Table 2. Standard HTTP methods
Method Description

GET

Receive a resource or sub-resource. This is a read-only operation and never changes the state of the resource or other resources.

HEAD

Same as GET, but does not return a response body. This can be used as a light-weight alternative to GET requests if only the status code or header values are of interest.

POST

Create or update a resource, or perform a modifying server-side operation.

PUT

Create or replace a resource with the content of the request.

DELETE

Remove a resource.

HTTP Method override

Some proxies restrict or lack support for certain HTTP methods, such as DELETE. In this case, a client may send a POST request with a non-standard X-HTTP-Method-Override header instead. The value of this header is used as a server-side override for the actual HTTP method.

HTTP Response Codes

Each of the API Endpoints defines a number of possible HTTP response status codes and their meaning. The following list summarizes all status codes used by this API and provides a general description.

Code Reason Description

200

OK

Request completed successfully. The response contains the requested resource.

201

Created

Resource created successfully. The location of the newly created resource can be found in the response Location header.

304

Not Modified

The requested resource has not changed since the client last requested it, given If-Modified-Since, If-None-Match or other conditional request headers were supplied.

400

Bad Request

The request violates the HTTP protocol or this API specification. A detailed error description is contained within the response.

401

Unauthorized

The requested resource requires Authentication.

403

Forbidden

The client is authenticated, but not authorized to access the requested resource or perform the requested operation.

404

Not Found

The requested resource does not exist, or the client is not allowed to know if it exists or not.

409

Conflict

The request could not be completed due to a conflict with the current state of the target resource. This code is used in situations where the user might be able to resolve the conflict and resubmit the request.

423

Locked

The requested resource is currently not available and additional steps are required to make it available again.

500

Internal Server Error

An error occurred on server side that cannot be fixed by the client. Try again later.

501

Not Implemented

The requested functionality is part of this API, but not implemented by the service.

503

Service Unavailable

The server is currently unable to handle the request due to a temporary overload or scheduled maintenance, which will likely be alleviated after some delay.

507

Insufficient Storage

Storage or quota not sufficient to perform this operation.

Caution
Please note that some APIs may return 404 Not Found instead of 403 Forbidden or 401 Unauthorized if the client has insufficient permissions to access a resource. This is to prevent leakage of information to unauthorized users (e.g. the existence of a private archive or a file within an archive).

Parameter types

API Endpoints may accept request parameters of various types, either via the query string part of the request URL, or as fields within a multipart/form+data formatted POST request, or both. In any case, each parameter is associated with a value type and interpreted according to the following table:

Table 3. Parameter Types
Name Description

boolean

Either true or false (case-insensitive). If this parameter is present, but has no value (or an empty string value), it is considered true. The boolean parameters param, param= and param=true all evaluate to true.

int

A signed integer number between -2147483648 and 2147483647.

long

A bit signed integer number between -9223372036854775808 and 9223372036854775807.

double

A decimal number in a format parseable by the Java Double.fromString(String) method. Examples: -2.4, .23, 1.0e-9, NaN.

string

An arbitrary utf-8 encoded string value.

enum

A value out of a predefined set of possible values. The valid values and their meanings are listed in the parameter description.

list({type})

This parameter accepts multiple values of the enclosing type. Clients may repeat this parameter once for each value. Some parameters may also accept a comma separated list.

file

This parameter type is only supported as part of a multipart/form+data formatted POST request and refers to a file upload as it would result from a <input type="file"> HTML form element. The multipart part must define a Content-Disposition header with a filename property in order to be recognized as a file upload.

glob

A file-name matching glob pattern. See Glob syntax

Glob syntax

Glob patterns are a simple way to filter or match file-names within an archive against a specific pattern. There is no real standard for glob patterns and existing implementations differ slightly. This is why CDStar implements its own subset of the most commonly used rules:

Table 4. Glob Syntax
Pattern Description

?

Matches a single character within a path segment. Does not match the path separator / (forward slash).

*

Matches any number of characters within a path segment, including an empty string. Does not match the path separator.

**

Matches any number of characters, including the path separator.

If the whole pattern starts with the path separator / (forward slash), then the entire path is matched against the pattern. Otherwise, a partial match at the end of the path is sufficient. The pattern *.pdf for example would return all PDF files within an archive, but /*.pdf would only return PDF files located directly within the root folder.

As mentioned above, single wildcards only match within a path segment, which means both ? and * do not expand across path separators (/). The pattern docs/*.pdf would find /docs/file.pdf but not /docs/subfolder/file.pdf. Use two adjacent asterisks (e.g docs/**.pdf) to include subfolders in your search.

Table 5. Glob Patterns as Regular Expressions
Glob Pattern Regular Expression Examples

*.pdf

[^/]*\.pdf$

/file.pdf
/file.tex
/folder/subfolder/file.pdf

/*.pdf

^/[^/]*\.pdf$

/file.pdf
/file.tex
/folder/subfolder/file.pdf

/folder/**.pdf

^/folder/.*\.pdf$

/file.pdf
/file.tex
/folder/subfolder/file.pdf

/201?/**.csv

^/201[^/]/.*\.csv$

/2016/report.csv
/2017/draft/report.csv
/2007/report.csv

Authentication

CDStar can be configured with one or more authentication realms, implementing various ways of authenticating and authorizing client requests against the service. From the HTTP API point of view, there are mostly two ways to authenticate:

Password Authentication

HTTP Basic Authentication is a stateless and simple authentication scheme most suitable for scripting or simple client applications. Username and password are transmitted with each request in cleartext, so this scheme should NOT be used over unencrypted connections.

$ curl -u username https://cdstar.gwdg.de/v3/...

Some realms may require a fully qualified username in the form of username@realm, but most realms also accept unqualified logins. If the username itself contains an @, then it MUST be qualified to avoid ambiguity.

Token Authentication

Token authentication is handled via an Authorization: bearer <token> header. Alternatively, the non-standard X-Token header or token query parameter can be used, but these are not recommended. Acquiring a token is not part of this API and depends heavily on the configured token realm (e.g. JWTRealm). For this example we assume that the client already obtained an access token.

curl -H "Authorization: bearer OAUTH-TOKEN" https://cdstar.gwdg.de/v3/...

In order to embed resources into HTML pages (e.g. images) or provide time-limited download links, a special token with limited access rights can be attached to the URL of GET requests via the token query parameter. As with access tokens, the method to obtain such tokens is not part of this API.

<img src="/v3/myVault/85a031d6e08d/image.png?token=READ-TOKEN" />

Authorization

CDStar implements a flexible authorization and permission system with fine-grained archive-level access control. The permission system is designed to be simple for the common case, but still powerful enough to support advanced requirements and responsibility models (e.g. groups and roles across multiple realms).

Note
The permission system may look complex on first glance, but remember that you only need a subset of this functionality for most common scenarios.

The core concept can be summarized as follows: 'Permissions' are granted to 'subjects' and affect a specific 'resource'. Subjects may be individual 'users' or 'groups' of users. A resources may be a single archive, a vault or the entire storage service. Subjects (both users and groups) are organized in 'realms'. A simple setup only requires a single realm, but multi-tenancy instances can use realms to separate different authorities.

Subjects and Realms

Subjects are encoded as strings and matched against the current user context using following subject qualifier syntax:

Table 6. Subject Qualifier
Subject Match Description

$any

Special subject that matches any user, authenticated or not.

$user

Special subject that matches authenticated users.

$owner

Special subject that matches the current owner of the affected resource. This is implemented for archive resources and matches against the the owner field of an archive.

@{group}

Subjects starting with @ are interpreted as group names. They match if the current user is a member of that group.
Example: @admins, @customers@realm

{user}

Subject that do not match any of the patterns above are tested against the identifier of the currently logged-in user.
Example: bob, alice@realm

Fully qualified subjects

If multiple realms are configured, then group and user names should be qualified with a realm name to avoid naming conflicts between realms. Unqualified names are still allowed, but they will match against any realm with a matching user or group.

Fully qualified names have the form name@realm. For example, alice from the ldap realm would be alice@ldap. Only the last occurrence of the @ character is recognized, so identifiers with @ in them (e.g. email addresses) are allowed. In fact, if the local part of a subject identifier contains an @, then the subject MUST be qualified with a realm to avoid ambiguity.

Vault Permissions

Permissions regarding a specific vault. If assigned globally, they have the from vault:{vaultName}:{permissionName}.

Table 7. Vault Permissions
Name Description

read

Open a vault. This is not required for public vaults, as these are visible and readable to anyone.

create

Create new archives within a vault.

list

List the archive IDs in a vault. Note that this allows a user to check if an archive exists independently of archive-level permissions.

Archive Permissions

Archives are protected by an access control list (ACL) which grands permissions to specific subjects (see Subjects and Realms). If assigned globally, they have the form archive:{vaultName}:{archiveId}:{permissionName}.

Note
Archive permissions are very fine-grained and most actions require more than one permission. For example, in order to receive a file from an archive, both read_file and load permissions are required. It most cases it is easier to assign Archive Permission Sets instead.
Table 8. Archive Permissions
Name Description

load

Check if an archive exists and read basic attributes (e.g last-modified or number of files).

delete

Delete an archive and its history (destructive operation).

read_acl

Read the access control list (ACL).

change_acl

Grant or revoke permissions by modifying the ACL.

change_owner

Change the owner.

read_meta

Read meta-data attributes.

change_meta

Add, remove or replace meta-data attributes.

list_files

List files and their attributes (e.g. name, size, type, hash).

read_files

Read file content.

change_files

Create, modify or remove files.

trim

Explicitly compress or clean-up an archive

Archive Permission Sets

Archive permissions are very fine-grained and most actions require more than one permission. For example, a user with only read_file permission on an archive would not be able to read any files, because the load permission is also required to load the archive in the first place. To simplify access control for common use-cases, permission sets were introduced. Each set bundles a number of permissions that are usually granted together, and can be assigned just like normal permissions.

Permission sets have upper-case names to distinguish them from normal permissions. The following matrix shows all pre-defined permission sets and their corresponding permissions.

Table 9. Archive Permission Set Matrix
Permission/set LIST READ WRITE OWNER MANAGE ADMIN

load

yes

yes

yes

yes

yes

yes

delete

yes

yes

read_acl

yes

yes

yes

change_acl

yes

yes

yes

change_owner

yes

yes

read_meta

yes

yes

yes

yes

change_meta

yes

yes

yes

list_files

yes

yes

yes

yes

yes

yes

read_files

yes

yes

yes

yes

change_files

yes

yes

yes

When to use MANAGE

The MANAGE set is intended for management and reporting jobs. These are usually only interested in the meta-data of an archive, not the content. The set therefore inherits LIST instead of READ or even WRITE to protect user data by default. While clients with this permission set would be able to grant more permissions to themselves, these changes would show up in audit logs and be accountable.

When to use OWNER

Vaults are usually configured to grant OWNER permissions to the $owner subject for new archives automatically. This allows the archive creator to work with the newly created archive and perform most actions, with the notable exception of changing the owner. Giving archives away is usually a task reserved for higher privilege accounts. This permission set is not limited or otherwise tied to the $owner subject, though. It can be given to other subjects, or revoked from the owner. Revoking permissions from the owner is a common pattern to make archives read-only after publishing.

Note
READ, WRITE and MANAGE reassemble the permissions defined in cdstar version 2.

Transaction Management

CDStar focuses on data safety and consistency. All transactions are atomic, consistent, isolated and durable by default (ACID properties). In short, this guarantees that transaction either succeed or fail completely ("all or nothing"), you will never see inconsistent state (e.g. half-committed changes), transactions won’t overlap or interfere with each other (isolation), and changes are persisted to disk before you get an OK back (durability).

Tip
ACID properties should be a core requirements for any kind of reliable storage service, but they are actually quite hard to find outside of traditional databases. Most modern web-based storage services (e.g. Amazon S3, couchdb, mongodb, most NoSQL databases) only provide "eventual consistency" or do not guarantee atomicity for operations affecting more than a single item. This makes it very hard or even impossible to implement certain workflows against these APIs in a reliable way, resulting in 'lost updates' or other consistency problems.

Each call to a API endpoint implicitly created and commits a transaction by default. If a single operation is not enough though, you can also create an explicit transaction, issue multiple API calls, and then commit or rollback all changes as a single atomic transaction. The non-standard X-Transaction header is used to associate HTTP calls with a running transaction.

$ curl -XPOST /v3/_tx
201 CREATED
{ "id": "d2ee7d6034e3", ... }

$ curl -H 'X-Transaction: d2ee7d6034e3' ...
...

$ curl -XPOST /v3/_tx/d2ee7d6034e3
204 OK

The results of these HTTP calls are not visible to other transactions until they are committed, and you won’t see any changes made by other users while your transaction is active, either. This is called 'snapshot isolation' and works as if each transaction operates on a snapshot of the entire database from the exact moment the transaction was started.

Error handling

Recoverable errors during an explicit transaction do not trigger a rollback. On one hand, this allows clients to recover from errors without loosing too much progress. On the other hand, clients using explicit transactions MUST handle errors properly. Individual operations may fail and still have partial effects. For example, if a file upload fails mid-request, the client should either repeat or resume the failed upload. The client MUST make sure the transaction is in a clean state before committing.

Conflict resolution

Update conflicts (multiple transactions updating the same archive at the same time) are not resolved automatically, since CDStar cannot possibly know how to merge multiple changes into a consistent result. In this unfortunate case, the transaction committed first will succeed and all other transactions writing to the same archive will fail as soon as a commit is tried.

Read-conflicts are allowed, though. If you only read from an archive and not change it, and a different transaction changes the archive in the meantime and commits before you, your transaction won’t fail. If you require a higher level of isolation (called 'serializability' in database theory) you can enable it via the isolation=full parameter when creating a new transaction.

Read-only transactions

Transaction management is expensive. Some transaction information must survive even a fatal server crash to allow reliable and automatic crash recovery. If you only need to 'read' from multiple archives in an isolated way, you can start the transaction with readonly=true and save a lot of server-side house-keeping.

Transaction Timeout

Explicit transactions expire after some time of inactivity. They never expire while a HTTP call is still in progress, and will extend their lifetime automatically after each HTTP call. You won’t have to worry about that in most cases. If you need a transaction to survive more than a couple of seconds of inactivity (e.g. while waiting for user input), you can specify a higher timeout when creating a transaction, or issue cheap HTTP calls (e.g. Renew Transaction) from time to time to prevent transactions from dying. Expired transactions are rolled back automatically.

API Endpoints

This chapter lists and describes all web service endpoints defined by the standard CDStar HTTP API. Requests are routed to the appropriate endpoint based on their HTTP method, content type and URI path. Some endpoints also require certain query parameters to be present. Path parameters (variable parts of the URL path) are marked with curly brackets.

Table 10. HTTP Endpoints: Overview
Title Method URI Path

Instance APIs

Service Info

GET

/v3/

Service Health

GET

/v3/_health

Vaults and Search

List Vaults

GET

/v3/

Get Vault Info

GET

/v3/{vault}

Search in Vault

GET

/v3/{vault}?q

List all Archives in a Vault

GET

/v3/{vault}?scroll

Archives

Create Archive

POST

/v3/{vault}/

Get Archive Info

GET

/v3/{vault}/{archive}

Export Archive

GET

/v3/{vault}/{archive}?export

Update Archive

POST

/v3/{vault}/{archive}

Delete Archive

DELETE

/v3/{vault}/{archive}

Files

List files

GET

/v3/{vault}/{archive}?files

Download file

GET

/v3/{vault}/{archive}/{filename}

Get file info

GET

/v3/{vault}/{archive}/{filename}?info

Upload file

PUT

/v3/{vault}/{archive}/{filename}

Resume file upload

PATCH

/v3/{vault}/{archive}/{filename}

Delete file

DELETE

/v3/{vault}/{archive}/{filename}

Metadata

Get Archive Metadata

GET

/v3/{vault}/{archive}?meta

Set Archive Metadata

PUT

/v3/{vault}/{archive}?meta

Get File Metadata

GET

/v3/{vault}/{archive}/{file}?meta

Set File Metadata

PUT

/v3/{vault}/{archive}/{file}?meta

Access Control

Get Archive ACL

GET

/v3/{vault}/{archive}?acl

Set Archive ACL

PUT

/v3/{vault}/{archive}?acl

Data Import

Import from ZIP/TAR

POST

/v3/{vault}/

Update from ZIP/TAR

POST

/v3/{vault}/{archive}

Snapshots

Create Snapshot

POST

/v3/{vault}/{archive}?snapshots

Delete Snapshot

DELETE

/v3/{vault}/{archive}@{snapshot}

List Snapshots

GET

/v3/{vault}/{archive}?snapshots

Transactions

Begin Transaction

POST

/v3/_tx/

Get Transaction Info

GET

/v3/_tx/{txid}

Commit Transaction

POST

/v3/_tx/{txid}

Renew Transaction

POST

/v3/_tx/{txid}?renew

Rollback Transaction

DELETE

/v3/_tx/{txid}

Instance APIs

APIs to access instance-level functionality like metrics, health, capabilities and more. This is also the entry point for most plugins.

Service Info

GET /v3/ HTTP/1.1

Get basic information about the cdstar instance as well as a list of all vaults accessible by the current user.

Table 11. Response Codes
Status Response Description

200

[ServiceInfo]

No description

Service Health

GET /v3/_health HTTP/1.1
Warning
This endpoint is marked as unstable and is subject to change.

Return health and performance metrics about the service.

Table 12. Query Parameters
Name Type Description

with

list(enum)

Include additional information in the response.

metrics

Include detailed metrics (named numercial values) in a metrics sub-object.

health

Include detailed health information (named checks) in a health sub-object.

Table 13. Response Codes
Status Response Description

200

[ServiceHealthInfo]

No description

Vaults and Search

List and access vaults, search or enumerate archives within a vault.

List Vaults

GET /v3/ HTTP/1.1

List vaults accessible by the current user. This is the same as Service Info.

Get Vault Info

GET /v3/{vault} HTTP/1.1

Get information about a vault.

Table 14. Response Codes
Status Response Description

200

[VaultInfo]

No description

Search in Vault

GET /v3/{vault}?q HTTP/1.1

Perform a search over all archives and files within a vault using the configured search backend. Only results that are visible to the currently logged in user are returned.

API Changes
  • Changed in v3.1: Added fields parameter.

Table 15. Query Parameters
Name Type Description

q

string

Search query using the lucene query syntax or an alternative query syntax supported by the backing search index. Multiple plain search terms are usually OR linked and optional by default, but this may also depend on the search backend used.

Example:
`Bananas or modified:[2017-01-01 TO 2017-12-31] AND dcTitle:"Master Thesis"

order

enum

Order results by score, modified, id or any of the fields supported by the search backend. Prefix the field name with a minus character to reverse the order. As an example, the default order -score will return results based on their relevance, ordered from highest relevance to lowest.

Multiple order fields can be specified as a comma separated list.

Default: "-score"

limit

int(0-max)

Limit the number of results. Values are automatically capped to an allowed maximum.

Default: 25

fields

list(string)

Request additional fields for each hit.

Search backends SHOULD support requesting index document fields by name (e.g. dcTitle or meta.dc:title) and return the corresponding value(s) for each hit. Unknown or unsupported fields should be silently ignored.

Search backends MAY support more complex field queries via a backend specific syntax. For example, requesting highlight(content) may return the relevant parts of the content field with the matched sections wrapped in HTML <em> tags. Requesting meta.dc:* may return all fields starting with meta.dc: as a single nested object. We discourage inventing a full mini-language here, though. Keep it simple.

The SearchHit data type contains a fields object that maps field queries to their value.

Multiple simple fields can be requested as a comma separated list.

Example:
fields=dcTitle,dcAuthor

scroll

string

When a search query matched more than limit results, you can use the scroll value from the last succesfull SearchResults response to skip all results already returned and fetch the next page of results from the search backend.

This works similar to the 'search_after' feature in elasticsearch or the 'cursorMark' feature in solr. The 'scroll' value in a SearchResults response is a stateless live cursor pointing to the last element returned in a result page. When repeating a search with a valid scroll cursor, all results that would be ordered lower or equal to this element are skipped.

Default: "none"

groups

list

Claim membership of additional user groups. This is useful if the realm of the user does not return all groups the user belongs to, and some search hits are not visible because of that. Each claim is checked against the realm, and if successful, hits visible to that group are included in the result.

Table 16. Response Codes
Status Response Description

200

SearchResults

No description

501

Error

Search functionality is disabled.

504

Error

Search functionality is enabled, but the search service did not respond in time.

List all Archives in a Vault

GET /v3/{vault}?scroll HTTP/1.1

List IDs of archives stored in this vault.

Up to limit IDs are returned per request. IDs are ordered in a stable but otherwise implemention specifc way (usually lexographical). If the scroll parameter is a non-empty string, then only IDs ordered after the given string are returned. This can be used to scroll through all IDs of a vault in an efficient manner.

By default, this API will return all IDs that were ever created in this vault, including IDs of archives that were removed or are not load-able by the current user. This mode requires list vault permission or the vault to be public.

In strict mode, archive manifests are actually loaded from storage and only IDs of archives that are load-able by the current user are returned. This mode is less efficient, but does not require list permissions on the vault. Use with caution.

This API is NOT transactional and may reflect changes made by other clients as soon as they happen.

Table 17. Query Parameters
Name Type Description

scroll

string

Required, but can be empty.

Start listing IDs greater than the given string, according to the implementation-defined ordering (usually lexographical). For pagination, set scroll to the ID of the last result of the previous page to fetch the next page.

limit

int(0-max)

Limit the number of results. Values are automatically capped to an allowed maximum.

Default: 25

strict

boolean

If true, only IDs for archives that are actually load-able by the current user are returned.

Table 18. Response Codes
Status Response Description

200

ScrollResults

No description

Archives

No description

Create Archive

POST /v3/{vault}/ HTTP/1.1
Note
This endpoint consumes: multipart/form-data, application/x-www-form-url-encoded

Create a new archive, owned by the current user.

If the request body contains form data, the new archive is immediately populated according to Update Archive.

Table 19. Response Codes
Status Response Description

201

[ArchiveCreated]

Archive created

Get Archive Info

GET /v3/{vault}/{archive} HTTP/1.1
GET /v3/{vault}/{archive}@{snapshot} HTTP/1.1

Get information about an archive or list its content.

When accessing a snapshot, information that is not part of the snapshot (e.g. owner or ACL) will be read from the current archive state.

Table 20. Query Parameters
Name Type Description

with

list(enum)

Include additional information in the response. This can be used as a shortcut for individual requests to Get Archive ACL, List files, Get Archive Metadata or List Snapshots. If access restrictions do not allow reading a subresource, the flag is silently ignored.

acl

Include acl field with an AclInfo in the response.

files

Include files field with a list of FileInfo in the response. This is implicitly enabled if any of the file listing parameters are present.

meta

Include meta field with a MetaAttributes in the response. If files are listed, their FileInfo will also contain an additional meta field.

snapshots

Include a snapshots field with a list of available snapshots for this archive (SnapshotInfo).

include

list(glob)

Only list files that match any of these glob patterns. Implies with=files.

exclude

list(glob)

Only list files that do not match any of these glob patterns. Implies with=files.

order

enum

Order files by name, type, size, created, modified, hash or id. The id ordering is useful to get a stable ordering that is not affected by name changes. Implies with=files.

Default: "name"

reverse

boolean

Return files in reverse order. Implies with=files.

limit

int(0-max)

Limit the number of files listed. Values are automatically capped to an allowed maximum. Implies with=files.

Default: 25

offset

int(0-inf)

Skip this many files from the listing. Can be used for pagination of archives with more than limit files. Implies with=files.

Table 21. Response Codes
Status Response Description

200

ArchiveInfo

Archive found

400

Error

Invalid parameters

404

Error

Archive not found or not readable by current user

Export Archive

GET /v3/{vault}/{archive}?export HTTP/1.1
GET /v3/{vault}/{archive}@{snapshot}?export HTTP/1.1
Warning
This endpoint is marked as unstable and is subject to change.
Note
This endpoint produces: */*

Export (a subset of) all files in an archive as a single download.

The file format is specified by the export parameter. Currently only zip is implemented. More formats are planned (e.g. BagIt, tar, tar.gz and more).

Table 22. Query Parameters
Name Type Description

export

list(enum)

Required parameter to specifies the export format. Currently only zip is supported.

zip

Export files as a zip archive.

include

list(glob)

Only export files that match any of these glob patterns.

exclude

list(glob)

Only export files that do not match any of these glob patterns.

Table 23. Response Codes
Status Response Description

200

bytes

The export format and Content-Type depends on the export query parameter.

404

Error

Archive not found or not readable by current user

Update Archive

POST /v3/{vault}/{archive} HTTP/1.1
POST /v3/{vault}/{archive}@{snapshot} HTTP/1.1
Note
This endpoint consumes: multipart/form-data, application/x-www-form-url-encoded

Update an existing archive or snapshot.

Request form data is interpreted as a list of commands and applied in order. The order is significant. For example, a file can be uploaded, then copied, then annotated with metadata, all in the same request. Make sure your HTTP client preserves the order of form fields when generating the request body.

Commands that contain a {filename} placeholder operate on files within the archive. The filename must start with a slash (/) in order to be recognized. If the filename also ends with a slash, it usually affects all files with that prefix. Be careful.

File uploads only work with multipart/form-data and are not recommended for large files. Prefer Upload file for anything larger than a couple of MB. Uploading a large number of small files may be faster using this api, though. Your mileage may vary.

Snapshots are read-only, but setting a new profile is supported.

Table 24. Form Parameters
Name Type Description

{filename}

file

Upload a new file (multipart/form-data only).

If the filename ends with a slash, then the original (client-side) name of the file is appended. If the filetype is either application/x-autodetect or missing, cdstar ill tyr to guess the correct content-type from the file name extension an default to application/octet-stream if that fails.

Example:
<input type="file" name="/folder/" />
$ curl --form /filename.txt=@source.txt
$ curl --form /filename.txt=@source.txt;type=text/plain
$ curl --form /folder/=@source.txt;name=filename.txt

copy:{filename}

string

Create a new file by copying the content of an existing file from the same archive.

Example:
$ curl --data copy:/target.txt=/source.txt

clone:{filename}

string

Create a new file by copying the content and metadata of an existing file from the same archive.

move:{filename}

string

Rename an existing file.

fetch:{filename}

uri

Create a new file by fetching an external resource. If {filename} ends in a slash (/) then the last path segment of the fetch URL is appended to the file name.

Supported URI schemes depend on installed plugins and not all URIs may be allowed. For example, fetching from http:// URLs may be limited to trusted domains, or diabled completely.

Example:
$ curl --data fetch:/bigfile.dat=http://example.com/bigfile.dat

delete:{filename}

string

Delete a file. The value is ignored. If {filename} ends with a slash (/), then all files under that directory are removed.

Example:
"$ curl --data delete:/some/file.txt"
"$ curl --data delete:/some/folder/"

type:{filename}

string

Change the content-type of an existing file. The value should follow the Content-Type header syntax (e.g. application/octet-stream). A special value of application/x-autodetect will cause cdstar to try to guess the correct content-type from the file name extention.

Example:
"$ curl --data type:/some/file.txt=text/plain"

meta:{attr}

list(string)

Set a meta-attributes for the archive. See Metadata for a list of supported {attr} names.

Example:
$ curl --data meta:dc:creator=Alice

meta:{attr}:{filename}

list(string)

Set a meta-attributes for a specific file within the archive. {filename} must correspond to an existing file.

Example:
$ curl --data meta:dc:creator:/thesis.pdf=Alice

acl:{subject}

list(enum)

Change the list of permissions granted to a {subject}. A subject can be an individual, an @ prefixed group or one of the special subjects $any, $user or $owner. The value should be a comma-separated list of permissions (lowercase) or permission-sets (uppercase). Any permissions previously granted to this exact subject are removed and the effective list of permissions is normalized automatically (sets are exploded, dublicates removed).

See Archive Permissions for a list of permission names.

Example:
$ curl --data acl:alice@gwdg=READ,change_meta
$ curl --data acl:@adminGroup=MANAGE
$ curl --data acl:\$any=READ # be careful to escape $ in a shell

profile

string

Set the desired storage profile for this archive or snapshot. Profile changes usually trigger background data migration and will take some time to have an effect. See Storage Profiles for details.

owner

string

Change the owner of this archive. This requires change_owner permissions, which are not included in the default OWNER permission set.

Delete Archive

DELETE /v3/{vault}/{archive} HTTP/1.1

Remove an existing archive and all snapshots. This requires delete permissions on the archive.

Table 25. Response Codes
Status Response Description

204

-

Archive removed (no content).

Files

No description

List files

GET /v3/{vault}/{archive}?files HTTP/1.1
GET /v3/{vault}/{archive}@{snapshot}?files HTTP/1.1

List files within an archive or snapshot. This endpoint supports the same parameters as Get Archive Info to filter or paginate the list of files.

Table 26. Response Codes
Status Response Description

200

FileList

No description

Download file

GET /v3/{vault}/{archive}/{filename} HTTP/1.1
GET /v3/{vault}/{archive}@{snapshot}/{filename} HTTP/1.1
Note
This endpoint produces: */*

Download a single file from an archive or snapshot.

This endpoint supports ranged requests and conditional headers such as If-(None-)Match, If-(Un)modified-Since, If-Range and Range, as well as HEAD requests. The ETag value is calculated from the files digest hash, if known.

Highly accessed files in publicy readable archives may be served from a different location (e.g. S3 or CDN). Clients should follow redirects (e.g. 307 Temporary Redirect) according to the HTTP standard.

During explixit transactions, and while a file upload is currently in progress, GET requests will fail with an "IncompleteWrite" error. HEAD requests are allowed, though. The Content-Length header will report the current upload size.

Table 27. Query Parameters
Name Type Description

inline

boolean

By default, files are returned with a Content-Disposition: attachment header, forcing a download dialog in most browsers. This header can be disabled to allow resoures to be embedded in HTML pages or opened directly in a suitable application.

Some content-types cannot be inlined for security reasons. This parameter is silently ignored for these types, and the Content-Disposition: attachment header is sent regardless.

Table 28. Response Codes
Status Response Description

200

bytes

File exists, is readable and its content is returned with this response. The Content-Type matches whatever was defined on the file resource.

206

bytes

Same as 200, but only parts of the file are returned according to the Range header in the request.

304

-

File not modified.

307

-

Same as 200, but the file content is available under a different URL specified in the Location header.

409

-

Archive not available. This may happen for archives with a cold storage profile.

412

-

Precondition failed.

416

-

Requested range not satisfiable.

Get file info

GET /v3/{vault}/{archive}/{filename}?info HTTP/1.1
GET /v3/{vault}/{archive}@{snapshot}/{filename}?info HTTP/1.1

Get FileInfo for a single file. For multiple files, [fileList] is usually faster.

Table 29. Query Parameters
Name Type Description

with

list(enum)

Return additional information about the file, embedded in the FileInfo document.

meta

Include MetaAttributes defined on this file.

Table 30. Response Codes
Status Response Description

200

FileInfo

No description

Upload file

PUT /v3/{vault}/{archive}/{filename} HTTP/1.1
Note
This endpoint consumes: */* NOTE: This endpoint produces: application/json

Directly upload a new file to an archive, or overwrite an existing file.

If a Content-Type header is missing or equals application/x-autodetect, then the media type is guessed from the filename extention.

The conditional headers If-Match: * or If-None-Match: * can be used to force update-only or create-only behavior.

Upload errors can only be detected properly if either Content-Length header is set, or Transfer-Encoding: chunked is used. If less than the expected number of bytes are transmitted, the file is considered incomplete and the transaction will fail.

During explicit transactions (see Transaction Management), failed uploads will leave the file in an incomplete state. The upload must be repeated or resumed before committing. See Resume file upload for details. Conflicting operations, for example reading the file content or fetching its info, will fail until the file was completely updated or removed. HEAD requests to the files URL are allowed, though.

Table 31. Response Codes
Status Response Description

200

-

File updated.

201

-

File created.

412

-

Precondition (e.g. If-Match or If-None-Match) failed.

Resume file upload

PATCH /v3/{vault}/{archive}/{filename} HTTP/1.1
Note
This endpoint consumes: application/vnd.cdstar.resume NOTE: This endpoint produces: application/json

Resume a failed or aborted file upload.

After a failed Upload file request during an explicit transactions (see Transaction Management), the client may choose to resume the upload instead of uploading the entire file again or removing it.

To do so, send a PATCH request with Content-Type: application/vnd.cdstar.resume and a Range header with a single byte range, either bytes=startByte- or bytes=startByte-endByte (see RFC-2616). The startByte index must match the current remote file size, as returned by a HEAD request to the Download file API. The endByte index is optional, but recommended as an additional saveguard. It should match the target file size.

A file is considered complete once the PUT or PATCH request completes without errors. Within a single transaction, failing uploads can be resumed repeatedly until all data is transmitted or the transaction runs into a timeout.

Do not use this api to upload files in small chunks. A successfull PUT or PATCH request will compute digests, which is an expensive operation. Always try to upload the entire file in one go, if possible.

Table 32. Response Codes
Status Response Description

200

-

File updated.

Delete file

DELETE /v3/{vault}/{archive}/{filename} HTTP/1.1

Remove a single file from an archive. This requires change_files permissions on the archive.

Table 33. Response Codes
Status Response Description

204

-

File removed (no content).

Metadata

Archives and individual files within an archive can be annotated with custom metadata attributes. Both the name and values of an attribute are plain strings, but each attribute can have multiple values. Lists of strings are returned even if onyl a single value is set.

Attribute names are case-insensitive and limited to letters, digits and the underscore character, and must start with a letter.

Attribute names may be prefixed with a namespace identifier followed by a single colon character (e.g. dc:title for a Dublin Core title attribute). Namespaced attributes are subject to server-side validation and defined in a schema. Custom attributes should be either prefixed with the custom: namespace or no namespace at all.

The value of an attribute is an ordered list of plain strings. Empty strings are allowed, but a list with no values is equal to an undefined attribute.

Get Archive Metadata

GET /v3/{vault}/{archive}?meta HTTP/1.1
GET /v3/{vault}/{archive}@{snapshot}?meta HTTP/1.1

Return metadata attributes for an archive or snapshot. The same information can also received as part of a Get Archive Info request by using the with=meta switch.

Table 34. Response Codes
Status Response Description

200

MetaAttributes

No description

Set Archive Metadata

PUT /v3/{vault}/{archive}?meta HTTP/1.1
Note
This endpoint consumes: application/json

Replace the metadata of an archive with a new MetaAttributes document. To clear all attributes, just send an empty document (e.g. {}).

Table 35. Request Body (MetaAttributes)
Field Type Description

{schema:attr}

list(string)

A list of string values. The list is ordered and dublicates are allowed.

Table 36. Response Codes
Status Response Description

204

-

Metadata updated.

Get File Metadata

GET /v3/{vault}/{archive}/{file}?meta HTTP/1.1
GET /v3/{vault}/{archive}@{snapshot}/{file}?meta HTTP/1.1

Return metadata attributes for a single file within an archive or snapshot. The same information can also received as part of a Get file info request by using the with=meta switch.

Table 37. Response Codes
Status Response Description

200

MetaAttributes

No description

Set File Metadata

PUT /v3/{vault}/{archive}/{file}?meta HTTP/1.1
Note
This endpoint consumes: application/json

Replace the metadata of a file within an archive with a new MetaAttributes document. To clear all attributes, just send an empty document (e.g. {}).

Table 38. Request Body (MetaAttributes)
Field Type Description

{schema:attr}

list(string)

A list of string values. The list is ordered and dublicates are allowed.

Table 39. Response Codes
Status Response Description

204

-

Metadata updated.

Access Control

The local access control list (ACL) of an archive can be used to grant permissions to individuals or groups. These permissions are checked before any external realm is consulted and stored as part of the archive. New permissions can be granted individually using the Update Archive endpoint, or in bulk via Set Archive ACL. The permissions read_acl or change_acl are required to read or change the access control list or an archive.

Note that the names for subjects (individuals or groups) can and should be qualified with the name of the autentication realm, especailly if more than one realm is installed. A subject named alice would match any user with that name, across all autentication sources. Use qualified names (e.g. userName@realmName or @groupName@realmName) to prevent ambiguities.

Get Archive ACL

GET /v3/{vault}/{archive}?acl HTTP/1.1

Return the local access control list of this archive as an AclInfo document. The same information can also be received as part of a Get Archive Info request by using the with=acl switch.

Table 40. Query Parameters
Name Type Description

acl

enum

group

Group permissions (lowercase) into permission-sets (uppercase) when possible. Permissions that do not fit into a complete group are returned individually.

explode

Return individual permissions and no permission sets.

Default: "group"

Table 41. Response Codes
Status Response Description

200

AclInfo

No description

Set Archive ACL

PUT /v3/{vault}/{archive}?acl HTTP/1.1
Note
This endpoint consumes: application/json

Replace all entries of the local access control list with entries from this AclInfo document.

Table 42. Request Body (AclInfo)
Field Type Description

{subject}

list(string)

A list of permissions (lowercase) or permission-sets (uppercase) granted to this subject. {subject} can be an individual, an @ prefixed group or one of the special subjects $self, $any, $user or $owner.

Table 43. Response Codes
Status Response Description

200

-

Archive updated.

400

-

Invalid permission

Data Import

No description

Import from ZIP/TAR

POST /v3/{vault}/ HTTP/1.1
Note
This endpoint consumes: application/zip, application/x-tar

Create a new Archive from a ZIP or TAR file.

For compressed TAR files, make sure to provide a suitable Content-Encoding header. Supported algorithms include gz, bzip2, xz, and deflate.

Note that importing compressed ZIP or TAR archives requires a significant amount of work on server-side after the upload completed, which may cause some clients to time-out before a response can be sent. Make sure to increase the read time-outs for your client before uploading large archives.

Table 44. Query Parameters
Name Type Description

prefix

string

Import files into this folder.

Example:
prefix=/import/

include

list(glob)

Only import files that match any of these glob patterns.

Example:
include=*.pdf

exclude

list(glob)

Only import files that do not match any of these glob patterns.

Example:
exclude=.svn/**

Table 45. Response Codes
Status Response Description

201

-

Archive created.

Update from ZIP/TAR

POST /v3/{vault}/{archive} HTTP/1.1
Note
This endpoint consumes: application/zip, application/x-tar

Import files from a zip or tar file into an existing archive. See Import from ZIP/TAR for details.

Table 46. Query Parameters
Name Type Description

prefix

string

Import files into this folder.

Example:
prefix=/import/

include

list(glob)

Only import files that match any of these glob patterns.

Example:
include=*.pdf

exclude

list(glob)

Only import files that do not match any of these glob patterns.

Example:
exclude=.svn/**

Table 47. Response Codes
Status Response Description

200

-

Archive updated.

Snapshots

Archive Snapshots are an efficient way to preserve the current payload (files and metadata) of an archive without actually creating a copy. This can be used to implement versioning or prepare unmodifiable copies for publishing.

The preserved state of a snapshot can be accessed (read-only) just like normal archive state, by appending an @ and the snapshot name to the archive id in the request path. For exampe, GET /v3/ab587f42c257@v1/data.csv will return a file from archive ab587f42c257 as preserved by snapshot v1. This works for all endpoints documented as supporting snapshots.

Snapshots only preserve the payload of an archive, namely metadata and files. Administrative metadata such as owner or access control lists are not part of a snapshot. Only the profile can be changed on a snapshot via Update Archive. This means that the storage state and availability of a snapshot can differ from that of the archive. See Storage Profiles for details.

Create Snapshot

POST /v3/{vault}/{archive}?snapshots HTTP/1.1
Note
This endpoint consumes: application/x-www-form-url-encoded

Create a new snapshot.

Table 48. Form Parameters
Name Type Description

name

string

(required) Snapshot name. Must be unique per archive and only contain ASCII letters, digits, dashes or dots (a-z A-Z 0-9 - .).

Table 49. Response Codes
Status Response Description

201

SnapshotInfo

Snapshot created.

Delete Snapshot

DELETE /v3/{vault}/{archive}@{snapshot} HTTP/1.1

Delete a snapshot. This requires delete permissions on the archive and is irreversable. The name of a deleted snapshot cannot be used to create a new snapshot.

Table 50. Response Codes
Status Response Description

204

-

Snapshot removed

List Snapshots

GET /v3/{vault}/{archive}?snapshots HTTP/1.1

Get a list of snapshots that exist for this archive, ordered by creation date, then name.

Transactions

Transactions can be started, comitted or rolled back explicitly using these endpoints. To learn more about transactions, see Transaction Management.

Begin Transaction

POST /v3/_tx/ HTTP/1.1
Note
This endpoint consumes: application/x-www-form-urlencoded

Start a new transaction. See Transaction Management for details.

Table 51. Form Parameters
Name Type Description

isolation

enum

Select an isolation level for this transaction. Supported modes are full and snapshot.

Transactions with 'snapshot' isolation work on a consistent snapshot of the entire database from the exact moment the transaction was sarted and only see their own changes. On a write-write conflict (the same resource modified by two overlapping transactions) only one of the transactions will be able to commit. This protects against 'lost updates' and is suitable for most scenarios.

Transactions with 'full' isolation (also called 'serializability isolation') will also fail on write-read conflicts. The transaction can only be committed if none of the affected resources (modified or not) was modified by an overlapping transacion.

Default: "snapshot"

readonly

boolean

If true, create a read-only transaction. These transactions cannot be committed (only rolled back).

timeout

integer

Timeout (in seconds) after which an unused transaction is automatically rolled back. User supplied timeouts are automatically capped to a server-defined maximum value.

Default: 60

Table 52. Response Codes
Status Response Description

201

TransactionInfo

Transaction created successfully.

Get Transaction Info

GET /v3/_tx/{txid} HTTP/1.1

Request information about a running transaction.

Table 53. Response Codes
Status Response Description

200

TransactionInfo

Transaction Info

404

Error

Transaction does not exist, expired or is not visible to the current user context.

Commit Transaction

POST /v3/_tx/{txid} HTTP/1.1

Commit a running transaction. All changes made with this transaction ID are persisted and new transactions will be able to see the changes. The commit may fail, in wich case not changes will be persisted at all. Partial commits never happen.

Table 54. Response Codes
Status Response Description

204

-

Transaction committed successfully.

404

Error

Transaction does not exist, expired or is not visible to the current user context.

409

Error

Transaction could not be commited because of unresolveable conflicts and was rolled back instead.

423

Error

Transaction could not be commited because of locked resources. It may still be possible to commit this transaction, so it is kept open. The client should either issue a rollback, or try again later.

Renew Transaction

POST /v3/_tx/{txid}?renew HTTP/1.1

Renew a running transaction. This resets the transaction timeout and ensures that the transaction is not rolled back automatically for the next TransactionInfo.ttl seconds.

Table 55. Response Codes
Status Response Description

200

TransactionInfo

Transaction renewed successfully. The response contains an updated timeout.

404

Error

Transaction does not exist, expired or is not visible to the current user context.

Rollback Transaction

DELETE /v3/_tx/{txid} HTTP/1.1

Close a running transation by rolling it back. All changes made with this transaction ID are discarded.

API Data Structures

AclInfo

This object maps subjects (users, groups or special subjects) to lists of permissions (lowercase) or permission sets (uppercase). See Archive Permissions for possible values.

Permissions are grouped into permission sets by default. Only permissions that do not fit into a complete set are returned individually. Endpoins returning this structure usually also support a flag to return individual permissions instead of sets.

For most subjects, this listing only contains permissions that were explicitly granted on the archive itself. Authorization realms configured on the server may grant additional permissions when requested. Those are not listed here, as they cannot be known in advance.

Table 56. Field list for AclInfo
Field Type Description

{subject}

list(string)

A list of permissions (lowercase) or permission-sets (uppercase) granted to this subject. {subject} can be an individual, an @ prefixed group or one of the special subjects $self, $any, $user or $owner.

Example for AclInfo
{
  "$any": [
    "READ"
  ],
  "$owner": [
    "OWNER"
  ],
  "alice": [
    "READ"
  ],
  "@cronGorup": [
    "READ",
    "read_acl"
  ]
}

ArchiveInfo

Archive properties and content listing as returned by Get Archive Info. Some of the fields are optional or affected by query parameters. See Get Archive Info for a detailed description.

If this document represents an archive snapshot, additional fields are present. State that is not part of the snapshot (e.g. owner or ACL) are complemented from the archive state, if requested.

Table 57. Field list for ArchiveInfo
Field Type Description

id

string

Unique ID of this archive.

vault

string

Name of the containing vault.

revision

string

Archive revision. This is currently an incrementing counter, but the value should be treated as an arbitrary string.

profile

string

The name of the storage profile. If the archive is currently in a pending-* state, then this is the target profile the archive is migrating to.

state

enum

The current storage state of this archive or snapshot. The states are:

open

The archive is open for reading and writing.

locked

The archive is write-protected, but can be read.

archived

The archive is stored in a external location, cannot be modified and file content may not be available. It needs recovery to be available again.

pending-recover

The archive is currently recovered from external storage and will change to open or locked once the recovery is complete.

pending-archive

The archive is currently migrating to external storage and will change to archived once the migration is complete.

Archives in pending- states have the same restrictions as archived. To change the state, change the storage profile and wait for the pending- state to clear.

created

date

Time this archive was created.

modified

date

Last time this archive, its meta-data or any of its files were modified. Note that changes to administrative meta-data (owner, ACL) do not update the modification time of an archive. If you need to track changes in administrative meta-data, always compare the actual values.

file_count

int

Total number of files in this archive. May be -1 to indicate that the actual number is not known. This may happen if the user does not have the permission to list the archives content.

files

list(FileInfo)

List of files in this archive. May be incomplete or missing based on query parameters, permissions and server configuration. See Get Archive Info for details.

meta

MetaAttributes

Meta-Attributes defined on this archive. May be incomplete or missing based on query parameters and permissions.

acl

AclInfo

Access control list. May be incomplete or missing based on query parameters and permissions.

snapshots

list(SnapshotInfo)

List of snapshots created for this archive, if any. May be incomplete or missing based on query parameters. See Get Archive Info for details.

Example for ArchiveInfo
{
  "id": "ab587f42c2570a884",
  "vault": "myVault",
  "revision": "0",
  "profile": "default",
  "state": "open",
  "created": "2016-12-20T13:59:37.160+0000",
  "modified": "2016-12-20T13:59:37.231+0000",
  "file_count": 1,
  "files": [
    {
      "name": "/example.txt",
      "id": "aaf0cc5ab587",
      "type": "text/plain",
      "size": 7,
      "created": "2016-12-20T13:59:37.217+0000",
      "modified": "2016-12-20T13:59:37.218+0000",
      "digests": {
        "md5": "1a79a4d60de6718e8e5b326e338ae533",
        "sha1": "c3499c2729730a7f807efb8676a92dcb6f8a3f8f",
        "sha256": "50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c"
      },
      "meta": {
        "dc:title": [
          "This is an example file"
        ],
        "dc:date": [
          "2016-12-20T13:59:37.218+0000"
        ]
      }
    }
  ],
  "acl": {
    "$any": [
      "READ"
    ],
    "$owner": [
      "OWNER"
    ],
    "alice": [
      "READ"
    ],
    "@cronGorup": [
      "READ",
      "read_acl"
    ]
  }
}

Error

In case of an error, CDStar will return a json document with additional information.

Table 58. Field list for Error
Field Type Description

status

int

HTTP status code of this response

error

string

Short description. Suitable as a key for translations or error handling, as it does not contain any dynamic parts.

message

string

Long description. Suitable to be presented to the user.

detail

object

Additional information or metadata. (Optional field)

other

list(Error)

If more than one error occuded during a single request, the other errors are listed here. (Optional field)

Example for Error
{
  "status": 404,
  "error": "Not found",
  "message": "The requested archive does not exist or is not readable.",
  "detail": {
    "vault": "myVault",
    "archive": "ab587f42c2570a884"
  }
}

FileInfo

Properties and (optionally) meta-data about a single file within an archive.

Table 59. Field list for FileInfo
Field Type Description

id

string

A unique and immutable string identifier. Other than the name attribute, the id will not change for the lifetime of the file and can be used to track individual files across name changes.

name

string

File name (unicode), always starting with a slash (/). The file name may actually represent a path and contain several path seperators (slash, /).

type

string

User supplied or auto-detected media type. Defaults to application/octet-stream

size

long

File size in bytes

created

date

Time the file was created.

modified

date

Last time the file content was modified.

digests

object

An object mapping digest algorithms to their hex value. The available algorithms (e.g. md5, sha1 or sha256) depend on server configuration, but at least one is always present.

This field is not available (null or missing) for incomplete files with running or aborted uploads in the same transaction.

meta

MetaAttributes

Meta attributes defined for this file. May be incomplete or missing based on query parameters and permissions.

Example for FileInfo
{
  "name": "/example.txt",
  "id": "aaf0cc5ab587",
  "type": "text/plain",
  "size": 7,
  "created": "2016-12-20T13:59:37.217+0000",
  "modified": "2016-12-20T13:59:37.218+0000",
  "digests": {
    "md5": "1a79a4d60de6718e8e5b326e338ae533",
    "sha1": "c3499c2729730a7f807efb8676a92dcb6f8a3f8f",
    "sha256": "50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c"
  },
  "meta": {
    "dc:title": [
      "This is an example file"
    ],
    "dc:date": [
      "2016-12-20T13:59:37.218+0000"
    ]
  }
}

FileList

A list of FileInfo objects, usually filtered and paginated. If count and total are not queal, then the result is incomplete and additional requests are required to get the complete list.

Table 60. Field list for FileList
Field Type Description

count

int

Number of results in this listing (size of the files array)

total

int

Total number of files matching the given include/exclude filters or query.

files

list(FileInfo)

List of FileInfo objects.

Example for FileList
{
  "count": 1,
  "total": 1,
  "files": [
    {
      "name": "/example.txt",
      "id": "aaf0cc5ab587",
      "type": "text/plain",
      "size": 7,
      "created": "2016-12-20T13:59:37.217+0000",
      "modified": "2016-12-20T13:59:37.218+0000",
      "digests": {
        "md5": "1a79a4d60de6718e8e5b326e338ae533",
        "sha1": "c3499c2729730a7f807efb8676a92dcb6f8a3f8f",
        "sha256": "50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c"
      },
      "meta": {
        "dc:title": [
          "This is an example file"
        ],
        "dc:date": [
          "2016-12-20T13:59:37.218+0000"
        ]
      }
    }
  ]
}

MetaAttributes

This objects contains one key per non-empty meta-attribute defined on the resource. The keys are fully qualified attribute names (including schema prefix) and values are always lists of strings, even if the attribute only allows a single value or has a different value type.

Table 61. Field list for MetaAttributes
Field Type Description

{schema:attr}

list(string)

A list of string values. The list is ordered and dublicates are allowed.

Example for MetaAttributes
{
  "dc:title": [
    "This is an example file"
  ],
  "dc:date": [
    "2016-12-20T13:59:37.218+0000"
  ]
}

ScrollResults

A page of results returned from a List all Archives in a Vault query.

Table 62. Field list for ScrollResults
Field Type Description

count

int

Number of results in this page.

limit

int

Maximum number of results per page. If limit is greater than count, then this is the last page.

results

list(String)

List of archive IDs

Example for ScrollResults
{
  "count": 2,
  "limit": 25,
  "results": [
    "ab587f42c2570a884",
    "ac2b39606a3a6e3b1"
  ]
}

SearchHit

A single element of a SearchResults listing.

Table 63. Field list for SearchHit
Field Type Description

id

string

Archive ID this hit belongs to.

type

string

Resource type of this hit (either archive or file)

name

string

Full file name (including path) of the matched file. Only present if type equals file.

score

float

Relevance score. May be 0 for queries or search backends that do not support relevance scoring.

fields

object(string, any)

Contains field query results requested during search or automatically provided by the search backend.

Each entry maps a field query to its result value, which is usually a simple type (e.g. number, string or list of strings), but can also take other forms for computed fields or errors.

Failed or unsupported individual field queries should map to an {'error': 'Reason'} object containing error details if possible, but may also be silently ignored and not included in the result at all.

Supported field queries and their return type depend on the search backend used.

Example for SearchHit
{
  "id": "ab587f42c2570a884",
  "type": "file",
  "name": "/folder/example.pdf",
  "score": 3.14,
  "fields": {
    "dcTitle": "Example Document Title",
    "highlight(content)": {
      "error": "UnsupportedFieldQuery"
    }
  }
}

SearchResults

A page of results returned from a search query.

Table 64. Field list for SearchResults
Field Type Description

count

int

Number of results in this page.

total

int

Total number of results in this result set (approximation)

scroll

string

A stateless cursor representing the last hit of this result page. It can be used to repeat the search and fetch the next page of a large result set.

hits

list(SearchHit)

List of search hits

Example for SearchResults
{
  "count": 1,
  "total": 1,
  "scroll": "WyJhYjU4N2Y0MmMyNTcwYTg4NDphYWYwY2M1YWI1ODciXQ==",
  "hits": [
    {
      "id": "ab587f42c2570a884",
      "type": "file",
      "name": "/folder/example.pdf",
      "score": 3.14,
      "fields": {
        "dcTitle": "Example Document Title",
        "highlight(content)": {
          "error": "UnsupportedFieldQuery"
        }
      }
    }
  ]
}

SnapshotInfo

Information about a single archive snapshot.

Table 65. Field list for SnapshotInfo
Field Type Description

name

string

Snapshot name

revision

string

Archive revision this snapshot refers to.

creator

string

User that created this snapshot.

created

string

Snapsho creation date

profile

string

Snapshot storage profile

Example for SnapshotInfo
{
  "name": "v1",
  "revision": 0,
  "creator": "user@domain",
  "created": "2020-05-26T12:02:45.301+0000",
  "profile": "default"
}

TransactionInfo

Information about a running transaction. See Transaction Management for details.

Table 66. Field list for TransactionInfo
Field Type Description

id

string

Transaction ID

isolation

enum

Isolation level (either full or snapshot)

readonly

boolean

Whether or not this transaction is in read-only mode. Read-only transactions cannot be committed (only rolled back) and do not allow modifying operations.

ttl

integer

Number of seconds left from the configured timeout. This counter is reset every time the transaction is used.

If this number is zero or negative, then the transaction already expired or may expire very soon.

timeout

integer

Number of seconds after which this transaction will expire if not used (see ttl).

Example for TransactionInfo
{
  "id": "091f8a6e-0fca-4771-a460-d2ee7d6034e3",
  "isolation": "snapshot",
  "readonly": false,
  "ttl": 59,
  "timeout": 60
}

Realms

Realms manage authentication and authorization in CDStar and are very flexible. There are different interfaces for authorization, authentication, group membership resolution, custom permission types and more. This list contains all available realms types that are either bundled with the core distribution or provided as officially supported plugins. Custom implementations can also be used.

StaticRealm

This realm provides authentication, authorization and groups from a static configuration file.

StaticRealm loads the entire user database (users, groups, roles and permissions) from a static configuration file (hence the name) and is the go-to solution for small instances with only a hand full users. No external database or server required.

Configuration

The realm is configured directly in the cdstar main configuration. Here is an example showing most options:

Example cdstar-static-realm.yaml
realms:
  default:
    class: StaticRealm
    domain: static
    role:
      userRole:
      - "vault:demo:read"
      - "vault:demo:create"
      adminRole:
      - "vault:*:*"
      - "archive:*:*:*"
    group:
      customers:
      - userRole
      admins:
      - userRole
      - adminRole
    user:
      alice:
        password: "cGxhaW4=:FmtSc7NSX8fsjLTmpLpoqRLP4vqWFg/r5uy3EU6JsEs="
        groups:
        - customers
        permissions:
        - "vault:alice:*"
      admin:
        password: "..."
        roles:
        - adminRole
Table 67. Config properties
Pram Description

class

Realm implementation class name. Always "StaticRealm"

file

Load additional configuration from an external yaml file (not implemented)

domain

Sets a default domain for this realm. (defaults to 'static')

user.<name>.password

Enables a user to authenticate against this realm. The password is stored in hashed from. These hashes can be created using the built-in command line tool (see below).

user.<name>.permissions

Grants string permissions directly to this user.

user.<name>.groups

Adds this user to a list of groups.

user.<name>.roles

Adds this user to a list of roles.

group.<name>

Defines a new group with a list of roles.

role.<name>

Defines a new role with a list of string permissions.

Unqualified groups and user-names are qualified with the configured default domain of the realm (e.g. alice is turned into alice@static). Fully qualified names (e.g. alice@otherRealm) are also accepted, even if the domain does not match the current realm.

Warning
Permissions groups and roles configured for a qualified user will affect any session with a matching principal name and domain, even if the session was authenticated by a different realm.

If no password is defined for a user, then the user will not be able to authenticate against this realm. Permissions, roles and groups still apply.

Password hash

A secure password-hash can be generated with the java -cp cdstar.jar de.gwdg.cdstar.auth.realm.StaticRealm tool.

LDAP Realm

An LDAPRealm authenticates password credentials against an LDAP server. The realm first searches for the user according to a configurable search base and filter, then tries to bind to the LDAP using the users password. Successfully authenticated principals are cached to speed up repeated login requests for the same user.

Configuration

Example Configuration
realm:
    ldap:
       class: LDAPRealm
       name: "ldap"
       server: "ldaps://SERVER"
       search.user: "cn=USER,ou=users,dc=example,dc=com"
       search.password: "SECRET"
       search.base: "dc=example,dc=com"
       search.filter: "(|(uid={})(mail={}))"
       attr.uid: "uid"
       attr.domain: "ou"
Table 68. Config Parameters
Name Description

class

Plugin class name. Always LDAPRealm

name

The name of this realm. Defaults to the value of _name.

server

URL (either ldap:// or ldaps://) of the LDAP server.

search.user

Login DN for the search agent. The search agent must be able to search below the search.base tree to find the `DN`s matching a login request.

search.password

Password for the search agent.

search.base

Base DN for user records. Only records below this tree are considered for login requests.

search.filter

Search filter used to map a login requests (e.g. user name or e-mail) to a qualified user DN. Every occurrence of {} within this filter is replaced by an escaped copy of the login request. Additional escaping is not required. For example, to allow login via common name, uid and email, provide a filter similar to: (|(cn={})(uid={})(mail={}))

attr.uid

The LDAP attribute used as the subject identifier. Note that subject identifiers must be unique and should not contain certain special characters. Defaults to uid.

attr.domain

Attribute to read the principal domain from. This allows a single LDAPRealm to represent multiple principal domains. If this config value is not set, or if the attribute is not found in the ldap record, then the principal domain defaults to the realm name. (Optional)

cache.size

Number of recently authenticated principals to keep in memory to prevent unnecessary LDAP request. Defaults to 1024. A cache size of 0 disables the cache.

cache.expire

Number of seconds after which a principal must be re-authenticated against LDAP. (default: 10 minutes)

Warning: cache.expire is enforced by the cache implementation, which might allows entries to survive longer than expected on Java 8 if the cache is mostly idle. If prompt expiration is important and the expiration time is very short, make sure to run on Java 9 or newer.

JWT Realm

This plugin adds support for JWT token based authentication and authorization.

Configuration

The JWTRealm class can be configured as a realm or regular plugin and allows users to authenticate via signed JWTs.

example.yaml
realm:
  jwt:
    class: JWTRealm
    default:
      hmac: c3VwZXJzZWNyZXQ= # base64("supersecret")
    my_issuer:
      iss: https://auth.example.com/my-realm/
      jwks: https://auth.example.com/my-realm/jwks.json
      domain: my_realm

This plugin supports multiple JWT issuers with different settings at the same time. Tokens are matched against configured issuers based in their iss claim. Tokens without an iss claim or with no matching issuer configuration will be matched against the default issuer, if defined.

Each issuer MUST define at least one of hmac, rsa, ecdsa or jwks to be able to verify signed tokens. Unsigned tokens are not supported and will be rejected.

Pram

Description

class

Plugin class name. Always JwtRealm

<issuer>.iss

Expected value of the iss header claim for tokens from this issuer. (default: <issuer>).

<issuer>.hmac

Base64 encoded secret. Required to verify HMAC based signatures.

<issuer>.rsa

RSA public key (X.509). Required to verify RSA based signatures. Keys are loaded from (*.pem or *.der) files, or directly from a base64 encoded string.

<issuer>.ecdsa

ECDSA public key (X.509). Required to verify ECDSA based signatures. Keys are loaded from (*.pem or *.der) files, or directly from a base64 encoded string.

<issuer>.jwks

Path or URL pointing to a JWKS (Java WebToken Key Set) file to load signing keys from.

<issuer>.leeway

Number of seconds to add/remove to exp or nbf claims before a token is checked. This helps prevent errors for short-lived tokens if the server clocks are not perfectly synchronized. (default: 0).

<issuer>.domain

The realm domain of the resulting principal. (default: <issuer>).

<issuer>.trusted

(deprecated) If true, the issuer can dynamically grant additional permissions via private claims (see below). (default: false)

<issuer>.permit

A list of static StringPermissions given to all tokens created by this issuer.

<issuer>.groups

A list of static groups all token users are considered to be a member of.

<issuer>.subject

SpEL expression to derive a subject name from a token. Must evaluate to a string. (default: getString('sub'))

<issuer>.verify.<name>

SpEL expression (see below) to check token validity. All expressions must evaluate to true, or the token will be rejected. The rule name is just informal.

<issuer>.groups.<name>

SpEL expression (see below) to derive group memberships from a token. Each expression must evaluate to a string, a list of strings, or null. Null values or empty lists are ignored and will not add any groups. The expression name is just informal.

<issuer>.permit.<name>

SpEL expression (see below) to derive StringPermissions from a token. The expressions must evaluate to a string, a list of strings, or null. Null values or empty lists are ignored and will not add any permissions. The expression name is just informal.

Dynamic expression rules

Because JWT is a very loose standard and the available claims may differ a lot between token providers, this plugin allows to verify tokens and extract information dynamically using SpEL expressions. Token claims are available as a claims map which maps claim names to com.auth0.jwt.interfaces.Claim instances, or via the hasClaim(name), getBool(name, default), getLong(name, default), getDouble(name, default), getString(name, default), getStringList(name), getClaim(name, type, default) and getClaimList(name, innerType) helper methods. These methods will return null or an empty list on any errors (missing claim, wrong type) and automatically convert between single and list claims. If a single value is requested for a list claim, the first value is returned.

dyn-example.yaml
realm.jwt:
    class: JWTRealm
    keycloak:
      iss: https://auth.example.com/realms/my_realm/
      jwks: https://auth.example.com/realms/my_realm/protocol/openid-connect/certs
      domain: my_realm
      subject: "getString('preferred_username') ?: getString('sub')"
      verify.aud: "getStringList('aud').contains('my_client_id')"
      groups.admin: "getBool('admin', false) ? 'admin_group' : null"
      permit.vaultUser: "getStringList('usable_vaults').!['vault:#{#this}:create']"

Trusted token claims (deprecated)

If the issuer is configured with trusted: true, then the following rules and expressions are automatically configured for a realm:

trusted.yaml
# Add `cdstar:groups` to list of groups.
groups._trusted: "getStringList('cdstar:groups')"

# Allow read access to all vaults in `cdstar:read`
permit._trusted_read: "getStringList('cdstar:read').!['vault:'+#this+':read']"

# Allow create+read access to all vaults in `cdstar:create`
permit._trusted_create:      "getStringList('cdstar:create').!['vault:'+#this+':create']"
permit._trusted_create_read: "getStringList('cdstar:create').!['vault:'+#this+':read']"

# Grant all vault and archive permissions in `cdstar:grant`
permit._trusted_grant: "getStringList('cdstar:grant').?[#this.startsWith('vault:') or #this.startsWith('archive:')]"

Plugins

Plugins are optional components that extend various parts of the cdstar runtime or REST API and can be enabled on demand. Some plugins are bundled with the core cdstar distribution, others must be downloaded and unpacked into the path.lib folder before they can be used. This chapter describes the official plugins that are tested and distributed with the core cdstar runtime and fully supported.

PushEventFilter

The PushEventFilter sends an HTTP request to a number of configured consumer URLs whenever an archive is modified. This can be used to update external services or keep external databases in sync with the actual data within cdstar.

Failed push request are tried again to compensate for busy or temporarily unavailable consumers. If a consumer goes down for an extended time period, any push requests that failed to be delivered are persisted to disk.

Configuration

Example Configuration
cdstar:
  plugin:
    push:
      class: PushEventFilter
      fail.log: "${path.var}/push-fail.log"
      retry.max: 3
      retry.delay: 1000
      retry.cooldown: 60000
      queue.size: 1000
      http.timeout: 60000
      url: http://localhost:8081/push
      url.alt: http://localhost:8082/push
      header.Authorization: Basic Y3VyaW91czpjYXQ=
      header.X-Push-Referrer: http://push:push@localhost:8080/v3/
Table 69. Config Parameters
Name Description

class

Plugin class name. Always PushEventFilter

fail.log

(optional, recommended) Path to a file where failed push requests are logged. If %s is part of the filename, it is replaced with the current unix epoch timestamp. If it is relative, it is created within the path.var directory of the CDStar instance. Missing directories are created automatically.

retry.max

(default: 3) Maximum number of attempts before a consumer is considered unresponsive.

retry.delay

(default: 1000) Number of milliseconds to wait between failed attempts.

retry.cooldown

(default: 60000) Number of milliseconds to wait after retry.max failed attempts.

http.timeout

(default: 60000) Number of milliseconds after which a request is aborted.

queue.size

(default: 1000) Number of queued events per consumer.

url

URL to send push requests to.

url.*

Additional URLs.

header.*

Additional HTTP headers to send with each request.

Push Event Consumer API

Events are send to consumers synchronously and in the order they appear, which means that there is at most one HTTP connection per consumer at any given time. The service behind the configured URL should expect requests like the following:

Example PUSH request
POST /push  HTTP/1.1
Host: localhost:8081
Content-Type: application/json; charset=UTF-8
Content-Length: 167
X-Push-Retry: 0
X-Push-Queue: 12 1000 0
X-Push-Referrer: http://push:push@localhost:8080/v3/

{
  "vault" : "test",
  "archive" : "b5e83cd9658f7f33",
  "revision" : "0",
  "parent" : null,
  "ts" : 1491914254133
  "tx" : "ded6b2d4-6983-48f6-9b1f-be8225dab136",
}
Table 70. Event Headers
Name Description

X-Push-Retry

(int) Number of previously failed attempts for this event.

X-Push-Referrer

(url) May be sent to tell consumers how to contact cdstar.

X-Push-Queue

Statistics about the event queue for this consumer. Contains three space-separated numbers:

  • (int) Number of events in waiting queue (not counting the current event)

  • (int) Maximum size of waiting queue

  • (int) Total number of dropped events since the service was last restarted

Example: 12 1000 0 means: Twelve events currently waiting in a queue limited to 1000 events. No events were dropped so far.

*

Additional headers can be configured with header.* properties.

Table 71. Event Attributes
Name Description

vault

Name of the vault.

archive

ID of the archive that changed.

revision

Revision of the changed archive, or null if the archive was deleted.

parent

Revision of the archive before the change, or null if this archive was just created.

ts

Timestamp of the change event (milliseconds since 1970-01-01T00:00:00GMT)

tx

ID of the transaction this change was part of.

A consumer may respond with 200 OK, 202 Accepted or 204 No Content to signal success. The response body should be empty and other headers (including cookies) are ignored.

Redirects with 30x response codes are followed according to normal HTTP client rules, but discouraged.

Consumers that are busy or unresponsive can answer with 503 Service Unavailable and request a cool-down time (in seconds) using the Retry-After header. This causes CDStar to pause the consumer and not send any more requests for the requested cool-down period. If the Retry-After header is missing, the default retry.cooldown is used.

Any other response as well as connection problems or timeouts are logged as warnings and the request is sent again after retry.delay milliseconds. If a request fails more than retry.max times in a row, it is logged as an error and the consumer is paused for retry.cooldown milliseconds. This gives the consumer a chance to recover and also reduces logging noise considerably. Note that failing event are not discarded, but simply send again after the cool-down. Consumers MUST return a success status if they want to drop or ignore an event. Otherwise, they will receive the same event over and over again.

Slow consumers should queue and persist events locally and answer with 202 Accepted to prevent timeouts or events piling up too quickly. If a single request takes longer than http.timeout milliseconds, it is aborted and tried again. If the number of waiting events exceeds queue.size (per consumer), new events will be dropped and logged to a fail.log file.

The fail.log file

The file configured with fail.log is used to store events that failed to be delivered. It contains one failed request per line, starting with the service URI, a single space, and the base64 encoded payload of the request. A timestamp is not logged since it can be easily recovered from the event payload itself.

Example fail.log entry
http://127.0.0.1:8081/push ewogICJ2YXVsd[...]IxMzYiLAp9Cg==

The PushEventFilter only appends to this file and there is no automatic clean-up. A warning is logged if this file is not empty at service start-up time, but there is no automatic recovery or re-querying of events. This feature may be added in the future, though.

If you have consumers that are sensitive to lost events, make sure to check this file regularly. A short python script to re-submit events from a fail.log is shown here:

Example recovery script
import requests
headers = {
	'Content-type': 'application/json'
}
with open(`/path/to/fail.log`) as fp:
  for lineno, line in enumerate(fp):
    target, payload = line.split(' ', 1)
    payload = payload.decode('base64')
    r = requests.post(target, data=payload, headers=headers)
    if r.status_code in (200, 204, 206):
    	print "%d SUCCESS" % lineno
    else:
    	print "%d ERROR" % lineno
    	print r

RabbitMQSink

This plugin emits change events to a RabbitMQ message broker.

Warning
This plugin is experimental.

Configuration

Pram

Type

Description

class

str

Always de.gwdg.cdstar.ext.rabbitmq.RabbitMQSink or RabbitMQSink

broker

URI

RabbitMQ transport URI to connect to, including authentication parameters and virtual node, if necessary.

exchange.name

str

Name of the exchange to publish to.

exchange.type

str

Type of the exchange (e.g. fanout). If no defined, then no exchange is declared and the exchange is assumed to already exist.

qsize

int

Size of the in-memory send-queue (default: 1024).

Reliability

Events are buffered in an in-memory send-queue and re-queued on any errors. This helps to compensate short event bursts, temporary network failures or broker restarts.

Events that cannot be queued or re-queued are logged and dropped. This may happen during shutdown phase or when the send-queue overflows.

Events are not part of the transaction logic (yet). A forced shutdown or crash will loose all messages in the send-buffer. Also note that the broker itself may drop messages for various reasons, depending on its configuration. The possibility of loosing events MUST be considered when using this plugin.

Embedded ActiveMQ Message Broker

This plugin emits change events to an embedded ActiveMQ message broker.

Warning
Embedding an ActiveMQ broker is fine for small to medium setups with low traffic and private networks. For production environments it is usually better to run a dedicated message broker with proper configuration and switch to the cdstar-activemq-sink or cdstar-rabbitmq-sink plugin.

Configuration

Pram Type Description

transport.<name>

URI

Network transports to bind to. See ActiveMQ docs for available protocols and URI parameters. The <name> part is only used for logging can can be omitted for a single transport.

This plugin bundles all dependencies needed for OpenWire, AMQP, STOMP and MQTT. Transports with vm, tcp, amqp, stomp, mqtt and auto schemes as well as their +ssl or +nio variants can be used directly. Other protocols may need additional dependencies on the class path.

The auto transport accepts OpenWire, AMQP, STOMP and MQTT clients on the same network port and is recommended in setups with mixed clients.

Default: auto+nio://127.0.0.1:5671

topic

list(str)

Change events are send to the given topics. (Default: cdstar)

queue

list(str)

Same as topic, but sends events to a queue. (Default: disabled)

buffer

int

Size of the send buffer. (Default: unbound)

Change Event Sink: ActiveMQ

This plugin emits change events to an ActiveMQ message broker.

Configuration

Pram Type Description

broker

URI

ActiveMQ transport URI to connect to, including authentication parameters, if necessary.

This plugin bundles all dependencies needed for OpenWire, AMQP, STOMP and MQTT. Transports with tcp, amqp, stomp, mqtt and auto schemes as well as their +ssl or +nio variants can be used directly. Other protocols may need additional dependencies on the class path.

topic

list(str)

Change events are send to the given topics. (Default: cdstar)

queue

list(str)

Same as topic, but sends events to a queue. (Default: disabled)

qsize

int

Size of the send buffer. (Default: unbound)

RedisSink

A dead simple plugin that emits change events to a redis server.

Configuration

Pram

Type

Description

class

str

Always de.gwdg.cdstar.ext.redis.RedisSink or RedisSink

url

URI

A redis server or cluster URI (default: redis://localhost:6379/0)

key

string

Redis key or pub/sub channel to push events to. (default: cdstar.events)

mode

string

Push mode (see below). (default: RPUSH)

qsize

int

Maximum in-memory send-queue size. (default: 1024)

Push modes

  • RPUSH Right-push do a redis list. (default)

  • LPUSH Left-push do a redis list.

  • PUBLISH Publish to a redis pub/sub channel.

Reliability

This sink will buffer events in a bounded in-memory queue and sent them out one by one as fast as it can. Any errors (network or redis errors, buffer queue overflow) will cause events to be logged an dropped (WARN level). On shutdown, the sink tries its best to send all remaining events, but will only do so for a couple of seconds. On a crash, all queued events are lost.

Or in other words: This sink is NOT reliable in any way. Network errors or crashes will cause events to be lost. On the plus side, this sink will not slow down cdstar if the redis server fails.

Search Proxy

This plugin installs a SearchProvider that forwards search requests to an external search gateway, using a simple HTTP protocol as described below.

To simplify gateway development and improve security, client credentials are NOT forwarded to the gateway. CDSTAR will authenticate and resolve client credentials before the search is forwarded, and only provide principal name and group memberships to the gateway. This enables user-specific searches without exposing client credentials to an external service.

Configuration

Example Configuration
plugin:
    search:
       class: ProxySearchPlugin
       target: "https://gateway.example.com/search"
       maxconn: 16
       header:
          X-Custom-Header: value
Table 72. Config Parameters
Name Description

class

Plugin class name. Always ProxySearchPlugin.

name

The name of this provider. Defaults to the value of _name.

target

URL to send search requests to. The target URL may contain authentication info.

maxconn

Maximum number of concurrent search requests (default: 10)

header.<name>

Additional HTTP headers to attach to each request.

Search gateway API

The search gateway should accept POST requests at the configured target URL with Content-Type: application/json and return results in the same format as the CDSTAR v3 search API. Search queries will be sent as JSON documents with the following fields:

Name type Description

q

string

User provided search query.

fields

array(string)

An array of additional fields that should be returned with each hit. (optional)

order

array(string)

User provided order criteria as a list of field names to order by, each optionally prefixed with -. (optional)

limit

int

User provided limit for results per page. (optional)

scroll

string

User provided scroll handle. (optional)

vault

string

Name of the vault this search is performed on.

principal

object

Security context for this search request. If missing or None, assume an unauthenticated user.

principal.name

string

Name (including domain) of the user performing the search. (optional)

principal.groups

array(string)

List of groups the searching user belongs to. (optional)

principal.privileged

boolean

If true, assume the user can see all results. (default: false)

The q, fields, order, limit and scroll fields correspond to the (cleaned up) user provided search parameters as defined by the CDSTAR search API. vault and principal are added by CDSTAR. The search target should limit search results to entities visible to the specified principal. If no principal is present (null, missing or empty), the search should only return publicly visible results. If principal.privileged is true, the search should not filter by visibility and return all matching results.

Example Request
POST https://gateway.example.com/search
Content-Type: application/json
{
    "q": "search query",
    "order": ["-score"],
    "limit": 100,
    "fields": ["meta.dc:title"],
    "vault": "myVault",
    "principal": {
        "name": "alice@realm",
        "groups": ["users@realm"],
        "privileged": false
    }
}

Security considerations

Since the search gateway is not supposed to authenticate the searching user and trust the fields send by CDSTAR, it could be used to perform searches on behalf of another user, if accessed directly by an attacker. Make sure that the gateway is only reachable from the CDSTAR instance or is protected by HTTPS and some authentication mechanism (e.g. BASIC auth or secret headers).

Landing Page (UI)

The cdstar-ui plugin provides a very minimal browser-based UI (user interface) mounted at the /ui root path. This UI is targeted at humans and may require a modern JavaScript enabled browser to be fully usable. The URL scheme is not defined or stable, with one exception: /ui/<vault>/<archive> will always show (or redirect to) a human readable landing page for an archive. The user may be asked to log-in first for non-public archives.

Configuration

No configuration necessary, but this plugin honors the global api.context setting (default: /). This may be required if the service path cannot be detected automatically and assets are not loaded correctly.

example.yaml
plugin.ui.class: cdstar-ui

TusPlugin

This TusPlugin installs a tus.io compatible REST endpoint to upload temporary files, and a way for other APIs to reference these files via server-side data streams. This helps clients to upload large files over unreliable network connections, or parallelize uploads of multiple files for the same archive.

Tip
TUS will NOT improve upload speed or throughput over stable network connections. The fastest and most efficient way to upload large files to cdstar is via Upload file. The best way to upload many small files to cdstar is via Update Archive. Only use TUS if uploads need to be resumable or you want to import the same file multiple times.

Configuration

There is currently no configuration for this plugin. Uploads will be placed into ${path.var}/tus/.

Example Configuration
plugin.tus.class: TusPlugin
Table 73. Config Parameters
Name Description

class

Plugin class name. Always TusPlugin or de.gwdg.cdstar.rest.ext.tus.TusPlugin.

expire

Maximum number of milliseconds a TUS upload is kept on disk after the last byte was written. If the value has a suffix (S,M,H or D) it is interpreted as seconds, minutes, hours or days instead of milliseconds. (default: 24H)

Usage

The tus.io compatible REST endpoint is reachable under /tus at the root-level of the service (not /v3/tus but just /tus). After creating a TUS handle and uploading data following TUS protocol, the temporary file can be referenced as tus:<tusId>, where <tusId> is the last part of the TUS handle. For example, if your TUS handle was /tus/24e533e, then the internal reference to this resource would be tus:24e533e.

Currently only the Create Archive and Update Archive support server-side imports via the fetch:<target> functionality. For example, to import a completed TUS upload into an archive, you would send fetch:/path/to/target.file=tus:24e533e as a POST form parameter. Note that the digests must still be computed, so a fetch may take just as long as uploading the file directly. TUS usually does not improve overall throughput, but may improve reliability of large-file uploads over unreliable network connections. Use it wisely.

Incomplete TUS handles that do not see any new data will expire after 2 hours. Once complete, the TUS handle can be referenced for another 24 hours before it expires. Handles that are not needed anymore can (and should) be deleted faster with a single DELETE request to the TUS handle.

Advanced topics

NioPool Storage

NioPool is the default StoragePool implementation for CDStar and provides transactional and robust persistence to a local or network-attached file system. It is usually bundled with the default distribution of CDStar and does not require any additional plugins.

Note
StoragePool is a low level interface and abstraction layer for the underlying physical storage. High level concepts (namely vaults, archives and files) map roughly to low level entities (pools, objects and resources) but should not be confused or mixed. The exact relations between high and low level concepts are described in a separate document (TODO).

This document describes the on-disk folder structure and index file format used by NioPool. The storage format is designed to be IO efficient and human-accessible at the same time: index files are human-readable and self-describing JSON files. In theory, all data and meta-data can be analyzed and recovered without prior knowledge or specialized software.

Folder structure

Storage objects are distributed into a directory tree with configurable depth, based on the first few character-pairs of the object ID. This reduces the maximum number of inodes per directory and helps keeping file system metadata cache-friendly, even for large pools with millions of objects. For a depth of d, the lookup path would be computed as follows: {poolName}/{id[0:2]}/…​/{id[(d-1)*2:d*2]}/{id}/. For example, given a default depth value of d=2, an object with ID 0123456789abcdef would be stored in myPool/01/23/0123456789abcdef/.

Tip
NioPool follows symlinks, even across device borders. This makes it easy to split large repositories and distribute load across multiple file systems or storage devices.

All files related to a specific pool object are stored in the same folder. Each object folder contains at least a HEAD symlink pointing to the latest {revision}.json index file. This file describes the state and content of the object in human readable form (json). There will be an extra index file for each revision of the object. Binary resources are stored in separate {sha256}.bin files. If object packing is enabled, some index or resource files may be bundled into packs and must be unpacked before they can be used (see below).

Example: Pool object directory (empty)
cdstar-home/data/vaultName/dc/64/dc64abb808e0c227/
  ./HEAD -> ./e371ce6a077f88755c1155b507b757d5.json
  ./e371ce6a077f88755c1155b507b757d5.json
Example: Pool object directory (two revision, one resource)
cdstar-home/data/vaultName/dc/64/dc64abb808e0c227/
  ./HEAD -> ./008f113ff1579f8aed9399bf7960118f.json
  ./008f113ff1579f8aed9399bf7960118f.json
  ./e371ce6a077f88755c1155b507b757d5.json
  ./30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58.bin
Example: Pool object directory (large object with many revisions and resources, packed)
cdstar-home/data/vaultName/dc/64/dc64abb808e0c227/
  ./HEAD -> ./2266dc2ff8067e68104607e750abb9d3.json
  ./2266dc2ff8067e68104607e750abb9d3.json
  ./15041131681337.pack.zip

Object index file format

Each time an object is modified, a new {revision}.json index file is created and the HEAD symlink is updated. These files contain an utf-8 encoded JSON document describing the current state (contained resources, attributes and meta-data) of the storage object in a human-readable form.

Warning
Fields with null or empty values may be skipped to save space, and additional fields may be added in future versions of this implementation. Keep that in mind if you plan to parse these files with custom tools.
Example for a {revision}.json index file with one resource
{
  "v" : 3,
  "id" : "dc64abb808e0c227",
  "rev" : "008f113ff1579f8aed9399bf7960118f",
  "parent" : "e371ce6a077f88755c1155b507b757d5",
  "type" : "application/x-cdstar;v=3",
  "ctime" : 1507979048000,
  "mtime" : 1507979048885,
  "x-cdstar:owner" : "test@static",
  "x-cdstar:mtime" : "2017-08-29T11:28:06.0722Z",
  "x-cdstar:acl:$owner" : "OWNER",
  "x-cdstar:rev" : "1",
  "resources" : [ {
    "id" : "8c5a29d5707b6927e8484e2cd5170749",
    "name" : "data/target.txt",
    "type" : "application/octet-stream",
    "size" : 1048576,
    "ctime" : 1507979048885,
    "mtime" : 1507979048885,
    "sha1" : "O3H0P/MPSxW1zYXdnpXrx+hOtaM=",
    "sha256" : "MOFJVevxNSJm3C/4Bn5oEEYH51CrudOzZYK4r5Cfy1g=",
    "md5" : "ttgbNgpWctgMJ0MPORU+LA=="
  } ]
}
Table 74. Object index properties
Name Type Description

v

int

Format version. Defaults to 3.

id

String

Pool object ID. Should be the same as the containing directory name.

rev

String

Revision string. Should match the file name.

parent

String

Revision string of the parent revision. This field can be used to traverse the revision history of an object. May be null or missing for the first revision of an object which has no parent.

type

String

Application defined mime-type. May be null or missing.

ctime

long

Date and time of object creation (Unix epoch, millisecond resolution).

mtime

long

Date and time of last modification (Unix epoch, millisecond resolution).

x-{key}

String

Custom application defined key/value pairs.

resources

Array

Unordered list of resource records (see below). May be empty, null or missing.

Table 75. Resource record properties
Name Type Description

id

String

Unique resource identifier. This string is unique per object, not globally.

name

String

Application defined resource name. This should be unique per object, but uniqueness is not enforced. May be null or missing.

type

String

Application defined content-type. May be null or missing.

enc

String

Application defined content-encoding. May be null or missing.

size

Long

Size of resource binary data in bytes.

ctime

String

Date and time of resource creation (Unix epoch, millisecond resolution).

mtime

String

Date and time of last modification (Unix epoch, millisecond resolution).

src

String

External location identifier for the resource binary content. May be null or missing, in which case the resource is either empty or stored in the default location (see below). If set, the data file may be removed by garbage-collection and additional steps are required to recover the content of the resource.

md5

Base64

MD5 hash of the resource content as a base64 string. May be null or missing.

sha1

Base64

SHA-1 hash of the resource content as a base64 string. May be null or missing.

sha256

Base64

SHA-256 hash of the resource content as a base64 string.

x-{key}

String

Custom application defined key/value pairs.

Dates are stored as unix epoch timestamps with millisecond resolution (signed long integer). While not directly human readable, these are easily recognized and a very common exchange format for points in time. Most programming languages provide built-in tools to translate an epoch timestamp into a human readable form.

Resource default location

By default, the uncompressed binary content of non-empty resources are stored in the object directory as {sha256}.bin files named after the lower-case hex encoded sha256 digest of their content. These files always end in .bin regardless of their actual content-type. If this file is missing, the resource may either have been packed (see "Object Packing") or externalized (see "External resources") and additional steps are required to recover the binary content of the resource.

External resources

If the src field of a resource record is set, the corresponding {sha256}.bin resource file is subject to garbage-collection and may be removed at any time. In this case, the value of the src field should contain enough information to recover the resource file manually or with the help of an application-specific process. The src field MUST start with a prefix defined in this document, or with x- followed by an application defined location hint (e.g. an URI).

Object Packing (not implemented)

Resource files in an object directory may be bundled into one or more *.pack.zip files to save inodes and disk space. Compression can also help reducing IO pressure on the storage device in exchange for higher CPU usage during read access. This trade-of may be beneficial, in particular for rarely accessed objects or resources with highly compressible content.

Resources stored in a pack have a src value of pack:<pack-file-name> and follow default naming rules ({sha256}.bin) within the pack file.

Note
The zip format allows fast lookup and random access to individual files. Other common packaging formats (e.g. tar) require linear scans in order to find a specific file. The drawbacks of the zip format (e.g. low resolution timestamps or file name limitations) are negligible as these information is also present in the object index file.

Temporary data

NioPool may create temporary .tmp files or directories within an object directory. These may contain data required for recovery, so do not delete these files after an unclean shutdown or while the service is running. Temporary files that remain after an ordinary shutdown can be removed.

Locking, concurrency control and transactional storage

Any actor that creates or removes files other than *.tmp in an object directory or intends to change the target of the HEAD symlink MUST acquire a HEAD_NEXT file lock before doing so. The HEAD_NEXT file SHOULD be a symlink pointing to a (possibly not yet created) index file. To change the HEAD link, make sure that the HEAD_NEXT target exists and is synchronized to disk, then move-and-replace HEAD_NEXT to HEAD. Any error during this sequence should result in dangling HEAD_NEXT symlink protecting the object from further manipulation until manual or automatic recovery succeeded. In a disaster situation, either HEAD or HEAD_NEXT (or both) exists and the object can be rolled back or committed manually.

Tip
Some file systems do not implement an atomic move-and-replace operation. In this case, HEAD must be removed before HEAD_NEXT can be renamed. Clients may try to access HEAD in the short time span when it does not exists. Robust implementations should simply retry a couple of times.

Configuration

StoragePool configuration is stored by CDSTAR in a vault.yaml within the pool base directory and can be bootstrapped during vault creation with predefined parameters. NioPool supports the following configuration parameters:

.Configuration Parameters
Name Type Description

path

String

Path to the vault base directory (required, default: ${path.data}/${vaultName}/)

cacheSize

int

Number of manifests to keep in an in-memory cache for faster load times.

autotrim

bool

If enabled, schedule a garbage collection run after each successful commit for each modified object.

digests

str

Comma separated list of digests to calculate. SHA-256 is always calculated. Defaults to: MD5,SHA-1,SHA-256.

Storage Profiles

CDSTAR supports and integrates third party long time storage systems (LTS, e.g. tape libraries) via storage profiles. From the users perspective, a storage profile defines where and how data should be stored. By assigning a storage profile to a CDSTAR archive, the user can control data migration to and from LTS in a coherent, safe and predictable way. The actual data migration happens in the background and is fully managed by CDSTAR.

Profile mode: HOT vs. COLD

Storage profiles can be either "hot" or "cold", which changes the way CDSTAR handles its local data.

Hot profiles causes CDSTAR to copy the archive content to external storage, but keep all data available in CDSTAR as well. While the profile is in effect, only administrative metadata (owner, ACLs, storage profile, …​) can be modified. The actual content (files and metadata) is write-protected to prevent stale LTS copies.

Cold profiles, on the other hand, allow CDSTAR to re-claim disk space by deleting archive files from disk after a copy was stored externally. Metadata is still kept available, but file content can no longer be accessed through CDSTAR. The profile needs to be changed to default or a hot profile to make file content available again.

Hot profiles are meant to increase long term availability or data integrity guarantees by storing important data in a second location. Cold profiles are mostly used to store large amounts of rarely accessed data in a more cost-effective way (e.g. on tape), while keeping meta-data search- and discoverable.

Profile configuration

Profiles can be configured globally, and enabled or disabled per vault. They currently only have a name, and define a mode (hot or cold) and an associated LTS target, which is configured separately as a plugin. This allows multiple profiles to reference the same LTS target, but with different configuration.

profile:
  bagit-hot:
    lts.name: bagit
  bagit-cold:
    lts.name: bagit
    lts.mode: cold

LTS target configuration

Data migration from or to third party LTS systems is highly depended on the system in use. Multiple implementations are available and can be loaded via the CDSTAR plugin infrastructure. CDSTAR bundles a general purpose implementation that exports to BagIt directories and allows an external process to perform the actual LTS migration asynchronously.

plugin:
  bagit:
    class: BagitTarget
    path: /path/to/store/bagit/

LTS handlers are referenced by name, so special care must be taken when removing or renaming LTS handlers. Do not remove or rename an LTS target as long as there are archives still referencing it.

How to not loose data

Moving data out of the CDSTAR system, especially with cold profiles, bears some risks that should be well understood before enabling the LTS feature. Please read this chapter carefully.

After a successful migration to an LTS target, CDSTAR stores the LTS name and a unique location identifier (generated by the LTS) into non-public archive properties. These are used to recover missing files in case of a future profile change. Cold profiles allow CDSTAR to remove local copies of archived files after successfully copying these files to LTS. If the LTS goes away, for whatever reason, then CDSTAR has no way to recover missing files and the archive is stuck in cold state. File content will be unavailable and data migration after profile changes will fail.

  • Do not remove or rename an LTS target as long as there are archives still referencing it.

  • When updating LTS Plugins or changing configuration, ensure that existing location identifiers remain valid.

  • Monitor CDSTAR logs for failed migrations.

BagIt LTS Target

This LTS target exports archives into BagIt folders, and is designed to work with external worker processes for the actual migration from/to LTS storage (e.g. tape).

The exporter will create a BagIt package in a temporary folder, then rename it to [name].bagit with a unique name. A worker process may check for these folders and copy or move data to LTS.

The importer will create a file named [name].want and start the import as soon as the [name].bagit folder can be found. A worker process should check for these [name].want files and recover the missing [name].bagit folder from LTS. Once complete, the importer will delete the [name].want file and the recovered [name].bagit folder can be cleaned up by the worker.

If the external copy is no longer needed, a [name].delete file is created. A worker process should watch for these files, remove the external copy (if any), remove the [name].bagit directory (if present), and then also remove the [name].delete file.

External workers are allowed to create additional files for their own state handling, as long as they do not interfere with the names defined here.

Archive Snapshots

Archive snapshots are an efficient way to preserve the current payload of an archive without actually creating a full copy. They can be used to implement versioning, tag important milestones or create immutable and citeable releases for publishing.

From a users perspective, snapshots are virtual read-only archives that represent the payload of their source archive from a specific point in time. The payload of a snapshot will not change if the source archive is modified. Other aspects however, most notably owner and access control information, are transparently inherited from the source archive and will change if the source archive changes. One exception is the storage profile, which can be changed on a per-snapshot basis independent from the source archive. See Storage Profiles for details.

Once created, most read-only operations that work on an archive are also available for snapshots. In the REST API, snapshots are referenced by the source archive name, followed by an @ character and the snapshot name. For example, GET /v3/somevault/ab587f42c257@v1/data.csv would fetch a file from the v1 snapshot instead of the current archive state. Details are explained in the REST API documentation.

Sparse Copies and Deduplication

On storage level, snapshots live in separate storage objects, but are created in a way that allows them to share common data files with their source archive or other snapshots, if supported by the storage back-end. This ensures that snapshots only take up a minimum amount of additional storage space and are usually way more efficient than actually copying an entire archive. NioPool implements this on file-system level by hard-linking files with the same content, and only creating a copy if content changes (copy on write semantics).