Documentation

CDStar is package-oriented data-management framework for scientific and other data-driven applications. It enables the development of powerful tools and workflows against a simple and stable REST interfaces that hides away the details and complexities of the actual storage back-end in use.

The CDStar storage API is organized in vaults, archives (or packages) and files: A vault can store any number of archives and distributes them transparently across different storage infrastructures. Each archive is identified by a unique ID and contains a list of named files. Within an archive, files can be organized in folder structures and annotated with search-able attributes. Archives themselves can also be annotated. The search integration indexes attributes and file content to allow near-realtime search across an entire vault.

Getting started

This guide is a step-by-step tutorial which shows how to install, configure, and use cdstar in a simple example setup. You will download and run cdstar locally, configure a single vault and store some files. All you need is a computer with Java (8+) installed. This tutorial assumes you are running some flavor of Linux.

Installation

CDSTAR is written in Java and the "cdstar.jar" binary distributions runs on any platform with a compatible Java Runtime Environment (OpenJDK or Oracle Java 11 or newer). There are several ways to obtain a recent version of cdstar, described here.

Download binary release

wget https://cdstar.gwdg.de/release/dev/cdstar.jar

Older and stable releases are also available here: https://cdstar.gwdg.de/release/

Build from source

Building CDSTAR requires a Java JDK (Java 11 or newer) and Maven. The CDSTAR source distribution ships with a Maven wrapper script (./mvnw or ./mvnw.bat) that fetches the correct version of Maven and sould be preferred over whatever Maven version is offered as a system package by your distribution.

Install build dependencies

sudo apt install git build-essential # for 'git' and 'make'
sudo apt install default-jdk-headless

Checkout source code

git clone https://gitlab.gwdg.de/cdstar/cdstar.git
cd cdstar

Build standalone server executable

make cdstar.jar
# or manually:
./mvnw -pl cdstar-cli -am -DskipTests=true -Pshaded clean package
cp cdstar-cli/target/cdstar-cli-*-shaded.jar cdstar.jar

Tip	The `-DskipTests=true` parameter will save you some time. Releases are always tested before they are published, so there is no point in running all tests again.

Configuration

CDStar can read configuration from yaml and json files, whichever you prefer. Here is a small example to get you started:

Example cdstar-demo.yaml

---
path.home: /tmp/cdstar-demo
vault.demo:
  create: True
  public: True
  pool.autotrim: True

realm.static:
  class: StaticRealm
  # This role can create, read and list archives in the 'demo' vault.
  role.demoRole: vault:demo:create, vault:demo:read, vault:demo:list
  # This group inherit all permissions from 'demoRole'.
  group.demoGroup: "demoRole"
  # This user has the password 'test' and belongs to the 'demoGroup'.
  # Password hashes can be computed using cdstar.jar:
  #   $ java -cp cdstar.jar de.gwdg.cdstar.auth.realm.StaticRealm
  user.test:
    password: "cGxhaW4=:FmtSc7NSX8fsjLTmpLpoqRLP4vqWFg/r5uy3EU6JsEs="
    groups: demoGroup
    # permissions: ...
    # roles: ...

Note	A secure password-hash can be generated with the `java -cp cdstar.jar de.gwdg.cdstar.auth.realm.StaticRealm` tool.

The only required parameter is path.home. Everything else is optional. See Configuration for details.

First run

$ java -jar cdstar.jar -c cdstar-config.yaml run -p 8080

Test if server is running

curl http://localhost:8080/v3/

Command line parameters

Example: cdstar --help

Usage: cdstar [-h] [--version] [--log-config=<file>] [-c=<file>]...
              [-C=<key=value>]... [--debug=<logger>]... [--trace=<logger>]...
              [COMMAND]
Run or manage CDSTAR instances
  -c, --config=<file>       Load configuration from this file(s). Prefix the
                              filename with '?' to mark it optional
  -C=<key=value>            Override individual configuration parameter. Use
                              'KEY=VALUE' to override or 'KEY+VALUE' to append
      --debug=<logger>      Increase loggin for specific packages. The value
                              'ROOT' may be used as an alias for the root logger
  -h, --help                Print help and exit
      --log-config=<file>   Provide a custom log4j2.properties file
      --trace=<logger>      Increase loggin for specific packages. The value
                              'ROOT' may be used as an alias for the root logger
      --version             Show version string and exit
Commands:
  run     Start server instance
  config  Manage configuration
  vault   Manage vaults

Example: cdstar run --help

Usage: cdstar run [-h] [-b=<bind>] [-p=<port>]
Start server instance
  -b, --bind=<bind>   Override 'http.host' setting
  -h, --help          Print help and exit
  -p, --port=<port>   Override 'http.port' setting

Run as a service

CDStar can be compiled into a cdstar.war file and run within a servlet container, but this is not recommended and not officially supported. CDStar also does not offer any built-in daemonizing capabilities. If you want to run cdstar as a long-running background process, use proper system tools like systemd, supervisord or traditional init.d scripts and start-stop-daemon as a last resort.

Systemd: Example config

# /etc/systemd/system/cdstar.service
[Unit]
Description=CDStar Storage Service
After=syslog.target network.target remote-fs.target

[Service]
User=cdstar
ExecStart=/usr/bin/java -jar /path/to/cdstar.jar -c /etc/cdstar/cdstar.yaml run -p 8080

[Install]
WantedBy=multi-user.target

Systemd: Enable service

sudo systemctl enable cdstar.service

Tutorial

For this tutorial we are using the excellent requests python library and assume that you already have an instance up and running on http://localhost:8080/ with an account that is allowed to create archives in a vault named demo.

Creating our first Archive

To begin, we import some helpful functions from the 'requests' module, define our API base URL and create our first archive.

Setup and create a new archive

>>> from requests import get, post, put, delete
>>> baseurl = 'http://test:test@localhost:8080/v3'
>>> r = post(baseurl + '/demo/')
>>> r.status_code
201
>>> r.headers['Location']
"/v3/demo/ab587f42c2570a884"
>>> r.json()
{
    'id': 'ab587f42c2570a884',
    'vault': 'demo',
    'revision': '0'
}

CDStar returns JSON most of the time, so we can use requests.Response.json() to parse the response directly into a python dictionary. In this case, we are only interested in the id field of the response. This string identifies our archive within a vault and can be used to build the archive URL. Alternatively, we could just follow the Location header.

The archive is still empty. We can list its content with simple a GET request.

Show Archive Info

>>> get(baseurl + '/demo/ab587f42c2570a884').json()
{
  "id": "ab587f42c2570a884",
  "vault": "myVault",
  "revision": "0",
  "created": "2016-12-20T13:59:37.160+0000",
  "modified": "2016-12-20T13:59:37.231+0000",
  "file_count": 0
}

As you can see, there are no files in this archive. Let’s change that and upload some files.

Upload Files

There are mutiple ways to populate an archive. The simplest way is to send multipart/form-data POST requests to the archive URL. Each file upload with a name that start with a slash (e.g. /example.txt) creates a new file in our archive.

Upload files

>>> files = {'/report.xls': open('report.xls', 'rb')}
>>> post(baseurl + '/demo/ab587f42c2570a884', files=files).json()
{
    'id': 'ab587f42c2570a884',
    'vault': 'myVault',
    'revision': '1',
    'report:': [ {
        'change': 'file',
        'file': {
            'name': 'report.xls',
            'type': 'application/vnd.ms-excel',
            'size': 65992,
            'created': '2016-12-20T13:59:37.217+0000',
            'modified': '2016-12-20T13:59:37.218+0000',
            'digests': {
                'md5': '1a79a4d60de6718e8e5b326e338ae533',
                'sha1': 'c3499c2729730a7f807efb8676a92dcb6f8a3f8f',
                'sha256': '50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c'
            }
        }
    } ]
}

The response is JSON again and contains a list of all files that changed during the last request. We can use this info to double-check if everything was uploaded correctly.

Annotate Archives and Files

Now we want to attach some meta attributes to our archive and the file we just uploaded. We send just another POST request to the same URL, but this time we use form-fields starting with meta: to define new meta attribute on the archive or a file within the archive.

Set metadata properties

>>> data = {
...  'meta:dc:title': 'My Report Archive',             (1)
...  'meta:dc:title:/report.xls': 'My Report',          (2)
...  'meta:dc:contributor': ['Alice', 'Bob'],          (3)
... }
>>> post(baseurl + '/demo/ab587f42c2570a884', data=data).json()
{
    'id': 'ab587f42c2570a884',
    'vault': 'myVault',
    'revision': '2',
    'report:': [ {
        'change': 'meta',
        'field': 'dc:title',
        'values': ['My Report Archive']
    }, {
        'change': 'meta',
        'field': 'dc:contributor',
        'values': ['Alice', 'Bob']
    }, {
        'change': 'meta',
        'field': 'dc:title',
        'file': 'report.xls',
        'values': ['My Report']
    } ]
}

Meta form fields start with meta: followed by the field name.
If a meta attribute should be set on a specific file instead of the archive, you can specify the file name after the field name, separated by a /.
Some meta attributes accept more than a single value.

Just like the file upload example from above, we get back a report of everything that changed.

Tip	You can upload multiple files and set multiple meta-attributes with a single request. It is even possible to create a fully populated archive in a single step by submitting the POST request to the createArchive endpoint.

List Files and Meta-Attributes

Let us have a look at our archive again and also request file and meta-attribute listings this time.

Show Archive Info

>>> get(baseurl + '/demo/ab587f42c2570a884?with=files,meta').json()
{
  "id": "ab587f42c2570a884",
  "vault": "myVault",
  "revision": "0",
  "created": "2016-12-20T13:59:37.160+0000",
  "modified": "2016-12-20T13:59:37.231+0000",
  "file_count": 1,
  'meta': {
    'dc.title': ['My Report Archive']
  },
  'files': [ {
    'name': '/report.xls',
    'type': 'application/vnd.ms-excel',
    'size': 65992,
    'created': '2016-12-20T13:59:37.217+0000',
    'modified': '2016-12-20T13:59:40.114+0000',
    'digests': {
      'md5': '1a79a4d60de6718e8e5b326e338ae533',
      'sha1': 'c3499c2729730a7f807efb8676a92dcb6f8a3f8f',
      'sha256': '50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c'
    },
    'meta': {
        'dc.title': ['My Report']
    }
  } ]
}

The file and meta fields are hidden by default and only included if you add with=files,meta as a query parameter. For large archives, you can even filter and paginate the returned information. See getArchiveInfo for details.

Direct File API (CRUD)

Each file within an archive has its own URL, for example /myVault/ab587f42c2570a884/some/file.txt. You can create, read, update or delete individual files by sending the respective PUT, GET, POST or DELETE requests to these URLs, which is sometimes a lot easier than working with the form-based API described earlier, especially from within scripts or programmable REST clients.

First, let’s upload a new file to the archive. Just PUT the raw file content to the file URL.

Example: Create or replace a file

>>> with open('example.txt', 'rb') as fp:
...     put(baseurl + '/demo/ab587f42c2570a884/some/example.txt', data=fp).json()
{
  'name': 'some/example.txt',
  'type': 'text/plain',                             (1)
  'id': '4e2cdf90ae00bff1e2bad79ffebdb63b',         (2)
  'size': 12,
  'created': '2017-07-25T11:08:02.558+0000',
  'modified': '2017-07-25T11:08:02.602+0000',
  'digests': {
    'sha256': '30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58',
    'sha1': '3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3',
    'md5': 'b6d81b360a5672d80c27430f39153e2c'},
}

The type is auto-detected from the file name if you do not specify a Content-Type header.
The id of a file does not change, even if you rename or modify it.

If you need more control over whether a file should be overwritten or not, you can add one of the following conditional headers to your request:

Table 1. Conditional headers for PUT requests
Header	Description
`If-None-Match: *`	Create new file. If the file already exists, it is not modified.
`If-Match: *`	Update existing file. If the file does not exist, it is not created.

You should check for 412 Precondition Failed errors in your application if you use these headers.

Once the file is stored in the archive, you can retrieve it using the same URL.

Example: Download a file

>>> r = get(baseurl + '/demo/ab587f42c2570a884/some/example.txt', stream=True)
>>> with open("download.txt", 'wb') as fd:
...     for chunk in r.iter_content(chunk_size=1024*8):
...         fd.write(chunk)

This downloads the entire file and stores it locally. You can also request parts of the file (using Range headers) and make your request conditional (If-Match, If-None-Match, If-Modified-Since, If-Unodified-Since and If-Range headers are fully supported).

Instead of the actual file content, you can also request the file attributes or meta-attributes via the info and meta sub-resources.

Example: Get file attributes or meta-attributes

>>> get(baseurl + '/demo/ab587f42c2570a884/some/example.txt?info').json()
{
  'name': 'some/example.txt',
  'type': 'text/plain',
  'id': '4e2cdf90ae00bff1e2bad79ffebdb63b',
  'size': 12,
  'created': '2017-07-25T11:08:02.558+0000',
  'modified': '2017-07-25T11:08:02.602+0000',
  'digests': {
    'sha256': '30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58',
    'sha1': '3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3',
    'md5': 'b6d81b360a5672d80c27430f39153e2c'},
}
>>> get(baseurl + '/demo/ab587f42c2570a884/report.xls?meta').json()
{
  'dc:title': ['My Report']
}

Tip	Since `meta` is a sub-resource of `info`, you can fetch both at the same time via `?info&with=meta`.

And finally: Deleting individual files is just a plain and simple DELETE request.

Example: Delete file from archive

>>> delete(baseurl + '/myVault/ab587f42c2570a884/some/example.txt')

Thats it for now. To be continued …

Configuration

CDStar is configured via configuration files (YAML or json), command-line arguments or environment variables, or a combination thereof. In any cases, configuration is treated as a flat list of dot-separated keys and plain string values (e.g. key.name=value). File formats that support advanced data types and nesting (namely json an yaml) are flattened automatically when loaded. Arrays or multiple values for the same key are simply joined into a comma-separated list.

Example: Nested documents are flattened automatically.

---
# Nested document
path:
  home: "/mnt/vault"
vault.demo:
  create: True
---
# Flattened form
path.home: "/mnt/vault"
vault.demo.create: "True"

Value references

Values may contain references to other keys (e.g. ${path.home}) or environment variables (e.g. ${ENV_NAME}). The latter is recommended for sensitive information that should not appear in config files or command line arguments (e.g. passwords). A cololon (:) is used to separate the reference from an optional default value.

For example, ${CDSATR_HOME:/var/lib/cdstar} would be replaced by the content of the CDSTAR_HOME environment variable, or the default path if the environment variable is not defined.

Disk Storage

CDStar stores all its data and internal state on the file system. You usually only need to set set path.home, as all other parameters default to subdirectories under the path.home directory.

path.home: This directory is used as as base directory for the other paths. (default: ${CDSTAR_HOME:/var/lib/cdstar/})
path.data: Storage location for archive-data and runtime information. CDStar creates a subdirectory for each vault and follows symlinks, which makes it easy to split the storage across several mounted disks. (default: ${path.home}/data)
path.var: Storage location for short-lived temporary data. Do NOT use a ramdisk or other volatile storage, as transaction and crash-recovery data will also be stored here. (default: ${path.home}/var)
path.lib: Plugins and extensions are searched for in this directory, if they are not found on the default java classpath. (default: ${path.home}/lib)

Transports

CDStar supports http and https transports out of the box. By default, only the unencrypted http transport is enabled and binds to localhost port 8080. The high port number allows CDStar to run as non-root, which is the recommended mode of operation.

External access should be encrypted and off-loaded to a reverse proxy (e.g. nginx) for security and performance reasons. Only enable the build-in https transport for testing or if you know what you are doing.

http.host: IP address to bind to. A value of 0.0.0.0 will bind to all available interfaces at the same time. (default: 127.0.0.1).
http.port: Network port to bind to. Ports below 1024 require root privileges (not recommended). A value of 0 will bind to a random free port. A value if -1 will disable this transport. (default: 8080)
https.host: IP address to bind to. (default: ${http.port}).
https.port: Network port to listen to. (default: 8433)
https.certfile: Path to a *.pem file containing the certificate chain and private key. (required)
https.h2: Enable HTTP/2. This requires Java 9+ and should be considered experimental. (default: false)

Public REST API

The REST API is exposed over all configured transports.

api.dariah.enable: Enable or disable the dariah REST API. (default: False)
api.v2.enable: Enable or disable the legacy v2 REST API. (default: False)
api.v3.enable: Enable or disable the current v3 REST API. (default: True)
api.context: Provide the public service URL. This is required if cdstar runs behind a reverse proxy or load balancer and cannot detect its public URL automatically. (default: /)

Vaults

Vaults are usually created at runtime via the management API, but can also be be bootstrapped from configuration. Statically configured vaults are created at startup if they do not exist, and ignored otherwise. It is not possible to change the parameters of a vault via configuration after they were created.

vault.<name>.create: If true, create this vault on startup if it does not exist already.
vault.<name>.public: If true, allow public (non-authenticated) read access to this vault. Archive permissions are still checked.

Each vault is backed by a storage pool, which can be configured as part of the vault configuration. The default pool configuration looks like this, and may be overwritten if needed (experimental, not recommended).

vault.<name>.pool.class: Storage pool class or name. Defaults to the NioPool class.
vault.<name>.pool.name: Storage pool name. Defaults to the vault name.
vault.<name>.pool.path: Data path for this storage pool. Defaults to ${path.data}/${name}:

Other StoragePool implementations may accept additional parameters.

Plugins may also read vault-level configuration to control vault-specific behavior. The DefaultPermissions feature for example controls the permissions defined on newly created archives and can be configured differently for each vault.

Realms

Realms manage authentication and authorization in CDStar. For a simple setup with only a hand full of users, you usually only need a single 'default' realm (e.g. StaticRealm) with everything configured within the same config file. More complex scenarios (e.g. LDAP, JWT or SAML auth) are supported via specialized implementations of the Realm interface (e.g. StaticRealm, JWTRealm or LdapRealm) and can be combined in many ways.

realm.<name>.class: Realm implementation to use. Either a simple class name or a fully qualified java class. (required)
realm.<name>.<field>: Additional realm configuration.

See Realms for a list of available realm implementations and their configuration options.

Warning

If no realm is configured, cdstar adds an 'admin' user with a randomly generated password to the implicit 'system' realm. The password is logged to the console on startup and changes every restart.

Tip	Realms are no different from plugins. They are only configured in a separate `reaml.*` name-space to avoid accidental misconfiguration.

Plugins and Extentions

CDSTAR can be extended with custom implementations for event listeners, storage pools, long-term storage adapters and many other interfaces. These can be referenced by name, simple class-name or fully qualified java class name.

plugin.<name>.class: Plugin to load. Either a name, java class name or a fully qualified java class path.
plugin.<name>.<field>: Additional plugin configuration.

Example

plugin.ui:
   class: UIBlueprint
plugin.bagit:
   class: de.gwdg.cdstar.runtime.lts.bagit.BagitTarget
   path: ${path.home}/bagit/

Instance Info

You can publish contact information or other metadata about your instance using the info.* namespace. All information below this namespace is publicly accessible. Values can only be strings, but more complex technical information can be stored as string-encoded JSON documents and later parse by the client.

Clients should only display fields matching public.* to humans by default. Field names below this prefix may be interpreted as Dublin Core terms and used to display contact or other information to humans.

Any other field not under the public. namespace can be used however you seem fit. Client software may use those fields for auto-discovery or configuration purposes. The cdstar explorer client for example could check the cdstar.explorer.v3. namespace for hints on how to configure itself. Check out the client documentation for supporeted fields.

API Basics

The cdstar HTTP API is the primary method for accessing CDStar instances. Requests are made via HTTP to one of the documented API Endpoints and responses are returned mostly as JSON documents for easy consumption by scripts or client software.

The current stable HTTP API is reachable under the /v3 path on a cdstar server. Other APIs (e.g. legacy-v2, dariah or S3) may be available under different paths on the same server, but these are not part of this chapter.

Basics

The cdstar HTTP API follows RESTful principles. The core concepts are described here. You can skip this section if you are already familiar with HTTP and REST.

HTTP Methods

CDStar API Endpoints make use of the following standard HTTP request methods:

Table 2. Standard HTTP methods
Method	Description
GET	Receive a resource or sub-resource. This is a read-only operation and never changes the state of the resource or other resources.
HEAD	Same as `GET`, but does not return a response body. This can be used as a light-weight alternative to `GET` requests if only the status code or header values are of interest.
POST	Create or update a resource, or perform a modifying server-side operation.
PUT	Create or replace a resource with the content of the request.
DELETE	Remove a resource.

HTTP Method override

Some proxies restrict or lack support for certain HTTP methods, such as DELETE. In this case, a client may send a POST request with a non-standard X-HTTP-Method-Override header instead. The value of this header is used as a server-side override for the actual HTTP method.

HTTP Response Codes

Each of the API Endpoints defines a number of possible HTTP response status codes and their meaning. The following list summarizes all status codes used by this API and provides a general description.

Code Reason Description

Code	Reason	Description
200	OK	Request completed successfully. The response contains the requested resource.
201	Created	Resource created successfully. The location of the newly created resource can be found in the response `Location` header.
304	Not Modified	The requested resource has not changed since the client last requested it, given `If-Modified-Since`, `If-None-Match` or other conditional request headers were supplied.
400	Bad Request	The request violates the HTTP protocol or this API specification. A detailed error description is contained within the response.
401	Unauthorized	The requested resource requires Authentication.
403	Forbidden	The client is authenticated, but not authorized to access the requested resource or perform the requested operation.
404	Not Found	The requested resource does not exist, or the client is not allowed to know if it exists or not.
409	Conflict	The request could not be completed due to a conflict with the current state of the target resource. This code is used in situations where the user might be able to resolve the conflict and resubmit the request.
423	Locked	The requested resource is currently not available and additional steps are required to make it available again.
500	Internal Server Error	An error occurred on server side that cannot be fixed by the client. Try again later.
501	Not Implemented	The requested functionality is part of this API, but not implemented by the service.
503	Service Unavailable	The server is currently unable to handle the request due to a temporary overload or scheduled maintenance, which will likely be alleviated after some delay.
507	Insufficient Storage	Storage or quota not sufficient to perform this operation.

200

Request completed successfully. The response contains the requested resource.

201

Created

Resource created successfully. The location of the newly created resource can be found in the response Location header.

304

Not Modified

The requested resource has not changed since the client last requested it, given If-Modified-Since, If-None-Match or other conditional request headers were supplied.

400

Bad Request

The request violates the HTTP protocol or this API specification. A detailed error description is contained within the response.

401

Unauthorized

The requested resource requires Authentication.

403

Forbidden

The client is authenticated, but not authorized to access the requested resource or perform the requested operation.

404

Not Found

The requested resource does not exist, or the client is not allowed to know if it exists or not.

409

Conflict

The request could not be completed due to a conflict with the current state of the target resource. This code is used in situations where the user might be able to resolve the conflict and resubmit the request.

423

Locked

The requested resource is currently not available and additional steps are required to make it available again.

500

Internal Server Error

An error occurred on server side that cannot be fixed by the client. Try again later.

501

Not Implemented

The requested functionality is part of this API, but not implemented by the service.

503

Service Unavailable

The server is currently unable to handle the request due to a temporary overload or scheduled maintenance, which will likely be alleviated after some delay.

507

Insufficient Storage

Storage or quota not sufficient to perform this operation.

Caution

Please note that some APIs may return 404 Not Found instead of 403 Forbidden or 401 Unauthorized if the client has insufficient permissions to access a resource. This is to prevent leakage of information to unauthorized users (e.g. the existence of a private archive or a file within an archive).

Parameter types

API Endpoints may accept request parameters of various types, either via the query string part of the request URL, or as fields within a multipart/form+data formatted POST request, or both. In any case, each parameter is associated with a value type and interpreted according to the following table:

Table 3. Parameter Types
Name	Description
boolean	Either `true` or `false` (case-insensitive). If this parameter is present, but has no value (or an empty string value), it is considered `true`. The boolean parameters `param`, `param=` and `param=true` all evaluate to `true`.
int	A signed integer number between `-2147483648` and `2147483647`.
long	A bit signed integer number between `-9223372036854775808` and `9223372036854775807`.
double	A decimal number in a format parseable by the Java `Double.fromString(String)` method. Examples: `-2.4`, `.23`, `1.0e-9`, `NaN`.
string	An arbitrary `utf-8` encoded string value.
enum	A value out of a predefined set of possible values. The valid values and their meanings are listed in the parameter description.
list({type})	This parameter accepts multiple values of the enclosing type. Clients may repeat this parameter once for each value. Some parameters may also accept a comma separated list.
file	This parameter type is only supported as part of a `multipart/form+data` formatted `POST` request and refers to a file upload as it would result from a `<input type="file">` HTML form element. The multipart part must define a `Content-Disposition` header with a `filename` property in order to be recognized as a file upload.
glob	A file-name matching glob pattern. See Glob syntax

Glob syntax

Glob patterns are a simple way to filter or match file-names within an archive against a specific pattern. There is no real standard for glob patterns and existing implementations differ slightly. This is why CDStar implements its own subset of the most commonly used rules:

Table 4. Glob Syntax
Pattern	Description
`?`	Matches a single character within a path segment. Does not match the path separator `/` (forward slash).
`*`	Matches any number of characters within a path segment, including an empty string. Does not match the path separator.
`**`	Matches any number of characters, including the path separator.

If the whole pattern starts with the path separator / (forward slash), then the entire path is matched against the pattern. Otherwise, a partial match at the end of the path is sufficient. The pattern *.pdf for example would return all PDF files within an archive, but /*.pdf would only return PDF files located directly within the root folder.

As mentioned above, single wildcards only match within a path segment, which means both ? and * do not expand across path separators (/). The pattern docs/*.pdf would find /docs/file.pdf but not /docs/subfolder/file.pdf. Use two adjacent asterisks (e.g docs/**.pdf) to include subfolders in your search.

Table 5. Glob Patterns as Regular Expressions
Glob Pattern	Regular Expression	Examples
`*.pdf`	`[^/]*\.pdf$`	/file.pdf /file.tex /folder/subfolder/file.pdf
`/*.pdf`	`^/[^/]*\.pdf$`	/file.pdf /file.tex /folder/subfolder/file.pdf
`/folder/**.pdf`	`^/folder/.*\.pdf$`	/file.pdf /file.tex /folder/subfolder/file.pdf
`/201?/**.csv`	`^/201[^/]/.*\.csv$`	/2016/report.csv /2017/draft/report.csv /2007/report.csv

Authentication

CDStar can be configured with one or more authentication realms, implementing various ways of authenticating and authorizing client requests against the service. From the HTTP API point of view, there are mostly two ways to authenticate:

Password Authentication

HTTP Basic Authentication is a stateless and simple authentication scheme most suitable for scripting or simple client applications. Username and password are transmitted with each request in cleartext, so this scheme should NOT be used over unencrypted connections.

$ curl -u username https://cdstar.gwdg.de/v3/...

Some realms may require a fully qualified username in the form of username@realm, but most realms also accept unqualified logins. If the username itself contains an @, then it MUST be qualified to avoid ambiguity.

Token Authentication

Token authentication is handled via an Authorization: bearer <token> header. Alternatively, the non-standard X-Token header or token query parameter can be used, but these are not recommended. Acquiring a token is not part of this API and depends heavily on the configured token realm (e.g. JWTRealm). For this example we assume that the client already obtained an access token.

curl -H "Authorization: bearer OAUTH-TOKEN" https://cdstar.gwdg.de/v3/...

In order to embed resources into HTML pages (e.g. images) or provide time-limited download links, a special token with limited access rights can be attached to the URL of GET requests via the token query parameter. As with access tokens, the method to obtain such tokens is not part of this API.

<img src="/v3/myVault/85a031d6e08d/image.png?token=READ-TOKEN" />

Authorization

CDStar implements a flexible authorization and permission system with fine-grained archive-level access control. The permission system is designed to be simple for the common case, but still powerful enough to support advanced requirements and responsibility models (e.g. groups and roles across multiple realms).

Note	The permission system may look complex on first glance, but remember that you only need a subset of this functionality for most common scenarios.

The core concept can be summarized as follows: 'Permissions' are granted to 'subjects' and affect a specific 'resource'. Subjects may be individual 'users' or 'groups' of users. A resources may be a single archive, a vault or the entire storage service. Subjects (both users and groups) are organized in 'realms'. A simple setup only requires a single realm, but multi-tenancy instances can use realms to separate different authorities.

Subjects and Realms

Subjects are encoded as strings and matched against the current user context using following subject qualifier syntax:

Table 6. Subject Qualifier
Subject Match	Description
`$any`	Special subject that matches any user, authenticated or not.
`$user`	Special subject that matches authenticated users.
`$owner`	Special subject that matches the current owner of the affected resource. This is implemented for archive resources and matches against the the `owner` field of an archive.
`@{group}`	Subjects starting with `@` are interpreted as group names. They match if the current user is a member of that group. Example: `@admins`, `@customers@realm`
`{user}`	Subject that do not match any of the patterns above are tested against the identifier of the currently logged-in user. Example: `bob`, `alice@realm`

Fully qualified subjects

If multiple realms are configured, then group and user names should be qualified with a realm name to avoid naming conflicts between realms. Unqualified names are still allowed, but they will match against any realm with a matching user or group.

Fully qualified names have the form name@realm. For example, alice from the ldap realm would be alice@ldap. Only the last occurrence of the @ character is recognized, so identifiers with @ in them (e.g. email addresses) are allowed. In fact, if the local part of a subject identifier contains an @, then the subject MUST be qualified with a realm to avoid ambiguity.

Vault Permissions

Permissions regarding a specific vault. If assigned globally, they have the from vault:{vaultName}:{permissionName}.

Table 7. Vault Permissions
Name	Description
`read`	Open a vault. This is not required for `public` vaults, as these are visible and readable to anyone.
`create`	Create new archives within a vault.
`list`	List the archive IDs in a vault. Note that this allows a user to check if an archive exists independently of archive-level permissions.

Archive Permissions

Archives are protected by an access control list (ACL) which grands permissions to specific subjects (see Subjects and Realms). If assigned globally, they have the form archive:{vaultName}:{archiveId}:{permissionName}.

Note	Archive permissions are very fine-grained and most actions require more than one permission. For example, in order to receive a file from an archive, both `read_file` and `load` permissions are required. It most cases it is easier to assign Archive Permission Sets instead.

Table 8. Archive Permissions
Name	Description
`load`	Check if an archive exists and read basic attributes (e.g last-modified or number of files).
`delete`	Delete an archive and its history (destructive operation).
`read_acl`	Read the access control list (ACL).
`change_acl`	Grant or revoke permissions by modifying the ACL.
`change_owner`	Change the owner.
`read_meta`	Read meta-data attributes.
`change_meta`	Add, remove or replace meta-data attributes.
`list_files`	List files and their attributes (e.g. name, size, type, hash).
`read_files`	Read file content.
`change_files`	Create, modify or remove files.
`trim`	Explicitly compress or clean-up an archive

Archive Permission Sets

Archive permissions are very fine-grained and most actions require more than one permission. For example, a user with only read_file permission on an archive would not be able to read any files, because the load permission is also required to load the archive in the first place. To simplify access control for common use-cases, permission sets were introduced. Each set bundles a number of permissions that are usually granted together, and can be assigned just like normal permissions.

Permission sets have upper-case names to distinguish them from normal permissions. The following matrix shows all pre-defined permission sets and their corresponding permissions.

Table 9. Archive Permission Set Matrix
Permission/set	LIST	READ	WRITE	OWNER	MANAGE	ADMIN
load	yes	yes	yes	yes	yes	yes
delete				yes		yes
read_acl				yes	yes	yes
change_acl				yes	yes	yes
change_owner					yes	yes
read_meta		yes	yes	yes		yes
change_meta			yes	yes		yes
list_files	yes	yes	yes	yes	yes	yes
read_files		yes	yes	yes		yes
change_files			yes	yes		yes

When to use MANAGE

The MANAGE set is intended for management and reporting jobs. These are usually only interested in the meta-data of an archive, not the content. The set therefore inherits LIST instead of READ or even WRITE to protect user data by default. While clients with this permission set would be able to grant more permissions to themselves, these changes would show up in audit logs and be accountable.

When to use OWNER

Vaults are usually configured to grant OWNER permissions to the $owner subject for new archives automatically. This allows the archive creator to work with the newly created archive and perform most actions, with the notable exception of changing the owner. Giving archives away is usually a task reserved for higher privilege accounts. This permission set is not limited or otherwise tied to the $owner subject, though. It can be given to other subjects, or revoked from the owner. Revoking permissions from the owner is a common pattern to make archives read-only after publishing.

Note	READ, WRITE and MANAGE reassemble the permissions defined in cdstar version 2.

Transaction Management

CDStar focuses on data safety and consistency. All transactions are atomic, consistent, isolated and durable by default (ACID properties). In short, this guarantees that transaction either succeed or fail completely ("all or nothing"), you will never see inconsistent state (e.g. half-committed changes), transactions won’t overlap or interfere with each other (isolation), and changes are persisted to disk before you get an OK back (durability).

Tip

ACID properties should be a core requirements for any kind of reliable storage service, but they are actually quite hard to find outside of traditional databases. Most modern web-based storage services (e.g. Amazon S3, couchdb, mongodb, most NoSQL databases) only provide "eventual consistency" or do not guarantee atomicity for operations affecting more than a single item. This makes it very hard or even impossible to implement certain workflows against these APIs in a reliable way, resulting in 'lost updates' or other consistency problems.

Each call to a API endpoint implicitly created and commits a transaction by default. If a single operation is not enough though, you can also create an explicit transaction, issue multiple API calls, and then commit or rollback all changes as a single atomic transaction. The non-standard X-Transaction header is used to associate HTTP calls with a running transaction.

$ curl -XPOST /v3/_tx
201 CREATED
{ "id": "d2ee7d6034e3", ... }

$ curl -H 'X-Transaction: d2ee7d6034e3' ...
...

$ curl -XPOST /v3/_tx/d2ee7d6034e3
204 OK

The results of these HTTP calls are not visible to other transactions until they are committed, and you won’t see any changes made by other users while your transaction is active, either. This is called 'snapshot isolation' and works as if each transaction operates on a snapshot of the entire database from the exact moment the transaction was started.

Error handling

Recoverable errors during an explicit transaction do not trigger a rollback. On one hand, this allows clients to recover from errors without loosing too much progress. On the other hand, clients using explicit transactions MUST handle errors properly. Individual operations may fail and still have partial effects. For example, if a file upload fails mid-request, the client should either repeat or resume the failed upload. The client MUST make sure the transaction is in a clean state before committing.

Conflict resolution

Update conflicts (multiple transactions updating the same archive at the same time) are not resolved automatically, since CDStar cannot possibly know how to merge multiple changes into a consistent result. In this unfortunate case, the transaction committed first will succeed and all other transactions writing to the same archive will fail as soon as a commit is tried.

Read-conflicts are allowed, though. If you only read from an archive and not change it, and a different transaction changes the archive in the meantime and commits before you, your transaction won’t fail. If you require a higher level of isolation (called 'serializability' in database theory) you can enable it via the isolation=full parameter when creating a new transaction.

Read-only transactions

Transaction management is expensive. Some transaction information must survive even a fatal server crash to allow reliable and automatic crash recovery. If you only need to 'read' from multiple archives in an isolated way, you can start the transaction with readonly=true and save a lot of server-side house-keeping.

Transaction Timeout

Explicit transactions expire after some time of inactivity. They never expire while a HTTP call is still in progress, and will extend their lifetime automatically after each HTTP call. You won’t have to worry about that in most cases. If you need a transaction to survive more than a couple of seconds of inactivity (e.g. while waiting for user input), you can specify a higher timeout when creating a transaction, or issue cheap HTTP calls (e.g. Renew Transaction) from time to time to prevent transactions from dying. Expired transactions are rolled back automatically.

API Endpoints

This chapter lists and describes all web service endpoints defined by the standard CDStar HTTP API. Requests are routed to the appropriate endpoint based on their HTTP method, content type and URI path. Some endpoints also require certain query parameters to be present. Path parameters (variable parts of the URL path) are marked with curly brackets.

Table 10. HTTP Endpoints: Overview
Title	Method	URI Path
Instance APIs
Service Info	`GET`	`/v3/`
Service Health	`GET`	`/v3/_health`
Vaults and Search
List Vaults	`GET`	`/v3/`
Get Vault Info	`GET`	`/v3/{vault}`
Search in Vault	`GET`	`/v3/{vault}?q`
List all Archives in a Vault	`GET`	`/v3/{vault}?scroll`
Archives
Create Archive	`POST`	`/v3/{vault}/`
Get Archive Info	`GET`	`/v3/{vault}/{archive}`
Export Archive	`GET`	`/v3/{vault}/{archive}?export`
Update Archive	`POST`	`/v3/{vault}/{archive}`
Delete Archive	`DELETE`	`/v3/{vault}/{archive}`
Files
List files	`GET`	`/v3/{vault}/{archive}?files`
Download file	`GET`	`/v3/{vault}/{archive}/{filename}`
Get file info	`GET`	`/v3/{vault}/{archive}/{filename}?info`
Upload file	`PUT`	`/v3/{vault}/{archive}/{filename}`
Resume file upload	`PATCH`	`/v3/{vault}/{archive}/{filename}`
Delete file	`DELETE`	`/v3/{vault}/{archive}/{filename}`
Metadata
Get Archive Metadata	`GET`	`/v3/{vault}/{archive}?meta`
Set Archive Metadata	`PUT`	`/v3/{vault}/{archive}?meta`
Get File Metadata	`GET`	`/v3/{vault}/{archive}/{file}?meta`
Set File Metadata	`PUT`	`/v3/{vault}/{archive}/{file}?meta`
Access Control
Get Archive ACL	`GET`	`/v3/{vault}/{archive}?acl`
Set Archive ACL	`PUT`	`/v3/{vault}/{archive}?acl`
Data Import
Import from ZIP/TAR	`POST`	`/v3/{vault}/`
Update from ZIP/TAR	`POST`	`/v3/{vault}/{archive}`
Snapshots
Create Snapshot	`POST`	`/v3/{vault}/{archive}?snapshots`
Delete Snapshot	`DELETE`	`/v3/{vault}/{archive}@{snapshot}`
List Snapshots	`GET`	`/v3/{vault}/{archive}?snapshots`
Transactions
Begin Transaction	`POST`	`/v3/_tx/`
Get Transaction Info	`GET`	`/v3/_tx/{txid}`
Commit Transaction	`POST`	`/v3/_tx/{txid}`
Renew Transaction	`POST`	`/v3/_tx/{txid}?renew`
Rollback Transaction	`DELETE`	`/v3/_tx/{txid}`

Instance APIs

APIs to access instance-level functionality like metrics, health, capabilities and more. This is also the entry point for most plugins.

Service Info

GET /v3/ HTTP/1.1

Get basic information about the cdstar instance as well as a list of all vaults accessible by the current user.

Table 11. Response Codes
Status	Response	Description
200	ServiceInfo	No description

Service Health

GET /v3/_health HTTP/1.1

Warning

This endpoint is marked as unstable and is subject to change.

Return health and performance metrics about the service.

Table 12. Query Parameters
Name	Type	Description
with	list(enum)	Include additional information in the response. metrics Include detailed metrics (named numercial values) in a `metrics` sub-object. health Include detailed health information (named checks) in a `health` sub-object.

Table 13. Response Codes
Status	Response	Description
200	[ServiceHealthInfo]	No description

Vaults and Search

List and access vaults, search or enumerate archives within a vault.

List Vaults

GET /v3/ HTTP/1.1

List vaults accessible by the current user. This is the same as Service Info.

Get Vault Info

GET /v3/{vault} HTTP/1.1

Get information about a vault.

Table 14. Response Codes
Status	Response	Description
200	VaultInfo	No description

Search in Vault

GET /v3/{vault}?q HTTP/1.1

Perform a search over all archives and files within a vault using the configured search backend. Only results that are visible to the currently logged in user are returned.

API Changes

Changed in v3.1: Added fields parameter.

Table 15. Query Parameters
Name	Type	Description
q	string	Search query using the lucene query syntax or an alternative query syntax supported by the backing search index. Multiple plain search terms are usually `OR` linked and optional by default, but this may also depend on the search backend used. Example: `Bananas or `modified:[2017-01-01 TO 2017-12-31] AND dcTitle:"Master Thesis"`
order	enum	Order results by `score`, `modified`, `id` or any of the fields supported by the search backend. Prefix the field name with a minus character to reverse the order. As an example, the default order `-score` will return results based on their relevance, ordered from highest relevance to lowest. Multiple order fields can be specified as a comma separated list. Default: `"-score"`
limit	int(0-max)	Limit the number of results. Values are automatically capped to an allowed maximum. Default: `25`
fields	list(string)	Request additional fields for each hit. Search backends SHOULD support requesting index document fields by name (e.g. `dcTitle` or `meta.dc:title`) and return the corresponding value(s) for each hit. Unknown or unsupported fields should be silently ignored. Search backends MAY support more complex field queries via a backend specific syntax. For example, requesting `highlight(content)` may return the relevant parts of the `content` field with the matched sections wrapped in HTML `<em>` tags. Requesting `meta.dc:` may return all fields starting with `meta.dc:` as a single nested object. We discourage inventing a full mini-language here, though. Keep it simple. The SearchHit data type contains a `fields` object that maps field queries to their value. Multiple simple fields can be requested as a comma separated list. Example:* `fields=dcTitle,dcAuthor`
scroll	string	When a search query matched more than `limit` results, you can use the `scroll` value from the last succesfull SearchResults response to skip all results already returned and fetch the next page of results from the search backend. This works similar to the 'search_after' feature in elasticsearch or the 'cursorMark' feature in solr. The 'scroll' value in a SearchResults response is a stateless live cursor pointing to the last element returned in a result page. When repeating a search with a valid `scroll` cursor, all results that would be ordered lower or equal to this element are skipped. Default: `"none"`
groups	list	Claim membership of additional user groups. This is useful if the realm of the user does not return all groups the user belongs to, and some search hits are not visible because of that. Each claim is checked against the realm, and if successful, hits visible to that group are included in the result.

Table 16. Response Codes
Status	Response	Description
200	SearchResults	No description
501	Error	Search functionality is disabled.
504	Error	Search functionality is enabled, but the search service did not respond in time.

List all Archives in a Vault

GET /v3/{vault}?scroll HTTP/1.1

List IDs of archives stored in this vault.

Up to limit IDs are returned per request. IDs are ordered in a stable but otherwise implemention specifc way (usually lexographical). If the scroll parameter is a non-empty string, then only IDs ordered after the given string are returned. This can be used to scroll through all IDs of a vault in an efficient manner.

By default, this API will return all IDs that were ever created in this vault, including IDs of archives that were removed or are not load-able by the current user. This mode requires list vault permission or the vault to be public.

In strict mode, archive manifests are actually loaded from storage and only IDs of archives that are load-able by the current user are returned. This mode is less efficient, but does not require list permissions on the vault. Use with caution.

This API is NOT transactional and may reflect changes made by other clients as soon as they happen.

Table 17. Query Parameters
Name	Type	Description
scroll	string	Required, but can be empty. Start listing IDs greater than the given string, according to the implementation-defined ordering (usually lexographical). For pagination, set `scroll` to the ID of the last result of the previous page to fetch the next page.
limit	int(0-max)	Limit the number of results. Values are automatically capped to an allowed maximum. Default: `25`
strict	boolean	If true, only IDs for archives that are actually load-able by the current user are returned.

Table 18. Response Codes
Status	Response	Description
200	ScrollResults	No description

Table 19. Response Codes
Status	Response	Description
201	[ArchiveCreated]	Archive created

Table 20. Query Parameters
Name	Type	Description
with	list(enum)	Include additional information in the response. This can be used as a shortcut for individual requests to Get Archive ACL, List files, Get Archive Metadata or List Snapshots. If access restrictions do not allow reading a subresource, the flag is silently ignored. acl Include `acl` field with an AclInfo in the response. files Include `files` field with a list of FileInfo in the response. This is implicitly enabled if any of the file listing parameters are present. meta Include `meta` field with a MetaAttributes in the response. If files are listed, their FileInfo will also contain an additional `meta` field. snapshots Include a `snapshots` field with a list of available snapshots for this archive (SnapshotInfo).
include	list(glob)	Only list files that match any of these glob patterns. Implies `with=files`.
exclude	list(glob)	Only list files that do not match any of these glob patterns. Implies `with=files`.
order	enum	Order files by `name`, `type`, `size`, `created`, `modified`, `hash` or `id`. The `id` ordering is useful to get a stable ordering that is not affected by name changes. Implies `with=files`. Default: `"name"`
reverse	boolean	Return files in reverse order. Implies `with=files`.
limit	int(0-max)	Limit the number of files listed. Values are automatically capped to an allowed maximum. Implies `with=files`. Default: `25`
offset	int(0-inf)	Skip this many files from the listing. Can be used for pagination of archives with more than `limit` files. Implies `with=files`.

Table 21. Response Codes
Status	Response	Description
200	ArchiveInfo	Archive found
400	Error	Invalid parameters
404	Error	Archive not found or not readable by current user

Table 22. Query Parameters
Name	Type	Description
export	list(enum)	Required parameter to specifies the export format. Currently only `zip` is supported. zip Export files as a zip archive.
include	list(glob)	Only export files that match any of these glob patterns.
exclude	list(glob)	Only export files that do not match any of these glob patterns.

Table 23. Response Codes
Status	Response	Description
200	bytes	The export format and `Content-Type` depends on the `export` query parameter.
404	Error	Archive not found or not readable by current user

Table 24. Form Parameters
Name	Type	Description
{filename}	file	Upload a new file (`multipart/form-data` only). If the filename ends with a slash, then the original (client-side) name of the file is appended. If the filetype is either `application/x-autodetect` or missing, cdstar ill tyr to guess the correct content-type from the file name extension an default to `application/octet-stream` if that fails. Example: `<input type="file" name="/folder/" />` `$ curl --form /filename.txt=@source.txt` `$ curl --form /filename.txt=@source.txt;type=text/plain` `$ curl --form /folder/=@source.txt;name=filename.txt`
copy:{filename}	string	Create a new file by copying the content of an existing file from the same archive. Example: `$ curl --data copy:/target.txt=/source.txt`
clone:{filename}	string	Create a new file by copying the content and metadata of an existing file from the same archive.
move:{filename}	string	Rename an existing file.
fetch:{filename}	uri	Create a new file by fetching an external resource. If `{filename}` ends in a slash (`/`) then the last path segment of the fetch URL is appended to the file name. Supported URI schemes depend on installed plugins and not all URIs may be allowed. For example, fetching from `http://` URLs may be limited to trusted domains, or diabled completely. Example: `$ curl --data fetch:/bigfile.dat=http://example.com/bigfile.dat`
delete:{filename}	string	Delete a file. The value is ignored. If `{filename}` ends with a slash (`/`), then all files under that directory are removed. Example: `"$ curl --data delete:/some/file.txt"` `"$ curl --data delete:/some/folder/"`
type:{filename}	string	Change the content-type of an existing file. The value should follow the `Content-Type` header syntax (e.g. `application/octet-stream`). A special value of `application/x-autodetect` will cause cdstar to try to guess the correct content-type from the file name extention. Example: `"$ curl --data type:/some/file.txt=text/plain"`
meta:{attr}	list(string)	Set a meta-attributes for the archive. See Metadata for a list of supported `{attr}` names. Example: `$ curl --data meta:dc:creator=Alice`
meta:{attr}:{filename}	list(string)	Set a meta-attributes for a specific file within the archive. `{filename}` must correspond to an existing file. Example: `$ curl --data meta:dc:creator:/thesis.pdf=Alice`
acl:{subject}	list(enum)	Change the list of permissions granted to a `{subject}`. A subject can be an individual, an `@` prefixed group or one of the special subjects `$any`, `$user` or `$owner`. The value should be a comma-separated list of permissions (lowercase) or permission-sets (uppercase). Any permissions previously granted to this exact subject are removed and the effective list of permissions is normalized automatically (sets are exploded, dublicates removed). See Archive Permissions for a list of permission names. Example: `$ curl --data acl:alice@gwdg=READ,change_meta` `$ curl --data acl:@adminGroup=MANAGE` `$ curl --data acl:\$any=READ # be careful to escape $ in a shell`
profile	string	Set the desired storage profile for this archive or snapshot. Profile changes usually trigger background data migration and will take some time to have an effect. See Storage Profiles for details.
owner	string	Change the owner of this archive. This requires `change_owner` permissions, which are not included in the default `OWNER` permission set.

Table 25. Response Codes
Status	Response	Description
204	-	Archive removed (no content).

Files

No description

List files

GET /v3/{vault}/{archive}?files HTTP/1.1

GET /v3/{vault}/{archive}@{snapshot}?files HTTP/1.1

List files within an archive or snapshot. This endpoint supports the same parameters as Get Archive Info to filter or paginate the list of files.

Table 26. Response Codes
Status	Response	Description
200	FileList	No description

Download file

GET /v3/{vault}/{archive}/{filename} HTTP/1.1

GET /v3/{vault}/{archive}@{snapshot}/{filename} HTTP/1.1

Note	This endpoint produces: `/`

Download a single file from an archive or snapshot.

This endpoint supports ranged requests and conditional headers such as If-(None-)Match, If-(Un)modified-Since, If-Range and Range, as well as HEAD requests. The ETag value is calculated from the files digest hash, if known.

Highly accessed files in publicy readable archives may be served from a different location (e.g. S3 or CDN). Clients should follow redirects (e.g. 307 Temporary Redirect) according to the HTTP standard.

During explixit transactions, and while a file upload is currently in progress, GET requests will fail with an "IncompleteWrite" error. HEAD requests are allowed, though. The Content-Length header will report the current upload size.

Table 27. Query Parameters
Name	Type	Description
inline	boolean	By default, files are returned with a `Content-Disposition: attachment` header, forcing a download dialog in most browsers. This header can be disabled to allow resoures to be embedded in HTML pages or opened directly in a suitable application. Some content-types cannot be inlined for security reasons. This parameter is silently ignored for these types, and the `Content-Disposition: attachment` header is sent regardless.

Table 28. Response Codes
Status	Response	Description
200	bytes	File exists, is readable and its content is returned with this response. The `Content-Type` matches whatever was defined on the file resource.
206	bytes	Same as `200`, but only parts of the file are returned according to the `Range` header in the request.
304	-	File not modified.
307	-	Same as `200`, but the file content is available under a different URL specified in the `Location` header.
409	-	Archive not available. This may happen for archives with a cold storage profile.
412	-	Precondition failed.
416	-	Requested range not satisfiable.

Get file info

GET /v3/{vault}/{archive}/{filename}?info HTTP/1.1

GET /v3/{vault}/{archive}@{snapshot}/{filename}?info HTTP/1.1

Get FileInfo for a single file. For multiple files, [fileList] is usually faster.

Table 29. Query Parameters
Name	Type	Description
with	list(enum)	Return additional information about the file, embedded in the FileInfo document. meta Include MetaAttributes defined on this file.

Table 30. Response Codes
Status	Response	Description
200	FileInfo	No description

Upload file

PUT /v3/{vault}/{archive}/{filename} HTTP/1.1

Note	This endpoint consumes: `/` NOTE: This endpoint produces: `application/json`

Directly upload a new file to an archive, or overwrite an existing file.

If a Content-Type header is missing or equals application/x-autodetect, then the media type is guessed from the filename extention.

The conditional headers If-Match: * or If-None-Match: * can be used to force update-only or create-only behavior.

Upload errors can only be detected properly if either Content-Length header is set, or Transfer-Encoding: chunked is used. If less than the expected number of bytes are transmitted, the file is considered incomplete and the transaction will fail.

During explicit transactions (see Transaction Management), failed uploads will leave the file in an incomplete state. The upload must be repeated or resumed before committing. See Resume file upload for details. Conflicting operations, for example reading the file content or fetching its info, will fail until the file was completely updated or removed. HEAD requests to the files URL are allowed, though.

Table 31. Response Codes
Status	Response	Description
200	-	File updated.
201	-	File created.
412	-	Precondition (e.g. `If-Match` or `If-None-Match`) failed.

Resume file upload

PATCH /v3/{vault}/{archive}/{filename} HTTP/1.1

Note	This endpoint consumes: `application/vnd.cdstar.resume` NOTE: This endpoint produces: `application/json`

Resume a failed or aborted file upload.

After a failed Upload file request during an explicit transactions (see Transaction Management), the client may choose to resume the upload instead of uploading the entire file again or removing it.

To do so, send a PATCH request with Content-Type: application/vnd.cdstar.resume and a Range header with a single byte range, either bytes=startByte- or bytes=startByte-endByte (see RFC-2616). The startByte index must match the current remote file size, as returned by a HEAD request to the Download file API. The endByte index is optional, but recommended as an additional saveguard. It should match the target file size.

A file is considered complete once the PUT or PATCH request completes without errors. Within a single transaction, failing uploads can be resumed repeatedly until all data is transmitted or the transaction runs into a timeout.

Do not use this api to upload files in small chunks. A successfull PUT or PATCH request will compute digests, which is an expensive operation. Always try to upload the entire file in one go, if possible.

Table 32. Response Codes
Status	Response	Description
200	-	File updated.

Delete file

DELETE /v3/{vault}/{archive}/{filename} HTTP/1.1

Remove a single file from an archive. This requires change_files permissions on the archive.

Table 33. Response Codes
Status	Response	Description
204	-	File removed (no content).

Metadata

Archives and individual files within an archive can be annotated with custom metadata attributes. Both the name and values of an attribute are plain strings, but each attribute can have multiple values. Lists of strings are returned even if onyl a single value is set.

Attribute names are case-insensitive and limited to letters, digits and the underscore character, and must start with a letter.

Attribute names may be prefixed with a namespace identifier followed by a single colon character (e.g. dc:title for a Dublin Core title attribute). Namespaced attributes are subject to server-side validation and defined in a schema. Custom attributes should be either prefixed with the custom: namespace or no namespace at all.

The value of an attribute is an ordered list of plain strings. Empty strings are allowed, but a list with no values is equal to an undefined attribute.

Get Archive Metadata

GET /v3/{vault}/{archive}?meta HTTP/1.1

GET /v3/{vault}/{archive}@{snapshot}?meta HTTP/1.1

Return metadata attributes for an archive or snapshot. The same information can also received as part of a Get Archive Info request by using the with=meta switch.

Table 34. Response Codes
Status	Response	Description
200	MetaAttributes	No description

Set Archive Metadata

PUT /v3/{vault}/{archive}?meta HTTP/1.1

Note	This endpoint consumes: `application/json`

Replace the metadata of an archive with a new MetaAttributes document. To clear all attributes, just send an empty document (e.g. {}).

Table 35. Request Body (MetaAttributes)
Field	Type	Description
{schema:attr}	list(string)	A list of string values. The list is ordered and dublicates are allowed.

Table 36. Response Codes
Status	Response	Description
204	-	Metadata updated.

Get File Metadata

GET /v3/{vault}/{archive}/{file}?meta HTTP/1.1

GET /v3/{vault}/{archive}@{snapshot}/{file}?meta HTTP/1.1

Return metadata attributes for a single file within an archive or snapshot. The same information can also received as part of a Get file info request by using the with=meta switch.

Table 37. Response Codes
Status	Response	Description
200	MetaAttributes	No description

Set File Metadata

PUT /v3/{vault}/{archive}/{file}?meta HTTP/1.1

Note	This endpoint consumes: `application/json`

Replace the metadata of a file within an archive with a new MetaAttributes document. To clear all attributes, just send an empty document (e.g. {}).

Table 38. Request Body (MetaAttributes)
Field	Type	Description
{schema:attr}	list(string)	A list of string values. The list is ordered and dublicates are allowed.

Table 39. Response Codes
Status	Response	Description
204	-	Metadata updated.

Access Control

The local access control list (ACL) of an archive can be used to grant permissions to individuals or groups. These permissions are checked before any external realm is consulted and stored as part of the archive. New permissions can be granted individually using the Update Archive endpoint, or in bulk via Set Archive ACL. The permissions read_acl or change_acl are required to read or change the access control list or an archive.

Note that the names for subjects (individuals or groups) can and should be qualified with the name of the autentication realm, especailly if more than one realm is installed. A subject named alice would match any user with that name, across all autentication sources. Use qualified names (e.g. userName@realmName or @groupName@realmName) to prevent ambiguities.

Get Archive ACL

GET /v3/{vault}/{archive}?acl HTTP/1.1

Return the local access control list of this archive as an AclInfo document. The same information can also be received as part of a Get Archive Info request by using the with=acl switch.

Table 40. Query Parameters
Name	Type	Description
acl	enum	group Group permissions (lowercase) into permission-sets (uppercase) when possible. Permissions that do not fit into a complete group are returned individually. explode Return individual permissions and no permission sets. Default: `"group"`

Table 41. Response Codes
Status	Response	Description
200	AclInfo	No description

Set Archive ACL

PUT /v3/{vault}/{archive}?acl HTTP/1.1

Note	This endpoint consumes: `application/json`

Replace all entries of the local access control list with entries from this AclInfo document.

Table 42. Request Body (AclInfo)
Field	Type	Description
{subject}	list(string)	A list of permissions (lowercase) or permission-sets (uppercase) granted to this subject. `{subject}` can be an individual, an `@` prefixed group or one of the special subjects `$self`, `$any`, `$user` or `$owner`.

Table 43. Response Codes
Status	Response	Description
200	-	Archive updated.
400	-	Invalid permission

Data Import

No description

Import from ZIP/TAR

POST /v3/{vault}/ HTTP/1.1

Note	This endpoint consumes: `application/zip`, `application/x-tar`

Create a new Archive from a ZIP or TAR file.

For compressed TAR files, make sure to provide a suitable Content-Encoding header. Supported algorithms include gz, bzip2, xz, and deflate.

Note that importing compressed ZIP or TAR archives requires a significant amount of work on server-side after the upload completed, which may cause some clients to time-out before a response can be sent. Make sure to increase the read time-outs for your client before uploading large archives.

Table 44. Query Parameters
Name	Type	Description
prefix	string	Import files into this folder. Example: `prefix=/import/`
include	list(glob)	Only import files that match any of these glob patterns. Example: `include=*.pdf`
exclude	list(glob)	Only import files that do not match any of these glob patterns. Example: `exclude=.svn/**`

Table 45. Response Codes
Status	Response	Description
201	-	Archive created.

Update from ZIP/TAR

POST /v3/{vault}/{archive} HTTP/1.1

Note	This endpoint consumes: `application/zip`, `application/x-tar`

Import files from a zip or tar file into an existing archive. See Import from ZIP/TAR for details.

Table 46. Query Parameters
Name	Type	Description
prefix	string	Import files into this folder. Example: `prefix=/import/`
include	list(glob)	Only import files that match any of these glob patterns. Example: `include=*.pdf`
exclude	list(glob)	Only import files that do not match any of these glob patterns. Example: `exclude=.svn/**`

Table 47. Response Codes
Status	Response	Description
200	-	Archive updated.

Snapshots

Archive Snapshots are an efficient way to preserve the current payload (files and metadata) of an archive without actually creating a copy. This can be used to implement versioning or prepare unmodifiable copies for publishing.

The preserved state of a snapshot can be accessed (read-only) just like normal archive state, by appending an @ and the snapshot name to the archive id in the request path. For exampe, GET /v3/ab587f42c257@v1/data.csv will return a file from archive ab587f42c257 as preserved by snapshot v1. This works for all endpoints documented as supporting snapshots.

Snapshots only preserve the payload of an archive, namely metadata and files. Administrative metadata such as owner or access control lists are not part of a snapshot. Only the profile can be changed on a snapshot via Update Archive. This means that the storage state and availability of a snapshot can differ from that of the archive. See Storage Profiles for details.

Create Snapshot

POST /v3/{vault}/{archive}?snapshots HTTP/1.1

Note	This endpoint consumes: `application/x-www-form-url-encoded`

Create a new snapshot.

Table 48. Form Parameters
Name	Type	Description
name	string	(required) Snapshot name. Must be unique per archive and only contain ASCII letters, digits, dashes or dots (`a-z A-Z 0-9 - .`).

Table 49. Response Codes
Status	Response	Description
201	SnapshotInfo	Snapshot created.

Delete Snapshot

DELETE /v3/{vault}/{archive}@{snapshot} HTTP/1.1

Delete a snapshot. This requires delete permissions on the archive and is irreversable. The name of a deleted snapshot cannot be used to create a new snapshot.

Table 50. Response Codes
Status	Response	Description
204	-	Snapshot removed

List Snapshots

GET /v3/{vault}/{archive}?snapshots HTTP/1.1

Get a list of snapshots that exist for this archive, ordered by creation date, then name.

Transactions

Transactions can be started, comitted or rolled back explicitly using these endpoints. To learn more about transactions, see Transaction Management.

Begin Transaction

POST /v3/_tx/ HTTP/1.1

Note	This endpoint consumes: `application/x-www-form-urlencoded`

Start a new transaction. See Transaction Management for details.

Table 51. Form Parameters
Name	Type	Description
isolation	enum	Select an isolation level for this transaction. Supported modes are `full` and `snapshot`. Transactions with 'snapshot' isolation work on a consistent snapshot of the entire database from the exact moment the transaction was sarted and only see their own changes. On a write-write conflict (the same resource modified by two overlapping transactions) only one of the transactions will be able to commit. This protects against 'lost updates' and is suitable for most scenarios. Transactions with 'full' isolation (also called 'serializability isolation') will also fail on write-read conflicts. The transaction can only be committed if none of the affected resources (modified or not) was modified by an overlapping transacion. Default: `"snapshot"`
readonly	boolean	If true, create a read-only transaction. These transactions cannot be committed (only rolled back).
timeout	integer	Timeout (in seconds) after which an unused transaction is automatically rolled back. User supplied timeouts are automatically capped to a server-defined maximum value. Default: `60`

Table 52. Response Codes
Status	Response	Description
201	TransactionInfo	Transaction created successfully.

Get Transaction Info

GET /v3/_tx/{txid} HTTP/1.1

Request information about a running transaction.

Table 53. Response Codes
Status	Response	Description
200	TransactionInfo	Transaction Info
404	Error	Transaction does not exist, expired or is not visible to the current user context.

Commit Transaction

POST /v3/_tx/{txid} HTTP/1.1

Commit a running transaction. All changes made with this transaction ID are persisted and new transactions will be able to see the changes. The commit may fail, in wich case not changes will be persisted at all. Partial commits never happen.

Table 54. Response Codes
Status	Response	Description
204	-	Transaction committed successfully.
404	Error	Transaction does not exist, expired or is not visible to the current user context.
409	Error	Transaction could not be commited because of unresolveable conflicts and was rolled back instead.
423	Error	Transaction could not be commited because of locked resources. It may still be possible to commit this transaction, so it is kept open. The client should either issue a rollback, or try again later.

Renew Transaction

POST /v3/_tx/{txid}?renew HTTP/1.1

Renew a running transaction. This resets the transaction timeout and ensures that the transaction is not rolled back automatically for the next TransactionInfo.ttl seconds.

Table 55. Response Codes
Status	Response	Description
200	TransactionInfo	Transaction renewed successfully. The response contains an updated `timeout`.
404	Error	Transaction does not exist, expired or is not visible to the current user context.

Rollback Transaction

DELETE /v3/_tx/{txid} HTTP/1.1

Close a running transation by rolling it back. All changes made with this transaction ID are discarded.

API Data Structures

List of Types

AclInfo
ArchiveInfo
Error
FileInfo
FileList
MetaAttributes
ScrollResults
SearchHit
SearchResults
ServiceInfo
SnapshotInfo
TransactionInfo
VaultInfo

AclInfo

This object maps subjects (users, groups or special subjects) to lists of permissions (lowercase) or permission sets (uppercase). See Archive Permissions for possible values.

Permissions are grouped into permission sets by default. Only permissions that do not fit into a complete set are returned individually. Endpoins returning this structure usually also support a flag to return individual permissions instead of sets.

For most subjects, this listing only contains permissions that were explicitly granted on the archive itself. Authorization realms configured on the server may grant additional permissions when requested. Those are not listed here, as they cannot be known in advance.

Table 56. Field list for AclInfo
Field	Type	Description
{subject}	list(string)	A list of permissions (lowercase) or permission-sets (uppercase) granted to this subject. `{subject}` can be an individual, an `@` prefixed group or one of the special subjects `$self`, `$any`, `$user` or `$owner`.

Example for AclInfo

{
  "$any": [
    "READ"
  ],
  "$owner": [
    "OWNER"
  ],
  "alice": [
    "READ"
  ],
  "@cronGorup": [
    "READ",
    "read_acl"
  ]
}

ArchiveInfo

Archive properties and content listing as returned by Get Archive Info. Some of the fields are optional or affected by query parameters. See Get Archive Info for a detailed description.

If this document represents an archive snapshot, additional fields are present. State that is not part of the snapshot (e.g. owner or ACL) are complemented from the archive state, if requested.

Table 57. Field list for ArchiveInfo
Field	Type	Description
id	string	Unique ID of this archive.
vault	string	Name of the containing vault.
revision	string	Archive revision. This is currently an incrementing counter, but the value should be treated as an arbitrary string.
profile	string	The name of the storage profile. If the archive is currently in a `pending-*` state, then this is the target profile the archive is migrating to.
state	enum	The current storage state of this archive or snapshot. The states are: open The archive is open for reading and writing. locked The archive is write-protected, but can be read. archived The archive is stored in a external location, cannot be modified and file content may not be available. It needs recovery to be available again. pending-recover The archive is currently recovered from external storage and will change to `open` or `locked` once the recovery is complete. pending-archive The archive is currently migrating to external storage and will change to `archived` once the migration is complete. Archives in `pending-` states have the same restrictions as `archived`. To change the state, change the storage profile and wait for the `pending-` state to clear.
created	date	Time this archive was created.
modified	date	Last time this archive, its meta-data or any of its files were modified. Note that changes to administrative meta-data (owner, ACL) do not update the modification time of an archive. If you need to track changes in administrative meta-data, always compare the actual values.
file_count	int	Total number of files in this archive. May be `-1` to indicate that the actual number is not known. This may happen if the user does not have the permission to list the archives content.
files	list(FileInfo)	List of files in this archive. May be incomplete or missing based on query parameters, permissions and server configuration. See Get Archive Info for details.
meta	MetaAttributes	Meta-Attributes defined on this archive. May be incomplete or missing based on query parameters and permissions.
acl	AclInfo	Access control list. May be incomplete or missing based on query parameters and permissions.
snapshots	list(SnapshotInfo)	List of snapshots created for this archive, if any. May be incomplete or missing based on query parameters. See Get Archive Info for details.

Example for ArchiveInfo

{
  "id": "ab587f42c2570a884",
  "vault": "myVault",
  "revision": "0",
  "profile": "default",
  "state": "open",
  "created": "2016-12-20T13:59:37.160+0000",
  "modified": "2016-12-20T13:59:37.231+0000",
  "file_count": 1,
  "files": [
    {
      "name": "/example.txt",
      "id": "aaf0cc5ab587",
      "type": "text/plain",
      "size": 7,
      "created": "2016-12-20T13:59:37.217+0000",
      "modified": "2016-12-20T13:59:37.218+0000",
      "digests": {
        "md5": "1a79a4d60de6718e8e5b326e338ae533",
        "sha1": "c3499c2729730a7f807efb8676a92dcb6f8a3f8f",
        "sha256": "50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c"
      },
      "meta": {
        "dc:title": [
          "This is an example file"
        ],
        "dc:date": [
          "2016-12-20T13:59:37.218+0000"
        ]
      }
    }
  ],
  "acl": {
    "$any": [
      "READ"
    ],
    "$owner": [
      "OWNER"
    ],
    "alice": [
      "READ"
    ],
    "@cronGorup": [
      "READ",
      "read_acl"
    ]
  }
}

Error

In case of an error, CDStar will return a json document with additional information.

Table 58. Field list for Error
Field	Type	Description
status	int	HTTP status code of this response
error	string	Short description. Suitable as a key for translations or error handling, as it does not contain any dynamic parts.
message	string	Long description. Suitable to be presented to the user.
detail	object	Additional information or metadata. (Optional field)
other	list(Error)	If more than one error occuded during a single request, the other errors are listed here. (Optional field)

Example for Error

{
  "status": 404,
  "error": "Not found",
  "message": "The requested archive does not exist or is not readable.",
  "detail": {
    "vault": "myVault",
    "archive": "ab587f42c2570a884"
  }
}

FileInfo

Properties and (optionally) meta-data about a single file within an archive.

Table 59. Field list for FileInfo
Field	Type	Description
id	string	A unique and immutable string identifier. Other than the `name` attribute, the `id` will not change for the lifetime of the file and can be used to track individual files across name changes.
name	string	File name (unicode), always starting with a slash (`/`). The file name may actually represent a path and contain several path seperators (slash, `/`).
type	string	User supplied or auto-detected media type. Defaults to `application/octet-stream`
size	long	File size in bytes
created	date	Time the file was created.
modified	date	Last time the file content was modified.
digests	object	An object mapping digest algorithms to their hex value. The available algorithms (e.g. `md5`, `sha1` or `sha256`) depend on server configuration, but at least one is always present. This field is not available (null or missing) for incomplete files with running or aborted uploads in the same transaction.
meta	MetaAttributes	Meta attributes defined for this file. May be incomplete or missing based on query parameters and permissions.

Example for FileInfo

{
  "name": "/example.txt",
  "id": "aaf0cc5ab587",
  "type": "text/plain",
  "size": 7,
  "created": "2016-12-20T13:59:37.217+0000",
  "modified": "2016-12-20T13:59:37.218+0000",
  "digests": {
    "md5": "1a79a4d60de6718e8e5b326e338ae533",
    "sha1": "c3499c2729730a7f807efb8676a92dcb6f8a3f8f",
    "sha256": "50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c"
  },
  "meta": {
    "dc:title": [
      "This is an example file"
    ],
    "dc:date": [
      "2016-12-20T13:59:37.218+0000"
    ]
  }
}

FileList

A list of FileInfo objects, usually filtered and paginated. If count and total are not queal, then the result is incomplete and additional requests are required to get the complete list.

Table 60. Field list for FileList
Field	Type	Description
count	int	Number of results in this listing (size of the `files` array)
total	int	Total number of files matching the given include/exclude filters or query.
files	list(FileInfo)	List of FileInfo objects.

Example for FileList

{
  "count": 1,
  "total": 1,
  "files": [
    {
      "name": "/example.txt",
      "id": "aaf0cc5ab587",
      "type": "text/plain",
      "size": 7,
      "created": "2016-12-20T13:59:37.217+0000",
      "modified": "2016-12-20T13:59:37.218+0000",
      "digests": {
        "md5": "1a79a4d60de6718e8e5b326e338ae533",
        "sha1": "c3499c2729730a7f807efb8676a92dcb6f8a3f8f",
        "sha256": "50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c"
      },
      "meta": {
        "dc:title": [
          "This is an example file"
        ],
        "dc:date": [
          "2016-12-20T13:59:37.218+0000"
        ]
      }
    }
  ]
}

MetaAttributes

This objects contains one key per non-empty meta-attribute defined on the resource. The keys are fully qualified attribute names (including schema prefix) and values are always lists of strings, even if the attribute only allows a single value or has a different value type.

Table 61. Field list for MetaAttributes
Field	Type	Description
{schema:attr}	list(string)	A list of string values. The list is ordered and dublicates are allowed.

Example for MetaAttributes

{
  "dc:title": [
    "This is an example file"
  ],
  "dc:date": [
    "2016-12-20T13:59:37.218+0000"
  ]
}

ScrollResults

A page of results returned from a List all Archives in a Vault query.

Table 62. Field list for ScrollResults
Field	Type	Description
count	int	Number of results in this page.
limit	int	Maximum number of results per page. If `limit` is greater than `count`, then this is the last page.
results	list(String)	List of archive IDs

Example for ScrollResults

{
  "count": 2,
  "limit": 25,
  "results": [
    "ab587f42c2570a884",
    "ac2b39606a3a6e3b1"
  ]
}

SearchHit

A single element of a SearchResults listing.

Table 63. Field list for SearchHit
Field	Type	Description
id	string	Archive ID this hit belongs to.
type	string	Resource type of this hit (either `archive` or `file`)
name	string	Full file name (including path) of the matched file. Only present if `type` equals `file`.
score	float	Relevance score. May be 0 for queries or search backends that do not support relevance scoring.
fields	object(string, any)	Contains field query results requested during search or automatically provided by the search backend. Each entry maps a field query to its result value, which is usually a simple type (e.g. number, string or list of strings), but can also take other forms for computed fields or errors. Failed or unsupported individual field queries should map to an `{'error': 'Reason'}` object containing error details if possible, but may also be silently ignored and not included in the result at all. Supported field queries and their return type depend on the search backend used.

Example for SearchHit

{
  "id": "ab587f42c2570a884",
  "type": "file",
  "name": "/folder/example.pdf",
  "score": 3.14,
  "fields": {
    "dcTitle": "Example Document Title",
    "highlight(content)": {
      "error": "UnsupportedFieldQuery"
    }
  }
}

SearchResults

A page of results returned from a search query.

Table 64. Field list for SearchResults
Field	Type	Description
count	int	Number of results in this page.
total	int	Total number of results in this result set (approximation)
scroll	string	A stateless cursor representing the last hit of this result page. It can be used to repeat the search and fetch the next page of a large result set.
hits	list(SearchHit)	List of search hits

Example for SearchResults

{
  "count": 1,
  "total": 1,
  "scroll": "WyJhYjU4N2Y0MmMyNTcwYTg4NDphYWYwY2M1YWI1ODciXQ==",
  "hits": [
    {
      "id": "ab587f42c2570a884",
      "type": "file",
      "name": "/folder/example.pdf",
      "score": 3.14,
      "fields": {
        "dcTitle": "Example Document Title",
        "highlight(content)": {
          "error": "UnsupportedFieldQuery"
        }
      }
    }
  ]
}

ServiceInfo

General information about this cdstar instance

Table 65. Field list for ServiceInfo
Field	Type	Description
version	[VersionInfo]	Information about the currently running version. Some details are only shown to logged in users.
vaults	list(string)	List of vault names visible to the current user.
info	object	A flat map with meta information about this vault. Keys and values are strings. Keys starting with `public.*` are usually intended for humans, all other keys may contain technical data for client-side auto-configuration and should usually not be presented to humans in their raw form.
features	object	Plugins or optionmal features can advertise their presence and configuration here.

Example for ServiceInfo

{
  "version": {
    "cdstar": "3.1.0-SNAPSHOT",
    "api": "3.0",
    "java": "11.0.16+8",
    "source": {
      "branch": "412b92489ef24e338ee7ab8b11ad32ca3a1569d1",
      "commit": "412b92489ef24e338ee7ab8b11ad32ca3a1569d1",
      "date": "2023-05-31T14:17:35Z"
    }
  },
  "vaults": [
    "exampleVault"
  ],
  "features": {
    "tus": {
      "version": 3,
      "path": "/tus/"
    }
  },
  "info": {
    "public.title": "Awesome repository",
    "public.contact.email": "support@gwdg.de",
    "cdstar.explorer.v3": "{ ... json document as string ...}"
  }
}

SnapshotInfo

Information about a single archive snapshot.

Table 66. Field list for SnapshotInfo
Field	Type	Description
name	string	Snapshot name
revision	string	Archive revision this snapshot refers to.
creator	string	User that created this snapshot.
created	string	Snapsho creation date
profile	string	Snapshot storage profile

Example for SnapshotInfo

{
  "name": "v1",
  "revision": 0,
  "creator": "user@domain",
  "created": "2020-05-26T12:02:45.301+0000",
  "profile": "default"
}

TransactionInfo

Information about a running transaction. See Transaction Management for details.

Table 67. Field list for TransactionInfo
Field	Type	Description
id	string	Transaction ID
isolation	enum	Isolation level (either `full` or `snapshot`)
readonly	boolean	Whether or not this transaction is in read-only mode. Read-only transactions cannot be committed (only rolled back) and do not allow modifying operations.
ttl	integer	Number of seconds left from the configured `timeout`. This counter is reset every time the transaction is used. If this number is zero or negative, then the transaction already expired or may expire very soon.
timeout	integer	Number of seconds after which this transaction will expire if not used (see ttl).

Example for TransactionInfo

{
  "id": "091f8a6e-0fca-4771-a460-d2ee7d6034e3",
  "isolation": "snapshot",
  "readonly": false,
  "ttl": 59,
  "timeout": 60
}

VaultInfo

Information about a vault

Table 68. Field list for VaultInfo
Field	Type	Description
name	string	Vault name
public	boolean	Public vaults are accessible without authentication.
info	object	A flat map with meta information about this vault. Keys and values are strings. Keys starting with `public.*` are usually intended for humans, all other keys may contain technical data for client-side auto-configuration and should usually not be presented to humans in their raw form.

Example for VaultInfo

{
  "name": "exampleVault",
  "public": true,
  "info": {
    "public.title": "Awesome repository"
  }
}

Realms

Realms manage authentication and authorization in CDStar and are very flexible. There are different interfaces for authorization, authentication, group membership resolution, custom permission types and more. This list contains all available realms types that are either bundled with the core distribution or provided as officially supported plugins. Custom implementations can also be used.

StaticRealm

This realm provides authentication, authorization and groups from a static configuration file.

StaticRealm loads the entire user database (users, groups, roles and permissions) from a static configuration file (hence the name) and is the go-to solution for small instances with only a hand full users. No external database or server required.

Configuration

The realm is configured directly in the cdstar main configuration. Here is an example showing most options:

Example cdstar-static-realm.yaml

realms:
  default:
    class: StaticRealm
    domain: static
    role:
      userRole:
      - "vault:demo:read"
      - "vault:demo:create"
      adminRole:
      - "vault:*:*"
      - "archive:*:*:*"
    group:
      customers:
      - userRole
      admins:
      - userRole
      - adminRole
    user:
      alice:
        password: "cGxhaW4=:FmtSc7NSX8fsjLTmpLpoqRLP4vqWFg/r5uy3EU6JsEs="
        groups:
        - customers
        permissions:
        - "vault:alice:*"
      admin:
        password: "..."
        roles:
        - adminRole

Table 69. Config properties
Pram	Description
class	Realm implementation class name. Always `"StaticRealm"`
file	Load additional configuration from an external yaml file (not implemented)
domain	Sets a default domain for this realm. (defaults to 'static')
user.<name>.password	Enables a user to authenticate against this realm. The password is stored in hashed from. These hashes can be created using the built-in command line tool (see below).
user.<name>.permissions	Grants string permissions directly to this user.
user.<name>.groups	Adds this user to a list of groups.
user.<name>.roles	Adds this user to a list of roles.
group.<name>	Defines a new group with a list of roles.
role.<name>	Defines a new role with a list of string permissions.

Unqualified groups and user-names are qualified with the configured default domain of the realm (e.g. alice is turned into alice@static). Fully qualified names (e.g. alice@otherRealm) are also accepted, even if the domain does not match the current realm.

Warning

Permissions groups and roles configured for a qualified user will affect any session with a matching principal name and domain, even if the session was authenticated by a different realm.

If no password is defined for a user, then the user will not be able to authenticate against this realm. Permissions, roles and groups still apply.

Password hash

A secure password-hash can be generated with the java -cp cdstar.jar de.gwdg.cdstar.auth.realm.StaticRealm tool.

LDAP Realm

An LDAPRealm authenticates password credentials against an LDAP server. The realm first searches for the user according to a configurable search base and filter, then tries to bind to the LDAP using the users password. Successfully authenticated principals are cached to speed up repeated login requests for the same user.

Configuration

Example Configuration

realm:
    ldap:
       class: LDAPRealm
       name: "ldap"
       server: "ldaps://SERVER"
       search.user: "cn=USER,ou=users,dc=example,dc=com"
       search.password: "SECRET"
       search.base: "dc=example,dc=com"
       search.filter: "(|(uid={})(mail={}))"
       attr.uid: "uid"
       attr.domain: "ou"

Table 70. Config Parameters
Name	Description
class	Plugin class name. Always `LDAPRealm`
name	The name of this realm. Defaults to the value of `_name`.
server	URL (either `ldap://` or `ldaps://`) of the LDAP server.
search.user	Login `DN` for the search agent. The search agent must be able to search below the `search.base` tree to find the `DN`s matching a login request.
search.password	Password for the search agent.
search.base	Base `DN` for user records. Only records below this tree are considered for login requests.
search.filter	Search filter used to map a login requests (e.g. user name or e-mail) to a qualified user `DN`. Every occurrence of `{}` within this filter is replaced by an escaped copy of the login request. Additional escaping is not required. For example, to allow login via common name, uid and email, provide a filter similar to: `(\|(cn={})(uid={})(mail={}))`
attr.uid	The LDAP attribute used as the subject identifier. Note that subject identifiers must be unique and should not contain certain special characters. Defaults to `uid`.
attr.domain	Attribute to read the principal domain from. This allows a single LDAPRealm to represent multiple principal domains. If this config value is not set, or if the attribute is not found in the ldap record, then the principal domain defaults to the realm name. (Optional)
cache.size	Number of recently authenticated principals to keep in memory to prevent unnecessary LDAP request. Defaults to `1024`. A cache size of `0` disables the cache.
cache.expire	Number of seconds after which a principal must be re-authenticated against LDAP. (default: 10 minutes)

Warning: cache.expire is enforced by the cache implementation, which might allows entries to survive longer than expected on Java 8 if the cache is mostly idle. If prompt expiration is important and the expiration time is very short, make sure to run on Java 9 or newer.

JWT Realm

This plugin adds support for JWT token based authentication and authorization.

Configuration

The JWTRealm class can be configured as a realm or regular plugin and allows users to authenticate via signed JWTs.

example.yaml

realm:
  jwt:
    class: JWTRealm
    default:
      hmac: c3VwZXJzZWNyZXQ= # base64("supersecret")
    my_issuer:
      iss: https://auth.example.com/my-realm/
      jwks: https://auth.example.com/my-realm/jwks.json
      domain: my_realm

This plugin supports multiple JWT issuers with different settings at the same time. Tokens are matched against configured issuers based in their iss claim. Tokens without an iss claim or with no matching issuer configuration will be matched against the default issuer, if defined.

Each issuer MUST define at least one of hmac, rsa, ecdsa or jwks to be able to verify signed tokens. Unsigned tokens are not supported and will be rejected.

Pram

Description

class

Plugin class name. Always JwtRealm

<issuer>.iss

Expected value of the iss header claim for tokens from this issuer. (default: <issuer>).

<issuer>.hmac

Base64 encoded secret. Required to verify HMAC based signatures.

<issuer>.rsa

RSA public key (X.509). Required to verify RSA based signatures. Keys are loaded from (*.pem or *.der) files, or directly from a base64 encoded string.

<issuer>.ecdsa

ECDSA public key (X.509). Required to verify ECDSA based signatures. Keys are loaded from (*.pem or *.der) files, or directly from a base64 encoded string.

<issuer>.jwks

Path or URL pointing to a JWKS (Java WebToken Key Set) file to load signing keys from.

<issuer>.leeway

Number of seconds to add/remove to exp or nbf claims before a token is checked. This helps prevent errors for short-lived tokens if the server clocks are not perfectly synchronized. (default: 0).

<issuer>.domain

The realm domain of the resulting principal. Interpreted as an SpEL expression (see below) if it looks like one. (default: <issuer>).

<issuer>.trusted

(deprecated) If true, the issuer can dynamically grant additional permissions via private claims (see below). (default: false)

<issuer>.permit

A list of static StringPermissions given to all tokens created by this issuer.

<issuer>.groups

A list of static groups all token users are considered to be a member of.

<issuer>.subject

SpEL expression to derive a subject name from a token. Must evaluate to a string. (default: getString('sub'))

<issuer>.verify.<name>

SpEL expression (see below) to check token validity. All expressions must evaluate to true, or the token will be rejected. The rule name is just informal.

<issuer>.groups.<name>

SpEL expression (see below) to derive group memberships from a token. Each expression must evaluate to a string, a list of strings, or null. Null values or empty lists are ignored and will not add any groups. The expression name is just informal.

<issuer>.permit.<name>

SpEL expression (see below) to derive StringPermissions from a token. The expressions must evaluate to a string, a list of strings, or null. Null values or empty lists are ignored and will not add any permissions. The expression name is just informal.

Dynamic expression rules

Because JWT is a very loose standard and the available claims may differ a lot between token providers, this plugin allows to verify tokens and extract information dynamically using SpEL expressions. Token claims are available as a claims map which maps claim names to com.auth0.jwt.interfaces.Claim instances, or via the hasClaim(name), getBool(name, default), getLong(name, default), getDouble(name, default), getString(name, default), getStringList(name), getClaim(name, type, default) and getClaimList(name, innerType) helper methods. These methods will return null or an empty list on any errors (missing claim, wrong type) and automatically convert between single and list claims. If a single value is requested for a list claim, the first value is returned.

dyn-example.yaml

realm.jwt:
    class: JWTRealm
    keycloak:
      iss: https://auth.example.com/realms/my_realm/
      jwks: https://auth.example.com/realms/my_realm/protocol/openid-connect/certs
      domain: "getString('org_id')?.toUpperCase() ?: 'OIDC'"
      subject: "getString('sub')"
      verify.aud: "getStringList('aud').contains('my_client_id')"
      groups.admin: "getBool('admin', false) ? 'admin_group' : null"
      permit.vaultUser: "getStringList('usable_vaults').!['vault:#{#this}:create']"

Trusted token claims (deprecated)

If the issuer is configured with trusted: true, then the following rules and expressions are automatically configured for a realm:

trusted.yaml

# Add `cdstar:groups` to list of groups.
groups._trusted: "getStringList('cdstar:groups')"

# Allow read access to all vaults in `cdstar:read`
permit._trusted_read: "getStringList('cdstar:read').!['vault:'+#this+':read']"

# Allow create+read access to all vaults in `cdstar:create`
permit._trusted_create:      "getStringList('cdstar:create').!['vault:'+#this+':create']"
permit._trusted_create_read: "getStringList('cdstar:create').!['vault:'+#this+':read']"

# Grant all vault and archive permissions in `cdstar:grant`
permit._trusted_grant: "getStringList('cdstar:grant').?[#this.startsWith('vault:') or #this.startsWith('archive:')]"

Plugins

Plugins are optional components that extend various parts of the cdstar runtime or REST API and can be enabled on demand. Some plugins are bundled with the core cdstar distribution, others must be downloaded and unpacked into the path.lib folder before they can be used. This chapter describes the official plugins that are tested and distributed with the core cdstar runtime and fully supported.

PushEventFilter

The PushEventFilter sends an HTTP request to a number of configured consumer URLs whenever an archive is modified. This can be used to update external services or keep external databases in sync with the actual data within cdstar.

Failed push request are tried again to compensate for busy or temporarily unavailable consumers. If a consumer goes down for an extended time period, any push requests that failed to be delivered are persisted to disk.

Configuration

Example Configuration

cdstar:
  plugin:
    push:
      class: PushEventFilter
      fail.log: "${path.var}/push-fail.log"
      retry.max: 3
      retry.delay: 1000
      retry.cooldown: 60000
      queue.size: 1000
      http.timeout: 60000
      url: http://localhost:8081/push
      url.alt: http://localhost:8082/push
      header.Authorization: Basic Y3VyaW91czpjYXQ=
      header.X-Push-Referrer: http://push:push@localhost:8080/v3/

Table 71. Config Parameters
Name	Description
class	Plugin class name. Always `PushEventFilter`
fail.log	(optional, recommended) Path to a file where failed push requests are logged. If `%s` is part of the filename, it is replaced with the current unix epoch timestamp. If it is relative, it is created within the `path.var` directory of the CDStar instance. Missing directories are created automatically.
retry.max	(default: `3`) Maximum number of attempts before a consumer is considered unresponsive.
retry.delay	(default: `1000`) Number of milliseconds to wait between failed attempts.
retry.cooldown	(default: `60000`) Number of milliseconds to wait after `retry.max` failed attempts.
http.timeout	(default: `60000`) Number of milliseconds after which a request is aborted.
queue.size	(default: `1000`) Number of queued events per consumer.
url	URL to send push requests to.
url.*	Additional URLs.
header.*	Additional HTTP headers to send with each request.

Push Event Consumer API

Events are send to consumers synchronously and in the order they appear, which means that there is at most one HTTP connection per consumer at any given time. The service behind the configured URL should expect requests like the following:

Example PUSH request

POST /push  HTTP/1.1
Host: localhost:8081
Content-Type: application/json; charset=UTF-8
Content-Length: 167
X-Push-Retry: 0
X-Push-Queue: 12 1000 0
X-Push-Referrer: http://push:push@localhost:8080/v3/

{
  "vault" : "test",
  "archive" : "b5e83cd9658f7f33",
  "revision" : "0",
  "parent" : null,
  "ts" : 1491914254133
  "tx" : "ded6b2d4-6983-48f6-9b1f-be8225dab136",
}

Table 72. Event Headers
Name	Description
X-Push-Retry	(int) Number of previously failed attempts for this event.
X-Push-Referrer	(url) May be sent to tell consumers how to contact cdstar.
X-Push-Queue	Statistics about the event queue for this consumer. Contains three space-separated numbers: (int) Number of events in waiting queue (not counting the current event) (int) Maximum size of waiting queue (int) Total number of dropped events since the service was last restarted Example: `12 1000 0` means: Twelve events currently waiting in a queue limited to 1000 events. No events were dropped so far.
*	Additional headers can be configured with `header.*` properties.

Table 73. Event Attributes
Name	Description
vault	Name of the vault.
archive	ID of the archive that changed.
revision	Revision of the changed archive, or `null` if the archive was deleted.
parent	Revision of the archive before the change, or `null` if this archive was just created.
ts	Timestamp of the change event (milliseconds since 1970-01-01T00:00:00GMT)
tx	ID of the transaction this change was part of.

A consumer may respond with 200 OK, 202 Accepted or 204 No Content to signal success. The response body should be empty and other headers (including cookies) are ignored.

Redirects with 30x response codes are followed according to normal HTTP client rules, but discouraged.

Consumers that are busy or unresponsive can answer with 503 Service Unavailable and request a cool-down time (in seconds) using the Retry-After header. This causes CDStar to pause the consumer and not send any more requests for the requested cool-down period. If the Retry-After header is missing, the default retry.cooldown is used.

Any other response as well as connection problems or timeouts are logged as warnings and the request is sent again after retry.delay milliseconds. If a request fails more than retry.max times in a row, it is logged as an error and the consumer is paused for retry.cooldown milliseconds. This gives the consumer a chance to recover and also reduces logging noise considerably. Note that failing event are not discarded, but simply send again after the cool-down. Consumers MUST return a success status if they want to drop or ignore an event. Otherwise, they will receive the same event over and over again.

Slow consumers should queue and persist events locally and answer with 202 Accepted to prevent timeouts or events piling up too quickly. If a single request takes longer than http.timeout milliseconds, it is aborted and tried again. If the number of waiting events exceeds queue.size (per consumer), new events will be dropped and logged to a fail.log file.

The fail.log file

The file configured with fail.log is used to store events that failed to be delivered. It contains one failed request per line, starting with the service URI, a single space, and the base64 encoded payload of the request. A timestamp is not logged since it can be easily recovered from the event payload itself.

Example fail.log entry

http://127.0.0.1:8081/push ewogICJ2YXVsd[...]IxMzYiLAp9Cg==

The PushEventFilter only appends to this file and there is no automatic clean-up. A warning is logged if this file is not empty at service start-up time, but there is no automatic recovery or re-querying of events. This feature may be added in the future, though.

If you have consumers that are sensitive to lost events, make sure to check this file regularly. A short python script to re-submit events from a fail.log is shown here:

Example recovery script

import requests
headers = {
	'Content-type': 'application/json'
}
with open(`/path/to/fail.log`) as fp:
  for lineno, line in enumerate(fp):
    target, payload = line.split(' ', 1)
    payload = payload.decode('base64')
    r = requests.post(target, data=payload, headers=headers)
    if r.status_code in (200, 204, 206):
    	print "%d SUCCESS" % lineno
    else:
    	print "%d ERROR" % lineno
    	print r

RabbitMQSink

This plugin emits change events to a RabbitMQ message broker.

Warning

This plugin is experimental.

Configuration

Pram

Type

Description

class

str

Always de.gwdg.cdstar.ext.rabbitmq.RabbitMQSink or RabbitMQSink

broker

URI

RabbitMQ transport URI to connect to, including authentication parameters and virtual node, if necessary.

exchange.name

str

Name of the exchange to publish to.

exchange.type

str

Type of the exchange (e.g. fanout). If no defined, then no exchange is declared and the exchange is assumed to already exist.

qsize

int

Size of the in-memory send-queue (default: 1024).

Reliability

Events are buffered in an in-memory send-queue and re-queued on any errors. This helps to compensate short event bursts, temporary network failures or broker restarts.

Events that cannot be queued or re-queued are logged and dropped. This may happen during shutdown phase or when the send-queue overflows.

Events are not part of the transaction logic (yet). A forced shutdown or crash will loose all messages in the send-buffer. Also note that the broker itself may drop messages for various reasons, depending on its configuration. The possibility of loosing events MUST be considered when using this plugin.

Embedded ActiveMQ Message Broker

This plugin emits change events to an embedded ActiveMQ message broker.

Warning

Embedding an ActiveMQ broker is fine for small to medium setups with low traffic and private networks. For production environments it is usually better to run a dedicated message broker with proper configuration and switch to the cdstar-activemq-sink or cdstar-rabbitmq-sink plugin.

Configuration

Pram Type Description

transport.<name>

URI

Network transports to bind to. See ActiveMQ docs for available protocols and URI parameters. The <name> part is only used for logging can can be omitted for a single transport.

This plugin bundles all dependencies needed for OpenWire, AMQP, STOMP and MQTT. Transports with vm, tcp, amqp, stomp, mqtt and auto schemes as well as their +ssl or +nio variants can be used directly. Other protocols may need additional dependencies on the class path.

The auto transport accepts OpenWire, AMQP, STOMP and MQTT clients on the same network port and is recommended in setups with mixed clients.

Default: auto+nio://127.0.0.1:5671

topic

list(str)

Change events are send to the given topics. (Default: cdstar)

queue

list(str)

Same as topic, but sends events to a queue. (Default: disabled)

buffer

int

Size of the send buffer. (Default: unbound)

Change Event Sink: ActiveMQ

This plugin emits change events to an ActiveMQ message broker.

Configuration

Pram Type Description

broker

URI

ActiveMQ transport URI to connect to, including authentication parameters, if necessary.

This plugin bundles all dependencies needed for OpenWire, AMQP, STOMP and MQTT. Transports with tcp, amqp, stomp, mqtt and auto schemes as well as their +ssl or +nio variants can be used directly. Other protocols may need additional dependencies on the class path.

topic

list(str)

Change events are send to the given topics. (Default: cdstar)

queue

list(str)

Same as topic, but sends events to a queue. (Default: disabled)

qsize

int

Size of the send buffer. (Default: unbound)

RedisSink

A dead simple plugin that emits change events to a redis server.

Configuration

Pram

Type

Description

class

str

Always de.gwdg.cdstar.ext.redis.RedisSink or RedisSink

url

URI

A redis server or cluster URI (default: redis://localhost:6379/0)

key

string

Redis key or pub/sub channel to push events to. (default: cdstar.events)

mode

string

Push mode (see below). (default: RPUSH)

qsize

int

Maximum in-memory send-queue size. (default: 1024)

Push modes

RPUSH Right-push do a redis list. (default)
LPUSH Left-push do a redis list.
PUBLISH Publish to a redis pub/sub channel.

Reliability

This sink will buffer events in a bounded in-memory queue and sent them out one by one as fast as it can. Any errors (network or redis errors, buffer queue overflow) will cause events to be logged an dropped (WARN level). On shutdown, the sink tries its best to send all remaining events, but will only do so for a couple of seconds. On a crash, all queued events are lost.

Or in other words: This sink is NOT reliable in any way. Network errors or crashes will cause events to be lost. On the plus side, this sink will not slow down cdstar if the redis server fails.

Search Proxy

This plugin installs a SearchProvider that forwards search requests to an external search gateway, using a simple HTTP protocol as described below.

To simplify gateway development and improve security, client credentials are NOT forwarded to the gateway. CDSTAR will authenticate and resolve client credentials before the search is forwarded, and only provide principal name and group memberships to the gateway. This enables user-specific searches without exposing client credentials to an external service.

Configuration

Example Configuration

plugin:
    search:
       class: ProxySearchPlugin
       target: "https://gateway.example.com/search"
       maxconn: 16
       header:
          X-Custom-Header: value

Table 74. Config Parameters
Name	Description
class	Plugin class name. Always `ProxySearchPlugin`.
name	The name of this provider. Defaults to the value of `_name`.
target	URL to send search requests to. The target URL may contain authentication info.
maxconn	Maximum number of concurrent search requests (default: 10)
header.<name>	Additional HTTP headers to attach to each request.

Search gateway API

The search gateway should accept POST requests at the configured target URL with Content-Type: application/json and return results in the same format as the CDSTAR v3 search API. Search queries will be sent as JSON documents with the following fields:

Name type Description

string

User provided search query.

fields

array(string)

An array of additional fields that should be returned with each hit. (optional)

order

array(string)

User provided order criteria as a list of field names to order by, each optionally prefixed with -. (optional)

limit

int

User provided limit for results per page. (optional)

scroll

string

User provided scroll handle. (optional)

vault

string

Name of the vault this search is performed on.

principal

object

Security context for this search request. If missing or None, assume an unauthenticated user.

principal.name

string

Name (including domain) of the user performing the search. (optional)

principal.groups

array(string)

List of groups the searching user belongs to. (optional)

principal.privileged

boolean

If true, assume the user can see all results. (default: false)

The q, fields, order, limit and scroll fields correspond to the (cleaned up) user provided search parameters as defined by the CDSTAR search API. vault and principal are added by CDSTAR. The search target should limit search results to entities visible to the specified principal. If no principal is present (null, missing or empty), the search should only return publicly visible results. If principal.privileged is true, the search should not filter by visibility and return all matching results.

Example Request

POST https://gateway.example.com/search
Content-Type: application/json
{
    "q": "search query",
    "order": ["-score"],
    "limit": 100,
    "fields": ["meta.dc:title"],
    "vault": "myVault",
    "principal": {
        "name": "alice@realm",
        "groups": ["users@realm"],
        "privileged": false
    }
}

Security considerations

Since the search gateway is not supposed to authenticate the searching user and trust the fields send by CDSTAR, it could be used to perform searches on behalf of another user, if accessed directly by an attacker. Make sure that the gateway is only reachable from the CDSTAR instance or is protected by HTTPS and some authentication mechanism (e.g. BASIC auth or secret headers).

Landing Page (UI)

The cdstar-ui plugin provides a very minimal browser-based UI (user interface) mounted at the /ui root path. This UI is targeted at humans and may require a modern JavaScript enabled browser to be fully usable. The URL scheme is not defined or stable, with one exception: /ui/<vault>/<archive> will always show (or redirect to) a human readable landing page for an archive. The user may be asked to log-in first for non-public archives.

Configuration

No configuration necessary, but this plugin honors the global api.context setting (default: /). This may be required if the service path cannot be detected automatically and assets are not loaded correctly.

example.yaml

plugin.ui.class: cdstar-ui

TusPlugin

This TusPlugin installs a tus.io compatible REST endpoint to upload temporary files, and a way for other APIs to reference these files via server-side data streams. This helps clients to upload large files over unreliable network connections, or parallelize uploads of multiple files for the same archive.

Tip

TUS will NOT improve upload speed or throughput over stable network connections. The fastest and most efficient way to upload large files to cdstar is via Upload file. The best way to upload many small files to cdstar is via Update Archive. Only use TUS if uploads need to be resumable or you want to import the same file multiple times.

Configuration

There is currently no configuration for this plugin. Uploads will be placed into ${path.var}/tus/.

Example Configuration

plugin.tus.class: TusPlugin

Table 75. Config Parameters
Name	Description
class	Plugin class name. Always `TusPlugin` or `de.gwdg.cdstar.rest.ext.tus.TusPlugin`.
expire	Maximum number of milliseconds a TUS upload is kept on disk after the last byte was written. If the value has a suffix (S,M,H or D) it is interpreted as seconds, minutes, hours or days instead of milliseconds. (default: `24H`)

Usage

The tus.io compatible REST endpoint is reachable under /tus at the root-level of the service (not /v3/tus but just /tus). After creating a TUS handle and uploading data following TUS protocol, the temporary file can be referenced as tus:<tusId>, where <tusId> is the last part of the TUS handle. For example, if your TUS handle was /tus/24e533e, then the internal reference to this resource would be tus:24e533e.

Currently only the Create Archive and Update Archive support server-side imports via the fetch:<target> functionality. For example, to import a completed TUS upload into an archive, you would send fetch:/path/to/target.file=tus:24e533e as a POST form parameter. Note that the digests must still be computed, so a fetch may take just as long as uploading the file directly. TUS usually does not improve overall throughput, but may improve reliability of large-file uploads over unreliable network connections. Use it wisely.

Incomplete TUS handles that do not see any new data will expire after 2 hours. Once complete, the TUS handle can be referenced for another 24 hours before it expires. Handles that are not needed anymore can (and should) be deleted faster with a single DELETE request to the TUS handle.

Advanced topics

NioPool Storage

NioPool is the default StoragePool implementation for CDStar and provides transactional and robust persistence to a local or network-attached file system. It is usually bundled with the default distribution of CDStar and does not require any additional plugins.

Note

StoragePool is a low level interface and abstraction layer for the underlying physical storage. High level concepts (namely vaults, archives and files) map roughly to low level entities (pools, objects and resources) but should not be confused or mixed. The exact relations between high and low level concepts are described in a separate document (TODO).

This document describes the on-disk folder structure and index file format used by NioPool. The storage format is designed to be IO efficient and human-accessible at the same time: index files are human-readable and self-describing JSON files. In theory, all data and meta-data can be analyzed and recovered without prior knowledge or specialized software.

Folder structure

Storage objects are distributed into a directory tree with configurable depth, based on the first few character-pairs of the object ID. This reduces the maximum number of inodes per directory and helps keeping file system metadata cache-friendly, even for large pools with millions of objects. For a depth of d, the lookup path would be computed as follows: {poolName}/{id[0:2]}/…/{id[(d-1)*2:d*2]}/{id}/. For example, given a default depth value of d=2, an object with ID 0123456789abcdef would be stored in myPool/01/23/0123456789abcdef/.

Tip	`NioPool` follows symlinks, even across device borders. This makes it easy to split large repositories and distribute load across multiple file systems or storage devices.

All files related to a specific pool object are stored in the same folder. Each object folder contains at least a HEAD symlink pointing to the latest {revision}.json index file. This file describes the state and content of the object in human readable form (json). There will be an extra index file for each revision of the object. Binary resources are stored in separate {sha256}.bin files. If object packing is enabled, some index or resource files may be bundled into packs and must be unpacked before they can be used (see below).

Example: Pool object directory (empty)

cdstar-home/data/vaultName/dc/64/dc64abb808e0c227/
  ./HEAD -> ./e371ce6a077f88755c1155b507b757d5.json
  ./e371ce6a077f88755c1155b507b757d5.json

Example: Pool object directory (two revision, one resource)

cdstar-home/data/vaultName/dc/64/dc64abb808e0c227/
  ./HEAD -> ./008f113ff1579f8aed9399bf7960118f.json
  ./008f113ff1579f8aed9399bf7960118f.json
  ./e371ce6a077f88755c1155b507b757d5.json
  ./30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58.bin

Example: Pool object directory (large object with many revisions and resources, packed)

cdstar-home/data/vaultName/dc/64/dc64abb808e0c227/
  ./HEAD -> ./2266dc2ff8067e68104607e750abb9d3.json
  ./2266dc2ff8067e68104607e750abb9d3.json
  ./15041131681337.pack.zip

Object index file format

Each time an object is modified, a new {revision}.json index file is created and the HEAD symlink is updated. These files contain an utf-8 encoded JSON document describing the current state (contained resources, attributes and meta-data) of the storage object in a human-readable form.

Warning

Fields with null or empty values may be skipped to save space, and additional fields may be added in future versions of this implementation. Keep that in mind if you plan to parse these files with custom tools.

Example for a {revision}.json index file with one resource

{
  "v" : 3,
  "id" : "dc64abb808e0c227",
  "rev" : "008f113ff1579f8aed9399bf7960118f",
  "parent" : "e371ce6a077f88755c1155b507b757d5",
  "type" : "application/x-cdstar;v=3",
  "ctime" : 1507979048000,
  "mtime" : 1507979048885,
  "x-cdstar:owner" : "test@static",
  "x-cdstar:mtime" : "2017-08-29T11:28:06.0722Z",
  "x-cdstar:acl:$owner" : "OWNER",
  "x-cdstar:rev" : "1",
  "resources" : [ {
    "id" : "8c5a29d5707b6927e8484e2cd5170749",
    "name" : "data/target.txt",
    "type" : "application/octet-stream",
    "size" : 1048576,
    "ctime" : 1507979048885,
    "mtime" : 1507979048885,
    "sha1" : "O3H0P/MPSxW1zYXdnpXrx+hOtaM=",
    "sha256" : "MOFJVevxNSJm3C/4Bn5oEEYH51CrudOzZYK4r5Cfy1g=",
    "md5" : "ttgbNgpWctgMJ0MPORU+LA=="
  } ]
}

Table 76. Object index properties
Name	Type	Description
v	int	Format version. Defaults to `3`.
id	String	Pool object ID. Should be the same as the containing directory name.
rev	String	Revision string. Should match the file name.
parent	String	Revision string of the parent revision. This field can be used to traverse the revision history of an object. May be `null` or missing for the first revision of an object which has no parent.
type	String	Application defined mime-type. May be `null` or missing.
ctime	long	Date and time of object creation (Unix epoch, millisecond resolution).
mtime	long	Date and time of last modification (Unix epoch, millisecond resolution).
x-{key}	String	Custom application defined key/value pairs.
resources	Array	Unordered list of resource records (see below). May be empty, `null` or missing.

Table 77. Resource record properties
Name	Type	Description
id	String	Unique resource identifier. This string is unique per object, not globally.
name	String	Application defined resource name. This should be unique per object, but uniqueness is not enforced. May be `null` or missing.
type	String	Application defined content-type. May be `null` or missing.
enc	String	Application defined content-encoding. May be `null` or missing.
size	Long	Size of resource binary data in bytes.
ctime	String	Date and time of resource creation (Unix epoch, millisecond resolution).
mtime	String	Date and time of last modification (Unix epoch, millisecond resolution).
src	String	External location identifier for the resource binary content. May be `null` or missing, in which case the resource is either empty or stored in the default location (see below). If set, the data file may be removed by garbage-collection and additional steps are required to recover the content of the resource.
md5	Base64	MD5 hash of the resource content as a base64 string. May be `null` or missing.
sha1	Base64	SHA-1 hash of the resource content as a base64 string. May be `null` or missing.
sha256	Base64	SHA-256 hash of the resource content as a base64 string.
x-{key}	String	Custom application defined key/value pairs.

Dates are stored as unix epoch timestamps with millisecond resolution (signed long integer). While not directly human readable, these are easily recognized and a very common exchange format for points in time. Most programming languages provide built-in tools to translate an epoch timestamp into a human readable form.

Resource default location

By default, the uncompressed binary content of non-empty resources are stored in the object directory as {sha256}.bin files named after the lower-case hex encoded sha256 digest of their content. These files always end in .bin regardless of their actual content-type. If this file is missing, the resource may either have been packed (see "Object Packing") or externalized (see "External resources") and additional steps are required to recover the binary content of the resource.

External resources

If the src field of a resource record is set, the corresponding {sha256}.bin resource file is subject to garbage-collection and may be removed at any time. In this case, the value of the src field should contain enough information to recover the resource file manually or with the help of an application-specific process. The src field MUST start with a prefix defined in this document, or with x- followed by an application defined location hint (e.g. an URI).

Object Packing (not implemented)

Resource files in an object directory may be bundled into one or more *.pack.zip files to save inodes and disk space. Compression can also help reducing IO pressure on the storage device in exchange for higher CPU usage during read access. This trade-of may be beneficial, in particular for rarely accessed objects or resources with highly compressible content.

Resources stored in a pack have a src value of pack:<pack-file-name> and follow default naming rules ({sha256}.bin) within the pack file.

Note

The zip format allows fast lookup and random access to individual files. Other common packaging formats (e.g. tar) require linear scans in order to find a specific file. The drawbacks of the zip format (e.g. low resolution timestamps or file name limitations) are negligible as these information is also present in the object index file.

Temporary data

NioPool may create temporary .tmp files or directories within an object directory. These may contain data required for recovery, so do not delete these files after an unclean shutdown or while the service is running. Temporary files that remain after an ordinary shutdown can be removed.

Locking, concurrency control and transactional storage

Any actor that creates or removes files other than *.tmp in an object directory or intends to change the target of the HEAD symlink MUST acquire a HEAD_NEXT file lock before doing so. The HEAD_NEXT file SHOULD be a symlink pointing to a (possibly not yet created) index file. To change the HEAD link, make sure that the HEAD_NEXT target exists and is synchronized to disk, then move-and-replace HEAD_NEXT to HEAD. Any error during this sequence should result in dangling HEAD_NEXT symlink protecting the object from further manipulation until manual or automatic recovery succeeded. In a disaster situation, either HEAD or HEAD_NEXT (or both) exists and the object can be rolled back or committed manually.

Tip	Some file systems do not implement an atomic move-and-replace operation. In this case, `HEAD` must be removed before `HEAD_NEXT` can be renamed. Clients may try to access `HEAD` in the short time span when it does not exists. Robust implementations should simply retry a couple of times.

Configuration

StoragePool configuration is stored by CDSTAR in a vault.yaml within the pool base directory and can be bootstrapped during vault creation with predefined parameters. NioPool supports the following configuration parameters:

.Configuration Parameters

Name Type Description

path

String

Path to the vault base directory (required, default: ${path.data}/${vaultName}/)

cacheSize

int

Number of manifests to keep in an in-memory cache for faster load times.

autotrim

bool

If enabled, schedule a garbage collection run after each successful commit for each modified object.

digests

str

Comma separated list of digests to calculate. SHA-256 is always calculated. Defaults to: MD5,SHA-1,SHA-256.

Storage Profiles

CDSTAR supports and integrates third party long time storage systems (LTS, e.g. tape libraries) via storage profiles. From the users perspective, a storage profile defines where and how data should be stored. By assigning a storage profile to a CDSTAR archive, the user can control data migration to and from LTS in a coherent, safe and predictable way. The actual data migration happens in the background and is fully managed by CDSTAR.

Profile mode: HOT vs. COLD

Storage profiles can be either "hot" or "cold", which changes the way CDSTAR handles its local data.

Hot profiles causes CDSTAR to copy the archive content to external storage, but keep all data available in CDSTAR as well. While the profile is in effect, only administrative metadata (owner, ACLs, storage profile, …) can be modified. The actual content (files and metadata) is write-protected to prevent stale LTS copies.

Cold profiles, on the other hand, allow CDSTAR to re-claim disk space by deleting archive files from disk after a copy was stored externally. Metadata is still kept available, but file content can no longer be accessed through CDSTAR. The profile needs to be changed to default or a hot profile to make file content available again.

Hot profiles are meant to increase long term availability or data integrity guarantees by storing important data in a second location. Cold profiles are mostly used to store large amounts of rarely accessed data in a more cost-effective way (e.g. on tape), while keeping meta-data search- and discoverable.

Profile configuration

Profiles can be configured globally, and enabled or disabled per vault. They currently only have a name, and define a mode (hot or cold) and an associated LTS target, which is configured separately as a plugin. This allows multiple profiles to reference the same LTS target, but with different configuration.

profile:
  bagit-hot:
    lts.name: bagit
  bagit-cold:
    lts.name: bagit
    lts.mode: cold

LTS target configuration

Data migration from or to third party LTS systems is highly depended on the system in use. Multiple implementations are available and can be loaded via the CDSTAR plugin infrastructure. CDSTAR bundles a general purpose implementation that exports to BagIt directories and allows an external process to perform the actual LTS migration asynchronously.

plugin:
  bagit:
    class: BagitTarget
    path: /path/to/store/bagit/

LTS handlers are referenced by name, so special care must be taken when removing or renaming LTS handlers. Do not remove or rename an LTS target as long as there are archives still referencing it.

How to not loose data

Moving data out of the CDSTAR system, especially with cold profiles, bears some risks that should be well understood before enabling the LTS feature. Please read this chapter carefully.

After a successful migration to an LTS target, CDSTAR stores the LTS name and a unique location identifier (generated by the LTS) into non-public archive properties. These are used to recover missing files in case of a future profile change. Cold profiles allow CDSTAR to remove local copies of archived files after successfully copying these files to LTS. If the LTS goes away, for whatever reason, then CDSTAR has no way to recover missing files and the archive is stuck in cold state. File content will be unavailable and data migration after profile changes will fail.

Do not remove or rename an LTS target as long as there are archives still referencing it.
When updating LTS Plugins or changing configuration, ensure that existing location identifiers remain valid.
Monitor CDSTAR logs for failed migrations.

BagIt LTS Target

This LTS target exports archives into BagIt folders, and is designed to work with external worker processes for the actual migration from/to LTS storage (e.g. tape).

The exporter will create a BagIt package in a temporary folder, then rename it to [name].bagit with a unique name. A worker process may check for these folders and copy or move data to LTS.

The importer will create a file named [name].want and start the import as soon as the [name].bagit folder can be found. A worker process should check for these [name].want files and recover the missing [name].bagit folder from LTS. Once complete, the importer will delete the [name].want file and the recovered [name].bagit folder can be cleaned up by the worker.

If the external copy is no longer needed, a [name].delete file is created. A worker process should watch for these files, remove the external copy (if any), remove the [name].bagit directory (if present), and then also remove the [name].delete file.

External workers are allowed to create additional files for their own state handling, as long as they do not interfere with the names defined here.

Archive Snapshots

Archive snapshots are an efficient way to preserve the current payload of an archive without actually creating a full copy. They can be used to implement versioning, tag important milestones or create immutable and citeable releases for publishing.

From a users perspective, snapshots are virtual read-only archives that represent the payload of their source archive from a specific point in time. The payload of a snapshot will not change if the source archive is modified. Other aspects however, most notably owner and access control information, are transparently inherited from the source archive and will change if the source archive changes. One exception is the storage profile, which can be changed on a per-snapshot basis independent from the source archive. See Storage Profiles for details.

Once created, most read-only operations that work on an archive are also available for snapshots. In the REST API, snapshots are referenced by the source archive name, followed by an @ character and the snapshot name. For example, GET /v3/somevault/ab587f42c257@v1/data.csv would fetch a file from the v1 snapshot instead of the current archive state. Details are explained in the REST API documentation.

Sparse Copies and Deduplication

On storage level, snapshots live in separate storage objects, but are created in a way that allows them to share common data files with their source archive or other snapshots, if supported by the storage back-end. This ensures that snapshots only take up a minimum amount of additional storage space and are usually way more efficient than actually copying an entire archive. NioPool implements this on file-system level by hard-linking files with the same content, and only creating a copy if content changes (copy on write semantics).