CDStar is package-oriented data-management framework for scientific and other data-driven applications. It enables the development of powerful tools and workflows against a simple and stable REST interfaces that hides away the details and complexities of the actual storage back-end in use.
The CDStar storage API is organized in vaults, archives (or packages) and files: A vault can store any number of archives and distributes them transparently across different storage infrastructures. Each archive is identified by a unique ID and contains a list of named files. Within an archive, files can be organized in folder structures and annotated with search-able attributes. Archives themselves can also be annotated. The search integration indexes attributes and file content to allow near-realtime search across an entire vault.
Getting started
This guide is a step-by-step tutorial which shows how to install, configure, and use cdstar in a simple example setup. You will download and run cdstar locally, configure a single vault and store some files. All you need is a computer with Java (8+) installed. This tutorial assumes you are running some flavor of Linux.
Installation
CDSTAR is written in Java and the "cdstar.jar" binary distributions runs on any platform with a compatible Java Runtime Environment (OpenJDK or Oracle Java 11 or newer). There are several ways to obtain a recent version of cdstar, described here.
Download binary release
wget https://cdstar.gwdg.de/release/dev/cdstar.jar
Older and stable releases are also available here: https://cdstar.gwdg.de/release/
Build from source
Building CDSTAR requires a Java JDK (Java 11 or newer) and Maven. The CDSTAR source distribution ships with a Maven wrapper script (./mvnw
or ./mvnw.bat
) that fetches the correct version of Maven and sould be preferred over whatever Maven version is offered as a system package by your distribution.
sudo apt install git build-essential # for 'git' and 'make'
sudo apt install default-jdk-headless
git clone https://gitlab.gwdg.de/cdstar/cdstar.git
cd cdstar
make cdstar.jar
# or manually:
./mvnw -pl cdstar-cli -am -DskipTests=true -Pshaded clean package
cp cdstar-cli/target/cdstar-cli-*-shaded.jar cdstar.jar
Tip
|
The -DskipTests=true parameter will save you some time. Releases are always tested before they are published, so there is no point in running all tests again.
|
Configuration
CDStar can read configuration from yaml
and json
files, whichever you prefer. Here is a small example to get you started:
cdstar-demo.yaml
---
path.home: /tmp/cdstar-demo
vault.demo:
create: True
public: True
pool.autotrim: True
realm.static:
class: StaticRealm
# This role can create, read and list archives in the 'demo' vault.
role.demoRole: vault:demo:create, vault:demo:read, vault:demo:list
# This group inherit all permissions from 'demoRole'.
group.demoGroup: "demoRole"
# This user has the password 'test' and belongs to the 'demoGroup'.
# Password hashes can be computed using cdstar.jar:
# $ java -cp cdstar.jar de.gwdg.cdstar.auth.realm.StaticRealm
user.test:
password: "cGxhaW4=:FmtSc7NSX8fsjLTmpLpoqRLP4vqWFg/r5uy3EU6JsEs="
groups: demoGroup
# permissions: ...
# roles: ...
Note
|
A secure password-hash can be generated with the java -cp cdstar.jar de.gwdg.cdstar.auth.realm.StaticRealm tool.
|
The only required parameter is path.home
. Everything else is optional. See Configuration for details.
First run
$ java -jar cdstar.jar -c cdstar-config.yaml run -p 8080
curl http://localhost:8080/v3/
Command line parameters
Usage: cdstar [-h] [--version] [--log-config=<file>] [-c=<file>]...
[-C=<key=value>]... [--debug=<logger>]... [COMMAND]
Run or manage CDSTAR instances
-c, --config=<file> Load configuration from this file(s). Prefix the
filename with '?' to mark it optional
-C=<key=value> Override individual configuration parameter. Use
'KEY=VALUE' to override or 'KEY+VALUE' to append
--debug=<logger> Increase loggin for specific packages. The value
'ROOT' may be used as an alias for the root logger
-h, --help Print help and exit
--log-config=<file> Provide a custom log4j2.properties file
--version Show version string and exit
Commands:
run Start server instance
config Manage configuration
vault Manage vaults
Usage: cdstar run [-h] [-b=<bind>] [-p=<port>]
Start server instance
-b, --bind=<bind> Override 'http.host' setting
-h, --help Print help and exit
-p, --port=<port> Override 'http.port' setting
Run as a service
CDStar can be compiled into a cdstar.war
file and run within a servlet container, but this is not recommended and not officially supported. CDStar also does not offer any built-in daemonizing capabilities. If you want to run cdstar as a long-running background process, use proper system tools like systemd, supervisord or traditional init.d
scripts and start-stop-daemon
as a last resort.
# /etc/systemd/system/cdstar.service
[Unit]
Description=CDStar Storage Service
After=syslog.target network.target remote-fs.target
[Service]
User=cdstar
ExecStart=/usr/bin/java -jar /path/to/cdstar.jar -c /etc/cdstar/cdstar.yaml run -p 8080
[Install]
WantedBy=multi-user.target
sudo systemctl enable cdstar.service
Tutorial
For this tutorial we are using the excellent requests python library and assume that you already have an instance up and running on http://localhost:8080/ with an account that is allowed to create archives in a vault named demo
.
Creating our first Archive
To begin, we import some helpful functions from the 'requests' module, define our API base URL and create our first archive.
>>> from requests import get, post, put, delete
>>> baseurl = 'http://test:test@localhost:8080/v3'
>>> r = post(baseurl + '/demo/')
>>> r.status_code
201
>>> r.headers['Location']
"/v3/demo/ab587f42c2570a884"
>>> r.json()
{
'id': 'ab587f42c2570a884',
'vault': 'demo',
'revision': '0'
}
CDStar returns JSON most of the time, so we can use requests.Response.json()
to parse the response directly into a python dictionary. In this case, we are only interested in the id
field of the response. This string identifies our archive within a vault and can be used to build the archive URL. Alternatively, we could just follow the Location
header.
The archive is still empty. We can list its content with simple a GET request.
>>> get(baseurl + '/demo/ab587f42c2570a884').json()
{
"id": "ab587f42c2570a884",
"vault": "myVault",
"revision": "0",
"created": "2016-12-20T13:59:37.160+0000",
"modified": "2016-12-20T13:59:37.231+0000",
"file_count": 0
}
As you can see, there are no files in this archive. Let’s change that and upload some files.
Upload Files
There are mutiple ways to populate an archive. The simplest way is to send multipart/form-data
POST requests to the archive URL.
Each file upload with a name that start with a slash (e.g. /example.txt
) creates a new file in our archive.
>>> files = {'/report.xls': open('report.xls', 'rb')}
>>> post(baseurl + '/demo/ab587f42c2570a884', files=files).json()
{
'id': 'ab587f42c2570a884',
'vault': 'myVault',
'revision': '1',
'report:': [ {
'change': 'file',
'file': {
'name': 'report.xls',
'type': 'application/vnd.ms-excel',
'size': 65992,
'created': '2016-12-20T13:59:37.217+0000',
'modified': '2016-12-20T13:59:37.218+0000',
'digests': {
'md5': '1a79a4d60de6718e8e5b326e338ae533',
'sha1': 'c3499c2729730a7f807efb8676a92dcb6f8a3f8f',
'sha256': '50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c'
}
}
} ]
}
The response is JSON again and contains a list of all files that changed during the last request. We can use this info to double-check if everything was uploaded correctly.
Annotate Archives and Files
Now we want to attach some meta attributes to our archive and the file we just uploaded.
We send just another POST request to the same URL, but this time we use form-fields starting with meta:
to define new meta attribute on the archive or a file within the archive.
>>> data = {
... 'meta:dc:title': 'My Report Archive', (1)
... 'meta:dc:title:/report.xls': 'My Report', (2)
... 'meta:dc:contributor': ['Alice', 'Bob'], (3)
... }
>>> post(baseurl + '/demo/ab587f42c2570a884', data=data).json()
{
'id': 'ab587f42c2570a884',
'vault': 'myVault',
'revision': '2',
'report:': [ {
'change': 'meta',
'field': 'dc:title',
'values': ['My Report Archive']
}, {
'change': 'meta',
'field': 'dc:contributor',
'values': ['Alice', 'Bob']
}, {
'change': 'meta',
'field': 'dc:title',
'file': 'report.xls',
'values': ['My Report']
} ]
}
-
Meta form fields start with
meta:
followed by the field name. -
If a meta attribute should be set on a specific file instead of the archive, you can specify the file name after the field name, separated by a
/
. -
Some meta attributes accept more than a single value.
Just like the file upload example from above, we get back a report of everything that changed.
Tip
|
You can upload multiple files and set multiple meta-attributes with a single request. It is even possible to create a fully populated archive in a single step by submitting the POST request to the createArchive endpoint. |
List Files and Meta-Attributes
Let us have a look at our archive again and also request file and meta-attribute listings this time.
>>> get(baseurl + '/demo/ab587f42c2570a884?with=files,meta').json()
{
"id": "ab587f42c2570a884",
"vault": "myVault",
"revision": "0",
"created": "2016-12-20T13:59:37.160+0000",
"modified": "2016-12-20T13:59:37.231+0000",
"file_count": 1,
'meta': {
'dc.title': ['My Report Archive']
},
'files': [ {
'name': '/report.xls',
'type': 'application/vnd.ms-excel',
'size': 65992,
'created': '2016-12-20T13:59:37.217+0000',
'modified': '2016-12-20T13:59:40.114+0000',
'digests': {
'md5': '1a79a4d60de6718e8e5b326e338ae533',
'sha1': 'c3499c2729730a7f807efb8676a92dcb6f8a3f8f',
'sha256': '50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c'
},
'meta': {
'dc.title': ['My Report']
}
} ]
}
The file
and meta
fields are hidden by default and only included if you add with=files,meta
as a query parameter. For large archives, you can even filter and paginate the returned information. See getArchiveInfo for details.
Direct File API (CRUD)
Each file within an archive has its own URL, for example /myVault/ab587f42c2570a884/some/file.txt
. You can create, read, update or delete individual files by sending the respective PUT
, GET
, POST
or DELETE
requests to these URLs, which is sometimes a lot easier than working with the form-based API described earlier, especially from within scripts or programmable REST clients.
First, let’s upload a new file to the archive. Just PUT
the raw file content to the file URL.
>>> with open('example.txt', 'rb') as fp:
... put(baseurl + '/demo/ab587f42c2570a884/some/example.txt', data=fp).json()
{
'name': 'some/example.txt',
'type': 'text/plain', (1)
'id': '4e2cdf90ae00bff1e2bad79ffebdb63b', (2)
'size': 12,
'created': '2017-07-25T11:08:02.558+0000',
'modified': '2017-07-25T11:08:02.602+0000',
'digests': {
'sha256': '30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58',
'sha1': '3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3',
'md5': 'b6d81b360a5672d80c27430f39153e2c'},
}
-
The type is auto-detected from the file name if you do not specify a
Content-Type
header. -
The
id
of a file does not change, even if you rename or modify it.
If you need more control over whether a file should be overwritten or not, you can add one of the following conditional headers to your request:
Header | Description |
---|---|
|
Create new file. If the file already exists, it is not modified. |
|
Update existing file. If the file does not exist, it is not created. |
You should check for 412 Precondition Failed
errors in your application if you use these headers.
Once the file is stored in the archive, you can retrieve it using the same URL.
>>> r = get(baseurl + '/demo/ab587f42c2570a884/some/example.txt', stream=True)
>>> with open("download.txt", 'wb') as fd:
... for chunk in r.iter_content(chunk_size=1024*8):
... fd.write(chunk)
This downloads the entire file and stores it locally. You can also request parts of the file (using Range
headers) and make your request conditional (If-Match
, If-None-Match
, If-Modified-Since
, If-Unodified-Since
and If-Range
headers are fully supported).
Instead of the actual file content, you can also request the file attributes or meta-attributes via the info
and meta
sub-resources.
>>> get(baseurl + '/demo/ab587f42c2570a884/some/example.txt?info').json()
{
'name': 'some/example.txt',
'type': 'text/plain',
'id': '4e2cdf90ae00bff1e2bad79ffebdb63b',
'size': 12,
'created': '2017-07-25T11:08:02.558+0000',
'modified': '2017-07-25T11:08:02.602+0000',
'digests': {
'sha256': '30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58',
'sha1': '3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3',
'md5': 'b6d81b360a5672d80c27430f39153e2c'},
}
>>> get(baseurl + '/demo/ab587f42c2570a884/report.xls?meta').json()
{
'dc:title': ['My Report']
}
Tip
|
Since meta is a sub-resource of info , you can fetch both at the same time via ?info&with=meta .
|
And finally: Deleting individual files is just a plain and simple DELETE
request.
>>> delete(baseurl + '/myVault/ab587f42c2570a884/some/example.txt')
Thats it for now. To be continued …
Configuration
CDStar is configured via configuration files (YAML
or json
), command-line arguments or environment variables, or a combination thereof. In any cases, configuration is treated as a flat list of dot-separated keys and plain string values (e.g. key.name=value
). File formats that support advanced data types and nesting (namely json an yaml) are flattened automatically when loaded. Arrays or multiple values for the same key are simply joined into a comma-separated list.
---
# Nested document
path:
home: "/mnt/vault"
vault.demo:
create: True
---
# Flattened form
path.home: "/mnt/vault"
vault.demo.create: "True"
Values may contain references to other keys (e.g. ${path.home}
) or environment variables (e.g. ${ENV_NAME}
). The latter is recommended for sensitive information that should not appear in config files or command line arguments (e.g. passwords). A cololon (:
) is used to separate the reference from an optional default value.
For example, ${CDSATR_HOME:/var/lib/cdstar}
would be replaced by the content of the CDSTAR_HOME
environment variable, or the default path if the environment variable is not defined.
Disk Storage
CDStar stores all its data and internal state on the file system. You usually only need to set set path.home
, as all other parameters default to subdirectories under the path.home
directory.
- path.home
-
This directory is used as as base directory for the other paths. (default:
${CDSTAR_HOME:/var/lib/cdstar/}
) - path.data
-
Storage location for archive-data and runtime information. CDStar creates a subdirectory for each vault and follows symlinks, which makes it easy to split the storage across several mounted disks. (default:
${path.home}/data
) - path.var
-
Storage location for short-lived temporary data. Do NOT use a ramdisk or other volatile storage, as transaction and crash-recovery data will also be stored here. (default:
${path.home}/var
) - path.lib
-
Plugins and extensions are searched for in this directory, if they are not found on the default java classpath. (default:
${path.home}/lib
)
Transports
CDStar supports http
and https
transports out of the box. By default, only the unencrypted http
transport is enabled and binds to localhost port 8080
. The high port number allows CDStar to run as non-root, which is the recommended mode of operation.
External access should be encrypted and off-loaded to a reverse proxy (e.g. nginx) for security and performance reasons. Only enable the build-in https
transport for testing or if you know what you are doing.
- http.host
-
IP address to bind to. A value of
0.0.0.0
will bind to all available interfaces at the same time. (default:127.0.0.1
). - http.port
-
Network port to bind to. Ports below
1024
require root privileges (not recommended). A value of0
will bind to a random free port. A value if-1
will disable this transport. (default:8080
) - https.host
-
IP address to bind to. (default:
${http.port}
). - https.port
-
Network port to listen to. (default:
8433
) - https.certfile
-
Path to a
*.pem
file containing the certificate chain and private key. (required) - https.h2
-
Enable
HTTP/2
. This requires Java 9+ and should be considered experimental. (default:false
)
Public REST API
The REST API is exposed over all configured transports.
- api.dariah.enable
-
Enable or disable the dariah REST API. (default:
False
) - api.v2.enable
-
Enable or disable the legacy v2 REST API. (default:
False
) - api.v3.enable
-
Enable or disable the current v3 REST API. (default:
True
) - api.context
-
Provide the public service URL. This is required if cdstar runs behind a reverse proxy or load balancer and cannot detect its public URL automatically. (default:
/
)
Vaults
Vaults are usually created at runtime via the management API, but can also be be bootstrapped from configuration. Statically configured vaults are created at startup if they do not exist, and ignored otherwise. It is not possible to change the parameters of a vault via configuration after they were created.
- vault.<name>.create
-
If true, create this vault on startup if it does not exist already.
- vault.<name>.public
-
If true, allow public (non-authenticated) read access to this vault. Archive permissions are still checked.
Each vault is backed by a storage pool, which can be configured as part of the vault configuration. The default pool configuration looks like this, and may be overwritten if needed (experimental, not recommended).
- vault.<name>.pool.class
-
Storage pool class or name. Defaults to the
NioPool
class. - vault.<name>.pool.name
-
Storage pool name. Defaults to the vault name.
- vault.<name>.pool.path
-
Data path for this storage pool. Defaults to
${path.data}/${name}
:
Other StoragePool
implementations may accept additional parameters.
Plugins may also read vault-level configuration to control vault-specific behavior. The DefaultPermissions
feature for example controls the permissions defined on newly created archives and can be configured differently for each vault.
Realms
Realms manage authentication and authorization in CDStar.
For a simple setup with only a hand full of users, you usually only need a single 'default' realm (e.g. StaticRealm
) with everything configured within the same config file.
More complex scenarios (e.g. LDAP, JWT or SAML auth) are supported via specialized implementations of the Realm
interface (e.g. StaticRealm
, JWTRealm
or LdapRealm
) and can be combined in many ways.
- realm.<name>.class
-
Realm implementation to use. Either a simple class name or a fully qualified java class. (required)
- realm.<name>.<field>
-
Additional realm configuration.
See Realms for a list of available realm implementations and their configuration options.
Warning
|
If no realm is configured, cdstar adds an 'admin' user with a randomly generated password to the implicit 'system' realm. The password is logged to the console on startup and changes every restart. |
Tip
|
Realms are no different from plugins. They are only configured in a separate reaml.* name-space to avoid accidental misconfiguration.
|
Plugins and Extentions
CDSTAR can be extended with custom implementations for event listeners, storage pools, long-term storage adapters and many other interfaces. These can be referenced by name, simple class-name or fully qualified java class name.
- plugin.<name>.class
-
Plugin to load. Either a name, java class name or a fully qualified java class path.
- plugin.<name>.<field>
-
Additional plugin configuration.
plugin.ui:
class: UIBlueprint
plugin.bagit:
class: de.gwdg.cdstar.runtime.lts.bagit.BagitTarget
path: ${path.home}/bagit/
API Basics
The cdstar HTTP API is the primary method for accessing CDStar instances. Requests are made via HTTP to one of the documented API Endpoints and responses are returned mostly as JSON documents for easy consumption by scripts or client software.
The current stable HTTP API is reachable under the /v3
path on a cdstar server. Other APIs (e.g. legacy-v2, dariah or S3) may be available under different paths on the same server, but these are not part of this chapter.
Basics
The cdstar HTTP API follows RESTful principles. The core concepts are described here. You can skip this section if you are already familiar with HTTP and REST.
HTTP Methods
CDStar API Endpoints make use of the following standard HTTP request methods:
Method | Description |
---|---|
GET |
Receive a resource or sub-resource. This is a read-only operation and never changes the state of the resource or other resources. |
HEAD |
Same as |
POST |
Create or update a resource, or perform a modifying server-side operation. |
PUT |
Create or replace a resource with the content of the request. |
DELETE |
Remove a resource. |
Some proxies restrict or lack support for certain HTTP methods, such as DELETE
. In this case, a client may send a POST
request with a non-standard X-HTTP-Method-Override
header instead. The value of this header is used as a server-side override for the actual HTTP method.
HTTP Response Codes
Each of the API Endpoints defines a number of possible HTTP response status codes and their meaning. The following list summarizes all status codes used by this API and provides a general description.
Code | Reason | Description |
---|---|---|
200 |
OK |
Request completed successfully. The response contains the requested resource. |
201 |
Created |
Resource created successfully. The location of the newly created resource can be found in the response |
304 |
Not Modified |
The requested resource has not changed since the client last requested it, given |
400 |
Bad Request |
The request violates the HTTP protocol or this API specification. A detailed error description is contained within the response. |
401 |
Unauthorized |
The requested resource requires Authentication. |
403 |
Forbidden |
The client is authenticated, but not authorized to access the requested resource or perform the requested operation. |
404 |
Not Found |
The requested resource does not exist, or the client is not allowed to know if it exists or not. |
409 |
Conflict |
The request could not be completed due to a conflict with the current state of the target resource. This code is used in situations where the user might be able to resolve the conflict and resubmit the request. |
423 |
Locked |
The requested resource is currently not available and additional steps are required to make it available again. |
500 |
Internal Server Error |
An error occurred on server side that cannot be fixed by the client. Try again later. |
501 |
Not Implemented |
The requested functionality is part of this API, but not implemented by the service. |
503 |
Service Unavailable |
The server is currently unable to handle the request due to a temporary overload or scheduled maintenance, which will likely be alleviated after some delay. |
507 |
Insufficient Storage |
Storage or quota not sufficient to perform this operation. |
Caution
|
Please note that some APIs may return 404 Not Found instead of 403 Forbidden or 401 Unauthorized if the client has insufficient permissions to access a resource. This is to prevent leakage of information to unauthorized users (e.g. the existence of a private archive or a file within an archive).
|
Parameter types
API Endpoints may accept request parameters of various types, either via the query string part of the request URL, or as fields within a multipart/form+data
formatted POST
request, or both. In any case, each parameter is associated with a value type and interpreted according to the following table:
Name | Description |
---|---|
boolean |
Either |
int |
A signed integer number between |
long |
A bit signed integer number between |
double |
A decimal number in a format parseable by the Java |
string |
An arbitrary |
enum |
A value out of a predefined set of possible values. The valid values and their meanings are listed in the parameter description. |
list({type}) |
This parameter accepts multiple values of the enclosing type. Clients may repeat this parameter once for each value. Some parameters may also accept a comma separated list. |
file |
This parameter type is only supported as part of a |
glob |
A file-name matching glob pattern. See Glob syntax |
Glob syntax
Glob patterns are a simple way to filter or match file-names within an archive against a specific pattern. There is no real standard for glob patterns and existing implementations differ slightly. This is why CDStar implements its own subset of the most commonly used rules:
Pattern | Description |
---|---|
|
Matches a single character within a path segment. Does not match the path separator |
|
Matches any number of characters within a path segment, including an empty string. Does not match the path separator. |
|
Matches any number of characters, including the path separator. |
If the whole pattern starts with the path separator /
(forward slash), then the entire path is matched against the pattern. Otherwise, a partial match at the end of the path is sufficient. The pattern *.pdf
for example would return all PDF files within an archive, but /*.pdf
would only return PDF files located directly within the root folder.
As mentioned above, single wildcards only match within a path segment, which means both ?
and *
do not expand across path separators (/
). The pattern docs/*.pdf
would find /docs/file.pdf
but not /docs/subfolder/file.pdf
. Use two adjacent asterisks (e.g docs/**.pdf
) to include subfolders in your search.
Glob Pattern | Regular Expression | Examples |
---|---|---|
|
|
/file.pdf |
|
|
/file.pdf |
|
|
/file.pdf |
|
|
/2016/report.csv |
Authentication
CDStar can be configured with one or more authentication realms, implementing various ways of authenticating and authorizing client requests against the service. From the HTTP API point of view, there are mostly two ways to authenticate:
Password Authentication
HTTP Basic Authentication is a stateless and simple authentication scheme most suitable for scripting or simple client applications. Username and password are transmitted with each request in cleartext, so this scheme should NOT be used over unencrypted connections.
$ curl -u username https://cdstar.gwdg.de/v3/...
Some realms may require a fully qualified username in the form of username@realm
, but most realms also accept unqualified logins. If the username itself contains an @
, then it MUST be qualified to avoid ambiguity.
Token Authentication
Token authentication is handled via an Authorization: bearer <token>
header. Alternatively, the non-standard X-Token
header or token
query parameter can be used, but these are not recommended. Acquiring a token is not part of this API and depends heavily on the configured token realm (e.g. JWTRealm). For this example we assume that the client already obtained an access token.
curl -H "Authorization: bearer OAUTH-TOKEN" https://cdstar.gwdg.de/v3/...
In order to embed resources into HTML pages (e.g. images) or provide time-limited download links, a special token with limited access rights can be attached to the URL of GET
requests via the token
query parameter. As with access tokens, the method to obtain such tokens is not part of this API.
<img src="/v3/myVault/85a031d6e08d/image.png?token=READ-TOKEN" />
Authorization
CDStar implements a flexible authorization and permission system with fine-grained archive-level access control. The permission system is designed to be simple for the common case, but still powerful enough to support advanced requirements and responsibility models (e.g. groups and roles across multiple realms).
Note
|
The permission system may look complex on first glance, but remember that you only need a subset of this functionality for most common scenarios. |
The core concept can be summarized as follows: 'Permissions' are granted to 'subjects' and affect a specific 'resource'. Subjects may be individual 'users' or 'groups' of users. A resources may be a single archive, a vault or the entire storage service. Subjects (both users and groups) are organized in 'realms'. A simple setup only requires a single realm, but multi-tenancy instances can use realms to separate different authorities.
Subjects and Realms
Subjects are encoded as strings and matched against the current user context using following subject qualifier syntax:
Subject Match | Description |
---|---|
|
Special subject that matches any user, authenticated or not. |
|
Special subject that matches authenticated users. |
|
Special subject that matches the current owner of the affected resource. This is implemented for archive resources and matches against the the |
|
Subjects starting with |
|
Subject that do not match any of the patterns above are tested against the identifier of the currently logged-in user. |
If multiple realms are configured, then group and user names should be qualified with a realm name to avoid naming conflicts between realms. Unqualified names are still allowed, but they will match against any realm with a matching user or group.
Fully qualified names have the form name@realm
. For example, alice
from the ldap
realm would be alice@ldap
.
Only the last occurrence of the @
character is recognized, so identifiers with @
in them (e.g. email addresses) are allowed.
In fact, if the local part of a subject identifier contains an @
, then the subject MUST be qualified with a realm to avoid ambiguity.
Vault Permissions
Permissions regarding a specific vault. If assigned globally, they have the from vault:{vaultName}:{permissionName}
.
Name | Description |
---|---|
|
Open a vault. This is not required for |
|
Create new archives within a vault. |
|
List the archive IDs in a vault. Note that this allows a user to check if an archive exists independently of archive-level permissions. |
Archive Permissions
Archives are protected by an access control list (ACL) which grands permissions to specific subjects (see Subjects and Realms). If assigned globally, they have the form archive:{vaultName}:{archiveId}:{permissionName}
.
Note
|
Archive permissions are very fine-grained and most actions require more than one permission. For example, in order to receive a file from an archive, both read_file and load permissions are required. It most cases it is easier to assign Archive Permission Sets instead.
|
Name | Description |
---|---|
|
Check if an archive exists and read basic attributes (e.g last-modified or number of files). |
|
Delete an archive and its history (destructive operation). |
|
Read the access control list (ACL). |
|
Grant or revoke permissions by modifying the ACL. |
|
Change the owner. |
|
Read meta-data attributes. |
|
Add, remove or replace meta-data attributes. |
|
List files and their attributes (e.g. name, size, type, hash). |
|
Read file content. |
|
Create, modify or remove files. |
|
Explicitly compress or clean-up an archive |
Archive Permission Sets
Archive permissions are very fine-grained and most actions require more than one permission. For example, a user with only read_file
permission on an archive would not be able to read any files, because the load
permission is also required to load the archive in the first place. To simplify access control for common use-cases, permission sets were introduced. Each set bundles a number of permissions that are usually granted together, and can be assigned just like normal permissions.
Permission sets have upper-case names to distinguish them from normal permissions. The following matrix shows all pre-defined permission sets and their corresponding permissions.
Permission/set | LIST | READ | WRITE | OWNER | MANAGE | ADMIN |
---|---|---|---|---|---|---|
load |
yes |
yes |
yes |
yes |
yes |
yes |
delete |
yes |
yes |
||||
read_acl |
yes |
yes |
yes |
|||
change_acl |
yes |
yes |
yes |
|||
change_owner |
yes |
yes |
||||
read_meta |
yes |
yes |
yes |
yes |
||
change_meta |
yes |
yes |
yes |
|||
list_files |
yes |
yes |
yes |
yes |
yes |
yes |
read_files |
yes |
yes |
yes |
yes |
||
change_files |
yes |
yes |
yes |
The MANAGE
set is intended for management and reporting jobs. These are usually only interested in the meta-data of an archive, not the content. The set therefore inherits LIST
instead of READ
or even WRITE
to protect user data by default. While clients with this permission set would be able to grant more permissions to themselves, these changes would show up in audit logs and be accountable.
Vaults are usually configured to grant OWNER
permissions to the $owner
subject for new archives automatically. This allows the archive creator to work with the newly created archive and perform most actions, with the notable exception of changing the owner. Giving archives away is usually a task reserved for higher privilege accounts. This permission set is not limited or otherwise tied to the $owner
subject, though. It can be given to other subjects, or revoked from the owner. Revoking permissions from the owner is a common pattern to make archives read-only after publishing.
Note
|
READ, WRITE and MANAGE reassemble the permissions defined in cdstar version 2. |
Transaction Management
CDStar focuses on data safety and consistency. All transactions are atomic, consistent, isolated and durable by default (ACID properties). In short, this guarantees that transaction either succeed or fail completely ("all or nothing"), you will never see inconsistent state (e.g. half-committed changes), transactions won’t overlap or interfere with each other (isolation), and changes are persisted to disk before you get an OK back (durability).
Tip
|
ACID properties should be a core requirements for any kind of reliable storage service, but they are actually quite hard to find outside of traditional databases. Most modern web-based storage services (e.g. Amazon S3, couchdb, mongodb, most NoSQL databases) only provide "eventual consistency" or do not guarantee atomicity for operations affecting more than a single item. This makes it very hard or even impossible to implement certain workflows against these APIs in a reliable way, resulting in 'lost updates' or other consistency problems. |
Each call to a API endpoint implicitly created and commits a transaction by default. If a single operation is not enough though, you can also create an explicit transaction, issue multiple API calls, and then commit or rollback all changes as a single atomic transaction. The non-standard X-Transaction
header is used to associate HTTP calls with a running transaction.
$ curl -XPOST /v3/_tx
201 CREATED
{ "id": "d2ee7d6034e3", ... }
$ curl -H 'X-Transaction: d2ee7d6034e3' ...
...
$ curl -XPOST /v3/_tx/d2ee7d6034e3
204 OK
The results of these HTTP calls are not visible to other transactions until they are committed, and you won’t see any changes made by other users while your transaction is active, either. This is called 'snapshot isolation' and works as if each transaction operates on a snapshot of the entire database from the exact moment the transaction was started.
Recoverable errors during an explicit transaction do not trigger a rollback. On one hand, this allows clients to recover from errors without loosing too much progress. On the other hand, clients using explicit transactions MUST handle errors properly. Individual operations may fail and still have partial effects. For example, if a file upload fails mid-request, the client should either repeat or resume the failed upload. The client MUST make sure the transaction is in a clean state before committing.
Update conflicts (multiple transactions updating the same archive at the same time) are not resolved automatically, since CDStar cannot possibly know how to merge multiple changes into a consistent result. In this unfortunate case, the transaction committed first will succeed and all other transactions writing to the same archive will fail as soon as a commit is tried.
Read-conflicts are allowed, though. If you only read from an archive and not change it, and a different transaction changes the archive in the meantime and commits before you, your transaction won’t fail. If you require a higher level of isolation (called 'serializability' in database theory) you can enable it via the isolation=full
parameter when creating a new transaction.
Transaction management is expensive. Some transaction information must survive even a fatal server crash to allow reliable and automatic crash recovery. If you only need to 'read' from multiple archives in an isolated way, you can start the transaction with readonly=true
and save a lot of server-side house-keeping.
Explicit transactions expire after some time of inactivity. They never expire while a HTTP call is still in progress, and will extend their lifetime automatically after each HTTP call. You won’t have to worry about that in most cases. If you need a transaction to survive more than a couple of seconds of inactivity (e.g. while waiting for user input), you can specify a higher timeout
when creating a transaction, or issue cheap HTTP calls (e.g. Renew Transaction) from time to time to prevent transactions from dying. Expired transactions are rolled back automatically.
API Endpoints
This chapter lists and describes all web service endpoints defined by the standard CDStar HTTP API. Requests are routed to the appropriate endpoint based on their HTTP method, content type and URI path. Some endpoints also require certain query parameters to be present. Path parameters (variable parts of the URL path) are marked with curly brackets.
Title | Method | URI Path |
---|---|---|
Instance APIs |
||
|
|
|
|
|
|
Vaults and Search |
||
|
|
|
|
|
|
|
|
|
|
|
|
Archives |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Files |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Metadata |
||
|
|
|
|
|
|
|
|
|
|
|
|
Access Control |
||
|
|
|
|
|
|
Data Import |
||
|
|
|
|
|
|
Snapshots |
||
|
|
|
|
|
|
|
|
|
Transactions |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Instance APIs
APIs to access instance-level functionality like metrics, health, capabilities and more. This is also the entry point for most plugins.
Service Info
/v3/
HTTP/1.1
Get basic information about the cdstar instance as well as a list of all vaults accessible by the current user.
Status | Response | Description |
---|---|---|
200 |
No description |
Service Health
/v3/_health
HTTP/1.1
Warning
|
This endpoint is marked as unstable and is subject to change. |
Return health and performance metrics about the service.
Name | Type | Description |
---|---|---|
with |
list(enum) |
Include additional information in the response.
|
Status | Response | Description |
---|---|---|
200 |
No description |
Vaults and Search
List and access vaults, search or enumerate archives within a vault.
List Vaults
/v3/
HTTP/1.1
List vaults accessible by the current user. This is the same as Service Info.
Get Vault Info
/v3/{vault}
HTTP/1.1
Get information about a vault.
Status | Response | Description |
---|---|---|
200 |
No description |
Search in Vault
/v3/{vault}?q
HTTP/1.1
Perform a search over all archives and files within a vault using the configured search backend. Only results that are visible to the currently logged in user are returned.
-
Changed in
v3.1
: Addedfields
parameter.
Name | Type | Description |
---|---|---|
q |
string |
Search query using the lucene query syntax or an alternative query syntax
supported by the backing search index. Multiple plain search terms are usually Example: |
order |
enum |
Order results by Multiple order fields can be specified as a comma separated list. Default: |
limit |
int(0-max) |
Limit the number of results. Values are automatically capped to an allowed maximum. Default: |
fields |
list(string) |
Request additional fields for each hit. Search backends SHOULD support requesting index document fields by name (e.g. Search backends MAY support more complex field queries via a backend specific syntax. For example, requesting
The SearchHit data type contains a Multiple simple fields can be requested as a comma separated list. Example: |
scroll |
string |
When a search query matched more than This works similar to the 'search_after' feature in elasticsearch or the 'cursorMark' feature in solr.
The 'scroll' value in a SearchResults response is a stateless live cursor pointing to the last
element returned in a result page. When repeating a search with a valid Default: |
groups |
list |
Claim membership of additional user groups. This is useful if the realm of the user does not return all groups the user belongs to, and some search hits are not visible because of that. Each claim is checked against the realm, and if successful, hits visible to that group are included in the result. |
Status | Response | Description |
---|---|---|
200 |
No description |
|
501 |
Search functionality is disabled. |
|
504 |
Search functionality is enabled, but the search service did not respond in time. |
List all Archives in a Vault
/v3/{vault}?scroll
HTTP/1.1
List IDs of archives stored in this vault.
Up to limit
IDs are returned per request. IDs are ordered in a stable but
otherwise implemention specifc way (usually lexographical). If the scroll
parameter is a non-empty string, then only IDs ordered after the given string
are returned. This can be used to scroll through all IDs of a vault in an
efficient manner.
By default, this API will return all IDs that were ever created in this vault,
including IDs of archives that were removed or are not load-able by the current
user. This mode requires list
vault permission or the vault to be public.
In strict
mode, archive manifests are actually loaded from storage and only IDs
of archives that are load-able by the current user are returned. This mode is less
efficient, but does not require list
permissions on the vault. Use with caution.
This API is NOT transactional and may reflect changes made by other clients as soon as they happen.
Name | Type | Description |
---|---|---|
scroll |
string |
Required, but can be empty. Start listing IDs greater than the given string, according to the implementation-defined ordering
(usually lexographical). For pagination, set |
limit |
int(0-max) |
Limit the number of results. Values are automatically capped to an allowed maximum. Default: |
strict |
boolean |
If true, only IDs for archives that are actually load-able by the current user are returned. |
Status | Response | Description |
---|---|---|
200 |
No description |
Archives
No description
Create Archive
/v3/{vault}/
HTTP/1.1
Note
|
This endpoint consumes: multipart/form-data , application/x-www-form-url-encoded
|
Create a new archive, owned by the current user.
If the request body contains form data, the new archive is immediately populated according to Update Archive.
Status | Response | Description |
---|---|---|
201 |
Archive created |
Get Archive Info
/v3/{vault}/{archive}
HTTP/1.1
/v3/{vault}/{archive}@{snapshot}
HTTP/1.1
Get information about an archive or list its content.
When accessing a snapshot, information that is not part of the snapshot (e.g. owner or ACL) will be read from the current archive state.
Name | Type | Description |
---|---|---|
with |
list(enum) |
Include additional information in the response. This can be used as a shortcut for individual requests to Get Archive ACL, List files, Get Archive Metadata or List Snapshots. If access restrictions do not allow reading a subresource, the flag is silently ignored.
|
include |
list(glob) |
Only list files that match any of these glob patterns. Implies |
exclude |
list(glob) |
Only list files that do not match any of these glob patterns. Implies |
order |
enum |
Order files by Default: |
reverse |
boolean |
Return files in reverse order. Implies |
limit |
int(0-max) |
Limit the number of files listed. Values are automatically capped to
an allowed maximum. Implies Default: |
offset |
int(0-inf) |
Skip this many files from the listing. Can be used for pagination
of archives with more than |
Status | Response | Description |
---|---|---|
200 |
Archive found |
|
400 |
Invalid parameters |
|
404 |
Archive not found or not readable by current user |
Export Archive
/v3/{vault}/{archive}?export
HTTP/1.1
/v3/{vault}/{archive}@{snapshot}?export
HTTP/1.1
Warning
|
This endpoint is marked as unstable and is subject to change. |
Note
|
This endpoint produces: */*
|
Export (a subset of) all files in an archive as a single download.
The file format is specified by the export
parameter. Currently only zip
is implemented.
More formats are planned (e.g. BagIt
, tar
, tar.gz
and more).
Name | Type | Description |
---|---|---|
export |
list(enum) |
Required parameter to specifies the export format.
Currently only
|
include |
list(glob) |
Only export files that match any of these glob patterns. |
exclude |
list(glob) |
Only export files that do not match any of these glob patterns. |
Status | Response | Description |
---|---|---|
200 |
bytes |
The export format and |
404 |
Archive not found or not readable by current user |
Update Archive
/v3/{vault}/{archive}
HTTP/1.1
/v3/{vault}/{archive}@{snapshot}
HTTP/1.1
Note
|
This endpoint consumes: multipart/form-data , application/x-www-form-url-encoded
|
Update an existing archive or snapshot.
Request form data is interpreted as a list of commands and applied in order. The order is significant. For example, a file can be uploaded, then copied, then annotated with metadata, all in the same request. Make sure your HTTP client preserves the order of form fields when generating the request body.
Commands that contain a {filename}
placeholder operate on files within the archive. The filename must start with a slash (/
) in order to be recognized. If the filename also ends with a slash, it usually affects all files with that prefix. Be careful.
File uploads only work with multipart/form-data
and are not recommended for large files.
Prefer Upload file for anything larger than a couple of MB. Uploading a large number of small files may be faster using this api, though. Your mileage may vary.
Snapshots are read-only, but setting a new profile
is supported.
Name | Type | Description |
---|---|---|
{filename} |
file |
Upload a new file ( If the filename ends with a slash, then the original (client-side) name of the file is appended.
If the filetype is either Example: |
copy:{filename} |
string |
Create a new file by copying the content of an existing file from the same archive. Example: |
clone:{filename} |
string |
Create a new file by copying the content and metadata of an existing file from the same archive. |
move:{filename} |
string |
Rename an existing file. |
fetch:{filename} |
uri |
Create a new file by fetching an external resource. If Supported URI schemes depend on installed plugins and not all URIs may be allowed.
For example, fetching from Example: |
delete:{filename} |
string |
Delete a file. The value is ignored. If Example: |
type:{filename} |
string |
Change the content-type of an existing file. The value should follow the Example: |
meta:{attr} |
list(string) |
Set a meta-attributes for the archive. See Metadata for a list of supported Example: |
meta:{attr}:{filename} |
list(string) |
Set a meta-attributes for a specific file within the archive. Example: |
acl:{subject} |
list(enum) |
Change the list of permissions granted to a See Archive Permissions for a list of permission names. Example: |
profile |
string |
Set the desired storage profile for this archive or snapshot. Profile changes usually trigger background data migration and will take some time to have an effect. See Storage Profiles for details. |
owner |
string |
Change the owner of this archive.
This requires |
Delete Archive
/v3/{vault}/{archive}
HTTP/1.1
Remove an existing archive and all snapshots. This requires delete
permissions on the archive.
Status | Response | Description |
---|---|---|
204 |
- |
Archive removed (no content). |
Files
No description
List files
/v3/{vault}/{archive}?files
HTTP/1.1
/v3/{vault}/{archive}@{snapshot}?files
HTTP/1.1
List files within an archive or snapshot. This endpoint supports the same parameters as Get Archive Info to filter or paginate the list of files.
Status | Response | Description |
---|---|---|
200 |
No description |
Download file
/v3/{vault}/{archive}/{filename}
HTTP/1.1
/v3/{vault}/{archive}@{snapshot}/{filename}
HTTP/1.1
Note
|
This endpoint produces: */*
|
Download a single file from an archive or snapshot.
This endpoint supports ranged requests and conditional headers such as
If-(None-)Match
, If-(Un)modified-Since
, If-Range
and Range
, as
well as HEAD
requests. The ETag
value is calculated from the files
digest hash, if known.
Highly accessed files in publicy readable archives may be served from
a different location (e.g. S3 or CDN). Clients should follow redirects
(e.g. 307 Temporary Redirect
) according to the HTTP standard.
During explixit transactions, and while a file upload is currently in
progress, GET
requests will fail with an "IncompleteWrite" error.
HEAD
requests are allowed, though. The Content-Length
header will
report the current upload size.
Name | Type | Description |
---|---|---|
inline |
boolean |
By default, files are returned with a Some content-types cannot be inlined for security reasons. This parameter is silently
ignored for these types, and the |
Status | Response | Description |
---|---|---|
200 |
bytes |
File exists, is readable and its content is returned with this response.
The |
206 |
bytes |
Same as |
304 |
- |
File not modified. |
307 |
- |
Same as |
409 |
- |
Archive not available. This may happen for archives with a cold storage profile. |
412 |
- |
Precondition failed. |
416 |
- |
Requested range not satisfiable. |
Get file info
/v3/{vault}/{archive}/{filename}?info
HTTP/1.1
/v3/{vault}/{archive}@{snapshot}/{filename}?info
HTTP/1.1
Get FileInfo for a single file. For multiple files, [fileList] is usually faster.
Name | Type | Description |
---|---|---|
with |
list(enum) |
Return additional information about the file, embedded in the FileInfo document.
|
Status | Response | Description |
---|---|---|
200 |
No description |
Upload file
/v3/{vault}/{archive}/{filename}
HTTP/1.1
Note
|
This endpoint consumes: */*
NOTE: This endpoint produces: application/json
|
Directly upload a new file to an archive, or overwrite an existing file.
If a Content-Type
header is missing or equals application/x-autodetect
,
then the media type is guessed from the filename extention.
The conditional headers If-Match: *
or If-None-Match: *
can be used to
force update-only or create-only behavior.
Upload errors can only be detected properly if either Content-Length
header is set, or Transfer-Encoding: chunked
is used. If less than the
expected number of bytes are transmitted, the file is considered incomplete
and the transaction will fail.
During explicit transactions (see Transaction Management), failed uploads will leave the file
in an incomplete state. The upload must be repeated or resumed before committing.
See Resume file upload for details. Conflicting operations, for example reading the
file content or fetching its info, will fail until the file was completely updated
or removed. HEAD
requests to the files URL are allowed, though.
Status | Response | Description |
---|---|---|
200 |
- |
File updated. |
201 |
- |
File created. |
412 |
- |
Precondition (e.g. |
Resume file upload
/v3/{vault}/{archive}/{filename}
HTTP/1.1
Note
|
This endpoint consumes: application/vnd.cdstar.resume
NOTE: This endpoint produces: application/json
|
Resume a failed or aborted file upload.
After a failed Upload file request during an explicit transactions (see Transaction Management), the client may choose to resume the upload instead of uploading the entire file again or removing it.
To do so, send a PATCH request with Content-Type: application/vnd.cdstar.resume
and a Range
header with a single byte range, either bytes=startByte-
or
bytes=startByte-endByte
(see RFC-2616). The startByte
index must match the
current remote file size, as returned by a HEAD request to the Download file API.
The endByte
index is optional, but recommended as an additional saveguard. It
should match the target file size.
A file is considered complete once the PUT or PATCH request completes without errors. Within a single transaction, failing uploads can be resumed repeatedly until all data is transmitted or the transaction runs into a timeout.
Do not use this api to upload files in small chunks. A successfull PUT or PATCH request will compute digests, which is an expensive operation. Always try to upload the entire file in one go, if possible.
Status | Response | Description |
---|---|---|
200 |
- |
File updated. |
Delete file
/v3/{vault}/{archive}/{filename}
HTTP/1.1
Remove a single file from an archive. This requires change_files
permissions on the archive.
Status | Response | Description |
---|---|---|
204 |
- |
File removed (no content). |
Metadata
Archives and individual files within an archive can be annotated with custom metadata attributes. Both the name and values of an attribute are plain strings, but each attribute can have multiple values. Lists of strings are returned even if onyl a single value is set.
Attribute names are case-insensitive and limited to letters, digits and the underscore character, and must start with a letter.
Attribute names may be prefixed with a namespace identifier followed by a single colon character (e.g. dc:title
for a Dublin Core title
attribute). Namespaced attributes are subject to server-side
validation and defined in a schema. Custom attributes should be either prefixed with the custom:
namespace or no
namespace at all.
The value of an attribute is an ordered list of plain strings. Empty strings are allowed, but a list with no values is equal to an undefined attribute.
Get Archive Metadata
/v3/{vault}/{archive}?meta
HTTP/1.1
/v3/{vault}/{archive}@{snapshot}?meta
HTTP/1.1
Return metadata attributes for an archive or snapshot.
The same information can also received as part of a Get Archive Info request by using the with=meta
switch.
Status | Response | Description |
---|---|---|
200 |
No description |
Set Archive Metadata
/v3/{vault}/{archive}?meta
HTTP/1.1
Note
|
This endpoint consumes: application/json
|
Replace the metadata of an archive with a new MetaAttributes document. To clear all attributes, just send an empty document (e.g. {}
).
Field | Type | Description |
---|---|---|
{schema:attr} |
list(string) |
A list of string values. The list is ordered and dublicates are allowed. |
Status | Response | Description |
---|---|---|
204 |
- |
Metadata updated. |
Get File Metadata
/v3/{vault}/{archive}/{file}?meta
HTTP/1.1
/v3/{vault}/{archive}@{snapshot}/{file}?meta
HTTP/1.1
Return metadata attributes for a single file within an archive or snapshot.
The same information can also received as part of a Get file info request by using the with=meta
switch.
Status | Response | Description |
---|---|---|
200 |
No description |
Set File Metadata
/v3/{vault}/{archive}/{file}?meta
HTTP/1.1
Note
|
This endpoint consumes: application/json
|
Replace the metadata of a file within an archive with a new MetaAttributes document. To clear all attributes, just send an empty document (e.g. {}
).
Field | Type | Description |
---|---|---|
{schema:attr} |
list(string) |
A list of string values. The list is ordered and dublicates are allowed. |
Status | Response | Description |
---|---|---|
204 |
- |
Metadata updated. |
Access Control
The local access control list (ACL) of an archive can be used to grant
permissions to individuals or groups. These permissions are checked
before any external realm is consulted and stored as part of the archive.
New permissions can be granted individually using the
Update Archive endpoint, or in bulk via Set Archive ACL. The permissions
read_acl
or change_acl
are required to read or change the access
control list or an archive.
Note that the names for subjects (individuals or groups) can and should be
qualified with the name of the autentication realm, especailly if more than
one realm is installed. A subject named alice
would match any user with
that name, across all autentication sources. Use qualified names
(e.g. userName@realmName
or @groupName@realmName
) to prevent ambiguities.
Get Archive ACL
/v3/{vault}/{archive}?acl
HTTP/1.1
Return the local access control list of this archive as an AclInfo document.
The same information can also be received as part of a Get Archive Info request
by using the with=acl
switch.
Name | Type | Description |
---|---|---|
acl |
enum |
Default: |
Status | Response | Description |
---|---|---|
200 |
No description |
Set Archive ACL
/v3/{vault}/{archive}?acl
HTTP/1.1
Note
|
This endpoint consumes: application/json
|
Replace all entries of the local access control list with entries from this AclInfo document.
Field | Type | Description |
---|---|---|
{subject} |
list(string) |
A list of permissions (lowercase) or permission-sets (uppercase) granted to this subject.
|
Status | Response | Description |
---|---|---|
200 |
- |
Archive updated. |
400 |
- |
Invalid permission |
Data Import
No description
Import from ZIP/TAR
/v3/{vault}/
HTTP/1.1
Note
|
This endpoint consumes: application/zip , application/x-tar
|
Create a new Archive from a ZIP or TAR file.
For compressed TAR files, make sure to provide a suitable
Content-Encoding
header. Supported algorithms include gz
, bzip2
,
xz
, and deflate
.
Note that importing compressed ZIP or TAR archives requires a significant amount of work on server-side after the upload completed, which may cause some clients to time-out before a response can be sent. Make sure to increase the read time-outs for your client before uploading large archives.
Name | Type | Description |
---|---|---|
prefix |
string |
Import files into this folder. Example: |
include |
list(glob) |
Only import files that match any of these glob patterns. Example: |
exclude |
list(glob) |
Only import files that do not match any of these glob patterns. Example: |
Status | Response | Description |
---|---|---|
201 |
- |
Archive created. |
Update from ZIP/TAR
/v3/{vault}/{archive}
HTTP/1.1
Note
|
This endpoint consumes: application/zip , application/x-tar
|
Import files from a zip or tar file into an existing archive. See Import from ZIP/TAR for details.
Name | Type | Description |
---|---|---|
prefix |
string |
Import files into this folder. Example: |
include |
list(glob) |
Only import files that match any of these glob patterns. Example: |
exclude |
list(glob) |
Only import files that do not match any of these glob patterns. Example: |
Status | Response | Description |
---|---|---|
200 |
- |
Archive updated. |
Snapshots
Archive Snapshots are an efficient way to preserve the current payload (files and metadata) of an archive without actually creating a copy. This can be used to implement versioning or prepare unmodifiable copies for publishing.
The preserved state of a snapshot can be accessed (read-only) just like normal archive state, by appending an @
and the snapshot name to the archive id in the request path. For exampe, GET /v3/ab587f42c257@v1/data.csv
will return a file from archive ab587f42c257
as preserved by snapshot v1
. This works for all endpoints documented as supporting snapshots.
Snapshots only preserve the payload of an archive, namely metadata and files. Administrative metadata such as owner or access control lists are not part of a snapshot. Only the profile
can be changed on a snapshot via Update Archive. This means that the storage state and availability of a snapshot can differ from that of the archive. See Storage Profiles for details.
Create Snapshot
/v3/{vault}/{archive}?snapshots
HTTP/1.1
Note
|
This endpoint consumes: application/x-www-form-url-encoded
|
Create a new snapshot.
Name | Type | Description |
---|---|---|
name |
string |
(required) Snapshot name. Must be unique per archive and only contain ASCII letters, digits, dashes or dots ( |
Status | Response | Description |
---|---|---|
201 |
Snapshot created. |
Delete Snapshot
/v3/{vault}/{archive}@{snapshot}
HTTP/1.1
Delete a snapshot. This requires delete
permissions on the archive and is irreversable. The name of a deleted snapshot cannot be used to create a new snapshot.
Status | Response | Description |
---|---|---|
204 |
- |
Snapshot removed |
List Snapshots
/v3/{vault}/{archive}?snapshots
HTTP/1.1
Get a list of snapshots that exist for this archive, ordered by creation date, then name.
Transactions
Transactions can be started, comitted or rolled back explicitly using these endpoints. To learn more about transactions, see Transaction Management.
Begin Transaction
/v3/_tx/
HTTP/1.1
Note
|
This endpoint consumes: application/x-www-form-urlencoded
|
Start a new transaction. See Transaction Management for details.
Name | Type | Description |
---|---|---|
isolation |
enum |
Select an isolation level for this transaction. Supported modes are Transactions with 'snapshot' isolation work on a consistent snapshot of the entire database from the exact moment the transaction was sarted and only see their own changes. On a write-write conflict (the same resource modified by two overlapping transactions) only one of the transactions will be able to commit. This protects against 'lost updates' and is suitable for most scenarios. Transactions with 'full' isolation (also called 'serializability isolation') will also fail on write-read conflicts. The transaction can only be committed if none of the affected resources (modified or not) was modified by an overlapping transacion. Default: |
readonly |
boolean |
If true, create a read-only transaction. These transactions cannot be committed (only rolled back). |
timeout |
integer |
Timeout (in seconds) after which an unused transaction is automatically rolled back. User supplied timeouts are automatically capped to a server-defined maximum value. Default: |
Status | Response | Description |
---|---|---|
201 |
Transaction created successfully. |
Get Transaction Info
/v3/_tx/{txid}
HTTP/1.1
Request information about a running transaction.
Status | Response | Description |
---|---|---|
200 |
Transaction Info |
|
404 |
Transaction does not exist, expired or is not visible to the current user context. |
Commit Transaction
/v3/_tx/{txid}
HTTP/1.1
Commit a running transaction. All changes made with this transaction ID are persisted and new transactions will be able to see the changes. The commit may fail, in wich case not changes will be persisted at all. Partial commits never happen.
Status | Response | Description |
---|---|---|
204 |
- |
Transaction committed successfully. |
404 |
Transaction does not exist, expired or is not visible to the current user context. |
|
409 |
Transaction could not be commited because of unresolveable conflicts and was rolled back instead. |
|
423 |
Transaction could not be commited because of locked resources. It may still be possible to commit this transaction, so it is kept open. The client should either issue a rollback, or try again later. |
Renew Transaction
/v3/_tx/{txid}?renew
HTTP/1.1
Renew a running transaction.
This resets the transaction timeout and ensures that the transaction
is not rolled back automatically for the next TransactionInfo.ttl
seconds.
Status | Response | Description |
---|---|---|
200 |
Transaction renewed successfully. The response contains an updated |
|
404 |
Transaction does not exist, expired or is not visible to the current user context. |
Rollback Transaction
/v3/_tx/{txid}
HTTP/1.1
Close a running transation by rolling it back. All changes made with this transaction ID are discarded.
API Data Structures
AclInfo
This object maps subjects (users, groups or special subjects) to lists of permissions (lowercase) or permission sets (uppercase). See Archive Permissions for possible values.
Permissions are grouped into permission sets by default. Only permissions that do not fit into a complete set are returned individually. Endpoins returning this structure usually also support a flag to return individual permissions instead of sets.
For most subjects, this listing only contains permissions that were explicitly granted on the archive itself. Authorization realms configured on the server may grant additional permissions when requested. Those are not listed here, as they cannot be known in advance.
Field | Type | Description |
---|---|---|
{subject} |
list(string) |
A list of permissions (lowercase) or permission-sets (uppercase) granted to this subject.
|
{
"$any": [
"READ"
],
"$owner": [
"OWNER"
],
"alice": [
"READ"
],
"@cronGorup": [
"READ",
"read_acl"
]
}
ArchiveInfo
Archive properties and content listing as returned by Get Archive Info. Some of the fields are optional or affected by query parameters. See Get Archive Info for a detailed description.
If this document represents an archive snapshot, additional fields are present. State that is not part of the snapshot (e.g. owner or ACL) are complemented from the archive state, if requested.
Field | Type | Description |
---|---|---|
id |
string |
Unique ID of this archive. |
vault |
string |
Name of the containing vault. |
revision |
string |
Archive revision. This is currently an incrementing counter, but the value should be treated as an arbitrary string. |
profile |
string |
The name of the storage profile. If the archive is currently in a |
state |
enum |
The current storage state of this archive or snapshot. The states are:
Archives in |
created |
date |
Time this archive was created. |
modified |
date |
Last time this archive, its meta-data or any of its files were modified. Note that changes to administrative meta-data (owner, ACL) do not update the modification time of an archive. If you need to track changes in administrative meta-data, always compare the actual values. |
file_count |
int |
Total number of files in this archive.
May be |
files |
list(FileInfo) |
List of files in this archive. May be incomplete or missing based on query parameters, permissions and server configuration. See Get Archive Info for details. |
meta |
Meta-Attributes defined on this archive. May be incomplete or missing based on query parameters and permissions. |
|
acl |
Access control list. May be incomplete or missing based on query parameters and permissions. |
|
snapshots |
list(SnapshotInfo) |
List of snapshots created for this archive, if any. May be incomplete or missing based on query parameters. See Get Archive Info for details. |
{
"id": "ab587f42c2570a884",
"vault": "myVault",
"revision": "0",
"profile": "default",
"state": "open",
"created": "2016-12-20T13:59:37.160+0000",
"modified": "2016-12-20T13:59:37.231+0000",
"file_count": 1,
"files": [
{
"name": "/example.txt",
"id": "aaf0cc5ab587",
"type": "text/plain",
"size": 7,
"created": "2016-12-20T13:59:37.217+0000",
"modified": "2016-12-20T13:59:37.218+0000",
"digests": {
"md5": "1a79a4d60de6718e8e5b326e338ae533",
"sha1": "c3499c2729730a7f807efb8676a92dcb6f8a3f8f",
"sha256": "50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c"
},
"meta": {
"dc:title": [
"This is an example file"
],
"dc:date": [
"2016-12-20T13:59:37.218+0000"
]
}
}
],
"acl": {
"$any": [
"READ"
],
"$owner": [
"OWNER"
],
"alice": [
"READ"
],
"@cronGorup": [
"READ",
"read_acl"
]
}
}
Error
In case of an error, CDStar will return a json document with additional information.
Field | Type | Description |
---|---|---|
status |
int |
HTTP status code of this response |
error |
string |
Short description. Suitable as a key for translations or error handling, as it does not contain any dynamic parts. |
message |
string |
Long description. Suitable to be presented to the user. |
detail |
object |
Additional information or metadata. (Optional field) |
other |
list(Error) |
If more than one error occuded during a single request, the other errors are listed here. (Optional field) |
{
"status": 404,
"error": "Not found",
"message": "The requested archive does not exist or is not readable.",
"detail": {
"vault": "myVault",
"archive": "ab587f42c2570a884"
}
}
FileInfo
Properties and (optionally) meta-data about a single file within an archive.
Field | Type | Description |
---|---|---|
id |
string |
A unique and immutable string identifier. Other than the |
name |
string |
File name (unicode), always starting with a slash ( |
type |
string |
User supplied or auto-detected media type. Defaults to |
size |
long |
File size in bytes |
created |
date |
Time the file was created. |
modified |
date |
Last time the file content was modified. |
digests |
object |
An object mapping digest algorithms to their hex value.
The available algorithms (e.g. This field is not available (null or missing) for incomplete files with running or aborted uploads in the same transaction. |
meta |
Meta attributes defined for this file. May be incomplete or missing based on query parameters and permissions. |
{
"name": "/example.txt",
"id": "aaf0cc5ab587",
"type": "text/plain",
"size": 7,
"created": "2016-12-20T13:59:37.217+0000",
"modified": "2016-12-20T13:59:37.218+0000",
"digests": {
"md5": "1a79a4d60de6718e8e5b326e338ae533",
"sha1": "c3499c2729730a7f807efb8676a92dcb6f8a3f8f",
"sha256": "50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c"
},
"meta": {
"dc:title": [
"This is an example file"
],
"dc:date": [
"2016-12-20T13:59:37.218+0000"
]
}
}
FileList
A list of FileInfo objects, usually filtered and paginated.
If count
and total
are not queal, then the result is incomplete
and additional requests are required to get the complete list.
Field | Type | Description |
---|---|---|
count |
int |
Number of results in this listing (size of the |
total |
int |
Total number of files matching the given include/exclude filters or query. |
files |
list(FileInfo) |
List of FileInfo objects. |
{
"count": 1,
"total": 1,
"files": [
{
"name": "/example.txt",
"id": "aaf0cc5ab587",
"type": "text/plain",
"size": 7,
"created": "2016-12-20T13:59:37.217+0000",
"modified": "2016-12-20T13:59:37.218+0000",
"digests": {
"md5": "1a79a4d60de6718e8e5b326e338ae533",
"sha1": "c3499c2729730a7f807efb8676a92dcb6f8a3f8f",
"sha256": "50d858e0985ecc7f60418aaf0cc5ab587f42c2570a884095a9e8ccacd0f6545c"
},
"meta": {
"dc:title": [
"This is an example file"
],
"dc:date": [
"2016-12-20T13:59:37.218+0000"
]
}
}
]
}
MetaAttributes
This objects contains one key per non-empty meta-attribute defined on the resource. The keys are fully qualified attribute names (including schema prefix) and values are always lists of strings, even if the attribute only allows a single value or has a different value type.
Field | Type | Description |
---|---|---|
{schema:attr} |
list(string) |
A list of string values. The list is ordered and dublicates are allowed. |
{
"dc:title": [
"This is an example file"
],
"dc:date": [
"2016-12-20T13:59:37.218+0000"
]
}
ScrollResults
A page of results returned from a List all Archives in a Vault query.
Field | Type | Description |
---|---|---|
count |
int |
Number of results in this page. |
limit |
int |
Maximum number of results per page. If |
results |
list(String) |
List of archive IDs |
{
"count": 2,
"limit": 25,
"results": [
"ab587f42c2570a884",
"ac2b39606a3a6e3b1"
]
}
SearchHit
A single element of a SearchResults listing.
Field | Type | Description |
---|---|---|
id |
string |
Archive ID this hit belongs to. |
type |
string |
Resource type of this hit (either |
name |
string |
Full file name (including path) of the matched file. Only present if |
score |
float |
Relevance score. May be 0 for queries or search backends that do not support relevance scoring. |
fields |
object(string, any) |
Contains field query results requested during search or automatically provided by the search backend. Each entry maps a field query to its result value, which is usually a simple type (e.g. number, string or list of strings), but can also take other forms for computed fields or errors. Failed or unsupported individual field queries should map to an Supported field queries and their return type depend on the search backend used. |
{
"id": "ab587f42c2570a884",
"type": "file",
"name": "/folder/example.pdf",
"score": 3.14,
"fields": {
"dcTitle": "Example Document Title",
"highlight(content)": {
"error": "UnsupportedFieldQuery"
}
}
}
SearchResults
A page of results returned from a search query.
Field | Type | Description |
---|---|---|
count |
int |
Number of results in this page. |
total |
int |
Total number of results in this result set (approximation) |
scroll |
string |
A stateless cursor representing the last hit of this result page. It can be used to repeat the search and fetch the next page of a large result set. |
hits |
list(SearchHit) |
List of search hits |
{
"count": 1,
"total": 1,
"scroll": "WyJhYjU4N2Y0MmMyNTcwYTg4NDphYWYwY2M1YWI1ODciXQ==",
"hits": [
{
"id": "ab587f42c2570a884",
"type": "file",
"name": "/folder/example.pdf",
"score": 3.14,
"fields": {
"dcTitle": "Example Document Title",
"highlight(content)": {
"error": "UnsupportedFieldQuery"
}
}
}
]
}
SnapshotInfo
Information about a single archive snapshot.
Field | Type | Description |
---|---|---|
name |
string |
Snapshot name |
revision |
string |
Archive revision this snapshot refers to. |
creator |
string |
User that created this snapshot. |
created |
string |
Snapsho creation date |
profile |
string |
Snapshot storage profile |
{
"name": "v1",
"revision": 0,
"creator": "user@domain",
"created": "2020-05-26T12:02:45.301+0000",
"profile": "default"
}
TransactionInfo
Information about a running transaction. See Transaction Management for details.
Field | Type | Description |
---|---|---|
id |
string |
Transaction ID |
isolation |
enum |
Isolation level (either |
readonly |
boolean |
Whether or not this transaction is in read-only mode. Read-only transactions cannot be committed (only rolled back) and do not allow modifying operations. |
ttl |
integer |
Number of seconds left from the configured If this number is zero or negative, then the transaction already expired or may expire very soon. |
timeout |
integer |
Number of seconds after which this transaction will expire if not used (see ttl). |
{
"id": "091f8a6e-0fca-4771-a460-d2ee7d6034e3",
"isolation": "snapshot",
"readonly": false,
"ttl": 59,
"timeout": 60
}
Realms
Realms manage authentication and authorization in CDStar and are very flexible. There are different interfaces for authorization, authentication, group membership resolution, custom permission types and more. This list contains all available realms types that are either bundled with the core distribution or provided as officially supported plugins. Custom implementations can also be used.
StaticRealm
This realm provides authentication, authorization and groups from a static configuration file.
StaticRealm loads the entire user database (users, groups, roles and permissions) from a static configuration file (hence the name) and is the go-to solution for small instances with only a hand full users. No external database or server required.
Configuration
The realm is configured directly in the cdstar main configuration. Here is an example showing most options:
cdstar-static-realm.yaml
realms:
default:
class: StaticRealm
domain: static
role:
userRole:
- "vault:demo:read"
- "vault:demo:create"
adminRole:
- "vault:*:*"
- "archive:*:*:*"
group:
customers:
- userRole
admins:
- userRole
- adminRole
user:
alice:
password: "cGxhaW4=:FmtSc7NSX8fsjLTmpLpoqRLP4vqWFg/r5uy3EU6JsEs="
groups:
- customers
permissions:
- "vault:alice:*"
admin:
password: "..."
roles:
- adminRole
Pram | Description |
---|---|
class |
Realm implementation class name. Always |
file |
Load additional configuration from an external yaml file (not implemented) |
domain |
Sets a default domain for this realm. (defaults to 'static') |
user.<name>.password |
Enables a user to authenticate against this realm. The password is stored in hashed from. These hashes can be created using the built-in command line tool (see below). |
user.<name>.permissions |
Grants string permissions directly to this user. |
user.<name>.groups |
Adds this user to a list of groups. |
user.<name>.roles |
Adds this user to a list of roles. |
group.<name> |
Defines a new group with a list of roles. |
role.<name> |
Defines a new role with a list of string permissions. |
Unqualified groups and user-names are qualified with the configured default domain of the realm (e.g. alice
is turned into alice@static
).
Fully qualified names (e.g. alice@otherRealm
) are also accepted, even if the domain does not match the current realm.
Warning
|
Permissions groups and roles configured for a qualified user will affect any session with a matching principal name and domain, even if the session was authenticated by a different realm. |
If no password is defined for a user, then the user will not be able to authenticate against this realm. Permissions, roles and groups still apply.
Password hash
A secure password-hash can be generated with the java -cp cdstar.jar de.gwdg.cdstar.auth.realm.StaticRealm
tool.
LDAP Realm
An LDAPRealm
authenticates password credentials against an LDAP server. The realm first searches for the user according to a configurable search base and filter, then tries to bind to the LDAP using the users password. Successfully authenticated principals are cached to speed up repeated login requests for the same user.
Configuration
realm:
ldap:
class: LDAPRealm
name: "ldap"
server: "ldaps://SERVER"
search.user: "cn=USER,ou=users,dc=example,dc=com"
search.password: "SECRET"
search.base: "dc=example,dc=com"
search.filter: "(|(uid={})(mail={}))"
attr.uid: "uid"
attr.domain: "ou"
Name | Description |
---|---|
class |
Plugin class name. Always |
name |
The name of this realm. Defaults to the value of |
server |
URL (either |
search.user |
Login |
search.password |
Password for the search agent. |
search.base |
Base |
search.filter |
Search filter used to map a login requests (e.g. user name or e-mail) to a qualified user |
attr.uid |
The LDAP attribute used as the subject identifier. Note that subject identifiers must be unique and should not contain certain special characters. Defaults to |
attr.domain |
Attribute to read the principal domain from. This allows a single LDAPRealm to represent multiple principal domains. If this config value is not set, or if the attribute is not found in the ldap record, then the principal domain defaults to the realm name. (Optional) |
cache.size |
Number of recently authenticated principals to keep in memory to prevent unnecessary LDAP request. Defaults to |
cache.expire |
Number of seconds after which a principal must be re-authenticated against LDAP. (default: 10 minutes) |
Warning: cache.expire
is enforced by the cache implementation, which might allows entries to survive longer than expected on Java 8 if the cache is mostly idle. If prompt expiration is important and the expiration time is very short, make sure to run on Java 9 or newer.
JWT Realm
This plugin adds support for JWT token based authentication and authorization.
Configuration
The JWTRealm
class can be configured as a realm or regular plugin and allows users to authenticate via signed JWTs.
realm:
jwt:
class: JWTRealm
default:
hmac: c3VwZXJzZWNyZXQ= # base64("supersecret")
my_issuer:
iss: https://auth.example.com/my-realm/
jwks: https://auth.example.com/my-realm/jwks.json
domain: my_realm
This plugin supports multiple JWT issuers with different settings at the same time. Tokens are matched against configured issuers based in their iss
claim. Tokens without an iss
claim or with no matching issuer configuration will be matched against the default
issuer, if defined.
Each issuer MUST define at least one of hmac
, rsa
, ecdsa
or jwks
to be able to verify signed tokens. Unsigned tokens are not supported and will be rejected.
Pram |
Description |
class |
Plugin class name. Always |
<issuer>.iss |
Expected value of the |
<issuer>.hmac |
Base64 encoded secret. Required to verify HMAC based signatures. |
<issuer>.rsa |
RSA public key (X.509). Required to verify RSA based signatures. Keys are loaded from (*.pem or *.der) files, or directly from a base64 encoded string. |
<issuer>.ecdsa |
ECDSA public key (X.509). Required to verify ECDSA based signatures. Keys are loaded from (*.pem or *.der) files, or directly from a base64 encoded string. |
<issuer>.jwks |
Path or URL pointing to a JWKS (Java WebToken Key Set) file to load signing keys from. |
<issuer>.leeway |
Number of seconds to add/remove to |
<issuer>.domain |
The realm domain of the resulting principal. (default: <issuer>). |
<issuer>.trusted |
(deprecated) If |
<issuer>.permit |
A list of static StringPermissions given to all tokens created by this issuer. |
<issuer>.groups |
A list of static groups all token users are considered to be a member of. |
<issuer>.subject |
SpEL expression to derive a subject name from a token. Must evaluate to a string. (default: getString('sub')) |
<issuer>.verify.<name> |
SpEL expression (see below) to check token validity. All expressions must evaluate to |
<issuer>.groups.<name> |
SpEL expression (see below) to derive group memberships from a token. Each expression must evaluate to a string, a list of strings, or null. Null values or empty lists are ignored and will not add any groups. The expression name is just informal. |
<issuer>.permit.<name> |
SpEL expression (see below) to derive StringPermissions from a token. The expressions must evaluate to a string, a list of strings, or null. Null values or empty lists are ignored and will not add any permissions. The expression name is just informal. |
Dynamic expression rules
Because JWT is a very loose standard and the available claims may differ a lot between token providers, this plugin allows to verify tokens and extract information dynamically using SpEL expressions. Token claims are available as a claims
map which maps claim names to com.auth0.jwt.interfaces.Claim
instances, or via the hasClaim(name)
, getBool(name, default)
, getLong(name, default)
, getDouble(name, default)
, getString(name, default)
, getStringList(name)
, getClaim(name, type, default)
and getClaimList(name, innerType)
helper methods. These methods will return null or an empty list on any errors (missing claim, wrong type) and automatically convert between single and list claims. If a single value is requested for a list claim, the first value is returned.
realm.jwt:
class: JWTRealm
keycloak:
iss: https://auth.example.com/realms/my_realm/
jwks: https://auth.example.com/realms/my_realm/protocol/openid-connect/certs
domain: my_realm
subject: "getString('preferred_username') ?: getString('sub')"
verify.aud: "getStringList('aud').contains('my_client_id')"
groups.admin: "getBool('admin', false) ? 'admin_group' : null"
permit.vaultUser: "getStringList('usable_vaults').!['vault:#{#this}:create']"
Trusted token claims (deprecated)
If the issuer is configured with trusted: true
, then the following rules and expressions are automatically configured for a realm:
# Add `cdstar:groups` to list of groups.
groups._trusted: "getStringList('cdstar:groups')"
# Allow read access to all vaults in `cdstar:read`
permit._trusted_read: "getStringList('cdstar:read').!['vault:'+#this+':read']"
# Allow create+read access to all vaults in `cdstar:create`
permit._trusted_create: "getStringList('cdstar:create').!['vault:'+#this+':create']"
permit._trusted_create_read: "getStringList('cdstar:create').!['vault:'+#this+':read']"
# Grant all vault and archive permissions in `cdstar:grant`
permit._trusted_grant: "getStringList('cdstar:grant').?[#this.startsWith('vault:') or #this.startsWith('archive:')]"
Plugins
Plugins are optional components that extend various parts of the cdstar runtime or REST API and can be enabled on demand. Some plugins are bundled with the core cdstar distribution, others must be downloaded and unpacked into the path.lib
folder before they can be used. This chapter describes the official plugins that are tested and distributed with the core cdstar runtime and fully supported.
PushEventFilter
The PushEventFilter
sends an HTTP request to a number of configured consumer URLs whenever an archive is modified. This can be used to update external services or keep external databases in sync with the actual data within cdstar.
Failed push request are tried again to compensate for busy or temporarily unavailable consumers. If a consumer goes down for an extended time period, any push requests that failed to be delivered are persisted to disk.
Configuration
cdstar:
plugin:
push:
class: PushEventFilter
fail.log: "${path.var}/push-fail.log"
retry.max: 3
retry.delay: 1000
retry.cooldown: 60000
queue.size: 1000
http.timeout: 60000
url: http://localhost:8081/push
url.alt: http://localhost:8082/push
header.Authorization: Basic Y3VyaW91czpjYXQ=
header.X-Push-Referrer: http://push:push@localhost:8080/v3/
Name | Description |
---|---|
class |
Plugin class name. Always |
fail.log |
(optional, recommended) Path to a file where failed push requests are logged. If |
retry.max |
(default: |
retry.delay |
(default: |
retry.cooldown |
(default: |
http.timeout |
(default: |
queue.size |
(default: |
url |
URL to send push requests to. |
url.* |
Additional URLs. |
header.* |
Additional HTTP headers to send with each request. |
Push Event Consumer API
Events are send to consumers synchronously and in the order they appear, which means that there is at most one HTTP connection per consumer at any given time. The service behind the configured URL should expect requests like the following:
POST /push HTTP/1.1
Host: localhost:8081
Content-Type: application/json; charset=UTF-8
Content-Length: 167
X-Push-Retry: 0
X-Push-Queue: 12 1000 0
X-Push-Referrer: http://push:push@localhost:8080/v3/
{
"vault" : "test",
"archive" : "b5e83cd9658f7f33",
"revision" : "0",
"parent" : null,
"ts" : 1491914254133
"tx" : "ded6b2d4-6983-48f6-9b1f-be8225dab136",
}
Name | Description |
---|---|
X-Push-Retry |
(int) Number of previously failed attempts for this event. |
X-Push-Referrer |
(url) May be sent to tell consumers how to contact cdstar. |
X-Push-Queue |
Statistics about the event queue for this consumer. Contains three space-separated numbers:
Example: |
* |
Additional headers can be configured with |
Name | Description |
---|---|
vault |
Name of the vault. |
archive |
ID of the archive that changed. |
revision |
Revision of the changed archive, or |
parent |
Revision of the archive before the change, or |
ts |
Timestamp of the change event (milliseconds since 1970-01-01T00:00:00GMT) |
tx |
ID of the transaction this change was part of. |
A consumer may respond with 200 OK
, 202 Accepted
or 204 No Content
to signal success. The response body should be empty and other headers (including cookies) are ignored.
Redirects with 30x
response codes are followed according to normal HTTP client rules, but discouraged.
Consumers that are busy or unresponsive can answer with 503 Service Unavailable
and request a cool-down time (in seconds) using the Retry-After
header. This causes CDStar to pause the consumer and not send any more requests for the requested cool-down period. If the Retry-After
header is missing, the default retry.cooldown
is used.
Any other response as well as connection problems or timeouts are logged as warnings and the request is sent again after retry.delay
milliseconds. If a request fails more than retry.max
times in a row, it is logged as an error and the consumer is paused for retry.cooldown
milliseconds. This gives the consumer a chance to recover and also reduces logging noise considerably. Note that failing event are not discarded, but simply send again after the cool-down. Consumers MUST return a success status if they want to drop or ignore an event. Otherwise, they will receive the same event over and over again.
Slow consumers should queue and persist events locally and answer with 202 Accepted
to prevent timeouts or events piling up too quickly. If a single request takes longer than http.timeout
milliseconds, it is aborted and tried again. If the number of waiting events exceeds queue.size
(per consumer), new events will be dropped and logged to a fail.log
file.
The fail.log file
The file configured with fail.log
is used to store events that failed to be delivered. It contains one failed request per line, starting with the service URI, a single space, and the base64 encoded payload of the request. A timestamp is not logged since it can be easily recovered from the event payload itself.
fail.log
entryhttp://127.0.0.1:8081/push ewogICJ2YXVsd[...]IxMzYiLAp9Cg==
The PushEventFilter
only appends to this file and there is no automatic clean-up. A warning is logged if this file is not empty at service start-up time, but there is no automatic recovery or re-querying of events. This feature may be added in the future, though.
If you have consumers that are sensitive to lost events, make sure to check this file regularly. A short python script to re-submit events from a fail.log
is shown here:
import requests
headers = {
'Content-type': 'application/json'
}
with open(`/path/to/fail.log`) as fp:
for lineno, line in enumerate(fp):
target, payload = line.split(' ', 1)
payload = payload.decode('base64')
r = requests.post(target, data=payload, headers=headers)
if r.status_code in (200, 204, 206):
print "%d SUCCESS" % lineno
else:
print "%d ERROR" % lineno
print r
RabbitMQSink
This plugin emits change events to a RabbitMQ message broker.
Warning
|
This plugin is experimental. |
Configuration
Pram |
Type |
Description |
class |
str |
Always |
broker |
URI |
RabbitMQ transport URI to connect to, including authentication parameters and virtual node, if necessary. |
exchange.name |
str |
Name of the exchange to publish to. |
exchange.type |
str |
Type of the exchange (e.g. |
qsize |
int |
Size of the in-memory send-queue (default: 1024). |
Reliability
Events are buffered in an in-memory send-queue and re-queued on any errors. This helps to compensate short event bursts, temporary network failures or broker restarts.
Events that cannot be queued or re-queued are logged and dropped. This may happen during shutdown phase or when the send-queue overflows.
Events are not part of the transaction logic (yet). A forced shutdown or crash will loose all messages in the send-buffer. Also note that the broker itself may drop messages for various reasons, depending on its configuration. The possibility of loosing events MUST be considered when using this plugin.
Embedded ActiveMQ Message Broker
This plugin emits change events to an embedded ActiveMQ message broker.
Warning
|
Embedding an ActiveMQ broker is fine for small to medium setups with low traffic and private networks. For production environments it is usually better to run a dedicated message broker with proper configuration and switch to the cdstar-activemq-sink or cdstar-rabbitmq-sink plugin.
|
Configuration
Pram | Type | Description |
---|---|---|
transport.<name> |
URI |
Network transports to bind to. See ActiveMQ docs for available protocols and URI parameters. The This plugin bundles all dependencies needed for The auto transport accepts Default: |
topic |
list(str) |
Change events are send to the given topics. (Default: |
queue |
list(str) |
Same as |
buffer |
int |
Size of the send buffer. (Default: unbound) |
Change Event Sink: ActiveMQ
This plugin emits change events to an ActiveMQ message broker.
Configuration
Pram | Type | Description |
---|---|---|
broker |
URI |
ActiveMQ transport URI to connect to, including authentication parameters, if necessary. This plugin bundles all dependencies needed for |
topic |
list(str) |
Change events are send to the given topics. (Default: |
queue |
list(str) |
Same as |
qsize |
int |
Size of the send buffer. (Default: unbound) |
RedisSink
A dead simple plugin that emits change events to a redis server.
Configuration
Pram |
Type |
Description |
class |
str |
Always |
url |
URI |
A redis server or cluster URI (default: |
key |
string |
Redis key or pub/sub channel to push events to. (default: |
mode |
string |
Push mode (see below). (default: |
qsize |
int |
Maximum in-memory send-queue size. (default: |
Push modes
-
RPUSH
Right-push do a redis list. (default) -
LPUSH
Left-push do a redis list. -
PUBLISH
Publish to a redis pub/sub channel.
Reliability
This sink will buffer events in a bounded in-memory queue and sent them out one by one as fast as it can. Any errors (network or redis errors, buffer queue overflow) will cause events to be logged an dropped (WARN level). On shutdown, the sink tries its best to send all remaining events, but will only do so for a couple of seconds. On a crash, all queued events are lost.
Or in other words: This sink is NOT reliable in any way. Network errors or crashes will cause events to be lost. On the plus side, this sink will not slow down cdstar if the redis server fails.
Search Proxy
This plugin installs a SearchProvider
that forwards search requests to an external search gateway, using a simple HTTP protocol as described below.
To simplify gateway development and improve security, client credentials are NOT forwarded to the gateway. CDSTAR will authenticate and resolve client credentials before the search is forwarded, and only provide principal name and group memberships to the gateway. This enables user-specific searches without exposing client credentials to an external service.
Configuration
plugin:
search:
class: ProxySearchPlugin
target: "https://gateway.example.com/search"
maxconn: 16
header:
X-Custom-Header: value
Name | Description |
---|---|
class |
Plugin class name. Always |
name |
The name of this provider. Defaults to the value of |
target |
URL to send search requests to. The target URL may contain authentication info. |
maxconn |
Maximum number of concurrent search requests (default: 10) |
header.<name> |
Additional HTTP headers to attach to each request. |
Search gateway API
The search gateway should accept POST requests at the configured target URL with Content-Type: application/json
and return results in the same format as the CDSTAR v3 search API. Search queries will be sent as JSON documents with the following fields:
Name | type | Description |
---|---|---|
q |
string |
User provided search query. |
fields |
array(string) |
An array of additional fields that should be returned with each hit. (optional) |
order |
array(string) |
User provided order criteria as a list of field names to order by, each optionally prefixed with |
limit |
int |
User provided limit for results per page. (optional) |
scroll |
string |
User provided scroll handle. (optional) |
vault |
string |
Name of the vault this search is performed on. |
principal |
object |
Security context for this search request. If missing or None, assume an unauthenticated user. |
principal.name |
string |
Name (including domain) of the user performing the search. (optional) |
principal.groups |
array(string) |
List of groups the searching user belongs to. (optional) |
principal.privileged |
boolean |
If true, assume the user can see all results. (default: false) |
The q
, fields
, order
, limit
and scroll
fields correspond to the (cleaned up) user provided search parameters as defined by the CDSTAR search API. vault
and principal
are added by CDSTAR. The search target should limit search results to entities visible to the specified principal
. If no principal is present (null, missing or empty), the search should only return publicly visible results. If principal.privileged
is true, the search should not filter by visibility and return all matching results.
POST https://gateway.example.com/search
Content-Type: application/json
{
"q": "search query",
"order": ["-score"],
"limit": 100,
"fields": ["meta.dc:title"],
"vault": "myVault",
"principal": {
"name": "alice@realm",
"groups": ["users@realm"],
"privileged": false
}
}
Security considerations
Since the search gateway is not supposed to authenticate the searching user and trust the fields send by CDSTAR, it could be used to perform searches on behalf of another user, if accessed directly by an attacker. Make sure that the gateway is only reachable from the CDSTAR instance or is protected by HTTPS and some authentication mechanism (e.g. BASIC auth or secret headers).
Landing Page (UI)
The cdstar-ui
plugin provides a very minimal browser-based UI (user interface) mounted at the /ui
root path. This UI is targeted at humans and may require a modern JavaScript enabled browser to be fully usable. The URL scheme is not defined or stable, with one exception: /ui/<vault>/<archive>
will always show (or redirect to) a human readable landing page for an archive. The user may be asked to log-in first for non-public archives.
Configuration
No configuration necessary, but this plugin honors the global api.context
setting (default: /
). This may be required if the service path cannot be detected automatically and assets are not loaded correctly.
plugin.ui.class: cdstar-ui
TusPlugin
This TusPlugin
installs a tus.io compatible REST endpoint to upload temporary files, and a way for other APIs to reference these files via server-side data streams. This helps clients to upload large files over unreliable network connections, or parallelize uploads of multiple files for the same archive.
Tip
|
TUS will NOT improve upload speed or throughput over stable network connections. The fastest and most efficient way to upload large files to cdstar is via Upload file. The best way to upload many small files to cdstar is via Update Archive. Only use TUS if uploads need to be resumable or you want to import the same file multiple times. |
Configuration
There is currently no configuration for this plugin. Uploads will be placed into ${path.var}/tus/
.
plugin.tus.class: TusPlugin
Name | Description |
---|---|
class |
Plugin class name. Always |
expire |
Maximum number of milliseconds a TUS upload is kept on disk after the last byte was written. If the value has a suffix (S,M,H or D) it is interpreted as seconds, minutes, hours or days instead of milliseconds. (default: |
Usage
The tus.io compatible REST endpoint is reachable under /tus
at the root-level of the service (not /v3/tus
but just /tus
). After creating a TUS handle and uploading data following TUS protocol, the temporary file can be referenced as tus:<tusId>
, where <tusId>
is the last part of the TUS handle. For example, if your TUS handle was /tus/24e533e
, then the internal reference to this resource would be tus:24e533e
.
Currently only the Create Archive and Update Archive support server-side imports via the fetch:<target>
functionality. For example, to import a completed TUS upload into an archive, you would send fetch:/path/to/target.file=tus:24e533e
as a POST form parameter. Note that the digests must still be computed, so a fetch may take just as long as uploading the file directly. TUS usually does not improve overall throughput, but may improve reliability of large-file uploads over unreliable network connections. Use it wisely.
Incomplete TUS handles that do not see any new data will expire after 2 hours. Once complete, the TUS handle can be referenced for another 24 hours before it expires. Handles that are not needed anymore can (and should) be deleted faster with a single DELETE
request to the TUS handle.
Advanced topics
NioPool Storage
NioPool
is the default StoragePool
implementation for CDStar and provides transactional and robust persistence to a local or network-attached file system. It is usually bundled with the default distribution of CDStar and does not require any additional plugins.
Note
|
StoragePool is a low level interface and abstraction layer for the underlying physical storage. High level concepts (namely vaults, archives and files) map roughly to low level entities (pools, objects and resources) but should not be confused or mixed. The exact relations between high and low level concepts are described in a separate document (TODO).
|
This document describes the on-disk folder structure and index file format used by NioPool
. The storage format is designed to be IO efficient and human-accessible at the same time: index files are human-readable and self-describing JSON files. In theory, all data and meta-data can be analyzed and recovered without prior knowledge or specialized software.
Folder structure
Storage objects are distributed into a directory tree with configurable depth, based on the first few character-pairs of the object ID. This reduces the maximum number of inodes per directory and helps keeping file system metadata cache-friendly, even for large pools with millions of objects. For a depth of d
, the lookup path would be computed as follows: {poolName}/{id[0:2]}/…/{id[(d-1)*2:d*2]}/{id}/
. For example, given a default depth value of d=2
, an object with ID 0123456789abcdef
would be stored in myPool/01/23/0123456789abcdef/
.
Tip
|
NioPool follows symlinks, even across device borders. This makes it easy to split large repositories and distribute load across multiple file systems or storage devices.
|
All files related to a specific pool object are stored in the same folder. Each object folder contains at least a HEAD
symlink pointing to the latest {revision}.json
index file. This file describes the state and content of the object in human readable form (json). There will be an extra index file for each revision of the object. Binary resources are stored in separate {sha256}.bin
files. If object packing is enabled, some index or resource files may be bundled into packs and must be unpacked before they can be used (see below).
cdstar-home/data/vaultName/dc/64/dc64abb808e0c227/ ./HEAD -> ./e371ce6a077f88755c1155b507b757d5.json ./e371ce6a077f88755c1155b507b757d5.json
cdstar-home/data/vaultName/dc/64/dc64abb808e0c227/ ./HEAD -> ./008f113ff1579f8aed9399bf7960118f.json ./008f113ff1579f8aed9399bf7960118f.json ./e371ce6a077f88755c1155b507b757d5.json ./30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58.bin
cdstar-home/data/vaultName/dc/64/dc64abb808e0c227/ ./HEAD -> ./2266dc2ff8067e68104607e750abb9d3.json ./2266dc2ff8067e68104607e750abb9d3.json ./15041131681337.pack.zip
Object index file format
Each time an object is modified, a new {revision}.json
index file is created and the HEAD
symlink is updated. These files contain an utf-8
encoded JSON
document describing the current state (contained resources, attributes and meta-data) of the storage object in a human-readable form.
Warning
|
Fields with null or empty values may be skipped to save space, and additional fields may be added in future versions of this implementation. Keep that in mind if you plan to parse these files with custom tools.
|
{
"v" : 3,
"id" : "dc64abb808e0c227",
"rev" : "008f113ff1579f8aed9399bf7960118f",
"parent" : "e371ce6a077f88755c1155b507b757d5",
"type" : "application/x-cdstar;v=3",
"ctime" : 1507979048000,
"mtime" : 1507979048885,
"x-cdstar:owner" : "test@static",
"x-cdstar:mtime" : "2017-08-29T11:28:06.0722Z",
"x-cdstar:acl:$owner" : "OWNER",
"x-cdstar:rev" : "1",
"resources" : [ {
"id" : "8c5a29d5707b6927e8484e2cd5170749",
"name" : "data/target.txt",
"type" : "application/octet-stream",
"size" : 1048576,
"ctime" : 1507979048885,
"mtime" : 1507979048885,
"sha1" : "O3H0P/MPSxW1zYXdnpXrx+hOtaM=",
"sha256" : "MOFJVevxNSJm3C/4Bn5oEEYH51CrudOzZYK4r5Cfy1g=",
"md5" : "ttgbNgpWctgMJ0MPORU+LA=="
} ]
}
Name | Type | Description |
---|---|---|
v |
int |
Format version. Defaults to |
id |
String |
Pool object ID. Should be the same as the containing directory name. |
rev |
String |
Revision string. Should match the file name. |
parent |
String |
Revision string of the parent revision. This field can be used to traverse the revision history of an object. May be |
type |
String |
Application defined mime-type. May be |
ctime |
long |
Date and time of object creation (Unix epoch, millisecond resolution). |
mtime |
long |
Date and time of last modification (Unix epoch, millisecond resolution). |
x-{key} |
String |
Custom application defined key/value pairs. |
resources |
Array |
Unordered list of resource records (see below). May be empty, |
Name | Type | Description |
---|---|---|
id |
String |
Unique resource identifier. This string is unique per object, not globally. |
name |
String |
Application defined resource name. This should be unique per object, but uniqueness is not enforced. May be |
type |
String |
Application defined content-type. May be |
enc |
String |
Application defined content-encoding. May be |
size |
Long |
Size of resource binary data in bytes. |
ctime |
String |
Date and time of resource creation (Unix epoch, millisecond resolution). |
mtime |
String |
Date and time of last modification (Unix epoch, millisecond resolution). |
src |
String |
External location identifier for the resource binary content. May be |
md5 |
Base64 |
MD5 hash of the resource content as a base64 string. May be |
sha1 |
Base64 |
SHA-1 hash of the resource content as a base64 string. May be |
sha256 |
Base64 |
SHA-256 hash of the resource content as a base64 string. |
x-{key} |
String |
Custom application defined key/value pairs. |
Dates are stored as unix epoch timestamps with millisecond resolution (signed long integer). While not directly human readable, these are easily recognized and a very common exchange format for points in time. Most programming languages provide built-in tools to translate an epoch timestamp into a human readable form.
Resource default location
By default, the uncompressed binary content of non-empty resources are stored in the object directory as {sha256}.bin
files named after the lower-case hex encoded sha256
digest of their content. These files always end in .bin
regardless of their actual content-type. If this file is missing, the resource may either have been packed (see "Object Packing") or externalized (see "External resources") and additional steps are required to recover the binary content of the resource.
External resources
If the src
field of a resource record is set, the corresponding {sha256}.bin
resource file is subject to garbage-collection and may be removed at any time. In this case, the value of the src
field should contain enough information to recover the resource file manually or with the help of an application-specific process. The src
field MUST start with a prefix defined in this document, or with x-
followed by an application defined location hint (e.g. an URI).
Object Packing (not implemented)
Resource files in an object directory may be bundled into one or more *.pack.zip
files to save inodes and disk space. Compression can also help reducing IO pressure on the storage device in exchange for higher CPU usage during read access. This trade-of may be beneficial, in particular for rarely accessed objects or resources with highly compressible content.
Resources stored in a pack have a src
value of pack:<pack-file-name>
and follow default naming rules ({sha256}.bin
) within the pack file.
Note
|
The zip format allows fast lookup and random access to individual files. Other common packaging formats (e.g. tar) require linear scans in order to find a specific file. The drawbacks of the zip format (e.g. low resolution timestamps or file name limitations) are negligible as these information is also present in the object index file. |
Temporary data
NioPool
may create temporary .tmp
files or directories within an object directory. These may contain data required for recovery, so do not delete these files after an unclean shutdown or while the service is running. Temporary files that remain after an ordinary shutdown can be removed.
Locking, concurrency control and transactional storage
Any actor that creates or removes files other than *.tmp
in an object directory or intends to change the target of the HEAD
symlink MUST acquire a HEAD_NEXT
file lock before doing so. The HEAD_NEXT
file SHOULD be a symlink pointing to a (possibly not yet created) index file. To change the HEAD
link, make sure that the HEAD_NEXT
target exists and is synchronized to disk, then move-and-replace HEAD_NEXT
to HEAD
. Any error during this sequence should result in dangling HEAD_NEXT
symlink protecting the object from further manipulation until manual or automatic recovery succeeded. In a disaster situation, either HEAD
or HEAD_NEXT
(or both) exists and the object can be rolled back or committed manually.
Tip
|
Some file systems do not implement an atomic move-and-replace operation. In this case, HEAD must be removed before HEAD_NEXT can be renamed. Clients may try to access HEAD in the short time span when it does not exists. Robust implementations should simply retry a couple of times.
|
Configuration
StoragePool configuration is stored by CDSTAR in a vault.yaml
within the pool base directory and can be bootstrapped during vault creation with predefined parameters. NioPool supports the following configuration parameters:
.Configuration Parameters
Name | Type | Description |
---|---|---|
path |
String |
Path to the vault base directory (required, default: |
cacheSize |
int |
Number of manifests to keep in an in-memory cache for faster load times. |
autotrim |
bool |
If enabled, schedule a garbage collection run after each successful commit for each modified object. |
digests |
str |
Comma separated list of digests to calculate. |
Storage Profiles
CDSTAR supports and integrates third party long time storage systems (LTS, e.g. tape libraries) via storage profiles. From the users perspective, a storage profile defines where and how data should be stored. By assigning a storage profile to a CDSTAR archive, the user can control data migration to and from LTS in a coherent, safe and predictable way. The actual data migration happens in the background and is fully managed by CDSTAR.
Profile mode: HOT vs. COLD
Storage profiles can be either "hot" or "cold", which changes the way CDSTAR handles its local data.
Hot profiles causes CDSTAR to copy the archive content to external storage, but keep all data available in CDSTAR as well. While the profile is in effect, only administrative metadata (owner, ACLs, storage profile, …) can be modified. The actual content (files and metadata) is write-protected to prevent stale LTS copies.
Cold profiles, on the other hand, allow CDSTAR to re-claim disk space by deleting archive files from disk after a copy was stored externally. Metadata is still kept available, but file content can no longer be accessed through CDSTAR. The profile needs to be changed to default
or a hot profile to make file content available again.
Hot profiles are meant to increase long term availability or data integrity guarantees by storing important data in a second location. Cold profiles are mostly used to store large amounts of rarely accessed data in a more cost-effective way (e.g. on tape), while keeping meta-data search- and discoverable.
Profile configuration
Profiles can be configured globally, and enabled or disabled per vault. They currently only have a name, and define a mode (hot
or cold
) and an associated LTS target, which is configured separately as a plugin. This allows multiple profiles to reference the same LTS target, but with different configuration.
profile:
bagit-hot:
lts.name: bagit
bagit-cold:
lts.name: bagit
lts.mode: cold
LTS target configuration
Data migration from or to third party LTS systems is highly depended on the system in use. Multiple implementations are available and can be loaded via the CDSTAR plugin infrastructure. CDSTAR bundles a general purpose implementation that exports to BagIt directories and allows an external process to perform the actual LTS migration asynchronously.
plugin:
bagit:
class: BagitTarget
path: /path/to/store/bagit/
LTS handlers are referenced by name, so special care must be taken when removing or renaming LTS handlers. Do not remove or rename an LTS target as long as there are archives still referencing it.
How to not loose data
Moving data out of the CDSTAR system, especially with cold profiles, bears some risks that should be well understood before enabling the LTS feature. Please read this chapter carefully.
After a successful migration to an LTS target, CDSTAR stores the LTS name and a unique location identifier (generated by the LTS) into non-public archive properties. These are used to recover missing files in case of a future profile change. Cold profiles allow CDSTAR to remove local copies of archived files after successfully copying these files to LTS. If the LTS goes away, for whatever reason, then CDSTAR has no way to recover missing files and the archive is stuck in cold state. File content will be unavailable and data migration after profile changes will fail.
-
Do not remove or rename an LTS target as long as there are archives still referencing it.
-
When updating LTS Plugins or changing configuration, ensure that existing location identifiers remain valid.
-
Monitor CDSTAR logs for failed migrations.
BagIt LTS Target
This LTS target exports archives into BagIt folders, and is designed to work with external worker processes for the actual migration from/to LTS storage (e.g. tape).
The exporter will create a BagIt package in a temporary folder, then rename it
to [name].bagit
with a unique name. A worker process may check for these
folders and copy or move data to LTS.
The importer will create a file named [name].want
and start the import as
soon as the [name].bagit
folder can be found. A worker process should check
for these [name].want
files and recover the missing [name].bagit
folder
from LTS. Once complete, the importer will delete the [name].want
file and
the recovered [name].bagit
folder can be cleaned up by the worker.
If the external copy is no longer needed, a [name].delete
file is created.
A worker process should watch for these files, remove the external copy (if
any), remove the [name].bagit
directory (if present), and then also remove
the [name].delete
file.
External workers are allowed to create additional files for their own state handling, as long as they do not interfere with the names defined here.
Archive Snapshots
Archive snapshots are an efficient way to preserve the current payload of an archive without actually creating a full copy. They can be used to implement versioning, tag important milestones or create immutable and citeable releases for publishing.
From a users perspective, snapshots are virtual read-only archives that represent the payload of their source archive from a specific point in time. The payload of a snapshot will not change if the source archive is modified. Other aspects however, most notably owner and access control information, are transparently inherited from the source archive and will change if the source archive changes. One exception is the storage profile, which can be changed on a per-snapshot basis independent from the source archive. See Storage Profiles for details.
Once created, most read-only operations that work on an archive are also available for snapshots. In the REST API, snapshots are referenced by the source archive name, followed by an @
character and the snapshot name. For example, GET /v3/somevault/ab587f42c257@v1/data.csv
would fetch a file from the v1
snapshot instead of the current archive state. Details are explained in the REST API documentation.
Sparse Copies and Deduplication
On storage level, snapshots live in separate storage objects, but are created in a way that allows them to share common data files with their source archive or other snapshots, if supported by the storage back-end. This ensures that snapshots only take up a minimum amount of additional storage space and are usually way more efficient than actually copying an entire archive. NioPool implements this on file-system level by hard-linking files with the same content, and only creating a copy if content changes (copy on write semantics).