Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

DuraCloud Preserve is a project designed to make using AWS S3 as simple as possible for users who only want to care about uploading files, or integrating S3 storage with other applications, and not have to worry about esoteric configuration or infrastructure management. It also supports digital preservation use cases by managing the configuration and features available to S3 to support long term access to and preservation of files.

The goal is to make it easy for users to choose any off the shelf S3 client and interact with S3 gaining more advanced features by default. Advanced features are described in more detail throughout the user and technical documentation but in brief: versioning, inventory, replication, logging etc. is enabled as buckets are created without a user having to do anything in AWS.

Periodically checksum verification is performed to ensure that file integrity is maintained between the primary and replicated (backup) files. This builds upon the already impressive levels of durability that S3 provides by adding a recurring guarantee that files are what they are intended to be.

Additional features include generating manifest (file inventory) and storage reports and user access control via preconstructed groups that are scoped to stacks. When deployed every resource is created “within” a stack. A stack is simply a resource naming prefix and tag applied to all resources managed by the deployed components to exclusively associate them. This makes it possible to have multiple stacks within a single account and makes it so different users can belong to one or more stacks.

Lyrasis provides a hosting service for DuraCloud Preserve, handling the AWS account creation and installation, and which comes with access to a web based ui for S3, using SFTPGo. S3 can then be interacted with using the web ui or via direct AWS access credentials for broader integrations or for usage with tools like the AWS cli.

AWS resources used:

Context

DuraCloud Preserve is a continuation of the DuraCloud project in a form that is intended to be more sustainable for the long term. It does this by focusing on the core mission of DuraCloud but with a significantly smaller technical footprint, made possible by leveraging AWS S3 features directly in contrast to the more abstracted approach that DuraCloud took in being open to support multiple backend storage providers.

But the goals remain the same: in the digital era, ensuring that critically important documents remain safe and available is a continual challenge. Physical computing hardware that is used to create and store documents can fail or become obsolete very quickly, providing a need for tools to ensure that these documents remain available. DuraCloud Preserve aims to address these concerns:

  • How do I upload files to the storage service in a simple and reliable way?
  • How do I ensure that the storage service that I am using receives a copy of my local files?
  • How do I ensure that files remain intact over time?
  • How do I retrieve my content once it is stored?
  • How do I recover a file if it has been overwritten or corrupted?
  • How do I make my content publicly accessible at a stable URL?
  • How am I protected against the storage service becoming obsolete or going away?

Answers to these questions are provided throughout the rest of this documentation.

Features

This is a brief overview of the functionality that is explained more thoroughly in the user guide and technical documentation:

Access controls

Users can be standard or power users by assignment to a stack created IAM group.

  • Standard users can list and upload files but cannot download or delete them.
  • Power users can do all of the above.

Only AWS account administrators can access replicated buckets and objects.

Audit trail

Request logs are generated for each user-created bucket. This is raw AWS provided data that can be processed using tools like DuckDB.

Checksum reports

Checksum reports are generated on a configurable schedule, comparing checksums across source and replication buckets to detect corruption. Files found to be corrupt can be restored from the verified copy. See the checksum verification documentation for more details.

Choice of region

Files can be stored in any AWS region supported by the infrastructure.

CLI available

A command-line interface (dcp) is available for advanced users. It provides access to all core functions and additional maintenance commands for tasks such as checksumming local files, reconciling bucket configuration, and transferring data between buckets.

Hosting and support

If creating an AWS account and deploying resources to it is not possible then Lyrasis provides a hosting and technical support option to handle the infrastructure for you.

Inventory

A file manifest is generated for each user-created bucket. The raw AWS inventory data is available in Parquet format and a consolidated, user friendly csv file is also made available that includes the S3 url for each file.

Lifecycle transitions

Files are uploaded to the standard storage tier and transition to a selected storage class after a configurable interval, which can be specified for each stack deployment. Old versions of files and aborted multipart uploads are automatically deleted after a configurable period.

Manifest reports

A consolidated, human-readable CSV file is generated per bucket, listing all files with metadata including S3 URL, size, storage class, and last modified date.

Public access via CDN (Content Delivery Network)

A CloudFront distribution and bucket is created that can be used to make files publicly available. Simply upload files to it and share the public url using a specified domain.

Other buckets can be created as publicly accessible by naming them with a -public suffix. Files uploaded to such buckets will be available using a standard, unauthenticated S3 URL.

Files will be stored in the intelligent storage tier and not transitioned to Glacier; however replication will still occur and the backup copies will be stored in Glacier.

Reconciliation reports

The reconciliation report is used to detect drift in bucket configuration, providing reassurance that buckets are configured correctly and working as expected.

Replication

Files for all buckets are replicated to Glacier Deep Archive. These files are included in the checksum verification process to determine file integrity. We have dedicated documentation for how this works.

Storage reports

An HTML storage report is generated showing usage statistics across all buckets in the stack, including total file counts and storage consumed by bucket and top-level prefix. It also includes the year-to-date total of data transfer out from S3 to the internet (requires Cost Explorer to be enabled in AWS, and an active Stack cost allocation tag).

Versioning

Bucket versioning is enabled. This supports file restore for up to a configurable number of days post update which can be specified for each stack deployment.

Web UI integration with SFTPGo

There is support within the application and deployment tooling for SFTPGo integration, which provides a web based interface for S3. Users can be created that are pre-configured with appropriate access (per the access controls that have been assigned to them) and the SFTPGo user account is kept in sync as buckets are created, or via the dcp cli.


General integrations

Web applications that support use of Amazon S3 for storage

Any application or framework that can be configured to use Amazon S3 for storage can work with DuraCloud Preserve. By simply using a bucket created as part of a DuraCloud Preserve stack files will be stored with the additional benefits outlined in this documentation, including versioning, replication and checksum verification.

Some specific examples:

Lyrasis service integrations

ArchivesSpace

ArchivesSpace itself does not manage digital content and provides no way to upload files. The public urls provided by the Duracloud Preserve CloudFront enabled bucket can be used to host files that are referenced in Digital Objects using the File URI field to make them openly accessible on the internet.

CollectionSpace

Refer to the roadmap for any upcoming work.

DSpace

The Replication Task Suite is a plugin for DSpace that adds preservation capabilities that can be accessed using the DSpace user interface. It creates archival information packages used to backup DSpace items in a self contained way that are periodically transferred to external storage, including Amazon S3. Doing the latter with a DuraCloud Preserve created bucket works equivalently to using S3 for the DSpace Storage Layer (assetstore), and if both are configured this way it enables a dual layer of protection for files (as both the assetstore and archival packages would benefit from versioning, replication and checksum verification etc.).

Other integrations

Archive-It

Create an inventory and a backup of WARC files retrieved from the Internet Archive - Archive-It service.

Checksum Verification

DuraCloud Preserve stores and replicates files using Amazon S3. Checksum verification is the process by which the system confirms that stored files have not been silently corrupted over time. Even in highly durable storage systems, subtle errors (known as “bit rot”) can alter file content without any obvious warning. By regularly comparing checksums across independent copies of each object, the system can detect and remediate corruption before it affects both copies.

How It Works

1. Upload Integrity

AWS S3 provides integrity guarantees at the point of upload. Using built-in integrity checking mechanisms, S3 validates received data and rejects any upload where the computed checksum does not match. A successful upload response from S3 confirms that the stored object matches exactly what was transmitted.

The system’s integrity guarantee begins at this point of successful upload.

The checksum and version of any stored object can be retrieved using the AWS CLI:

aws s3api head-object --bucket ${bucket} --key ${key} --checksum-mode ENABLED

Example response:

{
    "AcceptRanges": "bytes",
    "LastModified": "2026-01-24T00:22:19+00:00",
    "ContentLength": 15310515,
    "ChecksumCRC64NVME": "V+va1ramtYo=",
    "ChecksumType": "FULL_OBJECT",
    "ETag": "\"822f9ffde463633f9a56df6d90b1dbb6\"",
    "VersionId": "HnU.prnfFqU2oJKqjIibty9_cet6zTDH",
    "ContentType": "application/pdf",
    "ServerSideEncryption": "AES256",
    "Metadata": {},
    "StorageClass": "GLACIER_IR",
    "ReplicationStatus": "COMPLETED"
}

Further reading:

2. Replication

After a successful upload, AWS S3 replication creates a copy of the object in a second independent bucket, typically within 15 minutes. The same upload integrity guarantees apply to replication, ensuring the replicated object is an exact copy of the source.

The checksum and version ID of the replica will match the source object exactly:

{
    "AcceptRanges": "bytes",
    "LastModified": "2026-01-24T00:22:19+00:00",
    "ContentLength": 15310515,
    "ChecksumCRC64NVME": "V+va1ramtYo=",
    "ChecksumType": "FULL_OBJECT",
    "ETag": "\"822f9ffde463633f9a56df6d90b1dbb6\"",
    "VersionId": "HnU.prnfFqU2oJKqjIibty9_cet6zTDH",
    "ContentType": "application/pdf",
    "ServerSideEncryption": "AES256",
    "Metadata": {},
    "StorageClass": "GLACIER",
    "ReplicationStatus": "REPLICA"
}

Note that ChecksumCRC64NVME and VersionId are identical across both objects.

Further reading:

3. Durability

AWS S3 is designed for 99.999999999% (eleven nines) durability. Given S3’s upload integrity guarantees and its documented durability, uploaded and replicated objects can be considered correct and consistent at the point of replication with a very high degree of confidence.

Further reading: Durability in Amazon S3

4. Ongoing Verification

S3 Batch Operations are used to generate checksum reports across all objects in both the source and replication buckets. These reports are compared on a regular schedule.

ResultMeaning
Version ID and checksum matchVerification successful — objects are identical
Version ID or checksum do not matchOne object may be corrupted — investigation required

If a Mismatch Is Detected

If verification finds that checksums do not match, the following steps identify and repair the corruption.

Step 1 — Check prior reports. A previously generated checksum report may already contain the expected checksum values, making it straightforward to determine which copy — source or replica — is corrupt.

Step 2 — Request object metadata. If no prior report is available, retrieve the stored checksum, value, and version directly from each object’s metadata and compare them:

aws s3api head-object --bucket ${bucket} --key ${key} --checksum-mode ENABLED

Step 3 — Download and verify locally. For a more thorough inspection, download the objects and compute checksums locally using the same algorithm S3 uses (CRC-64/NVME by default):

# Retrieve the stored checksum
aws s3api head-object --bucket ${bucket} --key ${key} --checksum-mode ENABLED

# Download the file
aws s3 cp s3://${bucket}/${key} .

# Compute the checksum locally using the DuraCloud Preserve CLI
dcp checksum --file ${key}

Step 4 — Restore. Once the valid copy is confirmed, re-upload it to the source bucket to repair the corrupted object.

Important

Hosted clients: Lyrasis will handle checksum verification and file restoration on your behalf if errors are found.

Learn more about Lyrasis Hosting

Further reading:

What Successful Verification Confirms

Successful verification confirms that the source and replica objects are identical to each other. Given S3’s upload integrity guarantees and its documented durability, this means objects are also identical to what was originally uploaded to a very high degree of confidence.

This strategy is considered sufficient for the vast majority of standard use cases. In the unlikely event that corruption is not automatically addressed by the S3 infrastructure, it is highly improbable that both independent copies would be corrupted in exactly the same way — which would be required to produce a false verification result.

For the strongest possible guarantee, independent verification using locally managed checksums is required. See Stricter Compliance Requirements below.

Further reading:

Checksum Reports

Checksum reports are stored in S3 for the duration of the stack’s retention policy and can be downloaded at any time.

For organizations requiring independent verification or stricter compliance, reports should be downloaded and stored locally or in a system separate from S3.

Stricter Compliance Requirements

For organizations with higher assurance requirements — such as regulated industries or formal digital preservation programs — the approach described above may not be sufficient on its own, as it is ultimately dependent on the claims of a single third-party provider (Amazon AWS). An independent audit mechanism, separate from the primary storage provider, is required for the strictest compliance standards.

Best practice for stricter compliance:

  1. Generate checksums locally before uploading. Use a tool such as QuickHash to compute a checksum for each file before it is uploaded to S3.
  2. Maintain a local checksum inventory. Keep a record of each filename and its corresponding checksum in a safe location. This inventory can be stored in S3, but must also exist independently.
  3. Verify on retrieval. When downloading a file, recompute its checksum locally and compare it against the inventory record.

It is also important to note that DuraCloud Preserve is entirely dependent on the Amazon AWS S3 service, its regional infrastructure, and its policies. Organizations with strict independence or sovereignty requirements should factor this into their preservation planning.

Frameworks and standards for reference:

Getting started

Whoever is responsible for deployment will provide access credentials to users.  If you are intending to connect directly to S3 using a GUI or CLI tool then you should receive an access key and secret, which serve as a username and password for interacting with S3. It is important to treat this as sensitively as you would any username and password.

If you are intending to use the web client then you should receive a username (your email address), password and the url to login. It’s completely fine to use both approaches if you’d like access to both.

You should also receive a stack name.

This will typically be in the form duracloud-$ID where $ID is an identifier assigned by those handling the deployment. It may be based on or similar to a sitecode used by your institution for its domain (e.g. INSTITUTION.edu).

It is important to know this because your user will only be able to interact with a subset of buckets in an AWS account that are prefixed with that stack name. You will also see references to stack name throughout the documentation.

Important

Before proceeding confirm you have received:

  • Access key (username) and secret (password) for direct s3 access if requested
  • Stack prefix (duracloud-$ID)
  • Web client username, password and url if requested

Lyrasis Hosting clients permissions

Hosting clients will start with identifying one user who will have power user permissions. This user will be able to upload, download, and delete. The initial power user will need to provide the Hosting team the names of other users for whom they wish to have accounts and indicate whether those users should be power users or standard users who can only upload files. The Hosting team recommends limiting the number of power users per institution to 1 or 2 individuals because of the power to delete.

S3 Client Options

In order to keep things simple for the end user, less complicated to maintain on the technical side, but also provide some flexibility over how content can be uploaded to S3, there is no prescribed user interface. Any S3-compatible client can be used to interact with the tool.

We believe this is the right choice because there are many popular, well-supported, and tested options already available. However, we provide streamlined documentation for the use of the open source program Cyberduck as a downloadable GUI option, the AWS CLI for command line usage, and the web-based browser SFTPGo for the simplest access point.

Here’s a list of clients that have been used or tested by Lyrasis staff:

But there are many others and you are free to use any S3 compatible client that you prefer.

After connecting to your S3 account via your preferred method, you will see the folders already created for your account using your duracloud-$ID, including:

  • -managed
  • -public (default bucket for files that can be accessed publicly through CloudFront)
  • -request (used for making create bucket or checksum inventory requests)

AWS CLI Documentation

Step 1: Install AWS CLI

Installing or updating to the latest version of the AWS CLI

After following the instructions for your operating system, check your installation:

aws --version

Step 2: Configure Your AWS Credentials

Configuration and credential file settings in the AWS CLI

Verify your configuration:

aws sts get-caller-identity

If you have multiple AWS accounts or environments, set up a named profile and configure with your key, secret, and region (us-west-2):

aws configure --profile dcp

Setting Region for Lyrasis Hosting

If you are a Lyrasis-hosted client, the AWS region is us-west-2. You can set this in a few ways:

1. Add --region directly to the command

This is the most explicit method and overrides all other settings (profiles, config files, etc.):

aws s3 ls --region us-west-2

With a profile:

aws s3 sync ./data s3://{stackname}-bucket --profile dcp --region us-west-2

2. Set the region temporarily in your shell

This applies only to the current terminal session:

export AWS_REGION=us-west-2

Then commands can be run without specifying the region.

3. Set the region inside the profile

[profile dcp]
region = us-west-2
output = json

Cyberduck Documentation

Cyberduck documentation for setting up new connections:
https://docs.cyberduck.io/cyberduck/connection/

Step-by-step Instructions

  1. File → Open Connection\
  2. Change dropdown menu to Amazon S3
    • If you are a Lyrasis Hosting Services client, update Server to:
      s3.us-west-2.amazonaws.com\
    • (Lyrasis Hosting currently supports us-west-2 and us-east-2)
  3. Type in provided Access Key ID and Secret Access Key\
  4. Click Connect

Cyberduck Setting UpConnection

Tip

  • Click Go → Enclosing Folder to navigate up the file path tree one level at a time, or click in the filepath dropdown to navigate up multiple levels after your connection is set up.
  • Logs and other items you download will go to your Downloads folder by default. You can change this under Edit → Preferences → Transfers (General tab)

SFTPGo Documentation

Navigate to: DuraCloud Preserve

Use this web-based interface to log in, upload, and download content.

Individual users will be provided credentials by their system administrator (such as the Lyrasis Hosting team). The first time you log in, you will be asked to change your password. You can do this from the small person icon in the upper-right corner of the screen.

First login change password message

Change password screen

Upon login you will see three folders already created for you:

  • managed
  • public
  • request

You may:

  • Create new buckets by uploading a request file (see Creating Buckets)
  • Upload content buckets (creating subfolder structures as needed)
  • Download content from buckets
  • Download reports and other hosted content from the managed bucket

Provided buckets displayed in SFTPGo web interface showing three bucket folders labeled managed, public and request with upload and download options available

Tip

Before proceeding, confirm that you are able to successfully connect to S3.

Managed Resources

When you view your S3 account using a GUI client or the AWS CLI for the first time, you will notice a number of pre-existing buckets that have been created.

Pre-Existing Buckets

  • duracloud-$ID-request: Used to make requests to create new buckets. See: Creating Buckets
  • duracloud-$ID-managed: Used to deposit generated files such as audit history, exports, inventory, and reports. This bucket is read-only.
  • duracloud-$ID-private: Default private bucket.
  • duracloud-$ID-public: Default public bucket. Files uploaded here will have a publicly accessible URL.

Managed Bucket Structure

Over time, the duracloud-$ID-managed bucket will contain the following prefixes (folders):

  • audit: AWS generated Audit logs
  • batch: AWS generated files related to S3 batch operations
  • cloudtrail: AWS generated files for events related to S3
  • feedback: Application generated files for troubleshooting issues
  • manifests: AWS generated inventory files
  • metadata: Application generated files related to various stats (checksum, usage etc.)
  • reports: Application generated files intended for user review and download

More information about the data available in the -managed bucket is available on the Reports page.

Tip

  • If the AWS account is used for purposes, additional buckets may exist. This may also occur if there are multiple stacks per account.
  • However, the access credentials provided for this service will only work with the eligible stack resources associated with the user credentials.

Creating Buckets

Important

These instructions apply to all users, whether using Cyberduck, SFTPGo, the AWS CLI, or another S3-compatible client. The process is the same for everyone: upload a text file containing your bucket names to the duracloud-$ID-request bucket under the buckets folder. Instructions for each client are provided in the Steps section below.

Create a Bucket

To create a bucket, you must create a text file (.txt) containing the names of up to five buckets you want to create.

Naming Rules

  • Bucket names are automatically prefixed with the stack name — do not include the stack name in the file.
  • Each bucket name must be entered on its own line.
  • Bucket names may contain only alphanumeric characters and -.
  • Bucket names must not begin or end with -.
  • Bucket names must be no more than 63 characters total, including:
    • The stack name prefix (duracloud-$ID)
    • 5 reserved characters for the -repl suffix

Tip

Practically, this means your names should be no more than: 63 - 5 - (length of your duracloud-$ID)

Public Bucket Naming

To create a publicly accessible bucket, the name must end with -public.

  • This subtracts an additional 7 characters from the maximum length.

Reserved Prefixes and Suffixes

The following cannot be used in bucket names:

  • duracloud- — already included as the foremost prefix
  • -logs — used for access logging buckets
  • -managed — used for system-managed buckets (reports, logs, and other system data appear here)
  • -repl — used for replication target buckets (Amazon Glacier replication)
  • -request — used for bucket request files

Steps

  1. Open a text editor (such as Notepad or Notepad++) and create a file containing your bucket names, one per line. Save it as a .txt file.

  2. Upload the file to the duracloud-$ID-request bucket, inside the buckets folder.

    • If the buckets folder does not exist then create it first.
    • Buckets can only be created from files uploaded to the buckets folder in the request bucket.

Cyberduck

  1. Connect to your S3 account (see Connecting to S3).
  2. Navigate to the duracloud-$ID-request bucket.
  3. If a buckets folder does not exist, create one: Action → New Folder.
  4. Open the buckets folder and drag your .txt file into the Cyberduck window, or click Upload to browse for it.
  5. Cyberduck will show a transfer log confirming the upload.

Tip

When re-using the same file with updated bucket names (Step 8 below), Cyberduck may ask you to confirm overwriting the existing file. Confirm to proceed.

SFTPGo

  1. Log in to the SFTPGo web interface (see Connecting to S3).
  2. Navigate to your home folder. You will see managed and public folders — do not upload to these. Instead, navigate back to the root or look for a request folder corresponding to duracloud-$ID-request.
  3. If a buckets folder does not exist inside the request area, click New Folder to create it.
  4. Open the buckets folder, then click Upload Files or drag your .txt file into the upload area.
  5. Click Save to complete the upload.

AWS CLI

aws s3 cp mybuckets.txt s3://duracloud-$ID-request/buckets/mybuckets.txt
  1. The file will be processed in the background and an attempt will be made to create each bucket.
    • Processing normally takes 0–2 minutes.
  2. A report file will be uploaded to the feedback folder inside the -managed bucket, providing details about the outcome.
  3. Review the log when it becomes available.
  4. Refresh your client view or reconnect to S3.
    • Successfully created buckets will now be visible.
    • Each new bucket will have an associated replication bucket with a -repl suffix.
    • Replication buckets are list-only (files cannot be downloaded).
  5. The newly created buckets are now usable, and files can be uploaded.
  6. To create more buckets:
    • Re-use and re-upload the same file with new bucket names, or
    • Create and upload an entirely new file. Both approaches work.

Troubleshooting

  • If you do not see any new buckets created, check the logs in the $ID-managed bucket feedback folder for error messages.
  • If you attempt to create multiple buckets at one time and one bucket has an error (for example, the name is too long or you attempted to create more than five buckets), none of the buckets will be created. You must correct the issue and start again for all buckets.

Uploading Files

You will be able to upload files to the buckets you’ve created (see Creating Buckets). After your content has been uploaded, it will be mirrored in Glacier Deep Archive in the bucket that duplicates your bucket names with the -repl suffix.

You will not be able to do anything with the content in the -repl bucket. You will be able to see filenames, as a reassurance that your content has been mirrored, but if you attempt to download or get information about the files, you will likely encounter:

  • Access denied
  • Failure to read attributes of [filename]. Forbidden. Request Error

or other errors.

Files in these -repl buckets will only be accessed in the event of checksum failure in your active file structures, so those files can be replaced by this Glacier Deep Archive copy.

Tip

This tool is intended primarily for the long-term storage and preservation of digital assets. Frequent or repeated access to private files within the system may lead to increased operational costs and could potentially compromise data integrity. Users are advised to limit such access and use this tool in accordance with its preservation-focused purpose.

CLI option

Refer to the AWS CLI S3 documentation:
https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html

Upload files (entire folder):

aws s3 sync ./local-folder s3://{stackname}-bucket

Upload a single file:

aws s3 cp myfile.txt s3://{stackname}-bucket

Cyberduck option

Cyberduck documentation on File Transfers

  • Uploading folders or individual files is as simple as clicking and dragging from a folder in File Explorer / Finder into the Cyberduck client. Alternatively, click the Upload button in the Cyberduck client to browse for files or folders.
  • Cyberduck will provide a pop-up log indicating whether the upload was successful. Another pop-up will appear if there are any errors or issues (for example, if you are not authorized to upload to the bucket).

SFTPGo option

  • Uploading folders or individual files is as simple as clicking and dragging from a folder in File Explorer / Finder into the web application. Alternatively, click the “drop files here to upload” area to browse for files or folders.
  • You cannot upload an empty folder, but you can create folder structures within your -private and -public folders before uploading content.
  • Uploading very large files may take a long time and can time out. If you have files larger than 1–2 GB, you may need to use Cyberduck or another S3-compatible tool.

Creating folders in SFTPGo

Use the New Folder button to create your folder structure(s) before uploading content.

  • The web application will show a list of all files queued for upload so you can confirm filenames and paths.
  • After uploading content, do not forget to click the Save button in the bottom right corner, or your content will not be uploaded.
  • After completion, you will see your preserved file structure. The default display shows 10 results at a time, but this can be increased up to 500.

Changing displayed results

Screen display showing preserved folder structure and the option to change the number of displayed results.

Reminder: You will not see the replicated file structure in the SFTPGo web application, but your files are still being replicated in Glacier.

Tip

We have occasionally seen a generic “Error uploading files” message in SFTPGo. Closing the error and attempting the upload again has so far worked successfully (sometimes requiring closing the error twice).
The cause is not yet certain; it may be related to attempting uploads after a session has expired. This is an area for further investigation and feedback.

Reports

Data generated by your S3 client about your content is saved in the -managed bucket associated with your S3 account.

After you begin creating buckets and uploading content, you will see folders in the -managed bucket, including:

  • audit: AWS generated Audit logs
  • batch: AWS generated files related to S3 batch operations
  • cloudtrail: AWS generated files for events related to S3
  • feedback: Application generated files for troubleshooting issues
  • manifests: AWS generated inventory files
  • metadata: Application generated files related to various stats (checksum, usage etc.)
  • reports: Application generated files intended for user review and download

audit

  • A folder for each bucket you created
  • Mostly machine-readable data
  • Provides details about activities performed on your data

batch

  • Mostly machine-readable data
  • Provides outputs from S3 batch operations

cloudtrail

  • Mostly machine-readable data
  • Provides outputs from S3 events

feedback

  • Provides files for recording issues that arise

manifests

  • Mostly machine-readable data
  • Provides outputs from S3 inventory

metadata

  • Provides raw stats related to checksum and inventory processes

reports

This is the primary folder for content intended for review.

Checksum

Checksum reports are organized by date and stored under reports/ in the managed bucket.

There are two types of checksum reports:

  1. Checksum verification report (_checksum-report.csv)
  2. Checksum inventory report (_checksum-inventory.csv)

Checksum verification report

A checksum verification report that provides generated summarising totals: matches, mismatches, missing replicas, and failures.

Checksum inventory report

This report uses existing inventory reports to generate csv of checksum metadata.

  • reports/latest/checksums/<bucket>_checksum-inventory.csv — most recent report
  • reports/YYYY-MM-DD/checksums/<bucket>_checksum-inventory.csv — date-stamped archive

Each CSV is a per-object checksum inventory. Each row includes the object key, its CRC64NVMe checksum (when present), and a status:

  • ok — no errors were encountered retrieving metadata for this object
  • not_found — object was not found
  • missing_checksum — object exists but has no checksum recorded
  • error — other failure

Note: checksum inventory does not provide checksum verification.

Manifest

Inventory manifest reports provide a listing of all files in each bucket. They are stored under reports/ in the managed bucket:

  • reports/latest/manifests/<bucket>.csv — most recent report
  • reports/YYYY-MM-DD/manifests/<bucket>.csv — date-stamped archive

Each CSV contains one row per object with metadata including filename, size, last modified date, and storage class.

Storage

Storage reports are interactive HTML files generated weekly. They are stored under reports/ in the managed bucket:

  • reports/latest/storage/<stack>.html — most recent report
  • reports/YYYY-MM-DD/storage/<stack>.html — date-stamped archive

Open the HTML file in a browser to view charts and tables covering:

  • Aggregated totals — storage usage across all buckets in the stack
  • Per bucket totals — storage usage broken down by individual bucket
  • Per bucket / per prefix totals — storage usage by folder within each bucket

These reports are the most human-readable summaries available.


You may download data from any of these folders for local review and storage.

Accessing Reports

Cyberduck

  1. Connect to your S3 account (see Connecting to S3).
  2. Navigate to the duracloud-$ID-managed bucket and open the reports/latest/ folder.
  3. Open the relevant subfolder:
    • checksums/ — checksum report CSVs per bucket
    • manifests/ — inventory manifest CSVs per bucket
    • storage/ — interactive HTML storage report for your stack
  4. Right-click (or control-click on macOS) the file and select Download, Download As, or Download To to save it locally.
  5. To view the storage report, open the downloaded .html file in your browser.

Tip

  • Downloaded files are saved to your default Downloads folder. You can change this in Edit → Preferences → Transfers under the General tab.
  • Right-click to rename files when downloading to avoid overwriting reports from previous dates.

SFTPGo

  1. Log in to the SFTPGo web interface (see Connecting to S3).
  2. Navigate to the managed folder, then open reports/latest/.
  3. Open the relevant subfolder:
    • checksums/ — checksum report CSVs per bucket
    • manifests/ — inventory manifest CSVs per bucket
    • storage/ — interactive HTML storage report for your stack
  4. To download a single file, click directly on its filename.
  5. To download multiple files, check the boxes next to them and use the Actions menu → Download. Selected items will be zipped automatically.
  6. To view the storage report, download the .html file and open it in your browser.

AWS CLI

Download the latest storage report:

aws s3 cp s3://duracloud-$ID-managed/reports/latest/storage/$ID.html .

Download the latest checksum inventory for a bucket:

aws s3 cp s3://duracloud-$ID-managed/reports/latest/checksums/$BUCKET_checksum-inventory.csv .

Download the latest manifest report for a bucket:

aws s3 cp s3://duracloud-$ID-managed/reports/latest/manifests/$BUCKET.csv .

Sync an entire dated archive locally:

aws s3 sync s3://duracloud-$ID-managed/reports/ ./reports/

Downloading Content

Remember that you will not be able to download content from your replicated buckets (buckets ending in -repl). If you need to get content from the replicated buckets, such as because of accidental deletion or corruption, you will need to ask your hosting provider for assistance.

AWS CLI Option

Refer to https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html

Download files from bucket

aws s3 sync s3://{stackname}-bucket/myfolder ./local-folder

Download a single file

aws s3 cp s3://{stackname}-bucket/myfile.txt .

Cyberduck

Refer to https://docs.cyberduck.io/cyberduck/download/ If you right- or control-click on an item or selected group of items to download, you will have the options to:

  • Download — goes to your general preferences folder or the system Downloads folder if not changed
  • Download As — change the type of an individual item
  • Download To — change where the item(s) are saved

SFTPGo

In order to download content from SFTPGo, navigate to the folder structure you wish to download from and select the folder(s) or item(s) you wish to download. The application will let you download, move, or copy content from a dropdown menu after you’ve selected content. This option will automatically zip up all selected items. If you just want to download a single item, click directly on its filename. This option will not work if you want to download a folder of content.

SFTPGo screenshot

Tip

You may be able to view some file types directly in SFTPGo, such as .jpg, .txt, and .pdf files, by clicking on the little eye icon to the right of the filename; the application will use your browser settings.

Making Content Public

There are two ways to make content public.

Each stack includes a pre-created -public bucket that is served through a CloudFront distribution with a friendly domain. This is the recommended way to make content publicly accessible. Your administrator will provide the public domain URL.

Cyberduck

Navigate to the duracloud-$ID-public bucket and upload your files there (see Uploading Files). Files uploaded to this bucket will be publicly accessible via the CloudFront domain.

SFTPGo

Navigate to the public folder and upload your content there (see Uploading Files). Files placed here will be publicly accessible via the CloudFront domain.

AWS CLI

aws s3 cp myfile.jpg s3://duracloud-$ID-public/myfolder/myfile.jpg

You can also make content publicly available by designating a bucket as -public - See How to Create Buckets.

You can construct what a public link will look like based on this pattern:

https://{BUCKET_NAME}.s3.{REGION}.amazonaws.com/{PREFIX}/{FILE}

If you have spaces in any of your folder or filenames, replace those with a + sign when forming a URL. The region information is also optional.

So, for example, an image found in the lyrasis account’s bucket public → test-01 → catpics folder structure would look like:

https://duracloud-lyrasis-public.s3.us-west-2.amazonaws.com/test-01/catpics/callie_and_friend.jpg

OR, without the region information:

https://duracloud-lyrasis-public.s3.amazonaws.com/test-01/catpics/callie_and_friend.jpg

Note this feature is currently available but may be restricted in the future as it goes against AWS guidelines.

Cyberduck sharing options

Cyberduck has some additional ways to share folders and individual objects.

  • Navigate to the item you wish to share.
  • Right-click on Windows / control+click on a Mac or two-finger click on a touchpad and select “Copy URL” — you can also use the Action (cog) menu and select “Open URL”.
    • If you right-click and select “Copy URL,” you will have options for how you wish to copy the URL, including HTTPS or HTTP, an expiration on the link (for individual objects only), or the AWS command link.
    • You can now share the item however you wish.
    • The HTTPS and HTTP links may be formed slightly differently (with AWS information before the bucket name), but they should still provide public access to objects in your account.

Query audit and inventory data

S3 audit logs and inventory can be synced locally for ad-hoc querying with DuckDB.

Pre-reqs

Sync the files

Download audit and / or inventory data to a local ./data folder. For example:

mkdir -p data/audit/
mkdir -p data/inventory/

aws s3 sync s3://${stack_name}-managed/audit/ data/audit/
aws s3 sync s3://${stack_name}-managed/manifests/ data/inventory/

# also download the query setup sql files
curl -O https://artifacts.preserve.duracloud.org/query/audit.sql
curl -O https://artifacts.preserve.duracloud.org/query/inventory.sql

Query audit data with DuckDB

The log files are in the S3 server access log format: one request per line, space-delimited, with a bracketed timestamp and quoted request_uri, referer, and user_agent. DuckDB’s CSV reader can’t handle the mixed quoting, so audit.sql reads each line as a single string and pulls fields out with a regex, exposing them as the audit view.

Launch the DuckDB CLI with the view preloaded:

duckdb -init audit.sql

Then query away. For example, every request ordered by time:

SELECT event_time, bucket, remote_ip, operation, key, http_status, bytes_sent
FROM audit
ORDER BY event_time;

Standard object operations by users

The requester field is an IAM ARN. Most traffic is programmatic (for example SDK sessions named aws-go-sdk-…, service roles doing replication or batch work etc.) but when a user assumes a role via a named profile, the session name at the end of the ARN is usually the IAM username. To see just the standard object-level operations (GET, PUT, DELETE) performed by assumed-role sessions, with the obvious programmatic sessions filtered out:

SELECT
  event_time,
  regexp_extract(requester, 'assumed-role/[^/]+/(.+)$', 1) AS who,
  bucket,
  operation,
  key,
  http_status
FROM audit
WHERE operation IN ('REST.PUT.OBJECT', 'REST.GET.OBJECT', 'REST.DELETE.OBJECT')
  AND requester LIKE '%:assumed-role/%'
  AND requester NOT LIKE '%aws-go-sdk-%'
  AND requester NOT LIKE '%assume-role-from-profile-%'
ORDER BY event_time;

Service roles (e.g. replication, batch jobs) may still appear in the results. Inspect the who column and add further NOT LIKE clauses for any session names that aren’t people of interest.

Query inventory data with DuckDB

S3 inventory reports ship as Parquet, which DuckDB reads natively. inventory.sql globs every parquet file under data/inventory/ and exposes them as the inventory view. Because each daily snapshot re-reports objects that haven’t changed, the view uses SELECT DISTINCT to collapse identical rows so basic queries see one row per unique observed state.

Launch the DuckDB CLI with the view preloaded:

duckdb -init inventory.sql

List every object across all buckets:

SELECT bucket, key, size, last_modified_date, storage_class
FROM inventory
ORDER BY bucket, key;

Object count and total bytes per bucket:

SELECT bucket, COUNT(*) AS objects, SUM(size) AS total_bytes
FROM inventory
GROUP BY bucket
ORDER BY bucket;

To work with both views in the same session, pass both scripts:

duckdb -init audit.sql -cmd ".read inventory.sql"

External documentation

Documentation provided by third-party clients recommended for providing access to your account’s managed S3 buckets:

Overview

Test environment

Production environment

Setup

This documentation is focused on the technical aspects of the core functionality and how to test locally using the provided cli and remotely after the functions have been deployed.

This documentation does not address user functionality or deployment concerns, for those see:

Pre-reqs

Requirements:

You must have access to an AWS account. Caution: costs may be incurred.

Setup

There are Makefile tasks to wrap cargo (et al.) commands for convenience:

These args are used frequently:

  • f=function function name i.e. bucket-request
  • p=profile aws profile name i.e. default
  • s=stack resource prefix used for identification/partitioning within an aws account

But note in some contexts a letter may have a different meaning, for example f=file (check the docs or output of make for details).

To get started run this task to create the base infrastructure:

# choose your own value for s=$stack and p=$profile
make setup s=digipres-dev1 p=default

This task uses Terraform so it must be installed for it to work.

Of most significance for testing using the above example will create:

  • digipres-dev1-s3-batch-role (i.e. ${stack}-s3-batch-role)
  • digipres-dev1-s3-replication-role (i.e. ${stack}-s3-replication-role)
  • digipres-dev1-request (i.e. ${stack}-request)
  • digipres-dev1-managed (i.e. ${stack}-managed)
  • digipres-dev1-public (i.e. ${stack}-public)
  • digipres-dev1-public-repl (i.e. ${stack}-public-repl)

The managed bucket will also be assigned a policy that permits it to be a target for S3 inventory from buckets using the same stack name (prefix).

The public bucket is “special” as it works differently from regular user created public buckets owing to a CloudFront distribution that is created to provide access to the files, rather than using raw S3 urls.

Testing remotely with Lambda

The base infrastructure is sufficient for testing using the provided cli. However, no AWS Lambda functions will be deployed by the setup task. If you want to test a full stack deployment including the Lambda functions then there is a deploy task for that:

make deploy s=digipres-dev1 p=default

This will build the Lambda packages and upload them to an “artifacts” bucket that Lambda can access. Doing this will enable you to try out the remote testing instructions for each function vs. only testing via the cli. Generally speaking the cli covers most of what happens when run through Lambda with these primary differences:

  • Local cli testing uses your local AWS credentials
  • Deployed Lambdas use permissions provided by IAM roles
  • The entrypoints are different: see the cli vs. functions folders

Testing public access via CloudFront

terraform output cloudfront_domain_name

This will output something like: d2vy8bpfecxis5.cloudfront.net.

make upload b=digipres-dev1-public d=example f=files/buckets.txt p=default

Then access the file in the browser, it should work:

For production the other Terraform outputs can be used for setting up a custom domain using ACM, see the deployment documentation for more details.

Functions

The core service functionality is encapsulated by Lambda functions that run on a schedule or in response to S3 events:

FunctionTriggerDescription
bucket-requestS3 eventCreates S3 buckets with prefab configuration from an uploaded text file
checksum-reportScheduledCompares checksum results across source and replication buckets to detect corruption
compute-checksumsScheduledTriggers S3 batch checksum jobs across all bucket pairs to verify data integrity
inventory-reportS3 eventProcesses S3 inventory data into a human-readable CSV manifest and generates storage stats
storage-reportScheduledGenerates an HTML storage usage report across all buckets in the stack
sync-usersS3 eventSyncs IAM users to SFTPGo so they can access their stack buckets over SFTP

All functions can also be run locally via the dcp CLI, which additionally provides commands for tasks not covered by Lambda. See CLI for details.

bucket-request

  • Lambda trigger: S3 event (fires when a user uploads a file to the request bucket)
  • Dependencies: None

Overview

This Lambda function creates S3 buckets with prefab configuration based on a list of bucket names provided in a plain text file.

Example buckets.txt

manuscripts
newspapers
rare-books

The workflow is:

  1. A text file containing bucket names is uploaded to the S3 bucket named ${stack}-request
  2. The Lambda function is triggered by the upload event
  3. The file is downloaded and processed — either locally (for development/testing) or inside Lambda (for remote execution)
  4. Buckets are created according to the prefab configuration if they don’t already exist

CLI Testing

Use make run-bucket-request to process a file locally without uploading to S3:

make run-bucket-request f=files/buckets-list.txt s=digipres-dev1 p=default
  • f= — path to a local file containing bucket names
  • s= — the stack name (used as a prefix for created buckets)
  • p= — the AWS profile to use

You can also create a single bucket by name without a file, using the cargo CLI directly.

Important

Before testing, export your aws profile prior to using the cargo CLI.

cargo run -p dcp -- bucket-request --stack=digipres-dev1 --name=rare-books

This is useful for one-off bucket creation or quick iteration without maintaining a file.

Remote Testing

Use make upload to upload a file to S3 and trigger the Lambda function as it would run in production:

make upload b=digipres-dev1-request d=buckets f=files/buckets.txt p=default
  • b= — the name of the S3 request bucket (typically ${stack}-request)
  • d= — the S3 directory (path) to upload into (must be buckets)
  • f= — path to the local file containing bucket names
  • p= — the AWS profile to use

Output

Given the example file files/buckets.txt, two buckets should be created (assuming they do not already exist):

  • digipres-dev1-private — private S3 bucket
  • digipres-dev1-private-repl — private S3 bucket used as the replication destination for the above

You can verify the buckets were created using:

make bucket a=list p=default
# Filter results by stack name using grep
make bucket a=list p=default | grep digipres-dev1

QA testing

Aside from the happy path, here are variations to try:

  • File too large
  • File invalid (rename some other file buckets.txt i.e a jpg)
  • Bucket names are too long or has invalid characters
  • Too many bucket names (5 max, additionals are discarded)
  • Bucket names are duplicates, the buckets already exist
  • Errors should be uploaded to a file in the managed bucket feedback path

inventory-report

Type: Lambda function
Trigger: S3 event (manifest.json is created)
Dependencies: None

Overview

This function processes Parquet-formatted S3 inventory data into a single human-readable CSV manifest per bucket. It also generates storage usage statistics used by the storage report:

  • Total number of files and total storage used
  • The same, broken down by top-level prefix (folder)

Note

At least one bucket must exist with files uploaded before this function can run. It has no inventory to process otherwise.

CLI testing

Run locally against the most recently available S3 inventory for a bucket:

make run-inventory-report b=digipres-dev1-private p=default
  • b= — Bucket name to process the inventory report for (required)
  • p= — AWS profile to use (required)

Remote testing

Staging a remote test requires crafting a specific event payload and uploading matching Parquet files, which adds significant overhead. In practice it is simpler to let the infrastructure run on its normal daily schedule and inspect the logs if the report does not appear.

If the CLI works but the Lambda does not, the most likely cause is an IAM permissions issue.

To stage a full remote test:

  1. Craft an event payload that references a manifest.json.
  2. Upload Parquet inventory files to the location referenced in the manifest.json.
  3. Upload the manifest.json to the path specified in the event payload — this must be within the event notification path (/manifests).
  4. Ensure the Parquet files contain the correct stack-prefixed bucket name.

Output

When run successfully there should be four generated files:

  • metadata/latest/manifests/stats/$bucket.csv
  • metadata/YYYY-MM-DD/manifests/stats/$bucket.csv
  • reports/latest/manifests/$bucket.csv
  • reports/YYYY-MM-DD/manifests/$bucket.csv

To access the latest report you can do:

aws s3 cp \
    s3://digipres-dev1-managed/reports/latest/manifests/digipres-dev1-private.csv \
    . \
    --profile default

QA testing

Confirm:

  • All expected files are available.
  • The report contains expected items.
  • The stats are accurate.

compute-checksums

Type: Lambda function
Trigger: Scheduled EventBridge event
Dependencies: None

Overview

This Lambda function triggers S3 batch checksum jobs to verify data integrity across your buckets. It processes standard/public + replication bucket pairs together, ensuring both the source and replicated data are checksummed.

Invocation methods

Scheduled execution (production)

The Lambda is automatically triggered by a scheduled EventBridge event at regular intervals.

CLI testing

Compute checksums for a single bucket and its replication pair:

make run-compute-checksums b=digipres-dev1-private p=default

Parameters:

  • b= — Standard or public stack bucket to checksum (required)
  • p= — AWS profile (required)

Constraints:

  • Only supports single bucket at a time
  • Automatically paired with replication bucket
  • Cannot directly specify a replication bucket

Remote trigger

Compute checksums for all stack buckets in a given stack:

make trigger f=compute-checksums s=digipres-dev1 p=default

Parameters:

  • f= — Function name (compute-checksums)
  • s= — Stack name (required)
  • p= — AWS profile (required)

Behavior: Triggers jobs for ALL stack buckets in the specified stack.

Output

Function response

{
    "StatusCode": 200,
    "ExecutedVersion": "$LATEST"
}

Receipt files

For each bucket pair processed, a job receipt is uploaded to:

  • metadata/latest/checksums/receipts/{source_job_id}.json
  • metadata/latest/checksums/receipts/{repl_job_id}.json
  • metadata/latest/checksums/receipts/{source_bucket_name}.json
  • metadata/{date}/checksums/receipts/{source_bucket_name}.json

Purpose: The receipt is uploaded multiple times for different discovery paths:

  • Job IDs — used by the Lambda checksum report process for internal tracking
  • Bucket names — used by the CLI checksum report and for easier manual access

QA testing

Confirm:

  • Jobs are created without errors
  • Jobs are completed successfully
  • All receipt files are generated and avaiable at the expected paths

checksum-request

Trigger: S3 event (.txt file uploaded under the request bucket’s checksums/ prefix)
Dependencies: inventory-report — the manifest CSV must already exist before running this

Overview

checksum-request turns an inventory manifest CSV into a checksum inventory. For every object listed in the manifest, it issues a HEAD request, records the CRC64NVMe checksum (when present), and assigns a per-object status of ok, not_found, missing_checksum, or error. The result is uploaded as a CSV to the managed bucket under reports/*/checksums/.

The trigger file’s name (minus the extension) identifies which bucket’s inventory to process. For example, uploading checksums/digipres-dev1-private.txt processes the inventory for digipres-dev1-private.

Workflow:

  1. A .txt file named <bucket>.txt is uploaded to s3://${stack}-request/checksums/
  2. The Lambda function is triggered by the upload event
  3. The bucket name is parsed from the trigger filename
  4. The function checks for a matching inventory manifest at s3://${stack}-managed/reports/latest/manifests/<bucket>.csv
  5. If found, the inventory is processed and the checksum CSV is uploaded to the managed bucket
  6. The trigger file is deleted on success — re-upload to re-trigger

CLI testing

Run locally against an existing manifest:

make run-checksum-request p=digipres-dev1-private
FlagDescription
--bucketBucket name to process the checksum inventory for (required)

Important

If no manifest exists for the bucket, the CLI will fail with Inventory report not found. Run inventory-report first.

Remote testing

Upload a trigger file to the request bucket’s checksums/ prefix:

make upload b=digipres-dev1-request d=checksums f=files/digipres-dev1-private.txt p=default
FlagDescription
b=The S3 request bucket (typically ${stack}-request)
d=The S3 path to upload into — must be checksums
f=Path to a local trigger file; its basename (without extension) must be the bucket name
p=AWS profile

Note

The trigger file’s contents are not read — only its name matters.

Output

A successful run writes two files to the managed bucket:

  • reports/latest/checksums/<bucket>_checksum-inventory.csv
  • reports/YYYY-MM-DD/checksums/<bucket>_checksum-inventory.csv

To download the latest report:

aws s3 cp \
    s3://digipres-dev1-managed/reports/latest/checksums/digipres-dev1-private_checksum-inventory.csv \
    . \
    --profile default

QA testing

In addition to the happy path, test these edge cases:

ScenarioExpected behaviour
Trigger file uploaded with no matching inventory manifestFails with Inventory report not found
Trigger filename does not parse to a valid bucket (e.g. no extension)Fails before doing any work
Trigger file uploaded outside the checksums/ prefixLambda is not invoked

checksum-report

Trigger: CloudTrail EventBridge event (batch job status: complete or failed)
Dependencies: compute-checksums

Overview

checksum-report processes AWS Batch compute checksum job output into a single checksum report CSV per bucket, and generates checksum verification stats (e.g. total mismatches).

In production, this function is triggered asynchronously by EventBridge when a batch job reaches complete or failed status. Each bucket pair (source + replication) runs as independent jobs. Report generation requires both jobs to be complete — if the first job finishes before the second, the function exits early and waits for the second event before continuing.

Usage

CLI (local testing)

Important

compute-checksums must have already run and completed for the target bucket pair (source + replication) before running this command.

make run-checksum-report b=digipres-dev1-private p=default
FlagDescription
b=A standard or public stack bucket to generate a checksum report for
p=AWS profile

Remote testing

Remote testing starts the same way as compute-checksums:

make trigger f=compute-checksums s=digipres-dev1 p=default

When a compute checksum job completes, it automatically triggers checksum report generation — once per bucket job.

Note

Replication buckets with objects in glacier storage tier can take days to complete. For testing, use buckets that contain only recently created objects that haven’t yet transitioned to glacier storage.

Tracking job status

make job-status-by-receipt b=digipres-dev1-private p=default

A status of "Active" means the job is still running.

Expected output

On success, the CLI prints a verification summary and uploads a report CSV to the managed bucket:

Checksum report complete:
        Total objects:      6
        Matches:            6
        Mismatches:         0
        Missing replica:    0
        Missing source:     0
        Failed source:      0
        Failed replication: 0
FieldDescription
Total objectsNumber of source objects evaluated
MatchesObjects where source and replica checksums are identical
MismatchesObjects where checksums differ — indicates data integrity issue
Missing replicaObjects present in source but absent from replication bucket
Missing sourceObjects present in replication but absent from source bucket
Failed sourceObjects where checksum computation failed on the source
Failed replicationObjects where checksum computation failed on the replica

A report CSV is also uploaded to the stack’s managed bucket for long-term record keeping.

To verify the checksum report was written to S3:

aws s3 ls s3://digipres-dev1-managed/reports/$(date +%F)/checksums/

QA testing

Confirm:

  • Files are uploaded
  • Appropriate logging for first bucket event (exit only)
  • Appropriate logging for second bucket event (continuation)

storage-report

Type: Lambda function
Trigger: Scheduled EventBridge event (weekly)
Dependencies: inventory-report

Overview

This Lambda function generates a consolidated storage report for a stack, displaying storage usage across all standard and public buckets. The report is output as a single interactive HTML file using Chart.js for visualizations.

Report sections

  • Aggregated totals — Storage usage across all buckets in the stack
  • Per bucket totals — Storage usage broken down by individual bucket
  • Per bucket / per prefix totals — Storage usage by prefix within each bucket

Prerequisites

The storage report requires S3 inventory data to be available. Before running this function:

  1. S3 inventory must be enabled for the buckets
  2. At least one inventory report must have been generated and uploaded
  3. The inventory-report function must have completed successfully

CLI testing

Generate a storage report for a specific stack:

make run-storage-report s=digipres-dev1 p=default

Parameters:

  • s= — Stack name (required)
  • p= — AWS profile (required)

Remote trigger

make trigger f=storage-report s=digipres-dev1 p=default

Parameters:

  • f= — Function name (storage-report)
  • s= — Stack name (required)
  • p= — AWS profile (required)

Scheduled execution

Automatically triggered weekly by EventBridge.

Output

When successful, four files are generated:

Statistics (JSON format)

  • metadata/latest/storage/stats/{stack}.json — Latest version
  • metadata/YYYY-MM-DD/storage/stats/{stack}.json — Date-stamped archive

Contains raw storage metrics for programmatic access.

Report (HTML format)

  • reports/latest/storage/{stack}.html — Latest version
  • reports/YYYY-MM-DD/storage/{stack}.html — Date-stamped archive

Interactive HTML report with Chart.js visualizations for viewing in a browser.

sync-users

  • Lambda trigger: S3 event (fires when a TRIGGER file is uploaded to the managed bucket under sync-users/)
  • Dependencies: None

Overview

This Lambda function synchronizes IAM users with an SFTPGo server so that each user can access their stack buckets over SFTP using their AWS access keys.

Unlike the other functions, sync-users operates across stacks. A user can belong to one or more stacks (via IAM group membership), and this function discovers those relationships to grant the user access to the appropriate set of buckets.

Important

sync-users only updates existing SFTPGo users — it does not create them. SFTPGo users are provisioned separately via the users terraform module.

The workflow is:

  1. An empty TRIGGER file is uploaded to s3://${stack}-managed/sync-users/TRIGGER
  2. The Lambda function is triggered by the upload event
  3. Eligible IAM users are discovered (those with an Email tag and one or more stack group memberships)
  4. For each user, their access/secret keys are retrieved from SSM and the matching SFTPGo account is updated with access to the buckets for each stack they belong to
  5. The TRIGGER file is deleted on success

The SFTPGo connection details (SFTPGO_HOST, SFTPGO_USERNAME, SFTPGO_PASSWORD) are provided via Lambda environment variables set at deploy time.

CLI testing

The CLI can sync a single user or all users. SFTPGo credentials are read from the environment.

SFTPGO_HOST=https://sftpgo.example.org \
SFTPGO_USERNAME=admin \
SFTPGO_PASSWORD=secret \
make run-sync-users p=default

To sync a specific user only:

SFTPGO_HOST=... SFTPGO_USERNAME=... SFTPGO_PASSWORD=... \
cargo run -p dcp -- sync-users --username=alice

Unlike other CLI commands, sync-users does not take a stack argument — it works across all eligible users in the account.

Remote testing

Upload the TRIGGER file to the managed bucket to invoke the Lambda:

make upload b=digipres-dev1-managed d=sync-users f=TRIGGER p=default
  • b= — the managed bucket name (${stack}-managed)
  • d= — the S3 directory (must be sync-users)
  • f= — path to an empty local file named TRIGGER
  • p= — the AWS profile to use

Create an empty TRIGGER file first if you don’t have one:

touch TRIGGER

Output

sync-users does not produce files in S3. Successful execution can be verified in the following ways:

  • The TRIGGER file is removed from s3://${stack}-managed/sync-users/ after a successful run
  • CloudWatch logs show per-user processing output (email, identified buckets)
  • The SFTPGo admin UI shows the expected users with the expected bucket virtual folders configured

QA testing

Confirm:

  • A user with no Email tag is skipped (not synced)
  • A user with no stack group memberships is skipped (no buckets)
  • A user with no matching SFTPGo account is skipped (sync-users does not create SFTPGo users)
  • A user belonging to multiple stacks has access to buckets from each stack
  • The TRIGGER file is deleted after a successful run
  • A user’s SFTPGo account reflects changes when their IAM group memberships change

CLI Reference

The dcp command-line tool provides access to core operations for managing buckets, generating reports, and maintaining data integrity. This reference documents all available commands and their usage.

Commands

CommandDescription
bucket-reconciliationCheck bucket configuration and report drift
bucket-requestProcess bucket creation requests
checksumCompute a checksum for a local file
checksum-requestBuild checksum inventory from S3 inventory data
checksum-reportGenerate checksum report and statistics
compute-checksumsRun S3 batch operations compute checksums
inventory-reportGenerate inventory report and statistics
resetReset stack (empty buckets, requires confirmation)
storage-reportGenerate storage report
sync-usersSync IAM users to SFTPGo
transferTransfer files from source to stack destination bucket

Usage

dcp <COMMAND> [OPTIONS]

Global options

  • -h, --help — Print help message

Commands

Bucket operations

bucket-reconciliation

Check bucket configuration and report drift.

dcp bucket-reconciliation [OPTIONS]

Detects inconsistencies between local bucket configuration and remote state, useful for identifying configuration drift or missing objects.


bucket-request

Process bucket creation requests.

dcp bucket-request [OPTIONS]

Handle requests to create new buckets within the stack infrastructure.


reset

Reset stack (empty buckets, requires confirmation).

dcp reset [OPTIONS]

Caution

This is a destructive operation. Removes all content from stack buckets. Requires confirmation before proceeding.


transfer

Transfer files from source to stack destination bucket.

dcp transfer [OPTIONS]

Copy data from a source bucket to a destination bucket within the stack. Useful for migrations and data reorganization.


Checksum operations

checksum

Checksum a file.

dcp checksum [OPTIONS] <FILE>

Compute checksum for a local file to verify data integrity.


compute-checksums

Run S3 batch operations compute checksums.

dcp compute-checksums [OPTIONS]

Trigger S3 batch checksum jobs for buckets. For detailed usage, see compute-checksums documentation.


checksum-request

Build checksum inventory from S3 inventory data.

dcp checksum-request [OPTIONS]

Process S3 inventory data to create a checksum inventory for analysis and verification.


checksum-report

Generate checksum report and statistics.

dcp checksum-report [OPTIONS]

Create a report of checksum results and statistics across buckets. For detailed usage, see checksum-report documentation.


Reporting operations

inventory-report

Generate inventory report and statistics.

dcp inventory-report [OPTIONS]

Create an inventory report from S3 inventory data showing bucket contents and statistics. For detailed usage, see inventory-report documentation.


storage-report

Generate storage report.

dcp storage-report [OPTIONS]

Generate a comprehensive storage report with visualizations showing storage usage across all buckets in the stack. For detailed usage, see storage-report documentation.


User management

sync-users

Sync IAM users to SFTPGo.

dcp sync-users [OPTIONS]

Synchronize IAM users with SFTPGo for SFTP access management. For detailed usage, see sync-users documentation.


Help

help

Print help message or help for a specific subcommand.

dcp help [COMMAND]

Display general help or help for a specific command.

Common workflows

Local testing with CLI

Most development and testing uses the CLI. See development documentation for local testing patterns.

Testing with deployed Lambda

For testing with deployed Lambda functions, see the documentation for specific operations:

Makefile helpers

The project provides Makefile tasks that wrap CLI commands with common parameters:

# Example: Run compute-checksums via Makefile
make run-compute-checksums b=digipres-dev1-private p=default

# Example: Trigger Lambda function
make trigger f=storage-report s=digipres-dev1 p=default

# Example: Run CLI command directly
dcp compute-checksums --bucket digipres-dev1-private

For all available Makefile tasks, run make help.

Cleanup

# empties buckets only, resources are not destroyed
make reset s=digipres-dev1 p=default

# teardown: empties buckets and deletes everything
make teardown s=digipres-dev1 p=default

Development

Most new features follow the same progression: CLI command → perform module → Lambda → Terraform. The CLI is the fastest path to a working end-to-end against real AWS, and the Lambda is a thin entrypoint that delegates to the same perform module once the functionality is proven.

1. Add a CLI command

The CLI lives in cli/src/commands/. Each command is its own module exposing an Args struct and a run function.

  • Create cli/src/commands/<new_command>.rs with pub struct Args (clap) and pub async fn run(args: Args) -> Result<(), Box<dyn std::error::Error>>.
  • Register the module in cli/src/commands/mod.rs.
  • Add a Commands::<NewCommand>(commands::<new_command>::Args) variant and dispatch arm in cli/src/main.rs.
  • Build SDK clients directly from awsutils::config::load_defaults() + Clients::new(&sdk_config), or use app::config::load(stack) if the command is stack-scoped.
  • Wire clap args/env vars (e.g. #[arg(long, env = "SFTPGO_HOST")]).

Keep the CLI thin — parse args, build config, delegate to a perform function.

2. Implement the perform module

Shared functionality lives in shared/app/src/perform/. This is where the real work happens, and it is reused by both the CLI and the Lambda.

  • Create shared/app/src/perform/<feature>.rs.
  • Export a PerformArgs struct (public fields) and pub async fn perform(...) -> Result<..., <Feature>Error>.
  • Add the module to shared/app/src/perform/mod.rs.
  • Add a <Feature>Error variant in shared/app/src/errors.rs.
  • If the work is stack-scoped, accept &Config. For account-wide work (e.g. cross-stack user sync), accept &Clients instead.

Write unit tests alongside the module with test_support::TestClientBuilder for mocked SDK responses. Integration tests that hit real AWS go in shared/app/tests/<feature>.rs (gated with #[ignore] and run via make test-integration).

3. Add a Lambda function

Once the CLI and perform module work, wrap them in a Lambda entrypoint.

cd functions
cargo lambda new <feature-name>

Add the new crate to members in the workspace Cargo.toml.

Each Lambda crate has two files:

  • src/main.rs — reads env vars (at minimum STACK), loads config, starts the runtime.
  • src/event_handler.rs — validates the inbound event (bucket, prefix, filename), short-circuits on config.debug_handler(), builds PerformArgs, calls perform.

Provide a sample payload at events/sample.json and test the handler with test_support::TestClientBuilder + debug_handler=true.

From the project root:

# Build all or specified pkg (using -p)
cargo lambda build [-p $pkg]

# Run local
cargo lambda watch -p $pkg

# Invoke local with a sample payload
cargo lambda invoke -p $pkg --data-example s3-event

# Invoke local using a json file as payload
cargo lambda invoke -p $pkg --data-file functions/$pkg/events/event.json

4. Wire up Terraform

The Lambda needs infrastructure: an IAM policy scoping its permissions, a trigger (S3 event or EventBridge schedule), and an entry in the dev main.tf so the artifact gets uploaded and the function gets deployed.

4a. Shared constants → terraform locals

If the Lambda needs any prefixes, filenames, or other fixed values that terraform also needs to reference, add them to shared/constants/src/lib.rs and regenerate the terraform locals:

make locals

This keeps Rust and Terraform aligned — never hand-edit terraform/modules/stack/_locals.tf.

4b. Function-specific IAM policy

Create terraform/modules/stack/<feature>.tf following the pattern in bucket_request.tf or storage_report.tf:

locals {
  deploy_<feature> = contains(keys(local.functions), "<feature>") ? { "<feature>" = {} } : {}
}

data "aws_iam_policy_document" "<feature>" {
  for_each = local.deploy_<feature>

  statement { ... }
}

resource "aws_iam_role_policy" "<feature>" {
  for_each = local.deploy_<feature>

  role   = aws_iam_role.lambda[each.key].name
  policy = data.aws_iam_policy_document.<feature>[each.key].json
}

The base Lambda role, log group, and error alarm are created automatically from the functions map in functions.tf and alarms.tf — you do not need to add those.

4c. Trigger

Pick one based on how the function should fire:

S3 event trigger — add a aws_lambda_permission resource scoped to the source bucket ARN in your <feature>.tf, then add an entry to the appropriate bucket in notifications.tf:

for k, _ in local.deploy_<feature> : {
  id            = "<feature>-trigger"
  lambda_arn    = aws_lambda_function.main[k].arn
  events        = ["s3:ObjectCreated:*"]
  filter_prefix = "${local.<feature>_prefix}/"
  filter_suffix = local.<feature>_file
}

Add aws_lambda_permission.<feature> to the depends_on list.

Scheduled trigger — add local.deploy_<feature> into local.scheduled_functions in scheduler.tf. The schedule itself is configured via the schedule and tz fields on the functions map entry (defaults in variables.tf).

4d. Register in the dev main.tf

Add the function to local.functions in the project-root main.tf so it gets built, uploaded to the artifacts bucket, and deployed:

<feature> = {
  bucket = local.functions_bucket
  file   = "target/lambda/<feature>/bootstrap.zip"
  env    = { SOME_VAR = local.some_value } # optional
}

4e. Apply

make deploy s=<stack> p=<profile>

Testing the new function

  • CLI (local, against real AWS): cargo run -p dcp -- <subcommand> [args]
  • Lambda (local watch + invoke): cargo lambda watch -p <feature> in one shell, then cargo lambda invoke -p <feature> --data-file functions/<feature>/events/sample.json (or --data-example s3-event for a built-in fixture).
  • Lambda (invoked remotely with sample payload): make trigger f=<feature> s=<stack> p=<profile>
  • Unit tests: cargo test -p <crate>
  • Integration tests: make test-integration s=<stack> p=<profile>

Each feature should also get a technical doc at docs/src/technical/<feature>.md following the format of the others in that directory.

Roadmap

TODO

Lyrasis hosting and support

DuraCloud Preserve is open source and freely available for anyone to deploy into their own AWS account. However, Lyrasis provides a hosted option for individuals or institutions wanting a managed service.

Benefits

  • Lyrasis manages an AWS account for you, which can be transferred to your ownership at any time with 30 days notice of cancelation of your hosting contract.
  • Setup, configuration, and monitoring are fully handled by Lyrasis.
  • You receive S3 access credentials to interact with DuraCloud Preserve using any S3 client.
    • Credentials can provide “full”, “limited”, or “restricted” access per user (refer to user docs for details).
  • Technical support is provided by experienced hosting staff.
  • We provide access to a web application (SFTPGo) for file uploads.

For pricing information and other details .