Introduction
DuraCloud Preserve is a project designed to make using AWS S3 as simple as possible for users who only want to care about uploading files, or integrating S3 storage with other applications, and not have to worry about esoteric configuration or infrastructure management. It also supports digital preservation use cases by managing the configuration and features available to S3 to support long term access to and preservation of files.
The goal is to make it easy for users to choose any off the shelf S3 client and interact with S3 gaining more advanced features by default. Advanced features are described in more detail throughout the user and technical documentation but in brief: versioning, inventory, replication, logging etc. is enabled as buckets are created without a user having to do anything in AWS.
Periodically checksum verification is performed to ensure that file integrity is maintained between the primary and replicated (backup) files. This builds upon the already impressive levels of durability that S3 provides by adding a recurring guarantee that files are what they are intended to be.
Additional features include generating manifest (file inventory) and storage reports and user access control via preconstructed groups that are scoped to stacks. When deployed every resource is created “within” a stack. A stack is simply a resource naming prefix and tag applied to all resources managed by the deployed components to exclusively associate them. This makes it possible to have multiple stacks within a single account and makes it so different users can belong to one or more stacks.
Lyrasis provides a hosting service for DuraCloud Preserve, handling the AWS account creation and installation, and which comes with access to a web based ui for S3, using SFTPGo. S3 can then be interacted with using the web ui or via direct AWS access credentials for broader integrations or for usage with tools like the AWS cli.
AWS resources used:
Context
DuraCloud Preserve is a continuation of the DuraCloud project in a form that is intended to be more sustainable for the long term. It does this by focusing on the core mission of DuraCloud but with a significantly smaller technical footprint, made possible by leveraging AWS S3 features directly in contrast to the more abstracted approach that DuraCloud took in being open to support multiple backend storage providers.
But the goals remain the same: in the digital era, ensuring that critically important documents remain safe and available is a continual challenge. Physical computing hardware that is used to create and store documents can fail or become obsolete very quickly, providing a need for tools to ensure that these documents remain available. DuraCloud Preserve aims to address these concerns:
- How do I upload files to the storage service in a simple and reliable way?
- How do I ensure that the storage service that I am using receives a copy of my local files?
- How do I ensure that files remain intact over time?
- How do I retrieve my content once it is stored?
- How do I recover a file if it has been overwritten or corrupted?
- How do I make my content publicly accessible at a stable URL?
- How am I protected against the storage service becoming obsolete or going away?
Answers to these questions are provided throughout the rest of this documentation.
Features
This is a brief overview of the functionality that is explained more thoroughly in the user guide and technical documentation:
Access controls
Users can be standard or power users by assignment to a stack created IAM group.
- Standard users can list and upload files but cannot download or delete them.
- Power users can do all of the above.
Only AWS account administrators can access replicated buckets and objects.
Audit trail
Request logs are generated for each user-created bucket. This is raw AWS provided data that can be processed using tools like DuckDB.
Checksum reports
Checksum reports are generated on a configurable schedule, comparing checksums across source and replication buckets to detect corruption. Files found to be corrupt can be restored from the verified copy. See the checksum verification documentation for more details.
Choice of region
Files can be stored in any AWS region supported by the infrastructure.
CLI available
A command-line interface (dcp) is available for advanced users. It provides access to all core functions and additional maintenance commands for tasks such as checksumming local files, reconciling bucket configuration, and transferring data between buckets.
Hosting and support
If creating an AWS account and deploying resources to it is not possible then Lyrasis provides a hosting and technical support option to handle the infrastructure for you.
Inventory
A file manifest is generated for each user-created bucket. The raw AWS inventory data is available in Parquet format and a consolidated, user friendly csv file is also made available that includes the S3 url for each file.
Lifecycle transitions
Files are uploaded to the standard storage tier and transition to a selected storage class after a configurable interval, which can be specified for each stack deployment. Old versions of files and aborted multipart uploads are automatically deleted after a configurable period.
Manifest reports
A consolidated, human-readable CSV file is generated per bucket, listing all files with metadata including S3 URL, size, storage class, and last modified date.
Public access via CDN (Content Delivery Network)
A CloudFront distribution and bucket is created that can be used to make files publicly available. Simply upload files to it and share the public url using a specified domain.
Other buckets can be created as publicly accessible by naming them with a -public suffix. Files uploaded to such buckets will be available using a standard, unauthenticated S3 URL.
Files will be stored in the intelligent storage tier and not transitioned to Glacier; however replication will still occur and the backup copies will be stored in Glacier.
Reconciliation reports
The reconciliation report is used to detect drift in bucket configuration, providing reassurance that buckets are configured correctly and working as expected.
Replication
Files for all buckets are replicated to Glacier Deep Archive. These files are included in the checksum verification process to determine file integrity. We have dedicated documentation for how this works.
Storage reports
An HTML storage report is generated showing usage statistics across all buckets in the stack, including total file counts and storage consumed by bucket and top-level prefix. It also includes the year-to-date total of data transfer out from S3 to the internet (requires Cost Explorer to be enabled in AWS, and an active Stack cost allocation tag).
Versioning
Bucket versioning is enabled. This supports file restore for up to a configurable number of days post update which can be specified for each stack deployment.
Web UI integration with SFTPGo
There is support within the application and deployment tooling for SFTPGo integration, which provides a web based interface for S3. Users can be created that are pre-configured with appropriate access (per the access controls that have been assigned to them) and the SFTPGo user account is kept in sync as buckets are created, or via the dcp cli.
General integrations
Web applications that support use of Amazon S3 for storage
Any application or framework that can be configured to use Amazon S3 for storage can work with DuraCloud Preserve. By simply using a bucket created as part of a DuraCloud Preserve stack files will be stored with the additional benefits outlined in this documentation, including versioning, replication and checksum verification.
Some specific examples:
- Any Rails web applications using ActiveStorage.
- Archivematica Storage Service
- CollectionSpace file storage.
- DSpace Storage Layer.
Lyrasis service integrations
ArchivesSpace
ArchivesSpace itself does not manage digital content and provides no way to upload files. The public urls provided by the Duracloud Preserve CloudFront enabled bucket can be used to host files that are referenced in Digital Objects using the File URI field to make them openly accessible on the internet.
CollectionSpace
Refer to the roadmap for any upcoming work.
DSpace
The Replication Task Suite is a plugin for DSpace that adds preservation capabilities that can be accessed using the DSpace user interface. It creates archival information packages used to backup DSpace items in a self contained way that are periodically transferred to external storage, including Amazon S3. Doing the latter with a DuraCloud Preserve created bucket works equivalently to using S3 for the DSpace Storage Layer (assetstore), and if both are configured this way it enables a dual layer of protection for files (as both the assetstore and archival packages would benefit from versioning, replication and checksum verification etc.).
Other integrations
Archive-It
Create an inventory and a backup of WARC files retrieved from the Internet Archive - Archive-It service.
Checksum Verification
DuraCloud Preserve stores and replicates files using Amazon S3. Checksum verification is the process by which the system confirms that stored files have not been silently corrupted over time. Even in highly durable storage systems, subtle errors (known as “bit rot”) can alter file content without any obvious warning. By regularly comparing checksums across independent copies of each object, the system can detect and remediate corruption before it affects both copies.
How It Works
1. Upload Integrity
AWS S3 provides integrity guarantees at the point of upload. Using built-in integrity checking mechanisms, S3 validates received data and rejects any upload where the computed checksum does not match. A successful upload response from S3 confirms that the stored object matches exactly what was transmitted.
The system’s integrity guarantee begins at this point of successful upload.
The checksum and version of any stored object can be retrieved using the AWS CLI:
aws s3api head-object --bucket ${bucket} --key ${key} --checksum-mode ENABLED
Example response:
{
"AcceptRanges": "bytes",
"LastModified": "2026-01-24T00:22:19+00:00",
"ContentLength": 15310515,
"ChecksumCRC64NVME": "V+va1ramtYo=",
"ChecksumType": "FULL_OBJECT",
"ETag": "\"822f9ffde463633f9a56df6d90b1dbb6\"",
"VersionId": "HnU.prnfFqU2oJKqjIibty9_cet6zTDH",
"ContentType": "application/pdf",
"ServerSideEncryption": "AES256",
"Metadata": {},
"StorageClass": "GLACIER_IR",
"ReplicationStatus": "COMPLETED"
}
Further reading:
2. Replication
After a successful upload, AWS S3 replication creates a copy of the object in a second independent bucket, typically within 15 minutes. The same upload integrity guarantees apply to replication, ensuring the replicated object is an exact copy of the source.
The checksum and version ID of the replica will match the source object exactly:
{
"AcceptRanges": "bytes",
"LastModified": "2026-01-24T00:22:19+00:00",
"ContentLength": 15310515,
"ChecksumCRC64NVME": "V+va1ramtYo=",
"ChecksumType": "FULL_OBJECT",
"ETag": "\"822f9ffde463633f9a56df6d90b1dbb6\"",
"VersionId": "HnU.prnfFqU2oJKqjIibty9_cet6zTDH",
"ContentType": "application/pdf",
"ServerSideEncryption": "AES256",
"Metadata": {},
"StorageClass": "GLACIER",
"ReplicationStatus": "REPLICA"
}
Note that ChecksumCRC64NVME and VersionId are identical across both objects.
Further reading:
- Meeting compliance requirements with S3 Replication Time Control
- Replicating objects within and across Regions
3. Durability
AWS S3 is designed for 99.999999999% (eleven nines) durability. Given S3’s upload integrity guarantees and its documented durability, uploaded and replicated objects can be considered correct and consistent at the point of replication with a very high degree of confidence.
Further reading: Durability in Amazon S3
4. Ongoing Verification
S3 Batch Operations are used to generate checksum reports across all objects in both the source and replication buckets. These reports are compared on a regular schedule.
| Result | Meaning |
|---|---|
| Version ID and checksum match | Verification successful — objects are identical |
| Version ID or checksum do not match | One object may be corrupted — investigation required |
If a Mismatch Is Detected
If verification finds that checksums do not match, the following steps identify and repair the corruption.
Step 1 — Check prior reports. A previously generated checksum report may already contain the expected checksum values, making it straightforward to determine which copy — source or replica — is corrupt.
Step 2 — Request object metadata. If no prior report is available, retrieve the stored checksum, value, and version directly from each object’s metadata and compare them:
aws s3api head-object --bucket ${bucket} --key ${key} --checksum-mode ENABLED
Step 3 — Download and verify locally. For a more thorough inspection, download the objects and compute checksums locally using the same algorithm S3 uses (CRC-64/NVME by default):
# Retrieve the stored checksum
aws s3api head-object --bucket ${bucket} --key ${key} --checksum-mode ENABLED
# Download the file
aws s3 cp s3://${bucket}/${key} .
# Compute the checksum locally using the DuraCloud Preserve CLI
dcp checksum --file ${key}
Step 4 — Restore. Once the valid copy is confirmed, re-upload it to the source bucket to repair the corrupted object.
Important
Hosted clients: Lyrasis will handle checksum verification and file restoration on your behalf if errors are found.
Learn more about Lyrasis Hosting
Further reading:
- Compute checksums
- Examples: S3 Batch Operations completion reports
- Efficiently verify Amazon S3 data at scale with compute checksum operation
What Successful Verification Confirms
Successful verification confirms that the source and replica objects are identical to each other. Given S3’s upload integrity guarantees and its documented durability, this means objects are also identical to what was originally uploaded to a very high degree of confidence.
This strategy is considered sufficient for the vast majority of standard use cases. In the unlikely event that corruption is not automatically addressed by the S3 infrastructure, it is highly improbable that both independent copies would be corrupted in exactly the same way — which would be required to produce a false verification result.
For the strongest possible guarantee, independent verification using locally managed checksums is required. See Stricter Compliance Requirements below.
Further reading:
Checksum Reports
Checksum reports are stored in S3 for the duration of the stack’s retention policy and can be downloaded at any time.
For organizations requiring independent verification or stricter compliance, reports should be downloaded and stored locally or in a system separate from S3.
Stricter Compliance Requirements
For organizations with higher assurance requirements — such as regulated industries or formal digital preservation programs — the approach described above may not be sufficient on its own, as it is ultimately dependent on the claims of a single third-party provider (Amazon AWS). An independent audit mechanism, separate from the primary storage provider, is required for the strictest compliance standards.
Best practice for stricter compliance:
- Generate checksums locally before uploading. Use a tool such as QuickHash to compute a checksum for each file before it is uploaded to S3.
- Maintain a local checksum inventory. Keep a record of each filename and its corresponding checksum in a safe location. This inventory can be stored in S3, but must also exist independently.
- Verify on retrieval. When downloading a file, recompute its checksum locally and compare it against the inventory record.
It is also important to note that DuraCloud Preserve is entirely dependent on the Amazon AWS S3 service, its regional infrastructure, and its policies. Organizations with strict independence or sovereignty requirements should factor this into their preservation planning.
Frameworks and standards for reference:
- NDSA Levels of Digital Preservation — A tiered framework for assessing digital preservation practices
- Digital Preservation Coalition — Audit and Certification — Overview of audit standards and certification options
Getting started
Whoever is responsible for deployment will provide access credentials to users. If you are intending to connect directly to S3 using a GUI or CLI tool then you should receive an access key and secret, which serve as a username and password for interacting with S3. It is important to treat this as sensitively as you would any username and password.
If you are intending to use the web client then you should receive a username (your email address), password and the url to login. It’s completely fine to use both approaches if you’d like access to both.
You should also receive a stack name.
This will typically be in the form duracloud-$ID where $ID is an
identifier assigned by those handling the deployment. It may be based on
or similar to a sitecode used by your institution for its domain (e.g.
INSTITUTION.edu).
It is important to know this because your user will only be able to interact with a subset of buckets in an AWS account that are prefixed with that stack name. You will also see references to stack name throughout the documentation.
Important
Before proceeding confirm you have received:
- Access key (username) and secret (password) for direct s3 access if requested
- Stack prefix (
duracloud-$ID)- Web client username, password and url if requested
Lyrasis Hosting clients permissions
Hosting clients will start with identifying one user who will have power user permissions. This user will be able to upload, download, and delete. The initial power user will need to provide the Hosting team the names of other users for whom they wish to have accounts and indicate whether those users should be power users or standard users who can only upload files. The Hosting team recommends limiting the number of power users per institution to 1 or 2 individuals because of the power to delete.
S3 Client Options
In order to keep things simple for the end user, less complicated to maintain on the technical side, but also provide some flexibility over how content can be uploaded to S3, there is no prescribed user interface. Any S3-compatible client can be used to interact with the tool.
We believe this is the right choice because there are many popular, well-supported, and tested options already available. However, we provide streamlined documentation for the use of the open source program Cyberduck as a downloadable GUI option, the AWS CLI for command line usage, and the web-based browser SFTPGo for the simplest access point.
Here’s a list of clients that have been used or tested by Lyrasis staff:
- AWS CLI
- Cyberduck
- S3Browser (Windows only - we are not providing additional documentation about this option)
- SFTPGo
But there are many others and you are free to use any S3 compatible client that you prefer.
After connecting to your S3 account via your preferred method, you will
see the folders already created for your account using your
duracloud-$ID, including:
-managed-public(default bucket for files that can be accessed publicly through CloudFront)-request(used for making create bucket or checksum inventory requests)
AWS CLI Documentation
Step 1: Install AWS CLI
Installing or updating to the latest version of the AWS CLI
After following the instructions for your operating system, check your installation:
aws --version
Step 2: Configure Your AWS Credentials
Configuration and credential file settings in the AWS CLI
Verify your configuration:
aws sts get-caller-identity
If you have multiple AWS accounts or environments, set up a named
profile and configure with your key, secret, and region (us-west-2):
aws configure --profile dcp
Setting Region for Lyrasis Hosting
If you are a Lyrasis-hosted client, the AWS region is us-west-2. You can set this in a few ways:
1. Add --region directly to the command
This is the most explicit method and overrides all other settings (profiles, config files, etc.):
aws s3 ls --region us-west-2
With a profile:
aws s3 sync ./data s3://{stackname}-bucket --profile dcp --region us-west-2
2. Set the region temporarily in your shell
This applies only to the current terminal session:
export AWS_REGION=us-west-2
Then commands can be run without specifying the region.
3. Set the region inside the profile
[profile dcp]
region = us-west-2
output = json
Cyberduck Documentation
Cyberduck documentation for setting up new connections:
https://docs.cyberduck.io/cyberduck/connection/
Step-by-step Instructions
- File → Open Connection\
- Change dropdown menu to Amazon S3
- If you are a Lyrasis Hosting Services client, update Server to:
s3.us-west-2.amazonaws.com\ - (Lyrasis Hosting currently supports
us-west-2andus-east-2)
- If you are a Lyrasis Hosting Services client, update Server to:
- Type in provided Access Key ID and Secret Access Key\
- Click Connect

Tip
- Click Go → Enclosing Folder to navigate up the file path tree one level at a time, or click in the filepath dropdown to navigate up multiple levels after your connection is set up.
- Logs and other items you download will go to your Downloads folder by default. You can change this under Edit → Preferences → Transfers (General tab)
SFTPGo Documentation
Navigate to: DuraCloud Preserve
Use this web-based interface to log in, upload, and download content.
Individual users will be provided credentials by their system administrator (such as the Lyrasis Hosting team). The first time you log in, you will be asked to change your password. You can do this from the small person icon in the upper-right corner of the screen.


Upon login you will see three folders already created for you:
managedpublicrequest
You may:
- Create new buckets by uploading a request file (see Creating Buckets)
- Upload content buckets (creating subfolder structures as needed)
- Download content from buckets
- Download reports and other hosted content from the
managedbucket

Tip
Before proceeding, confirm that you are able to successfully connect to S3.
Managed Resources
When you view your S3 account using a GUI client or the AWS CLI for the first time, you will notice a number of pre-existing buckets that have been created.
Pre-Existing Buckets
duracloud-$ID-request: Used to make requests to create new buckets. See: Creating Bucketsduracloud-$ID-managed: Used to deposit generated files such as audit history, exports, inventory, and reports. This bucket is read-only.duracloud-$ID-private: Default private bucket.duracloud-$ID-public: Default public bucket. Files uploaded here will have a publicly accessible URL.
Managed Bucket Structure
Over time, the duracloud-$ID-managed bucket will contain the following prefixes (folders):
audit: AWS generated Audit logsbatch: AWS generated files related to S3 batch operationscloudtrail: AWS generated files for events related to S3feedback: Application generated files for troubleshooting issuesmanifests: AWS generated inventory filesmetadata: Application generated files related to various stats (checksum, usage etc.)reports: Application generated files intended for user review and download
More information about the data available in the -managed bucket is available on the Reports page.
Tip
- If the AWS account is used for purposes, additional buckets may exist. This may also occur if there are multiple stacks per account.
- However, the access credentials provided for this service will only work with the eligible stack resources associated with the user credentials.
Creating Buckets
Important
These instructions apply to all users, whether using Cyberduck, SFTPGo, the AWS CLI, or another S3-compatible client. The process is the same for everyone: upload a text file containing your bucket names to the
duracloud-$ID-requestbucket under thebucketsfolder. Instructions for each client are provided in the Steps section below.
Create a Bucket
To create a bucket, you must create a text file (.txt) containing the names of up to five buckets you want to create.
Naming Rules
- Bucket names are automatically prefixed with the stack name — do not include the stack name in the file.
- Each bucket name must be entered on its own line.
- Bucket names may contain only alphanumeric characters and
-. - Bucket names must not begin or end with
-. - Bucket names must be no more than 63 characters total, including:
- The stack name prefix (
duracloud-$ID) - 5 reserved characters for the
-replsuffix
- The stack name prefix (
Tip
Practically, this means your names should be no more than: 63 - 5 - (length of your duracloud-$ID)
Public Bucket Naming
To create a publicly accessible bucket, the name must end with -public.
- This subtracts an additional 7 characters from the maximum length.
Reserved Prefixes and Suffixes
The following cannot be used in bucket names:
duracloud-— already included as the foremost prefix-logs— used for access logging buckets-managed— used for system-managed buckets (reports, logs, and other system data appear here)-repl— used for replication target buckets (Amazon Glacier replication)-request— used for bucket request files
Steps
-
Open a text editor (such as Notepad or Notepad++) and create a file containing your bucket names, one per line. Save it as a
.txtfile. -
Upload the file to the
duracloud-$ID-requestbucket, inside thebucketsfolder.- If the
bucketsfolder does not exist then create it first. - Buckets can only be created from files uploaded to the
bucketsfolder in the request bucket.
- If the
Cyberduck
- Connect to your S3 account (see Connecting to S3).
- Navigate to the
duracloud-$ID-requestbucket. - If a
bucketsfolder does not exist, create one: Action → New Folder. - Open the
bucketsfolder and drag your.txtfile into the Cyberduck window, or click Upload to browse for it. - Cyberduck will show a transfer log confirming the upload.
Tip
When re-using the same file with updated bucket names (Step 8 below), Cyberduck may ask you to confirm overwriting the existing file. Confirm to proceed.
SFTPGo
- Log in to the SFTPGo web interface (see Connecting to S3).
- Navigate to your home folder. You will see
managedandpublicfolders — do not upload to these. Instead, navigate back to the root or look for arequestfolder corresponding toduracloud-$ID-request. - If a
bucketsfolder does not exist inside the request area, click New Folder to create it. - Open the
bucketsfolder, then click Upload Files or drag your.txtfile into the upload area. - Click Save to complete the upload.
AWS CLI
aws s3 cp mybuckets.txt s3://duracloud-$ID-request/buckets/mybuckets.txt
- The file will be processed in the background and an attempt will be made to create each bucket.
- Processing normally takes 0–2 minutes.
- A report file will be uploaded to the
feedbackfolder inside the-managedbucket, providing details about the outcome. - Review the log when it becomes available.
- Refresh your client view or reconnect to S3.
- Successfully created buckets will now be visible.
- Each new bucket will have an associated replication bucket with a
-replsuffix. - Replication buckets are list-only (files cannot be downloaded).
- The newly created buckets are now usable, and files can be uploaded.
- To create more buckets:
- Re-use and re-upload the same file with new bucket names, or
- Create and upload an entirely new file. Both approaches work.
Troubleshooting
- If you do not see any new buckets created, check the logs in the
$ID-managedbucketfeedbackfolder for error messages. - If you attempt to create multiple buckets at one time and one bucket has an error (for example, the name is too long or you attempted to create more than five buckets), none of the buckets will be created. You must correct the issue and start again for all buckets.
Uploading Files
You will be able to upload files to the buckets you’ve created (see Creating Buckets). After your content has been uploaded, it will be mirrored in Glacier Deep Archive in the bucket that duplicates your bucket names with the -repl suffix.
You will not be able to do anything with the content in the -repl bucket. You will be able to see filenames, as a reassurance that your content has been mirrored, but if you attempt to download or get information about the files, you will likely encounter:
Access deniedFailure to read attributes of [filename]. Forbidden. Request Error
or other errors.
Files in these -repl buckets will only be accessed in the event of checksum failure in your active file structures, so those files can be replaced by this Glacier Deep Archive copy.
Tip
This tool is intended primarily for the long-term storage and preservation of digital assets. Frequent or repeated access to private files within the system may lead to increased operational costs and could potentially compromise data integrity. Users are advised to limit such access and use this tool in accordance with its preservation-focused purpose.
CLI option
Refer to the AWS CLI S3 documentation:
https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html
Upload files (entire folder):
aws s3 sync ./local-folder s3://{stackname}-bucket
Upload a single file:
aws s3 cp myfile.txt s3://{stackname}-bucket
Cyberduck option
Cyberduck documentation on File Transfers
- Uploading folders or individual files is as simple as clicking and dragging from a folder in File Explorer / Finder into the Cyberduck client. Alternatively, click the Upload button in the Cyberduck client to browse for files or folders.
- Cyberduck will provide a pop-up log indicating whether the upload was successful. Another pop-up will appear if there are any errors or issues (for example, if you are not authorized to upload to the bucket).
SFTPGo option
- Uploading folders or individual files is as simple as clicking and dragging from a folder in File Explorer / Finder into the web application. Alternatively, click the “drop files here to upload” area to browse for files or folders.
- You cannot upload an empty folder, but you can create folder structures within your
-privateand-publicfolders before uploading content. - Uploading very large files may take a long time and can time out. If you have files larger than 1–2 GB, you may need to use Cyberduck or another S3-compatible tool.

Use the New Folder button to create your folder structure(s) before uploading content.
- The web application will show a list of all files queued for upload so you can confirm filenames and paths.
- After uploading content, do not forget to click the Save button in the bottom right corner, or your content will not be uploaded.
- After completion, you will see your preserved file structure. The default display shows 10 results at a time, but this can be increased up to 500.

Screen display showing preserved folder structure and the option to change the number of displayed results.
Reminder: You will not see the replicated file structure in the SFTPGo web application, but your files are still being replicated in Glacier.
Tip
We have occasionally seen a generic “Error uploading files” message in SFTPGo. Closing the error and attempting the upload again has so far worked successfully (sometimes requiring closing the error twice).
The cause is not yet certain; it may be related to attempting uploads after a session has expired. This is an area for further investigation and feedback.
Reports
Data generated by your S3 client about your content is saved in the -managed bucket associated with your S3 account.
After you begin creating buckets and uploading content, you will see folders in the -managed bucket, including:
audit: AWS generated Audit logsbatch: AWS generated files related to S3 batch operationscloudtrail: AWS generated files for events related to S3feedback: Application generated files for troubleshooting issuesmanifests: AWS generated inventory filesmetadata: Application generated files related to various stats (checksum, usage etc.)reports: Application generated files intended for user review and download
audit
- A folder for each bucket you created
- Mostly machine-readable data
- Provides details about activities performed on your data
batch
- Mostly machine-readable data
- Provides outputs from S3 batch operations
cloudtrail
- Mostly machine-readable data
- Provides outputs from S3 events
feedback
- Provides files for recording issues that arise
manifests
- Mostly machine-readable data
- Provides outputs from S3 inventory
metadata
- Provides raw stats related to checksum and inventory processes
reports
This is the primary folder for content intended for review.
Checksum
Checksum reports are organized by date and stored under reports/ in the managed bucket.
There are two types of checksum reports:
- Checksum verification report (
_checksum-report.csv) - Checksum inventory report (
_checksum-inventory.csv)
Checksum verification report
A checksum verification report that provides generated summarising totals: matches, mismatches, missing replicas, and failures.
Checksum inventory report
This report uses existing inventory reports to generate csv of checksum metadata.
reports/latest/checksums/<bucket>_checksum-inventory.csv— most recent reportreports/YYYY-MM-DD/checksums/<bucket>_checksum-inventory.csv— date-stamped archive
Each CSV is a per-object checksum inventory. Each row includes the object key, its CRC64NVMe checksum (when present), and a status:
ok— no errors were encountered retrieving metadata for this objectnot_found— object was not foundmissing_checksum— object exists but has no checksum recordederror— other failure
Note: checksum inventory does not provide checksum verification.
Manifest
Inventory manifest reports provide a listing of all files in each bucket. They are stored under reports/ in the managed bucket:
reports/latest/manifests/<bucket>.csv— most recent reportreports/YYYY-MM-DD/manifests/<bucket>.csv— date-stamped archive
Each CSV contains one row per object with metadata including filename, size, last modified date, and storage class.
Storage
Storage reports are interactive HTML files generated weekly. They are stored under reports/ in the managed bucket:
reports/latest/storage/<stack>.html— most recent reportreports/YYYY-MM-DD/storage/<stack>.html— date-stamped archive
Open the HTML file in a browser to view charts and tables covering:
- Aggregated totals — storage usage across all buckets in the stack
- Per bucket totals — storage usage broken down by individual bucket
- Per bucket / per prefix totals — storage usage by folder within each bucket
These reports are the most human-readable summaries available.
You may download data from any of these folders for local review and storage.
Accessing Reports
Cyberduck
- Connect to your S3 account (see Connecting to S3).
- Navigate to the
duracloud-$ID-managedbucket and open thereports/latest/folder. - Open the relevant subfolder:
checksums/— checksum report CSVs per bucketmanifests/— inventory manifest CSVs per bucketstorage/— interactive HTML storage report for your stack
- Right-click (or control-click on macOS) the file and select Download, Download As, or Download To to save it locally.
- To view the storage report, open the downloaded
.htmlfile in your browser.
Tip
- Downloaded files are saved to your default Downloads folder. You can change this in Edit → Preferences → Transfers under the General tab.
- Right-click to rename files when downloading to avoid overwriting reports from previous dates.
SFTPGo
- Log in to the SFTPGo web interface (see Connecting to S3).
- Navigate to the
managedfolder, then openreports/latest/. - Open the relevant subfolder:
checksums/— checksum report CSVs per bucketmanifests/— inventory manifest CSVs per bucketstorage/— interactive HTML storage report for your stack
- To download a single file, click directly on its filename.
- To download multiple files, check the boxes next to them and use the Actions menu → Download. Selected items will be zipped automatically.
- To view the storage report, download the
.htmlfile and open it in your browser.
AWS CLI
Download the latest storage report:
aws s3 cp s3://duracloud-$ID-managed/reports/latest/storage/$ID.html .
Download the latest checksum inventory for a bucket:
aws s3 cp s3://duracloud-$ID-managed/reports/latest/checksums/$BUCKET_checksum-inventory.csv .
Download the latest manifest report for a bucket:
aws s3 cp s3://duracloud-$ID-managed/reports/latest/manifests/$BUCKET.csv .
Sync an entire dated archive locally:
aws s3 sync s3://duracloud-$ID-managed/reports/ ./reports/
Downloading Content
Remember that you will not be able to download content from your replicated buckets (buckets ending in -repl). If you need to get content from the replicated buckets, such as because of accidental deletion or corruption, you will need to ask your hosting provider for assistance.
AWS CLI Option
Refer to https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html
Download files from bucket
aws s3 sync s3://{stackname}-bucket/myfolder ./local-folder
Download a single file
aws s3 cp s3://{stackname}-bucket/myfile.txt .
Cyberduck
Refer to https://docs.cyberduck.io/cyberduck/download/ If you right- or control-click on an item or selected group of items to download, you will have the options to:
- Download — goes to your general preferences folder or the system Downloads folder if not changed
- Download As — change the type of an individual item
- Download To — change where the item(s) are saved
SFTPGo
In order to download content from SFTPGo, navigate to the folder structure you wish to download from and select the folder(s) or item(s) you wish to download. The application will let you download, move, or copy content from a dropdown menu after you’ve selected content. This option will automatically zip up all selected items. If you just want to download a single item, click directly on its filename. This option will not work if you want to download a folder of content.

Tip
You may be able to view some file types directly in SFTPGo, such as .jpg, .txt, and .pdf files, by clicking on the little eye icon to the right of the filename; the application will use your browser settings.
Making Content Public
There are two ways to make content public.
Pre-created public bucket (recommended)
Each stack includes a pre-created -public bucket that is served through a CloudFront distribution with a friendly domain. This is the recommended way to make content publicly accessible. Your administrator will provide the public domain URL.
Cyberduck
Navigate to the duracloud-$ID-public bucket and upload your files there (see Uploading Files). Files uploaded to this bucket will be publicly accessible via the CloudFront domain.
SFTPGo
Navigate to the public folder and upload your content there (see Uploading Files). Files placed here will be publicly accessible via the CloudFront domain.
AWS CLI
aws s3 cp myfile.jpg s3://duracloud-$ID-public/myfolder/myfile.jpg
Public buckets (not recommended)
You can also make content publicly available by designating a bucket as -public - See How to Create Buckets.
You can construct what a public link will look like based on this pattern:
https://{BUCKET_NAME}.s3.{REGION}.amazonaws.com/{PREFIX}/{FILE}
If you have spaces in any of your folder or filenames, replace those with a + sign when forming a URL. The region information is also optional.
So, for example, an image found in the lyrasis account’s bucket public → test-01 → catpics folder structure would look like:
https://duracloud-lyrasis-public.s3.us-west-2.amazonaws.com/test-01/catpics/callie_and_friend.jpg
OR, without the region information:
https://duracloud-lyrasis-public.s3.amazonaws.com/test-01/catpics/callie_and_friend.jpg
Note this feature is currently available but may be restricted in the future as it goes against AWS guidelines.
Cyberduck sharing options
Cyberduck has some additional ways to share folders and individual objects.
- Navigate to the item you wish to share.
- Right-click on Windows / control+click on a Mac or two-finger click on a touchpad and select “Copy URL” — you can also use the Action (cog) menu and select “Open URL”.
- If you right-click and select “Copy URL,” you will have options for how you wish to copy the URL, including HTTPS or HTTP, an expiration on the link (for individual objects only), or the AWS command link.
- You can now share the item however you wish.
- The HTTPS and HTTP links may be formed slightly differently (with AWS information before the bucket name), but they should still provide public access to objects in your account.
Query audit and inventory data
S3 audit logs and inventory can be synced locally for ad-hoc querying with DuckDB.
Pre-reqs
Sync the files
Download audit and / or inventory data to a local ./data folder. For example:
mkdir -p data/audit/
mkdir -p data/inventory/
aws s3 sync s3://${stack_name}-managed/audit/ data/audit/
aws s3 sync s3://${stack_name}-managed/manifests/ data/inventory/
# also download the query setup sql files
curl -O https://artifacts.preserve.duracloud.org/query/audit.sql
curl -O https://artifacts.preserve.duracloud.org/query/inventory.sql
Query audit data with DuckDB
The log files are in the S3 server access log format: one request per
line, space-delimited, with a bracketed timestamp and quoted request_uri,
referer, and user_agent. DuckDB’s CSV reader can’t handle the mixed
quoting, so audit.sql reads each line as a single string and pulls fields
out with a regex, exposing them as the audit view.
Launch the DuckDB CLI with the view preloaded:
duckdb -init audit.sql
Then query away. For example, every request ordered by time:
SELECT event_time, bucket, remote_ip, operation, key, http_status, bytes_sent
FROM audit
ORDER BY event_time;
Standard object operations by users
The requester field is an IAM ARN. Most traffic is programmatic (for
example SDK sessions named aws-go-sdk-…, service roles doing replication
or batch work etc.) but when a user assumes a role via a named profile, the
session name at the end of the ARN is usually the IAM username. To see
just the standard object-level operations (GET, PUT, DELETE) performed
by assumed-role sessions, with the obvious programmatic sessions filtered out:
SELECT
event_time,
regexp_extract(requester, 'assumed-role/[^/]+/(.+)$', 1) AS who,
bucket,
operation,
key,
http_status
FROM audit
WHERE operation IN ('REST.PUT.OBJECT', 'REST.GET.OBJECT', 'REST.DELETE.OBJECT')
AND requester LIKE '%:assumed-role/%'
AND requester NOT LIKE '%aws-go-sdk-%'
AND requester NOT LIKE '%assume-role-from-profile-%'
ORDER BY event_time;
Service roles (e.g. replication, batch jobs) may still appear in the
results. Inspect the who column and add further NOT LIKE clauses
for any session names that aren’t people of interest.
Query inventory data with DuckDB
S3 inventory reports ship as Parquet, which DuckDB reads natively.
inventory.sql globs every parquet file under data/inventory/ and
exposes them as the inventory view. Because each daily snapshot
re-reports objects that haven’t changed, the view uses SELECT DISTINCT
to collapse identical rows so basic queries see one row per unique
observed state.
Launch the DuckDB CLI with the view preloaded:
duckdb -init inventory.sql
List every object across all buckets:
SELECT bucket, key, size, last_modified_date, storage_class
FROM inventory
ORDER BY bucket, key;
Object count and total bytes per bucket:
SELECT bucket, COUNT(*) AS objects, SUM(size) AS total_bytes
FROM inventory
GROUP BY bucket
ORDER BY bucket;
To work with both views in the same session, pass both scripts:
duckdb -init audit.sql -cmd ".read inventory.sql"
External documentation
Documentation provided by third-party clients recommended for providing access to your account’s managed S3 buckets:
- Amazon command line documentation
- Cyberduck documentation, including for downloading and installing their client on Mac or Windows devices
- SFTPGo documentation (mostly intended for institutions setting up their own infrastructure)
Overview
Test environment
Production environment
Setup
This documentation is focused on the technical aspects of the core functionality and how to test locally using the provided cli and remotely after the functions have been deployed.
This documentation does not address user functionality or deployment concerns, for those see:
Pre-reqs
Requirements:
You must have access to an AWS account. Caution: costs may be incurred.
Setup
There are Makefile tasks to wrap cargo (et al.) commands for convenience:
These args are used frequently:
f=functionfunction name i.e.bucket-requestp=profileaws profile name i.e.defaults=stackresource prefix used for identification/partitioning within an aws account
But note in some contexts a letter may have a different meaning, for example
f=file (check the docs or output of make for details).
To get started run this task to create the base infrastructure:
# choose your own value for s=$stack and p=$profile
make setup s=digipres-dev1 p=default
This task uses Terraform so it must be installed for it to work.
Of most significance for testing using the above example will create:
digipres-dev1-s3-batch-role(i.e.${stack}-s3-batch-role)digipres-dev1-s3-replication-role(i.e.${stack}-s3-replication-role)digipres-dev1-request(i.e.${stack}-request)digipres-dev1-managed(i.e.${stack}-managed)digipres-dev1-public(i.e.${stack}-public)digipres-dev1-public-repl(i.e.${stack}-public-repl)
The managed bucket will also be assigned a policy that permits it to be
a target for S3 inventory from buckets using the same stack name (prefix).
The public bucket is “special” as it works differently from regular
user created public buckets owing to a CloudFront distribution that is
created to provide access to the files, rather than using raw S3 urls.
Testing remotely with Lambda
The base infrastructure is sufficient for testing using the provided
cli. However, no AWS Lambda functions will be deployed by the setup
task. If you want to test a full stack deployment including the Lambda
functions then there is a deploy task for that:
make deploy s=digipres-dev1 p=default
This will build the Lambda packages and upload them to an “artifacts” bucket that Lambda can access. Doing this will enable you to try out the remote testing instructions for each function vs. only testing via the cli. Generally speaking the cli covers most of what happens when run through Lambda with these primary differences:
- Local cli testing uses your local AWS credentials
- Deployed Lambdas use permissions provided by IAM roles
- The entrypoints are different: see the
clivs.functionsfolders
Testing public access via CloudFront
terraform output cloudfront_domain_name
This will output something like: d2vy8bpfecxis5.cloudfront.net.
make upload b=digipres-dev1-public d=example f=files/buckets.txt p=default
Then access the file in the browser, it should work:
For production the other Terraform outputs can be used for setting up a custom domain using ACM, see the deployment documentation for more details.
Functions
The core service functionality is encapsulated by Lambda functions that run on a schedule or in response to S3 events:
| Function | Trigger | Description |
|---|---|---|
| bucket-request | S3 event | Creates S3 buckets with prefab configuration from an uploaded text file |
| checksum-report | Scheduled | Compares checksum results across source and replication buckets to detect corruption |
| compute-checksums | Scheduled | Triggers S3 batch checksum jobs across all bucket pairs to verify data integrity |
| inventory-report | S3 event | Processes S3 inventory data into a human-readable CSV manifest and generates storage stats |
| storage-report | Scheduled | Generates an HTML storage usage report across all buckets in the stack |
| sync-users | S3 event | Syncs IAM users to SFTPGo so they can access their stack buckets over SFTP |
All functions can also be run locally via the dcp CLI, which additionally provides commands for tasks not covered by Lambda. See CLI for details.
bucket-request
- Lambda trigger: S3 event (fires when a user uploads a file to the request bucket)
- Dependencies: None
Overview
This Lambda function creates S3 buckets with prefab configuration based on a list of bucket names provided in a plain text file.
Example buckets.txt
manuscripts
newspapers
rare-books
The workflow is:
- A text file containing bucket names is uploaded to the S3 bucket named
${stack}-request - The Lambda function is triggered by the upload event
- The file is downloaded and processed — either locally (for development/testing) or inside Lambda (for remote execution)
- Buckets are created according to the prefab configuration if they don’t already exist
CLI Testing
Use make run-bucket-request to process a file locally without uploading to S3:
make run-bucket-request f=files/buckets-list.txt s=digipres-dev1 p=default
f=— path to a local file containing bucket namess=— the stack name (used as a prefix for created buckets)p=— the AWS profile to use
You can also create a single bucket by name without a file, using the cargo CLI directly.
Important
Before testing, export your aws profile prior to using the
cargoCLI.
cargo run -p dcp -- bucket-request --stack=digipres-dev1 --name=rare-books
This is useful for one-off bucket creation or quick iteration without maintaining a file.
Remote Testing
Use make upload to upload a file to S3 and trigger the Lambda function as it would run in production:
make upload b=digipres-dev1-request d=buckets f=files/buckets.txt p=default
b=— the name of the S3 request bucket (typically${stack}-request)d=— the S3 directory (path) to upload into (must bebuckets)f=— path to the local file containing bucket namesp=— the AWS profile to use
Output
Given the example file files/buckets.txt, two buckets should be created (assuming they do not already exist):
digipres-dev1-private— private S3 bucketdigipres-dev1-private-repl— private S3 bucket used as the replication destination for the above
You can verify the buckets were created using:
make bucket a=list p=default
# Filter results by stack name using grep
make bucket a=list p=default | grep digipres-dev1
QA testing
Aside from the happy path, here are variations to try:
- File too large
- File invalid (rename some other file
buckets.txti.e a jpg) - Bucket names are too long or has invalid characters
- Too many bucket names (5 max, additionals are discarded)
- Bucket names are duplicates, the buckets already exist
- Errors should be uploaded to a file in the managed bucket
feedbackpath
inventory-report
Type: Lambda function
Trigger: S3 event (manifest.json is created)
Dependencies: None
Overview
This function processes Parquet-formatted S3 inventory data into a single human-readable CSV manifest per bucket. It also generates storage usage statistics used by the storage report:
- Total number of files and total storage used
- The same, broken down by top-level prefix (folder)
Note
At least one bucket must exist with files uploaded before this function can run. It has no inventory to process otherwise.
CLI testing
Run locally against the most recently available S3 inventory for a bucket:
make run-inventory-report b=digipres-dev1-private p=default
b=— Bucket name to process the inventory report for (required)p=— AWS profile to use (required)
Remote testing
Staging a remote test requires crafting a specific event payload and uploading matching Parquet files, which adds significant overhead. In practice it is simpler to let the infrastructure run on its normal daily schedule and inspect the logs if the report does not appear.
If the CLI works but the Lambda does not, the most likely cause is an IAM permissions issue.
To stage a full remote test:
- Craft an event payload that references a
manifest.json. - Upload Parquet inventory files to the location referenced in the
manifest.json. - Upload the
manifest.jsonto the path specified in the event payload — this must be within the event notification path (/manifests). - Ensure the Parquet files contain the correct stack-prefixed bucket name.
Output
When run successfully there should be four generated files:
metadata/latest/manifests/stats/$bucket.csvmetadata/YYYY-MM-DD/manifests/stats/$bucket.csvreports/latest/manifests/$bucket.csvreports/YYYY-MM-DD/manifests/$bucket.csv
To access the latest report you can do:
aws s3 cp \
s3://digipres-dev1-managed/reports/latest/manifests/digipres-dev1-private.csv \
. \
--profile default
QA testing
Confirm:
- All expected files are available.
- The report contains expected items.
- The stats are accurate.
compute-checksums
Type: Lambda function
Trigger: Scheduled EventBridge event
Dependencies: None
Overview
This Lambda function triggers S3 batch checksum jobs to verify data integrity across your buckets. It processes standard/public + replication bucket pairs together, ensuring both the source and replicated data are checksummed.
Invocation methods
Scheduled execution (production)
The Lambda is automatically triggered by a scheduled EventBridge event at regular intervals.
CLI testing
Compute checksums for a single bucket and its replication pair:
make run-compute-checksums b=digipres-dev1-private p=default
Parameters:
b=— Standard or public stack bucket to checksum (required)p=— AWS profile (required)
Constraints:
- Only supports single bucket at a time
- Automatically paired with replication bucket
- Cannot directly specify a replication bucket
Remote trigger
Compute checksums for all stack buckets in a given stack:
make trigger f=compute-checksums s=digipres-dev1 p=default
Parameters:
f=— Function name (compute-checksums)s=— Stack name (required)p=— AWS profile (required)
Behavior: Triggers jobs for ALL stack buckets in the specified stack.
Output
Function response
{
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}
Receipt files
For each bucket pair processed, a job receipt is uploaded to:
metadata/latest/checksums/receipts/{source_job_id}.jsonmetadata/latest/checksums/receipts/{repl_job_id}.jsonmetadata/latest/checksums/receipts/{source_bucket_name}.jsonmetadata/{date}/checksums/receipts/{source_bucket_name}.json
Purpose: The receipt is uploaded multiple times for different discovery paths:
- Job IDs — used by the Lambda checksum report process for internal tracking
- Bucket names — used by the CLI checksum report and for easier manual access
QA testing
Confirm:
- Jobs are created without errors
- Jobs are completed successfully
- All receipt files are generated and avaiable at the expected paths
checksum-request
Trigger: S3 event (
.txtfile uploaded under the request bucket’schecksums/prefix)
Dependencies:inventory-report— the manifest CSV must already exist before running this
Overview
checksum-request turns an inventory manifest CSV into a checksum inventory. For every object listed in the manifest, it issues a HEAD request, records the CRC64NVMe checksum (when present), and assigns a per-object status of ok, not_found, missing_checksum, or error. The result is uploaded as a CSV to the managed bucket under reports/*/checksums/.
The trigger file’s name (minus the extension) identifies which bucket’s inventory to process. For example, uploading checksums/digipres-dev1-private.txt processes the inventory for digipres-dev1-private.
Workflow:
- A
.txtfile named<bucket>.txtis uploaded tos3://${stack}-request/checksums/ - The Lambda function is triggered by the upload event
- The bucket name is parsed from the trigger filename
- The function checks for a matching inventory manifest at
s3://${stack}-managed/reports/latest/manifests/<bucket>.csv - If found, the inventory is processed and the checksum CSV is uploaded to the managed bucket
- The trigger file is deleted on success — re-upload to re-trigger
CLI testing
Run locally against an existing manifest:
make run-checksum-request p=digipres-dev1-private
| Flag | Description |
|---|---|
--bucket | Bucket name to process the checksum inventory for (required) |
Important
If no manifest exists for the bucket, the CLI will fail with
Inventory report not found. Runinventory-reportfirst.
Remote testing
Upload a trigger file to the request bucket’s checksums/ prefix:
make upload b=digipres-dev1-request d=checksums f=files/digipres-dev1-private.txt p=default
| Flag | Description |
|---|---|
b= | The S3 request bucket (typically ${stack}-request) |
d= | The S3 path to upload into — must be checksums |
f= | Path to a local trigger file; its basename (without extension) must be the bucket name |
p= | AWS profile |
Note
The trigger file’s contents are not read — only its name matters.
Output
A successful run writes two files to the managed bucket:
reports/latest/checksums/<bucket>_checksum-inventory.csvreports/YYYY-MM-DD/checksums/<bucket>_checksum-inventory.csv
To download the latest report:
aws s3 cp \
s3://digipres-dev1-managed/reports/latest/checksums/digipres-dev1-private_checksum-inventory.csv \
. \
--profile default
QA testing
In addition to the happy path, test these edge cases:
| Scenario | Expected behaviour |
|---|---|
| Trigger file uploaded with no matching inventory manifest | Fails with Inventory report not found |
| Trigger filename does not parse to a valid bucket (e.g. no extension) | Fails before doing any work |
Trigger file uploaded outside the checksums/ prefix | Lambda is not invoked |
checksum-report
Trigger: CloudTrail EventBridge event (batch job status:
completeorfailed)
Dependencies:compute-checksums
Overview
checksum-report processes AWS Batch compute checksum job output into a single checksum report CSV per bucket, and generates checksum verification stats (e.g. total mismatches).
In production, this function is triggered asynchronously by EventBridge when a batch job reaches complete or failed status. Each bucket pair (source + replication) runs as independent jobs. Report generation requires both jobs to be complete — if the first job finishes before the second, the function exits early and waits for the second event before continuing.
Usage
CLI (local testing)
Important
compute-checksumsmust have already run and completed for the target bucket pair (source + replication) before running this command.
make run-checksum-report b=digipres-dev1-private p=default
| Flag | Description |
|---|---|
b= | A standard or public stack bucket to generate a checksum report for |
p= | AWS profile |
Remote testing
Remote testing starts the same way as compute-checksums:
make trigger f=compute-checksums s=digipres-dev1 p=default
When a compute checksum job completes, it automatically triggers checksum report generation — once per bucket job.
Note
Replication buckets with objects in glacier storage tier can take days to complete. For testing, use buckets that contain only recently created objects that haven’t yet transitioned to glacier storage.
Tracking job status
make job-status-by-receipt b=digipres-dev1-private p=default
A status of "Active" means the job is still running.
Expected output
On success, the CLI prints a verification summary and uploads a report CSV to the managed bucket:
Checksum report complete:
Total objects: 6
Matches: 6
Mismatches: 0
Missing replica: 0
Missing source: 0
Failed source: 0
Failed replication: 0
| Field | Description |
|---|---|
Total objects | Number of source objects evaluated |
Matches | Objects where source and replica checksums are identical |
Mismatches | Objects where checksums differ — indicates data integrity issue |
Missing replica | Objects present in source but absent from replication bucket |
Missing source | Objects present in replication but absent from source bucket |
Failed source | Objects where checksum computation failed on the source |
Failed replication | Objects where checksum computation failed on the replica |
A report CSV is also uploaded to the stack’s managed bucket for long-term record keeping.
To verify the checksum report was written to S3:
aws s3 ls s3://digipres-dev1-managed/reports/$(date +%F)/checksums/
QA testing
Confirm:
- Files are uploaded
- Appropriate logging for first bucket event (exit only)
- Appropriate logging for second bucket event (continuation)
storage-report
Type: Lambda function
Trigger: Scheduled EventBridge event (weekly)
Dependencies: inventory-report
Overview
This Lambda function generates a consolidated storage report for a stack, displaying storage usage across all standard and public buckets. The report is output as a single interactive HTML file using Chart.js for visualizations.
Report sections
- Aggregated totals — Storage usage across all buckets in the stack
- Per bucket totals — Storage usage broken down by individual bucket
- Per bucket / per prefix totals — Storage usage by prefix within each bucket
Prerequisites
The storage report requires S3 inventory data to be available. Before running this function:
- S3 inventory must be enabled for the buckets
- At least one inventory report must have been generated and uploaded
- The
inventory-reportfunction must have completed successfully
CLI testing
Generate a storage report for a specific stack:
make run-storage-report s=digipres-dev1 p=default
Parameters:
s=— Stack name (required)p=— AWS profile (required)
Remote trigger
make trigger f=storage-report s=digipres-dev1 p=default
Parameters:
f=— Function name (storage-report)s=— Stack name (required)p=— AWS profile (required)
Scheduled execution
Automatically triggered weekly by EventBridge.
Output
When successful, four files are generated:
Statistics (JSON format)
metadata/latest/storage/stats/{stack}.json— Latest versionmetadata/YYYY-MM-DD/storage/stats/{stack}.json— Date-stamped archive
Contains raw storage metrics for programmatic access.
Report (HTML format)
reports/latest/storage/{stack}.html— Latest versionreports/YYYY-MM-DD/storage/{stack}.html— Date-stamped archive
Interactive HTML report with Chart.js visualizations for viewing in a browser.
sync-users
- Lambda trigger: S3 event (fires when a
TRIGGERfile is uploaded to the managed bucket undersync-users/) - Dependencies: None
Overview
This Lambda function synchronizes IAM users with an SFTPGo server so that each user can access their stack buckets over SFTP using their AWS access keys.
Unlike the other functions, sync-users operates across stacks. A user can belong to one or more stacks (via IAM group membership), and this function discovers those relationships to grant the user access to the appropriate set of buckets.
Important
sync-usersonly updates existing SFTPGo users — it does not create them. SFTPGo users are provisioned separately via theusersterraform module.
The workflow is:
- An empty
TRIGGERfile is uploaded tos3://${stack}-managed/sync-users/TRIGGER - The Lambda function is triggered by the upload event
- Eligible IAM users are discovered (those with an
Emailtag and one or more stack group memberships) - For each user, their access/secret keys are retrieved from SSM and the matching SFTPGo account is updated with access to the buckets for each stack they belong to
- The
TRIGGERfile is deleted on success
The SFTPGo connection details (SFTPGO_HOST, SFTPGO_USERNAME, SFTPGO_PASSWORD) are provided via Lambda environment variables set at deploy time.
CLI testing
The CLI can sync a single user or all users. SFTPGo credentials are read from the environment.
SFTPGO_HOST=https://sftpgo.example.org \
SFTPGO_USERNAME=admin \
SFTPGO_PASSWORD=secret \
make run-sync-users p=default
To sync a specific user only:
SFTPGO_HOST=... SFTPGO_USERNAME=... SFTPGO_PASSWORD=... \
cargo run -p dcp -- sync-users --username=alice
Unlike other CLI commands, sync-users does not take a stack argument — it works across all eligible users in the account.
Remote testing
Upload the TRIGGER file to the managed bucket to invoke the Lambda:
make upload b=digipres-dev1-managed d=sync-users f=TRIGGER p=default
b=— the managed bucket name (${stack}-managed)d=— the S3 directory (must besync-users)f=— path to an empty local file namedTRIGGERp=— the AWS profile to use
Create an empty TRIGGER file first if you don’t have one:
touch TRIGGER
Output
sync-users does not produce files in S3. Successful execution can be verified in the following ways:
- The
TRIGGERfile is removed froms3://${stack}-managed/sync-users/after a successful run - CloudWatch logs show per-user processing output (email, identified buckets)
- The SFTPGo admin UI shows the expected users with the expected bucket virtual folders configured
QA testing
Confirm:
- A user with no
Emailtag is skipped (not synced) - A user with no stack group memberships is skipped (no buckets)
- A user with no matching SFTPGo account is skipped (sync-users does not create SFTPGo users)
- A user belonging to multiple stacks has access to buckets from each stack
- The
TRIGGERfile is deleted after a successful run - A user’s SFTPGo account reflects changes when their IAM group memberships change
CLI Reference
The dcp command-line tool provides access to core operations for managing buckets, generating reports, and maintaining data integrity. This reference documents all available commands and their usage.
Commands
| Command | Description |
|---|---|
bucket-reconciliation | Check bucket configuration and report drift |
bucket-request | Process bucket creation requests |
checksum | Compute a checksum for a local file |
checksum-request | Build checksum inventory from S3 inventory data |
checksum-report | Generate checksum report and statistics |
compute-checksums | Run S3 batch operations compute checksums |
inventory-report | Generate inventory report and statistics |
reset | Reset stack (empty buckets, requires confirmation) |
storage-report | Generate storage report |
sync-users | Sync IAM users to SFTPGo |
transfer | Transfer files from source to stack destination bucket |
Usage
dcp <COMMAND> [OPTIONS]
Global options
-h, --help— Print help message
Commands
Bucket operations
bucket-reconciliation
Check bucket configuration and report drift.
dcp bucket-reconciliation [OPTIONS]
Detects inconsistencies between local bucket configuration and remote state, useful for identifying configuration drift or missing objects.
bucket-request
Process bucket creation requests.
dcp bucket-request [OPTIONS]
Handle requests to create new buckets within the stack infrastructure.
reset
Reset stack (empty buckets, requires confirmation).
dcp reset [OPTIONS]
Caution
This is a destructive operation. Removes all content from stack buckets. Requires confirmation before proceeding.
transfer
Transfer files from source to stack destination bucket.
dcp transfer [OPTIONS]
Copy data from a source bucket to a destination bucket within the stack. Useful for migrations and data reorganization.
Checksum operations
checksum
Checksum a file.
dcp checksum [OPTIONS] <FILE>
Compute checksum for a local file to verify data integrity.
compute-checksums
Run S3 batch operations compute checksums.
dcp compute-checksums [OPTIONS]
Trigger S3 batch checksum jobs for buckets. For detailed usage, see compute-checksums documentation.
checksum-request
Build checksum inventory from S3 inventory data.
dcp checksum-request [OPTIONS]
Process S3 inventory data to create a checksum inventory for analysis and verification.
checksum-report
Generate checksum report and statistics.
dcp checksum-report [OPTIONS]
Create a report of checksum results and statistics across buckets. For detailed usage, see checksum-report documentation.
Reporting operations
inventory-report
Generate inventory report and statistics.
dcp inventory-report [OPTIONS]
Create an inventory report from S3 inventory data showing bucket contents and statistics. For detailed usage, see inventory-report documentation.
storage-report
Generate storage report.
dcp storage-report [OPTIONS]
Generate a comprehensive storage report with visualizations showing storage usage across all buckets in the stack. For detailed usage, see storage-report documentation.
User management
sync-users
Sync IAM users to SFTPGo.
dcp sync-users [OPTIONS]
Synchronize IAM users with SFTPGo for SFTP access management. For detailed usage, see sync-users documentation.
Help
help
Print help message or help for a specific subcommand.
dcp help [COMMAND]
Display general help or help for a specific command.
Common workflows
Local testing with CLI
Most development and testing uses the CLI. See development documentation for local testing patterns.
Testing with deployed Lambda
For testing with deployed Lambda functions, see the documentation for specific operations:
Makefile helpers
The project provides Makefile tasks that wrap CLI commands with common parameters:
# Example: Run compute-checksums via Makefile
make run-compute-checksums b=digipres-dev1-private p=default
# Example: Trigger Lambda function
make trigger f=storage-report s=digipres-dev1 p=default
# Example: Run CLI command directly
dcp compute-checksums --bucket digipres-dev1-private
For all available Makefile tasks, run make help.
Cleanup
# empties buckets only, resources are not destroyed
make reset s=digipres-dev1 p=default
# teardown: empties buckets and deletes everything
make teardown s=digipres-dev1 p=default
Development
Most new features follow the same progression: CLI command → perform module → Lambda → Terraform. The CLI is the fastest path to a working end-to-end against real AWS, and the Lambda is a thin entrypoint that delegates to the same perform module once the functionality is proven.
1. Add a CLI command
The CLI lives in cli/src/commands/. Each command is its own module exposing an Args struct and a run function.
- Create
cli/src/commands/<new_command>.rswithpub struct Args(clap) andpub async fn run(args: Args) -> Result<(), Box<dyn std::error::Error>>. - Register the module in
cli/src/commands/mod.rs. - Add a
Commands::<NewCommand>(commands::<new_command>::Args)variant and dispatch arm incli/src/main.rs. - Build SDK clients directly from
awsutils::config::load_defaults()+Clients::new(&sdk_config), or useapp::config::load(stack)if the command is stack-scoped. - Wire clap args/env vars (e.g.
#[arg(long, env = "SFTPGO_HOST")]).
Keep the CLI thin — parse args, build config, delegate to a perform function.
2. Implement the perform module
Shared functionality lives in shared/app/src/perform/. This is where the real work happens, and it is reused by both the CLI and the Lambda.
- Create
shared/app/src/perform/<feature>.rs. - Export a
PerformArgsstruct (public fields) andpub async fn perform(...) -> Result<..., <Feature>Error>. - Add the module to
shared/app/src/perform/mod.rs. - Add a
<Feature>Errorvariant inshared/app/src/errors.rs. - If the work is stack-scoped, accept
&Config. For account-wide work (e.g. cross-stack user sync), accept&Clientsinstead.
Write unit tests alongside the module with test_support::TestClientBuilder for mocked SDK responses. Integration tests that hit real AWS go in shared/app/tests/<feature>.rs (gated with #[ignore] and run via make test-integration).
3. Add a Lambda function
Once the CLI and perform module work, wrap them in a Lambda entrypoint.
cd functions
cargo lambda new <feature-name>
Add the new crate to members in the workspace Cargo.toml.
Each Lambda crate has two files:
src/main.rs— reads env vars (at minimumSTACK), loads config, starts the runtime.src/event_handler.rs— validates the inbound event (bucket, prefix, filename), short-circuits onconfig.debug_handler(), buildsPerformArgs, callsperform.
Provide a sample payload at events/sample.json and test the handler with test_support::TestClientBuilder + debug_handler=true.
From the project root:
# Build all or specified pkg (using -p)
cargo lambda build [-p $pkg]
# Run local
cargo lambda watch -p $pkg
# Invoke local with a sample payload
cargo lambda invoke -p $pkg --data-example s3-event
# Invoke local using a json file as payload
cargo lambda invoke -p $pkg --data-file functions/$pkg/events/event.json
4. Wire up Terraform
The Lambda needs infrastructure: an IAM policy scoping its permissions, a trigger (S3 event or EventBridge schedule), and an entry in the dev main.tf so the artifact gets uploaded and the function gets deployed.
4a. Shared constants → terraform locals
If the Lambda needs any prefixes, filenames, or other fixed values that terraform also needs to reference, add them to shared/constants/src/lib.rs and regenerate the terraform locals:
make locals
This keeps Rust and Terraform aligned — never hand-edit terraform/modules/stack/_locals.tf.
4b. Function-specific IAM policy
Create terraform/modules/stack/<feature>.tf following the pattern in bucket_request.tf or storage_report.tf:
locals {
deploy_<feature> = contains(keys(local.functions), "<feature>") ? { "<feature>" = {} } : {}
}
data "aws_iam_policy_document" "<feature>" {
for_each = local.deploy_<feature>
statement { ... }
}
resource "aws_iam_role_policy" "<feature>" {
for_each = local.deploy_<feature>
role = aws_iam_role.lambda[each.key].name
policy = data.aws_iam_policy_document.<feature>[each.key].json
}
The base Lambda role, log group, and error alarm are created automatically from the functions map in functions.tf and alarms.tf — you do not need to add those.
4c. Trigger
Pick one based on how the function should fire:
S3 event trigger — add a aws_lambda_permission resource scoped to the source bucket ARN in your <feature>.tf, then add an entry to the appropriate bucket in notifications.tf:
for k, _ in local.deploy_<feature> : {
id = "<feature>-trigger"
lambda_arn = aws_lambda_function.main[k].arn
events = ["s3:ObjectCreated:*"]
filter_prefix = "${local.<feature>_prefix}/"
filter_suffix = local.<feature>_file
}
Add aws_lambda_permission.<feature> to the depends_on list.
Scheduled trigger — add local.deploy_<feature> into local.scheduled_functions in scheduler.tf. The schedule itself is configured via the schedule and tz fields on the functions map entry (defaults in variables.tf).
4d. Register in the dev main.tf
Add the function to local.functions in the project-root main.tf so it gets built, uploaded to the artifacts bucket, and deployed:
<feature> = {
bucket = local.functions_bucket
file = "target/lambda/<feature>/bootstrap.zip"
env = { SOME_VAR = local.some_value } # optional
}
4e. Apply
make deploy s=<stack> p=<profile>
Testing the new function
- CLI (local, against real AWS):
cargo run -p dcp -- <subcommand> [args] - Lambda (local watch + invoke):
cargo lambda watch -p <feature>in one shell, thencargo lambda invoke -p <feature> --data-file functions/<feature>/events/sample.json(or--data-example s3-eventfor a built-in fixture). - Lambda (invoked remotely with sample payload):
make trigger f=<feature> s=<stack> p=<profile> - Unit tests:
cargo test -p <crate> - Integration tests:
make test-integration s=<stack> p=<profile>
Each feature should also get a technical doc at docs/src/technical/<feature>.md following the format of the others in that directory.
Roadmap
TODO
Lyrasis hosting and support
DuraCloud Preserve is open source and freely available for anyone to deploy into their own AWS account. However, Lyrasis provides a hosted option for individuals or institutions wanting a managed service.
Benefits
- Lyrasis manages an AWS account for you, which can be transferred to your ownership at any time with 30 days notice of cancelation of your hosting contract.
- Setup, configuration, and monitoring are fully handled by Lyrasis.
- You receive S3 access credentials to interact with DuraCloud Preserve using any S3 client.
- Credentials can provide “full”, “limited”, or “restricted” access per user (refer to user docs for details).
- Technical support is provided by experienced hosting staff.
- We provide access to a web application (SFTPGo) for file uploads.
For pricing information and other details ….