Participating as a Data Preparer

Slingshot v3 aims to engage individuals/teams of Data Preparers (DPs) that own the process of downloading datasets, processing them, and sharing generated CAR files with eligible SPs for deal making. If you participated in past versions of Slingshot and want to learn more about this iteration, start with the Program Details.

In order to participate as a DP:

  1. Register (top right) on this website
  2. Finish the sign-up flow by updating the required Account Details

Selecting datasets to prepare

Once you're signed up, you should be able to browse available datasets in the website. Pick up to 3 datasets that you are interested in preparing for Slingshot SPs to store. Dataset metadata shared should include the region in which the dataset is currently hosted, its size, how many files it has, and how you can obtain it (i.e., AWS bucket). You should use these to chose datasets that you can obtain and process most efficiently.

Note that once you claim a particular dataset, no other DP can claim it. If you release your claimed dataset, you will not be able to claim it again unless an admin overrides this. You can request an admin to override this by creating an issue.

Each dataset is associated with a slug-name, i.e. coco-common-objects-in-context-fastai-datasets. You can check the slug name of your claimed datasets under My Claimed Datasets. You'll see how this is used below during dataset preparation.

Preparation process

As you go through the preparation process for each dataset, please remember to update its Preparation Progress in the website. The Slingshot team will be keeping track of progress across datasets and may remove you from a claimed dataset if there is no update on it for > 4 weeks. The preparation states tracked in the website are:

  • Not started (default)
  • Downloading dataset
  • CAR generation in progress
  • CAR generation complete

Preparation tool

The tool used for dataset preparation is singularity. It will help properly split the dataset (or files within it), construct the IPLD structure for the CAR file and generate the corresponding manifest file which needs to be uploaded later.

Daemon or Standalone

There are two ways to use the tool

  • The Daemon version comes with better management experience where you can manage all your dataset preparation requests, including pausing, resuming and retrying. This is the recommended option.
  • The Standalone version is easier to set up and is more suitable for those who want to prepare smaller datasets.

Quick Start

# Install NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
source ~/.bashrc
nvm install 16
# Install Singularity
npm i -g @techgreedy/singularity
# Use daemon version
singularity init
singularity daemon
# [Option A] Start preparation for already downloaded dataset
singularity prep create <dataset_name> <parent_of_dataset_folder> <out_dir>
# [Option B] Start preparation while download at the same time
singularity prep create -t <tmp_dir> <dataset_name> s3://<bucket_name>/<optional_prefix>/ <out_dir>

Hardware requirements

Each job will take 1-2 CPU cores, 50-100MB/s disk I/O. RAM usage is negligible. To control parallelism:

  • For standalone version, use -j flag
  • For daemon version, change deal_preparation_worker.num_workers and s3.per_job_concurrency in the config file (~/.singularity/default.toml)

Preparation Requirements

To ensure the data prepared is consumable for future applications such as compute over data, the data preparer needs to follow below practices to ensure the data model consistency

  1. Always target 32GiB sector size, which is the default
  2. Avoid slicing the dataset manually to multiple subfolders
  3. Cover all files of the dataset before claiming completion - we may check your manifest file with our scanned records, we may also check the CID or perform a retrieval to validate the integrity
  4. All failed CAR generations need to be retried until they are complete
  5. If you plan to download the dataset first, make sure you check the completeness of the download by running aws s3 sync --no-sign-request --delete s3://<bucket> <destination>. The downloaded dataset needs to be put in a folder with the same name of the S3 bucket and the preparation request need to use the parent folder. This will allow us to recognize the S3 bucket name for each file during validation. For example, if you downloaded s3://bucket_name to /mnt/downloads/bucket_name, you should prepare the data using singularity prep create <dataset_name> /mnt/downloads <out_dir>

Common issues, questions and bug reports

Check singularity repo for known common issues.

If you encounter any other issue using the tools or need help to troubleshoot, feel free to ask them in the #large-clients-tooling Filecoin slack channel or report a bug

Uploading CAR file metadata

After dataset has been prepared, the manifest files need to be uploaded to Web3.Storage.

  • For dataset prepared by the daemon version, use upload-manifest-daemon.sh to upload the manifest
    export WEB3_STORAGE_TOKEN=eyJ...
    ./upload-manifest-daemon.sh slug-name slug-name
    
  • For dataset prepared by the standalone version, use upload-manifest-standalone.sh to upload the manifest
    export WEB3_STORAGE_TOKEN=eyJ...
    ./upload-manifest-standalone.sh out-dir slug-name
    

Uploading CAR <> HTTP sources

How to upload

  1. Log in to your account and navigate to the "My Claimed Datasets" page (/account/<your_github_username>/datasets/claimed)
  2. Toggle open the dataset to which you want to upload your manifest file(s)
  3. Click on the "CAR HTTP Manifests" tab
  4. Here you will see a "Select a file" button. Clicking on it will allow you to upload your file.
  5. Once uploaded, please allow for up to 15 minutes for the file to be processed and then refresh the page. You will see the the status change from processing: pending to processing: complete or processing: error.

Requirements

  • The file must be of type JSON
  • JSON itself must be valid
  • JSON must be an associative array ({} not [])
  • Associative array must not be empty (must have at least 1 source)
  • piece_cid key must be a valid CID
  • url value must be an array of strings
  • url strings must be a valid URL

Sample model

{
  "baga6ea4se...": [
    "https://..."
  ],
  "zaga6ea4se...": [
    "https://...part1",
    "https://...part2",
    "https://...part3",
    "https://...part4"
  ]
}

Getting deals on-chain

In order for your Slingshot data preparation to be eligible for rewards, corresponding deals must be made by the Slingshot Deal Engine with participating SPs. SPs who are able to obtain your CAR files can initiate dealmaking directly with the Slingshot Deal Engine. Your prepared CAR files can be obtained by SPs in two ways:

  • you can distribute your CAR files to SPs
  • SPs can retrieve pieces from other SPs (after initial replicas have been stored)

Distributing CAR files

For at least the first few copies, DPs are recommended to:

  • host CAR files somewhere where SPs can download them
  • send SPs CAR files directly over-the-wire
  • send SPs CAR files directly offline, i.e., via shipping drives

DPs choosing to host CAR files for SPs to download them have several options. One simple path here is to host a basic HTTP server:

sudo apt install nginx

Modify /etc/nginx/sites-available/default and add below lines

server {
  ...
  location / {
    root /home/user/car_dir;
  }
  ...
}

This will allow storage providers to download files with URL http://<your_site-ip>/<piece_cid>.car

To improve download speed for your storage providers, we recommend you to sign up with CloudFlare which protects your service as well as improves the throughput and latency. We also recommend your storage provider to use a multithreaded download software such as aria2 or axel.

Finding SPs to work with

You are free to partner with specific SPs to onboard your prepared data on-chain. This may help in prioritization of your CAR files to ensure you can build the maximum replicas in the alloted time. Options for identifying SPs to work with include:

  • advertise available CAR files to SPs in #slingshot or #fil-deal-market
  • use data marketplace like https://www.bigd.exchange/ where you, as a DP, can receive further incentives from bidding SPs

Sharing your contact information with Storage Providers

You have the option to make a Slack handle and/or an email address public so that a prospective SP can reach out to you about obtaining your CAR files. Keep in mind, a Slack handle is not unique. It is recommended that you provide your Slack Member ID.

How to publicize your contact information

  1. Log in to your account and navigate to the "My Account" page (/account/<your_github_username>)
  2. Near the bottom of the form, fill out the "Share contact information" section
  3. Slack handle and email are both optional. You could include both or either one.
  4. Make sure to check the checkbox labelled "I understand that the public slack handle and public email entries will become visible on the table on the Home Page of this website". If you do not check this checkbox, your contact information will not be publicized. If at any point you wish to remove your contact information from the site, you can simply uncheck this checkbox.
  5. Your contact information will now become visible in the table on the Home Page dataset list table, in the same row(s) of any dataset(s) you've claimed
  6. Additionally, the contact information will not be visible if the dataset preparation status is still Not Started

How to find your Slack Member ID

  1. Open your Slack Profile
  2. Click the "3 dots" to open the secondary menu
  3. Click "Copy member ID"

Participating as a Storage Provider (SP)

Storage Providers interested in serving deals for prepared Slingshot data should sign to participate here.

Details on SP participation requirements can be found in the Program Details.

Storage Requirements

  • Participating SPs commit to serving fast retrievals for this data throughout the duration of the deal, for 0FIL. SPs with retrieval success rates below 95% may be temporarility suspended from participating in the program
  • Suspended SPs will be re-activated through successfully serving ongoing retrieval checks

How it works

The process for participating in deal making from the Slingshot Deal Engine is the same as that for the Evergreen program. The engine is the same, but is hosting Slingshot as a separate tenant, so you still need to register and be approved for deals from Slingshot.

  1. Ensure your minerID gets on the list of eligible SPs by going through the application process.
  2. Use the authenticator tool to validate that your requests are coming from the right SP ID. You will need access to your current SP worker key (the same one you use to use for ProveCommits). This is required in order to use the API.
    • Download the authenticator using curl -OL https://raw.githubusercontent.com/filecoin-project/evergreen-dealer/master/misc/fil-spid.bash
    • Run chmod 755 fil-spid.bash
    • Run curl -sLH "Authorization: $( ./fil-spid.bash f0XXXX )" https://api.evergreen.filecoin.io/pending_proposals
  3. Use the deal engine to examine the list of CIDs you can store
    • A list of all deals eligible for storage using curl -sLH "Authorization: $( ./fil-spid.bash f0xxxx )" https://api.evergreen.filecoin.io/eligible_pieces/anywhere
  4. Get the piece(s) in order to be able to store them. You can do this in two ways:
    • Coordinating directly with the DP that created the CAR files. DPs can host the files for you to obtain or find an alternative way of transferring them to you. You can coordinate with DPs directly or in the #slingshot channel in Filecoin Slack.
    • Retrieve it from SPs currently hosting the data. For each piece CID, there may be one or multiple SPs that currently have it in a deal, make sure to search through the table/API results before attempting a retrieval (ideally from one geographically close to you).
  5. Once you are ready to request deal proposals for the deals you would like to renew,
    • For each deal, curl -sLH "Authorization: $( ./fil-spid.bash f0xxxx )" https://api.evergreen.filecoin.io/request_piece/bagaChosenPieceCid
    • Note that from the moment of invoking this method your SP system will receive a deal proposal within ~5 minutes with a deal-start-time about 3 days (~72 hours) in the future.
    • These will be verified deals with DataCap
    • Deals will be made for maximum practical duration (~532 days)
  6. You can view the set of outstanding deals against your SP at any time by invoking curl -sLH "Authorization: $( ./fil-spid.bash f0xxxx )" https://api.evergreen.filecoin.io/pending_proposals
    • Note that in order to prevent abuse you can have at most 10TiB (320 x 32GiB sectors) outstanding against your SP at any time.