Slingshot v3 aims to engage individuals/teams of Data Preparers (DPs) that own the process of downloading datasets, processing them, and sharing generated CAR files with eligible SPs for deal making. If you participated in past versions of Slingshot and want to learn more about this iteration, start with the Program Details.
In order to participate as a DP:
Once you're signed up, you should be able to browse available datasets in the website. Pick up to 3 datasets that you are interested in preparing for Slingshot SPs to store. Dataset metadata shared should include the region in which the dataset is currently hosted, its size, how many files it has, and how you can obtain it (i.e., AWS bucket). You should use these to chose datasets that you can obtain and process most efficiently.
Note that once you claim a particular dataset, no other DP can claim it. If you release your claimed dataset, you will not be able to claim it again unless an admin overrides this. You can request an admin to override this by creating an issue.
Each dataset is associated with a slug-name
, i.e. coco-common-objects-in-context-fastai-datasets
. You can check the slug name of your claimed datasets under My Claimed Datasets. You'll see how this is used below during dataset preparation.
As you go through the preparation process for each dataset, please remember to update its Preparation Progress in the website. The Slingshot team will be keeping track of progress across datasets and may remove you from a claimed dataset if there is no update on it for > 4 weeks. The preparation states tracked in the website are:
The tool used for dataset preparation is singularity. It will help properly split the dataset (or files within it), construct the IPLD structure for the CAR file and generate the corresponding manifest file which needs to be uploaded later.
There are two ways to use the tool
# Install NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
source ~/.bashrc
nvm install 16
# Install Singularity
npm i -g @techgreedy/singularity
# Use daemon version
singularity init
singularity daemon
# [Option A] Start preparation for already downloaded dataset
singularity prep create <dataset_name> <parent_of_dataset_folder> <out_dir>
# [Option B] Start preparation while download at the same time
singularity prep create -t <tmp_dir> <dataset_name> s3://<bucket_name>/<optional_prefix>/ <out_dir>
Each job will take 1-2 CPU cores, 50-100MB/s disk I/O. RAM usage is negligible. To control parallelism:
-j
flagdeal_preparation_worker.num_workers
and s3.per_job_concurrency
in the config file (~/.singularity/default.toml)To ensure the data prepared is consumable for future applications such as compute over data, the data preparer needs to follow below practices to ensure the data model consistency
aws s3 sync --no-sign-request --delete s3://<bucket> <destination>
. The downloaded dataset needs to be put in a folder with the same name of the S3 bucket and the preparation request need to use the parent folder. This will allow us to recognize the S3 bucket name for each file during validation. For example, if you downloaded s3://bucket_name
to /mnt/downloads/bucket_name
, you should prepare the data using singularity prep create <dataset_name> /mnt/downloads <out_dir>
Check singularity repo for known common issues.
If you encounter any other issue using the tools or need help to troubleshoot, feel free to ask them in the #large-clients-tooling Filecoin slack channel or report a bug
After dataset has been prepared, the manifest files need to be uploaded to Web3.Storage.
export WEB3_STORAGE_TOKEN=eyJ...
./upload-manifest-daemon.sh slug-name slug-name
export WEB3_STORAGE_TOKEN=eyJ...
./upload-manifest-standalone.sh out-dir slug-name
How to upload
/account/<your_github_username>/datasets/claimed
)processing: pending
to processing: complete
or processing: error
.Requirements
piece_cid
key must be a valid CIDurl
value must be an array of stringsurl
strings must be a valid URLSample model
{
"baga6ea4se...": [
"https://..."
],
"zaga6ea4se...": [
"https://...part1",
"https://...part2",
"https://...part3",
"https://...part4"
]
}
In order for your Slingshot data preparation to be eligible for rewards, corresponding deals must be made by the Slingshot Deal Engine with participating SPs. SPs who are able to obtain your CAR files can initiate dealmaking directly with the Slingshot Deal Engine. Your prepared CAR files can be obtained by SPs in two ways:
For at least the first few copies, DPs are recommended to:
DPs choosing to host CAR files for SPs to download them have several options. One simple path here is to host a basic HTTP server:
sudo apt install nginx
Modify /etc/nginx/sites-available/default and add below lines
server {
...
location / {
root /home/user/car_dir;
}
...
}
This will allow storage providers to download files with URL
http://<your_site-ip>/<piece_cid>.car
To improve download speed for your storage providers, we recommend you to sign up with CloudFlare which protects your service as well as improves the throughput and latency. We also recommend your storage provider to use a multithreaded download software such as aria2 or axel.
You are free to partner with specific SPs to onboard your prepared data on-chain. This may help in prioritization of your CAR files to ensure you can build the maximum replicas in the alloted time. Options for identifying SPs to work with include:
You have the option to make a Slack handle and/or an email address public so that a prospective SP can reach out to you about obtaining your CAR files. Keep in mind, a Slack handle is not unique. It is recommended that you provide your Slack Member ID.
How to publicize your contact information
/account/<your_github_username>
)Not Started
How to find your Slack Member ID
Storage Providers interested in serving deals for prepared Slingshot data should sign to participate here.
Details on SP participation requirements can be found in the Program Details.
The process for participating in deal making from the Slingshot Deal Engine is the same as that for the Evergreen program. The engine is the same, but is hosting Slingshot as a separate tenant, so you still need to register and be approved for deals from Slingshot.
curl -OL https://raw.githubusercontent.com/filecoin-project/evergreen-dealer/master/misc/fil-spid.bash
chmod 755 fil-spid.bash
curl -sLH "Authorization: $( ./fil-spid.bash f0XXXX )" https://api.evergreen.filecoin.io/pending_proposals
curl -sLH "Authorization: $( ./fil-spid.bash f0xxxx )" https://api.evergreen.filecoin.io/eligible_pieces/anywhere
curl -sLH "Authorization: $( ./fil-spid.bash f0xxxx )" https://api.evergreen.filecoin.io/request_piece/bagaChosenPieceCid
curl -sLH "Authorization: $( ./fil-spid.bash f0xxxx )" https://api.evergreen.filecoin.io/pending_proposals