Scratch Data Management
Understanding Lustre I/O¶
When a client (a compute node from your job) needs to create or access a file, the client queries the metadata server (MDS) and the metadata target (MDT) for the layout and location of the file's stripes. Once the file is opened and the client obtains the striping information, the MDS is no longer involved in the file I/O process. The client interacts directly with the object storage servers (OSSes) and OSTs to perform I/O operations such as locking, disk allocation, storage, and retrieval.
If multiple clients try to read and write the same part of a file at the same time, the Lustre distributed lock manager enforces coherency, so that all clients see consistent results.
Discover MDTs and OSTs¶
ULHPC's Lustre file systems look and act like a single logical storage, but a large files on Lustre can be divided into multiple chunks (stripes) and stored across over OSTs. This technique is called file striping. The stripes are distributed among the OSTs in a round-robin fashion to ensure load balancing. It is thus important to know the number of OST on your running system.
As mentioned in the Lustre implementation section, the ULHPC Lustre infrastructure is composed of 2 MDS servers (2 MDT), 2 OSS servers and 16 OSTs.
You can list the MDTs and OSTs with the command lfs df
:
$ cds # OR: cd $SCRATCH
$ lfs df -h
UUID bytes Used Available Use% Mounted on
lscratch-MDT0000_UUID 3.2T 15.4G 3.1T 1% /mnt/lscratch[MDT:0]
lscratch-MDT0001_UUID 3.2T 3.8G 3.2T 1% /mnt/lscratch[MDT:1]
lscratch-OST0000_UUID 57.4T 16.7T 40.2T 30% /mnt/lscratch[OST:0]
lscratch-OST0001_UUID 57.4T 18.8T 38.0T 34% /mnt/lscratch[OST:1]
lscratch-OST0002_UUID 57.4T 17.6T 39.3T 31% /mnt/lscratch[OST:2]
lscratch-OST0003_UUID 57.4T 16.6T 40.3T 30% /mnt/lscratch[OST:3]
lscratch-OST0004_UUID 57.4T 16.5T 40.3T 30% /mnt/lscratch[OST:4]
lscratch-OST0005_UUID 57.4T 16.5T 40.3T 30% /mnt/lscratch[OST:5]
lscratch-OST0006_UUID 57.4T 16.3T 40.6T 29% /mnt/lscratch[OST:6]
lscratch-OST0007_UUID 57.4T 17.0T 39.9T 30% /mnt/lscratch[OST:7]
lscratch-OST0008_UUID 57.4T 16.8T 40.0T 30% /mnt/lscratch[OST:8]
lscratch-OST0009_UUID 57.4T 13.2T 43.6T 24% /mnt/lscratch[OST:9]
lscratch-OST000a_UUID 57.4T 13.2T 43.7T 24% /mnt/lscratch[OST:10]
lscratch-OST000b_UUID 57.4T 13.3T 43.6T 24% /mnt/lscratch[OST:11]
lscratch-OST000c_UUID 57.4T 14.0T 42.8T 25% /mnt/lscratch[OST:12]
lscratch-OST000d_UUID 57.4T 13.9T 43.0T 25% /mnt/lscratch[OST:13]
lscratch-OST000e_UUID 57.4T 14.4T 42.5T 26% /mnt/lscratch[OST:14]
lscratch-OST000f_UUID 57.4T 12.9T 43.9T 23% /mnt/lscratch[OST:15]
filesystem_summary: 919.0T 247.8T 662.0T 28% /mnt/lscratch
File striping¶
File striping permits to increase the throughput of operations by taking advantage of several OSSs and OSTs, by allowing one or more clients to read/write different parts of the same file in parallel. On the other hand, striping small files can decrease the performance.
File striping allows file sizes larger than a single OST, large files MUST be striped over several OSTs in order to avoid filling a single OST and harming the performance for all users. There is default stripe configuration for ULHPC Lustre filesystems (see below). However, users can set the following stripe parameters for their own directories or files to get optimum I/O performance. You can tune file striping using 3 properties:
Property | Effect | Default | Accepted values | Advised values |
---|---|---|---|---|
stripe_size | Size of the file stripes in bytes | 1048576 (1m) | > 0 | > 0 |
stripe_count | Number of OST to stripe across | 1 | -1 (use all the OSTs), 1-16 | -1 |
stripe_offset | Index of the OST where the first stripe of files will be written | -1 (automatic) | -1, 0-15 | -1 |
Note: With regards stripe_offset
(the index of the OST where the first stripe is to be placed); the default is -1 which results in random selection and using a non-default value is NOT recommended.
Note
Setting stripe size and stripe count correctly for your needs may significantly affect the I/O performance.
- Use the
lfs getstripe
command for getting the stripe parameters. - Use
lfs setstripe
for setting the stripe parameters to get optimal I/O performance. The correct stripe setting depends on your needs and file access patterns.- Newly created files and directories will inherit these parameters from their parent directory. However, the parameters cannot be changed on an existing file.
$ lfs getstripe dir|filename
$ lfs setstripe -s <stripe_size> -c <stripe_count> -o <stripe_offset> dir|filename
usage: lfs setstripe -d <directory> (to delete default striping from an existing directory)
usage: lfs setstripe [--stripe-count|-c <stripe_count>]
[--stripe-index|-i <start_ost_idx>]
[--stripe-size|-S <stripe_size>] <directory|filename>
Example:
$ lfs getstripe $SCRATCH
/scratch/users/<login>/
stripe_count: 1 stripe_size: 1048576 stripe_offset: -1
[...]
$ lfs setstripe -c -1 $SCRATCH
$ lfs getstripe $SCRATCH
/scratch/users/<login>/
stripe_count: -1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1
In this example, we view the current stripe setting of the $SCRATCH
directory. The stripe count is changed to all OSTs and verified.
All files written to this directory will be striped over the maximum number of OSTs (16).
Use lfs check osts
to see the number and status of active OSTs for each filesystem on the cluster. Learn more by reading the man page:
$ lfs check osts
$ man lfs
File stripping Examples¶
- Set the striping parameters for a directory containing only small files (< 20MB)
$ cd $SCRATCH
$ mkdir test_small_files
$ lfs getstripe test_small_files
test_small_files
stripe_count: 1 stripe_size: 1048576 stripe_offset: -1 pool:
$ lfs setstripe --stripe-size 1M --stripe-count 1 test_small_files
$ lfs getstripe test_small_files
test_small_files
stripe_count: 1 stripe_size: 1048576 stripe_offset: -1
- Set the striping parameters for a directory containing only large files between 100MB and 1GB
$ mkdir test_large_files
$ lfs setstripe --stripe-size 2M --stripe-count 2 test_large_files
$ lfs getstripe test_large_files
test_large_files
stripe_count: 2 stripe_size: 2097152 stripe_offset: -1
- Set the striping parameters for a directory containing files larger than 1GB
$ mkdir test_larger_files
$ lfs setstripe --stripe-size 4M --stripe-count 6 test_larger_files
$ lfs getstripe test_larger_files
test_larger_files
stripe_count: 6 stripe_size: 4194304 stripe_offset: -1
Big Data files management on Lustre
Using a large stripe size can improve performance when accessing very large files
Large stripe size allows each client to have exclusive access to its own part of a file. However, it can be counterproductive in some cases if it does not match your I/O pattern. The choice of stripe size has no effect on a single-stripe file.
Note that these are simple examples, the optimal settings defer depending on the application (concurrent threads accessing the same file, size of each write operation, etc).
Lustre Best practices¶
Parallel I/O on the same file
Increase the stripe_count
for parallel I/O to the same file.
When multiple processes are writing blocks of data to the same file in parallel, the I/O performance for large files will improve when the stripe_count
is set to a larger value. The stripe count sets the number of OSTs to which the file will be written. By default, the stripe count is set to 1. While this default setting provides for efficient access of metadata (for example to support the ls -l
command), large files should use stripe counts of greater than 1. This will increase the aggregate I/O bandwidth by using multiple OSTs in parallel instead of just one. A rule of thumb is to use a stripe count approximately equal to the number of gigabytes in the file.
Another good practice is to make the stripe count be an integral factor of the number of processes performing the write in parallel, so that you achieve load balance among the OSTs. For example, set the stripe count to 16 instead of 15 when you have 64 processes performing the writes. For more details, you can read the following external resources:
- Reference Documentation: Managing File Layout (Striping) and Free Space
- Lustre Wiki
- Lustre Best Practices - Nasa HECC
- I/O and Lustre Usage - NISC