Errors 'too many open files' while load is running #355

Closed
opened 2023-05-16 13:35:20 +00:00 by anikeev-yadro · 3 comments

Steps to Reproduce (for bugs)

1.Start load on all nodes with following profile

  test_title: "write 8KB 16 nodes 12h 8000 ops 1200th open model" 
  wipe_data: false
  load_params:
    verify: false
    load_time: 1800
    load_type: S3 
    object_size: 8
    scenario: S3_CAR
    nodes_selection_strategy: ALL
    endpoint_selection_strategy: FIRST
    k6_process_allocation_strategy: PER_ENDPOINT
    write_rate: 8000
    preallocated_writers: 1200
    preset: 
      buckets_count: 100
      s3_location: load-1-4
      objects_count: 0

2.While load see errors in log only on 2 nodes:

HOST: 10.78.69.106
COMMAND:
 sudo journalctl --no-pager --since '2023-05-16 11:22:30' --until '2023-05-16 12:05:18' --grep '\Wpanic\W|\Woom\W|\Wtoo many open files\W' --case-sensitive=0
RC:
 0
STDOUT:
 -- Journal begins at Fri 2023-05-12 18:25:45 UTC, ends at Tue 2023-05-16 12:05:33 UTC. --
 May 16 11:36:08 az frostfs-node[8524]: 2023-05-16T11:36:08.963Z        debug        shard/put.go:61        can't put object to the write-cache, trying blobstor        {"shard_id": "THxCC9PUcLiAAMQSYC2g4U", "err": "open /srv/frostfs/meta0/write_cache4/E/Qkti9sCyVq9rdqmtqKa25GM3EupQZcb6NBQssbzRBvd.HbWXTPb21cL3EdNodWAGH7Lu9JxithUwwbwbQmukM7ZE#0: too many open files"}
 May 16 11:36:08 az frostfs-node[8524]: 2023-05-16T11:36:08.977Z        warn        engine/engine.go:162        could not put object to shard        {"shard_id": "THxCC9PUcLiAAMQSYC2g4U", "error count": 1, "error": "could not put object to BLOB storage: open /srv/frostfs/data4/E/Q/k/t/i9sCyVq9rdqmtqKa25GM3EupQZcb6NBQssbzRBvd.HbWXTPb21cL3EdNodWAGH7Lu9JxithUwwbwbQmukM7ZE#0: too many open files"}
 May 16 11:36:08 az frostfs-node[8524]: 2023-05-16T11:36:08.977Z        debug        shard/put.go:61        can't put object to the write-cache, trying blobstor        {"shard_id": "yDLXgKUMQ87uLrfAjYvPj", "err": "open /srv/frostfs/meta1/write_cache3/E/Qkti9sCyVq9rdqmtqKa25GM3EupQZcb6NBQssbzRBvd.HbWXTPb21cL3EdNodWAGH7Lu9JxithUwwbwbQmukM7ZE#0: too many open files"}
 May 16 11:36:08 az frostfs-node[8524]: 2023-05-16T11:36:08.990Z        warn        engine/engine.go:162        could not put object to shard        {"shard_id": "yDLXgKUMQ87uLrfAjYvPj", "error count": 1, "error": "could not put object to BLOB storage: open /srv/frostfs/data3/E/Q/k/t/i9sCyVq9rdqmtqKa25GM3EupQZcb6NBQssbzRBvd.HbWXTPb21cL3EdNodWAGH7Lu9JxithUwwbwbQmukM7ZE#0: too many open files"}
 May 16 11:36:08 az frostfs-node[8524]: 2023-05-16T11:36:08.990Z        debug        shard/put.go:61        can't put object to the write-cache, trying blobstor        {"shard_id": "DwMRfQNb28rP8KQGHCfPCY", "err": "open 

3.At the moment see a lot of files opened by node:

service@az:~$ ps ax|grep node
   1337 ?        Ssl  280:11 /usr/bin/prometheus-node-exporter --collector.disable-defaults --collector.cpu --collector.cpufreq --collector.diskstats --collector.dmi --collector.edac --collector.entropy --collector.filefd --collector.filesystem --collector.hwmon --collector.loadavg --collector.mdadm --collector.meminfo --collector.netdev --collector.netstat --collector.nvme --collector.os --collector.pressure --collector.stat --collector.timex --collector.uname --collector.vmstat --collector.systemd --collector.systemd.unit-include=(neo-go|neogo|frostfs).* --collector.textfile --collector.netclass --web.listen-address=localhost:9100
   8518 ?        Ssl  1289:27 /usr/bin/neo-go node --mainnet --config-path=/etc/neo-go
   8524 ?        Ssl  209618:09 /usr/bin/frostfs-node --config /etc/frostfs/storage/config.yml
3104156 pts/3    S+     0:00 grep node
service@az:~$ sudo lsof -p 8524 | wc -l
66755

4.Also see a lot of goroutines

System limits

service@pokoy:~$ cat /proc/8493/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             514158               514158               processes
Max open files            65536                65536                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       514158               514158               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Version

FrostFS Storage node
Version: v0.0.1-414-gc60029d3
GoVersion: go1.18.4

Context

Running perfomance tests

Your Environment

HW
16 nodes

## Steps to Reproduce (for bugs) 1.Start load on all nodes with following profile ``` test_title: "write 8KB 16 nodes 12h 8000 ops 1200th open model" wipe_data: false load_params: verify: false load_time: 1800 load_type: S3 object_size: 8 scenario: S3_CAR nodes_selection_strategy: ALL endpoint_selection_strategy: FIRST k6_process_allocation_strategy: PER_ENDPOINT write_rate: 8000 preallocated_writers: 1200 preset: buckets_count: 100 s3_location: load-1-4 objects_count: 0 ``` 2.While load see errors in log only on 2 nodes: ``` HOST: 10.78.69.106 COMMAND: sudo journalctl --no-pager --since '2023-05-16 11:22:30' --until '2023-05-16 12:05:18' --grep '\Wpanic\W|\Woom\W|\Wtoo many open files\W' --case-sensitive=0 RC: 0 STDOUT: -- Journal begins at Fri 2023-05-12 18:25:45 UTC, ends at Tue 2023-05-16 12:05:33 UTC. -- May 16 11:36:08 az frostfs-node[8524]: 2023-05-16T11:36:08.963Z debug shard/put.go:61 can't put object to the write-cache, trying blobstor {"shard_id": "THxCC9PUcLiAAMQSYC2g4U", "err": "open /srv/frostfs/meta0/write_cache4/E/Qkti9sCyVq9rdqmtqKa25GM3EupQZcb6NBQssbzRBvd.HbWXTPb21cL3EdNodWAGH7Lu9JxithUwwbwbQmukM7ZE#0: too many open files"} May 16 11:36:08 az frostfs-node[8524]: 2023-05-16T11:36:08.977Z warn engine/engine.go:162 could not put object to shard {"shard_id": "THxCC9PUcLiAAMQSYC2g4U", "error count": 1, "error": "could not put object to BLOB storage: open /srv/frostfs/data4/E/Q/k/t/i9sCyVq9rdqmtqKa25GM3EupQZcb6NBQssbzRBvd.HbWXTPb21cL3EdNodWAGH7Lu9JxithUwwbwbQmukM7ZE#0: too many open files"} May 16 11:36:08 az frostfs-node[8524]: 2023-05-16T11:36:08.977Z debug shard/put.go:61 can't put object to the write-cache, trying blobstor {"shard_id": "yDLXgKUMQ87uLrfAjYvPj", "err": "open /srv/frostfs/meta1/write_cache3/E/Qkti9sCyVq9rdqmtqKa25GM3EupQZcb6NBQssbzRBvd.HbWXTPb21cL3EdNodWAGH7Lu9JxithUwwbwbQmukM7ZE#0: too many open files"} May 16 11:36:08 az frostfs-node[8524]: 2023-05-16T11:36:08.990Z warn engine/engine.go:162 could not put object to shard {"shard_id": "yDLXgKUMQ87uLrfAjYvPj", "error count": 1, "error": "could not put object to BLOB storage: open /srv/frostfs/data3/E/Q/k/t/i9sCyVq9rdqmtqKa25GM3EupQZcb6NBQssbzRBvd.HbWXTPb21cL3EdNodWAGH7Lu9JxithUwwbwbQmukM7ZE#0: too many open files"} May 16 11:36:08 az frostfs-node[8524]: 2023-05-16T11:36:08.990Z debug shard/put.go:61 can't put object to the write-cache, trying blobstor {"shard_id": "DwMRfQNb28rP8KQGHCfPCY", "err": "open ``` 3.At the moment see a lot of files opened by node: ``` service@az:~$ ps ax|grep node 1337 ? Ssl 280:11 /usr/bin/prometheus-node-exporter --collector.disable-defaults --collector.cpu --collector.cpufreq --collector.diskstats --collector.dmi --collector.edac --collector.entropy --collector.filefd --collector.filesystem --collector.hwmon --collector.loadavg --collector.mdadm --collector.meminfo --collector.netdev --collector.netstat --collector.nvme --collector.os --collector.pressure --collector.stat --collector.timex --collector.uname --collector.vmstat --collector.systemd --collector.systemd.unit-include=(neo-go|neogo|frostfs).* --collector.textfile --collector.netclass --web.listen-address=localhost:9100 8518 ? Ssl 1289:27 /usr/bin/neo-go node --mainnet --config-path=/etc/neo-go 8524 ? Ssl 209618:09 /usr/bin/frostfs-node --config /etc/frostfs/storage/config.yml 3104156 pts/3 S+ 0:00 grep node ``` ``` service@az:~$ sudo lsof -p 8524 | wc -l 66755 ``` 4.Also see a lot of goroutines System limits ``` service@pokoy:~$ cat /proc/8493/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 514158 514158 processes Max open files 65536 65536 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 514158 514158 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us ``` ## Version ``` FrostFS Storage node Version: v0.0.1-414-gc60029d3 GoVersion: go1.18.4 ``` ## Context Running perfomance tests ## Your Environment HW 16 nodes
anikeev-yadro added the
bug
triage
labels 2023-05-16 13:35:20 +00:00
fyrchik added the
P0
label 2023-05-16 14:16:13 +00:00
ale64bit was assigned by fyrchik 2023-05-17 07:15:47 +00:00
Collaborator

@anikeev-yadro can you share the exact k6 command you used?

@anikeev-yadro can you share the exact k6 command you used?
Collaborator

Looks like the majority of them are network connections, so something is leaking those. I'm looking into it.

Looks like the majority of them are network connections, so something is leaking those. I'm looking into it.

Fixed via #360, #358.

Fixed via #360, #358.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: TrueCloudLab/frostfs-node#355
There is no content yet.