blocklayout GETDEVICEINFO fix

Discussion:

Christoph Hellwig

2014-09-26 14:02:49 UTC

The rpc_pipefs pattern used for the GETDEVICEINFO implementation in the
blocklayout driver is fundamentally not thread safe. Workloads like
dbench manage to trigger concurrent GETDEVICEINFO calls though, so we
need to add a critical section around it.

I also wonder if we should avoid even sending multiple GETDEVICEINFO
a a higher level in the NFS client, as =D1=95ending multiple request ju=
st
to cancel out their action before adding them to the cache doesn't
seem very helpful.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2014-09-26 14:02:50 UTC

Permalink

The rpc_pipefs code isn't thread safe, leading to occasional use after
frees when running xfstests generic/241 (dbench).

Signed-off-by: Christoph Hellwig <hch-***@public.gmane.org>
---
fs/nfs/blocklayout/rpc_pipefs.c | 14 +++++++++-----
fs/nfs/netns.h | 1 +
2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/blocklayout/rpc_pipefs.c b/fs/nfs/blocklayout/rpc_pipefs.c
index 8d04bda..da58ff7 100644
--- a/fs/nfs/blocklayout/rpc_pipefs.c
+++ b/fs/nfs/blocklayout/rpc_pipefs.c
@@ -65,17 +65,18 @@ bl_resolve_deviceid(struct nfs_server *server, struct pnfs_block_volume *b,

dprintk("%s CREATING PIPEFS MESSAGE\n", __func__);

+ mutex_lock(&nn->bl_mutex);
bl_pipe_msg.bl_wq = &nn->bl_wq;

b->simple.len += 4; /* single volume */
if (b->simple.len > PAGE_SIZE)
- return -EIO;
+ goto out_unlock;

memset(msg, 0, sizeof(*msg));
msg->len = sizeof(*bl_msg) + b->simple.len;
msg->data = kzalloc(msg->len, gfp_mask);
if (!msg->data)
- goto out;
+ goto out_free_data;

bl_msg = msg->data;
bl_msg->type = BL_DEVICE_MOUNT,
@@ -87,7 +88,7 @@ bl_resolve_deviceid(struct nfs_server *server, struct pnfs_block_volume *b,
rc = rpc_queue_upcall(nn->bl_device_pipe, msg);
if (rc < 0) {
remove_wait_queue(&nn->bl_wq, &wq);
- goto out;
+ goto out_free_data;
}

set_current_state(TASK_UNINTERRUPTIBLE);
@@ -98,12 +99,14 @@ bl_resolve_deviceid(struct nfs_server *server, struct pnfs_block_volume *b,
if (reply->status != BL_DEVICE_REQUEST_PROC) {
printk(KERN_WARNING "%s failed to decode device: %d\n",
__func__, reply->status);
- goto out;
+ goto out_free_data;
}

dev = MKDEV(reply->major, reply->minor);
-out:
+out_free_data:
kfree(msg->data);
+out_unlock:
+ mutex_unlock(&nn->bl_mutex);
return dev;
}

@@ -233,6 +236,7 @@ static int nfs4blocklayout_net_init(struct net *net)
struct nfs_net *nn = net_generic(net, nfs_net_id);
struct dentry *dentry;

+ mutex_init(&nn->bl_mutex);
init_waitqueue_head(&nn->bl_wq);
nn->bl_device_pipe = rpc_mkpipe_data(&bl_upcall_ops, 0);
if (IS_ERR(nn->bl_device_pipe))
diff --git a/fs/nfs/netns.h b/fs/nfs/netns.h
index ef221fb..f0e06e4 100644
--- a/fs/nfs/netns.h
+++ b/fs/nfs/netns.h
@@ -19,6 +19,7 @@ struct nfs_net {
struct rpc_pipe *bl_device_pipe;
struct bl_dev_msg bl_mount_reply;
wait_queue_head_t bl_wq;
+ struct mutex bl_mutex;
struct list_head nfs_client_list;
struct list_head nfs_volume_list;
#if IS_ENABLED(CONFIG_NFS_V4)

--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Trond Myklebust

2014-09-26 14:29:34 UTC

Permalink

Post by Christoph Hellwig
The rpc_pipefs code isn't thread safe, leading to occasional use after
frees when running xfstests generic/241 (dbench).
---
fs/nfs/blocklayout/rpc_pipefs.c | 14 +++++++++-----
fs/nfs/netns.h | 1 +
2 files changed, 10 insertions(+), 5 deletions(-)

It worries me that we're putting a mutex directly in the writeback
path. For small arrays, it might be acceptable, but what if you have a
block device with 1000s of disks on the back end?

Is there no better way to fix this issue?
--
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust-7I+n7zu2hftEKMMhf/***@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2014-09-26 15:48:43 UTC

Permalink

Post by Trond Myklebust
It worries me that we're putting a mutex directly in the writeback
path. For small arrays, it might be acceptable, but what if you have a
block device with 1000s of disks on the back end?
Is there no better way to fix this issue?

Not without getting rid of the rpc_pipefs interface. That is on my
very long term TODO list, but it will require new userspace support.

Note that I'm actually worried about GETDEVICEINFO from the writeback
path in general. There is a lot that happens when we don't have
a device in cache, including the need to open a block device for
the block layout driver, which is a complex operation full of
GFP_KERNEL allocation, or even a more complex scsi device scan
for the object layout. It's been on my more near term todo list
to look into reproducers for deadlocks in this area which seem
very possible, and then look into a fix for it; I can't really
think of anything less drastic than refusing block or object layout
I/O from memory reclaim if we don't have the device cached yet.
The situation for file layouts seems less severe, so I'll need
help from people more familar with to think about the situation there.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Trond Myklebust

2014-09-26 16:21:06 UTC

Permalink

Post by Christoph Hellwig

Not without getting rid of the rpc_pipefs interface. That is on my
very long term TODO list, but it will require new userspace support.

Why is that? rpc_pipefs was designed to be message based, so it should
work quite well in a multi-threaded environment. We certainly don't
use mutexes around the gssd up/downcall, and the only reason for the
mutex in idmapd is to deal with the keyring upcall.

Post by Christoph Hellwig
Note that I'm actually worried about GETDEVICEINFO from the writeback
path in general. There is a lot that happens when we don't have
a device in cache, including the need to open a block device for
the block layout driver, which is a complex operation full of
GFP_KERNEL allocation, or even a more complex scsi device scan
for the object layout. It's been on my more near term todo list
to look into reproducers for deadlocks in this area which seem
very possible, and then look into a fix for it; I can't really
think of anything less drastic than refusing block or object layout
I/O from memory reclaim if we don't have the device cached yet.
The situation for file layouts seems less severe, so I'll need
help from people more familar with to think about the situation there.

Agreed,
--
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust-7I+n7zu2hftEKMMhf/***@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2014-09-26 16:41:58 UTC

Permalink

Post by Trond Myklebust

Post by Christoph Hellwig
Not without getting rid of the rpc_pipefs interface. That is on my
very long term TODO list, but it will require new userspace support.

Ok, the GSS code dynamically allocates a new message for each upcall
and thus doesn't have the issue. I'll take a look if I can do this
in the blocklayout driver as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html