Discussion:
[PATCH v1 00/16] NFS/RDMA patches for 3.19
Chuck Lever
2014-10-16 19:38:12 UTC
Permalink
Hi-

Two groups of patches in this series. The first group is fixes
and clean-ups for xprtrdma. The second group adds client support
for NFSv4.1 on RDMA. Looking for review and testing.

Also available in the "nfs-rdma-for-3.19" topic branch at

git://linux-nfs.org/projects/cel/cel-2.6.git

---

Chuck Lever (16):
xprtrdma: Return an errno from rpcrdma_register_external()
xprtrdma: Cap req_cqinit
SUNRPC: Pass callsize and recvsize to buf_alloc as separate arguments
xprtrdma: Re-write rpcrdma_flush_cqs()
xprtrdma: unmap all FMRs during transport disconnect
xprtrdma: spin CQ completion vectors
SUNRPC: serialize iostats updates
xprtrdma: Display async errors
xprtrdma: Enable pad optimization
NFS: Include transport protocol name in UCS client string
NFS: Clean up nfs4_init_callback()
SUNRPC: Add rpc_xprt_is_bidirectional()
NFS: Add sidecar RPC client support
NFS: Set BIND_CONN_TO_SESSION arguments in the proc layer
NFS: Bind side-car connection to session
NFS: Disable SESSION4_BACK_CHAN when a backchannel sidecar is to be used


fs/nfs/client.c | 1
fs/nfs/nfs4client.c | 86 +++++++++++++++++----
fs/nfs/nfs4proc.c | 71 ++++++++++++++---
fs/nfs/nfs4xdr.c | 16 ++--
include/linux/nfs_fs_sb.h | 2
include/linux/nfs_xdr.h | 6 +
include/linux/sunrpc/clnt.h | 1
include/linux/sunrpc/metrics.h | 3 +
include/linux/sunrpc/sched.h | 2
include/linux/sunrpc/xprt.h | 4 +
net/sunrpc/clnt.c | 28 ++++++-
net/sunrpc/sched.c | 6 +
net/sunrpc/stats.c | 21 ++++-
net/sunrpc/xprtrdma/transport.c | 6 +
net/sunrpc/xprtrdma/verbs.c | 159 +++++++++++++++++++++++++++++++++++----
net/sunrpc/xprtsock.c | 6 +
16 files changed, 347 insertions(+), 71 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:38:21 UTC
Permalink
The RPC/RDMA send_request method and the chunk registration code
expects an errno from the registration function. This allows
the upper layers to distinguish between a recoverable failure
(for example, temporary memory exhaustion) and a hard failure
(for example, a bug in the registration logic).

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/verbs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 61c4129..6ea2942 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1918,10 +1918,10 @@ rpcrdma_register_external(struct rpcrdma_mr_seg *seg,
break;

default:
- return -1;
+ return -EIO;
}
if (rc)
- return -1;
+ return rc;

return nsegs;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:38:29 UTC
Permalink
Recent work made FRMR registration and invalidation completions
unsignaled. This greatly reduces the adapter interrupt rate.

Every so often, however, a posted send Work Request is allowed to
signal. Otherwise, the provider's Work Queue will wrap and the
workload will hang.

The number of Work Requests that are allowed to remain unsignaled is
determined by the value of req_cqinit. Currently, this is set to the
size of the send Work Queue divided by two, minus 1.

For FRMR, the send Work Queue is the maximum number of concurrent
RPCs (currently 32) times the maximum number of Work Requests an
RPC might use (currently 7, though some adapters may need more).

For mlx4, this is 224 entries. This leaves completion signaling
disabled for 111 send Work Requests.

Some providers hold back dispatching Work Requests until a CQE is
generated. If completions are disabled, then no CQEs are generated
for quite some time, and that can stall the Work Queue.

I've seen this occur running xfstests generic/113 over NFSv4, where
eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
because the Work Queue has overflowed. The connection is dropped
and re-established.

Cap the rep_cqinit setting so completions are not left turned off
for too long.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/verbs.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 6ea2942..5c0c7a5 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -733,6 +733,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,

/* set trigger for requesting send completion */
ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
+ if (ep->rep_cqinit > 20)
+ ep->rep_cqinit = 20;
if (ep->rep_cqinit <= 2)
ep->rep_cqinit = 0;
INIT_CQCOUNT(ep);

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Anna Schumaker
2014-10-20 13:27:26 UTC
Permalink
Hey Chuck,
Post by Chuck Lever
Recent work made FRMR registration and invalidation completions
unsignaled. This greatly reduces the adapter interrupt rate.
Every so often, however, a posted send Work Request is allowed to
signal. Otherwise, the provider's Work Queue will wrap and the
workload will hang.
The number of Work Requests that are allowed to remain unsignaled is
determined by the value of req_cqinit. Currently, this is set to the
size of the send Work Queue divided by two, minus 1.
For FRMR, the send Work Queue is the maximum number of concurrent
RPCs (currently 32) times the maximum number of Work Requests an
RPC might use (currently 7, though some adapters may need more).
For mlx4, this is 224 entries. This leaves completion signaling
disabled for 111 send Work Requests.
Some providers hold back dispatching Work Requests until a CQE is
generated. If completions are disabled, then no CQEs are generated
for quite some time, and that can stall the Work Queue.
I've seen this occur running xfstests generic/113 over NFSv4, where
eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
because the Work Queue has overflowed. The connection is dropped
and re-established.
Cap the rep_cqinit setting so completions are not left turned off
for too long.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269
---
net/sunrpc/xprtrdma/verbs.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 6ea2942..5c0c7a5 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -733,6 +733,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
/* set trigger for requesting send completion */
ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
+ if (ep->rep_cqinit > 20)
+ ep->rep_cqinit = 20;
if (ep->rep_cqinit <= 2)
Can you change the ep->rep_cqinit <= 2 check into an else-if?

Thanks!
Anna
Post by Chuck Lever
ep->rep_cqinit = 0;
INIT_CQCOUNT(ep);
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:38:38 UTC
Permalink
I noticed that on RDMA, NFSv4 operations were using "hardway"
allocations much more than not. A "hardway" allocation uses GFP_NOFS
during each RPC to allocate the XDR buffer, instead of using a
pre-allocated pre-registered buffer for each RPC.

The pre-allocated buffers are 2200 bytes in length. The requested
XDR buffer sizes looked like this:

GETATTR: 3220 bytes
LOOKUP: 3612 bytes
WRITE: 3256 bytes
OPEN: 6344 bytes

But an NFSv4 GETATTR RPC request should be small. It's the reply
part of GETATTR that can grow large.

call_allocate() passes a single value as the XDR buffer size: the
sum of call and reply buffers. However, the xprtrdma transport
allocates its XDR request and reply buffers separately.

xprtrdma needs to know the maximum call size, as guidance for how
large the outgoing request is going to be and how the NFS payload
will be marshalled into chunks.

But RDMA XDR reply buffers are pre-posted, fixed-size buffers, not
allocated by xprt_rdma_allocate().

Because of the sum passed through ->buf_alloc(), xprtrdma's
->buf_alloc() always allocates more XDR buffer than it will ever
use. For NFSv4, it is unnecessarily triggering the slow "hardway"
path for almost every RPC.

Pass the call and reply buffer size values separately to the
transport's ->buf_alloc method. The RDMA transport ->buf_alloc can
now ignore the reply size, and allocate just what it will use for
the call buffer. The socket transport ->buf_alloc can simply add
them together, as call_allocate() did before.

With this patch, an NFSv4 GETATTR request now allocates a 476 byte
RDMA XDR buffer. I didn't see a single NFSv4 request that did not
fit into the transport's pre-allocated XDR buffer.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
include/linux/sunrpc/sched.h | 2 +-
include/linux/sunrpc/xprt.h | 3 ++-
net/sunrpc/clnt.c | 4 ++--
net/sunrpc/sched.c | 6 ++++--
net/sunrpc/xprtrdma/transport.c | 2 +-
net/sunrpc/xprtsock.c | 3 ++-
6 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 1a89599..68fa71d 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -232,7 +232,7 @@ struct rpc_task *rpc_wake_up_first(struct rpc_wait_queue *,
void *);
void rpc_wake_up_status(struct rpc_wait_queue *, int);
void rpc_delay(struct rpc_task *, unsigned long);
-void * rpc_malloc(struct rpc_task *, size_t);
+void *rpc_malloc(struct rpc_task *, size_t, size_t);
void rpc_free(void *);
int rpciod_up(void);
void rpciod_down(void);
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index fcbfe87..632685c 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -124,7 +124,8 @@ struct rpc_xprt_ops {
void (*rpcbind)(struct rpc_task *task);
void (*set_port)(struct rpc_xprt *xprt, unsigned short port);
void (*connect)(struct rpc_xprt *xprt, struct rpc_task *task);
- void * (*buf_alloc)(struct rpc_task *task, size_t size);
+ void * (*buf_alloc)(struct rpc_task *task,
+ size_t call, size_t reply);
void (*buf_free)(void *buffer);
int (*send_request)(struct rpc_task *task);
void (*set_retrans_timeout)(struct rpc_task *task);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 488ddee..5e817d6 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1599,8 +1599,8 @@ call_allocate(struct rpc_task *task)
req->rq_rcvsize = RPC_REPHDRSIZE + slack + proc->p_replen;
req->rq_rcvsize <<= 2;

- req->rq_buffer = xprt->ops->buf_alloc(task,
- req->rq_callsize + req->rq_rcvsize);
+ req->rq_buffer = xprt->ops->buf_alloc(task, req->rq_callsize,
+ req->rq_rcvsize);
if (req->rq_buffer != NULL)
return;

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79..fc4f939 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -829,7 +829,8 @@ static void rpc_async_schedule(struct work_struct *work)
/**
* rpc_malloc - allocate an RPC buffer
* @task: RPC task that will use this buffer
- * @size: requested byte size
+ * @call: maximum size of on-the-wire RPC call, in bytes
+ * @reply: maximum size of on-the-wire RPC reply, in bytes
*
* To prevent rpciod from hanging, this allocator never sleeps,
* returning NULL and suppressing warning if the request cannot be serviced
@@ -843,8 +844,9 @@ static void rpc_async_schedule(struct work_struct *work)
* In order to avoid memory starvation triggering more writebacks of
* NFS requests, we avoid using GFP_KERNEL.
*/
-void *rpc_malloc(struct rpc_task *task, size_t size)
+void *rpc_malloc(struct rpc_task *task, size_t call, size_t reply)
{
+ size_t size = call + reply;
struct rpc_buffer *buf;
gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2faac49..6e9d0a7 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -459,7 +459,7 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
* the receive buffer portion when using reply chunks.
*/
static void *
-xprt_rdma_allocate(struct rpc_task *task, size_t size)
+xprt_rdma_allocate(struct rpc_task *task, size_t size, size_t replen)
{
struct rpc_xprt *xprt = task->tk_rqstp->rq_xprt;
struct rpcrdma_req *req, *nreq;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89e..b4aca48 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2423,8 +2423,9 @@ static void xs_tcp_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
* we allocate pages instead doing a kmalloc like rpc_malloc is because we want
* to use the server side send routines.
*/
-static void *bc_malloc(struct rpc_task *task, size_t size)
+static void *bc_malloc(struct rpc_task *task, size_t call, size_t reply)
{
+ size_t size = call + reply;
struct page *page;
struct rpc_buffer *buf;


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Anna Schumaker
2014-10-20 14:04:45 UTC
Permalink
Post by Chuck Lever
I noticed that on RDMA, NFSv4 operations were using "hardway"
allocations much more than not. A "hardway" allocation uses GFP_NOFS
during each RPC to allocate the XDR buffer, instead of using a
pre-allocated pre-registered buffer for each RPC.
The pre-allocated buffers are 2200 bytes in length. The requested
GETATTR: 3220 bytes
LOOKUP: 3612 bytes
WRITE: 3256 bytes
OPEN: 6344 bytes
But an NFSv4 GETATTR RPC request should be small. It's the reply
part of GETATTR that can grow large.
call_allocate() passes a single value as the XDR buffer size: the
sum of call and reply buffers. However, the xprtrdma transport
allocates its XDR request and reply buffers separately.
xprtrdma needs to know the maximum call size, as guidance for how
large the outgoing request is going to be and how the NFS payload
will be marshalled into chunks.
But RDMA XDR reply buffers are pre-posted, fixed-size buffers, not
allocated by xprt_rdma_allocate().
Because of the sum passed through ->buf_alloc(), xprtrdma's
->buf_alloc() always allocates more XDR buffer than it will ever
use. For NFSv4, it is unnecessarily triggering the slow "hardway"
path for almost every RPC.
Pass the call and reply buffer size values separately to the
transport's ->buf_alloc method. The RDMA transport ->buf_alloc can
now ignore the reply size, and allocate just what it will use for
the call buffer. The socket transport ->buf_alloc can simply add
them together, as call_allocate() did before.
With this patch, an NFSv4 GETATTR request now allocates a 476 byte
RDMA XDR buffer. I didn't see a single NFSv4 request that did not
fit into the transport's pre-allocated XDR buffer.
---
include/linux/sunrpc/sched.h | 2 +-
include/linux/sunrpc/xprt.h | 3 ++-
net/sunrpc/clnt.c | 4 ++--
net/sunrpc/sched.c | 6 ++++--
net/sunrpc/xprtrdma/transport.c | 2 +-
net/sunrpc/xprtsock.c | 3 ++-
6 files changed, 12 insertions(+), 8 deletions(-)
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 1a89599..68fa71d 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -232,7 +232,7 @@ struct rpc_task *rpc_wake_up_first(struct rpc_wait_queue *,
void *);
void rpc_wake_up_status(struct rpc_wait_queue *, int);
void rpc_delay(struct rpc_task *, unsigned long);
-void * rpc_malloc(struct rpc_task *, size_t);
+void *rpc_malloc(struct rpc_task *, size_t, size_t);
void rpc_free(void *);
int rpciod_up(void);
void rpciod_down(void);
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index fcbfe87..632685c 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -124,7 +124,8 @@ struct rpc_xprt_ops {
void (*rpcbind)(struct rpc_task *task);
void (*set_port)(struct rpc_xprt *xprt, unsigned short port);
void (*connect)(struct rpc_xprt *xprt, struct rpc_task *task);
- void * (*buf_alloc)(struct rpc_task *task, size_t size);
+ void * (*buf_alloc)(struct rpc_task *task,
+ size_t call, size_t reply);
void (*buf_free)(void *buffer);
int (*send_request)(struct rpc_task *task);
void (*set_retrans_timeout)(struct rpc_task *task);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 488ddee..5e817d6 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1599,8 +1599,8 @@ call_allocate(struct rpc_task *task)
req->rq_rcvsize = RPC_REPHDRSIZE + slack + proc->p_replen;
req->rq_rcvsize <<= 2;
- req->rq_buffer = xprt->ops->buf_alloc(task,
- req->rq_callsize + req->rq_rcvsize);
+ req->rq_buffer = xprt->ops->buf_alloc(task, req->rq_callsize,
+ req->rq_rcvsize);
if (req->rq_buffer != NULL)
return;
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79..fc4f939 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -829,7 +829,8 @@ static void rpc_async_schedule(struct work_struct *work)
/**
* rpc_malloc - allocate an RPC buffer
*
* To prevent rpciod from hanging, this allocator never sleeps,
* returning NULL and suppressing warning if the request cannot be serviced
@@ -843,8 +844,9 @@ static void rpc_async_schedule(struct work_struct *work)
* In order to avoid memory starvation triggering more writebacks of
* NFS requests, we avoid using GFP_KERNEL.
*/
-void *rpc_malloc(struct rpc_task *task, size_t size)
+void *rpc_malloc(struct rpc_task *task, size_t call, size_t reply)
{
+ size_t size = call + reply;
struct rpc_buffer *buf;
gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2faac49..6e9d0a7 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -459,7 +459,7 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
* the receive buffer portion when using reply chunks.
*/
static void *
-xprt_rdma_allocate(struct rpc_task *task, size_t size)
+xprt_rdma_allocate(struct rpc_task *task, size_t size, size_t replen)
The comment right before this function mentions that send and receive buffers are allocated in the same call. Can you update this?

Anna
Post by Chuck Lever
{
struct rpc_xprt *xprt = task->tk_rqstp->rq_xprt;
struct rpcrdma_req *req, *nreq;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89e..b4aca48 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2423,8 +2423,9 @@ static void xs_tcp_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
* we allocate pages instead doing a kmalloc like rpc_malloc is because we want
* to use the server side send routines.
*/
-static void *bc_malloc(struct rpc_task *task, size_t size)
+static void *bc_malloc(struct rpc_task *task, size_t call, size_t reply)
{
+ size_t size = call + reply;
struct page *page;
struct rpc_buffer *buf;
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-20 18:21:57 UTC
Permalink
Post by Chuck Lever
I noticed that on RDMA, NFSv4 operations were using "hardway"
allocations much more than not. A "hardway" allocation uses GFP_NOFS
during each RPC to allocate the XDR buffer, instead of using a
pre-allocated pre-registered buffer for each RPC.
=20
The pre-allocated buffers are 2200 bytes in length. The requested
=20
GETATTR: 3220 bytes
LOOKUP: 3612 bytes
WRITE: 3256 bytes
OPEN: 6344 bytes
=20
But an NFSv4 GETATTR RPC request should be small. It's the reply
part of GETATTR that can grow large.
=20
call_allocate() passes a single value as the XDR buffer size: the
sum of call and reply buffers. However, the xprtrdma transport
allocates its XDR request and reply buffers separately.
=20
xprtrdma needs to know the maximum call size, as guidance for how
large the outgoing request is going to be and how the NFS payload
will be marshalled into chunks.
=20
But RDMA XDR reply buffers are pre-posted, fixed-size buffers, not
allocated by xprt_rdma_allocate().
=20
Because of the sum passed through ->buf_alloc(), xprtrdma's
->buf_alloc() always allocates more XDR buffer than it will ever
use. For NFSv4, it is unnecessarily triggering the slow "hardway"
path for almost every RPC.
=20
Pass the call and reply buffer size values separately to the
transport's ->buf_alloc method. The RDMA transport ->buf_alloc can
now ignore the reply size, and allocate just what it will use for
the call buffer. The socket transport ->buf_alloc can simply add
them together, as call_allocate() did before.
=20
With this patch, an NFSv4 GETATTR request now allocates a 476 byte
RDMA XDR buffer. I didn't see a single NFSv4 request that did not
fit into the transport's pre-allocated XDR buffer.
=20
---
include/linux/sunrpc/sched.h | 2 +-
include/linux/sunrpc/xprt.h | 3 ++-
net/sunrpc/clnt.c | 4 ++--
net/sunrpc/sched.c | 6 ++++--
net/sunrpc/xprtrdma/transport.c | 2 +-
net/sunrpc/xprtsock.c | 3 ++-
6 files changed, 12 insertions(+), 8 deletions(-)
=20
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sch=
ed.h
Post by Chuck Lever
index 1a89599..68fa71d 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -232,7 +232,7 @@ struct rpc_task *rpc_wake_up_first(struct rpc_wa=
it_queue *,
Post by Chuck Lever
void *);
void rpc_wake_up_status(struct rpc_wait_queue *, int);
void rpc_delay(struct rpc_task *, unsigned long);
-void * rpc_malloc(struct rpc_task *, size_t);
+void *rpc_malloc(struct rpc_task *, size_t, size_t);
void rpc_free(void *);
int rpciod_up(void);
void rpciod_down(void);
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt=
=2Eh
Post by Chuck Lever
index fcbfe87..632685c 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -124,7 +124,8 @@ struct rpc_xprt_ops {
void (*rpcbind)(struct rpc_task *task);
void (*set_port)(struct rpc_xprt *xprt, unsigned short port);
void (*connect)(struct rpc_xprt *xprt, struct rpc_task *task);
- void * (*buf_alloc)(struct rpc_task *task, size_t size);
+ void * (*buf_alloc)(struct rpc_task *task,
+ size_t call, size_t reply);
void (*buf_free)(void *buffer);
int (*send_request)(struct rpc_task *task);
void (*set_retrans_timeout)(struct rpc_task *task);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 488ddee..5e817d6 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1599,8 +1599,8 @@ call_allocate(struct rpc_task *task)
req->rq_rcvsize =3D RPC_REPHDRSIZE + slack + proc->p_replen;
req->rq_rcvsize <<=3D 2;
=20
- req->rq_buffer =3D xprt->ops->buf_alloc(task,
- req->rq_callsize + req->rq_rcvsize);
+ req->rq_buffer =3D xprt->ops->buf_alloc(task, req->rq_callsize,
+ req->rq_rcvsize);
if (req->rq_buffer !=3D NULL)
return;
=20
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79..fc4f939 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -829,7 +829,8 @@ static void rpc_async_schedule(struct work_struc=
t *work)
Post by Chuck Lever
/**
* rpc_malloc - allocate an RPC buffer
*
* To prevent rpciod from hanging, this allocator never sleeps,
* returning NULL and suppressing warning if the request cannot be s=
erviced
Post by Chuck Lever
@@ -843,8 +844,9 @@ static void rpc_async_schedule(struct work_struc=
t *work)
Post by Chuck Lever
* In order to avoid memory starvation triggering more writebacks of
* NFS requests, we avoid using GFP_KERNEL.
*/
-void *rpc_malloc(struct rpc_task *task, size_t size)
+void *rpc_malloc(struct rpc_task *task, size_t call, size_t reply)
{
+ size_t size =3D call + reply;
struct rpc_buffer *buf;
gfp_t gfp =3D GFP_NOWAIT | __GFP_NOWARN;
=20
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/t=
ransport.c
Post by Chuck Lever
index 2faac49..6e9d0a7 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -459,7 +459,7 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct =
rpc_task *task)
Post by Chuck Lever
* the receive buffer portion when using reply chunks.
*/
static void *
-xprt_rdma_allocate(struct rpc_task *task, size_t size)
+xprt_rdma_allocate(struct rpc_task *task, size_t size, size_t reple=
n)

The rpc_rqst actually has separate send and receive sizes
already in it, and the passed-in tk_task has a pointer to
rpc_rqst. So I don=92t need to alter the buf_alloc() method
synopsis at all.

Though I=92ve tested with the two server implementations that
I have on hand, I=92ve discovered this is a less than generic
approach to the problem. I can=92t just slice off the receive
part of this buffer, as it is actually used sometimes
(though apparently not with the Solaris or Linux servers).

=46or those two reasons, I=92m going to replace this patch with
something else in the next round. JFYI so you know what to
look for next time.
The comment right before this function mentions that send and receive=
buffers are allocated in the same call. Can you update this?

If the new patch touches xprt_rdma_allocate(), I can
replace Tom=92s implementer=92s notes in this function with
some operational documenting comments.
Anna
=20
Post by Chuck Lever
{
struct rpc_xprt *xprt =3D task->tk_rqstp->rq_xprt;
struct rpcrdma_req *req, *nreq;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89e..b4aca48 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2423,8 +2423,9 @@ static void xs_tcp_print_stats(struct rpc_xprt=
*xprt, struct seq_file *seq)
Post by Chuck Lever
* we allocate pages instead doing a kmalloc like rpc_malloc is beca=
use we want
Post by Chuck Lever
* to use the server side send routines.
*/
-static void *bc_malloc(struct rpc_task *task, size_t size)
+static void *bc_malloc(struct rpc_task *task, size_t call, size_t r=
eply)
Post by Chuck Lever
{
+ size_t size =3D call + reply;
struct page *page;
struct rpc_buffer *buf;
=20
=20
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs"=
in
Post by Chuck Lever
More majordomo info at http://vger.kernel.org/majordomo-info.html
=20
--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:38:46 UTC
Permalink
Currently rpcrdma_flush_cqs() attempts to avoid code duplication,
and simply invokes rpcrdma_recvcq_upcall and rpcrdma_sendcq_upcall.
This has two minor issues:

1. It re-arms the CQ, which can happen even if a CQ upcall is
running at the same time

2. The upcall functions drain only a limited number of CQEs,
thanks to the poll budget added by commit 8301a2c047cc
("xprtrdma: Limit work done by completion handler").

Rewrite rpcrdma_flush_cqs() to be sure all CQEs are drained after a
transport is disconnected.

Fixes: a7bc211ac926 ("xprtrdma: On disconnect, don't ignore ... ")
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/verbs.c | 32 ++++++++++++++++++++++++--------
1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 5c0c7a5..6fadb90 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -106,6 +106,17 @@ rpcrdma_run_tasklet(unsigned long data)
static DECLARE_TASKLET(rpcrdma_tasklet_g, rpcrdma_run_tasklet, 0UL);

static void
+rpcrdma_schedule_tasklet(struct list_head *sched_list)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
+ list_splice_tail(sched_list, &rpcrdma_tasklets_g);
+ spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
+ tasklet_schedule(&rpcrdma_tasklet_g);
+}
+
+static void
rpcrdma_qp_async_error_upcall(struct ib_event *event, void *context)
{
struct rpcrdma_ep *ep = context;
@@ -243,7 +254,6 @@ rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
struct list_head sched_list;
struct ib_wc *wcs;
int budget, count, rc;
- unsigned long flags;

INIT_LIST_HEAD(&sched_list);
budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
@@ -261,10 +271,7 @@ rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
rc = 0;

out_schedule:
- spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
- list_splice_tail(&sched_list, &rpcrdma_tasklets_g);
- spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
- tasklet_schedule(&rpcrdma_tasklet_g);
+ rpcrdma_schedule_tasklet(&sched_list);
return rc;
}

@@ -309,8 +316,17 @@ rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
static void
rpcrdma_flush_cqs(struct rpcrdma_ep *ep)
{
- rpcrdma_recvcq_upcall(ep->rep_attr.recv_cq, ep);
- rpcrdma_sendcq_upcall(ep->rep_attr.send_cq, ep);
+ struct list_head sched_list;
+ struct ib_wc wc;
+
+ INIT_LIST_HEAD(&sched_list);
+ while (ib_poll_cq(ep->rep_attr.recv_cq, 1, &wc) > 0)
+ rpcrdma_recvcq_process_wc(&wc, &sched_list);
+ if (!list_empty(&sched_list))
+ rpcrdma_schedule_tasklet(&sched_list);
+
+ while (ib_poll_cq(ep->rep_attr.send_cq, 1, &wc) > 0)
+ rpcrdma_sendcq_process_wc(&wc);
}

#ifdef RPC_DEBUG
@@ -980,7 +996,6 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
{
int rc;

- rpcrdma_flush_cqs(ep);
rc = rdma_disconnect(ia->ri_id);
if (!rc) {
/* returns without wait if not connected */
@@ -992,6 +1007,7 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
dprintk("RPC: %s: rdma_disconnect %i\n", __func__, rc);
ep->rep_connected = rc;
}
+ rpcrdma_flush_cqs(ep);
}

static int

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:38:55 UTC
Permalink
When using RPCRDMA_MTHCAFMR memory registration, after a few
transport disconnect / reconnect cycles, ib_map_phys_fmr() starts to
return EINVAL because the provider has exhausted its map pool.

Make sure that all FMRs are unmapped during transport disconnect,
and that ->send_request remarshals them during an RPC retransmit.
This resets the transport's MRs to ensure that none are leaked
during a disconnect.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/transport.c | 2 +-
net/sunrpc/xprtrdma/verbs.c | 40 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 6e9d0a7..61be0a0 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -601,7 +601,7 @@ xprt_rdma_send_request(struct rpc_task *task)

if (req->rl_niovs == 0)
rc = rpcrdma_marshal_req(rqst);
- else if (r_xprt->rx_ia.ri_memreg_strategy == RPCRDMA_FRMR)
+ else if (r_xprt->rx_ia.ri_memreg_strategy != RPCRDMA_ALLPHYSICAL)
rc = rpcrdma_marshal_chunks(rqst, 0);
if (rc < 0)
goto failed_marshal;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 6fadb90..9105524 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -62,6 +62,7 @@
#endif

static void rpcrdma_reset_frmrs(struct rpcrdma_ia *);
+static void rpcrdma_reset_fmrs(struct rpcrdma_ia *);

/*
* internal functions
@@ -884,8 +885,17 @@ retry:
rpcrdma_ep_disconnect(ep, ia);
rpcrdma_flush_cqs(ep);

- if (ia->ri_memreg_strategy == RPCRDMA_FRMR)
+ switch (ia->ri_memreg_strategy) {
+ case RPCRDMA_FRMR:
rpcrdma_reset_frmrs(ia);
+ break;
+ case RPCRDMA_MTHCAFMR:
+ rpcrdma_reset_fmrs(ia);
+ break;
+ default:
+ rc = -EIO;
+ goto out;
+ }

xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
id = rpcrdma_create_id(xprt, ia,
@@ -1305,6 +1315,34 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
kfree(buf->rb_pool);
}

+/* After a disconnect, unmap all FMRs.
+ *
+ * This is invoked only in the transport connect worker in order
+ * to serialize with rpcrdma_register_fmr_external().
+ */
+static void
+rpcrdma_reset_fmrs(struct rpcrdma_ia *ia)
+{
+ struct rpcrdma_xprt *r_xprt =
+ container_of(ia, struct rpcrdma_xprt, rx_ia);
+ struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+ struct list_head *pos;
+ struct rpcrdma_mw *r;
+ LIST_HEAD(l);
+ int rc;
+
+ list_for_each(pos, &buf->rb_all) {
+ r = list_entry(pos, struct rpcrdma_mw, mw_all);
+
+ INIT_LIST_HEAD(&l);
+ list_add(&r->r.fmr->list, &l);
+ rc = ib_unmap_fmr(&l);
+ if (rc)
+ dprintk("RPC: %s: ib_unmap_fmr failed %i\n",
+ __func__, rc);
+ }
+}
+
/* After a disconnect, a flushed FAST_REG_MR can leave an FRMR in
* an unusable state. Find FRMRs in this state and dereg / reg
* each. FRMRs that are VALID and attached to an rpcrdma_req are

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:39:03 UTC
Permalink
A pair of CQs is created for each xprtrdma transport. One transport
instance is created per NFS mount point.

Both Shirley Ma and Steve Wise have observed that the adapter
interrupt workload sticks with a single MSI-X and CPU core unless
manual steps are taken to move it to other CPUs. This tends to limit
performance once the interrupt workload consumes an entire core.

Sagi Grimwald suggested one way to get better dispersal of
interrupts is to use the completion vector argument of the
ib_create_cq() API to assign new CQs to different adapter ingress
queues. Currently, xprtrdma sets this argument to 0 unconditionally,
which leaves all xprtrdma CQs consuming the same small pool of
resources.

Each CQ will still be nailed to one completion vector. This won't help
a "single mount point" workload, but when multiple mount points are in
play, the RDMA provider will see to it that adapter interrupts are
better spread over available resources.

We also take a little trouble to stay off of vector 0, which is used
by many other kernel RDMA consumers such as IPoIB.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/verbs.c | 45 ++++++++++++++++++++++++++++++++++++++++---
1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 9105524..dc4c8e3 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -49,6 +49,8 @@

#include <linux/interrupt.h>
#include <linux/slab.h>
+#include <linux/random.h>
+
#include <asm/bitops.h>

#include "xprt_rdma.h"
@@ -666,6 +668,42 @@ rpcrdma_ia_close(struct rpcrdma_ia *ia)
}

/*
+ * Select a provider completion vector to assign a CQ to.
+ *
+ * This is an attempt to spread CQs across available CPUs. The counter
+ * is shared between all adapters on a system. Multi-adapter systems
+ * are rare, and this is still better for them than leaving all CQs on
+ * one completion vector.
+ *
+ * We could put the send and receive CQs for the same transport on
+ * different vectors. However, this risks assigning them to cores on
+ * different sockets in larger systems, which could have disasterous
+ * performance effects due to NUMA.
+ */
+static int
+rpcrdma_cq_comp_vec(struct rpcrdma_ia *ia)
+{
+ int num_comp_vectors = ia->ri_id->device->num_comp_vectors;
+ int vector = 0;
+
+ if (num_comp_vectors > 1) {
+ static DEFINE_SPINLOCK(rpcrdma_cv_lock);
+ static unsigned int rpcrdma_cv_counter;
+
+ spin_lock(&rpcrdma_cv_lock);
+ vector = rpcrdma_cv_counter++ % num_comp_vectors;
+ /* Skip 0, as it is commonly used by other RDMA consumers */
+ if (vector == 0)
+ vector = rpcrdma_cv_counter++ % num_comp_vectors;
+ spin_unlock(&rpcrdma_cv_lock);
+ }
+
+ dprintk("RPC: %s: adapter has %d vectors, using vector %d\n",
+ __func__, num_comp_vectors, vector);
+ return vector;
+}
+
+/*
* Create unconnected endpoint.
*/
int
@@ -674,7 +712,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
{
struct ib_device_attr devattr;
struct ib_cq *sendcq, *recvcq;
- int rc, err;
+ int rc, err, comp_vec;

rc = ib_query_device(ia->ri_id->device, &devattr);
if (rc) {
@@ -759,9 +797,10 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
init_waitqueue_head(&ep->rep_connect_wait);
INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);

+ comp_vec = rpcrdma_cq_comp_vec(ia);
sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
rpcrdma_cq_async_error_upcall, ep,
- ep->rep_attr.cap.max_send_wr + 1, 0);
+ ep->rep_attr.cap.max_send_wr + 1, comp_vec);
if (IS_ERR(sendcq)) {
rc = PTR_ERR(sendcq);
dprintk("RPC: %s: failed to create send CQ: %i\n",
@@ -778,7 +817,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,

recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
rpcrdma_cq_async_error_upcall, ep,
- ep->rep_attr.cap.max_recv_wr + 1, 0);
+ ep->rep_attr.cap.max_recv_wr + 1, comp_vec);
if (IS_ERR(recvcq)) {
rc = PTR_ERR(recvcq);
dprintk("RPC: %s: failed to create recv CQ: %i\n",

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:39:11 UTC
Permalink
Occasionally mountstats reports a negative retransmission rate.
Ensure that two RPCs completing concurrently don't confuse the sums
in the transport's op_metrics array.

Since pNFS filelayout can invoke rpc_count_iostats() on another
transport from xprt_release(), we can't rely on simply holding the
transport_lock in xprt_release(). There's nothing for it but hard
serialization. One spin lock per RPC operation should make this as
painless as it can be.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
include/linux/sunrpc/metrics.h | 3 +++
net/sunrpc/stats.c | 21 ++++++++++++++++-----
2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/metrics.h b/include/linux/sunrpc/metrics.h
index 1565bbe..eecb5a7 100644
--- a/include/linux/sunrpc/metrics.h
+++ b/include/linux/sunrpc/metrics.h
@@ -27,10 +27,13 @@

#include <linux/seq_file.h>
#include <linux/ktime.h>
+#include <linux/spinlock.h>

#define RPC_IOSTATS_VERS "1.0"

struct rpc_iostats {
+ spinlock_t om_lock;
+
/*
* These counters give an idea about how many request
* transmissions are required, on average, to complete that
diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
index 5453049..9711a15 100644
--- a/net/sunrpc/stats.c
+++ b/net/sunrpc/stats.c
@@ -116,7 +116,15 @@ EXPORT_SYMBOL_GPL(svc_seq_show);
*/
struct rpc_iostats *rpc_alloc_iostats(struct rpc_clnt *clnt)
{
- return kcalloc(clnt->cl_maxproc, sizeof(struct rpc_iostats), GFP_KERNEL);
+ struct rpc_iostats *stats;
+ int i;
+
+ stats = kcalloc(clnt->cl_maxproc, sizeof(*stats), GFP_KERNEL);
+ if (stats) {
+ for (i = 0; i < clnt->cl_maxproc; i++)
+ spin_lock_init(&stats[i].om_lock);
+ }
+ return stats;
}
EXPORT_SYMBOL_GPL(rpc_alloc_iostats);

@@ -135,20 +143,21 @@ EXPORT_SYMBOL_GPL(rpc_free_iostats);
* rpc_count_iostats - tally up per-task stats
* @task: completed rpc_task
* @stats: array of stat structures
- *
- * Relies on the caller for serialization.
*/
void rpc_count_iostats(const struct rpc_task *task, struct rpc_iostats *stats)
{
struct rpc_rqst *req = task->tk_rqstp;
struct rpc_iostats *op_metrics;
- ktime_t delta;
+ ktime_t delta, now;

if (!stats || !req)
return;

+ now = ktime_get();
op_metrics = &stats[task->tk_msg.rpc_proc->p_statidx];

+ spin_lock(&op_metrics->om_lock);
+
op_metrics->om_ops++;
op_metrics->om_ntrans += req->rq_ntrans;
op_metrics->om_timeouts += task->tk_timeouts;
@@ -161,8 +170,10 @@ void rpc_count_iostats(const struct rpc_task *task, struct rpc_iostats *stats)

op_metrics->om_rtt = ktime_add(op_metrics->om_rtt, req->rq_rtt);

- delta = ktime_sub(ktime_get(), task->tk_start);
+ delta = ktime_sub(now, task->tk_start);
op_metrics->om_execute = ktime_add(op_metrics->om_execute, delta);
+
+ spin_unlock(&op_metrics->om_lock);
}
EXPORT_SYMBOL_GPL(rpc_count_iostats);


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:39:19 UTC
Permalink
An async error upcall is a hard error, and should be reported in
the system log.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/verbs.c | 36 ++++++++++++++++++++++++++++++++----
1 file changed, 32 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index dc4c8e3..73079d5 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -108,6 +108,32 @@ rpcrdma_run_tasklet(unsigned long data)

static DECLARE_TASKLET(rpcrdma_tasklet_g, rpcrdma_run_tasklet, 0UL);

+static const char * const async_event[] = {
+ "CQ error",
+ "QP fatal error",
+ "QP request error",
+ "QP access error",
+ "communication established",
+ "send queue drained",
+ "path migration successful",
+ "path mig error",
+ "device fatal error",
+ "port active",
+ "port error",
+ "LID change",
+ "P_key change",
+ "SM change",
+ "SRQ error",
+ "SRQ limit reached",
+ "last WQE reached",
+ "client reregister",
+ "GID change",
+};
+
+#define ASYNC_MSG(status) \
+ ((status) < ARRAY_SIZE(async_event) ? \
+ async_event[(status)] : "unknown async error")
+
static void
rpcrdma_schedule_tasklet(struct list_head *sched_list)
{
@@ -124,8 +150,9 @@ rpcrdma_qp_async_error_upcall(struct ib_event *event, void *context)
{
struct rpcrdma_ep *ep = context;

- dprintk("RPC: %s: QP error %X on device %s ep %p\n",
- __func__, event->event, event->device->name, context);
+ pr_err("RPC: %s: %s on device %s ep %p\n",
+ __func__, ASYNC_MSG(event->event),
+ event->device->name, context);
if (ep->rep_connected == 1) {
ep->rep_connected = -EIO;
ep->rep_func(ep);
@@ -138,8 +165,9 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void *context)
{
struct rpcrdma_ep *ep = context;

- dprintk("RPC: %s: CQ error %X on device %s ep %p\n",
- __func__, event->event, event->device->name, context);
+ pr_err("RPC: %s: %s on device %s ep %p\n",
+ __func__, ASYNC_MSG(event->event),
+ event->device->name, context);
if (ep->rep_connected == 1) {
ep->rep_connected = -EIO;
ep->rep_func(ep);

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:39:27 UTC
Permalink
The Linux NFS/RDMA was rejecting NFSv3 WRITE requests when pad
optimization was enabled. That bug is now fixed

We can enable pad optimization, which helps performance and is
supported now by both Linux and Solaris servers.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/transport.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 61be0a0..09c6443 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -73,7 +73,7 @@ static unsigned int xprt_rdma_max_inline_read = RPCRDMA_DEF_INLINE;
static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
static unsigned int xprt_rdma_inline_write_padding;
static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
- int xprt_rdma_pad_optimize = 0;
+ int xprt_rdma_pad_optimize = 1;

#ifdef RPC_DEBUG


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:39:35 UTC
Permalink
The nfs_match_client() function asserts that different nfs_client
structures are used when mounting the same server with different
transport protocols. For example, if a Linux client mounts the same
server via NFSv3 with some UDP mounts and some TCP mounts, it will
use only two transports and two nfs_client structures: one shared
with all UDP mounts, and one shared with all TCP mounts.

When a uniform client string is in use (NFSv4.1, or NFSv4.0 with the
"migration" mount option), nfs_match_client() will ensure an
nfs_client structure and separate transport is created for mounts
with unique "proto=" settings (one for TCP and one for RDMA,
currently).

But EXCHANGE_ID sends exactly the same nfs_client_id4 string on both
transports. The server then believes that the client is trunking
over disparate transports, when it clearly is not. The open and lock
state that will appear on each transport is disjoint.

Now that NFSv4.1 over RDMA is supported, a user can mount the same
server with NFSv4.1 over TCP and RDMA concurrently. The client will
send an EXCHANGE_ID with the same client ID on both transports, and
it will also send a CREATE_SESSION on both.

To ensure the Linux client represents itself correctly to servers,
add the transport protocol name to the uniform client string. Each
transport instance should get its own client ID (and session) to
prevent trunking.

This doesn't appear to be a problem for NFSv4.0 without migration.
It also wasn't a problem for NFSv4.1 when TCP was the only available
transport.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
fs/nfs/nfs4proc.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 6ca0c8e..a1243e7 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -4929,16 +4929,23 @@ nfs4_init_uniform_client_string(const struct nfs_client *clp,
char *buf, size_t len)
{
const char *nodename = clp->cl_rpcclient->cl_nodename;
+ unsigned int result;

+ rcu_read_lock();
if (nfs4_client_id_uniquifier[0] != '\0')
- return scnprintf(buf, len, "Linux NFSv%u.%u %s/%s",
+ result = scnprintf(buf, len, "Linux NFSv%u.%u %s/%s %s",
clp->rpc_ops->version,
clp->cl_minorversion,
nfs4_client_id_uniquifier,
- nodename);
- return scnprintf(buf, len, "Linux NFSv%u.%u %s",
+ nodename, rpc_peeraddr2str(clp->cl_rpcclient,
+ RPC_DISPLAY_PROTO));
+ else
+ result = scnprintf(buf, len, "Linux NFSv%u.%u %s %s",
clp->rpc_ops->version, clp->cl_minorversion,
- nodename);
+ nodename, rpc_peeraddr2str(clp->cl_rpcclient,
+ RPC_DISPLAY_PROTO));
+ rcu_read_unlock();
+ return result;
}

/*

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:39:44 UTC
Permalink
nfs4_init_callback() is never invoked for NFS versions other than 4.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
fs/nfs/nfs4client.c | 31 ++++++++++++++-----------------
1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index ffdb28d..5f4b818 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -241,28 +241,25 @@ void nfs4_free_client(struct nfs_client *clp)
*/
static int nfs4_init_callback(struct nfs_client *clp)
{
+ struct rpc_xprt *xprt;
int error;

- if (clp->rpc_ops->version == 4) {
- struct rpc_xprt *xprt;
+ xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);

- xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);
-
- if (nfs4_has_session(clp)) {
- error = xprt_setup_backchannel(xprt,
- NFS41_BC_MIN_CALLBACKS);
- if (error < 0)
- return error;
- }
-
- error = nfs_callback_up(clp->cl_mvops->minor_version, xprt);
- if (error < 0) {
- dprintk("%s: failed to start callback. Error = %d\n",
- __func__, error);
+ if (nfs4_has_session(clp)) {
+ error = xprt_setup_backchannel(xprt, NFS41_BC_MIN_CALLBACKS);
+ if (error < 0)
return error;
- }
- __set_bit(NFS_CS_CALLBACK, &clp->cl_res_state);
}
+
+ error = nfs_callback_up(clp->cl_mvops->minor_version, xprt);
+ if (error < 0) {
+ dprintk("%s: failed to start callback. Error = %d\n",
+ __func__, error);
+ return error;
+ }
+ __set_bit(NFS_CS_CALLBACK, &clp->cl_res_state);
+
return 0;
}


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:39:52 UTC
Permalink
Allow upper layers to determine if a particular rpc_clnt is prepared
to provide a backchannel RPC service.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
include/linux/sunrpc/clnt.h | 1 +
include/linux/sunrpc/xprt.h | 1 +
net/sunrpc/clnt.c | 24 ++++++++++++++++++++++++
net/sunrpc/xprtsock.c | 3 +++
4 files changed, 29 insertions(+)

diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 70736b9..644c751 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -171,6 +171,7 @@ int rpc_protocol(struct rpc_clnt *);
struct net * rpc_net_ns(struct rpc_clnt *);
size_t rpc_max_payload(struct rpc_clnt *);
unsigned long rpc_get_timeout(struct rpc_clnt *clnt);
+bool rpc_xprt_is_bidirectional(struct rpc_clnt *);
void rpc_force_rebind(struct rpc_clnt *);
size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t);
const char *rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t);
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 632685c..4dea441 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -218,6 +218,7 @@ struct rpc_xprt {
* items */
struct list_head bc_pa_list; /* List of preallocated
* backchannel rpc_rqst's */
+ bool bc_supported;
#endif /* CONFIG_SUNRPC_BACKCHANNEL */
struct list_head recv;

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 5e817d6..1793341 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1347,6 +1347,30 @@ unsigned long rpc_get_timeout(struct rpc_clnt *clnt)
EXPORT_SYMBOL_GPL(rpc_get_timeout);

/**
+ * rpc_xprt_is_bidirectional
+ * @clnt: RPC clnt to query
+ *
+ * Returns true if underlying transport supports backchannel service.
+ */
+#ifdef CONFIG_SUNRPC_BACKCHANNEL
+bool rpc_xprt_is_bidirectional(struct rpc_clnt *clnt)
+{
+ bool ret;
+
+ rcu_read_lock();
+ ret = rcu_dereference(clnt->cl_xprt)->bc_supported;
+ rcu_read_unlock();
+ return ret;
+}
+#else
+bool rpc_xprt_is_bidirectional(struct rpc_clnt *clnt)
+{
+ return false;
+}
+#endif
+EXPORT_SYMBOL_GPL(rpc_xprt_is_bidirectional);
+
+/**
* rpc_force_rebind - force transport to check that remote port is unchanged
* @clnt: client to rebind
*
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index b4aca48..e2e15a9 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2864,6 +2864,9 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)

xprt->ops = &xs_tcp_ops;
xprt->timeout = &xs_tcp_default_timeout;
+#ifdef CONFIG_SUNRPC_BACKCHANNEL
+ xprt->bc_supported = true;
+#endif

switch (addr->sa_family) {
case AF_INET:

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:40:00 UTC
Permalink
So far, TCP is the only transport that supports bi-directional RPC.

When mounting with NFSv4.1 using a transport that does not support
bi-directional RPC, establish a TCP sidecar connection to handle
backchannel traffic for a session. The sidecar transport does not
use its forward channel except for sending BIND_CONN_TO_SESSION
operations.

This commit adds logic to create and destroy the sidecar transport.
Subsequent commits add logic to use the transport.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
fs/nfs/client.c | 1 +
fs/nfs/nfs4client.c | 54 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/nfs_fs_sb.h | 2 ++
3 files changed, 57 insertions(+)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 6a4f366..19f49bf 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -78,6 +78,7 @@ const struct rpc_program nfs_program = {
.stats = &nfs_rpcstat,
.pipe_dir_name = NFS_PIPE_DIRNAME,
};
+EXPORT_SYMBOL_GPL(nfs_program);

struct rpc_stat nfs_rpcstat = {
.program = &nfs_program
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 5f4b818..b1cc35e 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -213,6 +213,8 @@ static void nfs4_destroy_callback(struct nfs_client *clp)
{
if (__test_and_clear_bit(NFS_CS_CALLBACK, &clp->cl_res_state))
nfs_callback_down(clp->cl_mvops->minor_version, clp->cl_net);
+ if (clp->cl_bc_rpcclient)
+ rpc_shutdown_client(clp->cl_bc_rpcclient);
}

static void nfs4_shutdown_client(struct nfs_client *clp)
@@ -291,6 +293,53 @@ int nfs40_init_client(struct nfs_client *clp)

#if defined(CONFIG_NFS_V4_1)

+/*
+ * Create a separate rpc_clnt using TCP that can provide a
+ * backchannel service.
+ */
+static int nfs41_create_sidecar_rpc_client(struct nfs_client *clp)
+{
+ struct sockaddr_storage address;
+ struct sockaddr *sap = (struct sockaddr *)&address;
+ struct rpc_create_args args = {
+ .net = clp->cl_net,
+ .protocol = XPRT_TRANSPORT_TCP,
+ .address = sap,
+ .addrsize = clp->cl_addrlen,
+ .servername = clp->cl_hostname,
+ .program = &nfs_program,
+ .version = clp->rpc_ops->version,
+ .flags = (RPC_CLNT_CREATE_DISCRTRY |
+ RPC_CLNT_CREATE_NOPING),
+ };
+ struct rpc_clnt *clnt;
+ struct rpc_cred *cred;
+
+ if (rpc_xprt_is_bidirectional(clp->cl_rpcclient))
+ return 0;
+
+ if (test_bit(NFS_CS_NORESVPORT, &clp->cl_flags))
+ args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
+ memcpy(sap, &clp->cl_addr, clp->cl_addrlen);
+ rpc_set_port(sap, NFS_PORT);
+ cred = nfs4_get_clid_cred(clp);
+ if (cred) {
+ args.authflavor = cred->cr_auth->au_flavor;
+ put_rpccred(cred);
+ } else
+ args.authflavor = RPC_AUTH_UNIX;
+
+ clnt = rpc_create(&args);
+ if (IS_ERR(clnt)) {
+ dprintk("%s: cannot create side-car RPC client. Error = %ld\n",
+ __func__, PTR_ERR(clnt));
+ return PTR_ERR(clnt);
+ }
+
+ clp->cl_bc_rpcclient = clnt;
+ return 0;
+}
+
/**
* nfs41_init_client - nfs_client initialization tasks for NFSv4.1+
* @clp - nfs_client to initialize
@@ -300,6 +349,11 @@ int nfs40_init_client(struct nfs_client *clp)
int nfs41_init_client(struct nfs_client *clp)
{
struct nfs4_session *session = NULL;
+ int ret;
+
+ ret = nfs41_create_sidecar_rpc_client(clp);
+ if (ret)
+ return ret;

/*
* Create the session and mark it expired.
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 922be2e..159d703 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -87,6 +87,8 @@ struct nfs_client {

/* The sequence id to use for the next CREATE_SESSION */
u32 cl_seqid;
+ /* The optional sidecar backchannel transport */
+ struct rpc_clnt *cl_bc_rpcclient;
/* The flags used for obtaining the clientid during EXCHANGE_ID */
u32 cl_exchange_flags;
struct nfs4_session *cl_session; /* shared session */

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Anna Schumaker
2014-10-20 17:33:22 UTC
Permalink
Post by Chuck Lever
So far, TCP is the only transport that supports bi-directional RPC.
When mounting with NFSv4.1 using a transport that does not support
bi-directional RPC, establish a TCP sidecar connection to handle
backchannel traffic for a session. The sidecar transport does not
use its forward channel except for sending BIND_CONN_TO_SESSION
operations.
This commit adds logic to create and destroy the sidecar transport.
Subsequent commits add logic to use the transport.
I thought NFS v4.0 also uses a separate connection for the backchannel? Can any of that code be reused here, rather than creating new sidecar structures?

Anna
Post by Chuck Lever
---
fs/nfs/client.c | 1 +
fs/nfs/nfs4client.c | 54 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/nfs_fs_sb.h | 2 ++
3 files changed, 57 insertions(+)
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 6a4f366..19f49bf 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -78,6 +78,7 @@ const struct rpc_program nfs_program = {
.stats = &nfs_rpcstat,
.pipe_dir_name = NFS_PIPE_DIRNAME,
};
+EXPORT_SYMBOL_GPL(nfs_program);
struct rpc_stat nfs_rpcstat = {
.program = &nfs_program
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 5f4b818..b1cc35e 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -213,6 +213,8 @@ static void nfs4_destroy_callback(struct nfs_client *clp)
{
if (__test_and_clear_bit(NFS_CS_CALLBACK, &clp->cl_res_state))
nfs_callback_down(clp->cl_mvops->minor_version, clp->cl_net);
+ if (clp->cl_bc_rpcclient)
+ rpc_shutdown_client(clp->cl_bc_rpcclient);
}
static void nfs4_shutdown_client(struct nfs_client *clp)
@@ -291,6 +293,53 @@ int nfs40_init_client(struct nfs_client *clp)
#if defined(CONFIG_NFS_V4_1)
+/*
+ * Create a separate rpc_clnt using TCP that can provide a
+ * backchannel service.
+ */
+static int nfs41_create_sidecar_rpc_client(struct nfs_client *clp)
+{
+ struct sockaddr_storage address;
+ struct sockaddr *sap = (struct sockaddr *)&address;
+ struct rpc_create_args args = {
+ .net = clp->cl_net,
+ .protocol = XPRT_TRANSPORT_TCP,
+ .address = sap,
+ .addrsize = clp->cl_addrlen,
+ .servername = clp->cl_hostname,
+ .program = &nfs_program,
+ .version = clp->rpc_ops->version,
+ .flags = (RPC_CLNT_CREATE_DISCRTRY |
+ RPC_CLNT_CREATE_NOPING),
+ };
+ struct rpc_clnt *clnt;
+ struct rpc_cred *cred;
+
+ if (rpc_xprt_is_bidirectional(clp->cl_rpcclient))
+ return 0;
+
+ if (test_bit(NFS_CS_NORESVPORT, &clp->cl_flags))
+ args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
+ memcpy(sap, &clp->cl_addr, clp->cl_addrlen);
+ rpc_set_port(sap, NFS_PORT);
+ cred = nfs4_get_clid_cred(clp);
+ if (cred) {
+ args.authflavor = cred->cr_auth->au_flavor;
+ put_rpccred(cred);
+ } else
+ args.authflavor = RPC_AUTH_UNIX;
+
+ clnt = rpc_create(&args);
+ if (IS_ERR(clnt)) {
+ dprintk("%s: cannot create side-car RPC client. Error = %ld\n",
+ __func__, PTR_ERR(clnt));
+ return PTR_ERR(clnt);
+ }
+
+ clp->cl_bc_rpcclient = clnt;
+ return 0;
+}
+
/**
* nfs41_init_client - nfs_client initialization tasks for NFSv4.1+
@@ -300,6 +349,11 @@ int nfs40_init_client(struct nfs_client *clp)
int nfs41_init_client(struct nfs_client *clp)
{
struct nfs4_session *session = NULL;
+ int ret;
+
+ ret = nfs41_create_sidecar_rpc_client(clp);
+ if (ret)
+ return ret;
/*
* Create the session and mark it expired.
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 922be2e..159d703 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -87,6 +87,8 @@ struct nfs_client {
/* The sequence id to use for the next CREATE_SESSION */
u32 cl_seqid;
+ /* The optional sidecar backchannel transport */
+ struct rpc_clnt *cl_bc_rpcclient;
/* The flags used for obtaining the clientid during EXCHANGE_ID */
u32 cl_exchange_flags;
struct nfs4_session *cl_session; /* shared session */
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-20 18:09:42 UTC
Permalink
Post by Chuck Lever
So far, TCP is the only transport that supports bi-directional RPC.
=20
When mounting with NFSv4.1 using a transport that does not support
bi-directional RPC, establish a TCP sidecar connection to handle
backchannel traffic for a session. The sidecar transport does not
use its forward channel except for sending BIND_CONN_TO_SESSION
operations.
=20
This commit adds logic to create and destroy the sidecar transport.
Subsequent commits add logic to use the transport.
=20
I thought NFS v4.0 also uses a separate connection for the backchanne=
l?

=46or NFSv4.0, the server opens a connection to the client. For
NFSv4.1, the client opens the side car connection to the
server, and then performs BIND_CONN_TO_SESSION, an operation
not in NFSv4.0. The processes are not terribly similar.
Can any of that code be reused here, rather than creating new sidecar=
structures?

Since it=92s the client opening a connection in this case,
I don=92t see any obvious common code paths that I haven=92t
already re-used. I=92m open to suggestions.
Anna
=20
Post by Chuck Lever
=20
---
fs/nfs/client.c | 1 +
fs/nfs/nfs4client.c | 54 +++++++++++++++++++++++++++++++++++=
++++++++++
Post by Chuck Lever
include/linux/nfs_fs_sb.h | 2 ++
3 files changed, 57 insertions(+)
=20
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 6a4f366..19f49bf 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -78,6 +78,7 @@ const struct rpc_program nfs_program =3D {
.stats =3D &nfs_rpcstat,
.pipe_dir_name =3D NFS_PIPE_DIRNAME,
};
+EXPORT_SYMBOL_GPL(nfs_program);
=20
struct rpc_stat nfs_rpcstat =3D {
.program =3D &nfs_program
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 5f4b818..b1cc35e 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -213,6 +213,8 @@ static void nfs4_destroy_callback(struct nfs_cli=
ent *clp)
Post by Chuck Lever
{
if (__test_and_clear_bit(NFS_CS_CALLBACK, &clp->cl_res_state))
nfs_callback_down(clp->cl_mvops->minor_version, clp->cl_net);
+ if (clp->cl_bc_rpcclient)
+ rpc_shutdown_client(clp->cl_bc_rpcclient);
}
=20
static void nfs4_shutdown_client(struct nfs_client *clp)
@@ -291,6 +293,53 @@ int nfs40_init_client(struct nfs_client *clp)
=20
#if defined(CONFIG_NFS_V4_1)
=20
+/*
+ * Create a separate rpc_clnt using TCP that can provide a
+ * backchannel service.
+ */
+static int nfs41_create_sidecar_rpc_client(struct nfs_client *clp)
+{
+ struct sockaddr_storage address;
+ struct sockaddr *sap =3D (struct sockaddr *)&address;
+ struct rpc_create_args args =3D {
+ .net =3D clp->cl_net,
+ .protocol =3D XPRT_TRANSPORT_TCP,
+ .address =3D sap,
+ .addrsize =3D clp->cl_addrlen,
+ .servername =3D clp->cl_hostname,
+ .program =3D &nfs_program,
+ .version =3D clp->rpc_ops->version,
+ .flags =3D (RPC_CLNT_CREATE_DISCRTRY |
+ RPC_CLNT_CREATE_NOPING),
+ };
+ struct rpc_clnt *clnt;
+ struct rpc_cred *cred;
+
+ if (rpc_xprt_is_bidirectional(clp->cl_rpcclient))
+ return 0;
+
+ if (test_bit(NFS_CS_NORESVPORT, &clp->cl_flags))
+ args.flags |=3D RPC_CLNT_CREATE_NONPRIVPORT;
+ memcpy(sap, &clp->cl_addr, clp->cl_addrlen);
+ rpc_set_port(sap, NFS_PORT);
+ cred =3D nfs4_get_clid_cred(clp);
+ if (cred) {
+ args.authflavor =3D cred->cr_auth->au_flavor;
+ put_rpccred(cred);
+ } else
+ args.authflavor =3D RPC_AUTH_UNIX;
+
+ clnt =3D rpc_create(&args);
+ if (IS_ERR(clnt)) {
+ dprintk("%s: cannot create side-car RPC client. Error =3D %ld\n",
+ __func__, PTR_ERR(clnt));
+ return PTR_ERR(clnt);
+ }
+
+ clp->cl_bc_rpcclient =3D clnt;
+ return 0;
+}
+
/**
* nfs41_init_client - nfs_client initialization tasks for NFSv4.1+
@@ -300,6 +349,11 @@ int nfs40_init_client(struct nfs_client *clp)
int nfs41_init_client(struct nfs_client *clp)
{
struct nfs4_session *session =3D NULL;
+ int ret;
+
+ ret =3D nfs41_create_sidecar_rpc_client(clp);
+ if (ret)
+ return ret;
=20
/*
* Create the session and mark it expired.
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 922be2e..159d703 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -87,6 +87,8 @@ struct nfs_client {
=20
/* The sequence id to use for the next CREATE_SESSION */
u32 cl_seqid;
+ /* The optional sidecar backchannel transport */
+ struct rpc_clnt *cl_bc_rpcclient;
/* The flags used for obtaining the clientid during EXCHANGE_ID */
u32 cl_exchange_flags;
struct nfs4_session *cl_session; /* shared session */
=20
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs"=
in
Post by Chuck Lever
More majordomo info at http://vger.kernel.org/majordomo-info.html
=20
--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Trond Myklebust
2014-10-20 19:40:11 UTC
Permalink
So far, TCP is the only transport that supports bi-directional RPC=
=2E
When mounting with NFSv4.1 using a transport that does not support
bi-directional RPC, establish a TCP sidecar connection to handle
backchannel traffic for a session. The sidecar transport does not
use its forward channel except for sending BIND_CONN_TO_SESSION
operations.
This commit adds logic to create and destroy the sidecar transport=
=2E
Subsequent commits add logic to use the transport.
I thought NFS v4.0 also uses a separate connection for the backchan=
nel?
For NFSv4.0, the server opens a connection to the client. For
NFSv4.1, the client opens the side car connection to the
server, and then performs BIND_CONN_TO_SESSION, an operation
not in NFSv4.0. The processes are not terribly similar.
Can any of that code be reused here, rather than creating new sidec=
ar structures?
Since it=E2=80=99s the client opening a connection in this case,
I don=E2=80=99t see any obvious common code paths that I haven=E2=80=99=
t
already re-used. I=E2=80=99m open to suggestions.
Why aren't we doing the callbacks via RDMA as per the recommendation
in RFC5667 section 5.1?

--=20

Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust-7I+n7zu2hftEKMMhf/***@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-20 20:11:25 UTC
Permalink
Hi Trond-
Post by Trond Myklebust
Why aren't we doing the callbacks via RDMA as per the recommendation
in RFC5667 section 5.1?
There=92s no benefit to it. With a side car, the server requires
few or no changes. There are no CB operations that benefit
from using RDMA. It=92s very quick to implement, re-using most of
the client backchannel implementation that already exists.

I=92ve discussed this with an author of RFC 5667 [cc=92d], and also
with the implementors of an existing NFSv4.1 server that supports
RDMA. They both agree that a side car is an acceptable, or even a
preferable, way to approach backchannel support.

Also, when I discussed this with you months ago, you also felt
that a side car was better than adding backchannel support to the
xprtrdma transport. I took this approach only because you OK=92d it.

But I don=92t see an explicit recommendation in section 5.1. Which
text are you referring to?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Trond Myklebust
2014-10-20 22:31:17 UTC
Permalink
Post by Chuck Lever
Hi Trond-
Post by Trond Myklebust
Why aren't we doing the callbacks via RDMA as per the recommendation
in RFC5667 section 5.1?
There=E2=80=99s no benefit to it. With a side car, the server require=
s
Post by Chuck Lever
few or no changes. There are no CB operations that benefit
from using RDMA. It=E2=80=99s very quick to implement, re-using most =
of
Post by Chuck Lever
the client backchannel implementation that already exists.
I=E2=80=99ve discussed this with an author of RFC 5667 [cc=E2=80=99d]=
, and also
Post by Chuck Lever
with the implementors of an existing NFSv4.1 server that supports
RDMA. They both agree that a side car is an acceptable, or even a
preferable, way to approach backchannel support.
Also, when I discussed this with you months ago, you also felt
that a side car was better than adding backchannel support to the
xprtrdma transport. I took this approach only because you OK=E2=80=99=
d it.
Post by Chuck Lever
But I don=E2=80=99t see an explicit recommendation in section 5.1. Wh=
ich
Post by Chuck Lever
text are you referring to?
The very first paragraph argues that because callback messages don't
carry bulk data, there is no problem with using RPC/RDMA and, in
particular, with using RDMA_MSG provided that the buffer sizes are
negotiated correctly.

So the questions are:

1) Where is the discussion of the merits for and against adding
bi-directional support to the xprtrdma layer in Linux? What is the
showstopper preventing implementation of a design based around
RFC5667?

2) Why do we instead have to solve the whole backchannel problem in
the NFSv4.1 layer, and where is the discussion of the merits for and
against that particular solution? As far as I can tell, it imposes at
least 2 extra requirements:
a) NFSv4.1 client+server must have support either for session
trunking or for clientid trunking
b) NFSv4.1 client must be able to set up a TCP connection to the
server (that can be session/clientid trunked with the existing RDMA
channel)

All I've found so far on googling these questions is a 5 1/2 year old
email exchange between Tom Tucker and Ricardo where the conclusion
appears to be that we can, in time, implement both designs. However
there is no explanation of why we would want to do so.
http://comments.gmane.org/gmane.linux.nfs/22927

--=20
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust-7I+n7zu2hftEKMMhf/***@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-21 01:06:19 UTC
Permalink
Post by Trond Myklebust
Post by Chuck Lever
Hi Trond-
=20
=20
Why aren't we doing the callbacks via RDMA as per the recommendatio=
n
Post by Trond Myklebust
Post by Chuck Lever
in RFC5667 section 5.1?
=20
There=92s no benefit to it. With a side car, the server requires
few or no changes. There are no CB operations that benefit
from using RDMA. It=92s very quick to implement, re-using most of
the client backchannel implementation that already exists.
=20
I=92ve discussed this with an author of RFC 5667 [cc=92d], and also
with the implementors of an existing NFSv4.1 server that supports
RDMA. They both agree that a side car is an acceptable, or even a
preferable, way to approach backchannel support.
=20
Also, when I discussed this with you months ago, you also felt
that a side car was better than adding backchannel support to the
xprtrdma transport. I took this approach only because you OK=92d it.
=20
But I don=92t see an explicit recommendation in section 5.1. Which
text are you referring to?
=20
The very first paragraph argues that because callback messages don't
carry bulk data, there is no problem with using RPC/RDMA and, in
particular, with using RDMA_MSG provided that the buffer sizes are
negotiated correctly.
The opening paragraph is advice that applies to all forms
of NFSv4 callback, including NFSv4.0, which uses a separate
transport initiated from the NFS server. Specific advice about
NFSv4.1 bi-directional RPC is left to the next two paragraphs,
but they suggest there be dragons. I rather think this is a
warning not to =93go there.=94
Post by Trond Myklebust
=20
1) Where is the discussion of the merits for and against adding
bi-directional support to the xprtrdma layer in Linux? What is the
showstopper preventing implementation of a design based around
RFC5667?
There is no show-stopper (see Section 5.1, after all). It=92s
simply a matter of development effort: a side-car is much
less work than implementing full RDMA backchannel support for
both a client and server, especially since TCP backchannel
already works and can be used immediately.

Also, no problem with eventually implementing RDMA backchannel
if the complexity, and any performance overhead it introduces in
the forward channel, can be justified. The client can use the
CREATE_SESSION flags to detect what a server supports.
Post by Trond Myklebust
2) Why do we instead have to solve the whole backchannel problem in
the NFSv4.1 layer, and where is the discussion of the merits for and
against that particular solution? As far as I can tell, it imposes at
a) NFSv4.1 client+server must have support either for session
trunking or for clientid trunking
Very minimal trunking support. The only operation allowed on
the TCP side-car's forward channel is BIND_CONN_TO_SESSION.

Bruce told me that associating multiple transports to a
clientid/session should not be an issue for his server (his
words were =93if that doesn=92t work, it=92s a bug=94).

Would this restrictive form of trunking present a problem?
Post by Trond Myklebust
b) NFSv4.1 client must be able to set up a TCP connection to the
server (that can be session/clientid trunked with the existing RDMA
channel)
Also very minimal changes. The changes are already done,
posted in v1 of this patch series.
Post by Trond Myklebust
All I've found so far on googling these questions is a 5 1/2 year old
email exchange between Tom Tucker and Ricardo where the conclusion
appears to be that we can, in time, implement both designs.
You and I spoke about this on Feb 13, 2014 during pub night.
At the time you stated that a side-car was the only spec-
compliant way to approach this. I said I would go forward
with the idea in Linux, and you did not object.
Post by Trond Myklebust
However
there is no explanation of why we would want to do so.
http://comments.gmane.org/gmane.linux.nfs/22927
I=92ve implemented exactly what Ricardo proposed in this
Post by Trond Myklebust
Post by Chuck Lever
The thinking is that NFSRDMA could initially use a TCP callback cha=
nnel.
Post by Trond Myklebust
Post by Chuck Lever
We'll implement BIND_CONN_TO_SESSION so that the backchannel does n=
ot
Post by Trond Myklebust
Post by Chuck Lever
need to be tied to the forechannel connection. This should address=
the
Post by Trond Myklebust
Post by Chuck Lever
case where you have NFSRDMA for the forechannel and TCP for the
backchannel. BIND_CONN_TO_SESSION is also required to reestablish
dropped connections effectively (to avoid losing the reply cache).
Given what they're hoping to achieve, I'm fine with
doing a simple implementation of sessions first, then progressively
refining it.
What=92s the next step?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Trond Myklebust
2014-10-21 07:45:23 UTC
Permalink
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
Hi Trond-
Why aren't we doing the callbacks via RDMA as per the recommendati=
on
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
in RFC5667 section 5.1?
There=E2=80=99s no benefit to it. With a side car, the server requi=
res
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
few or no changes. There are no CB operations that benefit
from using RDMA. It=E2=80=99s very quick to implement, re-using mos=
t of
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
the client backchannel implementation that already exists.
I=E2=80=99ve discussed this with an author of RFC 5667 [cc=E2=80=99=
d], and also
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
with the implementors of an existing NFSv4.1 server that supports
RDMA. They both agree that a side car is an acceptable, or even a
preferable, way to approach backchannel support.
Also, when I discussed this with you months ago, you also felt
that a side car was better than adding backchannel support to the
xprtrdma transport. I took this approach only because you OK=E2=80=99=
d it.
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
But I don=E2=80=99t see an explicit recommendation in section 5.1. =
Which
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
text are you referring to?
The very first paragraph argues that because callback messages don't
carry bulk data, there is no problem with using RPC/RDMA and, in
particular, with using RDMA_MSG provided that the buffer sizes are
negotiated correctly.
The opening paragraph is advice that applies to all forms
of NFSv4 callback, including NFSv4.0, which uses a separate
transport initiated from the NFS server. Specific advice about
NFSv4.1 bi-directional RPC is left to the next two paragraphs,
but they suggest there be dragons. I rather think this is a
warning not to =E2=80=9Cgo there.=E2=80=9D
All I see is a warning not to rely on XIDs to distinguish between
backchannel and forward channel RPC requests. How is that particular
to RDMA?
Post by Chuck Lever
Post by Trond Myklebust
1) Where is the discussion of the merits for and against adding
bi-directional support to the xprtrdma layer in Linux? What is the
showstopper preventing implementation of a design based around
RFC5667?
There is no show-stopper (see Section 5.1, after all). It=E2=80=99s
simply a matter of development effort: a side-car is much
less work than implementing full RDMA backchannel support for
both a client and server, especially since TCP backchannel
already works and can be used immediately.
Also, no problem with eventually implementing RDMA backchannel
if the complexity, and any performance overhead it introduces in
the forward channel, can be justified. The client can use the
CREATE_SESSION flags to detect what a server supports.
What complexity and performance overhead does it introduce in the
forward channel? I've never been presented with a complete discussion
of this alternative either from you or from anybody else.
Post by Chuck Lever
Post by Trond Myklebust
2) Why do we instead have to solve the whole backchannel problem in
the NFSv4.1 layer, and where is the discussion of the merits for and
against that particular solution? As far as I can tell, it imposes a=
t
Post by Chuck Lever
Post by Trond Myklebust
a) NFSv4.1 client+server must have support either for session
trunking or for clientid trunking
Very minimal trunking support. The only operation allowed on
the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
Bruce told me that associating multiple transports to a
clientid/session should not be an issue for his server (his
words were =E2=80=9Cif that doesn=E2=80=99t work, it=E2=80=99s a bug=E2=
=80=9D).
Post by Chuck Lever
Would this restrictive form of trunking present a problem?
Post by Trond Myklebust
b) NFSv4.1 client must be able to set up a TCP connection to the
server (that can be session/clientid trunked with the existing RDMA
channel)
Also very minimal changes. The changes are already done,
posted in v1 of this patch series.
I'm not asking for details on the size of the changesets, but for a
justification of the design itself. If it is possible to confine all
the changes to the RPC/RDMA layer, then why consider patches that
change the NFSv4.1 layer at all?

--=20
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust-7I+n7zu2hftEKMMhf/***@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-21 17:11:26 UTC
Permalink
Post by Trond Myklebust
Post by Chuck Lever
=20
There is no show-stopper (see Section 5.1, after all). It=92s
simply a matter of development effort: a side-car is much
less work than implementing full RDMA backchannel support for
both a client and server, especially since TCP backchannel
already works and can be used immediately.
=20
Also, no problem with eventually implementing RDMA backchannel
if the complexity, and any performance overhead it introduces in
the forward channel, can be justified. The client can use the
CREATE_SESSION flags to detect what a server supports.
=20
What complexity and performance overhead does it introduce in the
forward channel?
The benefit of RDMA is that there are opportunities to
reduce host CPU interaction with incoming data.
Bi-direction requires that the transport look at the RPC
header to determine the direction of the message. That
could have an impact on the forward channel, but it=92s
never been measured, to my knowledge.

The reason this is more of an issue for RPC/RDMA is that
a copy of the XID appears in the RPC/RDMA header to avoid
the need to look at the RPC header. That=92s typically what
implementations use to steer RPC reply processing.

Often the RPC/RDMA header and RPC header land in
disparate buffers. The RPC/RDMA reply handler looks
strictly at the RPC/RDMA header, and runs in a tasklet
usually on a different CPU. Adding bi-direction would mean
the transport would have to peek into the upper layer
headers, possibly resulting in cache line bouncing.

The complexity would be the addition of over a hundred
new lines of code on the client, and possibly a similar
amount of new code on the server. Small, perhaps, but
not insignificant.
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
2) Why do we instead have to solve the whole backchannel problem in
the NFSv4.1 layer, and where is the discussion of the merits for an=
d
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
against that particular solution? As far as I can tell, it imposes =
at
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
a) NFSv4.1 client+server must have support either for session
trunking or for clientid trunking
=20
Very minimal trunking support. The only operation allowed on
the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
=20
Bruce told me that associating multiple transports to a
clientid/session should not be an issue for his server (his
words were =93if that doesn=92t work, it=92s a bug=94).
=20
Would this restrictive form of trunking present a problem?
=20
Post by Trond Myklebust
b) NFSv4.1 client must be able to set up a TCP connection to the
server (that can be session/clientid trunked with the existing RDMA
channel)
=20
Also very minimal changes. The changes are already done,
posted in v1 of this patch series.
=20
I'm not asking for details on the size of the changesets, but for a
justification of the design itself.
The size of the changeset _is_ the justification. It=92s
a much less invasive change to add a TCP side-car than
it is to implement RDMA backchannel on both server and
client.

Most servers would require almost no change. Linux needs
only a bug fix or two. Effectively zero-impact for
servers that already support NFSv4.0 on RDMA to get
NFSv4.1 and pNFS on RDMA, with working callbacks.

That=92s really all there is to it. It=92s almost entirely a
practical consideration: we have the infrastructure and
can make it work in just a few lines of code.
Post by Trond Myklebust
If it is possible to confine all
the changes to the RPC/RDMA layer, then why consider patches that
change the NFSv4.1 layer at all?
The fast new transport bring-up benefit is probably the
biggest win. A TCP side-car makes bringing up any new
transport implementation simpler.

And, RPC/RDMA offers zero performance benefit for
backchannel traffic, especially since CB traffic would
never move via RDMA READ/WRITE (as per RFC 5667 section
5.1).

The primary benefit to doing an RPC/RDMA-only solution
is that there is no upper layer impact. Is that a design
requirement?

There=92s also been no discussion of issues with adding a
very restricted amount of transport trunking. Can you
elaborate on the problems this could introduce?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Trond Myklebust
2014-10-22 08:39:40 UTC
Permalink
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
There is no show-stopper (see Section 5.1, after all). It=E2=80=99s
simply a matter of development effort: a side-car is much
less work than implementing full RDMA backchannel support for
both a client and server, especially since TCP backchannel
already works and can be used immediately.
Also, no problem with eventually implementing RDMA backchannel
if the complexity, and any performance overhead it introduces in
the forward channel, can be justified. The client can use the
CREATE_SESSION flags to detect what a server supports.
What complexity and performance overhead does it introduce in the
forward channel?
The benefit of RDMA is that there are opportunities to
reduce host CPU interaction with incoming data.
Bi-direction requires that the transport look at the RPC
header to determine the direction of the message. That
could have an impact on the forward channel, but it=E2=80=99s
never been measured, to my knowledge.
The reason this is more of an issue for RPC/RDMA is that
a copy of the XID appears in the RPC/RDMA header to avoid
the need to look at the RPC header. That=E2=80=99s typically what
implementations use to steer RPC reply processing.
Often the RPC/RDMA header and RPC header land in
disparate buffers. The RPC/RDMA reply handler looks
strictly at the RPC/RDMA header, and runs in a tasklet
usually on a different CPU. Adding bi-direction would mean
the transport would have to peek into the upper layer
headers, possibly resulting in cache line bouncing.
Under what circumstances would you expect to receive a valid NFSv4.1
callback with an RDMA header that spans multiple cache lines?
Post by Chuck Lever
The complexity would be the addition of over a hundred
new lines of code on the client, and possibly a similar
amount of new code on the server. Small, perhaps, but
not insignificant.
Until there are RDMA users, I care a lot less about code changes to
xprtrdma than to NFS.
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
2) Why do we instead have to solve the whole backchannel problem i=
n
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
the NFSv4.1 layer, and where is the discussion of the merits for a=
nd
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
against that particular solution? As far as I can tell, it imposes=
at
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
a) NFSv4.1 client+server must have support either for session
trunking or for clientid trunking
Very minimal trunking support. The only operation allowed on
the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
Bruce told me that associating multiple transports to a
clientid/session should not be an issue for his server (his
words were =E2=80=9Cif that doesn=E2=80=99t work, it=E2=80=99s a bu=
g=E2=80=9D).
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
Would this restrictive form of trunking present a problem?
b) NFSv4.1 client must be able to set up a TCP connection to the
server (that can be session/clientid trunked with the existing RDM=
A
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
channel)
Also very minimal changes. The changes are already done,
posted in v1 of this patch series.
I'm not asking for details on the size of the changesets, but for a
justification of the design itself.
The size of the changeset _is_ the justification. It=E2=80=99s
a much less invasive change to add a TCP side-car than
it is to implement RDMA backchannel on both server and
client.
Please define your use of the word "invasive" in the above context. To
me "invasive" means "will affect code that is in use by others".
Post by Chuck Lever
Most servers would require almost no change. Linux needs
only a bug fix or two. Effectively zero-impact for
servers that already support NFSv4.0 on RDMA to get
NFSv4.1 and pNFS on RDMA, with working callbacks.
That=E2=80=99s really all there is to it. It=E2=80=99s almost entirel=
y a
Post by Chuck Lever
practical consideration: we have the infrastructure and
can make it work in just a few lines of code.
Post by Trond Myklebust
If it is possible to confine all
the changes to the RPC/RDMA layer, then why consider patches that
change the NFSv4.1 layer at all?
The fast new transport bring-up benefit is probably the
biggest win. A TCP side-car makes bringing up any new
transport implementation simpler.
That's an assertion that assumes:
- we actually want to implement more transports aside from RDMA.
- implementing bi-directional transports in the RPC layer is non-simple

Right now, the benefit is only to RDMA users. Nobody else is asking
for such a change.
Post by Chuck Lever
And, RPC/RDMA offers zero performance benefit for
backchannel traffic, especially since CB traffic would
never move via RDMA READ/WRITE (as per RFC 5667 section
5.1).
The primary benefit to doing an RPC/RDMA-only solution
is that there is no upper layer impact. Is that a design
requirement?
There=E2=80=99s also been no discussion of issues with adding a
very restricted amount of transport trunking. Can you
elaborate on the problems this could introduce?
I haven't looked into those problems and I'm not the one asking anyone
to implement trunking.

--=20
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust-7I+n7zu2hftEKMMhf/***@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-22 17:20:03 UTC
Permalink
Post by Trond Myklebust
=20
Post by Chuck Lever
=20
Post by Trond Myklebust
=20
=20
There is no show-stopper (see Section 5.1, after all). It=E2=80=99=
s
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
simply a matter of development effort: a side-car is much
less work than implementing full RDMA backchannel support for
both a client and server, especially since TCP backchannel
already works and can be used immediately.
=20
Also, no problem with eventually implementing RDMA backchannel
if the complexity, and any performance overhead it introduces in
the forward channel, can be justified. The client can use the
CREATE_SESSION flags to detect what a server supports.
=20
What complexity and performance overhead does it introduce in the
forward channel?
=20
The benefit of RDMA is that there are opportunities to
reduce host CPU interaction with incoming data.
Bi-direction requires that the transport look at the RPC
header to determine the direction of the message. That
could have an impact on the forward channel, but it=E2=80=99s
never been measured, to my knowledge.
=20
The reason this is more of an issue for RPC/RDMA is that
a copy of the XID appears in the RPC/RDMA header to avoid
the need to look at the RPC header. That=E2=80=99s typically what
implementations use to steer RPC reply processing.
=20
Often the RPC/RDMA header and RPC header land in
disparate buffers. The RPC/RDMA reply handler looks
strictly at the RPC/RDMA header, and runs in a tasklet
usually on a different CPU. Adding bi-direction would mean
the transport would have to peek into the upper layer
headers, possibly resulting in cache line bouncing.
=20
Under what circumstances would you expect to receive a valid NFSv4.1
callback with an RDMA header that spans multiple cache lines?
The RPC header and RPC/RDMA header are separate entities, but
together can span multiple cache lines if the server has returned a
chunk list containing multiple entries.

=46or example, RDMA_NOMSG would send the RPC/RDMA header
via RDMA SEND with a chunk list that represents the RPC and NFS
payload. That list could make the header larger than 32 bytes.

I expect that any callback that involves more than 1024 byte of
RPC payload will need to use RDMA_NOMSG. A long device
info list might fit that category?
Post by Trond Myklebust
Post by Chuck Lever
The complexity would be the addition of over a hundred
new lines of code on the client, and possibly a similar
amount of new code on the server. Small, perhaps, but
not insignificant.
=20
Until there are RDMA users, I care a lot less about code changes to
xprtrdma than to NFS.
=20
Post by Chuck Lever
Post by Trond Myklebust
2) Why do we instead have to solve the whole backchannel problem =
in
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
the NFSv4.1 layer, and where is the discussion of the merits for =
and
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
against that particular solution? As far as I can tell, it impose=
s at
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
a) NFSv4.1 client+server must have support either for session
trunking or for clientid trunking
=20
Very minimal trunking support. The only operation allowed on
the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
=20
Bruce told me that associating multiple transports to a
clientid/session should not be an issue for his server (his
words were =E2=80=9Cif that doesn=E2=80=99t work, it=E2=80=99s a b=
ug=E2=80=9D).
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
=20
Would this restrictive form of trunking present a problem?
=20
b) NFSv4.1 client must be able to set up a TCP connection to the
server (that can be session/clientid trunked with the existing RD=
MA
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
channel)
=20
Also very minimal changes. The changes are already done,
posted in v1 of this patch series.
=20
I'm not asking for details on the size of the changesets, but for a
justification of the design itself.
=20
The size of the changeset _is_ the justification. It=E2=80=99s
a much less invasive change to add a TCP side-car than
it is to implement RDMA backchannel on both server and
client.
=20
Please define your use of the word "invasive" in the above context. T=
o
Post by Trond Myklebust
me "invasive" means "will affect code that is in use by others".
The server side, then, is non-invasive. The client side makes minor
changes to state management.
Post by Trond Myklebust
=20
Post by Chuck Lever
Most servers would require almost no change. Linux needs
only a bug fix or two. Effectively zero-impact for
servers that already support NFSv4.0 on RDMA to get
NFSv4.1 and pNFS on RDMA, with working callbacks.
=20
That=E2=80=99s really all there is to it. It=E2=80=99s almost entire=
ly a
Post by Trond Myklebust
Post by Chuck Lever
practical consideration: we have the infrastructure and
can make it work in just a few lines of code.
=20
Post by Trond Myklebust
If it is possible to confine all
the changes to the RPC/RDMA layer, then why consider patches that
change the NFSv4.1 layer at all?
=20
The fast new transport bring-up benefit is probably the
biggest win. A TCP side-car makes bringing up any new
transport implementation simpler.
=20
- we actually want to implement more transports aside from RDMA
So you no longer consider RPC/SCTP a possibility?
Post by Trond Myklebust
- implementing bi-directional transports in the RPC layer is non-simp=
le

I don't care to generalize about that. In the RPC/RDMA case, there
are some complications that make it non-simple, but not impossible.
So we have an example of a non-simple case, IMO.
Post by Trond Myklebust
Right now, the benefit is only to RDMA users. Nobody else is asking
for such a change.
=20
Post by Chuck Lever
And, RPC/RDMA offers zero performance benefit for
backchannel traffic, especially since CB traffic would
never move via RDMA READ/WRITE (as per RFC 5667 section
5.1).
=20
The primary benefit to doing an RPC/RDMA-only solution
is that there is no upper layer impact. Is that a design
requirement?
Based on your objections, it appears that "no upper layer
impact" is a hard design requirement. I will take this as a
NACK for the side-car approach.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Trond Myklebust
2014-10-22 20:53:21 UTC
Permalink
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
There is no show-stopper (see Section 5.1, after all). It=E2=80=99=
s
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
simply a matter of development effort: a side-car is much
less work than implementing full RDMA backchannel support for
both a client and server, especially since TCP backchannel
already works and can be used immediately.
Also, no problem with eventually implementing RDMA backchannel
if the complexity, and any performance overhead it introduces in
the forward channel, can be justified. The client can use the
CREATE_SESSION flags to detect what a server supports.
What complexity and performance overhead does it introduce in the
forward channel?
The benefit of RDMA is that there are opportunities to
reduce host CPU interaction with incoming data.
Bi-direction requires that the transport look at the RPC
header to determine the direction of the message. That
could have an impact on the forward channel, but it=E2=80=99s
never been measured, to my knowledge.
The reason this is more of an issue for RPC/RDMA is that
a copy of the XID appears in the RPC/RDMA header to avoid
the need to look at the RPC header. That=E2=80=99s typically what
implementations use to steer RPC reply processing.
Often the RPC/RDMA header and RPC header land in
disparate buffers. The RPC/RDMA reply handler looks
strictly at the RPC/RDMA header, and runs in a tasklet
usually on a different CPU. Adding bi-direction would mean
the transport would have to peek into the upper layer
headers, possibly resulting in cache line bouncing.
Under what circumstances would you expect to receive a valid NFSv4.1
callback with an RDMA header that spans multiple cache lines?
The RPC header and RPC/RDMA header are separate entities, but
together can span multiple cache lines if the server has returned a
chunk list containing multiple entries.
For example, RDMA_NOMSG would send the RPC/RDMA header
via RDMA SEND with a chunk list that represents the RPC and NFS
payload. That list could make the header larger than 32 bytes.
I expect that any callback that involves more than 1024 byte of
RPC payload will need to use RDMA_NOMSG. A long device
info list might fit that category?
Right, but are there any callbacks that would do that? AFAICS, most of
them are CB_SEQUENCE+(PUT_FH+CB_do_some_recall_operation_on_this_file
| some single CB_operation)

The point is that we can set finite limits on the size of callbacks in
the CREATE_SESSION. As long as those limits are reasonable (and 1K
does seem more than reasonable for existing use cases) then why
shouldn't we be able to expect the server to use RDMA_MSG?
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
The complexity would be the addition of over a hundred
new lines of code on the client, and possibly a similar
amount of new code on the server. Small, perhaps, but
not insignificant.
Until there are RDMA users, I care a lot less about code changes to
xprtrdma than to NFS.
Post by Chuck Lever
Post by Trond Myklebust
2) Why do we instead have to solve the whole backchannel problem=
in
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
the NFSv4.1 layer, and where is the discussion of the merits for=
and
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
against that particular solution? As far as I can tell, it impos=
es at
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
a) NFSv4.1 client+server must have support either for session
trunking or for clientid trunking
Very minimal trunking support. The only operation allowed on
the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
Bruce told me that associating multiple transports to a
clientid/session should not be an issue for his server (his
words were =E2=80=9Cif that doesn=E2=80=99t work, it=E2=80=99s a =
bug=E2=80=9D).
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
Would this restrictive form of trunking present a problem?
b) NFSv4.1 client must be able to set up a TCP connection to the
server (that can be session/clientid trunked with the existing R=
DMA
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
channel)
Also very minimal changes. The changes are already done,
posted in v1 of this patch series.
I'm not asking for details on the size of the changesets, but for =
a
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
Post by Trond Myklebust
justification of the design itself.
The size of the changeset _is_ the justification. It=E2=80=99s
a much less invasive change to add a TCP side-car than
it is to implement RDMA backchannel on both server and
client.
Please define your use of the word "invasive" in the above context. =
To
Post by Chuck Lever
Post by Trond Myklebust
me "invasive" means "will affect code that is in use by others".
The server side, then, is non-invasive. The client side makes minor
changes to state management.
Post by Trond Myklebust
Post by Chuck Lever
Most servers would require almost no change. Linux needs
only a bug fix or two. Effectively zero-impact for
servers that already support NFSv4.0 on RDMA to get
NFSv4.1 and pNFS on RDMA, with working callbacks.
That=E2=80=99s really all there is to it. It=E2=80=99s almost entir=
ely a
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
practical consideration: we have the infrastructure and
can make it work in just a few lines of code.
Post by Trond Myklebust
If it is possible to confine all
the changes to the RPC/RDMA layer, then why consider patches that
change the NFSv4.1 layer at all?
The fast new transport bring-up benefit is probably the
biggest win. A TCP side-car makes bringing up any new
transport implementation simpler.
- we actually want to implement more transports aside from RDMA
So you no longer consider RPC/SCTP a possibility?
I'd still like to consider it, but the whole point would be to _avoid_
doing trunking in the NFS layer. SCTP does trunking/multi-pathing at
the transport level, meaning that we don't have to deal with tracking
connections, state, replaying messages, etc.
Doing bi-directional RPC with SCTP is not an issue, since the
transport is fully symmetric.
Post by Chuck Lever
Post by Trond Myklebust
- implementing bi-directional transports in the RPC layer is non-sim=
ple
Post by Chuck Lever
I don't care to generalize about that. In the RPC/RDMA case, there
are some complications that make it non-simple, but not impossible.
So we have an example of a non-simple case, IMO.
Post by Trond Myklebust
Right now, the benefit is only to RDMA users. Nobody else is asking
for such a change.
Post by Chuck Lever
And, RPC/RDMA offers zero performance benefit for
backchannel traffic, especially since CB traffic would
never move via RDMA READ/WRITE (as per RFC 5667 section
5.1).
The primary benefit to doing an RPC/RDMA-only solution
is that there is no upper layer impact. Is that a design
requirement?
Based on your objections, it appears that "no upper layer
impact" is a hard design requirement. I will take this as a
NACK for the side-car approach.
There is not a hard NACK yet, but I am asking for stronger
justification. I do _not_ want to find myself in a situation 2 or 3
years down the road where I have to argue against someone telling me
that we additionally have to implement callbacks over IB/RDMA because
the TCP sidecar is an incomplete solution. We should do either one or
the other, but not both...

--=20
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust-7I+n7zu2hftEKMMhf/***@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-22 22:38:42 UTC
Permalink
Post by Chuck Lever
=20
=20
Post by Chuck Lever
=20
Post by Trond Myklebust
=20
Post by Chuck Lever
=20
There is no show-stopper (see Section 5.1, after all). It=92s
simply a matter of development effort: a side-car is much
less work than implementing full RDMA backchannel support for
both a client and server, especially since TCP backchannel
already works and can be used immediately.
=20
Also, no problem with eventually implementing RDMA backchannel
if the complexity, and any performance overhead it introduces in
the forward channel, can be justified. The client can use the
CREATE_SESSION flags to detect what a server supports.
=20
What complexity and performance overhead does it introduce in the
forward channel?
=20
The benefit of RDMA is that there are opportunities to
reduce host CPU interaction with incoming data.
Bi-direction requires that the transport look at the RPC
header to determine the direction of the message. That
could have an impact on the forward channel, but it=92s
never been measured, to my knowledge.
=20
The reason this is more of an issue for RPC/RDMA is that
a copy of the XID appears in the RPC/RDMA header to avoid
the need to look at the RPC header. That=92s typically what
implementations use to steer RPC reply processing.
=20
Often the RPC/RDMA header and RPC header land in
disparate buffers. The RPC/RDMA reply handler looks
strictly at the RPC/RDMA header, and runs in a tasklet
usually on a different CPU. Adding bi-direction would mean
the transport would have to peek into the upper layer
headers, possibly resulting in cache line bouncing.
=20
Under what circumstances would you expect to receive a valid NFSv4.=
1
Post by Chuck Lever
callback with an RDMA header that spans multiple cache lines?
=20
The RPC header and RPC/RDMA header are separate entities, but
together can span multiple cache lines if the server has returned a
chunk list containing multiple entries.
=20
For example, RDMA_NOMSG would send the RPC/RDMA header
via RDMA SEND with a chunk list that represents the RPC and NFS
payload. That list could make the header larger than 32 bytes.
=20
I expect that any callback that involves more than 1024 byte of
RPC payload will need to use RDMA_NOMSG. A long device
info list might fit that category?
=20
Right, but are there any callbacks that would do that? AFAICS, most o=
f
them are CB_SEQUENCE+(PUT_FH+CB_do_some_recall_operation_on_this_file
| some single CB_operation)
That is a question only a pNFS layout developer can answer.

Allowing larger CB operations might be important. I thought
I heard Matt list a couple of examples that might move bulk
data via CB, but probably none that have implementations
currently.

I=92m not familiar with block or flex-file, so I can=92t make
any kind of guess about those.
The point is that we can set finite limits on the size of callbacks i=
n
the CREATE_SESSION. As long as those limits are reasonable (and 1K
does seem more than reasonable for existing use cases) then why
shouldn't we be able to expect the server to use RDMA_MSG?
The spec allows both RDMA_MSG and RDMA_NOMSG for CB
RPC. That provides some implementation flexibility, but
it also means either end can use either MSG type. An
interoperable CB service would have to support all
scenarios.

RDMA_NOMSG can support small or large payloads. RDMA_MSG
can support only a payload that fits in the receiver=92s
pre-posted buffer (and that includes the RPC and NFS
headers) because NFS CB RPCs are not allowed to use
read or write chunks.

Since there are no implementations, currently, I was
hoping everyone might agree to stick with using only
RDMA_NOMSG for NFSv4.1 CB RPCs on RPC/RDMA. That would
mean there was a high limit on CB RPC payload size, and
RDMA READ would be used to move CB RPC calls and replies
in all cases.

NFS uses NOMSG so infrequently that it would be a good
way to limit churn in the transport=92s hot paths.
Particularly NFS READ and WRITE are required to use
RDMA_MSG with their payload encoded in chunks. If the
incoming message is MSG, then clearly it can=92t be a
reverse RPC, and there=92s no need to go looking for it
(as long as we all agree on the =93CBs use only NOMSG=94
convention).

The receiver can tell just by looking at the RPC/RDMA
header that extra processing won=92t be needed in the
common case.

Clearly we need a prototype or two to understand these
issues. And probably some WG discussion is warranted.
Post by Chuck Lever
Post by Chuck Lever
The complexity would be the addition of over a hundred
new lines of code on the client, and possibly a similar
amount of new code on the server. Small, perhaps, but
not insignificant.
=20
Until there are RDMA users, I care a lot less about code changes to
xprtrdma than to NFS.
=20
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
2) Why do we instead have to solve the whole backchannel proble=
m in
Post by Chuck Lever
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
the NFSv4.1 layer, and where is the discussion of the merits fo=
r and
Post by Chuck Lever
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
against that particular solution? As far as I can tell, it impo=
ses at
Post by Chuck Lever
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
a) NFSv4.1 client+server must have support either for session
trunking or for clientid trunking
=20
Very minimal trunking support. The only operation allowed on
the TCP side-car's forward channel is BIND_CONN_TO_SESSION.
=20
Bruce told me that associating multiple transports to a
clientid/session should not be an issue for his server (his
words were =93if that doesn=92t work, it=92s a bug=94).
=20
Would this restrictive form of trunking present a problem?
=20
b) NFSv4.1 client must be able to set up a TCP connection to th=
e
Post by Chuck Lever
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
server (that can be session/clientid trunked with the existing =
RDMA
Post by Chuck Lever
Post by Chuck Lever
Post by Trond Myklebust
Post by Chuck Lever
channel)
=20
Also very minimal changes. The changes are already done,
posted in v1 of this patch series.
=20
I'm not asking for details on the size of the changesets, but for=
a
Post by Chuck Lever
Post by Chuck Lever
Post by Trond Myklebust
justification of the design itself.
=20
The size of the changeset _is_ the justification. It=92s
a much less invasive change to add a TCP side-car than
it is to implement RDMA backchannel on both server and
client.
=20
Please define your use of the word "invasive" in the above context.=
To
Post by Chuck Lever
me "invasive" means "will affect code that is in use by others".
=20
The server side, then, is non-invasive. The client side makes minor
changes to state management.
=20
=20
Post by Chuck Lever
Most servers would require almost no change. Linux needs
only a bug fix or two. Effectively zero-impact for
servers that already support NFSv4.0 on RDMA to get
NFSv4.1 and pNFS on RDMA, with working callbacks.
=20
That=92s really all there is to it. It=92s almost entirely a
practical consideration: we have the infrastructure and
can make it work in just a few lines of code.
=20
Post by Trond Myklebust
If it is possible to confine all
the changes to the RPC/RDMA layer, then why consider patches that
change the NFSv4.1 layer at all?
=20
The fast new transport bring-up benefit is probably the
biggest win. A TCP side-car makes bringing up any new
transport implementation simpler.
=20
- we actually want to implement more transports aside from RDMA
=20
So you no longer consider RPC/SCTP a possibility?
=20
I'd still like to consider it, but the whole point would be to _avoid=
_
doing trunking in the NFS layer. SCTP does trunking/multi-pathing at
the transport level, meaning that we don't have to deal with tracking
connections, state, replaying messages, etc.
Doing bi-directional RPC with SCTP is not an issue, since the
transport is fully symmetric.
=20
Post by Chuck Lever
- implementing bi-directional transports in the RPC layer is non-si=
mple
Post by Chuck Lever
=20
I don't care to generalize about that. In the RPC/RDMA case, there
are some complications that make it non-simple, but not impossible.
So we have an example of a non-simple case, IMO.
=20
Right now, the benefit is only to RDMA users. Nobody else is asking
for such a change.
=20
Post by Chuck Lever
And, RPC/RDMA offers zero performance benefit for
backchannel traffic, especially since CB traffic would
never move via RDMA READ/WRITE (as per RFC 5667 section
5.1).
=20
The primary benefit to doing an RPC/RDMA-only solution
is that there is no upper layer impact. Is that a design
requirement?
=20
Based on your objections, it appears that "no upper layer
impact" is a hard design requirement. I will take this as a
NACK for the side-car approach.
=20
There is not a hard NACK yet, but I am asking for stronger
justification. I do _not_ want to find myself in a situation 2 or 3
years down the road where I have to argue against someone telling me
that we additionally have to implement callbacks over IB/RDMA because
the TCP sidecar is an incomplete solution. We should do either one or
the other, but not both=85
It is impossible to predict the future. However, I=92m not
sure there=92s a problem building both eventually, especially
because there is no spec guidance I=92m aware of about which
bi-directional RPC mechanisms MUST be supported for NFS/RDMA.

I might be wrong, but we have enough mechanism in
CREATE_SESSION for a client with bi-directional RPC/RDMA
support to detect a server with no bi-directional RPC/RDMA,
and use side-car only in that case.

If we need more discussion, I can drop the side-car patches
for 3.19 and do some more research.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2014-10-23 13:32:29 UTC
Permalink
Post by Chuck Lever
The size of the changeset _is_ the justification. It=E2=80=99s
a much less invasive change to add a TCP side-car than
it is to implement RDMA backchannel on both server and
client.
Something I'm confused about: is bidirectional RPC/RDMA optional or
mandatory for servers to implement?

Something somewhere has to be mandatory if we want to guarantee a
working backchannel between any two implementations.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-23 13:55:28 UTC
Permalink
Post by J. Bruce Fields
Post by Chuck Lever
The size of the changeset _is_ the justification. It=92s
a much less invasive change to add a TCP side-car than
it is to implement RDMA backchannel on both server and
client.
=20
Something I'm confused about: is bidirectional RPC/RDMA optional or
mandatory for servers to implement?
IMO bi-directional RPC/RDMA is not required anywhere in RFCs 5666
(the RPC/RDMA spec) or 5667 (the NFS on RPC/RDMA spec).
Post by J. Bruce Fields
Something somewhere has to be mandatory if we want to guarantee a
working backchannel between any two implementations.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:40:08 UTC
Permalink
When NFS_CS_SIDECAR_BACKCHANNEL is set, the client needs to bind
both the forward and the back channel during recovery. Two separate
BC2S operations are needed, with different arguments.

Prepare nfs4_proc_bind_conn_to_session() by creating a complete arg
struct for the BIND_CONN_TO_SESSION XDR encoder.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
fs/nfs/nfs4proc.c | 6 +++++-
fs/nfs/nfs4xdr.c | 16 +++++++++-------
include/linux/nfs_xdr.h | 6 ++++++
3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index a1243e7..9a8ffb7 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -6578,11 +6578,15 @@ nfs41_same_server_scope(struct nfs41_server_scope *a,
int nfs4_proc_bind_conn_to_session(struct nfs_client *clp, struct rpc_cred *cred)
{
int status;
+ struct nfs41_bind_conn_to_session_args args = {
+ .client = clp,
+ .dir = NFS4_CDFC4_BACK_OR_BOTH,
+ };
struct nfs41_bind_conn_to_session_res res;
struct rpc_message msg = {
.rpc_proc =
&nfs4_procedures[NFSPROC4_CLNT_BIND_CONN_TO_SESSION],
- .rpc_argp = clp,
+ .rpc_argp = &args,
.rpc_resp = &res,
.rpc_cred = cred,
};
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index e13b59d..6dfdd14 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -1733,17 +1733,19 @@ static void encode_secinfo(struct xdr_stream *xdr, const struct qstr *name, stru
#if defined(CONFIG_NFS_V4_1)
/* NFSv4.1 operations */
static void encode_bind_conn_to_session(struct xdr_stream *xdr,
- struct nfs4_session *session,
- struct compound_hdr *hdr)
+ struct nfs41_bind_conn_to_session_args *args,
+ struct compound_hdr *hdr)
{
+ struct nfs4_session *session = args->client->cl_session;
__be32 *p;

encode_op_hdr(xdr, OP_BIND_CONN_TO_SESSION,
decode_bind_conn_to_session_maxsz, hdr);
encode_opaque_fixed(xdr, session->sess_id.data, NFS4_MAX_SESSIONID_LEN);
+
p = xdr_reserve_space(xdr, 8);
- *p++ = cpu_to_be32(NFS4_CDFC4_BACK_OR_BOTH);
- *p = 0; /* use_conn_in_rdma_mode = False */
+ *p++ = cpu_to_be32(args->dir);
+ *p = args->use_conn_in_rdma_mode ? xdr_one : xdr_zero;
}

static void encode_op_map(struct xdr_stream *xdr, struct nfs4_op_map *op_map)
@@ -2766,14 +2768,14 @@ static void nfs4_xdr_enc_fsid_present(struct rpc_rqst *req,
*/
static void nfs4_xdr_enc_bind_conn_to_session(struct rpc_rqst *req,
struct xdr_stream *xdr,
- struct nfs_client *clp)
+ struct nfs41_bind_conn_to_session_args *args)
{
struct compound_hdr hdr = {
- .minorversion = clp->cl_mvops->minor_version,
+ .minorversion = args->client->cl_mvops->minor_version,
};

encode_compound_hdr(xdr, req, &hdr);
- encode_bind_conn_to_session(xdr, clp->cl_session, &hdr);
+ encode_bind_conn_to_session(xdr, args, &hdr);
encode_nops(&hdr);
}

diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 0040629..af17763 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1140,6 +1140,12 @@ struct nfs41_state_protection {
struct nfs4_op_map allow;
};

+struct nfs41_bind_conn_to_session_args {
+ struct nfs_client *client;
+ u32 dir;
+ bool use_conn_in_rdma_mode;
+};
+
#define NFS4_EXCHANGE_ID_LEN (48)
struct nfs41_exchange_id_args {
struct nfs_client *client;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:40:16 UTC
Permalink
When recovering from a network partition, a client must identify
both the forward and backchannel it wants bound to a session.

Usually these use the same transport, which can be re-bound to the
session with a single BIND_CONN_TO_SESSION operation.

But with a sidecar backchannel, the fore and back channels use
separate transports that must be bound to the transport connection
using separate BC2S operations.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
fs/nfs/nfs4client.c | 5 ++++-
fs/nfs/nfs4proc.c | 48 ++++++++++++++++++++++++++++++++++++++----------
2 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index b1cc35e..97cc170 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -246,7 +246,10 @@ static int nfs4_init_callback(struct nfs_client *clp)
struct rpc_xprt *xprt;
int error;

- xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);
+ if (clp->cl_bc_rpcclient)
+ xprt = rcu_dereference_raw(clp->cl_bc_rpcclient->cl_xprt);
+ else
+ xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);

if (nfs4_has_session(clp)) {
error = xprt_setup_backchannel(xprt, NFS41_BC_MIN_CALLBACKS);
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 9a8ffb7..2eaf7ec 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -6569,18 +6569,16 @@ nfs41_same_server_scope(struct nfs41_server_scope *a,
return false;
}

-/*
- * nfs4_proc_bind_conn_to_session()
- *
- * The 4.1 client currently uses the same TCP connection for the
- * fore and backchannel.
- */
-int nfs4_proc_bind_conn_to_session(struct nfs_client *clp, struct rpc_cred *cred)
+static int _nfs4_proc_bind_conn_to_session(struct nfs_client *clp,
+ struct rpc_cred *cred,
+ struct rpc_clnt *clnt,
+ u32 dir_from_client,
+ u32 dir_from_server)
{
int status;
struct nfs41_bind_conn_to_session_args args = {
.client = clp,
- .dir = NFS4_CDFC4_BACK_OR_BOTH,
+ .dir = dir_from_client,
};
struct nfs41_bind_conn_to_session_res res;
struct rpc_message msg = {
@@ -6599,7 +6597,7 @@ int nfs4_proc_bind_conn_to_session(struct nfs_client *clp, struct rpc_cred *cred
goto out;
}

- status = rpc_call_sync(clp->cl_rpcclient, &msg, RPC_TASK_TIMEOUT);
+ status = rpc_call_sync(clnt, &msg, RPC_TASK_TIMEOUT);
trace_nfs4_bind_conn_to_session(clp, status);
if (status == 0) {
if (memcmp(res.session->sess_id.data,
@@ -6608,7 +6606,7 @@ int nfs4_proc_bind_conn_to_session(struct nfs_client *clp, struct rpc_cred *cred
status = -EIO;
goto out_session;
}
- if (res.dir != NFS4_CDFS4_BOTH) {
+ if (res.dir != dir_from_server) {
dprintk("NFS: %s: Unexpected direction from server\n",
__func__);
status = -EIO;
@@ -6628,6 +6626,36 @@ out:
return status;
}

+/**
+ * nfs4_proc_bind_conn_to_session - (re)bind fore/back channels to session
+ * @clp: per-server state
+ * @cred: credential for managing state
+ *
+ * Returns zero on success, or a negative errno or negative NFS4ERR.
+ */
+int nfs4_proc_bind_conn_to_session(struct nfs_client *clp,
+ struct rpc_cred *cred)
+{
+ int ret;
+
+ if (!clp->cl_bc_rpcclient)
+ return _nfs4_proc_bind_conn_to_session(clp, cred,
+ clp->cl_rpcclient,
+ NFS4_CDFC4_BACK_OR_BOTH,
+ NFS4_CDFS4_BOTH);
+
+ ret = _nfs4_proc_bind_conn_to_session(clp, cred,
+ clp->cl_bc_rpcclient,
+ NFS4_CDFC4_BACK,
+ NFS4_CDFS4_BACK);
+ if (ret)
+ return ret;
+ return _nfs4_proc_bind_conn_to_session(clp, cred,
+ clp->cl_rpcclient,
+ NFS4_CDFC4_FORE,
+ NFS4_CDFS4_FORE);
+}
+
/*
* Minimum set of SP4_MACH_CRED operations from RFC 5661 in the enforce map
* and operations we'd like to see to enable certain features in the allow map

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chuck Lever
2014-10-16 19:40:24 UTC
Permalink
A CREATE_SESSION operation with SESSION4_BACK_CHAN cleared is sent
to force the server to report SEQ4_STATUS_CB_PATH_DOWN. The client
recovers by setting up a sidecar backchannel connection.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
fs/nfs/nfs4proc.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 2eaf7ec..e0bcc30 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -7187,6 +7187,7 @@ static int _nfs4_proc_create_session(struct nfs_client *clp,
struct nfs41_create_session_args args = {
.client = clp,
.cb_program = NFS4_CALLBACK,
+ .flags = SESSION4_PERSIST,
};
struct nfs41_create_session_res res = {
.client = clp,
@@ -7200,7 +7201,8 @@ static int _nfs4_proc_create_session(struct nfs_client *clp,
int status;

nfs4_init_channel_attrs(&args);
- args.flags = (SESSION4_PERSIST | SESSION4_BACK_CHAN);
+ if (!clp->cl_bc_rpcclient)
+ args.flags |= SESSION4_BACK_CHAN;

status = rpc_call_sync(session->clp->cl_rpcclient, &msg, RPC_TASK_TIMEOUT);
trace_nfs4_create_session(clp, status);

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...