[PATCH v1 00/16] NFS/RDMA patches for 3.19

Discussion:

Chuck Lever

2014-10-16 19:38:12 UTC

Hi-

Two groups of patches in this series. The first group is fixes
and clean-ups for xprtrdma. The second group adds client support
for NFSv4.1 on RDMA. Looking for review and testing.

Also available in the "nfs-rdma-for-3.19" topic branch at

git://linux-nfs.org/projects/cel/cel-2.6.git

---

Chuck Lever (16):
xprtrdma: Return an errno from rpcrdma_register_external()
xprtrdma: Cap req_cqinit
SUNRPC: Pass callsize and recvsize to buf_alloc as separate arguments
xprtrdma: Re-write rpcrdma_flush_cqs()
xprtrdma: unmap all FMRs during transport disconnect
xprtrdma: spin CQ completion vectors
SUNRPC: serialize iostats updates
xprtrdma: Display async errors
xprtrdma: Enable pad optimization
NFS: Include transport protocol name in UCS client string
NFS: Clean up nfs4_init_callback()
SUNRPC: Add rpc_xprt_is_bidirectional()
NFS: Add sidecar RPC client support
NFS: Set BIND_CONN_TO_SESSION arguments in the proc layer
NFS: Bind side-car connection to session
NFS: Disable SESSION4_BACK_CHAN when a backchannel sidecar is to be used

fs/nfs/client.c | 1
fs/nfs/nfs4client.c | 86 +++++++++++++++++----
fs/nfs/nfs4proc.c | 71 ++++++++++++++---
fs/nfs/nfs4xdr.c | 16 ++--
include/linux/nfs_fs_sb.h | 2
include/linux/nfs_xdr.h | 6 +
include/linux/sunrpc/clnt.h | 1
include/linux/sunrpc/metrics.h | 3 +
include/linux/sunrpc/sched.h | 2
include/linux/sunrpc/xprt.h | 4 +
net/sunrpc/clnt.c | 28 ++++++-
net/sunrpc/sched.c | 6 +
net/sunrpc/stats.c | 21 ++++-
net/sunrpc/xprtrdma/transport.c | 6 +
net/sunrpc/xprtrdma/verbs.c | 159 +++++++++++++++++++++++++++++++++++----
net/sunrpc/xprtsock.c | 6 +
16 files changed, 347 insertions(+), 71 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:38:21 UTC

Permalink

The RPC/RDMA send_request method and the chunk registration code
expects an errno from the registration function. This allows
the upper layers to distinguish between a recoverable failure
(for example, temporary memory exhaustion) and a hard failure
(for example, a bug in the registration logic).

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/verbs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 61c4129..6ea2942 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1918,10 +1918,10 @@ rpcrdma_register_external(struct rpcrdma_mr_seg *seg,
break;

default:
- return -1;
+ return -EIO;
}
if (rc)
- return -1;
+ return rc;

return nsegs;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:38:29 UTC

Permalink

Recent work made FRMR registration and invalidation completions
unsignaled. This greatly reduces the adapter interrupt rate.

Every so often, however, a posted send Work Request is allowed to
signal. Otherwise, the provider's Work Queue will wrap and the
workload will hang.

The number of Work Requests that are allowed to remain unsignaled is
determined by the value of req_cqinit. Currently, this is set to the
size of the send Work Queue divided by two, minus 1.

For FRMR, the send Work Queue is the maximum number of concurrent
RPCs (currently 32) times the maximum number of Work Requests an
RPC might use (currently 7, though some adapters may need more).

For mlx4, this is 224 entries. This leaves completion signaling
disabled for 111 send Work Requests.

Some providers hold back dispatching Work Requests until a CQE is
generated. If completions are disabled, then no CQEs are generated
for quite some time, and that can stall the Work Queue.

I've seen this occur running xfstests generic/113 over NFSv4, where
eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
because the Work Queue has overflowed. The connection is dropped
and re-established.

Cap the rep_cqinit setting so completions are not left turned off
for too long.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/verbs.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 6ea2942..5c0c7a5 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -733,6 +733,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,

/* set trigger for requesting send completion */
ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
+ if (ep->rep_cqinit > 20)
+ ep->rep_cqinit = 20;
if (ep->rep_cqinit <= 2)
ep->rep_cqinit = 0;
INIT_CQCOUNT(ep);

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Anna Schumaker

2014-10-20 13:27:26 UTC

Permalink

Hey Chuck,

Post by Chuck Lever
Recent work made FRMR registration and invalidation completions
unsignaled. This greatly reduces the adapter interrupt rate.
Every so often, however, a posted send Work Request is allowed to
signal. Otherwise, the provider's Work Queue will wrap and the
workload will hang.
The number of Work Requests that are allowed to remain unsignaled is
determined by the value of req_cqinit. Currently, this is set to the
size of the send Work Queue divided by two, minus 1.
For FRMR, the send Work Queue is the maximum number of concurrent
RPCs (currently 32) times the maximum number of Work Requests an
RPC might use (currently 7, though some adapters may need more).
For mlx4, this is 224 entries. This leaves completion signaling
disabled for 111 send Work Requests.
Some providers hold back dispatching Work Requests until a CQE is
generated. If completions are disabled, then no CQEs are generated
for quite some time, and that can stall the Work Queue.
I've seen this occur running xfstests generic/113 over NFSv4, where
eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
because the Work Queue has overflowed. The connection is dropped
and re-established.
Cap the rep_cqinit setting so completions are not left turned off
for too long.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269
---
net/sunrpc/xprtrdma/verbs.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 6ea2942..5c0c7a5 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -733,6 +733,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
/* set trigger for requesting send completion */
ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
+ if (ep->rep_cqinit > 20)
+ ep->rep_cqinit = 20;
if (ep->rep_cqinit <= 2)

Can you change the ep->rep_cqinit <= 2 check into an else-if?

Thanks!
Anna

Post by Chuck Lever
ep->rep_cqinit = 0;
INIT_CQCOUNT(ep);
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:38:38 UTC

Permalink

I noticed that on RDMA, NFSv4 operations were using "hardway"
allocations much more than not. A "hardway" allocation uses GFP_NOFS
during each RPC to allocate the XDR buffer, instead of using a
pre-allocated pre-registered buffer for each RPC.

The pre-allocated buffers are 2200 bytes in length. The requested
XDR buffer sizes looked like this:

GETATTR: 3220 bytes
LOOKUP: 3612 bytes
WRITE: 3256 bytes
OPEN: 6344 bytes

But an NFSv4 GETATTR RPC request should be small. It's the reply
part of GETATTR that can grow large.

call_allocate() passes a single value as the XDR buffer size: the
sum of call and reply buffers. However, the xprtrdma transport
allocates its XDR request and reply buffers separately.

xprtrdma needs to know the maximum call size, as guidance for how
large the outgoing request is going to be and how the NFS payload
will be marshalled into chunks.

But RDMA XDR reply buffers are pre-posted, fixed-size buffers, not
allocated by xprt_rdma_allocate().

Because of the sum passed through ->buf_alloc(), xprtrdma's
->buf_alloc() always allocates more XDR buffer than it will ever
use. For NFSv4, it is unnecessarily triggering the slow "hardway"
path for almost every RPC.

Pass the call and reply buffer size values separately to the
transport's ->buf_alloc method. The RDMA transport ->buf_alloc can
now ignore the reply size, and allocate just what it will use for
the call buffer. The socket transport ->buf_alloc can simply add
them together, as call_allocate() did before.

With this patch, an NFSv4 GETATTR request now allocates a 476 byte
RDMA XDR buffer. I didn't see a single NFSv4 request that did not
fit into the transport's pre-allocated XDR buffer.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
include/linux/sunrpc/sched.h | 2 +-
include/linux/sunrpc/xprt.h | 3 ++-
net/sunrpc/clnt.c | 4 ++--
net/sunrpc/sched.c | 6 ++++--
net/sunrpc/xprtrdma/transport.c | 2 +-
net/sunrpc/xprtsock.c | 3 ++-
6 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 1a89599..68fa71d 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -232,7 +232,7 @@ struct rpc_task *rpc_wake_up_first(struct rpc_wait_queue *,
void *);
void rpc_wake_up_status(struct rpc_wait_queue *, int);
void rpc_delay(struct rpc_task *, unsigned long);
-void * rpc_malloc(struct rpc_task *, size_t);
+void *rpc_malloc(struct rpc_task *, size_t, size_t);
void rpc_free(void *);
int rpciod_up(void);
void rpciod_down(void);
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index fcbfe87..632685c 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -124,7 +124,8 @@ struct rpc_xprt_ops {
void (*rpcbind)(struct rpc_task *task);
void (*set_port)(struct rpc_xprt *xprt, unsigned short port);
void (*connect)(struct rpc_xprt *xprt, struct rpc_task *task);
- void * (*buf_alloc)(struct rpc_task *task, size_t size);
+ void * (*buf_alloc)(struct rpc_task *task,
+ size_t call, size_t reply);
void (*buf_free)(void *buffer);
int (*send_request)(struct rpc_task *task);
void (*set_retrans_timeout)(struct rpc_task *task);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 488ddee..5e817d6 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1599,8 +1599,8 @@ call_allocate(struct rpc_task *task)
req->rq_rcvsize = RPC_REPHDRSIZE + slack + proc->p_replen;
req->rq_rcvsize <<= 2;

- req->rq_buffer = xprt->ops->buf_alloc(task,
- req->rq_callsize + req->rq_rcvsize);
+ req->rq_buffer = xprt->ops->buf_alloc(task, req->rq_callsize,
+ req->rq_rcvsize);
if (req->rq_buffer != NULL)
return;

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79..fc4f939 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -829,7 +829,8 @@ static void rpc_async_schedule(struct work_struct *work)
/**
* rpc_malloc - allocate an RPC buffer
* @task: RPC task that will use this buffer
- * @size: requested byte size
+ * @call: maximum size of on-the-wire RPC call, in bytes
+ * @reply: maximum size of on-the-wire RPC reply, in bytes
*
* To prevent rpciod from hanging, this allocator never sleeps,
* returning NULL and suppressing warning if the request cannot be serviced
@@ -843,8 +844,9 @@ static void rpc_async_schedule(struct work_struct *work)
* In order to avoid memory starvation triggering more writebacks of
* NFS requests, we avoid using GFP_KERNEL.
*/
-void *rpc_malloc(struct rpc_task *task, size_t size)
+void *rpc_malloc(struct rpc_task *task, size_t call, size_t reply)
{
+ size_t size = call + reply;
struct rpc_buffer *buf;
gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2faac49..6e9d0a7 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -459,7 +459,7 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
* the receive buffer portion when using reply chunks.
*/
static void *
-xprt_rdma_allocate(struct rpc_task *task, size_t size)
+xprt_rdma_allocate(struct rpc_task *task, size_t size, size_t replen)
{
struct rpc_xprt *xprt = task->tk_rqstp->rq_xprt;
struct rpcrdma_req *req, *nreq;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89e..b4aca48 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2423,8 +2423,9 @@ static void xs_tcp_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
* we allocate pages instead doing a kmalloc like rpc_malloc is because we want
* to use the server side send routines.
*/
-static void *bc_malloc(struct rpc_task *task, size_t size)
+static void *bc_malloc(struct rpc_task *task, size_t call, size_t reply)
{
+ size_t size = call + reply;
struct page *page;
struct rpc_buffer *buf;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Anna Schumaker

2014-10-20 14:04:45 UTC

Permalink

Post by Chuck Lever
I noticed that on RDMA, NFSv4 operations were using "hardway"
allocations much more than not. A "hardway" allocation uses GFP_NOFS
during each RPC to allocate the XDR buffer, instead of using a
pre-allocated pre-registered buffer for each RPC.
The pre-allocated buffers are 2200 bytes in length. The requested
GETATTR: 3220 bytes
LOOKUP: 3612 bytes
WRITE: 3256 bytes
OPEN: 6344 bytes
But an NFSv4 GETATTR RPC request should be small. It's the reply
part of GETATTR that can grow large.
call_allocate() passes a single value as the XDR buffer size: the
sum of call and reply buffers. However, the xprtrdma transport
allocates its XDR request and reply buffers separately.
xprtrdma needs to know the maximum call size, as guidance for how
large the outgoing request is going to be and how the NFS payload
will be marshalled into chunks.
But RDMA XDR reply buffers are pre-posted, fixed-size buffers, not
allocated by xprt_rdma_allocate().
Because of the sum passed through ->buf_alloc(), xprtrdma's
->buf_alloc() always allocates more XDR buffer than it will ever
use. For NFSv4, it is unnecessarily triggering the slow "hardway"
path for almost every RPC.
Pass the call and reply buffer size values separately to the
transport's ->buf_alloc method. The RDMA transport ->buf_alloc can
now ignore the reply size, and allocate just what it will use for
the call buffer. The socket transport ->buf_alloc can simply add
them together, as call_allocate() did before.
With this patch, an NFSv4 GETATTR request now allocates a 476 byte
RDMA XDR buffer. I didn't see a single NFSv4 request that did not
fit into the transport's pre-allocated XDR buffer.
---
include/linux/sunrpc/sched.h | 2 +-
include/linux/sunrpc/xprt.h | 3 ++-
net/sunrpc/clnt.c | 4 ++--
net/sunrpc/sched.c | 6 ++++--
net/sunrpc/xprtrdma/transport.c | 2 +-
net/sunrpc/xprtsock.c | 3 ++-
6 files changed, 12 insertions(+), 8 deletions(-)
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 1a89599..68fa71d 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -232,7 +232,7 @@ struct rpc_task *rpc_wake_up_first(struct rpc_wait_queue *,
void *);
void rpc_wake_up_status(struct rpc_wait_queue *, int);
void rpc_delay(struct rpc_task *, unsigned long);
-void * rpc_malloc(struct rpc_task *, size_t);
+void *rpc_malloc(struct rpc_task *, size_t, size_t);
void rpc_free(void *);
int rpciod_up(void);
void rpciod_down(void);
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index fcbfe87..632685c 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -124,7 +124,8 @@ struct rpc_xprt_ops {
void (*rpcbind)(struct rpc_task *task);
void (*set_port)(struct rpc_xprt *xprt, unsigned short port);
void (*connect)(struct rpc_xprt *xprt, struct rpc_task *task);
- void * (*buf_alloc)(struct rpc_task *task, size_t size);
+ void * (*buf_alloc)(struct rpc_task *task,
+ size_t call, size_t reply);
void (*buf_free)(void *buffer);
int (*send_request)(struct rpc_task *task);
void (*set_retrans_timeout)(struct rpc_task *task);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 488ddee..5e817d6 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1599,8 +1599,8 @@ call_allocate(struct rpc_task *task)
req->rq_rcvsize = RPC_REPHDRSIZE + slack + proc->p_replen;
req->rq_rcvsize <<= 2;
- req->rq_buffer = xprt->ops->buf_alloc(task,
- req->rq_callsize + req->rq_rcvsize);
+ req->rq_buffer = xprt->ops->buf_alloc(task, req->rq_callsize,
+ req->rq_rcvsize);
if (req->rq_buffer != NULL)
return;
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79..fc4f939 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -829,7 +829,8 @@ static void rpc_async_schedule(struct work_struct *work)
/**
* rpc_malloc - allocate an RPC buffer
*
* To prevent rpciod from hanging, this allocator never sleeps,
* returning NULL and suppressing warning if the request cannot be serviced
@@ -843,8 +844,9 @@ static void rpc_async_schedule(struct work_struct *work)
* In order to avoid memory starvation triggering more writebacks of
* NFS requests, we avoid using GFP_KERNEL.
*/
-void *rpc_malloc(struct rpc_task *task, size_t size)
+void *rpc_malloc(struct rpc_task *task, size_t call, size_t reply)
{
+ size_t size = call + reply;
struct rpc_buffer *buf;
gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2faac49..6e9d0a7 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -459,7 +459,7 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
* the receive buffer portion when using reply chunks.
*/
static void *
-xprt_rdma_allocate(struct rpc_task *task, size_t size)
+xprt_rdma_allocate(struct rpc_task *task, size_t size, size_t replen)

The comment right before this function mentions that send and receive buffers are allocated in the same call. Can you update this?

Anna

Post by Chuck Lever
{
struct rpc_xprt *xprt = task->tk_rqstp->rq_xprt;
struct rpcrdma_req *req, *nreq;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89e..b4aca48 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2423,8 +2423,9 @@ static void xs_tcp_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
* we allocate pages instead doing a kmalloc like rpc_malloc is because we want
* to use the server side send routines.
*/
-static void *bc_malloc(struct rpc_task *task, size_t size)
+static void *bc_malloc(struct rpc_task *task, size_t call, size_t reply)
{
+ size_t size = call + reply;
struct page *page;
struct rpc_buffer *buf;
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-20 18:21:57 UTC

Permalink

ed.h

Post by Chuck Lever
index 1a89599..68fa71d 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -232,7 +232,7 @@ struct rpc_task *rpc_wake_up_first(struct rpc_wa=

it_queue *,

Post by Chuck Lever
void *);
void rpc_wake_up_status(struct rpc_wait_queue *, int);
void rpc_delay(struct rpc_task *, unsigned long);
-void * rpc_malloc(struct rpc_task *, size_t);
+void *rpc_malloc(struct rpc_task *, size_t, size_t);
void rpc_free(void *);
int rpciod_up(void);
void rpciod_down(void);
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt=

=2Eh

Post by Chuck Lever
index fcbfe87..632685c 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -124,7 +124,8 @@ struct rpc_xprt_ops {
void (*rpcbind)(struct rpc_task *task);
void (*set_port)(struct rpc_xprt *xprt, unsigned short port);
void (*connect)(struct rpc_xprt *xprt, struct rpc_task *task);
- void * (*buf_alloc)(struct rpc_task *task, size_t size);
+ void * (*buf_alloc)(struct rpc_task *task,
+ size_t call, size_t reply);
void (*buf_free)(void *buffer);
int (*send_request)(struct rpc_task *task);
void (*set_retrans_timeout)(struct rpc_task *task);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 488ddee..5e817d6 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1599,8 +1599,8 @@ call_allocate(struct rpc_task *task)
req->rq_rcvsize =3D RPC_REPHDRSIZE + slack + proc->p_replen;
req->rq_rcvsize <<=3D 2;
=20
- req->rq_buffer =3D xprt->ops->buf_alloc(task,
- req->rq_callsize + req->rq_rcvsize);
+ req->rq_buffer =3D xprt->ops->buf_alloc(task, req->rq_callsize,
+ req->rq_rcvsize);
if (req->rq_buffer !=3D NULL)
return;
=20
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79..fc4f939 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -829,7 +829,8 @@ static void rpc_async_schedule(struct work_struc=

t *work)

Post by Chuck Lever
/**
* rpc_malloc - allocate an RPC buffer
*
* To prevent rpciod from hanging, this allocator never sleeps,
* returning NULL and suppressing warning if the request cannot be s=

erviced

Post by Chuck Lever
@@ -843,8 +844,9 @@ static void rpc_async_schedule(struct work_struc=

t *work)

Post by Chuck Lever
* In order to avoid memory starvation triggering more writebacks of
* NFS requests, we avoid using GFP_KERNEL.
*/
-void *rpc_malloc(struct rpc_task *task, size_t size)
+void *rpc_malloc(struct rpc_task *task, size_t call, size_t reply)
{
+ size_t size =3D call + reply;
struct rpc_buffer *buf;
gfp_t gfp =3D GFP_NOWAIT | __GFP_NOWARN;
=20
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/t=

ransport.c

Post by Chuck Lever
index 2faac49..6e9d0a7 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -459,7 +459,7 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct =

rpc_task *task)

Post by Chuck Lever
* the receive buffer portion when using reply chunks.
*/
static void *
-xprt_rdma_allocate(struct rpc_task *task, size_t size)
+xprt_rdma_allocate(struct rpc_task *task, size_t size, size_t reple=

n)

The rpc_rqst actually has separate send and receive sizes
already in it, and the passed-in tk_task has a pointer to
rpc_rqst. So I don=92t need to alter the buf_alloc() method
synopsis at all.

Though I=92ve tested with the two server implementations that
I have on hand, I=92ve discovered this is a less than generic
approach to the problem. I can=92t just slice off the receive
part of this buffer, as it is actually used sometimes
(though apparently not with the Solaris or Linux servers).

=46or those two reasons, I=92m going to replace this patch with
something else in the next round. JFYI so you know what to
look for next time.

The comment right before this function mentions that send and receive=

buffers are allocated in the same call. Can you update this?

If the new patch touches xprt_rdma_allocate(), I can
replace Tom=92s implementer=92s notes in this function with
some operational documenting comments.

Anna
=20

Post by Chuck Lever
{
struct rpc_xprt *xprt =3D task->tk_rqstp->rq_xprt;
struct rpcrdma_req *req, *nreq;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89e..b4aca48 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2423,8 +2423,9 @@ static void xs_tcp_print_stats(struct rpc_xprt=

*xprt, struct seq_file *seq)

Post by Chuck Lever
* we allocate pages instead doing a kmalloc like rpc_malloc is beca=

use we want

Post by Chuck Lever
* to use the server side send routines.
*/
-static void *bc_malloc(struct rpc_task *task, size_t size)
+static void *bc_malloc(struct rpc_task *task, size_t call, size_t r=

eply)

Post by Chuck Lever
{
+ size_t size =3D call + reply;
struct page *page;
struct rpc_buffer *buf;
=20
=20
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs"=

Post by Chuck Lever
More majordomo info at http://vger.kernel.org/majordomo-info.html

=20

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:38:46 UTC

Permalink

Currently rpcrdma_flush_cqs() attempts to avoid code duplication,
and simply invokes rpcrdma_recvcq_upcall and rpcrdma_sendcq_upcall.
This has two minor issues:

1. It re-arms the CQ, which can happen even if a CQ upcall is
running at the same time

2. The upcall functions drain only a limited number of CQEs,
thanks to the poll budget added by commit 8301a2c047cc
("xprtrdma: Limit work done by completion handler").

Rewrite rpcrdma_flush_cqs() to be sure all CQEs are drained after a
transport is disconnected.

Fixes: a7bc211ac926 ("xprtrdma: On disconnect, don't ignore ... ")
Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/verbs.c | 32 ++++++++++++++++++++++++--------
1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 5c0c7a5..6fadb90 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -106,6 +106,17 @@ rpcrdma_run_tasklet(unsigned long data)
static DECLARE_TASKLET(rpcrdma_tasklet_g, rpcrdma_run_tasklet, 0UL);

static void
+rpcrdma_schedule_tasklet(struct list_head *sched_list)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
+ list_splice_tail(sched_list, &rpcrdma_tasklets_g);
+ spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
+ tasklet_schedule(&rpcrdma_tasklet_g);
+}
+
+static void
rpcrdma_qp_async_error_upcall(struct ib_event *event, void *context)
{
struct rpcrdma_ep *ep = context;
@@ -243,7 +254,6 @@ rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
struct list_head sched_list;
struct ib_wc *wcs;
int budget, count, rc;
- unsigned long flags;

INIT_LIST_HEAD(&sched_list);
budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
@@ -261,10 +271,7 @@ rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
rc = 0;

out_schedule:
- spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
- list_splice_tail(&sched_list, &rpcrdma_tasklets_g);
- spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
- tasklet_schedule(&rpcrdma_tasklet_g);
+ rpcrdma_schedule_tasklet(&sched_list);
return rc;
}

@@ -309,8 +316,17 @@ rpcrdma_recvcq_upcall(struct ib_cq *cq, void *cq_context)
static void
rpcrdma_flush_cqs(struct rpcrdma_ep *ep)
{
- rpcrdma_recvcq_upcall(ep->rep_attr.recv_cq, ep);
- rpcrdma_sendcq_upcall(ep->rep_attr.send_cq, ep);
+ struct list_head sched_list;
+ struct ib_wc wc;
+
+ INIT_LIST_HEAD(&sched_list);
+ while (ib_poll_cq(ep->rep_attr.recv_cq, 1, &wc) > 0)
+ rpcrdma_recvcq_process_wc(&wc, &sched_list);
+ if (!list_empty(&sched_list))
+ rpcrdma_schedule_tasklet(&sched_list);
+
+ while (ib_poll_cq(ep->rep_attr.send_cq, 1, &wc) > 0)
+ rpcrdma_sendcq_process_wc(&wc);
}

#ifdef RPC_DEBUG
@@ -980,7 +996,6 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
{
int rc;

- rpcrdma_flush_cqs(ep);
rc = rdma_disconnect(ia->ri_id);
if (!rc) {
/* returns without wait if not connected */
@@ -992,6 +1007,7 @@ rpcrdma_ep_disconnect(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia)
dprintk("RPC: %s: rdma_disconnect %i\n", __func__, rc);
ep->rep_connected = rc;
}
+ rpcrdma_flush_cqs(ep);
}

static int

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:38:55 UTC

Permalink

When using RPCRDMA_MTHCAFMR memory registration, after a few
transport disconnect / reconnect cycles, ib_map_phys_fmr() starts to
return EINVAL because the provider has exhausted its map pool.

Make sure that all FMRs are unmapped during transport disconnect,
and that ->send_request remarshals them during an RPC retransmit.
This resets the transport's MRs to ensure that none are leaked
during a disconnect.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/transport.c | 2 +-
net/sunrpc/xprtrdma/verbs.c | 40 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 6e9d0a7..61be0a0 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -601,7 +601,7 @@ xprt_rdma_send_request(struct rpc_task *task)

if (req->rl_niovs == 0)
rc = rpcrdma_marshal_req(rqst);
- else if (r_xprt->rx_ia.ri_memreg_strategy == RPCRDMA_FRMR)
+ else if (r_xprt->rx_ia.ri_memreg_strategy != RPCRDMA_ALLPHYSICAL)
rc = rpcrdma_marshal_chunks(rqst, 0);
if (rc < 0)
goto failed_marshal;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 6fadb90..9105524 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -62,6 +62,7 @@
#endif

static void rpcrdma_reset_frmrs(struct rpcrdma_ia *);
+static void rpcrdma_reset_fmrs(struct rpcrdma_ia *);

/*
* internal functions
@@ -884,8 +885,17 @@ retry:
rpcrdma_ep_disconnect(ep, ia);
rpcrdma_flush_cqs(ep);

- if (ia->ri_memreg_strategy == RPCRDMA_FRMR)
+ switch (ia->ri_memreg_strategy) {
+ case RPCRDMA_FRMR:
rpcrdma_reset_frmrs(ia);
+ break;
+ case RPCRDMA_MTHCAFMR:
+ rpcrdma_reset_fmrs(ia);
+ break;
+ default:
+ rc = -EIO;
+ goto out;
+ }

xprt = container_of(ia, struct rpcrdma_xprt, rx_ia);
id = rpcrdma_create_id(xprt, ia,
@@ -1305,6 +1315,34 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
kfree(buf->rb_pool);
}

+/* After a disconnect, unmap all FMRs.
+ *
+ * This is invoked only in the transport connect worker in order
+ * to serialize with rpcrdma_register_fmr_external().
+ */
+static void
+rpcrdma_reset_fmrs(struct rpcrdma_ia *ia)
+{
+ struct rpcrdma_xprt *r_xprt =
+ container_of(ia, struct rpcrdma_xprt, rx_ia);
+ struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+ struct list_head *pos;
+ struct rpcrdma_mw *r;
+ LIST_HEAD(l);
+ int rc;
+
+ list_for_each(pos, &buf->rb_all) {
+ r = list_entry(pos, struct rpcrdma_mw, mw_all);
+
+ INIT_LIST_HEAD(&l);
+ list_add(&r->r.fmr->list, &l);
+ rc = ib_unmap_fmr(&l);
+ if (rc)
+ dprintk("RPC: %s: ib_unmap_fmr failed %i\n",
+ __func__, rc);
+ }
+}
+
/* After a disconnect, a flushed FAST_REG_MR can leave an FRMR in
* an unusable state. Find FRMRs in this state and dereg / reg
* each. FRMRs that are VALID and attached to an rpcrdma_req are

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:39:03 UTC

Permalink

A pair of CQs is created for each xprtrdma transport. One transport
instance is created per NFS mount point.

Both Shirley Ma and Steve Wise have observed that the adapter
interrupt workload sticks with a single MSI-X and CPU core unless
manual steps are taken to move it to other CPUs. This tends to limit
performance once the interrupt workload consumes an entire core.

Sagi Grimwald suggested one way to get better dispersal of
interrupts is to use the completion vector argument of the
ib_create_cq() API to assign new CQs to different adapter ingress
queues. Currently, xprtrdma sets this argument to 0 unconditionally,
which leaves all xprtrdma CQs consuming the same small pool of
resources.

Each CQ will still be nailed to one completion vector. This won't help
a "single mount point" workload, but when multiple mount points are in
play, the RDMA provider will see to it that adapter interrupts are
better spread over available resources.

We also take a little trouble to stay off of vector 0, which is used
by many other kernel RDMA consumers such as IPoIB.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/verbs.c | 45 ++++++++++++++++++++++++++++++++++++++++---
1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 9105524..dc4c8e3 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -49,6 +49,8 @@

#include <linux/interrupt.h>
#include <linux/slab.h>
+#include <linux/random.h>
+
#include <asm/bitops.h>

#include "xprt_rdma.h"
@@ -666,6 +668,42 @@ rpcrdma_ia_close(struct rpcrdma_ia *ia)
}

/*
+ * Select a provider completion vector to assign a CQ to.
+ *
+ * This is an attempt to spread CQs across available CPUs. The counter
+ * is shared between all adapters on a system. Multi-adapter systems
+ * are rare, and this is still better for them than leaving all CQs on
+ * one completion vector.
+ *
+ * We could put the send and receive CQs for the same transport on
+ * different vectors. However, this risks assigning them to cores on
+ * different sockets in larger systems, which could have disasterous
+ * performance effects due to NUMA.
+ */
+static int
+rpcrdma_cq_comp_vec(struct rpcrdma_ia *ia)
+{
+ int num_comp_vectors = ia->ri_id->device->num_comp_vectors;
+ int vector = 0;
+
+ if (num_comp_vectors > 1) {
+ static DEFINE_SPINLOCK(rpcrdma_cv_lock);
+ static unsigned int rpcrdma_cv_counter;
+
+ spin_lock(&rpcrdma_cv_lock);
+ vector = rpcrdma_cv_counter++ % num_comp_vectors;
+ /* Skip 0, as it is commonly used by other RDMA consumers */
+ if (vector == 0)
+ vector = rpcrdma_cv_counter++ % num_comp_vectors;
+ spin_unlock(&rpcrdma_cv_lock);
+ }
+
+ dprintk("RPC: %s: adapter has %d vectors, using vector %d\n",
+ __func__, num_comp_vectors, vector);
+ return vector;
+}
+
+/*
* Create unconnected endpoint.
*/
int
@@ -674,7 +712,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
{
struct ib_device_attr devattr;
struct ib_cq *sendcq, *recvcq;
- int rc, err;
+ int rc, err, comp_vec;

rc = ib_query_device(ia->ri_id->device, &devattr);
if (rc) {
@@ -759,9 +797,10 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,
init_waitqueue_head(&ep->rep_connect_wait);
INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);

+ comp_vec = rpcrdma_cq_comp_vec(ia);
sendcq = ib_create_cq(ia->ri_id->device, rpcrdma_sendcq_upcall,
rpcrdma_cq_async_error_upcall, ep,
- ep->rep_attr.cap.max_send_wr + 1, 0);
+ ep->rep_attr.cap.max_send_wr + 1, comp_vec);
if (IS_ERR(sendcq)) {
rc = PTR_ERR(sendcq);
dprintk("RPC: %s: failed to create send CQ: %i\n",
@@ -778,7 +817,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia,

recvcq = ib_create_cq(ia->ri_id->device, rpcrdma_recvcq_upcall,
rpcrdma_cq_async_error_upcall, ep,
- ep->rep_attr.cap.max_recv_wr + 1, 0);
+ ep->rep_attr.cap.max_recv_wr + 1, comp_vec);
if (IS_ERR(recvcq)) {
rc = PTR_ERR(recvcq);
dprintk("RPC: %s: failed to create recv CQ: %i\n",

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:39:11 UTC

Permalink

Occasionally mountstats reports a negative retransmission rate.
Ensure that two RPCs completing concurrently don't confuse the sums
in the transport's op_metrics array.

Since pNFS filelayout can invoke rpc_count_iostats() on another
transport from xprt_release(), we can't rely on simply holding the
transport_lock in xprt_release(). There's nothing for it but hard
serialization. One spin lock per RPC operation should make this as
painless as it can be.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
include/linux/sunrpc/metrics.h | 3 +++
net/sunrpc/stats.c | 21 ++++++++++++++++-----
2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/metrics.h b/include/linux/sunrpc/metrics.h
index 1565bbe..eecb5a7 100644
--- a/include/linux/sunrpc/metrics.h
+++ b/include/linux/sunrpc/metrics.h
@@ -27,10 +27,13 @@

#include <linux/seq_file.h>
#include <linux/ktime.h>
+#include <linux/spinlock.h>

#define RPC_IOSTATS_VERS "1.0"

struct rpc_iostats {
+ spinlock_t om_lock;
+
/*
* These counters give an idea about how many request
* transmissions are required, on average, to complete that
diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c
index 5453049..9711a15 100644
--- a/net/sunrpc/stats.c
+++ b/net/sunrpc/stats.c
@@ -116,7 +116,15 @@ EXPORT_SYMBOL_GPL(svc_seq_show);
*/
struct rpc_iostats *rpc_alloc_iostats(struct rpc_clnt *clnt)
{
- return kcalloc(clnt->cl_maxproc, sizeof(struct rpc_iostats), GFP_KERNEL);
+ struct rpc_iostats *stats;
+ int i;
+
+ stats = kcalloc(clnt->cl_maxproc, sizeof(*stats), GFP_KERNEL);
+ if (stats) {
+ for (i = 0; i < clnt->cl_maxproc; i++)
+ spin_lock_init(&stats[i].om_lock);
+ }
+ return stats;
}
EXPORT_SYMBOL_GPL(rpc_alloc_iostats);

@@ -135,20 +143,21 @@ EXPORT_SYMBOL_GPL(rpc_free_iostats);
* rpc_count_iostats - tally up per-task stats
* @task: completed rpc_task
* @stats: array of stat structures
- *
- * Relies on the caller for serialization.
*/
void rpc_count_iostats(const struct rpc_task *task, struct rpc_iostats *stats)
{
struct rpc_rqst *req = task->tk_rqstp;
struct rpc_iostats *op_metrics;
- ktime_t delta;
+ ktime_t delta, now;

if (!stats || !req)
return;

+ now = ktime_get();
op_metrics = &stats[task->tk_msg.rpc_proc->p_statidx];

+ spin_lock(&op_metrics->om_lock);
+
op_metrics->om_ops++;
op_metrics->om_ntrans += req->rq_ntrans;
op_metrics->om_timeouts += task->tk_timeouts;
@@ -161,8 +170,10 @@ void rpc_count_iostats(const struct rpc_task *task, struct rpc_iostats *stats)

op_metrics->om_rtt = ktime_add(op_metrics->om_rtt, req->rq_rtt);

- delta = ktime_sub(ktime_get(), task->tk_start);
+ delta = ktime_sub(now, task->tk_start);
op_metrics->om_execute = ktime_add(op_metrics->om_execute, delta);
+
+ spin_unlock(&op_metrics->om_lock);
}
EXPORT_SYMBOL_GPL(rpc_count_iostats);

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:39:19 UTC

Permalink

An async error upcall is a hard error, and should be reported in
the system log.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/verbs.c | 36 ++++++++++++++++++++++++++++++++----
1 file changed, 32 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index dc4c8e3..73079d5 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -108,6 +108,32 @@ rpcrdma_run_tasklet(unsigned long data)

static DECLARE_TASKLET(rpcrdma_tasklet_g, rpcrdma_run_tasklet, 0UL);

+static const char * const async_event[] = {
+ "CQ error",
+ "QP fatal error",
+ "QP request error",
+ "QP access error",
+ "communication established",
+ "send queue drained",
+ "path migration successful",
+ "path mig error",
+ "device fatal error",
+ "port active",
+ "port error",
+ "LID change",
+ "P_key change",
+ "SM change",
+ "SRQ error",
+ "SRQ limit reached",
+ "last WQE reached",
+ "client reregister",
+ "GID change",
+};
+
+#define ASYNC_MSG(status) \
+ ((status) < ARRAY_SIZE(async_event) ? \
+ async_event[(status)] : "unknown async error")
+
static void
rpcrdma_schedule_tasklet(struct list_head *sched_list)
{
@@ -124,8 +150,9 @@ rpcrdma_qp_async_error_upcall(struct ib_event *event, void *context)
{
struct rpcrdma_ep *ep = context;

- dprintk("RPC: %s: QP error %X on device %s ep %p\n",
- __func__, event->event, event->device->name, context);
+ pr_err("RPC: %s: %s on device %s ep %p\n",
+ __func__, ASYNC_MSG(event->event),
+ event->device->name, context);
if (ep->rep_connected == 1) {
ep->rep_connected = -EIO;
ep->rep_func(ep);
@@ -138,8 +165,9 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void *context)
{
struct rpcrdma_ep *ep = context;

- dprintk("RPC: %s: CQ error %X on device %s ep %p\n",
- __func__, event->event, event->device->name, context);
+ pr_err("RPC: %s: %s on device %s ep %p\n",
+ __func__, ASYNC_MSG(event->event),
+ event->device->name, context);
if (ep->rep_connected == 1) {
ep->rep_connected = -EIO;
ep->rep_func(ep);

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:39:27 UTC

Permalink

The Linux NFS/RDMA was rejecting NFSv3 WRITE requests when pad
optimization was enabled. That bug is now fixed

We can enable pad optimization, which helps performance and is
supported now by both Linux and Solaris servers.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
net/sunrpc/xprtrdma/transport.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 61be0a0..09c6443 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -73,7 +73,7 @@ static unsigned int xprt_rdma_max_inline_read = RPCRDMA_DEF_INLINE;
static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
static unsigned int xprt_rdma_inline_write_padding;
static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
- int xprt_rdma_pad_optimize = 0;
+ int xprt_rdma_pad_optimize = 1;

#ifdef RPC_DEBUG

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:39:35 UTC

Permalink

The nfs_match_client() function asserts that different nfs_client
structures are used when mounting the same server with different
transport protocols. For example, if a Linux client mounts the same
server via NFSv3 with some UDP mounts and some TCP mounts, it will
use only two transports and two nfs_client structures: one shared
with all UDP mounts, and one shared with all TCP mounts.

When a uniform client string is in use (NFSv4.1, or NFSv4.0 with the
"migration" mount option), nfs_match_client() will ensure an
nfs_client structure and separate transport is created for mounts
with unique "proto=" settings (one for TCP and one for RDMA,
currently).

But EXCHANGE_ID sends exactly the same nfs_client_id4 string on both
transports. The server then believes that the client is trunking
over disparate transports, when it clearly is not. The open and lock
state that will appear on each transport is disjoint.

Now that NFSv4.1 over RDMA is supported, a user can mount the same
server with NFSv4.1 over TCP and RDMA concurrently. The client will
send an EXCHANGE_ID with the same client ID on both transports, and
it will also send a CREATE_SESSION on both.

To ensure the Linux client represents itself correctly to servers,
add the transport protocol name to the uniform client string. Each
transport instance should get its own client ID (and session) to
prevent trunking.

This doesn't appear to be a problem for NFSv4.0 without migration.
It also wasn't a problem for NFSv4.1 when TCP was the only available
transport.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
fs/nfs/nfs4proc.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 6ca0c8e..a1243e7 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -4929,16 +4929,23 @@ nfs4_init_uniform_client_string(const struct nfs_client *clp,
char *buf, size_t len)
{
const char *nodename = clp->cl_rpcclient->cl_nodename;
+ unsigned int result;

+ rcu_read_lock();
if (nfs4_client_id_uniquifier[0] != '\0')
- return scnprintf(buf, len, "Linux NFSv%u.%u %s/%s",
+ result = scnprintf(buf, len, "Linux NFSv%u.%u %s/%s %s",
clp->rpc_ops->version,
clp->cl_minorversion,
nfs4_client_id_uniquifier,
- nodename);
- return scnprintf(buf, len, "Linux NFSv%u.%u %s",
+ nodename, rpc_peeraddr2str(clp->cl_rpcclient,
+ RPC_DISPLAY_PROTO));
+ else
+ result = scnprintf(buf, len, "Linux NFSv%u.%u %s %s",
clp->rpc_ops->version, clp->cl_minorversion,
- nodename);
+ nodename, rpc_peeraddr2str(clp->cl_rpcclient,
+ RPC_DISPLAY_PROTO));
+ rcu_read_unlock();
+ return result;
}

/*

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:39:44 UTC

Permalink

nfs4_init_callback() is never invoked for NFS versions other than 4.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
fs/nfs/nfs4client.c | 31 ++++++++++++++-----------------
1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index ffdb28d..5f4b818 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -241,28 +241,25 @@ void nfs4_free_client(struct nfs_client *clp)
*/
static int nfs4_init_callback(struct nfs_client *clp)
{
+ struct rpc_xprt *xprt;
int error;

- if (clp->rpc_ops->version == 4) {
- struct rpc_xprt *xprt;
+ xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);

- xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);
-
- if (nfs4_has_session(clp)) {
- error = xprt_setup_backchannel(xprt,
- NFS41_BC_MIN_CALLBACKS);
- if (error < 0)
- return error;
- }
-
- error = nfs_callback_up(clp->cl_mvops->minor_version, xprt);
- if (error < 0) {
- dprintk("%s: failed to start callback. Error = %d\n",
- __func__, error);
+ if (nfs4_has_session(clp)) {
+ error = xprt_setup_backchannel(xprt, NFS41_BC_MIN_CALLBACKS);
+ if (error < 0)
return error;
- }
- __set_bit(NFS_CS_CALLBACK, &clp->cl_res_state);
}
+
+ error = nfs_callback_up(clp->cl_mvops->minor_version, xprt);
+ if (error < 0) {
+ dprintk("%s: failed to start callback. Error = %d\n",
+ __func__, error);
+ return error;
+ }
+ __set_bit(NFS_CS_CALLBACK, &clp->cl_res_state);
+
return 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:39:52 UTC

Permalink

Allow upper layers to determine if a particular rpc_clnt is prepared
to provide a backchannel RPC service.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
include/linux/sunrpc/clnt.h | 1 +
include/linux/sunrpc/xprt.h | 1 +
net/sunrpc/clnt.c | 24 ++++++++++++++++++++++++
net/sunrpc/xprtsock.c | 3 +++
4 files changed, 29 insertions(+)

diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 70736b9..644c751 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -171,6 +171,7 @@ int rpc_protocol(struct rpc_clnt *);
struct net * rpc_net_ns(struct rpc_clnt *);
size_t rpc_max_payload(struct rpc_clnt *);
unsigned long rpc_get_timeout(struct rpc_clnt *clnt);
+bool rpc_xprt_is_bidirectional(struct rpc_clnt *);
void rpc_force_rebind(struct rpc_clnt *);
size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t);
const char *rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t);
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 632685c..4dea441 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -218,6 +218,7 @@ struct rpc_xprt {
* items */
struct list_head bc_pa_list; /* List of preallocated
* backchannel rpc_rqst's */
+ bool bc_supported;
#endif /* CONFIG_SUNRPC_BACKCHANNEL */
struct list_head recv;

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 5e817d6..1793341 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1347,6 +1347,30 @@ unsigned long rpc_get_timeout(struct rpc_clnt *clnt)
EXPORT_SYMBOL_GPL(rpc_get_timeout);

/**
+ * rpc_xprt_is_bidirectional
+ * @clnt: RPC clnt to query
+ *
+ * Returns true if underlying transport supports backchannel service.
+ */
+#ifdef CONFIG_SUNRPC_BACKCHANNEL
+bool rpc_xprt_is_bidirectional(struct rpc_clnt *clnt)
+{
+ bool ret;
+
+ rcu_read_lock();
+ ret = rcu_dereference(clnt->cl_xprt)->bc_supported;
+ rcu_read_unlock();
+ return ret;
+}
+#else
+bool rpc_xprt_is_bidirectional(struct rpc_clnt *clnt)
+{
+ return false;
+}
+#endif
+EXPORT_SYMBOL_GPL(rpc_xprt_is_bidirectional);
+
+/**
* rpc_force_rebind - force transport to check that remote port is unchanged
* @clnt: client to rebind
*
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index b4aca48..e2e15a9 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2864,6 +2864,9 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)

xprt->ops = &xs_tcp_ops;
xprt->timeout = &xs_tcp_default_timeout;
+#ifdef CONFIG_SUNRPC_BACKCHANNEL
+ xprt->bc_supported = true;
+#endif

switch (addr->sa_family) {
case AF_INET:

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-16 19:40:00 UTC

Permalink

So far, TCP is the only transport that supports bi-directional RPC.

When mounting with NFSv4.1 using a transport that does not support
bi-directional RPC, establish a TCP sidecar connection to handle
backchannel traffic for a session. The sidecar transport does not
use its forward channel except for sending BIND_CONN_TO_SESSION
operations.

This commit adds logic to create and destroy the sidecar transport.
Subsequent commits add logic to use the transport.

Signed-off-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+***@public.gmane.org>
---
fs/nfs/client.c | 1 +
fs/nfs/nfs4client.c | 54 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/nfs_fs_sb.h | 2 ++
3 files changed, 57 insertions(+)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 6a4f366..19f49bf 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -78,6 +78,7 @@ const struct rpc_program nfs_program = {
.stats = &nfs_rpcstat,
.pipe_dir_name = NFS_PIPE_DIRNAME,
};
+EXPORT_SYMBOL_GPL(nfs_program);

struct rpc_stat nfs_rpcstat = {
.program = &nfs_program
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 5f4b818..b1cc35e 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -213,6 +213,8 @@ static void nfs4_destroy_callback(struct nfs_client *clp)
{
if (__test_and_clear_bit(NFS_CS_CALLBACK, &clp->cl_res_state))
nfs_callback_down(clp->cl_mvops->minor_version, clp->cl_net);
+ if (clp->cl_bc_rpcclient)
+ rpc_shutdown_client(clp->cl_bc_rpcclient);
}

static void nfs4_shutdown_client(struct nfs_client *clp)
@@ -291,6 +293,53 @@ int nfs40_init_client(struct nfs_client *clp)

#if defined(CONFIG_NFS_V4_1)

+/*
+ * Create a separate rpc_clnt using TCP that can provide a
+ * backchannel service.
+ */
+static int nfs41_create_sidecar_rpc_client(struct nfs_client *clp)
+{
+ struct sockaddr_storage address;
+ struct sockaddr *sap = (struct sockaddr *)&address;
+ struct rpc_create_args args = {
+ .net = clp->cl_net,
+ .protocol = XPRT_TRANSPORT_TCP,
+ .address = sap,
+ .addrsize = clp->cl_addrlen,
+ .servername = clp->cl_hostname,
+ .program = &nfs_program,
+ .version = clp->rpc_ops->version,
+ .flags = (RPC_CLNT_CREATE_DISCRTRY |
+ RPC_CLNT_CREATE_NOPING),
+ };
+ struct rpc_clnt *clnt;
+ struct rpc_cred *cred;
+
+ if (rpc_xprt_is_bidirectional(clp->cl_rpcclient))
+ return 0;
+
+ if (test_bit(NFS_CS_NORESVPORT, &clp->cl_flags))
+ args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
+ memcpy(sap, &clp->cl_addr, clp->cl_addrlen);
+ rpc_set_port(sap, NFS_PORT);
+ cred = nfs4_get_clid_cred(clp);
+ if (cred) {
+ args.authflavor = cred->cr_auth->au_flavor;
+ put_rpccred(cred);
+ } else
+ args.authflavor = RPC_AUTH_UNIX;
+
+ clnt = rpc_create(&args);
+ if (IS_ERR(clnt)) {
+ dprintk("%s: cannot create side-car RPC client. Error = %ld\n",
+ __func__, PTR_ERR(clnt));
+ return PTR_ERR(clnt);
+ }
+
+ clp->cl_bc_rpcclient = clnt;
+ return 0;
+}
+
/**
* nfs41_init_client - nfs_client initialization tasks for NFSv4.1+
* @clp - nfs_client to initialize
@@ -300,6 +349,11 @@ int nfs40_init_client(struct nfs_client *clp)
int nfs41_init_client(struct nfs_client *clp)
{
struct nfs4_session *session = NULL;
+ int ret;
+
+ ret = nfs41_create_sidecar_rpc_client(clp);
+ if (ret)
+ return ret;

/*
* Create the session and mark it expired.
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 922be2e..159d703 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -87,6 +87,8 @@ struct nfs_client {

/* The sequence id to use for the next CREATE_SESSION */
u32 cl_seqid;
+ /* The optional sidecar backchannel transport */
+ struct rpc_clnt *cl_bc_rpcclient;
/* The flags used for obtaining the clientid during EXCHANGE_ID */
u32 cl_exchange_flags;
struct nfs4_session *cl_session; /* shared session */

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Anna Schumaker

2014-10-20 17:33:22 UTC

Permalink

Post by Chuck Lever
So far, TCP is the only transport that supports bi-directional RPC.
When mounting with NFSv4.1 using a transport that does not support
bi-directional RPC, establish a TCP sidecar connection to handle
backchannel traffic for a session. The sidecar transport does not
use its forward channel except for sending BIND_CONN_TO_SESSION
operations.
This commit adds logic to create and destroy the sidecar transport.
Subsequent commits add logic to use the transport.

I thought NFS v4.0 also uses a separate connection for the backchannel? Can any of that code be reused here, rather than creating new sidecar structures?

Anna

Post by Chuck Lever
---
fs/nfs/client.c | 1 +
fs/nfs/nfs4client.c | 54 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/nfs_fs_sb.h | 2 ++
3 files changed, 57 insertions(+)
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 6a4f366..19f49bf 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -78,6 +78,7 @@ const struct rpc_program nfs_program = {
.stats = &nfs_rpcstat,
.pipe_dir_name = NFS_PIPE_DIRNAME,
};
+EXPORT_SYMBOL_GPL(nfs_program);
struct rpc_stat nfs_rpcstat = {
.program = &nfs_program
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 5f4b818..b1cc35e 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -213,6 +213,8 @@ static void nfs4_destroy_callback(struct nfs_client *clp)
{
if (__test_and_clear_bit(NFS_CS_CALLBACK, &clp->cl_res_state))
nfs_callback_down(clp->cl_mvops->minor_version, clp->cl_net);
+ if (clp->cl_bc_rpcclient)
+ rpc_shutdown_client(clp->cl_bc_rpcclient);
}
static void nfs4_shutdown_client(struct nfs_client *clp)
@@ -291,6 +293,53 @@ int nfs40_init_client(struct nfs_client *clp)
#if defined(CONFIG_NFS_V4_1)
+/*
+ * Create a separate rpc_clnt using TCP that can provide a
+ * backchannel service.
+ */
+static int nfs41_create_sidecar_rpc_client(struct nfs_client *clp)
+{
+ struct sockaddr_storage address;
+ struct sockaddr *sap = (struct sockaddr *)&address;
+ struct rpc_create_args args = {
+ .net = clp->cl_net,
+ .protocol = XPRT_TRANSPORT_TCP,
+ .address = sap,
+ .addrsize = clp->cl_addrlen,
+ .servername = clp->cl_hostname,
+ .program = &nfs_program,
+ .version = clp->rpc_ops->version,
+ .flags = (RPC_CLNT_CREATE_DISCRTRY |
+ RPC_CLNT_CREATE_NOPING),
+ };
+ struct rpc_clnt *clnt;
+ struct rpc_cred *cred;
+
+ if (rpc_xprt_is_bidirectional(clp->cl_rpcclient))
+ return 0;
+
+ if (test_bit(NFS_CS_NORESVPORT, &clp->cl_flags))
+ args.flags |= RPC_CLNT_CREATE_NONPRIVPORT;
+ memcpy(sap, &clp->cl_addr, clp->cl_addrlen);
+ rpc_set_port(sap, NFS_PORT);
+ cred = nfs4_get_clid_cred(clp);
+ if (cred) {
+ args.authflavor = cred->cr_auth->au_flavor;
+ put_rpccred(cred);
+ } else
+ args.authflavor = RPC_AUTH_UNIX;
+
+ clnt = rpc_create(&args);
+ if (IS_ERR(clnt)) {
+ dprintk("%s: cannot create side-car RPC client. Error = %ld\n",
+ __func__, PTR_ERR(clnt));
+ return PTR_ERR(clnt);
+ }
+
+ clp->cl_bc_rpcclient = clnt;
+ return 0;
+}
+
/**
* nfs41_init_client - nfs_client initialization tasks for NFSv4.1+
@@ -300,6 +349,11 @@ int nfs40_init_client(struct nfs_client *clp)
int nfs41_init_client(struct nfs_client *clp)
{
struct nfs4_session *session = NULL;
+ int ret;
+
+ ret = nfs41_create_sidecar_rpc_client(clp);
+ if (ret)
+ return ret;
/*
* Create the session and mark it expired.
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 922be2e..159d703 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -87,6 +87,8 @@ struct nfs_client {
/* The sequence id to use for the next CREATE_SESSION */
u32 cl_seqid;
+ /* The optional sidecar backchannel transport */
+ struct rpc_clnt *cl_bc_rpcclient;
/* The flags used for obtaining the clientid during EXCHANGE_ID */
u32 cl_exchange_flags;
struct nfs4_session *cl_session; /* shared session */
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-20 18:09:42 UTC

Permalink

Post by Chuck Lever
So far, TCP is the only transport that supports bi-directional RPC.
=20
When mounting with NFSv4.1 using a transport that does not support
bi-directional RPC, establish a TCP sidecar connection to handle
backchannel traffic for a session. The sidecar transport does not
use its forward channel except for sending BIND_CONN_TO_SESSION
operations.
=20
This commit adds logic to create and destroy the sidecar transport.
Subsequent commits add logic to use the transport.

=20
I thought NFS v4.0 also uses a separate connection for the backchanne=

l?

=46or NFSv4.0, the server opens a connection to the client. For
NFSv4.1, the client opens the side car connection to the
server, and then performs BIND_CONN_TO_SESSION, an operation
not in NFSv4.0. The processes are not terribly similar.

Can any of that code be reused here, rather than creating new sidecar=

structures?

Since it=92s the client opening a connection in this case,
I don=92t see any obvious common code paths that I haven=92t
already re-used. I=92m open to suggestions.

Anna
=20

Post by Chuck Lever
=20
---
fs/nfs/client.c | 1 +
fs/nfs/nfs4client.c | 54 +++++++++++++++++++++++++++++++++++=

++++++++++

Post by Chuck Lever
include/linux/nfs_fs_sb.h | 2 ++
3 files changed, 57 insertions(+)
=20
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 6a4f366..19f49bf 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -78,6 +78,7 @@ const struct rpc_program nfs_program =3D {
.stats =3D &nfs_rpcstat,
.pipe_dir_name =3D NFS_PIPE_DIRNAME,
};
+EXPORT_SYMBOL_GPL(nfs_program);
=20
struct rpc_stat nfs_rpcstat =3D {
.program =3D &nfs_program
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index 5f4b818..b1cc35e 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -213,6 +213,8 @@ static void nfs4_destroy_callback(struct nfs_cli=

ent *clp)

Post by Chuck Lever
{
if (__test_and_clear_bit(NFS_CS_CALLBACK, &clp->cl_res_state))
nfs_callback_down(clp->cl_mvops->minor_version, clp->cl_net);
+ if (clp->cl_bc_rpcclient)
+ rpc_shutdown_client(clp->cl_bc_rpcclient);
}
=20
static void nfs4_shutdown_client(struct nfs_client *clp)
@@ -291,6 +293,53 @@ int nfs40_init_client(struct nfs_client *clp)
=20
#if defined(CONFIG_NFS_V4_1)
=20
+/*
+ * Create a separate rpc_clnt using TCP that can provide a
+ * backchannel service.
+ */
+static int nfs41_create_sidecar_rpc_client(struct nfs_client *clp)
+{
+ struct sockaddr_storage address;
+ struct sockaddr *sap =3D (struct sockaddr *)&address;
+ struct rpc_create_args args =3D {
+ .net =3D clp->cl_net,
+ .protocol =3D XPRT_TRANSPORT_TCP,
+ .address =3D sap,
+ .addrsize =3D clp->cl_addrlen,
+ .servername =3D clp->cl_hostname,
+ .program =3D &nfs_program,
+ .version =3D clp->rpc_ops->version,
+ .flags =3D (RPC_CLNT_CREATE_DISCRTRY |
+ RPC_CLNT_CREATE_NOPING),
+ };
+ struct rpc_clnt *clnt;
+ struct rpc_cred *cred;
+
+ if (rpc_xprt_is_bidirectional(clp->cl_rpcclient))
+ return 0;
+
+ if (test_bit(NFS_CS_NORESVPORT, &clp->cl_flags))
+ args.flags |=3D RPC_CLNT_CREATE_NONPRIVPORT;
+ memcpy(sap, &clp->cl_addr, clp->cl_addrlen);
+ rpc_set_port(sap, NFS_PORT);
+ cred =3D nfs4_get_clid_cred(clp);
+ if (cred) {
+ args.authflavor =3D cred->cr_auth->au_flavor;
+ put_rpccred(cred);
+ } else
+ args.authflavor =3D RPC_AUTH_UNIX;
+
+ clnt =3D rpc_create(&args);
+ if (IS_ERR(clnt)) {
+ dprintk("%s: cannot create side-car RPC client. Error =3D %ld\n",
+ __func__, PTR_ERR(clnt));
+ return PTR_ERR(clnt);
+ }
+
+ clp->cl_bc_rpcclient =3D clnt;
+ return 0;
+}
+
/**
* nfs41_init_client - nfs_client initialization tasks for NFSv4.1+
@@ -300,6 +349,11 @@ int nfs40_init_client(struct nfs_client *clp)
int nfs41_init_client(struct nfs_client *clp)
{
struct nfs4_session *session =3D NULL;
+ int ret;
+
+ ret =3D nfs41_create_sidecar_rpc_client(clp);
+ if (ret)
+ return ret;
=20
/*
* Create the session and mark it expired.
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 922be2e..159d703 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -87,6 +87,8 @@ struct nfs_client {
=20
/* The sequence id to use for the next CREATE_SESSION */
u32 cl_seqid;
+ /* The optional sidecar backchannel transport */
+ struct rpc_clnt *cl_bc_rpcclient;
/* The flags used for obtaining the clientid during EXCHANGE_ID */
u32 cl_exchange_flags;
struct nfs4_session *cl_session; /* shared session */
=20
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs"=

Post by Chuck Lever
More majordomo info at http://vger.kernel.org/majordomo-info.html

=20

Trond Myklebust

2014-10-20 19:40:11 UTC

Permalink

So far, TCP is the only transport that supports bi-directional RPC=

=2E

When mounting with NFSv4.1 using a transport that does not support
bi-directional RPC, establish a TCP sidecar connection to handle
backchannel traffic for a session. The sidecar transport does not
use its forward channel except for sending BIND_CONN_TO_SESSION
operations.
This commit adds logic to create and destroy the sidecar transport=

=2E

Subsequent commits add logic to use the transport.

I thought NFS v4.0 also uses a separate connection for the backchan=

nel?

For NFSv4.0, the server opens a connection to the client. For
NFSv4.1, the client opens the side car connection to the
server, and then performs BIND_CONN_TO_SESSION, an operation
not in NFSv4.0. The processes are not terribly similar.

Can any of that code be reused here, rather than creating new sidec=

ar structures?

Since it=E2=80=99s the client opening a connection in this case,
I don=E2=80=99t see any obvious common code paths that I haven=E2=80=99=

already re-used. I=E2=80=99m open to suggestions.

Why aren't we doing the callbacks via RDMA as per the recommendation
in RFC5667 section 5.1?

--=20

Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust-7I+n7zu2hftEKMMhf/***@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-20 20:11:25 UTC

Permalink

Hi Trond-

Post by Trond Myklebust
Why aren't we doing the callbacks via RDMA as per the recommendation
in RFC5667 section 5.1?

There=92s no benefit to it. With a side car, the server requires
few or no changes. There are no CB operations that benefit
from using RDMA. It=92s very quick to implement, re-using most of
the client backchannel implementation that already exists.

I=92ve discussed this with an author of RFC 5667 [cc=92d], and also
with the implementors of an existing NFSv4.1 server that supports
RDMA. They both agree that a side car is an acceptable, or even a
preferable, way to approach backchannel support.

Also, when I discussed this with you months ago, you also felt
that a side car was better than adding backchannel support to the
xprtrdma transport. I took this approach only because you OK=92d it.

But I don=92t see an explicit recommendation in section 5.1. Which
text are you referring to?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Trond Myklebust

2014-10-20 22:31:17 UTC

Permalink

Post by Chuck Lever
Hi Trond-

Post by Trond Myklebust
Why aren't we doing the callbacks via RDMA as per the recommendation
in RFC5667 section 5.1?

There=E2=80=99s no benefit to it. With a side car, the server require=

Post by Chuck Lever
few or no changes. There are no CB operations that benefit
from using RDMA. It=E2=80=99s very quick to implement, re-using most =

Post by Chuck Lever
the client backchannel implementation that already exists.
I=E2=80=99ve discussed this with an author of RFC 5667 [cc=E2=80=99d]=

, and also

Post by Chuck Lever
with the implementors of an existing NFSv4.1 server that supports
RDMA. They both agree that a side car is an acceptable, or even a
preferable, way to approach backchannel support.
Also, when I discussed this with you months ago, you also felt
that a side car was better than adding backchannel support to the
xprtrdma transport. I took this approach only because you OK=E2=80=99=

d it.

Post by Chuck Lever
But I don=E2=80=99t see an explicit recommendation in section 5.1. Wh=

ich

Post by Chuck Lever
text are you referring to?

The very first paragraph argues that because callback messages don't
carry bulk data, there is no problem with using RPC/RDMA and, in
particular, with using RDMA_MSG provided that the buffer sizes are
negotiated correctly.

So the questions are:

1) Where is the discussion of the merits for and against adding
bi-directional support to the xprtrdma layer in Linux? What is the
showstopper preventing implementation of a design based around
RFC5667?

2) Why do we instead have to solve the whole backchannel problem in
the NFSv4.1 layer, and where is the discussion of the merits for and
against that particular solution? As far as I can tell, it imposes at
least 2 extra requirements:
a) NFSv4.1 client+server must have support either for session
trunking or for clientid trunking
b) NFSv4.1 client must be able to set up a TCP connection to the
server (that can be session/clientid trunked with the existing RDMA
channel)

All I've found so far on googling these questions is a 5 1/2 year old
email exchange between Tom Tucker and Ricardo where the conclusion
appears to be that we can, in time, implement both designs. However
there is no explanation of why we would want to do so.
http://comments.gmane.org/gmane.linux.nfs/22927

--=20
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust-7I+n7zu2hftEKMMhf/***@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chuck Lever

2014-10-21 01:06:19 UTC

Permalink

Post by Trond Myklebust

Post by Chuck Lever
Hi Trond-
=20
=20

Why aren't we doing the callbacks via RDMA as per the recommendatio=