Firstly, thanks for the questions. To answer 2 you are right, the context manager should take care of it, but I personally like putting an extra sync to be sure. You can omit it in your own scripts.
Coming to 1, this was really interesting, I had to go back to docs to get more idea. You are right with your observation about cudaMemcpyAsync vanishing and only one kernel launch ..._ldg8_relu_f2f_stages_64x3_nn. It turns out that the new kernels is a relu family kernel, which does the EPILOGUE properly where it falls under the contract of using a bias and broadcasting it across rows. Previously the bias was 2D and did not fall under the correct EPILOGUE contract, and hence there was an extra cudaMemcpyAsync.
I hope this makes sense.