Subject: SFTP download speeds

SFTP download speeds

From: Daniel Stenberg <>
Date: Tue, 14 Dec 2010 13:49:01 +0100 (CET)

Hi friends,

I've been working on introducing the asynch approach for SFTP downloads.
Downloads are a bit different in nature than uploads as for example we will
normally get a largish buffer in each call and we will read until EOF.

I first experimented with sending very large FXP_READ requests, and I learned
than OpenSSH only sends back 64K data and never more. The SFTP spec is a bit
vaguely written but seems to say that implementations are only obliged to
support 32K.

So, to read SFTP really fast we create a queue of outgoing READ packets and
send them out one by one and return data as soon as we get such. This way we
get the pipelining effect we want and thus circumvent the waiting. The fact
that we don't know the size before-hand combined with this sort of pre-reading
makes us send a lot of READs beyond the end of the file. That's definately
room for improvement.


    (As usual all numbers are rough, possibly wrong due to my mistakes and I've
    ran all libssh2 tests using debug builds (without any kind of compiler
    optimizations enabled) and some of them used a fair about of printf outputs
    during their operations.)

  - I made the pre-reading use 'buffer_size * 4' as maximum outstanding reads.

  - I used the sftp_nonblock.c code and bumped its own buffer to 1MB - yes that
    makes libssh2 pre-read 4MB! Setting down the max to 'buffer_size * 4' did
    cut the transfer speed by 20%! 4MB makes ~135 outstanding READ packets when
    30000 is requested in each.

  - I modified the code to not write() the received data anywhere

  - For my test with the 1.2.7 code, I modified the buffer to 100K just to make
    sure it was as big as possible for that code.


  To simulate a far away server, I used a nice new trick I've learned to add
  RTT time:

   $ tc qdisc add dev lo root handle 1:0 netem delay 100msec

  Restore it back to normal again with:

   $ tc qdisc del dev lo root

  The added 100 millisecond delay here is once for each way, so this makes a
  200ms RTT when I ping localhost.

  A test with the original 1.2.7 code first:

   Got 10240000 bytes in 64238 ms = 159407.2 bytes/sec

  Yes, it really does perform that terribly bad. OpenSSH's sftp tool does the
  upload at 7.5MB/sec over the same connection.

  My first test with my new code, using the 4MB/30000 sizes:

   Got 102400000 bytes in 20585 ms = 4974496.0 bytes/sec

  Correct. Check the number of zeroes. Ten times the data in a third of the
  time: 31 times faster in total...

  So I started to experiment with sizes. My thinking is that with a 200 ms
  latency, we might want more than 200 requests in the pipe to be really
  efficient. And what do you know? If I cut down the outgoing data requests to
  ask for just 2000 bytes per "piece" I'm able to bump it up another 40%:

   Got 102400000 bytes in 14695 ms = 6968356.6 bytes/sec

  At almost 7MB/sec we're now very close to OpenSSH and roughly 43 times faster
  than 1.2.7...


  When I removed the added latency again and ran the test against localhost my
  test app seemed to get quite stable 25MB/sec while OpenSSH run like the wind
  at 44MB/sec. I've tried changing the packet sizes between 2000 and 30000 as I
  suspected that localhost might perform better with larger sizes there, but I
  didn't see any significant difference. I believe this difference is more due
  to something in our regular transport/channel handling as we are noticably
  slower than openssh already with plain SCP and as long as we are that, we
  can't make SFTP compare either.


  When we use this approach we have a significant over-read for small files. If
  we for example were to write an application that moves over a directory with
  100 files, each being 20 bytes, we would perform terribly slow and waste a
  lot of bandwidth.


  I think that we should consider having the SFTP code do an SSH_FXP_STAT
  query first to figure out the size of the remote file so that _no_
  "over-read" will be done and thus there will be no punishment for small
  files. Of course this will then not work exactly like today in cases when for
  example the file is being written to while the download begins.

  I think we should consider an API that limits or disables this read-ahead
  concept for small memory situations or just situations where it doesn't
  behave in a way that is favourable to the application.


  I'll be committing my changes soonish. I have come to think of a few quirks I
  want to look over first - not really related to my changes but I think my
  changes expose these problems more.

  I will really appreciate if everyone would consider getting the new
  code for a little spin to see in which ways it breaks and what mistakes I
  haven't yet found myself. My tests seems to run rather solidly, but I have a
  rather limited test environment and quite likely too bad imagination to cause
  the real disasters!

Received on 2010-12-14