It looks like there is a data dependency on the preceding load, it
might be worth looking into prefetching the data, either manually or
maybe try -fprefetch-loop-arrays?
I agree with Matt on needing more info, but I also agree with Will that a pre-fetch could speed things up.
The beginning of the block is a few instructions up, and the address of the VLDR is computed by almost all instructions in the block, in chain, I'm assuming (without evidence) that it's the VLDR itself who is taking all that time to release S15 for VSUB.
Furthermore, the VLDR was hit 100x less than the VSUB, hinting that it's not waiting for too long waiting for anything, so the instructions before it calculating the offset are pretty much streamlined, another hint that it's the VLDR itself who is taking that long.