Apple recently updated the docs on dispatch_once to point out
that the storage for the dispatch_once_t must be static or global,
but not something that was ever used before as the implementation
doesn't use a memory barrier. So we drop the use and create the
semaphore when needed and use an atomic swap deal with any
threading races.