This paper studies online shortest path routing over multi-hop networks. Link costs or delays are time-varying and modeled by independent and identically distributed random processes, whose parameters are initially unknown. The parameters, and hence the optimal path, can only be estimated by routing packets through the network and observing the realized delays. Our aim is to find a routing policy that minimizes the regret (the cumulative difference of expected delay) between the path chosen by the policy and the unknown optimal path. We formulate the problem as a combinatorial bandit optimization problem and consider several scenarios that differ in where routing decisions are made and in the information available when making the decisions. For each scenario, we derive a tight asymptotic lower bound on the regret that has to be satisfied by any online routing policy. These bounds help us to understand the performance improvements we can expect when (i) taking routing decisions at each hop rather than at the source only, and (ii) observing per-link delays rather than end-to-end path delays. In particular, we show that (i) is of no use while (ii) can have a spectacular impact. Three algorithms, with a trade-off between computational complexity and performance, are proposed. The regret upper bounds of these algorithms improve over those of the existing algorithms, and they significantly outperform state-of-the-art algorithms in numerical experiments.
In this paper, we develop optimal policies for deciding when a wireless node with radio frequency (RF) energy harvesting (EH) capabilities should try and harvest ambient RF energy. While the idea of RF-EH is appealing, it is not always beneficial to attempt to harvest energy; in environments where the ambient energy is low, nodes could consume more energy being awake with their harvesting circuits turned on than what they can extract from the ambient radio signals; it is then better to enter a sleep mode until the ambient RF energy increases. Towards this end, we consider a scenario with intermittent energy arrivals and a wireless node that wakes up for a period of time (herein called the time-slot) and harvests energy. If enough energy is harvested during the time-slot, then the harvesting is successful and excess energy is stored; however, if there does not exist enough energy the harvesting is unsuccessful and energy is lost. We assume that the ambient energy level is constant during the time-slot, and changes at slot boundaries. The energy level dynamics are described by a two-state Gilbert-Elliott Markov chain model, where the state of the Markov chain can only be observed during the harvesting action, and not when in sleep mode. Two scenarios are studied under this model. In the first scenario, we assume that we have knowledge of the transition probabilities of the Markov chain and formulate the problem as a Partially Observable Markov Decision Process (POMDP), where we find a threshold-based optimal policy. In the second scenario, we assume that we don't have any knowledge about these parameters and formulate the problem as a Bayesian adaptive POMDP; to reduce the complexity of the computations we also propose a heuristic posterior sampling algorithm. The performance of our approaches is demonstrated via numerical examples.