In this section we briefly discuss the concept behind
the variable structure stochastic
learning automaton (VSLA) [21].
Figure 9 shows the schematic of the stochastic
learning automaton.
Abstractly, a learning automaton can be considered to be
an object that can choose from a finite number of actions.
For every action that it chooses, the random environment
in which it operates evaluates that action. A corresponding
feedback is sent to the automaton based on which the
next action is chosen. As this process progresses the automaton
learns to choose the optimal action asymptotically.
The rule used by the automaton to select its successive actions
based on the environment's response defines the stochastic
learning algorithm. An important property of the learning
automaton is its ability to improve its performance with time.
This can be used to optimize functionals which may not be known
completely.
A VLSA is described by a 5-tuple
, where
is the
set of r actions from which the automaton can choose any action
at time n denoted by .
is the set of binary response to the VSLA from the
environment for a chosen action; is a reward and corresponds to
a penalty
is
the set of probabilities
T is the stochastic learning algorithm according to which
the elements of the set p are updated at each time n, i.e.,
,
where the ith element of the set p(n) is
, , and
is the set of
penalty probabilities conditioned on the chosen action, where
, .
The stationary random environment is represented by
. The values of ci are unknown
and it is assumed that C has a unique minimum element. For
a stationary random environment the penalty probabilities
are constant. The working of the learning automaton can be described
as follows. Initially at n=1 one of the actions is chosen by
the automaton at random with a
given probability. This action is then applied to the system and
the response from the environment is observed. If the response
is favorable () then the probability of choosing that
action for the next period of time is updated according to the
updating rule T. Then, another is action is chosen and the response
is observed. This process is repeated until a stopping criterion
is reached. When the learning process stops the algorithm has
learnt some characteristics of the random environment.
In this project, we consider the operator T to be linear.
When T is the linear reward-inaction algorithm it can be described
by [21]
where is the response when action i is chosen
at time n and a is the reward parameter. The idea behind these
update equations is that when a chosen action results in a reward the probability
of choosing that action in the future is increased by a small amount and
the other action probabilities are decreased.
If a penalty is received then the probabilities are not updated.
We now establish the theoretical basis of LRI.
Let the average penalty received by the automaton be
where E[.] is the expectation operator.
Definition 1241
[21]
A learning automaton is absolutely expedient if ,, and for all possible
ci, .
This means that absolutely expedient learning
results in a decreasing average penalty function.
Definition 1245
[21]
A learning automaton is -optimal if
for any arbitrary
.
That is, using -optimal learning we can obtain a solution
whose cost is arbitrarily close to the optimal cost.
It is shown in [21] that the unit vectors in are
absorbing states of the Markov process for LRI learning.
The process gets absorbed
w.p. 1 under certain conditions. From this we conclude that the
learning process using LRI terminates w.p. 1. Further, LRI is
absolutely expedient and -optimal.