Contents:
| action\state | stimulus 1 | stimulus 2 |
|---|---|---|
| go | reward (juice) | aversive (saline) |
| no-go | no reward | no reward |
Minicolumns required for this task as the following:
Three randomized presentations of each stimulus are given during training. Then follows a set of four test stimuli. Subsequently, the meaning of the stimuli is reversed on four more test stimuli.
During the training presentations, randomized go/no-go actions are simulated. During testing, go/no-go actions are controlled by the neuronal simulation. The randomized go/no-go actions may also occur when no visual stimuli are presented. During initial construction of the task simulation go/no-go actions always occur 10 ms after a new visual stimulus appears.
Figure 1: The environment simulation for the Thorpe task.
The spike trains produced for stimulus 1, stimulus 2 and reward states, as well as for go/no-go actions appear correct. When fed into the state-action pair processing circuitry, state-action spikes are produced with one issue that must be resolved. In previous experiments, it was always assumed that the same state would not occur consecutively with the same action. Here, that does occur during training. In those instances, a state-action pair is not correctly produced. Only a repetition of the state spike is produced then.
After removing the spatial navigation environment simulation, some items of neuronal circuitry that controls minicolumn input and output during the testing phase of the experiment require adjustments:
Figure 2: Spike trains produced during training in the Thorpe task simulation. Spikes with index 0 and 1 represent perception of stimulus 1 and 2 respectively. Spikes with index 2 represent reward that is received. The reward is delayed so that preceding state-action spikes achieve encoding in prefrontal minicolumns before reward is encoded. The reward is given in response to the combination of stimulus 1 and a go-action (spikes with index 3). Spikes with index 4 represent no-go action.
Figure 3: Action-state spike pairs generated in specialized neuronal circuitry for spike trains representing states, perceived stimuli and reward received, and representing actions, go/no-go. Not every spike train causes the generation of a new state-action pair. And there is no reaction to some spike trains, such as the reward spikes. These are issues that must be resolved.
There are three probable causes for the state-action pair discrepancies: (1) Only go action is forwarded to the spike pair generating circuitry. (2) The duration of each condition is too short for the time required for encoding in minicolumns, the time to which spike pair generation is tuned. (3) The reward spike train has a different spike frequency, which may affect its perceived salience at the input to the state-action spike pair generating circuitry. (4) The protocol of changes in the state and action spike trains is somewhat different than the protocol used in the spatial navigation task. There are breaks between spike trains corresponding to stimulus presentations and on some occasions both state and action change simultaneously.
Cause (1) is dealt with easily by forwarding both go action and no-go action, picking the correct connection router of output from vrat4dircontroller as the source for action input. Figure 1 includes this update and shows the correct output of go/no-go spike trains.
To prevent an effect as in (2), the stimulus protocol is changed so that every condition has a minimum three theta cycles to be established as a spike to the prefrontal minicolumns. Also, the event sequencer that controls clearing of the STM buffers in the minicolumns is set to provide STM clear signals between each set of data during training and testing. The state-action pair generating circuitry may also need to be modified to deal with the condition where reward is perceived while a stimulus is still available. In previous experiments, only one state (places and goal) was available at a time. The sequence we wish to produce is: stimulus STATE - go/no-go ACTION - reward STATE. There is no need for another action spike following the reward state spike.
The second part of the state-action pair spikes, 375 ms delayed, for some reason include a repetition of the state spike. Also, another action spike appears before that, on its own at t=500 ms. The initial appearance of no-go prior to the first stimulus causes a spike within the "single-spike-per-event" block of the action part of the state-action pair generating circuitry around t=120 ms. Still, the state part first produces a spike around t=248 ms, which is just after the first spikes appear in that part.
The common "new-event-detector" that responds to both state and action streams is triggered by the action event around t=120 ms. Yet, at that time there is no state input, so that no state spike is produced. A delay of 375 ms (3 cycles) causes an action spike to be produced around t=500 ms. That explains the lone action spike. Solution: Do not produce no-go action spikes before stimulus spikes begin.
The repetition of the state spike after t=500 ms, together with the action spike, is a coincidence of two parts of the process. There is a new action event at t=500 ms, since there is a switch to go-action. That is detected and causes a new state spike. At the same time, the 375 ms delayed new event detection from the previous state event reaches the action stream and produces a go-action spike. Solution: Do not switch the action at the same time as the action spike is produced. To produce the correct action spike, that action must be present, so it can only be switched on earlier. Of course, since it will still cause a new event, the only way to avoid a second state spike is for the go-action to start at the same time as the stimulus presentation.
The two solutions are tested for the first state and action spikes in minicolumns-thorpe-task.20031220b.ccm. They produce the desired STATE-ACTION spike pairs. But, it is not satisfying to have to start go-action at the same time as stimulus presentation, since in reality there must be a delay before a go/no-go action decision is made and acted upon.
During retrieval, reward is sought correctly if the reverse spread meets activity in the minicolumn representing stimulus 1, even if there are preceding minicolumns in the chain of associations. This means that a data protocol may include initializing steps:
stimulus 1 STATE & no-go ACTION --- stimulus 1 STATE & go ACTION --- reward STATE
+-----------------------------+ +--------------------------+ +----------+
375 ms 375 ms 375 ms
The two remaining issues are:
Figure 4: This circuitry produces a state spike for a state-action spike pair that represents only those portions of the state vector that have changed most recently. When a new state is detected, that new state spike triggers interneuron activity that clears previous state spikes from a buffer. That buffer is refilled with the new state spike. The buffered spikes are used as the state-spike when a new state-action spike pair is generated .No-go action can be considered present before stimulus 1 is presented, since the subject was not active. This can be simplified by starting the no-go action at the same time as stimulus 1 is presented. A plausible explanation is that priming for stimulation causes a theta (re)start so that buffers in both state and action streams are cleared and then filled at the same time. To improve the speed of the simulation, the protocol can be simplified by focusing on the two necessary associations (circled in red). The protocol before the dashed vertical red line can be omitted, taken as a given, by presenting stimulus 1 and go action at the same time. There are two issues to bear in mind when using this faster protocol: (1) Immediate action (in the same millisecond) does not look as realistic. (2) The [stim.1,no-go] and [no-go,stim.1] associations will not be learned. Will criticism of the experimental task arise over these two issues?
Figure 5: This circuitry produces a state spike for a state-action spike pair that represents only those portions of the state vector that have changed most recently. When a new state is detected, that new state spike triggers interneuron activity that clears previous state spikes from a buffer. That buffer is refilled with the new state spike. The buffered spikes are used as the state-spike when a new state-action spike pair is generated. The interneuron population is presented with dashed lines, since it may be implemented implicitly. The simulation requires less computation if inhibitory connections lead directly from the detector to the buffer.
The frequency of reward spikes is increased so that they also generate a new state.
As shown with minicolumns-thorpe-task.20031223.ccm, better spike pairs are now produced in groups for each training presentation. Unfortunately, a new arrangement for the application of reward if stimulus 1 and action are present together required a spike delay from action to the conditional spiking circuit that gates reward. This creates the correct reward application after 750 ms, but also leads to some false positives when the next stimulus state appears, but the previous action is still in the spike buffer. Instead of a spike buffer, another gating circuit is needed. Thus, reward spikes are produced only if (1) stimulus 1 is present, (2) go-action is taken, and (3) the time is between 750 ms and 1125 ms after the onset of the stimulus presentation.
If go-action represents licking a tube that supplies juice as a reward then it makes sense that every go-action spike should also result in a reward spike. Those spikes must be gated by the presence of stimulus 1 (and later by stimulus 2 when a reversal takes place). Those two aspects are already present in the circuitry. Now, the go-action spikes should no longer be buffered, but instead they should be gated in a manner so that the gate opens 750 ms after the onset of a stimulus presentation and closes when the spikes to clear STM buffers are given between trials. Circuitry that achieves the desired protocol for reward spikes is added in minicolumns-thorpe-task.20031229.ccm and shown in Figure 6.
Figure 6: Updated circuitry to produce the desired reward spikes in the interval between 750 ms and 1125 ms after the onset of a stimulus presentation for which go-action is rewarded. A vector switch (1) selects the first stimulus (2) or following reversal the second stimulus (3) as the stimulus for which go-action is rewarded. A single spike selects go/no-go action that is maintained as a continuous spike train in the ``vrat4dircontroller'' circuitry (4). That single spike is also transformed to a spike with index 1 in a connection router (5) and subsequently buffered (6) for 750 ms. When the delayed spike arrives at a vector switch (7), a vector output representing the perception of the stimulus for which go-action is rewarded may propagate to a ``conditional-spike-gate'' neuronal circuit (8). There the value of the stimulus perception vector controls transmission of a train of go-action spikes through synapses that elicit a train of reward spikes in the gating neuron. Together, the spike trains indicating the perception of the first stimulus (9), the second stimulus (10) and reward received form the state input to the state-action spike pair generating circuitry. The reward spike train ends as a spike with index 0 (received through a connection that is not drawn in this figure) clears short-term memory buffers throughout the system, as well as the ``vrat4dircontroller'' buffers and resets the vector switch (7) so that a constant value 0 suppresses transmission through the ``conditional-spike-gate'' neuronal circuit (8).
Figure 7: The resulting state-action spike trains. The indices of the spike trains represent the following: (0) go, (1) no-go, (2) stimulus 1, (3) stimulus 2, and (4) reward.
Figure 8: Spike pairs generated by the state-action spike pair generating circuitry. The indices of the spikes represent the following: (0) go ACTION, (1) no-go ACTION, (4) stimulus 1 STATE, (5) stimulus 2 STATE, and (6) reward STATE. Vertical blue lines were added to the spike plot to indicate the six different training sets.
If it is more desirable that reward immediately appears when the correct stimulus and go-action are combined, then the delay that allows the previous association between spike pairs to be encoded with LTP may be implemented in a more complicated version of the state-action spike pair generating circuitry.
| to r population | ||||||
|---|---|---|---|---|---|---|
| go | no-go | stim.1 | stim.2 | reward | ||
| from s population | go | (5-6) | (1) | |||
| no-go | ||||||
| stim.1 | (1) | (4) | ||||
| stim.2 | (3) | (2) | ||||
| reward | (1-2) | |||||
| to x population | ||||||
|---|---|---|---|---|---|---|
| go | no-go | stim.1 | stim.2 | reward | ||
| from y population | go | (1) | (3) | |||
| no-go | (4) | (2) | ||||
| stim.1 | ||||||
| stim.2 | (1-2) | |||||
| reward | (1) | |||||
The remaining associations that must be learned so that forward and backward spread of activation can be used to retrieve the stimulus upon which to act in order to received reward are two synapses in the connection matrices Wif and Wib. In Wif, the connection between stimulus 1 and reward in the go minicolumn must be strengthened, i.e. Wif{4 to 6}. In Wib, the connection between reward and stimulus 1 in the go minicolumn must be strengthened, i.e. Wib{6 to 4}. Inspection with a Catacomb ObservationRecorder shows that the required synapse in Wif is strengthened. But that is not shown for Wib! There are two possibilities: Either the association was not learned, or the known bugs in the ObservationRecorder do not allow its inspection. (In current versions of Catacomb, the ObservationRecorder displays only a subset of the columns for connections between pre- and postsynaptic neuronal populations that result in more than one column.)
Further inspection may show if Wib was trained correctly after-all. During the retrieval phase, activation in the reward minicolumn should propagate to the x population of the go minicolumn. And if Wib was correctly trained, that should result in x population activity at the neuron that enables backward propagation to the stimulus 1 minicolumn. Note that no output from prefrontal minicolumns currently appears during the retrieval phase, but that may be due to other problems in the output circuitry..
Buffer clearing may be improved, as demonstrated for the transition to the performance part of the task. There, I added a number of clear signals in minicolumns-thorpe-task.20040105b.ccm to insure that the last buffered action is cleared from the a-STM-buffer population around t=7500 ms. Similar modifications may improve clearing of buffers between training sets.
The inspection, taking a reward retrieval spike as the onset, shows the following: At around t=8015 ms, the go neuron of the a population spikes for retrieval. That causes spiking in y{48-55}, i.e. all neurons of the y population in the reward minicolumn. Backwards spread through Wb result in a spike at x{6}, a neuron in the x population of the go minicolumn at around t=8022 ms. If associations were correctly encoded in Wib then y{4} should spike to allow further backpropagation to the stimulus 1 minicolumn. That spike does not appear! Consequently, either training of Wib was not successful or spiking in the y population of the go minicolumn was inhibited by activity in the a population (if go was erroneously receiving a "current state" signal during the retrieval phase). No spike appears at the a population of the go minicolumn around that time, so Wib was not successfully trained.
Solving the problem of learning in Wib:
Inspection of spiking in the y-specific population, the population of neurons used to train Wib, shows that only two spikes occur during training. Both occur at y{4}, one at t=1100 ms and one at t=1207 ms. During the training set responsible for those spikes, the only spike at x{6} is at t=1209 ms. Thus, there are two problems: (1) x spikes after y, while the presynaptic spike should precede the postsynaptic spike to elicit LTP, and (2) only one pair of x{6} and y{4} spikes is available for learning.
Since y-specific is driven by r2 output, I now inspect the filtered r2 output and the r-STM-buffer activity. It is notable that the entire buffer spikes each time a buffer-clear signal is received between training sets. That clearing spike appears in all minicolumn implementations to date that include buffer clear signals. It is an artifact of the implementation. The intended way to clear a STM buffer is to revert to a state without theta rhythm. The current attempt to implement this as an actual change in the rhythmic modulation sends a clear signal as a short high-frequency spike train into the spike relay that distributes rhythmic spiking throughout the minicolumn neuronal circuitry. The high frequency spike train causes successive hyperpolarizations at STM buffers. There may be cause to improve this implementation of the ability to clear STM buffers.
It appears that the buffer is cleared too soon after reward is received. The first reward spike in the a population at t=1102 ms and the first clear signal is sent to the minicolumns at t=1250 ms. It was originally thought that the first reward spike would immediately participate in learning and that three cycles would be available before buffers are cleared. That protocol did not take into account several aspects of the mechanism that introduce delays:
It also seems that the a-STM-buffer is not cleared by the clear signal, at least the a neuron of the reward minicolumn continues to spike.
So, is the problem (1) simply that more time is needed after reward is received so that LTP is established in Wib, or (2) that the r2 output is more broken? The second possibility is tested by investigating the following questions:
In any case, the first usable x{6} spike appears at t=1209 ms, so that more time is needed to train the x{6} to y{4} association in Wf, despite correcting the previously broken r2 output. The data protocol is now adjusted to deal with problematic aspects of minicolumn activity. The periods in which theta rhythm is removed by clear-buffer signals is extended between the training sets to insure that all buffers are cleared, including the a-STM-buffer. The timing of clear buffer-signals and following training sets is also shifted to provide additional time for encoding after training sets that include reward activity. This is accomplished in minicolumns-thorpe-task.20040107.ccm.
Although more time is now provided for learning when reward is received, only two x{6} spikes appear, since the a population diffuse contribution ends as the a-STM-buffer receives new ACTION input three cycles after the reward STATE input. The first x{6} spike does not precede the corresponding y-specific{4} spike (because the phase of r2{6} was still shifting in r-STM-buffer), so that no LTP is established, but the second x{6} spike at t=1318 ms does precede the corresponding y-specific{4} spike at t=1322 ms. Thus there is one update of the connection strength. The update of Wib is observed in the ObservationRecorder for synapses from the x population to the y-specific population.
The single update produced a mild strengthening of the connection between x{6} and y-specific{4}. More such updates are possible by (a) increasing the number of cycles further and (b) increasing the delay from r2 to y-specific. The first solution requires that the delay between the STATE and ACTION spikes in a spike pair is increased by at least one more rhythmic cycle, and that the appearance of reward is similarly delayed to allow the association from stimulus 1 STATE to go ACTION to be encoded. This is non-problematic in the context of the Thorpe task, since Thorpe provided stimuli and reward over greater durations than these. But to simplify the simulation, I will first attempt the modification of delay from r2 to y-specific, increasing it by 4 ms to 15 ms. This is done in minicolumns-thorpe-task.20040107b.ccm, and encoding in Wib is successful. During retrieval, the presentation of the desire for reward causes the retrieval of go-action when stimulus 1 appears. Note that encoding is also successful with the original delay of 11 ms, since the rewarded training set appears twice during training.
An important design principle for integrate-and-fire models with persistent firing STM buffers was found here: The buffers need several cycles to shift items into the phase at which their reactivation is maintained in STM. If other mechanisms in a neuronal simulation depend on the specific phase of spikes in a buffer, they may need to wait several cycles to perform their function. The time taken to complete the simulation successfully increases proportionally. Two examples are:
|
Figure 9: Thorpe task simulation performance without reversal with minicolumns-thorpe-task.20040108.ccm. Black rectangles are trains of many spikes. Spike indices indicate the following: (0) go action, (1) no-go action, (2) stimulus 1 present, (3) stimulus 2 present, (4) reward. Training takes place between t=0 ms and t=8000 ms. After t=8000 ms, the trained neuronal simulation of the prefrontal minicolumns drives go and no-go actions in response to stimuli perceived. Each time stimulus 1 appears, the prefrontal minicolumns drive go action after a brief delay. Before t=13120 ms, that go action is rewarded. After that time, the environment simulation reverses the reward protocol. Since reversal is not dealt with at this stage of the model construction, the neuronal simulation continues to drive go action in response to stimulus 1.
One assumption we can make is that the ability to learn rapid reversal is hippocampus dependent. This assumption is based on the knowledge that the ability to encode episodic context dependent memory after a single presentation is hippocampus dependent. Such episodic memory enables continuous encoding that keeps track of the stimulus state and go action combination most recently associated with reward. The episodic memory must then be used to achieve the reversal of the behavior learned with cortical minicolumns.
As suggested in recent hippocampal modeling work, the temporal context can be stored during task performance through mechanisms in dentate gyrus and hippocampus. For each trial (temporal context), an association between stimulus state, action and a possible reward is encoded. The reward representation there may be strongly connected to the reward minicolumn. Also, in dentate gyrus the most recent associations may be more strongly connected to the current temporal context. When the desire for reward appears, these connections may result in retrieval of the most recently rewarded state-action context encoded in the hippocampus.
Learning to rapidly reverse behavior may proceed as follows:
One way to achieve the intrgration of the reversal behavior in the learned associations is shown in Figure 10. Note that it presumes that the possibility of reward for go action following either stimulus is encoded.
Figure 10: A neuronal circuit and protocol for reversal learning in the Thorpe task. Initially, an association is learned from one stimulus to go action and reward. When reversal firt occurs, the same has to be learned for the other stimulus. Simultaneously, the hippocampus retains episodic memory of previously rewarded state-action pairs. When a reversal occurs, that is also encoded in episodic memory and a corresponding minicolumn is activated. The activation of that minicolumn becomes assocated with go action and reward for the other stimulus. Thus, a secondary route appears and replaces the direct route to reward. This new route depends on hippocampal activity, so that rapid reversal behavior is encoded. Since this approach depends on the weakening of one set of associations as another set is encoded, the simulation requires an implementaiton of long-term depression.