1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
.. default-role:: math
.. header:: **N.P. Rougier & Y. Boniface** | Dynamic Self-Organising Map

===============================================================================
Dynamic Self-Organising Map                                                    
===============================================================================

-------------------------------------------------------------------------------
A computational model of cortical plasticity                                   
-------------------------------------------------------------------------------

**Nicolas P. Rougier** ¹ and **Yann Boniface** ²

| **¹** LORIA/INRIA Nancy - Nicolas.Rougier@loria.fr
| **²** LORIA/Université Nancy 2 - Yann.Boniface@loria.fr

.. contents:: Table of Contents
   :depth: 2

.. sectnum::
   :suffix: .
   :depth:  3

.. abstract::

   We present  in this paper a  variation of the  self-organising map algorithm
   where the original time-dependent (learning rate and neighbourhood) learning
   function is  replaced by a time-invariant  one. This allows  for on-line and
   continuous learning  on both static  and dynamic data distributions.  One of
   the property  of the newly  proposed algorithm is  that it does not  fit the
   magnification  law   and  the  achieved  vector  density   is  not  directly
   proportional  to the density  of the  distribution as  found in  most vector
   quantisation algorithms.   From a biological  point of view,  this algorithm
   sheds  light on cortical  plasticity seen  as a  dynamic and  tight coupling
   between the environment and the model.

.. keywords::

   SOM, self organisation, cortical plasticity, dynamic.



Introduction                                                                   
===============================================================================

Vector  quantisation (VQ)  refers to  the  modelling of  a probability  density
function  into  a discrete  set  of  prototype  vectors (sometimes  called  the
codebook) such  that any  point drawn from  the associated distribution  can be
associated to a  prototype vector. Most VQ algorithms try  to match the density
through the density of their codebook: high density regions of the distribution
tend to have more associated prototypes than low density region. This generally
allows to minimise  the loss of information (or distortion)  as measured by the
mean quadratic error. For a complete picture, it is to be noted that there also
exists some  cases where  only a partition  of the  space occupied by  the data
(regardless of their density) is necessary.  In this case, one wants to achieve
a regular  quantification *a priori*  of the probability density  function. For
example, in some classification problems, one wants to achieve a discrimination
of data in term  of classes and thus needs only to  draw frontiers between data
regardless of their respective density.

Vector quantisation can be achieved using several methods such as variations of
the   k-means   method   [MacQueen:1967]_,  Linde-Buzo-Gray   (LBG)   algorithm
[Linde+Al:1980]_ or neural network models such as the self-organising map (SOM)
[Kohonen:1982]_, neural  gas (NG)  [Martinetz+Al:1993]_ and growing  neural gas
(GNG) [Fritzke:1995]_. Among all these  methods, the SOM algorithm is certainly
the most famous in the field  of computational neuroscience since it can give a
biologically and plausible  account on the organisation of  receptive fields in
sensory  areas  where adjacent  neurons  shares  similar representations.   The
stability  and the  quality  of  such self-organisation  depends  heavily on  a
decreasing learning rate as well  as a decreasing neighbourhood function.  This
is quite  congruent with the idea  of a critical  period in the early  years of
development where most sensory or  motor properties are acquired and stabilised
[Hubel+Wiesel:1965]_,  [Hubel+Wiesel:1970]_, [Daw:1994]_.  However,  this fails
to explain cortical  plasticity since we know that the  cortex has the capacity
to  re-organise itself  in face  of lesions  or  deficits [BachyRita+Al:1969]_,
[BachyRita:1972]_,  [Ramachadran+Al:1992]_.  The  question is  then to  know to
what extent  it is possible to  have both stable and  dynamic representations ?
We propose to answer this question  by considering a tight coupling between the
environment  and  cortical  representations.   If the  environment  is  stable,
cortical representations  should remain stable and if  the environment suddenly
changes,  cortical  representations   must  dynamically  adapt  themselves  and
stabilise  again onto  the new  environment.

Quite obviously, this cannot be achieved using SOM-like algorithms that depends
on a time decreasing learning rate and/or neighbourhood function (SOM, NG, GNG)
and,  despite the  huge  amount of  literature [Oja+Al:2003]_  [Kaski+Al:1998]_
around self-organising  maps and Kohonen-typed  networks (more than  7000 works
listed in  [Pöllä+Al:2009]_), there is  is surprisingly and  comparatively very
little  work dealing  with online  learning  (also referred  as incremental  or
lifelong learning). Furthermore,  most of these works are  based on incremental
models, that  is, networks  that create and/or  delete nodes as  necessary. For
example,   the  modified   GNG  model   [Fritzke:1997]_  is   able   to  follow
non-stationary  distributions by  creating  nodes  like in  a  regular GNG  and
deleting them when  they have a too small  *utility* parameter.  Similarly, the
evolving self-organising  map (ESOM, [Deng+Al:2000]_,  [Deng+Al:2003]_ is based
on an incremental  network quite similar to GNG  that creates dynamically based
on the measure of  the distance of the winner to the data  (but the new node is
created  at   exact  data   point  instead  of   the  mid-point  as   in  GNG).
Self-organising  incremental neural  network (SOINN)  [Furao+Al:2006]_  and its
enhanced  version (ESOINN) [Furao+Al:2007]_  are also  based on  an incremental
structure  where the  first version  is using  a two  layers network  while the
enhanced version proposed a single  layer network. One noticeable result is the
model  proposed by  [Keith-Magee:2001]_ which  does not  rely on  a incremental
structure but  is based  on the  Butterworth decay scheme  that does  not decay
parameters  to  zero.   The  model  works  in  two  phases,  an  initial  phase
(approximately ten epochs) is used  to establish a rough global topology thanks
to a very  large neighbourhood and the second phase  uses a small neighbourhood
phase to train the network. Unfortunately, the size of the neighbourhood in the
second phase has to be adapted to the expected density of the data.

Without  judging performances  of these  models, we  do not  think they  give a
satisfactory answer to our initial question and we propose instead to answer by
considering a  tight coupling between  the environment and  representations. If
the  environment is  stable, representations  should remain  stable and  if the
environment suddenly changes, representations must dynamically adapt themselves
and stabilise  again onto the new  environment.  We thus  modified the original
SOM algorithm in order to  make its learning rule and neighbourhood independent
of time. This results in a tight coupling between the environment and the model
that  ensure  both stability  and  plasticity.  In  next section,  we  formally
describe the dynamic self-organising map  in the context of vector quantisation
and both neural gas and self-organising  map are formally described in order to
underline  differences   between  the   three  algorithms.  The   next  section
re-introduces  the  model  from a  more  behavioural  point  of view  and  main
experimental results are  introduced using either low or  high dimensional data
and offers  side-to-side comparison  with other algorithms.  Results concerning
dynamic   distributions  are   also   introduced  in   the   case  of   dynamic
self-organising  map   in  order  to   illustrate  the  coupling   between  the
distribution and the  model. Finally, we discuss the relevancy  of such a model
in the context of computational neurosciences and embodied cognition.



Definitions                                                                    
===============================================================================
Let us  consider a  probability density function  `f(x)` on a  compact manifold
`\Omega \in \mathbb{R}^d`. A vector quantisation (VQ) is a function `\Phi` from
`\Omega`  to   a  finite   subset  of  `n`   code  words   `\{\mathbf{w}_i  \in
\mathbb{R}^d\}_{1 \leq i \leq n}` that  form the codebook. A cluster is defined
as  `C_i \deq  \{x  \in \Omega  | \Phi(x)  =  \mathbf{w}_i \}`,  which forms  a
partition of  `\Omega` and  the distortion of  the VQ  is measured by  the mean
quadratic error

.. math::

   \xi = \sum_{i=1}^{n} \int_{C_i} \lVert x - \mathbf{w}_i \rVert^2 f(x) dx.

If the  function `f` is unknown  and a finite  set `\{x_i\}` of `p`  non biased
observations is available, the distortion error may be empirically estimated by

.. math::
  :label: error

  \hat{\xi} = \frac{1}{p}\sum_{i=1}^{n} \sum_{x_j \in C_i} \lVert
  x_j-\mathbf{w}_i \rVert^2.

Neural  maps define  a special  type of  vector quantifiers  whose  most common
approaches   are   the   Self-Organising    Map   (SOM),   Elastic   Net   (EN)
[Durbin+Willshaw:1987]_, Neural Gas (NG) and  Growing Neural Gas (GNG).  In the
following,   we   will   use    definitions   and   notations   introduced   by
[Villman+Clausen:2006]_ where a neural map  is defined as the projection from a
manifold  `\Omega  \subset  \mathbb{R}^d`  onto  a  set  `\mathcal{N}`  of  `n`
*neurons* which is formally written as `\Phi : \Omega \rightarrow \mathcal{N}`.
Each neuron `i` is associated with a code word `\mathbf{w}_i \in \mathbb{R}^d`,
all of which established the set `\{\mathbf{w}_i\}_{i \in \mathcal{N}}` that is
referred  as the  codebook. The  mapping from  `\Omega` to  `\mathcal{N}`  is a
closest-neighbour  winner-take-all rule  such that  any vector  `\mathbf{v} \in
\Omega` is mapped  to a neuron `i` with  the code `\mathbf{w}_\mathbf{v}` being
closest to the actual presented stimulus vector `\mathbf{v}`,

.. math::
   :label: psi

   \Phi : \mathbf{v} \mapsto \argmin_{i \in \mathcal{N}} (\lVert \mathbf{v} -
   \mathbf{w}_i \rVert).

The neuron `\mathbf{w}_\mathbf{v}` is called  the *winning element* and the set
`C_i =  \{x \in \Omega  | \Phi(x) =  \mathbf{w}_i \}` is called  the *receptive
field* of the neuron `i`. The  geometry corresponds to a Voronoï diagram of the
space with `\mathbf{w}_i` as the center.


Self-Organising Maps (SOM)                                                     
-------------------------------------------------------------------------------
SOM is a neural map equipped with a structure (usually a hypercube or hexagonal
lattice) and each element `i`  is assigned a fixed position `\mathbf{p}_{i}` in
`\mathbb{R}^q`  where `q`  is  the dimension  of  the lattice  (usually `1`  or
`2`). The learning process is an  iterative process between time `t=0` and time
`t=t_f \in \mathbb{N}^+` where vectors `\mathbf{v} \in \Omega` are sequentially
presented to the map with respect  to the probability density function `f`. For
each presented vector `\mathbf{v}` at time `t`, a winner `s \in \mathcal{N}` is
determined according to equation  :eq:`psi`.  All codes `\mathbf{w}_{i}` from
the codebook are shifted towards `\mathbf{v}` according to

.. math::
   :label: som-learning

   \Delta\mathbf{w}_{i} = \varepsilon(t)~h_\sigma(t,i,s)~(\mathbf{v} -
   \mathbf{w}_i)

with `h_\sigma(t,i,j)` being a neighbourhood function of the form

.. math::
   :label: som-neighborhood

   h_\sigma(t,i,j) = e^{- \frac{\lVert \mathbf{p}_i - \mathbf{p}_j
   \rVert^2}{2\sigma(t)^2}}.

where `\varepsilon(t) \in \mathbb{R}` is the learning rate and `\sigma(t) \in
\mathbb{R}` is the width of the neighbourhood defined as

.. math::

  \sigma(t) = \sigma_i\left(\frac{\sigma_f}{\sigma_i}\right)^{t/t_f}, \text{
  with } \varepsilon(t) =
  \varepsilon_i\left(\frac{\varepsilon_f}{\varepsilon_i}\right)^{t/t_f},

while  `\sigma_i`  and  `\sigma_f`  are  respectively  the  initial  and  final
neighbourhood  width and `\varepsilon_i`  and `\varepsilon_f`  are respectively
the initial  and final learning rate.  We usually have  `\sigma_f \ll \sigma_i`
and `\varepsilon_f \ll \varepsilon_i`.


Neural Gas (NG)                                                                
-------------------------------------------------------------------------------
In the  case of NG, the learning  process is an iterative  process between time
`t=0` and time  `t=t_f \in \mathbb{N}^+` where vectors  `\mathbf{v} \in \Omega`
are sequentially presented  to the map with respect  to the probability density
function `f`. For  each presented vector `\mathbf{v}` at  time `t`, neurons are
ordered  according  to  their  respective  distance  to  `\mathbf{v}`  (closest
distances map to lower ranks)  and assigned a rank `k_i(\mathbf{v})`. All codes
`\mathbf{w}_{i}` from  the codebook are shifted  towards `\mathbf{v}` according
to

.. math::
   :label: ng-learning

   \Delta\mathbf{w}_{i} = \varepsilon(t)~h_\lambda(t,i,\mathbf{v})~(\mathbf{v} -
   \mathbf{w}_i)

with `h_\lambda(t,i,\mathbf{v})` being a neighbourhood function of the form:

.. math::
   :label: ng-neighborhood

   h_{\lambda}(t,i,\mathbf{v}) = e^{-\frac{k_i(\mathbf{v})}{\lambda(t)}}

where `\varepsilon(t) \in \mathbb{R}` is the learning rate and `\lambda(t) \in
\mathbb{R}` is the width of the neighbourhood defined as

.. math::

  \lambda(t) = \lambda_i\left(\frac{\lambda_f}{\lambda_i}\right)^{t/t_f},
  \text{ with }
  \varepsilon(t) =
  \varepsilon_i\left(\frac{\varepsilon_f}{\varepsilon_i}\right)^{t/t_f},

while  `\lambda_i`  and `\lambda_f`  are  respectively  the  initial and  final
neighbourhood  and  `\varepsilon_i` and  `\varepsilon_f`  are respectively  the
initial and final learning rate.  We usually have `\lambda_f \ll \lambda_i` and
`\varepsilon_f \ll \varepsilon_i`.


Dynamic Self-Organising Map (DSOM)                                             
-------------------------------------------------------------------------------
DSOM  is a  neural map  equipped  with a  structure (a  hypercube or  hexagonal
lattice) and each  neuron `i` is assigned a  fixed position `\mathbf{p}_{i}` in
`\mathbb{R}^q`  where `q`  is  the dimension  of  the lattice  (usually `1`  or
`2`). The  learning process is  an iterative process where  vectors `\mathbf{v}
\in  \Omega`  are  sequentially  presented  to  the map  with  respect  to  the
probability density  function `f`.  For  each presented vector  `\mathbf{v}`, a
winner `s \in  \mathcal{N}` is determined according to  equation :eq:`psi`. All
codes  `\mathbf{w}_{i}`  from the  codebook  `\mathbf{W}`  are shifted  towards
`\mathbf{v}` according to

.. math::
   :label: dsom-learning

   \Delta\mathbf{w}_{i} = \varepsilon \lVert \mathbf{v} -
   \mathbf{w}_i\rVert_\Omega~h_\eta(i,s,\mathbf{v})~(\mathbf{v} - \mathbf{w}_i)

withj `\varepsilon` being a constant learning rate and `h_\eta(i,s,\mathbf{v})`
being a neighbourhood function of the form

.. math::
   :label: dsom-neighborhood

   h_\eta(i,s,\mathbf{v}) =
      e^{-\frac{1}{\eta^2} \frac{\lVert \mathbf{p}_i - \mathbf{p}_s
          \rVert^2}{{\lVert \mathbf{v} - \mathbf{w}_s \rVert}_{\Omega}^{2}}}

where `\eta`  is the *elasticity*  or *plasticity* parameter. If  `\mathbf{v} =
\mathbf{w}_s`, then `h_\eta(i,s,\mathbf{v}) = 0`.


Model                                                                          
===============================================================================
As  we explained  in  the introduction,  the  DSOM algorithm  is essentially  a
variation  of the SOM  algorithm where  the time  dependency has  been removed.
Regular  learning   function  :eq:`som-learning`  and   neighbourhood  function
:eq:`som-neighborhood`   have   been   respectively   replaced   by   equations
:eq:`dsom-learning` and :eq:`dsom-neighborhood` which reflect two main ideas:

- If a neuron is close enough to the data, there is no need for others to
  learn anything: the winner can represent the data.
- If there is no neuron close enough to the data, any neuron learns
  the data according to its own distance to the data.

This  draws several consequences  on the  notion of  neighbourhood that  is now
dynamic and  leads to a  qualitatively different self-organisation that  can be
controlled using a free elasticity parameter.


Dynamic neighbourhood                                                          
-------------------------------------------------------------------------------
Learning rate is  modulated using the closeness of the winner  to the data. The
figure  :fig:`learning-rate`  represents this  learning  rate  modulation as  a
function of a data `\mathbf{v}`, a  neuron `i` (with code `\mathbf{w}_i`) and a
winner `s` (with code `\mathbf{w}_s`). If the winner `s` is very close or equal
to  `\mathbf{v}` (bottom  line  on the  figure),  learning rate  of any  neuron
different from the  winner `s` is zero and only the  winner actually learns the
new data. When the winner `s` is  very far from the data (top line), any neuron
benefits from a large learning rate and learns the new data (modulated by their
own distance  to the data but this  extra modulation is not  represented on the
figure).

.. figure::   images/learning-rate.png
   :target:   images/learning-rate.png
   :width:    75%
   :label:    learning-rate

   At each presented data `\mathbf{v}`, the learning rate of each neuron `i` is
   modulated according  to both the distance `\lVert  \mathbf{w}_s - \mathbf{v}
   \rVert`  (which represents  the  distance  between the  winner  `s` and  the
   presented  data  `\mathbf{v}`)  and  the  distance  `\lVert  \mathbf{p}_i  -
   \mathbf{p}_s  \rVert` (which represent  the distance  between code  words of
   neuron `i`  and neuron `s`).  If  the winner `s`  is very close or  equal to
   `\mathbf{v}`  (bottom line  on  the  figure), learning  rate  of any  neuron
   different from  the winner `s` is  zero and only the  winner actually learns
   the new data. When the winner `s`  is very far from the data (top line), any
   neuron  benefits  from  a  large  learning  rate and  learns  the  new  data
   (modulated by  their own distance to  the data but this  extra modulation is
   not represented on the figure).



This notion  of closeness of the  winner to the  data is thus critical  for the
algorithm and  modifies considerably both  the notion of neighbourhood  and the
final codebook.  Most  VQ tries to capture data density  through the density of
their codebook as introduced in [Villman+Clausen:2006]_ where authors considers
the generalised error

.. math::

   E_\gamma = \int_\Omega \lVert \mathbf{w}_s - \mathbf{v} \rVert^\gamma
  P(\mathbf{v}) d\mathbf{v}

and  introduces the  relation  `P(\mathbf{w}) \propto  \rho(\mathbf{w})^\alpha`
with `\rho(\mathbf{w})` being the weight  vector density and `\alpha` being the
*magnification  exponent*  or  *magnification   factor*.  If  we  consider  the
intrinsic  (or Hausdorff)  dimension  `d`  of the  data,  the relation  between
magnification and `d` is given by `\alpha = \frac{d}{d+\gamma}` and an ideal VQ
achieves a  magnification factor of  1. However, DSOM algorithm  clearly states
that if a neuron is already close  enough to a presented data, there is no need
for the neighbours  to learn anything and this results in  a codebook that does
not follow the magnification law as illustrated on figure :fig:`density` for
three very simple two-dimensional non homogeneous distributions.

.. figure:: images/density.png
   :target: images/density.png
   :label:  density
   
   Three DSOM have been trained  on a disc distribution using different density
   areas.   **Left.**   The   density    is   uniform   all   over   the   disc
   (0.25).  **Center**. Outer  ring has  higher  density (.4)  than inner  disc
   (.1).  **Right**.  Outer  ring  has  lower  density  (.1)  than  inner  disc
   (.4).  Despite  these  different   density  distributions,  the  three  DSOM
   self-organise onto the support of the distribution (the whole disc) and does
   not try to match density.

Said differently,  what is actually  mapped by the  DSOM is the  *structure* or
*support* of  the distribution (`\Omega` using notations  introduced in section
[definitions]_) rather than the density.


Elasticity                                                                     
-------------------------------------------------------------------------------
The DSOM algorithm is not parameter free since we need to control when a neuron
may be considered to be *close enough* to a data such that it prevents learning
for its neighbours. This is the role of the elasticity parameter that modulates
the   strength  of   the  coupling   between   neurons  as   shown  on   figure
:fig:`elasticity` for a simple two-dimensional normal distribution.

.. figure:: images/elasticity.png
   :target: images/elasticity.png
   :label:  elasticity

   Three DSOM with respective elasticity equal  to 1, 2 and 3 have been trained
   for 20 000 iterations on a normal distribution using a regular grid covering
   the  `[0,1]^2` segment  as  initialisation. Low  elasticity  leads to  loose
   coupling between neurons while higher elasticity results in a tight coupling
   between neurons.

This  notion  of elasticity  shares  some  common  concepts with  the  Adaptive
Resonance Theory (ART)  as it has been introduced  in [Grossberg:1987]_. In the
ART model, the  vigilance parameter has a critical  influence on learning since
it  controls the  actual partition  of the  input space:  high  vigilance level
produces  high  number of  very  precise  memories  while low  vigilance  level
produces  fewer  and  more generic  memories.   This  is  very similar  to  the
elasticity parameter:  if elasticity is  high, neurons tend to  pack themselves
very  tightly  together (code  vectors  are  relatively  close) while  a  lower
elasticity allows for looser coupling  between neurons. However, in the case of
ART, the vigilance parameter also  governs the number of final prototypes since
they can be  created on demand. In  the case of DSOM, the  number of prototypes
(i.e. neurons) is fixed and they are  supposed to span the whole input space to
ensure convergence. Consequently, there  exists a relation between the diameter
of  the support  (defined as  the maximum  distance between  any two  points in
`\Omega`), the number of neurons and the elasticity parameter. In the one hand,
if elasticity  is too high,  neurons cannot span  the whole space and  the DSOM
algorithm  does not  converge, in  the other  hand, if  elasticity is  too low,
coupling between  neurons is weak  and may prevent self-organisation  to occur:
code-vectors  are evenly  spread on  the support  but they  do not  respect the
neighbourhood   relationship  anymore.  There   certainly  exists   an  optimal
elasticity for a  given distribution but we did not  yet investigate fully this
relationship and we do not have  formal results. As a preliminary work, we have
studied the relationship  between elasticity and the initial  conditions in the
one dimensional case  using a very simple experimental  setup where the dataset
is made  of only  two samples (one  at 0 and  the other  at 1) as  explained on
figure  :fig:`convergence`. This figure  clearly shows  a discontinuity  in the
error when  elasticity is varying from 1.0  to 4.0 but at  different places for
different  initial conditions.  The reason  comes  from the  dependency of  the
learning to the  distance between the winner node and  the presented data. When
this difference is  large, a large correction of weights  occur on all networks
nodes  and this is  only attenuated  by their  distance to  the winner  and the
network elasticity.

.. figure:: images/convergence.png
   :target: images/convergence.png
   :label:  convergence

   Several  one-dimensional DSOM  with two  nodes  have been  trained for  2500
   epochs  using  a dataset  of  two  samples (0  and  1)  that were  presented
   alternatively. Each  point of each curve  represents the error  of a network
   with  given elasticity  and initial  conditions. Point  A represents  a case
   where elasticity is too high and  makes the network to oscillate while point
   B represents a case where elasticity  was low enough to allow the network to
   properly converge (towards x=0 and y=1).

In  the   presented  experimental  setup,   data  (0  and  1)   were  presented
alternatively and lead  to a convergence when elasticity was  low enough and to
an oscillatory  behaviour (not visible on  the figure) when  elasticity was too
high. This oscillatory behaviour can  be understood most simply when looking at
scheme A  on the  figure. Each  correction made to  the network  in one  way is
immediately  counter-balanced in  the other  way when  next data  is presented.
This  preliminary  study  lead us  to  think  that  the  choice of  an  optimal
elasticity not  only depends on  the size  of the network  and the size  of the
support but also  on the initial conditions. If we were  to generalise from the
simple study above,  the initial configuration of the  network should cover the
entire support as much as possible to reduce elasticity dependency.


Convergence                                                                    
-------------------------------------------------------------------------------
It  is well known  that the  convergence of  the Kohonen  algorithm has  not be
proved  in the  general case  [Cottrel+Al:1998]_ even  though  some conditional
convergence  properties  have  been  established in  the  one-dimensional  case
[Cottrell+Al:1987]_. Furthermore, in the case  of continuous input, it has been
shown that there does not  exist an associated energy function [Erwin+Al:1992]_
and in the  case of a finite  set of training patterns, the  energy function is
highly discontinuous [Heskes:1999]_. In the  case of the dynamic SOM, the proof
of convergence is straightforward since we  can exhibit at least one case where
the DSOM does not converge, when the number of nodes is less then the number of
data as illustrated on figure :fig:`wrong`.

.. figure:: images/wrong.png
   :target: images/wrong.png
   :label:  wrong

   Due to its  dynamic nature, the dynamic SOM cannot  converge when the number
   of nodes (4 here)  is less than the number of data (5  here). NG and SOM can
   converge on an approximated solution  thanks to both their decaying learning
   rate and neighborhood and this  explains why three nodes are exactly aligned
   with  their corresponding  data while  the  last node  found a  mid-distance
   position. In  the case of  DSOM and because  of the constant  learning rate,
   every node is moving at each presented data and thus cannot converge at all.

Most generally, in case where the number of nodes is less than the total number
of   presented  data,  we   can  predict   that  the   dynamic  SOM   will  not
converge. Moreover, a similar problem occurs  if the number of nodes is exactly
equal to the number of data  and if nodes are initially distributed uniquely on
each data.  In such an  initial setup, the  learning parameter is zero  for any
presented data and this prevents the network to learn anything at all. We could
say that it does converge in such a case (network is frozen) but if the initial
configuration does  not correspond to a  proper unfolded one,  the answer would
not  be really  satisfactory.  A proof  of  convergence would  then require  to
identify configurations  (initial conditions, size,  elasticity, learning rate)
where the network  may have chances to converge but we  think this is currently
out of the scope of this paper.



Experimental results                                                           
===============================================================================
We report  in this section some  experimental results we  obtained on different
types of distribution that aim at  illustrating DSOM principles. We do not have
yet  formal results  about convergence  and/or quality  of the  codebook.  As a
consequence, these results do not  pretend to prove anything and are introduced
mainly to illustrate qualitative behaviour of the algorithm.

Unless stated otherwise, the learning procedure in following examples is:

- A distribution is chosen (normal, uniform, etc.)
- A discrete sample set of samples is drawn from the distribution
- Model learns for `n` iterations
- At each iteration, a sample is picked randomly and uniformly in the
  discrete sample set
- Distortion is measured on whole sample set every 100 iterations using
  equation :eq:`error`. 

The  distortion  error  is  plotted   above  each  graphics  to  show  rate  of
convergence.


Non stationary distribution                                                    
-------------------------------------------------------------------------------
In order  to study dynamic  aspect of the  DSOM algorithm, three  networks (NG,
SOM, DSOM)  have been trained for  20 000 iterations on  a dynamic distribution
that vary  along time: a  uniform distribution (1) on  [0.0,0.5]×[0.0,0.5] from
iterations 0  to 5000, a  uniform distribution (2) on  [0.5,1.0]×[0.5,1.0] from
iterations  5000 to 10000,  a uniform  distribution (3)  on [0.0,0.5]×[0.5,1.0]
from  iterations  10000  to 15000  and  a  final  uniform distribution  (4)  on
[0.5,1.0]×[0.0,0.5] from iterations 15000  to 20000. NG shows some difficulties
in tracking  various changes and  the final state  reflects the history  of the
distribution: there are many code  words within the first distribution and very
few in the  final one. In the case  of SOM, the algorithm can  almost cope with
the  dynamic nature  of the  distributions  as long  as its  learning rate  and
neighbourhood function are large enough to  move the codebook into the new data
region. This  is the  case for distributions  (1) to  (3) but the  final change
makes the SOM network unable to  map the final distribution as expected because
of the time  dependency of the algorithm.  In the case of DSOM,  the network is
able to  accurately track each  successive distribution with a  short transient
error correlated to  the distribution change. We think  this behaviour reflects
cortical  plasticity  seen  as a  tight  coupling  between  the model  and  the
environment.

.. figure:: images/dynamic.png
   :target: images/dynamic.png
   :label:  dynamic

   Three networks (NG, SOM, DSOM) have  been trained for 20 000 iterations on a
   dynamic distribution  that vary  along time: a  uniform distribution  (1) on
   [0.0,0.5]×[0.0,0.5] from iterations 0 to 5000, a uniform distribution (2) on
   [0.5,1.0]×[0.5,1.0] from  iterations 5000  to 10000, a  uniform distribution
   (3)  on [0.0,0.5]×[0.5,1.0]  from  iterations  10000 to  15000  and a  final
   uniform  distribution (4)  on [0.5,1.0]×[0.0,0.5]  from iterations  15000 to
   20000.


High-dimensional distributions                                                 
-------------------------------------------------------------------------------
Until now, we have  considered only trivial two-dimensional distributions whose
intrinsic  dimension matched  the topography  of the  network. We  now consider
higher dimensional  distribution with  unknown intrinsic dimension.   Using the
standard Lena  grey-level image as a  source input, samples of  8×8 pixels have
been  draw   uniformly  from   the  image  and   presented  to   the  different
networks. 1000 such samples have been  drawn and all three networks have learnt
during  10 000  iterations. As  illustrated on  figure :fig:`lena`,  the strong
influence of neighbourhood  in the case of SOM leads to  a final codebook where
vectors tend  to be very homogeneous and  composed of a mean  value with little
variations around  this mean  value. In  the case of  NG, things  are different
because of the absence of  topographic constraints: NG converges rapidly toward
a stable  solution made  of qualitatively different  filters, part of  them are
quite  homogeneous  like in  SOM  but some  others  clearly  possess a  greater
internal  variety. In  the case  of DSOM,  we can  also check  on the  figure a
greater variety of filters that are self-organised.

.. figure:: images/lena.png
   :target: images/lena.png
   :label:  lena

   Three networks  (NG, SOM, DSOM) have  been trained for 20  000 iterations on
   1000 samples  of size  8×8 pixels  that have been  drawn uniformly  from the
   standard lena grey image.

The  meaning of  such a  greater variety  of  filters in  the case  of DSOM  is
difficult  to appreciate.  In  the one  hand,  if we  were  to reconstruct  the
original  image  using  those  filters,  we would  certainly  obtain  a  larger
distortion error. In the other hand,  if those filters were supposed to extract
useful information from  the image, they would certainly  give a better account
of the structure of the image.


Conclusion                                                                     
===============================================================================
One of the major problem of most neural map algorithms is the necessity to have
a finite set  of observations to perform adaptive learning  starting from a set
of  initial parameters (learning  rate, neighbourhood  or temperature)  at time
`t_i` down  to a set  of final  parameters at time  `t_f`. In the  framework of
signal processing  or data analysis, this may  be acceptable as long  as we can
generate a finite set of samples in order to learn it off-line. However, from a
more behavioural point of view, this is not always possible to have access to a
finite set and we must face on-line learning. As explained in the introduction,
if  we consider  the  existence of  a critical  period  in the  early years  of
development,  the problem  may be  solved  using decreasing  learning rate  and
neighbourhood over an extended period of  time. But if this may explain to some
extents  the development  of early  sensory filters,  this fails  at explaining
cortical   plasticity    at   a   more   broad   level.     As   explained   in
[Buonomano+Al:1998]_,  we know  today that  *"cortical representations  are not
fixed  entities, but  rather,  are  dynamic and  are  continuously modified  by
experience"*. How can we achieve both stability and reactivity ?

We proposed  to answer this question  by introducing a variant  of the original
SOM learning algorithm  where time depency has been  removed. With no available
formal  proof  of  convergence  and   based  on  several  experiments  in  both
two-dimensional, high-dimensional  cases and dynamic  cases, we think  this new
algorithm allows for on-line and  continuous learning ensuring a tight coupling
to the environment.  However, the resulting codebook does  not fit data density
as expected  in most  VQ algorithms. This  could be  a serious drawback  in the
framework  of signal  processing or  data compression  but may  be  a desirable
property  from a  behavioural point  fo  view. For  example let  us consider  a
picture of a (very) snowy landscape with a small tree in the middle. If we want
to mimic  visual exploration of the  scene using eye saccades,  we can randomly
pick small  patches within the  image and present  them to the model.  Not very
surprisingly, the  vast majority  of these patches  would be  essentially white
(possibly with some variations) because the whole image is mainly white. From a
pure VQ point of view, the codebook would reflect this density by having a vast
majority of its representations into the  white domain and if the tree is small
enough, we could even have only white representation within the codebook. While
this would serve data  compression, how much is it relevant in  general ? We do
not have  the answer  in the  general case but  we think  this must  be decided
explicitely depending on task. DSOM allows such explicit decision since it maps
the structure of the data rather than  their density. This means that in a more
general framework, we could expect an external structure to attach some kind of
motivation for  each data that would  modulate its learning. If  some region of
the  perceptive space  is judged  behaviourally relevant,  model  could develop
precise representations in this region but if learning is driven solely by data
density  (like  in  most  VQ),  such modulation  would  certainly  be  strongly
attenuated or not possible at all.


Appendix A                                                                     
===============================================================================
.. nosectnum::

Here  are  some  results  linked  to various  distributions  illustrating  both
differences between NG, SOM and DSOM as well as DSOM specific properties.

.. figure:: images/uniform.png
   :target: images/uniform.png
 
   Three 8×8 networks  (NG, SOM, DSOM) have been trained  for 20 000 iterations
   on a uniform  square distribution using 10 000  samples.  Initialisation has
   been done by placing initial code vectors randomly over the [0,1]² area.


.. figure:: images/ring.png
   :target: images/ring.png

   Three 8×8 networks  (NG, SOM, DSOM) have been trained  for 20 000 iterations
   on a ring distribution using 10 000 samples. Initialisation has been done by
   placing initial code vectors randomly over the [0,1]² area.


.. figure:: images/double-ring.png
   :target: images/double-ring.png

   Three 8×8 networks  (NG, SOM, DSOM) have been trained  for 20 000 iterations
   on a uniform double  ring-distribution using 10 000 samples.  Initialisation
   has been done by placing initial code vectors randomly over the [0,1]² area.


.. figure:: images/gaussian-filters.png
   :target: images/gaussian-filters.png


   Three 8×8 networks  (NG, SOM, DSOM) have been trained  for 20 000 iterations
   on a set  of noisy rotated elongated Gaussians whose  angles have been drawn
   from a uniform distribution in  [-π/2,+π/2] .  An input is represented
   as a two-dimensional 16×16 vector  of real values (∈ [0,1]) and additive
   noise has been added using uniform random variables in [-0.1,0.1].


Appendix B                                                                     
===============================================================================
.. nosectnum::

.. figure:: movies/sphere.avi, movies/sphere.ogg
   :controls:
   :figwidth: 45%
   :figclass: right

   A 32×32 DSOM has been trained for 10000 iterations on a set of 10 000 points
   uniformly distributed over the surface of a sphere of radius 0.5 centered at
   (0.5,0.5,0.5).  Initialisation has been done by placing initial code vectors
   at the center of the sphere and elasticity has been set to 1.0.

.. figure:: movies/cube.avi, movies/cube.ogg
   :controls:
   :figwidth: 45%
   :figclass: clear-left

   A 32×32 DSOM has been trained for 10000 iterations on a set of 10 000 points
   uniformly distributed over  the surface of a cube of  radius 0.5 centered at
   (0.5,0.5,0.5).  Initialisation has been done by placing initial code vectors
   at the center of the sphere and elasticity has been set to 1.0.


.. figure:: movies/sphere-spheres.avi, movies/sphere-spheres.ogg
   :controls:
   :figwidth: 45%
   :figclass: right

   Self-reorganization from sphere to spheres surface

.. figure:: movies/sphere-cube.avi, movies/sphere-cube.ogg
   :controls:
   :figwidth: 45%
   :figclass: clear-left

   Self-reorganization from sphere to cubic surface



References                                                                     
===============================================================================
.. nosectnum::

.. [BachyRita+Al:1969] P. B. y Rita, C. Collins, F. Saunders, B. White, and
   L. Scadden. Vision substitution by tactile image projection. In *Nature*,
   221:963-964, 1969.

.. [BachyRita:1972] P. BachyRita. *Brain Mechanisms in Sensory Substitution*.
   Academic Press New York, 1972.

.. [Buonomano+Al:1998] D. Buonomano, M. Merzenich, Cortical plasticity: From
   synapses to maps, *Annual Review of Neuroscience* 21 (1998) 149--186.

.. [Cottrel+Al:1998] M. Cottrell, J. Fort, G. Pagès, Theoretical aspects of the
   som algorithm, *Neurocomputing* 21 (1998) 119--138.

.. [Cottrell+Al:1987] M. Cottrell, J. Fort, Etude d'un algorithme
   d'auto-organisation, *Annales Institut Henri Poincaré* 23~(1) (1987) 1--20.

.. [Daw:1994] N. Daw. Mechanisms of plasticity in the visual  cortex. In
   *Investigative Ophthalmology*, 35:4168-4179, 1994.

.. [Deng+Al:2000] D. Deng, N. Kasabov, Esom: An algorithm to evolve
   self-organizing maps from on-line data streams, in: *Proc. of IJCNN'2000*,
   Vol. VI, Como, Italy, 2000, pp. 3--8.

.. [Deng+Al:2003] D. Deng, N. Kasabov, On-line pattern analysis by evolving
   self-organizing maps, *Neurocomputing* 51 (2003) 87--103.

.. [Durbin+Willshaw:1987] R. Durbin, D. Willshaw, An analogue approach to the
   travelling salesman problem. In *Nature* 326 (1987) 689-691.

.. [Erwin+Al:1992] E. Erwin, K. Obermayer, K. Schulten, Self-organizing maps:
   Ordering, convergence properties and energy functions, *Biological
   Cybernetics* 67 (1992) 47--55.

.. [Fritzke:1995] B. Fritzke. A growing neural gas network learns topologies.
   In G. Tesauro, D. Touretzky, and T. Leen, editors, *Advances in Neural
   Information Processing Systems 7*, pages 625-632. MIT Press, Cambridge MA,
   1995.

.. [Fritzke:1997] B. Fritzke, A self-organizing network that can follow
   non-stationary distributions, in: *ICANN*, 1997, pp. 613--618.

.. [Furao+Al:2006] S. Furao, O. Hasegawa, An incremental network for on-line
   unsupervised classification and topology learning, *Neural Networks* 19 (1)
   (2006) 90--106.

.. [Furao+Al:2007] S. Furao, T. Ogura, O. Hasegawa, An enhanced self-organizing
   incremental neural network for online unsupervised learning, *Neural
   Networks* 20 (8) (2007) 893--903.

.. [Grossberg:1987] S. Grossberg, Competitive learning: From interactive
   activation to adaptive resonance. In *Cognitive Science* 11(1) (1987)
   23-63.

.. [Heskes:1999] T. Heskes, Energy functions for self-organizing maps, in:
   E. Oja, S. Kaski(Eds.), *Kohonen Maps*, Elsevier, Amsterdam, 1999,
   pp. 303--315.

.. [Hubel+Wiesel:1965] D. Hubel and T. Wiesel. Receptive fields and functional
   architecture in two non-striate visual areas (18 and 19) of the
   cat. In *Journal of Neurophysiology*, 28:229-289, 1965.

.. [Hubel+Wiesel:1970] D. Hubel and T. Wiesel. The period of susceptibility to
   the physiological effects of unilateral eye closure in kittens. In *Journal
   of Physiology*, 206:419-436, 1970.

.. [Kaski+Al:1998] S. Kaski, J. Kangas, T. Kohonen, Bibliography of
   self-organizing map (som) papers: 1981-1997, *Neural Computing Surveys* 1
   (1998) 102--320.

.. [Keith-Magee:2001] R. Keith-Magee, Learning and development in kohonen-style
   self-organising maps, Ph.D. thesis, Curtin University of Technology (2001).

.. [Kohonen:1982] T. Kohonen. Self-organized formation of topologically correct
   feature maps. In *Biological Cybernetics*, 43:59-69, 1982.

.. [Linde+Al:1980] A. B. Linde, A. Buzo and R. Gray. An algorithm for vector
   quantization design. In *IEEE Trans. on Communications*, COM-28:84-95, 1980.

.. [MacQueen:1967] J. B. Macqueen. Some methods of classification and analysis
   of multivariate observations. In *Proceedings of the Fifth Berkeley
   Symposium on Mathematical Statistics and Probability*, pages 281-297, 1967.

.. [Martinetz+Al:1993] T. M. Martinetz, S. G. Berkovich, and  K. J. Schulten.
   Neural-gas network for vector quantization and its application to
   time-series prediction. In *IEEE Trans. on Neural Networks*, 4(4):558-569,
   1993.

.. [Oja+Al:2003] M. Oja, S. Kaski, T. Kohonen, Bibliography of self-organizing
   map (som) papers: 1998-2001 addendum, *Neural Computing Surveys* 3 (2003)
   1--156.

.. [Pöllä+Al:2009] M. Pöllä, T. Honkela, T. Kohonen, Bibliography of
   self-organizing map (som) papers: 2002-2005 addendum, Tech. rep.,
   Information and Computer Science, Helsinki University of Technology (2009).

.. [Ramachadran+Al:1992] V. Ramachandran, D. Rogers-Ramachandran, and
   M. Stewart. Perceptual correlates of massive cortical reorganization. In
   *Science*, 258:1159-1160, 1992.

.. [Villman+Clausen:2006] T. Villman, J. Claussen, Magnification control in
   self-organizing maps and neural gas. In *Neural Computation* 18 (2006)
   446-449.



About this document                                                            
===============================================================================
.. nosectnum::

This document has  been generated using a modified  version of the `rst2html.py
<rst2html.py>`_ python script for  converting a restructured text document into
an  html   one.   The   rst  source  of   this  document  is   avalaible  `here
<article.rst.html>`_.