Desain, P., & Honing, H. (1991). Quantization of musical time: a connectionist approach. In P.M. Todd and D. G. Loy (eds.), Music and Connectionism. Cambridge: MIT Press.


Boulez describes two kinds of time in music (Boulez 1971): "striated" time and "smooth time. The first is "filled with counting", the second not. They identify the difference between discrete time intervals (a metric time scale) and continuous variable time intervals (occurring in tempo changes and expressive timing). Musical time is the product of these two time scales (Clarke 1987). In the notation of music both kinds are present even though the notation of continuous time is less developed than that of metric time (often just a word like rubato or accelerando is notated in the score).

There are a lot of different ways in which a musician can deviate from a metrical score: random errors in timing due to limits of the motor system, hierarchical errors due to the structure of mental processes, systematic violations of the norm (e.g. shortening triplets), expressive timing to emphasize a hierarchical structure, and timing differences occurring in ensemble playing. Research in these processes is done by comparing the performed time intervals (musical time) with the score (metrical time). If the latter is not known it has to be extracted. Human subjects can extract, memorize and reproduce the metrical structure from a musical performance, even when there are big deviations from the discrete time scale. This is surprising, knowing that the time durations of performed music can deviate up to 50% compared to the original score (Povel 1977). It even seems that the perception of time intervals on a discrete scale is a obligatory, automatic process. This was concluded from research where experts listened to non-metric time-divisions. These subjects could not reproduce them properly and the reproduction showed a systematic error in the direction of the metrical intervals. This so-called categoric perception can also be found in speech perception and vision.

In our research we evaluated traditional methods of simple quantization, as used in commercially available software packages for automatic transcription. We also studied more advanced (but low level) methods like tempo tracking. Dannenberg and Mont-Reynaud (1987) report a 30% error rate for a 'real-time foot tapper' which uses this method. Therefore we started looking at methods that make use of knowledge of rhythm in dealing with quantization. Longuet-Higgins (1987) describes a hybrid method (tempo tracking plus the use of knowledge about metre) and Mont-Reynaud and Goldstein (1985) describe a grammar based approach for analyzing rhythms which could possibly feed information back into the quantizer. But these approaches seem to share the same problems as all traditional AI programs: domain dependence and brittleness.

Connectionism gave us a new kind of model that has some characteristics that traditional AI models were lacking. It consist of a large number of simple elements with their own activation level, connected with each other in a complex network. These cells excite or inhibit each other via the connections. After the network is given a starting state it can converge to an equilibrium. Connectionistic models are characterized by their robustness and flexibility (Rumelhart e.a. 1986). In music perception research these models were used by Barucha (1987). He modelled the perception of tone scales and the build-up expectation of certain chords in a progression.

These models were not yet used for quantization. In the quantization problem they could be used in making a model that is at rest with metrical time intervals, and converges from non-metrical performed data to such a metrical equilibrium. From this idea some models were designed and used in simulation. They showed the desired behavior. We used cells and excitation functions at a high abstraction level. They could certainly be implemented in one of the formalisms proposed for cells of neural networks but describing such a "biological" model is at the moment however outside the scope of our research.

We started with two kinds of cells: the basic-cell, with an activation value equal to the played c.q. heard duration, and the interaction-cell that is connected bidirectional to two basic-cells. The interaction cell steers the basic-cells that it is connected to, toward integer multiples of each other, but only if they are already near such a state. The function used to excite or inhibit two basic-cells is based on the quotient of their activation values. This network can only quantize very simple metrical structures. Therefore sum-cells are postulated that adjust themselves to the sum of the activation levels of the subsequent basic-cells they are connected to. In this way they represent the longer time intervals generated by a sequence of some notes. These cells are also interconnected by interaction cells so that they also tend to stabilize on an integer division of each other and while doing so, they steer the basic-cells toward a metrical score. The network can be rather sparse, allowing only interaction of subsequent or hierarchical time intervals. A typical net of around 10 basic-cells, with a total number of around 100 cells will stabilize in about 40 iterations in which errors rating from 0 to 30% are reduced to 0.1%.

Full paper (pdf).

Code (lisp).