A reply to S. W. Smoliar's "Modelling Musical Perception: A Critical View"

Peter Desain & Henkjan Honing

[In press as: Desain, P., & Honing, H. (In Press). A reply to S. W. Smoliar's "Modelling Musical Perception: A Critical View". In N. Griffith, & P. Todd (eds.), Cambridge: MIT Press.]

In "Modelling Musical Perception: A Critical View," Stephen Smoliar (1994) presents a review of some of the work in "Music, Mind and Machine" by Peter Desain and Henkjan Honing (1992) [1]. This publication, a collection of research papers exploring an interdisciplinary approach to the study of music cognition, is the result of a close collaboration, though Smoliar's critique mentions mostly Desain. For full reviews see Smoliar, (1995) and Dannenberg (1995). Since we were not contacted before the publication of his critical review, this reply is an attempt to correct some of the errors and misrepresentations made in it.

The critical view concentrates first on the "quantization problem." Smoliar, in common with many others, uses the term to reflect only the extraction of a metrical score from a performance. But in general it is the process which separates the discrete (score) and continuous (expressive) components of musical time in a performed musical fragment. The programs of a connectionist model (page 61) and a re-implementation of a symbolic musical parser (Longuet-Higgins, 1976) in Lisp (page 253) were taken and used in a re-evaluation using new data (i.e. unpublished "real music"). In this enterprise, a number of serious mistakes were made, including the following:

The data consisted of a "professional English horn player" playing the English horn solo from Wagner's Tristan und Isolde on a MIDI keyboard, at a relatively slow tempo and freely using tempo rubato. Suspicious of this curious experimental setup, we obtained the original data (a MIDI file). In addition to the key-presses which were measured and used as input for the models, the file contained a large number of so-called 'aftertouch' messages. These represent key-pressure and are usually mapped to a loudness parameter of the synthesizer. This means that there is no guarantee that the perceptual onsets, as heard by the performer, occurred at the same at the time of note onsets. Nonetheless it is this pattern of onsets which the quantizer has to interpret (see Table I). Smoliar recognizes some of these problems when he states that the performance cannot even be transcribed by human listeners, yet he proceeds to criticize the model's performance based on this pathological input.

1.975 2.808 0.621 1.821 0.648 1.896 0.525 0.452 0.573 1.698
0.642 2.281 2.648 0.489 0.275 0.265 1.764 1.063 0.414 0.290

0.302 1.765 1.189 0.448 0.017 0.317 0.347 1.982 0.206 2.360

2.857 7.693 2.671

Table I. Performance data as used by Smoliar for his test (inter-onset intervals in seconds).

Furthermore, this data contains grace notes -- short notes that are outside the metrical framework, and which are essentially 'unquantizable.' The fragment also contained two errors, one of which was identified as a "brief erroneous blip" and left in the data, while the other was simply removed. Some common data collection precautions for expressive timing research seem thus to have been ignored: use expert performers (playing their own instruments) and record repeated performances so as to distinguish between motor noise, errors, and the actual musical intention of the performer.

The quantizer network was never designed to quantize a long musical fragment in a single operation. While there is evidence that some of the context following a note can influence its rhythmic interpretation, a human listener does not need to wait until the end of the fragment before perceiving the rhythm, nor is it realistic in terms of memory capacity [2]. Having initially designed the network for short fragments, we worked to extend it into a process model in which the input data shifts through the network while being quantized (see page 75). Thus Smoliar is applying the original model in a way that was never intended by applying it the whole performance at once.

The quantizer network slowly relaxes into a state in which many integer ratios between time durations are discovered, but there are of course also cells that represent non-integer ratios. They continue to 'pull' slightly on the time intervals, which is the cause of the inaccuracy still inherent in the output pattern (especially in long patterns with many active cells). Mutual inhibition, and a strategy for letting good cells 'win' was not part of the original design [3]. However, an extension of this kind can easily cancel out the influence of cells that stay too far from perfect integer ratios and thus boost the accuracy of the model. The problem that the resulting numbers still have to be categorized and named in a symbolic way (having decided what a good base for notation would be) falls completely outside the intended realm of the model. This issue only arises because the quantization model is equated with a technical application as transcription system.

Regarding Longuet-Higgins' musical parser, this model's essence is that it constructs metrical trees on top of the performed notes. Smoliar's claim that it "only reasons about notes" is misleading. First, it does not reason: it searches for an appropriate metrical tree; and second, most of its work is above the level of notes. The metrical trees are constructed up to a metrical level that is given to the model as parameter, be it a measure, the tactus or the like. And, ironically, it is Smoliar himself that prevents the model from analyzing higher metrical levels by starting it off with this parameter set to the first note: a half-bar. The observation that "[the connectionist model] deals with the same limited note-to-note relations that are addressed in the Longuet-Higgins algorithm" is also inaccurate. We showed, for example, that the connectionist quantizer is context sensitive, just as was shown for humans in Clarke (1987) where different metrical contexts can yield a different quantization for the same material.

Contrary to what Smoliar stated, we found that Longuet-Higgins' musical parser works remarkably well. This, in part, can be attributed to some inexplicit interactions between the tempo-tracking decisions that are made at different metrical levels. This, for example, gives rise to an area of patterns captured as triplets that is neatly shifted from its regular position towards the unequal way in which performers often play those rhythmic figures. Thus, while the musical parser is not intended as a model of expression, it turns out that in its behavior it expects certain regularities in performance timing, undoubtedly because the algorithm was tested and refined extensively by Longuet-Higgins. We made this "hidden knowledge" explicit through the use of the kind of simple low-dimensional behavioral analysis that Smoliar criticizes. In fact, both the graphical representation of the parameter space and rhythm space proved an immensely valuable tool for analysis of the models -- the parameter space shows the areas in which a model interprets rhythmic performances correctly (given that for a certain set it is known what 'correct' is) and the rhythm space shows the actual clustering behavior of the model in a space of all possible rhythms. It is unfortunate, but surely obvious, that to be able to print a graphical representation of these spaces in a book, one has to resort to a low-dimensional example. Interestingly, one of the low-dimensional representations turned out to be interpretable in terms of an expectancy of events still to happen in the future. It was elaborated into a cognitive model of rhythm perception (page 101), and subsequently into a beat induction model (Desain & Honing, 1994). It might well be that Longuet-Higgins' insight that quantization can best be done in the context of metric parsing is a very valuable one. This is the reason why we postponed further work on quantization until the research on beat and meter induction has matured.

Regarding our comments on another approach, namely the AI-method used in the Stanford music transcription project, Smoliar raises doubts about our competence. Contrary to what he suggests, we are only too familiar with the "nuts and bolts" of the practice of AI in which powerful ideas are implemented as a rule-based system. These systems often quickly reach their limits: they become 'brittle' and difficult to extend (or even understand) for their designers. Many of these projects were abandoned and, worse, the valuable work that went into formalizing the knowledge involved gets lost (see, e.g., Winograd and Flores, 1986) -- which was exactly what happened to the Stanford music transcription project.

Concerning our exploratory study of the use of autocorrelation for the analysis of timing data, Smoliar makes one mistake of interpretation after another. Since a uniform time series of tempo data captures the performance, not the rhythmic regularity in the score, it is a perfectly sound research question to study the extent to which autocorrelation methods can reveal periodic regularity in the expressive timing of the performance. Furthermore, because in a piece with an isochronous notated rhythm the series of interonset times is directly related (inversely proportional) to the tempo, no intervening data points have to be inferred, and a difficult problem (the tempo at a point in time where there is no event) can be avoided. This is why (among other reasons) the Bach C-Major Prelude is used so often in the literature of expressive timing and why the choice of this piece is not "awkward" at all. Smoliar's demonstration that different interpretations 'unfortunately' give rise to different autocorrelation profiles, far from being a problem, is exactly what the method is intended to show: it distinguishes between different performance interpretations (though see page 122 for some problems and the limits of applicability of this method).

To end on a more general methodological note, Smoliar's critique demonstrates the advantages of publishing computational models and data so that they are open to immediate test. Had he published his own data in turn, his criticism would also have been open to falsification. We agree whole-heartedly with the principle that AI programs should be tested with multiple examples (page 73) and we agree that much remains to be done in the case of quantization models.

Interestingly, even an algorithm that always produces a 'correct' output is not good enough: it does not validate the algorithm as a model of the cognitive process itself. If we want to make statements about the architecture of human cognition, we have to relate the architecture of the program to that of the human subject. This is still one of major challenges of the computational modeling of music cognition.

References

Dannenberg, R.B. (1995) Book review: Peter Desain and Henkjan Honing "Music, Mind and Machine: Studies in Computer Music, Music Cognition and Artificial Intelligence." Music Perception 12(3), 365-367.

Desain, P., & Honing, H. (1992). Music, Mind and Machine. Studies in Computer Music, Music Cognition and Artificial Intelligence. Amsterdam: Thesis Publishers.

Desain, P., & Honing, H. (1994). Advanced issues in beat induction modeling: syncopation, tempo and timing. In Proceedings of the 1994 International Computer Music Conference. 92-94. San Francisco: International Computer Music Association.

Clarke, E. F. (1987) Categorical Rhythm Perception: An Ecological Perspective. In A. Gabrielsson (ed.) Action and Perception in Rhythm and Music. Stockholm: Royal Swedish Academy of Music. No. 55: 19-33.

Longuet-Higgins, H.C. (1976) The Perception of Melodies. Nature 263: 646-653. Reprinted in Longuet-Higgins (1987) Mental Processes. Cambridge, Mass.: MIT Press.

Smoliar, S. W. (1994) Modelling Musical Perception: A Critical View. Connection Science, Vol. 6, Nos. 2 & 3, 209-222. See also this volume.

Smoliar, S. W. (1995) Book review: Peter Desain and Henkjan Honing "Music, Mind and Machine: Studies in Computer Music, Music Cognition and Artificial Intelligence". Artificial Intelligence, 79: 361-371.

Winograd, T. & F. Flores (1986) Understanding Computers and Cognition, a New Foundation for Design. Norwood, NJ: Ablex Corp.

Footnotes

[1] All page numbers mentioned in this reply refer to this publication.

[2] The model takes about n^2 cells for processing an input string of length n.

[3] A recent version of the code is maintained here