[published as: Desain, P and Honing, H. (1997) How to evaluate generative models of expression in music performance. in Isssues in Ai and Music Evaluation and Assessment. International Joint Conference on Artificial Intelligence., 5-7. Nagoya: Japan.]

 

 

How to Evaluate Generative Models of Expression in Music Performance

Peter Desain (1) & Henkjan Honing (1, 2)

 

NICI (1) Music Department (2)

University of Nijmegen University of Amsterdam

P.O.Box 9104 Spuistraat 134

NL-6500 HE Nijmegen NL-1012 VB Amsterdam

The Netherlands The Netherlands

desain@nici.kun.nl honing@nici.kun.nl

 

 

  • Abstract

     

    An outline is given of structural expression component theory (SECT). It generalizes existing generative models of music performance expression and extends them to deliver representations of musical timing and dynamics that can be composed into overall expression profiles. These compound profiles can than be fitted to actual performances.

    Based on this theory a technique is investigated to separate the expressive profiles of a musical performances into their structural components, each explained by one generative model. This method (DISSECT) allows us to obtain appropriate parameter setting for the individual models and effectively separates the expressive signal into its components, each explained by one kind of musical structure. Thus the explanatory power of each model can be evaluated, and the amounts of variation that can be attributed to different musical structural descriptions can be estimated for different musical styles and for various interpretations.

     

    1 Introduction

  •  

    In this position paper we outline a technique that makes it possible to evaluate existing generative models of expressive timing and dynamics, e.g., [Clynes, 1987; Sundberg et al., 1989; Todd, 1989] by fitting their output directly to empirical data, i.e. actual musical performances. Although these generative models explain musical expression based on one kind of musical structure (e.g., meter, phrase, surface), an overall theory is lacking that describes how these different components are combined in performance. It is hard to find ways in which timing information of different nature can be combined, but a representation that keeps tempo-changes and time-shifts separate is a promising candidate. We will present such an approach, named structural expression component theory (SECT). The theory makes use of generalizations of existing generative models and extends them to deliver representations of musical timing that can be composed into the overall expression profiles.

    The study will result, at the same time, in a technique to separate the expressive profile of a musical performance into its structural components, each explained by one generative model. This method (implemented as a computer program called DISSECT in the POCO environment; [Desain and Honing, 1992]), next to obtaining the appropriate parameter setting for the individual models, also effectively separates the expressive signal into its components, each explained by one generative model. This enables us to determine the explanatory power of each model, and to estimate the amounts of variation that can be attributed to different musical structural descriptions in different musical styles and for various interpretations.

    Remarkably, the method of fitting generative models directly to the empirical data was never used. They were tested only fragmentarily, if at all, by an analysis-by-synthesis paradigm, by perceptual experiments, or simply by visual comparison of the output of the model with a real performance. A possible reason for the reluctance to use this simple method is that each model explains the contribution of only one kind of musical structure to the expressive profile. They generate expressive timing and dynamics from a score that is annotated with only one kind of structural description (be it metrical, phrase or surface structure). So they are all only partial models. As such, the process of optimizing the parameters to fit the data is confounded by the many components that are contributed by the other types of musical structure. For example, a large tempo deviation, like a final ritard, establishes such a large trend in the data that fitting a regular repetitive profile linked to metrical structure becomes impossible. This means that promising models cannot be tested directly and that the role of musical structure, and the important hypothesis that musical expression is directly based on it, cannot be evaluated further.

     

  •  

    2 Separating the components of expressive timing

  •  

    We propose a method of separating the expressive signal into its structural elements based on a generalization of the existing generative models. The three main generative theories that will be used are Clynes’ composers’ pulse based on metric structure, Todd’s parabolas linked to phrase structure, and Sundberg’s rule-system for local features.

    Clynes [1983; 1987] proposes composer-specific and meter-specific, composer’s pulse (not to be confused with his more controversial sentic forms): a recursive uneven subdivision of the time intervals representing the structural levels (e.g., in the Beethoven 6/8 pulse the subsequent half-bars span 49 and 51% of the bar duration and each half bar is divided again in 35, 29 and 36%). This composer’s pulse is assumed to communicate the individual composer’s personality. A similar procedure is given for dynamics. This generative theory stems from intuition, but the artificially generated performances indeed capture important regularities in performance as is shown in evaluation studies where subjects compared real and artificial performances [Repp, 1990]. And in other studies it is confirmed that meter is indeed communicated somehow to the listener by means of expressive timing [Sloboda, 1983].

    Todd [1985; 198] postulates parabola-shaped tempo curves linked to the phrase structure that hierarchically add up to yield the final tempo profile. A more general picture of this line of research can be found in Shaffer, Clarke & Todd [1985]. The nonuniformity of the phrase-structure (phrases at the same level do not necessarily have the same length) is treated by stretching the parabolas. The dynamic (loudness) contours are proposed by Todd to have similar structure - but in contrast with Clynes even the same parameters are used for dynamics - leaving one final tempo to loudness mapping [Todd, 1992]. The parameters for his model were derived by a fit to empirical data by eye.

    Sundberg et al. [1983; 1989] propose a rule-based system to generate expression from a score based on the surface structure. Each rule in this system looks for a specific local pattern (e.g., a large pitch-leap) and modifies the timing (in this case, inserts a small pause). Rules and parameters were derived in an analysis-by-synthesis paradigm with one expert changing the rules and listening to the results. Later confirmation of the working of the rules was sought in evaluation studies in which listening panels had to rate artificial performances. Van Oosten [1993] has undertaken a re-implementation and a critical evaluation of this system.

    Note that in the literature no generative models, nor systematic studies, are found on the relationship between rhythmic or figural structure, and expressive timing, even though this source seem to account for a large proportion of the variance in the expressive signal [Drake & Palmer, 1993]. Only Johnson [1991] presents a model that directly links rhythmic patterns (series of score durations) to expressive profiles &endash; but this rather technical study did not have a follow-up.

     

  •  

    3 Optimization Method

  •  

    A good solution to this problem can be obtained by fitting the performance data at once to the combined outputs of all models, i.e. optimizing the parameter settings for all models at the same time. The prerequisites for this approach are i) that appropriate rules of combination of the contributions of the individual models can be formalized, ii) the models form a more or less complete set that can in principle explain most of the variance in expressive timing (no structure is overlooked) iii) the way in which each model calculates the expressive component is more or less valid, iv) proper optimization methods exist for fitting the outcome of the combined models to the data, and v) an error measure can be defined for expressing the difference between generated and real performance.

     

  •  

    4 Discussion and Final remarks

  •  

    Research in expression can make a major step forward when the proposed method succeeds. Only then can an empirical basis for the individual generative models emerge. The results will make it possible to estimate their validity and relative importance based on the amounts of variance that is explained by the respective models. Furthermore, the results will effectively establish a more general model that subsumes the known ones and formalize their coherence.

    Given a successful fit of performance data, the expressive signal can be effectively decomposed into its separate components and can be manipulated in musically plausible ways (e.g., an exaggeration of only the rubato linked to the phrase structure). This type of transformation was already available in the calculus for expression [Desain & Honing, 1991], but because a separation of expression into its components was not available, only one structural description could be used at a time and transformations to interacting types of expression could not be handled. Apart from applications in the music studio, this new method will allow the construction of stimuli in which the expression of a musically sensible fragment of music is manipulated. Most psychological research in music performance use global measurement methods (e.g., average performance data and correlation measures of timing profiles of whole pieces). Recently, though there is a tendency to take a more structural approach (cf. Repp’s latest work) &endash; the separation technique will allow the experimental design to focus on one structural aspect, while still using relatively rich musical fragments.

    When the method is proven to work well, the decomposition of expression yields a good and economical encoding. For example, instead of individual tempo measurements per note, one now only has to store a set of parameters for a systematic tempo profile that will be repeated per measure, plus parameters for the phrase-level rubato, and so on. This economy of encoding, when successful, can be interpreted as evidence for the mental representations of that specific structural aspect. The alternative mental representations can be investigated directly by redoing the decomposition analysis using different structural descriptions, like incompatible phrase-structures, different levels of a metric hierarchy, different spans of figural coding, etc. The descriptions that yield the best decomposition of the expressive signal, explaining most of the variance, can be argued to be the best candidates for the mental representations active during the performance. Thus, light can be shed on questions that remain unsolvable in musicology itself, like how many levels of the metrical hierarchy upto hyper-meter are actually activated and exhibit themselves in the performance of skilled pianists. This extends the technique from attributing expression to different known structural sources, towards the inference of structure itself directly from the expressive signal. It will be computationally tractable only when a limited set of possibilities for structural descriptions can be pre-selected on the basis of some criterion, but even then it is a promising direction.

     

  •  

    References

  •  

    [Clynes, 1983] Manfred Clynes. Expressive Microstructure in Music, liked to Living Qualities. In: Studies of Music Performance, edited by J.Sundberg . Stockholm: Royal Swedish Academy of Music, No. 39, 1983.

    [Clynes, 1987] Manfred Clynes. What can a musician learn about music performance from newly discovered microstructure principles (PM and PAS)? In A. Gabrielson (ed.) Action and Perception in Rhythm and Music, Royal Swedish Academy of Music, No. 55, 1987.

    [Desain and Honing, 1991] Peter Desain and Henkjan Honing. Towards a calculus for expressive timing in music. Computers in Music Research, 3, 43-120, 1991.

    [Desain and Honing, 1992] Peter Desain and Henkjan Honing. Music, mind and machine, studies in computer music, music cognition and artificial intelligence. Amsterdam: Thesis Publishers, 1992.

    [Drake and Palmer, 1993] Carolyn Drake and Caroline Palmer. Accent Structures in Music Performance. Music Perception. 10 (3) 343-378, 1993.

    [Johnson, 1991] Margareth Johnson. Toward an Expert System for Expressive Musical Performance. IEEE Computer (24)7. 30-34, 1991.

    [Oosten, 1993] Peter van Oosten. A Critical Sudy of Sundbergs’ Rules for Expression in the Performance of Melodies. Contemporary Music Review, 9, 267-274, 1993.

    [Repp, 1990] Bruno Repp. Patterns of expressive timing in performances of Beethoven minuet by nineteen famous pianists. Journal of the Acoustical Society of America, 88, 622-641, 1990.

    [Shaffer et al., 1985] Henry Shaffer, Eric Clarke, and Neil Todd. Metre and rhythm in piano playing. Cognition, 20, 1985.

    [Sloboda, 1983] John Sloboda. The communication of musical metre in piano performance. Quarterly Journal of Experimental Psychology, 35, 1983.

    [Sundberget al., 1983] JohanSundberg, Askenfelt and Lars Frydén (1983) Musical Performance: A synthesis-by -rule Approach. Computer Music Journal, 7(1), 1983.

    [Sundberget al., 1989] JohanSundberg, Anders Friberg and Lars Frydén. Rules for Automated Performance of Ensemble Music. Contemporary Music Review, 3, 1989.

    [Todd, 1985] Neil P. M. Todd. A model of expressive timing in tonal music. Music Perception, 3, 1985.

    [Todd, 1989] Neil P. M. Todd. A Computational Model of Rubato. In "Music, Mind and Structure", edited by E. Clarke and S. Emmerson. Contemporary Music Review 3(1), 1989.

    [Todd, 1992] Neil P. M. Todd. The dynamics of dynamics: a model of musical expression. Journal of the Acoustical Society of America. 91(6), 3540-3550,1992.