23th Proceedings of the 24 rd International Conference on Digital Audio Effects (DAFx20in21), Vienna,Austria, (DAFx2020), Vienna, Austria,September September2020-21 8-10, 2021 GRAPH-BASED AUDIO LOOPING AND GRANULATION Gerard Roma , Pierre Alexandre Tremblay and Owen Green CeReNeM University of Huddersfield Huddersfield, UK

[email protected]

ABSTRACT In the next section we review existing work related to the pro- posed approach. In Section 3, we detail the analysis framework. In this paper we describe similarity graphs computed from time- Section 4 describes the playback algorithms and their implemen- frequency analysis as a guide for audio playback, with the aim tation. Some examples of using the proposed algorithms are pre- of extending the content of fixed recordings in creative applica- sented in Section 5. We then conclude and outline future direc- tions. We explain the creation of the graph from the distance be- tions. tween spectral frames, as well as several features computed from the graph, such as methods for onset detection, beat detection, and cluster analysis. Several playback algorithms can be devised based 2. RELATED WORK on conditional pruning of the graph using these methods. We de- scribe examples for looping, granulation, and automatic montage. The idea of concatenating time-frequency frames based on similar- ity has been extensively used under the framework of corpus-based 1. INTRODUCTION concatenative synthesis (CBCS) [4]. In CBCS the framework usu- ally considers the audio material as an (ideally large) database of Short sound recordings, around the length of words and sentences, units with a focus on specifying the resulting sound. Here our fo- are used in the creative stages of many forms of audio and music cus is in the opposite direction: we aim to obtain different ways of production. Extending the these snippets in time has many appli- extending short snippets interactively, with an interest in existing cations, such as creating different musical gestures, or simulation gestures and textures, using common paradigms such as looping of realistic sound textures and sound effects in cinema or video and granulation. Several works have studied these tasks in a simi- games. The general idea of extending sounds in time is almost as lar ways, based on time-frequency analysis and similarity graphs. old as recording technology and can be traced back to tape loops The idea of automating the creation of music loops from audio and early approaches to granulation [1]. was proposed in [5], where an algorithm that found repetitive sec- In the general case, the possibilities are not limited to a fixed tions was presented. This problem is similar to other MIR tasks, objective such as the synthesis of a known sound, but very often such as finding the chorus in a pop song. In practice, however, emerge from the qualities of the sample at hand. Thus, automatic loops are often devised to create new rhythmic structures even if audio analysis can be used to leverage the inner structure, textures the original audio does not contain a repetition. A user interface or objects in a given recording in an interactive setting. In popu- for automatic and semi-automatic loop editing was proposed more lar platforms such as digital audio workstations and audio editing recently, using a very basic analysis algorithm [6]. We propose a suites, it has become common to conflate the reading of a sound more detailed algorithm that can be used interactively, addressing file with a time-frequency analysis that enables playback capabili- both the issue of seamless concatenation points and the possibility ties such as time scale modification. of leveraging existing repetitive content. One particular possibility of time-frequency analysis is com- The proposed algorithms for granulation and montage are re- puting similarity between frames. This has been extensively ex- lated to existing work on texture synthesis, particularly algorithms ploited by concatenative synthesis, either guided by some target based on CBCS. Several algorithms were evaluated in [7]. Among sequence [2] or interactive exploration of a corpus of grains or these, the Montage Synthesis (MS) algorithm, is perhaps the clos- short sounds [3]. On the other hand, content-based structural anal- est to our approach, although its focus is concatenating larger seg- ysis has been used most often in the context of music informa- ments for realistic texture synthesis and audio inpainting [8]. An- tion retrieval (MIR). In this paper, we explore using content-based other algorithm for inpainting was presented in [9], based on prun- structural analysis of audio for facilitating different playback algo- ing the similarity graph. Our algorithms are similarly based on rithms that can be used to extend the content of a recorded sound graph pruning but, instead of inpainting missing audio fragments in time. In particular, we propose using networks of similarity with realistic reproductions, our focus is on interactive control of between different points in the spectrogram. We describe three al- real-time playback. With respect to prior work, a particularly novel gorithms: one for automatic looping, one for granular synthesis aspect of our approach is the use of spectral clustering of the graph and one for automatic montage, and provide implementations for to identify regions of similar sounds. This allows random naviga- the Max and SuperCollider environments. tion of the similarity graph to provide some variation, while retain- ing some stability in the qualities of the resulting texture. Copyright: © 2021 Gerard Roma et al. This is an open-access article distributed Our approach is also similar in spirit to the algorithm in [10], under the terms of the Creative Commons Attribution 3.0 Unported License, which in that it allows extending the stationary part of sounds, although permits unrestricted use, distribution, and reproduction in any medium, provided the here we use a time-frequency concatenative approach instead of original author and source are credited. convolution with noise. DAFx.1 253 23th Proceedings of the 24 rd International Conference on Digital Audio Effects (DAFx20in21), Vienna,Austria, (DAFx2020), Vienna, Austria,September September2020-21 8-10, 2021 3. ANALYSIS 3.2. Self-similarity matrix The proposed framework is based on time-frequency analysis us- In order to compute the similarity between two frames, Mi and ing the short-time Fourier transform (STFT). We assume a pre- Mj , in the Mel spectrogram, we use the Jensen-Shannon (JS) dis- liminary analysis step resulting in a static data structure, which is tance: used during real-time playback. From the STFT frames we ex- 1 1 1 tract a lower dimensional perceptual representation that is used to DJS (Mi , Mj ) = ( DKL (M̂i ||M̂k ) + DKL (M̂j ||M̂k )) 2 , construct a self-similarity matrix (SSM). From this matrix we can 2 2 (3) extract some basic functions, such as an onset detection function where and the beat spectrum. The SSM is then thresholded into a re- 1 currence plot, which is interpreted as the adjacency matrix of the M̂k = (M̂i + M̂j ), (4) 2 similarity graph. Mx M̂x = P , (5) Mx 3.1. Feature extraction and DKL is the Kullback-Leibler (KL) divergence: The STFT of the signal is given by X N X −1 2πkn DKL (P ||Q) = P (x)log(P (x)/Q(x)) (6) X(m, k) = x(n + mH)w(n)e−j N (1) n=0 The KL divergence is frequently used for audio features. In partic- ular, it was found to perform well in early concatenative synthesis where n is the sample index, m is the spectral frame index, H is experiments [11]. The JS distance provides a symmetric version the hop size, k is the frequency bin index, and w is a window func- with all the properties of a metric, and is conveniently bounded be- tion, such as the Hann window. In order to compute the similarity tween 0 and 1 [12]. In initial experiments, we found this distance between spectral frames we need a low-dimensional representa- resulted in similar visual patterns as the cosine distance used in tion that relates to human perception. While many descriptors have [13]. Like the cosine distance, it is normalised with respect to the been used in the concatenative synthesis literature to encode differ- magnitude of each frame (here in order to represent a probability ent perceptual features we are interested in a general representation distribution). At the same time, following subjective assessment, that can be used to assess the similarity between audio spectra re- we found it would give better perceptual results when used with a gardless of their content. We use the Mel filterbank, which is one threshold to allow concatenation of frames from different locations of the most widely used representations for audio: with low distance. F −1 The similarity between two spectral frames is then computed M (m, f ) = X Mf b (k, f )|X(m, k)|, (2) as: f =0 SSM (i, j) = 1 − DJS (Mi , Mj ). (7) Eq. 7 defines a self-similarity matrix that shows some patterns in where Mf b is a matrix of F triangular filters scaled along the Mel the sound [13]. An example is shown in Figure 2. frequency scale. The number of filters can be tuned to the required resolution (along with the size of the FFT window) in the analysis 0 1.0 stage, depending on the sound. Figure 1 shows the Mel spectro- gram of a drum loop which is used throughout this section. 0.9 200 0.8 0.7 400 0.6 0.5 600 0.4 0.3 800 0.2 0 200 400 600 800 Figure 1: Mel spectrogram of a drum loop. Figure 2: Self-similarity matrix. DAFx.2 254 23th Proceedings of the 24 rd International Conference on Digital Audio Effects (DAFx20in21), Vienna,Austria, (DAFx2020), Vienna, Austria,September September2020-21 8-10, 2021 3.3. Useful measures from the SSM The main diagonal of the SSM is simply the similarity of each 1.0 frame to itself. The superdiagonal represents the similarity of each 0.9 frame to the next. Thus, it can be used to obtain an onset detection function ODF (i) = 1 − SSM (i, i + 1). The advantage of using 0.8 the same distance measure throughout is that is that it is consis- 0.7 tent with the rest of the framework and the threshold used for the 0.6 BS similarity, which we use as concatenation cost. As such, onsets will normally signal a disruption in the connections corresponding 0.5 to successive frames. However, since the distance is normalised, 0.4 it can be noisy at low amplitude values. This is common in other onset detection functions [14]. We post-process the ODF by re- 0.3 moving a median-filtered version and clipping it below zero. We 0.2 also remove peaks that are closer than 100ms to a previous peak. 0.0 0.5 1.0 1.5 2.0 2.5 Figure 3 shows an example of the post-processed ODF. time (seconds) Figure 4: Beat spectrum. 0.40 0.35 plot can also be seen as the adjacency matrix of a similarity graph, 0.30 which we also use to define potential transitions for playback. A 0.25 circular graph layout such as in Figure 6 provides a more intu- itive representation of the graph as a transition network (here a ODF 0.20 high threshold was used to generate fewer links for illustration 0.15 purposes). Nodes in the graph represent spectral frames. The cir- cular layout corresponds to the original order (thus, here, implying 0.10 a general loop). The rest of the links can be thought as ‘worm- 0.05 holes’, or shortcuts, which provide potential alternative playback 0.00 paths. The graph could also be defined through a nearest neigh- 0.0 0.5 1.0 1.5 2.0 2.5 bours algorithm. However, here the parameter ϵ can be seen as a time (seconds) constant bound for the transition cost and is also a useful tradeoff: a higher value will create a sparser RP with fewer transitions. A Figure 3: Onset detection function from the SSM diagonal. low threshold will allow more transitions but with higher potential for perceived discontinuity. Another useful measure that can be obtained from the SSM is the beat spectrum, defined as the sum of diagonals [13] (Figure 4): 0 I−l−1 1 X BS(l) = SSM (i, i + l), (8) I − l i=0 200 where I is the total number of frames. The sample at BS(l) is the average similarity between frames separated by lag l. Thus, peaks represent periodicities in the original audio, which can be used to automatically find loop points that capture existing rhythms. In 400 practice, the prominence of peaks depends on whether the audio is repetitive or not. For rhythmic audio, the peaks are very clear and are often found at durations that are multiples of the same basic beat. For short recordings we pick the earliest of the k highest peaks of the first half of the beat spectrum (typically with a value 600 of k = 3), in order to obtain a small quantisation beat. 3.4. Recurrence plot 800 Finally, the SSM is thresholded to obtain the recurrence plot (RP) (Figure 5): 0 200 400 600 800 1, if SSM (i, j) ≥ ϵ RP (i, j) = , (9) 0, otherwise Figure 5: Recurrence plot. where ϵ is a threshold parameter. The dots in RP define com- binations of spectral frames with high similarity. The recurrence DAFx.3 255 23th Proceedings of the 24 rd International Conference on Digital Audio Effects (DAFx20in21), Vienna,Austria, (DAFx2020), Vienna, Austria,September September2020-21 8-10, 2021 back. The general principle is to move a playback head over spec- tral frames, selected by traversing the graph, which are then syn- thesised via the inverse STFT. Depending on the sound and the chosen threshold, there may still be many links. Several algo- rithms are possible based on different ways of pruning the graph. We provide three examples focusing on looping, granulation and automatic montage. It is worth noting that pruning can be sim- ply implemented by removing rows and columns in RP ˆ . Given its simplicity, this process could eventually be presented as a user interface. The three algorithms are implemented as externals for the Max and SuperCollider languages, using the Fluid Corpus Manipula- tion Toolbox[17, 18]. The implementation can be obtained from github.1 All three algorithms include an initial analysis step in which the Mel spectrogram and similarity graph are computed. The graph is then pruned using different strategies for each object. After the analysis, the playback guided by the graph can be controlled in real time. We now describe each of the algorithms in more detail. 4.1. Looping Looping is used in many musical genres, often based on repetition and rhythm. The choice of a looping region may be influenced by existing periodicities in the audio, although it is also possible that Figure 6: Circular view of the graph as a transition network. there are none, or that new rhythms are created by the loop itself. The choice of the looping region can thus be seen as a dialogue between the user and the audio content. We propose to represent 3.5. Spectral clustering this dialogue as a search process: the user makes an initial query The graph obtained in the previous section can also be used to of loop start and end points, and the system proposes an alternative cluster the frames via spectral clustering [15]. In order to preserve set of start and en points based on the audio content. the information about similarities, we use a weighted version of the adjacency matrix RP : 4.1.1. Analysis In the analysis, phase, RPˆ is computed using a threshold param- ˆ (i, j) = RP (i, j) ⊙ SSM (i, j), RP (10) eter.2 A quantised mode is provided as an option that will use the where ⊙ is the element-wise product. earliest detected peak in the beat spectrum to remove links that are ˆ are the similarities of each frame with the The weights in RP not multiples of the detected beat. Since this may end up with a others above the given threshold. The degree of frame i is the sum very small number of links, all multiples of the beat that start in a of the weights, frame labelled as an onset are added to RPˆ in the quantised mode. L−1 X Onsets are detected as peaks in the ODF defined in Section 3.3. di = RPˆ (i, j), (11) We then fit a KD-Tree index [19] to the 2D points in the upper j=0 triangle of the symmetric matrix RP ˆ (Figure 5). The symmetric normalized laplacian is then 1 1 4.1.2. Playback Lsym = D− 2 LD− 2 , (12) During playback the play head generally follows the original order where D is the diagonal matrix formed with the degrees of the of frames, and jumps at the loop positions. The user can specify a frames d0 ...dI−1 along the diagonal. We compute the first l eigen- query point in real time. Every query point (ls0 , le0 ) is represented vectors and eigenvalues of Lsym (where l is the maximum number in the same 2D space as the indexed transitions, so the system of clusters). The number of clusters c can be specified manually as returns the nearest neighbour (ls1 , le1 ). This can be useful for a parameter, or automatically estimated by the eigengap method simplifying the selection of loop points for beginners, or for more [15], by looking at the largest gap between consecutive eigenval- sophisticated loop-based playback by modulating the query point. ues up to l. The clusters are then obtained by running the k-means algorithm for the first c eigenvectors (here used as features for each 4.2. Granulation of the spectral frames and row-normalised) [16]. Granular synthesis is often used to create textures and tones, often using some random variation of parameters. Here, we are con- 4. GRAPH-BASED AUDIO PLAYBACK 1 https://github.com/flucoma/graph_loop_grain As described in the previous sections, in this paper we propose 2 Note that in all the implemented objects, the threshold parameter is using the similarity graph as a general mechanism for audio play- defined over distance instead of similarity DAFx.4 256 23th Proceedings of the 24 rd International Conference on Digital Audio Effects (DAFx20in21), Vienna,Austria, (DAFx2020), Vienna, Austria,September September2020-21 8-10, 2021 strained to the STFT framework with respect to some of the grain sound objects in the recording, so the playback head is allowed to parameters, but some variation is obtained by random navigation cross onsets and cluster boundaries. of the graph. 4.3.2. Playback 4.2.1. Analysis In real time, the playback head follows the original sequence of For the granulation algorithm, the analysis phase includes spec- frames for a given duration, specified by a minimum duration pa- tral clustering of the graph as described in Section 3.5. The graph rameter. After this it jumps randomly to a similar point in the is then pruned to remove links between different clusters. The graph. The threshold parameter controls the number of candidate number of clusters thus controls the definition of the sound: if the points. As with the granulation algorithm, a ‘forgetfulness’ param- number of clusters is small, the resulting texture will have higher eter controls the predictability of the walk by blacklisting visited variation. The number of clusters can be specified as a parame- transitions for a given duration, but here the effect is not so im- ter, or an automatic value can be used via the eigengap method. mediately noticeable. A minimum distance parameter is used to In practice this method can end up returning too few clusters if the remove transitions to nearby frames, which may create freezing graph is fully connected, so we use the double of the value returned artefacts. by the eigengap method as the choice for the automatic mode. In our implementation the maximum number of clusters is empiri- 5. EXAMPLES cally set to 50, and clusters smaller than 10 frames are merged into the cluster with the nearest centroid. The different algorithms based on the similarity graph offer a vari- In addition to clustering, we remove all links to onset frames ety of ways to extend sound recordings in time. We now describe (again using the ODF derived from the SSM) to promote conti- some examples created using the implemented objects. The audio nuity and avoid stuttering effects commonly found in granulation. files for these and further examples, can be found in the companion Since the analysis is based on overlapping windows, and also be- website for this paper3 . cause audio around onsets is often loud and spans a broad fre- In the case of looping, we noted that by implementing the op- quency range, we remove the links to the 2 frames before and after eration in the STFT domain using overlap-add synthesis, and on a frame labelled as an onset. the basis of similarity between the overlapped frames, there are generally no issues with wave discontinuity for musical audio. In addition, by finding areas with similar spectrum at different points 4.2.2. Playback in time, the algorithm tends to find natural-sounding phrases. An The main parameters for controlling the granulation are the start- example is shown in Figure 7. Here, a rough guess was made for ing point, which determines the selected cluster, and the threshold. a loop at 10% and 20% of the duration of an excerpt of a synco- Both are selected in real-time and thus RP ˆ is computed dynami- pated drums performance, marked in yellow. This object generates cally from the SSM. The navigation is generally driven by a sorted the candidate loop points in the analysis phase and then it can be list of nearest neighbours with respect to the current frame and queried at any time during playback. In this case, the object replied threshold. This is controlled by two parameters: ‘randomness’ and with the points marked in green: a similarity is found in the onsets ‘forgetfulness’. Randomness (0 to 1) defines the number of nearest of two bass drum sounds, which determines a short bass drum and neighbours as a fraction of the allowed transitions from the current snare phrase. It is worth noting that this was found without using frame. A value of 0 will always select the nearest neighbour. This the quantisation option described in Sec. 4.1.1. will create a totally deterministic behaviour, which in general ends in a cycle of frames in the network. Choosing the nearest neigh- bour is always the smoothest transition. Thus, higher values of randomness will increase variation at the cost of more artefacts in the transitions. Forgetfulness specifies a duration during which already visited transitions are removed from the available transi- tions. In general, this parameter will encourage exploration of the cluster. When randomness is 0, forgetfulness determines the length of the deterministic loop. In order to reduce the artefacts from the concatenation of random frames, we added the option of synthesising the phase from the magnitude, using the real-time phase gradient heap integration (RTPGHI) algorithm [20]. 4.3. Automatic montage We describe a final algorithm based on a user-specified minimum duration parameter, similar to the MS algorithm in [8]. Figure 7: Looping example. 4.3.1. Analysis With respect to the granulation object, we noted it shares the For this algorithm no pruning is used in the analysis phase, which characteristic sound and artefacts of audio granulation techniques. provides more freedom for experimenting with the content of the sound. The algorithm focuses on preserving the shapes of existing 3 https://www.flucoma.org/DAFX-2021/ DAFx.5 257 23th Proceedings of the 24 rd International Conference on Digital Audio Effects (DAFx20in21), Vienna,Austria, (DAFx2020), Vienna, Austria,September September2020-21 8-10, 2021 These were considerably reduced by using high overlap factors Trying the montage object, we found it could be useful both (e.g. 4 or higher), and, when using tonal source material, using for articulating gestures found in recordings for music and sound the RTPGHI phase synthesis. For noisy sources, using phase syn- design, and for synthesizing more realistic textures, particularly thesis will still tend to produce periodicities, so for realistic re- resulting from aggregation of random events. Figures 10 and 11 sults it is better to turn it off. With respect to existing granulation show an example where a short recording of an applause is ex- techniques, this algorithm offers several particularities. One is the tended in time, again using 100ms windows. possibility of creating deterministic loops, which can be used to In all objects, low values (around 0.2) of the threshold (de- design new sounds. Another one is the ability to ‘wander’ by us- fined over distance) provided a good compromise for a sufficient ing the similarity network. This allows obtaining slowly varying number of seamless transitions. This also depends on the choice textures. Finally, unlike traditional granulators, this algorithm can of the number of Mel bands. A high threshold value will result in create stable textures while making use of different parts of the more noticeable transition artefacts, while a low value can result in sound. The stability of the sound is controlled by the number of freezing effects created by a repetition of short sequences. In the clusters in the analysis stage. A small number will produce clus- montage algorithm, this can prevented by the minimum distance ters with different pitches and timbres, and wandering behaviour, parameter, while in the granulation algorithm, low values can still while a large number will produce small clusters and stable tex- be used to produce artificial sounds. tures and tones. An example is shown in Figure 8. A recording of a music box melody was used as source material with a large (around 100ms) grain duration. The clustering matched parts of the sound with similar spectra (the selected cluster is highlighted in white), and the algorithm produced a stable tone by concatenat- ing similar frames through the random walk and synthesizing the phase (Figure 9). Figure 10: Applause recording fragment Figure 8: Granulation source material (recording of music box) Figure 11: Texture generated using the montage object with ap- plause 6. CONCLUSIONS AND FUTURE WORK Alternative playback mechanisms for short audio excerpts are gen- Figure 9: Granulation example erally useful for music and sound design. In this paper, we have DAFx.6 258 23th Proceedings of the 24 rd International Conference on Digital Audio Effects (DAFx20in21), Vienna,Austria, (DAFx2020), Vienna, Austria,September September2020-21 8-10, 2021 explored audio similarity graphs as a guide for devising different and evaluation,” in 19th International Conference on Digi- algorithms for synthesis and playback in the time-frequency do- tal Audio Effects (DAFx-16). Brno University of Technology, main. This is an intuitive model that can be used to produce novel Faculty of Electrical Engineering and Communication, 2016, interfaces for a variety of tasks. We have shown algorithms for pp. 217–224. automatic granulation, looping and montage. For the first case, synthesising the phase from the real-time concatenated sequence [8] Seán O’Leary and Axel Röbel, “A montage approach to of magnitude frames has proven useful for creating continuous sound texture synthesis,” IEEE/ACM Transactions on Au- pitched sounds based on existing material. Noisy textures can also dio, Speech, and Language Processing, vol. 24, no. 6, pp. be synthesised with both the granulation and the montage algo- 1094–1105, 2016. rithm. When using longer sequences of the original sound (in the [9] Nathanael Perraudin, Nicki Holighaus, Piotr Majdak, and Pe- looping and montage algorithms), the graph helps creating seam- ter Balazs, “Inpainting of long audio segments with similar- less transitions, while leveraging existing patterns and gestures in ity graphs,” IEEE/ACM Transactions on Audio, Speech, and the source audio. Language Processing, vol. 26, no. 6, pp. 1083–1094, 2018. One limitation of the proposed framework is the cost of the JS distance. In this paper, our focus is on extending short snippets, [10] Vesa Välimäki, Jussi Rämö, and Fabián Esqueda, “Creating but for longer recordings computing the full SSM using the JS endless sounds,” in Proc. 21st Int. Conf. Digital Audio Effects distance is unfeasible. This could be improved by computing an (DAFx-18), Aveiro, Portugal, 2018, pp. 32–39. initial guess of the graph with a fast nearest neighbours algorithm as in [9]. In general, the effect of the distance measure, along [11] Esther Klabbers and Raymond Veldhuis, “On the reduction with other aspects such as the number of clusters in the granulation of concatenation artefacts in diphone synthesis,” in Fifth algorithm, could benefit from formal evaluation through listening International Conference on Spoken Language Processing, tests. 1998. Many other algorithms could be devised on the basis of differ- [12] Jianhua Lin, “Divergence measures based on the shannon ent strategies for pruning the similarity graph, as well as biasing entropy,” IEEE Transactions on Information theory, vol. 37, random or deterministic navigation. In future work, we plan to in- no. 1, pp. 145–151, 1991. vestigate more open user interfaces that allow musicians and sound designers to develop their own playback sequences and algorithms [13] Jonathan Foote and Shingo Uchihashi, “The beat spectrum: in musical creative coding environments. A new approach to rhythm analysis,” in Proceedings of the IEEE International Conference on Multimedia and Expo, 7. ACKNOWLEDGMENTS 2001., 2001. [14] Juan Pablo Bello, Laurent Daudet, Samer Abdallah, Chris This project has received funding from the European Research Duxbury, Mike Davies, and Mark B Sandler, “A tutorial Council (ERC) under the European Union’s Horizon 2020 research on onset detection in music signals,” IEEE Transactions on and innovation programme (grant agreement no. 725899). speech and audio processing, vol. 13, no. 5, pp. 1035–1047, 2005. 8. REFERENCES [15] Ulrike Von Luxburg, “A tutorial on spectral clustering,” [1] Curtis Roads, Microsound, MIT press, 2004. Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007. [2] Aymeric Zils and François Pachet, “Musical Mosaicing,” in [16] Andrew Ng, Michael Jordan, and Yair Weiss, “On spectral Proceedings of the 2001 Conference on Digital Audio Effects clustering: Analysis and an algorithm,” Advances in neural (DaFx), 2001. information processing systems, vol. 14, pp. 849–856, 2001. [3] Diemo Schwarz, Grégory Beller, Bruno Verbrugghe, and [17] Pierre Alexandre Tremblay, Owen Green, Gerard Roma, Sam Britton, “Real-Time Corpus-Based Concatenative Syn- and Alex Harker, “From collections to corpora: Exploring thesis with CataRT,” in Proceedings of the International sounds through fluid decomposition,” in Proceedings of the Conference on Digital Audio Effects (DAFx), 2006. International Conference on Computer Music (ICMC), 2019. [4] Diemo Schwarz, “Corpus-based concatenative synthesis,” IEEE signal processing magazine, vol. 24, no. 2, pp. 92–104, [18] Pierre Alexandre Tremblay, Gerard Roma, and Owen Green, 2007. “Digging it: Programmatic data mining as musicking,” in Proceedings of the International Conference on Computer [5] Bee Suan Ong and Sebastian Streich, “Music loop extraction Music (ICMC), 2021. from digital audio signals,” in 2008 IEEE International Con- ference on Multimedia and Expo. IEEE, 2008, pp. 681–684. [19] Jon Louis Bentley, “Multidimensional binary search trees [6] Zhengshan Shi and Gautham J Mysore, “Loopmaker: Auto- used for associative searching,” Communications of the matic creation of music loops from pre-recorded music,” in ACM, vol. 18, no. 9, pp. 509–517, 1975. Proceedings of the 2018 CHI Conference on Human Factors [20] Zdeněk Průša and Peter L. Søndergaard, “Real-Time Spec- in Computing Systems, 2018, pp. 1–6. trogram Inversion Using Phase Gradient Heap Integration,” [7] Diemo Schwarz, Axel Roebel, Chunghsin Yeh, and Amaury in Proceedings of the 2016 International Conference on Dig- LaBurthe, “Concatenative sound texture synthesis methods ital Audio Effects (DAFx-16), 2016, pp. 17–21. DAFx.7 259