Automatic sound generation through object detection and physical modelling

Project for the M Sc. Sound and Music Computing

Andrea Corcuera

Copenhagen, Denmark

2017

The sounds generated by the objects that surround us is intrinsic to our life. We associate some specific sounds to the objects, its characteristics, and the actions that generate them, and we expect to hear the corresponding sound when we see that item. Similarly, one expect to see the corresponding object when we hear its sound.For example, on the street, when we hear a characteristic sound of a motor, we know that a car is approaching. There is, therefore, a particular relationship between the objects and the sound that they produce.In films, many of these sound effects are added in post-production, a method called “Foley”. In this project, these sound effects are generated automatically based mainly on one characteristic of the objects involved: their material. A system based on an object detector, an impact detector and a sound modeler is presented.

Implementation

In this project, a system that helps the sound designer in the process of adding audio effects to videos was implemented. An overview of the algorithm can be seen in the graph below:

The tool can be divided in three modules: video analysis, data processing and sound synthesis. The input of this system is a silent video file recorded with a static camera and the output is an audio file that matches the action performed in the movie. The video is analyzed with the help of a neural network, which detects the objects present in the movie, i.e., it labels the the items in every frame, and then this data is sent to the data processing module. After receiving this information, the velocity of a moving object can be computed and, if there is more than one object in the scene, impacts are detected.This information is sent to the sound synthesis module, which generates the specific audio associated to the performed action and material.

Video analysis

Dataset preparation. The images used to train the network were obtained from the ImageNet database. ImageNet is a large image database comprised of more than 14 million of images and intended for object classification and detection. The ImageNet [24] labels are taken from a language database called WordNet [32], which organize words (nouns, verbs,adjectives and adverbs) into sets of synonyms called synsets (synonym sets).

The downloaded synsets contain thousands of pictures. However, just a littlemore than a hundred of labeled bounding boxes are available in the dataset. There-fore, the images that did not have their corresponding labeled data were discarded.The remaining images were grouped in different classes and labeled as metal, wood,glass or motor. Finally, the bounding boxes files were converted from PASCAL (Visual Object Classes) VOC [33] format (.xml file with the metadata) to Darknet format (text file containing only the information about the class and the bounding boxes).In addition to the images downloaded from ImageNet, more images were taken,hand-labeled1and added to the dataset.Once we had all the images and their corresponding annotations, the train and test files were generated.

Data augmentation. To increase the size of the dataset and make the model more robust, before creating the train and test files, all the pictures were processed. Gaussian noise was added to the pictures and their brightness was reduced by a factor of 2 and 3, and increased by a factor of 1.5. In this way the number of images for all the classes was increased. In total, 23293 images were used for training and 5950 for testing.

YOLO object detection system. YOLO (You only look once) [34] is a open source state-of-the-art object detector that is able to process video in real time. It is implemented in a neural network framework, Darknet [35], that was developed by Joseph Redmon. This detector is used to obtain the bounding boxes and the classes of the objects in the video.

The network was trained with the customized dataset of images, with a learning rate of 10−3, a momentum of 0.9 and the pretrained weights provided by the author of the model [35]. The training was done in a computer with Ubuntu 16.04 operating system using a Titan X graphics card and it took 3 days to complete 80000 epochs. The output classifies the detected objects into 4 different classes: wood, glass, metal and engine.

Impact detection. A second process of the file is done if there are two objects detected in the video.This is performed in order to detect if an impact has occurred between both objects.The first step detects collision between the two items. This is done by verifying that the two bounding boxes are overlapped or adjoining. However, the bounding boxes are parallel to the axes and this is problematic if the objects are not facing the desired direction or if they have a shape that differs a rectangle. Therefore, it is not enough just to verify that the bounding boxes are overlapped to guarantee if two objects have touched.

To ensure that one object has hit the other one, the velocities are additionally checked.

Once the velocity vector is obtained, peaks in the array are detected. When the velocity reaches its maximum and then drops or changes its sign (i.e. the object bounces to the opposite direction), there is the possibility that the object has hit something.If both conditions have been fulfilled, then the system determines that an impact has occurred. The information containing the frame where the impact has been detected as well as the location of it, is sent to the sound synthesis module. The location of the impact corresponds to the point where the two bounding boxes are overlapping or touching.

Sound synthesis

The last step of the system makes use of the well know modal synthesis technique to model the sounds generated by rigid bodies.

The resulting sound depends on many actors as the shape and dimension of the object, the impact velocity or the location of the collision. The literature has shown that the perception of the material in impact sounds is mainly based on the frequency-dependent damping of the spectral components (equivalently, the sound decay) and the spectral content of the sound [37]. The parameters associated to the sound of each material (wood, glass and metal)can be extracted experimentally by analyzing real recordings and fitting the model parameters to the recorded sound. This was done for glass, metal and wooden sounds.

Wooden sounds, characterized by a low pitch and rapid decay, have larger decay rates than metal and glass sounds that are characterized by long decay times. Several signals were tested to generate the excitation of the system: impulses,bursts of noise and the model proposed in [2]1−cos(2πt/τ), where0≤t≤τ and τ is the total duration of the contact. However, the sounds generated by these models weren’t plausible enough, so it was decided to use the residual from the recordings of the different struck objects by using inverse filtering of the main modes.

Evaluation and conclusion

To asses the quality of the generated sounds and how well they match real objects,a perceptual evaluation was performed.

A total of 4 videos were shown. They were recorded with the rear camera of a Samsung Galaxy S6 in a typical kitchen scenario. In particular, the videos recorded were:

•A knife hitting a glass jar

•A knife hitting a metallic pot

•A glass jar hitting a wooden table

•A clip of a movie in which a man rides a motorcycle

The participants filled in a questionnaire that contained the videos to be studied and the links to the audio files. In the first part of the test only audio stimuli was given to the subjects. They listened some examples of the sounds and were asked to rate the quality and choose the apparent material of the sound that was played. In the second part of the test the subjects watched the videos with sounds generated by the different models.

A total of 15 participants aged between 24 and 60 volunteered in the evaluation. The subjects rated the synthesized wooden sounds with a lower quality than the recorded one. A t-test found significant differences (p= 0.03,α= 0.05) between the recorded and the synthesized sounds of wood at 5% significance level, but not for glass and metal.

The second part of the test suggest that people still prefer recorded sounds than synthesized ones. Almost all of the subjects (12 out of 15) preferred the recorded sound for the video of the table. In the case of the engine the number of volunteers that preferred the recording was 10, for the video of the pot there were 9 and 8 for the glass.In the question regarding the matching of the video and audio, the best results were found for the case of glass sounds.

The worst results were found in the videos with the metallic sound. This maybe possible due to the shape of the object. The sound was marked as metallic in the first part of the test but it didn’t fit very well with the object displayed in the video. The generated sound had bright modes with long decays and the recorded sound was more noisy. The subjects commented that the timbre of the synthesized sound didn’t match the video because of the location of the struck. Therefore, one can consider that the synthesized sound could have worked well if the knife had hit the body of the pot, where a clear metallic sound is expected. However, the pot was struck on its edge, which entails a more noisy sound. These results suggest that the synthesized sounds are good enough to be added to real videos but that the parameter estimation must be improved since the subjects stated in some cases that the synchronization of the hit was not perfect nor the suitability with the location of the hit.

In the future, the video analysis must be improved to get a more accurate model of the objects or materials involved in the actions. In addition, other networks could be more advantageous as one that performs object segmentation -partition the video image into regions- as well. By performing segmentation we could have a better model of the image and assume its shape which lead a better impact detection and timbre of the sound. With a better detection of the objects, other actions could be identified, such scrapping or rubbing and other methods intended for sound synthesis, like digital waveguide synthesis could be used.

References

[1] Wikipedia, “Vitaphone — Wikipedia, the free encyclopedia,” 2004. [Online].Available: https://en.wikipedia.org/wiki/Vitaphone

[2] K. Van Den Doel, P. G. Kry, and D. K. Pai, “Foleyautomatic: physically-basedsound effects for interactive simulation and animation,” inProceedings of the28th annual conference on Computer graphics and interactive techniques. ACM,2001, pp. 537–544.

[3] D. B. Lloyd, N. Raghuvanshi, and N. K. Govindaraju, “Sound synthesis forimpact sounds in video games,” inSymposium on Interactive 3D Graphics andGames. ACM, 2011, pp. PAGE–7.

[4] C. Zheng and D. L. James, “Rigid-body fracture sound with precomputed sound-banks,”ACM Transactions on Graphics (TOG), vol. 29, no. 4, p. 69, 2010.

[5] R. A. Garcia, “Automatic generation of sound synthesis techniques,” Ph.D.dissertation, Citeseer, 2001.

[6] J. O. Smith,Spectral Audio Signal Processing.http://ccrma.stanford.edu/- ̃jos/sasp/, online book, 2011 edition.

[7] F. Dunn, W. Hartmann, D. Campbell, and N. Fletcher,Springer handbook ofacoustics. Springer, 2015.

[8] G. De Poli, “A tutorial on digital sound synthesis techniques,”Computer MusicJournal, vol. 7, no. 4, pp. 8–26, 1983.

[9] Z. Ren, H. Yeh, and M. C. Lin, “Example-guided physically based modal soundsynthesis,”ACM Transactions on Graphics (TOG), vol. 32, no. 1, p. 1, 2013.

[10] S. Serafin, “The sound of friction: real-time models, playability and musicalapplications,” Ph.D. dissertation, stanford university, 2004.

[11] P. Viola and M. Jones, “Rapid object detection using a boosted cascade ofsimple features,” inComputer Vision and Pattern Recognition, 2001. CVPR2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1.IEEE, 2001, pp. I–I.

[12] M. Brown, D. G. Loweet al., “Recognising panoramas.” inICCV, vol. 3, 2003,p. 1218.[13] R. Brunelli,Template matching techniques in computer vision: theory and prac-tice. John Wiley & Sons, 2009.

[14] A. M. McIvor, “Background subtraction techniques,”Proc. of Image and VisionComputing, vol. 4, pp. 3099–3104, 2000.

[15] S. S. Beauchemin and J. L. Barron, “The computation of optical flow,”ACMcomputing surveys (CSUR), vol. 27, no. 3, pp. 433–466, 1995.

[16] “Convolutional neural networks for visual recognition course,” https://cs231n.github.io/neural-networks-1, [Online; accessed May 2017].

[17] I. Goodfellow, Y. Bengio, and A. Courville,Deep Learning. MIT Press, 2016,http://www.deeplearningbook.org.

[18] D. G. Lowe, “Object recognition from local scale-invariant features,” inCom-puter vision, 1999. The proceedings of the seventh IEEE international conferenceon, vol. 2. Ieee, 1999, pp. 1150–1157.

[19] R. Parisi, E. Di Claudio, G. Lucarelli, and G. Orlandi, “Car plate recogni-tion by neural networks and image processing,” inCircuits and Systems, 1998.ISCAS’98. Proceedings of the 1998 IEEE International Symposium on, vol. 3.IEEE, 1998, pp. 195–198.

[20] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for actionrecognition in videos,” inAdvances in neural information processing systems,2014, pp. 568–576.

[21] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic objectrecognition with invariance to pose and lighting,” inComputer Vision and Pat-tern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE ComputerSociety Conference on, vol. 2. IEEE, 2004, pp. II–104.

[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deepconvolutional neural networks,” inAdvances in neural information processingsystems, 2012, pp. 1097–1105.

[23] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014.

[24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNetLarge Scale Visual Recognition Challenge,”International Journal of ComputerVision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.

[25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernsteinet al., “Imagenet large scale visual recog-nition challenge,”International Journal of Computer Vision, vol. 115, no. 3, pp.211–252, 2015.

[26] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei,“Large-scale video classification with convolutional neural networks,” inPro-ceedings of the IEEE conference on Computer Vision and Pattern Recognition,2014, pp. 1725–1732.

[27] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for seman-tic segmentation,” inProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2015, pp. 3431–3440.

[28] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T.Freeman, “Visually indicated sounds,” inProceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2016, pp. 2405–2413.

[29] Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound represen-tations from unlabeled video,” inAdvances in Neural Information ProcessingSystems, 2016, pp. 892–900.

[30] A. Davis, M. Rubinstein, N. Wadhwa, G. J. Mysore, F. Durand, and W. T.Freeman, “The visual microphone: passive recovery of sound from video,” 2014.

[31] M. Cardle, S. Brooks, Z. Bar-Joseph, and P. Robinson, “Sound-by-numbers:motion-driven sound synthesis,” inProceedings of the 2003 ACM SIG-GRAPH/Eurographics symposium on Computer animation. Eurographics As-sociation, 2003, pp. 349–356.

[32] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller, “Introduc-tion to wordnet: An on-line lexical database,”International journal of lexicog-raphy, vol. 3, no. 4, pp. 235–244, 1990.