(Site under construction)

The Virtual Theremin

Copyright Edward Squires 2001

What is it?

The Theremin is a musical instrument that uses the positions of the performers hands to create sound. The Virtual Theremin aims to mimic the Theremin using just a computer and a cheap web camera. It tracks the hands of a performer, using image processing techniques, around a picture of a Theremin superimposed on the video stream. The positions of the hands are then converted to sound that mimics the response of the Theremin.

Where can I get it?



New Version 7/03/2002. It is now compatible with any camera (hopefully).

You should now have a running copy of vt.exe if all goes well. The Virtual Theremin is beta status.

The files are installed in a folder called VT on your main hard drive (C:\VT ).

Minimum requirements

How do I use it?

Initially you will be presented with a black window. This is very boring.

To get things happening, select Capture Device... from the File menu, which should bring up a modal dialog box. Make sure that your web camera is working and select your web camera from the Capture Device drop box. Set a video size so that you can achieve a frame rate greater than 20 (the video looks smooth and the lag is reduced).

The play, pause, step and stop buttons at the bottom of this dialog box control the video stream. Press play to watch the video that your camera is capturing. You should see two white squares on the video screen and a picture of the Theremin with it's distinctive loop and vertical rod.

Resize the vt window to your liking. Maximise it to fill the screen.

Select Virtual Theremin... from the Settings menu. This displays the Virtual Instrument dialog box. Press the Reset button.

If your camera can not mirror video, you can select the Flip Video Vertically check box to rotate the video, this makes it much easier to play. Playing the Virtual Theremin without this is for the hard core.

Now for the difficult bit. Using two people makes it easier so one can perform and one can calibrate. Your should read the section below on limitations first, to avoid headaches.

Make sure the performer is wearing a long sleeved shirt and that the only large skin coloured objects in the video scene are the performers head and hands. Make sure the performer is about three meters from the camera. Reliable results are achieved if you have a well lit scene with a plain coloured background.

Make sure that the performers hands completely fill the two white boxes and press the calibrate button. The performer should now rotate their hands slowly clenching their fists and opening them if necessary, and rotating their head. The algorithm is very fragile for a short time after it has calibrated and can easily lock onto a similarly coloured object (or lose lock). If this occurs press the reset button and try calibrating again. You can view what pixels are being detected by selecting the Show skin coloured pixels check box. This shows detected skin coloured pixels as gray and white. White pixels are detected as belonging to the skin of the performer. Gray pixels are detected errors. Ideally the performers head and hands should be completely white and there should be few Gray pixels.

The algorithm will mark the performers head with a black square and the performers hands with white squares. Once it has learnt enough information about your skin colour the boxes are placed roughly in the centre of the performers head and hands. The performer can now freely move about the video scene (providing they remain the same distance from the camera).

To play the Theremin, the left hand controls the volume and the right hand controls the pitch. The closer the left hand is to the volume loop the softer the volume. The closer the right hand to the vertical rod the higher the pitch.


The tracking algorithms are not completely robust and certain limitations are placed on the scene and performer.

How does it work? 

(assuming some knowledge of image processing)

The Virtual Theremin requires the performers hands to be tracked and identified. There are two modes of operation calibration and running mode. During calibration mode a statistical model is built based on the colours of pixels. To do this it samples two regions in the video over several video frames. 

Once the Virtual Theremin has gained enough information about the colour of the objects it is tracking it switches to running mode. In this mode it scans the entire frame in the video. Each pixel is tested against the statistical model built in calibration mode. A pixel in the frame is marked as either belonging to a 'skin coloured' object or not, to produce a binary image. Connectivity analysis is done on this binary image to identify the three largest blobs in the frame. The centres of each blob is found and marked by a square box. The biggest region is shown by a black square and the two smaller regions are marked by two white squares. Once the regions have been found the statistical model is updated using priori knowledge on the size a typical performers hand and head.

The software consists of a DirectShow (part of DirectX) filter called virtualinstrument.ax. This detects the hands and head of the performer in the video. Converting the positions of the performers hands to sound is done by the vt.exe program. This way any program can make use of the hand and head detection algorithms of virtualinstrument.ax.

You can change the picture and sound effect if you like. sample.bmp is RGB 24 bit image and can be any size. sample.wav is a mono WAV file.


To be completed.


I am interested to know your thoughts on the Virtual Theremin.

Email: esquire @ ihug. com. au


Back to my homepage.

This page has been accessed times, since 20/2/2002. Web counter courtesy of