This article explains a little about how the timer system works in QEMU, and in particular the changes associated with the new timer system I’ve contributed to QEMU.

What to timers do?

Timers (more precisely QEMUTimers) provide a means of calling a giving routine (a callback) after a time interval has elapsed, passing an opaque pointer to the routine.

The measurement of time can be against one of three clocks:

  • The realtime clock, which runs even when the VM is stopped, with a resolution of 1000 Hz;
  • The virtual clock, which only runs when the VM is running, at a high resolution; and
  • The host clock, which (like the realtime clock) runs even when the VM is stopped, but is sensitive to time changes to the system clock (e.g. NTP).

Timers are single shot, i.e. when they have elapsed, they call their callback and will do nothing further unless they are re-armed. The callback can rearm the timer to provide a repeating timer.

What’s changed in the implementation?

Prior to the timer API change, timers only existed in the main QEMU thread, and were only run from QEMU’s main loop. Expiry of timers was through a system-dependent system of secondary timers called alarm timers. These would call QemuNotify which caused a write to a notifier FD, and caused any poll() in progress to terminate. Throughout QEMU, poll() (or equivalent) would use infinite timeouts (rather than the timeout associated with any timer), and rely on the alarm timers (under POSIX running through signals) to terminate these system calls.

This approach caused a number of problems:

  • Timers were only handled within the main loop. QEMU’s AioContext has an internal loop that runs during block operations and may run for a long time. Timers were not processed during this loop which made writing block device timers that would be guaranteed to run while the block layer was busy hard to do;
  • Timers were fundamentally single threaded, and the existing system was incompatible with plans for additional threading and AioContexts;
  • The system dependent code that implemented alarm timers was not pretty; and
  • The API to call them was messy.

To fix this I contributed a 31 commit patch that:

  • Refactored nearly all of the timer code;
  • Switched to rely on timeouts from ppoll() rather than signals and alarm timers (thus allowing alarm timers to be deleted);
  • Separated the clock sources (QEMUClock) from the lists of active timers (QEMUTimerList);
  • Introduced timer list groups (QEMUTimerListGroup) to allow independent threads or other users of timers to have their own set of timers (each attached to any clock); and
  • Tidied up the API.

You can find the patch set here.

How does the new implementation work?

The diagram below shows the relationship between the new objects (click to enlarge).

QEMU Timer Diagram
QEMU Timer Diagram

There is exactly one QEMUClock for each clock type, representing the clock source with that particular type. Currently there are three clock sources, as outlined above. In the previous implementation, each clock would have a list of timers running against it. However, now the list of running timers is kept in a QEMUTimerList. The clock needs to keep to track of all the QEMUTimerLists attached to that clock, and for that purposes maintains a list of them (purple line on the diagram).

Each user of timers maintains a QEMUTimerListGroup. This is a struct currently consisting solely of an array of QEMUTimerList pointers, one for each clock type. Hence each QEMUTimerListGroup is really a collection of three (currently) QEMUTimerLists (the red line in the diagram). There are two current QEMUTimerListGroup users. A global static, main_loop_tlg, represents the timer lists run from the main loop (i.e. all current timer users). I also added a QEMUTimerListGroup to each AioContext (there currently being just one), so the block layer can use timers that will run without relying on the aio system returning to the main loop; this also permits future threading. Any other subsystem could have its own QEMUTimerListGroup; it merely needs to call timerlistgroup_run_timers at an appropriate time to run any pending timers.

Each QEMUTimerList maintains a pointer to the clock to which it is connected (orange line). It is (as set out above) on a list of QEMUTimerLists for that QEMUClock (the list element). It contains a pointer to the first active timer. The active timers are not maintained through the normal QEMU linked list object, but are instead a simple singly linked list of timers, arranged in order of expiry, with the timer expiring soonest at the start of the list, linked to by the QEMUTimerList object (blue line).

Each QEMUTimer object contains a link back to its timer list (green line) for manipulation of the lists.

Under the hood, the call to g_poll within the AioContext’s poll loop has been replaced with ppoll (the nanosecond equivalent of poll). Unfortunately, glib only provides millisecond resolution, meaning that the main loop’s timing will only be at millisecond accuracy whilst this continues to use glib’s poll routines. Plans are afoot to remedy this.

What’s changed in the API?

The API had become somewhat messy.

Firstly, I’ve replaced the previous timer API (with function names like qemu_mod_timer) with a new API of the form timer_<action> where <action> is the action required, for instance timer_new, timer_mod, and so forth. The timer_new function and friends take an enum value being the clock type, rather than a pointer to a clock object. For instance this:

timer = qemu_new_timer_ns (vm_clock, cb, opaque);

becomes this:

timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, cb, opaque);

Timers not using the main loop QEMUTimerListGroup can be created using timer_new_tl. There’s also timer_init available for use without malloc which takes a timer list. AioContext provides helper functions aio_timer_new and aio_timer_init.

Secondly, I’ve similarly rationalised the clock API. For instance this:

now = qemu_get_clock_ns (vm_clock);

becomes this:

now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);

Thirdly, I’ve added lots of documentation (see include/qemu/timer.h)

If you have out-of-tree code that needs converting to the new timer API, simply run

scripts/switch-timer-api [filename] ...

to convert them.