Namespace definitions

Beat

beat

Beat event markers with optional metrical position.

time

duration

value

confidence

[sec]

[sec]

[number] or [null]

–

Each observation corresponds to a single beat event.

The value field can be a number (positive or negative, integer or floating point), indicating the metrical position within the bar of the observed beat.

If no metrical position is provided for the annotation, the value field will be null.

Example

time

duration

value

confidence

0.500

0.000

1

null

1.000

0.000

2

null

1.500

0.000

3

null

2.000

0.000

4

null

2.500

0.000

1

null

Note

duration is typically zero for beat events, but this is not enforced.

confidence is an unconstrained field for beat annotations, and may contain arbitrary data.

beat_position

Beat events with time signature information.

time

duration

value

confidence

[sec]

[sec]

position

measure

num_beats

beat_units

–

Each observation corresponds to a single beat event.

The value field is a structure containing the following fields:

position : the position of the beat within the measure. Can be any number greater than or equal to 1.

measure : the index of the measure containing this beat. Can be any non-negative integer.

num_beats : the number of beats per measure : can be any strictly positive integer.

beat_units : the note value for beats in this measure. Must be one of: 1, 2, 4, 8, 16, 32, 64, 128, 256.

All fields are required for each observation.

Example

time

duration

value

confidence

0.500

0.000

position: 1

measure: 0

num_beats: 4

beat_units: 4

null

1.000

0.000

position: 2

measure: 0

num_beats: 4

beat_units: 4

null

1.500

0.000

position: 3

measure: 0

num_beats: 4

beat_units: 4

null

2.000

0.000

position: 4

measure: 0

num_beats: 4

beat_units: 4

null

2.500

0.000

position: 1

measure: 1

num_beats: 4

beat_units: 4

null

Note

duration is typically zero for beat events, but this is not enforced.

confidence is an unconstrained field for beat annotations, and may contain arbitrary data.

position should lie in the range [1, beat_units], but the upper bound is not enforced at the schema level.

Chord

chord

Chord annotations described by an extended version of the grammar defined by Harte, et al. [1]

time

duration

value

confidence

[sec]

[sec]

string

–

This namespace is similar to chord_harte, with the following modifications:

Sharps and flats may not be mixed in a note symbol. For instance, A#b# is legal in chord_harte but not in chord. A### is legal in both.

The following quality values have been added:

sus2, 1, 5

aug7

11, maj11, min11

13, maj13, min13

Example

time

duration

value

confidence

0.000

1.000

N

null

0.000

1.000

Bb:5

null

0.000

1.000

E:(*5)

null

0.000

1.000

E#:min9/9

null

0.000

1.000

G##:maj6

null

0.000

1.000

D:13/6

null

0.000

1.000

A:sus2

null

Note

confidence is an unconstrained field, and may contain arbitrary data.

chord_harte

Chord annotations described according to the grammar defined by Harte, et al. [1]

time

duration

value

confidence

[sec]

[sec]

string

–

Each observed value is a text representation of a chord annotation.

N specifies a no chord observation

Notes are annotated in the usual way: A-G followed by optional sharps (#) and flats (b)

Chord qualities are denoted by abbreviated strings:

maj, min, dim, aug

maj7, min7, 7, dim7, hdim7, minmaj7

maj6, min6

9, maj9, min9

sus4

Inversions are specified by a slash (/) followed by the interval number, e.g., G/3.

Extensions are denoted in parentheses, e.g., G(b11,13). Suppressed notes are indicated with an asterisk, e.g., G(*3)

A complete description of the chord grammar is provided in [1], table 1.

Example

time

duration

value

confidence

0.000

1.000

N

null

0.000

1.000

Bb

null

0.000

1.000

E:(*5)

null

0.000

1.000

E#:min9/9

null

0.000

1.000

G#b:maj6

null

0.000

1.000

D/6

null

0.000

1.000

A:sus4

null

Note

confidence is an unconstrained field, and may contain arbitrary data.

chord_roman

Chord annotations in roman numeral format, as described by [2].

time

duration

value

confidence

[sec]

[sec]

tonic

chord

–

The value field is a structure containing the following fields:

tonic : (string) the tonic note of the chord, e.g., A or Gb.

chord : (string) the scale degree of the chord in roman numerals (1–7), along with inversions, extensions, and qualities.

Scale degrees are encoded with optional leading sharps and flats, e.g., V, bV or #VII. Upper-case numerals indicate major, lower-case numeral indicate minor.

Qualities are encoded as one of the following symbols:

o : diminished (triad)

+ : augmented (triad)

s : suspension

d : dominant (seventh)

h : half-diminished (seventh)

x : fully-diminished (seventh)

Inversions are encoded by arabic numerals, e.g., V6 for a first-inversion triad, V64 for second inversion.

Applied chords are encoded by a / followed by a roman numeral encoding of the scale degree, e.g., V7/IV.

Example

time	duration	value	confidence
0.000	0.500	tonic: C chord: I6	–
0.500	0.500	tonic: C chord: bIV	–
1.000	0.500	tonic: C chord: Vh7	–

Note

The grammar defined in [2] has been constrained to support only the quality symbols listed above.

confidence is an unconstrained field, and may contain arbitrary data.

Key

key_mode

Key and optional mode (major/minor or Greek modes)

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is a string matching one of the three following patterns:

N : no key

Ab, A, A#, Bb, ... G# : tonic note, upper case

tonic:MODE where tonic is as described above, and MODE is one of: major, minor, ionian, dorian, phrygian, lydian, mixolydian, aeolian, locrian.

Example

time

duration

value

confidence

0.000

30.0

C:minor

null

30.0

5.00

N

null

35.0

15.0

C#:dorian

null

50.0

10.0

Eb

null

60.0

10.0

A:lydian

null

Note

confidence is an unconstrained field, and may contain arbitrary data.

Lyrics

lyrics

Time-aligned lyrical annotations.

time

duration

value

confidence

[sec]

[sec]

string

–

The required value field can contain arbitrary text data, e.g., lyrics.

Example

time

duration

value

confidence

0.500

4.000

“Row row row your boat”

null

4.500

2.000

“gently down the stream”

null

7.000

1.000

“merrily”

null

8.000

1.000

“merrily”

null

9.000

1.000

“merrily”

null

10.00

1.000

“merrily”

null

Note

confidence is an unconstrained field, and may contain arbitrary data.

lyrics_bow

Time-aligned bag-of-words or bag-of-ngrams.

time

duration

value

confidence

[sec]

[sec]

array

–

The required value field is an array, where each element is an array of [term, count]. The term here may be either a string (for simple bag-of-words) or an array of strings (for bag-of-ngrams).

Example

time

duration

value

confidence

0.000

30.00

[‘row’, 3]

[ [‘row’, ‘row’], 2]

[‘your’, 1]

[‘boat’, 1]

null

Note

confidence is an unconstrained field, and may contain arbitrary data.

Mood

mood_thayer

Time-varying emotion measurements as ordered pairs of (valence, arousal)

time

duration

value

confidence

[sec]

[sec]

(valence, arousal)

–

The value field is an ordered pair of numbers measuring the valence and arousal positions in the Thayer mood model [3].

Example

time

duration

value

confidence

0.500

0.250

(-0.5, 1.0)

null

0.750

0.250

(-0.3, 0.6)

null

1.000

0.750

(-0.1, 0.1)

null

1.750

0.500

(0.3, -0.5)

null

Note

confidence is an unconstrained field, and may contain arbitrary data.

Onset

onset

Note onset event markers.

time

duration

value

confidence

[sec]

[sec]

–

–

This namespace can be used to encode timing of arbitrary instantaneous events. Most commonly, this is applied to note onsets.

Example

time

duration

value

confidence

0.500

0.000

null

null

1.000

0.000

null

null

1.500

0.000

null

null

2.000

0.000

null

null

Note

duration is typically zero for instantaneous events, but this is not enforced.

value and confidence fields are unconstrained, and may contain arbitrary data.

Pattern

pattern_jku

Each note of the pattern contains (pattern_id, midi_pitch, occurrence_id, morph_pitch, staff), following the format described in [4].

time

duration

value

confidence

[sec]

[sec]

pattern_id

midi_pitch

occurrence_id

morph_pitch

staff

–

Each value field contains a dictionary with the following keys:

pattern_id: The integer that identifies the current pattern,
starting from 1.

midi_pitch: The float representing the midi pitch.

occurrence_id: The integer that identifies the current occurrence,
starting from 1.

morph_pitch: The float representing the morphological pitch.

staff: The integer representing the staff where the current note of the
patter is found, starting from 0.

Example

time

duration

value

confidence

62.86

0.09

pattern_id: 1

midi_pitch: 50

occurrence_id: 1

morph_pitch: 54

staff: 1

null

62.86

0.36

pattern_id: 1

midi_pitch: 77

occurrence_id: 1

morph_pitch: 70

staff: 0

null

36.34

0.09

pattern_id: 1

midi_pitch: 71

occurrence_id: 2

morph_pitch: 66

staff: 0

null

36.43

0.36

pattern_id: 1

midi_pitch: 69

occurrence_id: 2

morph_pitch: 65

staff: 0

null

Pitch

pitch_contour

Pitch contours in the format (index, frequency, voicing).

time

duration

value

confidence

[sec]

[–]

index

frequency

voiced

–

Each value field is a structure containing a contour index (an integer indicating which contour the observation belongs to), a frequency value in Hz, and a boolean indicating if the values is voiced. The confidence field is unconstrained.

Example

time

duration

value

confidence

0.0000

0.0000

index: 0

frequency: 442.1

voiced: True

null

0.0058

0.0000

index: 0

frequency: 457.8

voiced: False

null

2.5490

0.0000

index: 1

frequency: 89.4

voiced: True

null

2.5548

0.0000

index: 1

frequency: 90.0

voiced: True

null

note_hz

Note events with (non-negative) frequencies measured in Hz.

time

duration

value

confidence

[sec]

[sec]

number

–

Each value field gives the frequency of the note in Hz.

Example

time

duration

value

confidence

12.34

0.287

189.9

null

2.896

3.000

74.0

null

10.12

0.5.

440.0

null

note_midi

Note events with pitches measured in (fractional) MIDI note numbers.

time

duration

value

confidence

[sec]

[sec]

number

–

Each value field gives the pitch of the note in MIDI note numbers.

Example

time

duration

value

confidence

12.34

0.287

52.0

null

2.896

3.000

20.7

null

10.12

0.5.

42.0

null

pitch_class

Pitch measurements in (tonic, pitch class) format.

time

duration

value

confidence

[sec]

[sec]

tonic

pitch class

–

Each value field is a structure containing a tonic (note string, e.g., "A#" or "D") and a pitch class pitch as an integer scale degree. The confidence field is unconstrained.

Example

time

duration

value

confidence

0.000

30.0

tonic: C

pitch: 0

null

0.000

30.0

tonic: C

pitch: 4

null

0.000

30.0

tonic: C

pitch: 7

null

30.00

35.0

tonic: G

pitch: 0

null

pitch_hz

Warning

Deprecated, use pitch_contour.

Pitch measurements in Hertz (Hz). Pitch (a subjective sensation) is represented as fundamental frequency (a physical quantity), a.k.a. “f0”.

time

duration

value

confidence

[sec]

–

number

–

The time field represents the instantaneous time in which the pitch f0 was estimated. By convention, this (usually) represents the center time of the analysis frame. Note that this is different from pitch_midi and pitch_class, where time represents the onset time. As a consequence, the duration field is undefined and should be ignored. The value field is a number representing the f0 in Hz. By convention, values that are equal to or less than zero are used to represent silence (no pitch). Some algorithms (e.g. melody extraction algorithms that adhere to the MIREX convention) use negative f0 values to represent the algorithm’s pitch estimate for frames where it thinks there is no active pitch (e.g. no melody), to allow the independent evaluation of pitch activation detection (a.k.a. “voicing detection”) and pitch frequency estimation. The confidence field is unconstrained.

Example

time

duration

value

confidence

0.000

0.000

300.00

null

0.010

0.000

305.00

null

0.020

0.000

310.00

null

0.030

0.000

0.00

null

0.040

0.000

-280.00

null

0.050

0.000

-290.00

null

pitch_midi

Warning

Deprecated, use note_midi or pitch_contour.

Pitch measurements in (fractional) MIDI note number notation.

time

duration

value

confidence

[sec]

[sec]

number

–

The value field is a number representing the pitch in MIDI notation. Numbers can be negative (for notes below C-1) or fractional.

Example

time

duration

value

confidence

0.000

30.000

24

null

0.000

30.000

43.02

null

15.00

45.000

26

null

Segment

segment_open

Structural segmentation with an open vocabulary of segment labels.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field contains string descriptors for each segment, e.g., “verse” or “bridge”.

Example

time	duration	value	confidence
0.000	20.000	intro	null
20.00	30.000	verse	null
30.00	50.000	refrain	null
50.00	70.000	verse (alternate)	null

segment_salami_function

Segment annotations with functional labels from the SALAMI guidelines.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field must be one of the allowable strings in the SALAMI function vocabulary.

Example

time	duration	value	confidence
0.000	20.000	applause	null
20.00	30.000	count-in	null
30.00	50.000	introduction	null
50.00	70.000	verse	null

segment_salami_upper

Segment annotations with SALAMI’s upper-case (large) label format.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field must be a string of the following format:

“silence” or “Silence”

One or more upper-case letters, followed by zero or more apostrophes

Example

time	duration	value	confidence
0.000	20.000	silence	null
20.00	30.000	A	null
30.00	50.000	B	null
50.00	70.000	A’	null

segment_salami_lower

Segment annotations with SALAMI’s lower-case (small) label format.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field must be a string of the following format:

“silence” or “Silence”

One or more lower-case letters, followed by zero or more apostrophes

Example

time	duration	value	confidence
0.000	20.000	silence	null
20.00	30.000	a	null
30.00	50.000	b	null
50.00	70.000	a’	null

segment_tut

Segment annotations using the TUT vocabulary.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is a string describing the function of the segment.

Example

time	duration	value	confidence
0.000	20.000	Intro	null
20.00	30.000	Verse	null
30.00	50.000	bridge	null
50.00	70.000	RefrainA	null

multi_segment

Multi-level structural segmentations.

time

duration

value

confidence

[sec]

[sec]

label : string

level : int >= 0

–

In a multi-level segmentation, the track is partitioned many times — possibly recursively — which results in a collection of segmentations of varying degrees of specificity. In the multi_segment namespace, all of the resulting segments are collected together, and the level field is used to encode the segment’s corresponding partition.

Level values must be non-negative, and ordered by increasing specificity. For example, level==0 may correspond to a single segment spanning the entire track, and each subsequent level value corresponds to a more refined segmentation.

Example

time	duration	value	confidence
0.000	60.000	label : A level : 0	null
0.000	30.000	label : B level : 1	null
30.00	60.000	label : C level : 1	null
0.000	15.000	label : a level : 2	null
15.00	30.000	label : b level : 2	null
30.00	45.000	label : a level : 2	null
45.00	60.000	label : c level : 2	null

Tag

tag_cal10k

Tags from the CAL10K vocabulary.

time

duration

value

confidence

[sec]

[sec]

string

–

The value is constrained to a set of 1053 terms, spanning mood, instrumentation, style, and genre.

time

duration

value

confidence

0.000

30.000

“bop influences”

null

0.000

30.000

“bright beats”

null

0.000

30.000

“hip hop roots”

null

tag_cal500

Tags from the CAL500 vocabulary.

time

duration

value

confidence

[sec]

[sec]

string

–

The value is constrained to a set of 174 terms, spanning mood, instrumentation, and genre.

time

duration

value

confidence

0.000

30.000

“Genre-Best-Rock”

null

0.000

30.000

“Vocals-Monotone”

null

0.000

30.000

“Usage-At_work”

null

tag_gtzan

Genre classes from the GTZAN dataset.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is constrained to one of ten strings:

blues

classical

country

disco

hip-hop

jazz

metal

pop

reggae

rock

By convention, only one tag is applied per track in this namespace. This is not enforced by the schema.

Example

time

duration

value

confidence

0.000

30.000

“reggae”

null

tag_msd_tagtraum_cd1

Genre classes from the msd tagtraum cd1 dataset.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is constrained to one of 13 strings:

reggae

pop/rock

rnb

jazz

vocal

new age

latin

rap

country

international

blues

electronic

folk

By convention, one or two tags per track are possible in this namespace. The sum of the confidence values should equal 1.0. This is not enforced by the schema.

Example

time

duration

value

confidence

0.000

0.000

“reggae”

1.0

tag_msd_tagtraum_cd2

Genre classes from the msd tagtraum cd2 dataset.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is constrained to one of 15 strings:

reggae

latin

metal

rnb

jazz

punk

pop

new age

country

rap

rock

world

blues

electronic

folk

By convention, one or two tags per track are possible in this namespace. The sum of the confidence values should equal 1.0. This is not enforced by the schema.

Example

time

duration

value

confidence

0.000

0.000

“reggae”

0.6666667

0.000

0.000

“rock”

0.3333333

tag_medleydb_instruments

MedleyDB instrument source annotations.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is constrained to the set of instruments defined in MedleyDB instrument taxonomy.

Example

time

duration

value

confidence

0.000

20.000

“darbuka”

null

0.000

20.000

“flute section”

null

0.000

20.000

“oud”

null

tag_open

Open vocabulary tags. This namespace is appropriate for unconstrained tag data, such as tags from Last.FM or Magnatagatune.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is unconstrained, and can contain any string value.

Example

time

duration

value

confidence

0.000

20.000

“rocking”

null

0.000

20.000

“rockin’”

null

0.000

20.000

“”

null

0.000

20.000

“favez^^^”

null

tag_audioset

Tags from the full AudioSet (v1) ontology.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is constrained to the vocabulary of the AudioSet ontology.

Example

time

duration

value

confidence

0.000

20.000

“Air brake”

null

5.000

25.000

“Yodeling”

null

9.000

35.000

“Steam whistle”

null

tag_audioset_genre

Tags from the musical genre subset of the AudioSet (v1) ontology.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is constrained to the 66 musical genres of the AudioSet-genre ontology.

Example

time

duration

value

confidence

0.000

20.000

“Oldschool jungle”

null

tag_audioset_instrument

Tags from the musical instrument subset of the AudioSet (v1) ontology.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is constrained to the 91 musical instruments of the AudioSet-instruments ontology.

Example

time

duration

value

confidence

0.000

20.000

“Ukulele”

null

5.000

25.000

“Piano”

null

9.000

35.000

“Tuning fork”

null

tag_fma_genre

Tags from the Free Music Archive (FMA) 16-class genre taxonomy.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is constrained to the 16 genres of the FMA data-set.

Example

time

duration

value

confidence

0.000

20.000

“Blues”

null

5.000

25.000

“Instrumental”

null

9.000

35.000

“Soul-RnB”

null

tag_fma_subgenre

Tags from the Free Music Archive (FMA) 163-class genre and sub-genre taxonomy.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is constrained to the 163 genres and sub-genres of the FMA data-set.

Example

time

duration

value

confidence

0.000

20.000

“Pop”

null

5.000

25.000

“Power-Pop”

null

9.000

35.000

“Nerdcore”

null

tag_urbansound

Genre classes from the UrbanSound dataset.

time

duration

value

confidence

[sec]

[sec]

string

–

The value field is constrained to one of ten strings:

air_conditioner

car_horn

children_playing

dog_bark

drilling

engine_idling

gun_shot

jackhammer

siren

street_music

Example

time

duration

value

confidence

0.000

30.000

“street_music”

null

Tempo

tempo

Tempo measurements in beats per minute (BPM).

time

duration

value

confidence

[sec]

[sec]

number

number

The value field is a non-negative number (floating point), indicated the tempo measurement. The confidence field is a number in the range [0, 1], following the format used by MIREX [5].

Example

time

duration

value

confidence

0.00

60.00

180.0

0.8

0.00

60.00

90.0

0.2

Note

MIREX requires that tempo measurements come in pairs, and that the confidence values sum to 1. This is not enforced at the schema level.

Miscellaneous

Vector

Numerical vector data. This is useful for generic regression problems where the output is a vector of numbers.

time

duration

value

confidence

[sec]

[sec]

[array of numbers]

–

Each observation value must be an array of at least one number. Different observations may have different length arrays, so it is up to the user to verify that arrays have the desired length.

Blob

Arbitrary data blobs.

time

duration

value

confidence

[sec]

[sec]

–

–

This namespace can be used to encode arbitrary data. The value and confidence fields have no schema constraints, and may contain any structured (but serializable) data. This can be useful for storing complex output data that does not fit any particular task schema, such as regression targets or geolocation data.

It is strongly advised that the AnnotationMetadata for blobs be as explicit as possible.

Scaper

Structured representation for soundscapes synthesized by the Scaper package.

time

duration

value

confidence

[sec]

[sec]

label

source_file

source_time

event_time

event_duration

snr

time_stretch

pitch_shift

role

–

Each value field contains a dictionary with the following keys:

label: a string indicating the label of the sound source

source_file: a full path to the original sound source (on disk)

source_time: a non-negative number indicating the time offset within source_file of the sound

event_time: the start time of the event in the synthesized soundscape

event_duration: a strictly positive number indicating the duration of the event

snr: the signal-to-noise ratio (in LUFS) of the sound compared to the background

time_stetch: (optional) a strictly positive number indicating the amount of time-stretch applied to the source

pitch_shift: (optional) the amount of pitch-shift applied to the source

role: one of background or foreground