Namespace definitions
Beat
beat
Beat event markers with optional metrical position.
time
duration
value
confidence
[sec]
[sec]
[number] or [null]
–
Each observation corresponds to a single beat event.
The value field can be a number (positive or negative, integer or floating point),
indicating the metrical position within the bar of the observed beat.
If no metrical position is provided for the annotation, the value field will be
null.
Example
time
duration
value
confidence
0.500
0.000
1
null
1.000
0.000
2
null
1.500
0.000
3
null
2.000
0.000
4
null
2.500
0.000
1
null
Note
duration is typically zero for beat events, but this is not enforced.
confidence is an unconstrained field for beat annotations, and may contain
arbitrary data.
beat_position
Beat events with time signature information.
time
duration
value
confidence
[sec]
[sec]
position
measure
num_beats
beat_units
–
Each observation corresponds to a single beat event.
The value field is a structure containing the following fields:
position: the position of the beat within the measure. Can be any number greater than or equal to 1.
measure: the index of the measure containing this beat. Can be any non-negative integer.
num_beats: the number of beats per measure : can be any strictly positive integer.
beat_units: the note value for beats in this measure. Must be one of:1, 2, 4, 8, 16, 32, 64, 128, 256.
All fields are required for each observation.
Example
time
duration
value
confidence
0.500
0.000
position: 1
measure: 0
num_beats: 4
beat_units: 4
null
1.000
0.000
position: 2
measure: 0
num_beats: 4
beat_units: 4
null
1.500
0.000
position: 3
measure: 0
num_beats: 4
beat_units: 4
null
2.000
0.000
position: 4
measure: 0
num_beats: 4
beat_units: 4
null
2.500
0.000
position: 1
measure: 1
num_beats: 4
beat_units: 4
null
Note
duration is typically zero for beat events, but this is not enforced.
confidence is an unconstrained field for beat annotations, and may contain
arbitrary data.
position should lie in the range [1, beat_units], but the upper bound is not
enforced at the schema level.
Chord
chord
Chord annotations described by an extended version of the grammar defined by Harte, et al. [1]
time
duration
value
confidence
[sec]
[sec]
string
–
This namespace is similar to chord_harte, with the following modifications:
Sharps and flats may not be mixed in a note symbol. For instance, A#b# is legal in chord_harte but not in chord. A### is legal in both.
- The following quality values have been added:
sus2, 1, 5
aug7
11, maj11, min11
13, maj13, min13
Example
time
duration
value
confidence
0.000
1.000
Nnull
0.000
1.000
Bb:5null
0.000
1.000
E:(*5)null
0.000
1.000
E#:min9/9null
0.000
1.000
G##:maj6null
0.000
1.000
D:13/6null
0.000
1.000
A:sus2null
Note
confidence is an unconstrained field, and may contain arbitrary data.
chord_harte
Chord annotations described according to the grammar defined by Harte, et al. [1]
time
duration
value
confidence
[sec]
[sec]
string
–
Each observed value is a text representation of a chord annotation.
Nspecifies a no chord observationNotes are annotated in the usual way:
A-Gfollowed by optional sharps (#) and flats (b)
- Chord qualities are denoted by abbreviated strings:
maj, min, dim, aug
maj7, min7, 7, dim7, hdim7, minmaj7
maj6, min6
9, maj9, min9
sus4
Inversions are specified by a slash (
/) followed by the interval number, e.g.,G/3.Extensions are denoted in parentheses, e.g.,
G(b11,13). Suppressed notes are indicated with an asterisk, e.g.,G(*3)
A complete description of the chord grammar is provided in [1], table 1.
Example
time
duration
value
confidence
0.000
1.000
Nnull
0.000
1.000
Bbnull
0.000
1.000
E:(*5)null
0.000
1.000
E#:min9/9null
0.000
1.000
G#b:maj6null
0.000
1.000
D/6null
0.000
1.000
A:sus4null
Note
confidence is an unconstrained field, and may contain arbitrary data.
chord_roman
Chord annotations in roman numeral format, as described by [2].
time
duration
value
confidence
[sec]
[sec]
tonic
chord
–
The value field is a structure containing the following fields:
tonic: (string) the tonic note of the chord, e.g.,AorGb.
chord: (string) the scale degree of the chord in roman numerals (1–7), along with inversions, extensions, and qualities.
Scale degrees are encoded with optional leading sharps and flats, e.g.,
V,bVor#VII. Upper-case numerals indicate major, lower-case numeral indicate minor.Qualities are encoded as one of the following symbols:
o: diminished (triad)
+: augmented (triad)
s: suspension
d: dominant (seventh)
h: half-diminished (seventh)
x: fully-diminished (seventh)Inversions are encoded by arabic numerals, e.g.,
V6for a first-inversion triad,V64for second inversion.Applied chords are encoded by a
/followed by a roman numeral encoding of the scale degree, e.g.,V7/IV.
http://theory.esm.rochester.edu/rock_corpus/harmonic_analyses.html
- Example
time
duration
value
confidence
0.000
0.500
tonic: C
chord: I6
–
0.500
0.500
tonic: C
chord: bIV
–
1.000
0.500
tonic: C
chord: Vh7
–
Note
The grammar defined in [2] has been constrained to support only the quality symbols listed above.
confidence is an unconstrained field, and may contain arbitrary data.
Key
key_mode
Key and optional mode (major/minor or Greek modes)
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is a string matching one of the three following patterns:
N: no key
Ab, A, A#, Bb, ... G#: tonic note, upper case
tonic:MODEwheretonicis as described above, andMODEis one of:major, minor, ionian, dorian, phrygian, lydian, mixolydian, aeolian, locrian.
Example
time
duration
value
confidence
0.000
30.0
C:minor
null
30.0
5.00
N
null
35.0
15.0
C#:dorian
null
50.0
10.0
Eb
null
60.0
10.0
A:lydian
null
Note
confidence is an unconstrained field, and may contain arbitrary data.
Lyrics
lyrics
Time-aligned lyrical annotations.
time
duration
value
confidence
[sec]
[sec]
string
–
The required value field can contain arbitrary text data, e.g., lyrics.
Example
time
duration
value
confidence
0.500
4.000
“Row row row your boat”
null
4.500
2.000
“gently down the stream”
null
7.000
1.000
“merrily”
null
8.000
1.000
“merrily”
null
9.000
1.000
“merrily”
null
10.00
1.000
“merrily”
null
Note
confidence is an unconstrained field, and may contain arbitrary data.
lyrics_bow
Time-aligned bag-of-words or bag-of-ngrams.
time
duration
value
confidence
[sec]
[sec]
array
–
The required value field is an array, where each element is an array of [term, count].
The term here may be either a string (for simple bag-of-words) or an array of strings (for bag-of-ngrams).
Example
time
duration
value
confidence
0.000
30.00
[‘row’, 3]
[ [‘row’, ‘row’], 2]
[‘your’, 1]
[‘boat’, 1]
null
Note
confidence is an unconstrained field, and may contain arbitrary data.
Mood
mood_thayer
Time-varying emotion measurements as ordered pairs of (valence, arousal)
time
duration
value
confidence
[sec]
[sec]
(valence, arousal)
–
The value field is an ordered pair of numbers measuring the valence and
arousal positions in the Thayer mood model [3].
Thayer, Robert E. The biopsychology of mood and arousal. Oxford University Press, 1989.
Example
time
duration
value
confidence
0.500
0.250
(-0.5, 1.0)
null
0.750
0.250
(-0.3, 0.6)
null
1.000
0.750
(-0.1, 0.1)
null
1.750
0.500
(0.3, -0.5)
null
Note
confidence is an unconstrained field, and may contain arbitrary data.
Onset
onset
Note onset event markers.
time
duration
value
confidence
[sec]
[sec]
–
–
This namespace can be used to encode timing of arbitrary instantaneous events. Most commonly, this is applied to note onsets.
Example
time
duration
value
confidence
0.500
0.000
null
null
1.000
0.000
null
null
1.500
0.000
null
null
2.000
0.000
null
null
Note
duration is typically zero for instantaneous events, but this is not enforced.
value and confidence fields are unconstrained, and may contain arbitrary data.
Pattern
pattern_jku
Each note of the pattern contains (pattern_id, midi_pitch, occurrence_id, morph_pitch,
staff), following the format described in [4].
time
duration
value
confidence
[sec]
[sec]
pattern_id
midi_pitch
occurrence_id
morph_pitch
staff
–
Collins T., Discovery of Repeated Themes & Sections, Music Information Retrieval Evalaluation eXchange (MIReX), 2013 (Accessed on July 7th 2015). Available here.
Each value field contains a dictionary with the following keys:
pattern_id: The integer that identifies the current pattern,starting from 1.
midi_pitch: The float representing the midi pitch.
occurrence_id: The integer that identifies the current occurrence,starting from 1.
morph_pitch: The float representing the morphological pitch.
staff: The integer representing the staff where the current note of thepatter is found, starting from 0.
Example
time
duration
value
confidence
62.86
0.09
pattern_id: 1
midi_pitch: 50
occurrence_id: 1
morph_pitch: 54
staff: 1
null
62.86
0.36
pattern_id: 1
midi_pitch: 77
occurrence_id: 1
morph_pitch: 70
staff: 0
null
36.34
0.09
pattern_id: 1
midi_pitch: 71
occurrence_id: 2
morph_pitch: 66
staff: 0
null
36.43
0.36
pattern_id: 1
midi_pitch: 69
occurrence_id: 2
morph_pitch: 65
staff: 0
null
Pitch
pitch_contour
Pitch contours in the format (index, frequency, voicing).
time
duration
value
confidence
[sec]
[–]
index
frequency
voiced
–
Each value field is a structure containing a contour index (an integer indicating which contour the observation belongs to), a frequency value in Hz, and a boolean indicating if the values is voiced. The confidence field is unconstrained.
Example
time
duration
value
confidence
0.0000
0.0000
index: 0
frequency: 442.1
voiced: True
null
0.0058
0.0000
index: 0
frequency: 457.8
voiced: False
null
2.5490
0.0000
index: 1
frequency: 89.4
voiced: True
null
2.5548
0.0000
index: 1
frequency: 90.0
voiced: True
null
note_hz
Note events with (non-negative) frequencies measured in Hz.
time
duration
value
confidence
[sec]
[sec]
number
–
Each value field gives the frequency of the note in Hz.
Example
time
duration
value
confidence
12.34
0.287
189.9
null
2.896
3.000
74.0
null
10.12
0.5.
440.0
null
note_midi
Note events with pitches measured in (fractional) MIDI note numbers.
time
duration
value
confidence
[sec]
[sec]
number
–
Each value field gives the pitch of the note in MIDI note numbers.
Example
time
duration
value
confidence
12.34
0.287
52.0
null
2.896
3.000
20.7
null
10.12
0.5.
42.0
null
pitch_class
Pitch measurements in (tonic, pitch class) format.
time
duration
value
confidence
[sec]
[sec]
tonic
pitch class
–
Each value field is a structure containing a tonic (note string, e.g., "A#" or
"D")
and a pitch class pitch as an integer scale degree. The confidence field is unconstrained.
Example
time
duration
value
confidence
0.000
30.0
tonic:
Cpitch: 0
null
0.000
30.0
tonic:
Cpitch: 4
null
0.000
30.0
tonic:
Cpitch: 7
null
30.00
35.0
tonic:
Gpitch: 0
null
pitch_hz
Warning
Deprecated, use pitch_contour.
Pitch measurements in Hertz (Hz). Pitch (a subjective sensation) is represented as fundamental frequency (a physical quantity), a.k.a. “f0”.
time
duration
value
confidence
[sec]
–
number
–
The time field represents the instantaneous time in which the pitch f0 was
estimated. By convention, this (usually) represents the center time of the
analysis frame. Note that this is different from pitch_midi and pitch_class,
where time represents the onset time. As a consequence, the duration
field is undefined and should be ignored. The value field is a number
representing the f0 in Hz. By convention, values that are equal to or less than
zero are used to represent silence (no pitch). Some algorithms (e.g. melody
extraction algorithms that adhere to the MIREX convention) use negative f0
values to represent the algorithm’s pitch estimate for frames where it thinks
there is no active pitch (e.g. no melody), to allow the independent evaluation
of pitch activation detection (a.k.a. “voicing detection”) and pitch frequency
estimation. The confidence field is unconstrained.
Example
time
duration
value
confidence
0.000
0.000
300.00
null
0.010
0.000
305.00
null
0.020
0.000
310.00
null
0.030
0.000
0.00
null
0.040
0.000
-280.00
null
0.050
0.000
-290.00
null
pitch_midi
Warning
Deprecated, use note_midi or pitch_contour.
Pitch measurements in (fractional) MIDI note number notation.
time
duration
value
confidence
[sec]
[sec]
number
–
The value field is a number representing the pitch in MIDI notation.
Numbers can be negative (for notes below C-1) or fractional.
Example
time
duration
value
confidence
0.000
30.000
24
null
0.000
30.000
43.02
null
15.00
45.000
26
null
Segment
segment_open
Structural segmentation with an open vocabulary of segment labels.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field contains string descriptors for each segment, e.g., “verse” or
“bridge”.
- Example
time
duration
value
confidence
0.000
20.000
intro
null
20.00
30.000
verse
null
30.00
50.000
refrain
null
50.00
70.000
verse (alternate)
null
segment_salami_function
Segment annotations with functional labels from the SALAMI guidelines.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field must be one of the allowable strings in the SALAMI function
vocabulary.
- Example
time
duration
value
confidence
0.000
20.000
applause
null
20.00
30.000
count-in
null
30.00
50.000
introduction
null
50.00
70.000
verse
null
segment_salami_upper
Segment annotations with SALAMI’s upper-case (large) label format.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field must be a string of the following format:
“silence” or “Silence”
One or more upper-case letters, followed by zero or more apostrophes
- Example
time
duration
value
confidence
0.000
20.000
silence
null
20.00
30.000
A
null
30.00
50.000
B
null
50.00
70.000
A’
null
segment_salami_lower
Segment annotations with SALAMI’s lower-case (small) label format.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field must be a string of the following format:
“silence” or “Silence”
One or more lower-case letters, followed by zero or more apostrophes
- Example
time
duration
value
confidence
0.000
20.000
silence
null
20.00
30.000
a
null
30.00
50.000
b
null
50.00
70.000
a’
null
segment_tut
Segment annotations using the TUT vocabulary.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is a string describing the function of the segment.
- Example
time
duration
value
confidence
0.000
20.000
Intro
null
20.00
30.000
Verse
null
30.00
50.000
bridge
null
50.00
70.000
RefrainA
null
multi_segment
Multi-level structural segmentations.
time
duration
value
confidence
[sec]
[sec]
label : string
level : int >= 0
–
In a multi-level segmentation, the track is partitioned many times —
possibly recursively — which results in a collection of segmentations of varying degrees
of specificity. In the multi_segment namespace, all of the resulting segments are
collected together, and the level field is used to encode the segment’s corresponding
partition.
Level values must be non-negative, and ordered by increasing specificity. For example,
level==0 may correspond to a single segment spanning the entire track, and each
subsequent level value corresponds to a more refined segmentation.
- Example
time
duration
value
confidence
0.000
60.000
label : A
level : 0
null
0.000
30.000
label : B
level : 1
null
30.00
60.000
label : C
level : 1
null
0.000
15.000
label : a
level : 2
null
15.00
30.000
label : b
level : 2
null
30.00
45.000
label : a
level : 2
null
45.00
60.000
label : c
level : 2
null
Tag
tag_cal10k
Tags from the CAL10K vocabulary.
time
duration
value
confidence
[sec]
[sec]
string
–
The value is constrained to a set of 1053 terms, spanning mood, instrumentation,
style, and genre.
time
duration
value
confidence
0.000
30.000
“bop influences”
null
0.000
30.000
“bright beats”
null
0.000
30.000
“hip hop roots”
null
tag_cal500
Tags from the CAL500 vocabulary.
time
duration
value
confidence
[sec]
[sec]
string
–
The value is constrained to a set of 174 terms, spanning mood, instrumentation, and
genre.
time
duration
value
confidence
0.000
30.000
“Genre-Best-Rock”
null
0.000
30.000
“Vocals-Monotone”
null
0.000
30.000
“Usage-At_work”
null
tag_gtzan
Genre classes from the GTZAN dataset.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is constrained to one of ten strings:
blues
classical
country
disco
hip-hop
jazz
metal
pop
reggae
rock
By convention, only one tag is applied per track in this namespace. This is not enforced by the schema.
Example
time
duration
value
confidence
0.000
30.000
“reggae”
null
tag_msd_tagtraum_cd1
Genre classes from the msd tagtraum cd1 dataset.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is constrained to one of 13 strings:
reggae
pop/rock
rnb
jazz
vocal
new age
latin
rap
country
international
blues
electronic
folk
By convention, one or two tags per track are possible in this namespace.
The sum of the confidence values should equal 1.0.
This is not enforced by the schema.
Example
time
duration
value
confidence
0.000
0.000
“reggae”
1.0
tag_msd_tagtraum_cd2
Genre classes from the msd tagtraum cd2 dataset.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is constrained to one of 15 strings:
reggae
latin
metal
rnb
jazz
punk
pop
new age
country
rap
rock
world
blues
electronic
folk
By convention, one or two tags per track are possible in this namespace.
The sum of the confidence values should equal 1.0.
This is not enforced by the schema.
Example
time
duration
value
confidence
0.000
0.000
“reggae”
0.6666667
0.000
0.000
“rock”
0.3333333
tag_medleydb_instruments
MedleyDB instrument source annotations.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is constrained to the set of instruments defined
in MedleyDB instrument taxonomy.
Example
time
duration
value
confidence
0.000
20.000
“darbuka”
null
0.000
20.000
“flute section”
null
0.000
20.000
“oud”
null
tag_open
Open vocabulary tags. This namespace is appropriate for unconstrained tag data, such as tags from Last.FM or Magnatagatune.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is unconstrained, and can contain any string value.
Example
time
duration
value
confidence
0.000
20.000
“rocking”
null
0.000
20.000
“rockin’”
null
0.000
20.000
“”
null
0.000
20.000
“favez^^^”
null
tag_audioset
Tags from the full AudioSet (v1) ontology.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is constrained to the vocabulary of the AudioSet ontology.
Example
time
duration
value
confidence
0.000
20.000
“Air brake”
null
5.000
25.000
“Yodeling”
null
9.000
35.000
“Steam whistle”
null
tag_audioset_genre
Tags from the musical genre subset of the AudioSet (v1) ontology.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is constrained to the 66 musical genres of the AudioSet-genre ontology.
Example
time
duration
value
confidence
0.000
20.000
“Oldschool jungle”
null
tag_audioset_instrument
Tags from the musical instrument subset of the AudioSet (v1) ontology.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is constrained to the 91 musical instruments
of the AudioSet-instruments ontology.
Example
time
duration
value
confidence
0.000
20.000
“Ukulele”
null
5.000
25.000
“Piano”
null
9.000
35.000
“Tuning fork”
null
tag_fma_genre
Tags from the Free Music Archive (FMA) 16-class genre taxonomy.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is constrained to the 16 genres
of the FMA data-set.
Example
time
duration
value
confidence
0.000
20.000
“Blues”
null
5.000
25.000
“Instrumental”
null
9.000
35.000
“Soul-RnB”
null
tag_fma_subgenre
Tags from the Free Music Archive (FMA) 163-class genre and sub-genre taxonomy.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is constrained to the 163 genres and sub-genres of the FMA data-set.
Example
time
duration
value
confidence
0.000
20.000
“Pop”
null
5.000
25.000
“Power-Pop”
null
9.000
35.000
“Nerdcore”
null
tag_urbansound
Genre classes from the UrbanSound dataset.
time
duration
value
confidence
[sec]
[sec]
string
–
The value field is constrained to one of ten strings:
air_conditioner
car_horn
children_playing
dog_bark
drilling
engine_idling
gun_shot
jackhammer
siren
street_music
Example
time
duration
value
confidence
0.000
30.000
“street_music”
null
Tempo
tempo
Tempo measurements in beats per minute (BPM).
time
duration
value
confidence
[sec]
[sec]
number
number
The value field is a non-negative number (floating point), indicated the tempo measurement.
The confidence field is a number in the range [0, 1], following the format used by MIREX [5].
http://www.music-ir.org/mirex/wiki/2014:Audio_Tempo_Estimation
Example
time
duration
value
confidence
0.00
60.00
180.0
0.8
0.00
60.00
90.0
0.2
Note
MIREX requires that tempo measurements come in pairs, and that the confidence values sum to 1. This is not enforced at the schema level.
Miscellaneous
Vector
Numerical vector data. This is useful for generic regression problems where the output is a vector of numbers.
time
duration
value
confidence
[sec]
[sec]
[array of numbers]
–
Each observation value must be an array of at least one number. Different observations may have different length arrays, so it is up to the user to verify that arrays have the desired length.
Blob
Arbitrary data blobs.
time
duration
value
confidence
[sec]
[sec]
–
–
This namespace can be used to encode arbitrary data. The value and confidence fields have no schema constraints, and may contain any structured (but serializable) data. This can be useful for storing complex output data that does not fit any particular task schema, such as regression targets or geolocation data.
It is strongly advised that the AnnotationMetadata for blobs be as explicit as possible.
Scaper
Structured representation for soundscapes synthesized by the Scaper package.
time
duration
value
confidence
[sec]
[sec]
label
source_file
source_time
event_time
event_duration
snr
time_stretch
pitch_shift
role
–
Each value field contains a dictionary with the following keys:
label: a string indicating the label of the sound source
source_file: a full path to the original sound source (on disk)
source_time: a non-negative number indicating the time offset withinsource_fileof the sound
event_time: the start time of the event in the synthesized soundscape
event_duration: a strictly positive number indicating the duration of the event
snr: the signal-to-noise ratio (in LUFS) of the sound compared to the background
time_stetch: (optional) a strictly positive number indicating the amount of time-stretch applied to the source
pitch_shift: (optional) the amount of pitch-shift applied to the source
role: one ofbackgroundorforeground