Namespace definitions¶
Beat¶
beat¶
Beat event markers with optional metrical position.
time duration value confidence [sec] [sec] [number] or [null] –
Each observation corresponds to a single beat event.
The value
field can be a number (positive or negative, integer or floating point),
indicating the metrical position within the bar of the observed beat.
If no metrical position is provided for the annotation, the value
field will be
null
.
Example
time duration value confidence 0.500 0.000 1 null 1.000 0.000 2 null 1.500 0.000 3 null 2.000 0.000 4 null 2.500 0.000 1 null
Note
duration
is typically zero for beat events, but this is not enforced.
confidence
is an unconstrained field for beat annotations, and may contain
arbitrary data.
beat_position¶
Beat events with time signature information.
time duration value confidence [sec] [sec]
- position
- measure
- num_beats
- beat_units
–
Each observation corresponds to a single beat event.
The value
field is a structure containing the following fields:
position
: the position of the beat within the measure. Can be any number greater than or equal to 1.measure
: the index of the measure containing this beat. Can be any non-negative integer.num_beats
: the number of beats per measure : can be any strictly positive integer.beat_units
: the note value for beats in this measure. Must be one of:1, 2, 4, 8, 16, 32, 64, 128, 256
.
All fields are required for each observation.
Example
time duration value confidence 0.500 0.000
- position: 1
- measure: 0
- num_beats: 4
- beat_units: 4
null 1.000 0.000
- position: 2
- measure: 0
- num_beats: 4
- beat_units: 4
null 1.500 0.000
- position: 3
- measure: 0
- num_beats: 4
- beat_units: 4
null 2.000 0.000
- position: 4
- measure: 0
- num_beats: 4
- beat_units: 4
null 2.500 0.000
- position: 1
- measure: 1
- num_beats: 4
- beat_units: 4
null
Note
duration
is typically zero for beat events, but this is not enforced.
confidence
is an unconstrained field for beat annotations, and may contain
arbitrary data.
position
should lie in the range [1, beat_units]
, but the upper bound is not
enforced at the schema level.
Chord¶
chord¶
Chord annotations described by an extended version of the grammar defined by Harte, et al. [1]
time duration value confidence [sec] [sec] string –
[1] | (1, 2, 3) Harte, Christopher, Mark B. Sandler, Samer A. Abdallah, and Emilia Gómez. “Symbolic Representation of Musical Chords: A Proposed Syntax for Text Annotations.” In ISMIR, vol. 5, pp. 66-71. 2005. |
This namespace is similar to chord_harte, with the following modifications:
- Sharps and flats may not be mixed in a note symbol. For instance, A#b# is legal in chord_harte but not in chord. A### is legal in both.
- The following quality values have been added:
- sus2, 1, 5
- aug7
- 11, maj11, min11
- 13, maj13, min13
Example
time duration value confidence 0.000 1.000 N
null 0.000 1.000 Bb:5
null 0.000 1.000 E:(*5)
null 0.000 1.000 E#:min9/9
null 0.000 1.000 G##:maj6
null 0.000 1.000 D:13/6
null 0.000 1.000 A:sus2
null
Note
confidence
is an unconstrained field, and may contain arbitrary data.
chord_harte¶
Chord annotations described according to the grammar defined by Harte, et al. [1]
time duration value confidence [sec] [sec] string –
Each observed value is a text representation of a chord annotation.
N
specifies a no chord observation- Notes are annotated in the usual way:
A-G
followed by optional sharps (#
) and flats (b
)
- Chord qualities are denoted by abbreviated strings:
- maj, min, dim, aug
- maj7, min7, 7, dim7, hdim7, minmaj7
- maj6, min6
- 9, maj9, min9
- sus4
- Inversions are specified by a slash (
/
) followed by the interval number, e.g.,G/3
.- Extensions are denoted in parentheses, e.g.,
G(b11,13)
. Suppressed notes are indicated with an asterisk, e.g.,G(*3)
A complete description of the chord grammar is provided in [1], table 1.
Example
time duration value confidence 0.000 1.000 N
null 0.000 1.000 Bb
null 0.000 1.000 E:(*5)
null 0.000 1.000 E#:min9/9
null 0.000 1.000 G#b:maj6
null 0.000 1.000 D/6
null 0.000 1.000 A:sus4
null
Note
confidence
is an unconstrained field, and may contain arbitrary data.
chord_roman¶
Chord annotations in roman numeral format, as described by [2].
time duration value confidence [sec] [sec]
- tonic
- chord
–
The value
field is a structure containing the following fields:
tonic
: (string) the tonic note of the chord, e.g.,A
orGb
.
chord
: (string) the scale degree of the chord in roman numerals (1–7), along with inversions, extensions, and qualities.
Scale degrees are encoded with optional leading sharps and flats, e.g.,
V
,bV
or#VII
. Upper-case numerals indicate major, lower-case numeral indicate minor.Qualities are encoded as one of the following symbols:
o
: diminished (triad)+
: augmented (triad)s
: suspensiond
: dominant (seventh)h
: half-diminished (seventh)x
: fully-diminished (seventh)Inversions are encoded by arabic numerals, e.g.,
V6
for a first-inversion triad,V64
for second inversion.Applied chords are encoded by a
/
followed by a roman numeral encoding of the scale degree, e.g.,V7/IV
.
[2] | (1, 2) http://theory.esm.rochester.edu/rock_corpus/harmonic_analyses.html |
- Example
time duration value confidence 0.000 0.500 - tonic: C
- chord: I6
– 0.500 0.500 - tonic: C
- chord: bIV
– 1.000 0.500 - tonic: C
- chord: Vh7
–
Note
The grammar defined in [2] has been constrained to support only the quality symbols listed above.
confidence
is an unconstrained field, and may contain arbitrary data.
Key¶
key_mode¶
Key and optional mode (major/minor or Greek modes)
time duration value confidence [sec] [sec] string –
The value
field is a string matching one of the three following patterns:
N
: no keyAb, A, A#, Bb, ... G#
: tonic note, upper casetonic:MODE
wheretonic
is as described above, andMODE
is one of:major, minor, ionian, dorian, phrygian, lydian, mixolydian, aeolian, locrian
.
Example
time duration value confidence 0.000 30.0 C:minor null 30.0 5.00 N null 35.0 15.0 C#:dorian null 50.0 10.0 Eb null 60.0 10.0 A:lydian null
Note
confidence
is an unconstrained field, and may contain arbitrary data.
Lyrics¶
lyrics¶
Time-aligned lyrical annotations.
time duration value confidence [sec] [sec] string –
The required value
field can contain arbitrary text data, e.g., lyrics.
Example
time duration value confidence 0.500 4.000 “Row row row your boat” null 4.500 2.000 “gently down the stream” null 7.000 1.000 “merrily” null 8.000 1.000 “merrily” null 9.000 1.000 “merrily” null 10.00 1.000 “merrily” null
Note
confidence
is an unconstrained field, and may contain arbitrary data.
lyrics_bow¶
Time-aligned bag-of-words or bag-of-ngrams.
time duration value confidence [sec] [sec] array –
The required value
field is an array, where each element is an array of [term, count]
.
The term
here may be either a string (for simple bag-of-words) or an array of strings (for bag-of-ngrams).
Example
time duration value confidence 0.000 30.00
- [‘row’, 3]
- [ [‘row’, ‘row’], 2]
- [‘your’, 1]
- [‘boat’, 1]
null
Note
confidence
is an unconstrained field, and may contain arbitrary data.
Mood¶
mood_thayer¶
Time-varying emotion measurements as ordered pairs of (valence, arousal)
time duration value confidence [sec] [sec] (valence, arousal) –
The value
field is an ordered pair of numbers measuring the valence
and
arousal
positions in the Thayer mood model [3].
[3] | Thayer, Robert E. The biopsychology of mood and arousal. Oxford University Press, 1989. |
Example
time duration value confidence 0.500 0.250 (-0.5, 1.0) null 0.750 0.250 (-0.3, 0.6) null 1.000 0.750 (-0.1, 0.1) null 1.750 0.500 (0.3, -0.5) null
Note
confidence
is an unconstrained field, and may contain arbitrary data.
Onset¶
onset¶
Note onset event markers.
time duration value confidence [sec] [sec] – –
This namespace can be used to encode timing of arbitrary instantaneous events. Most commonly, this is applied to note onsets.
Example
time duration value confidence 0.500 0.000 null null 1.000 0.000 null null 1.500 0.000 null null 2.000 0.000 null null
Note
duration
is typically zero for instantaneous events, but this is not enforced.
value
and confidence
fields are unconstrained, and may contain arbitrary data.
Pattern¶
pattern_jku¶
Each note of the pattern contains (pattern_id, midi_pitch, occurrence_id, morph_pitch,
staff)
, following the format described in [4].
time duration value confidence [sec] [sec]
- pattern_id
- midi_pitch
- occurrence_id
- morph_pitch
- staff
–
[4] | Collins T., Discovery of Repeated Themes & Sections, Music Information Retrieval Evalaluation eXchange (MIReX), 2013 (Accessed on July 7th 2015). Available here. |
Each value
field contains a dictionary with the following keys:
pattern_id
: The integer that identifies the current pattern,- starting from 1.
midi_pitch
: The float representing the midi pitch.
occurrence_id
: The integer that identifies the current occurrence,- starting from 1.
morph_pitch
: The float representing the morphological pitch.
staff
: The integer representing the staff where the current note of the- patter is found, starting from 0.
Example
time duration value confidence 62.86 0.09
- pattern_id: 1
- midi_pitch: 50
- occurrence_id: 1
- morph_pitch: 54
- staff: 1
null 62.86 0.36
- pattern_id: 1
- midi_pitch: 77
- occurrence_id: 1
- morph_pitch: 70
- staff: 0
null 36.34 0.09
- pattern_id: 1
- midi_pitch: 71
- occurrence_id: 2
- morph_pitch: 66
- staff: 0
null 36.43 0.36
- pattern_id: 1
- midi_pitch: 69
- occurrence_id: 2
- morph_pitch: 65
- staff: 0
null
Pitch¶
pitch_contour¶
Pitch contours in the format (index, frequency, voicing)
.
time duration value confidence [sec] [–]
- index
- frequency
- voiced
–
Each value
field is a structure containing a contour index
(an integer indicating which contour the observation belongs to), a frequency
value in Hz, and a boolean indicating if the values is voiced
. The confidence
field is unconstrained.
Example
time duration value confidence 0.0000 0.0000
- index: 0
- frequency: 442.1
- voiced: True
null 0.0058 0.0000
- index: 0
- frequency: 457.8
- voiced: False
null 2.5490 0.0000
- index: 1
- frequency: 89.4
- voiced: True
null 2.5548 0.0000
- index: 1
- frequency: 90.0
- voiced: True
null
note_hz¶
Note events with (non-negative) frequencies measured in Hz.
time duration value confidence [sec] [sec]
- number
–
Each value
field gives the frequency of the note in Hz.
Example
time duration value confidence 12.34 0.287 189.9 null 2.896 3.000 74.0 null 10.12 0.5. 440.0 null
note_midi¶
Note events with pitches measured in (fractional) MIDI note numbers.
time duration value confidence [sec] [sec]
- number
–
Each value
field gives the pitch of the note in MIDI note numbers.
Example
time duration value confidence 12.34 0.287 52.0 null 2.896 3.000 20.7 null 10.12 0.5. 42.0 null
pitch_class¶
Pitch measurements in (tonic, pitch class)
format.
time duration value confidence [sec] [sec]
- tonic
- pitch class
–
Each value
field is a structure containing a tonic
(note string, e.g., "A#"
or
"D"
)
and a pitch class pitch
as an integer scale degree. The confidence
field is unconstrained.
Example
time duration value confidence 0.000 30.0
- tonic:
C
- pitch: 0
null 0.000 30.0
- tonic:
C
- pitch: 4
null 0.000 30.0
- tonic:
C
- pitch: 7
null 30.00 35.0
- tonic:
G
- pitch: 0
null
pitch_hz¶
Warning
Deprecated, use pitch_contour
.
Pitch measurements in Hertz (Hz). Pitch (a subjective sensation) is represented as fundamental frequency (a physical quantity), a.k.a. “f0”.
time duration value confidence [sec] –
- number
–
The time
field represents the instantaneous time in which the pitch f0 was
estimated. By convention, this (usually) represents the center time of the
analysis frame. Note that this is different from pitch_midi and pitch_class,
where time
represents the onset time. As a consequence, the duration
field is undefined and should be ignored. The value
field is a number
representing the f0 in Hz. By convention, values that are equal to or less than
zero are used to represent silence (no pitch). Some algorithms (e.g. melody
extraction algorithms that adhere to the MIREX convention) use negative f0
values to represent the algorithm’s pitch estimate for frames where it thinks
there is no active pitch (e.g. no melody), to allow the independent evaluation
of pitch activation detection (a.k.a. “voicing detection”) and pitch frequency
estimation. The confidence
field is unconstrained.
Example
time duration value confidence 0.000 0.000 300.00 null 0.010 0.000 305.00 null 0.020 0.000 310.00 null 0.030 0.000 0.00 null 0.040 0.000 -280.00 null 0.050 0.000 -290.00 null
pitch_midi¶
Warning
Deprecated, use note_midi
or pitch_contour
.
Pitch measurements in (fractional) MIDI note number notation.
time duration value confidence [sec] [sec] number –
The value
field is a number representing the pitch in MIDI notation.
Numbers can be negative (for notes below C-1
) or fractional.
Example
time duration value confidence 0.000 30.000 24 null 0.000 30.000 43.02 null 15.00 45.000 26 null
Segment¶
segment_open¶
Structural segmentation with an open vocabulary of segment labels.
time duration value confidence [sec] [sec] string –
The value
field contains string descriptors for each segment, e.g., “verse” or
“bridge”.
- Example
time duration value confidence 0.000 20.000 intro null 20.00 30.000 verse null 30.00 50.000 refrain null 50.00 70.000 verse (alternate) null
segment_salami_function¶
Segment annotations with functional labels from the SALAMI guidelines.
time duration value confidence [sec] [sec] string –
The value
field must be one of the allowable strings in the SALAMI function
vocabulary.
- Example
time duration value confidence 0.000 20.000 applause null 20.00 30.000 count-in null 30.00 50.000 introduction null 50.00 70.000 verse null
segment_salami_upper¶
Segment annotations with SALAMI’s upper-case (large) label format.
time duration value confidence [sec] [sec] string –
The value
field must be a string of the following format:
- “silence” or “Silence”
- One or more upper-case letters, followed by zero or more apostrophes
- Example
time duration value confidence 0.000 20.000 silence null 20.00 30.000 A null 30.00 50.000 B null 50.00 70.000 A’ null
segment_salami_lower¶
Segment annotations with SALAMI’s lower-case (small) label format.
time duration value confidence [sec] [sec] string –
The value
field must be a string of the following format:
- “silence” or “Silence”
- One or more lower-case letters, followed by zero or more apostrophes
- Example
time duration value confidence 0.000 20.000 silence null 20.00 30.000 a null 30.00 50.000 b null 50.00 70.000 a’ null
segment_tut¶
Segment annotations using the TUT vocabulary.
time duration value confidence [sec] [sec] string –
The value
field is a string describing the function of the segment.
- Example
time duration value confidence 0.000 20.000 Intro null 20.00 30.000 Verse null 30.00 50.000 bridge null 50.00 70.000 RefrainA null
multi_segment¶
Multi-level structural segmentations.
time duration value confidence [sec] [sec]
- label : string
- level : int >= 0
–
In a multi-level segmentation, the track is partitioned many times —
possibly recursively — which results in a collection of segmentations of varying degrees
of specificity. In the multi_segment
namespace, all of the resulting segments are
collected together, and the level
field is used to encode the segment’s corresponding
partition.
Level values must be non-negative, and ordered by increasing specificity. For example,
level==0
may correspond to a single segment spanning the entire track, and each
subsequent level value corresponds to a more refined segmentation.
- Example
time duration value confidence 0.000 60.000 - label : A
- level : 0
null 0.000 30.000 - label : B
- level : 1
null 30.00 60.000 - label : C
- level : 1
null 0.000 15.000 - label : a
- level : 2
null 15.00 30.000 - label : b
- level : 2
null 30.00 45.000 - label : a
- level : 2
null 45.00 60.000 - label : c
- level : 2
null
Tag¶
tag_cal10k¶
Tags from the CAL10K vocabulary.
time duration value confidence [sec] [sec] string –
The value
is constrained to a set of 1053 terms, spanning mood, instrumentation,
style, and genre.
time duration value confidence 0.000 30.000 “bop influences” null 0.000 30.000 “bright beats” null 0.000 30.000 “hip hop roots” null
tag_cal500¶
Tags from the CAL500 vocabulary.
time duration value confidence [sec] [sec] string –
The value
is constrained to a set of 174 terms, spanning mood, instrumentation, and
genre.
time duration value confidence 0.000 30.000 “Genre-Best-Rock” null 0.000 30.000 “Vocals-Monotone” null 0.000 30.000 “Usage-At_work” null
tag_gtzan¶
Genre classes from the GTZAN dataset.
time duration value confidence [sec] [sec] string –
The value
field is constrained to one of ten strings:
blues
classical
country
disco
hip-hop
jazz
metal
pop
reggae
rock
By convention, only one tag is applied per track in this namespace. This is not enforced by the schema.
Example
time duration value confidence 0.000 30.000 “reggae” null
tag_msd_tagtraum_cd1¶
Genre classes from the msd tagtraum cd1 dataset.
time duration value confidence [sec] [sec] string –
The value
field is constrained to one of 13 strings:
reggae
pop/rock
rnb
jazz
vocal
new age
latin
rap
country
international
blues
electronic
folk
By convention, one or two tags per track are possible in this namespace.
The sum of the confidence values should equal 1.0
.
This is not enforced by the schema.
Example
time duration value confidence 0.000 0.000 “reggae” 1.0
tag_msd_tagtraum_cd2¶
Genre classes from the msd tagtraum cd2 dataset.
time duration value confidence [sec] [sec] string –
The value
field is constrained to one of 15 strings:
reggae
latin
metal
rnb
jazz
punk
pop
new age
country
rap
rock
world
blues
electronic
folk
By convention, one or two tags per track are possible in this namespace.
The sum of the confidence values should equal 1.0
.
This is not enforced by the schema.
Example
time duration value confidence 0.000 0.000 “reggae” 0.6666667 0.000 0.000 “rock” 0.3333333
tag_medleydb_instruments¶
MedleyDB instrument source annotations.
time duration value confidence [sec] [sec] string –
The value
field is constrained to the set of instruments defined
in MedleyDB instrument taxonomy.
Example
time duration value confidence 0.000 20.000 “darbuka” null 0.000 20.000 “flute section” null 0.000 20.000 “oud” null
tag_open¶
Open vocabulary tags. This namespace is appropriate for unconstrained tag data, such as tags from Last.FM or Magnatagatune.
time duration value confidence [sec] [sec] string –
The value
field is unconstrained, and can contain any string value.
Example
time duration value confidence 0.000 20.000 “rocking” null 0.000 20.000 “rockin’” null 0.000 20.000 “” null 0.000 20.000 “favez^^^” null
tag_audioset¶
Tags from the full AudioSet (v1) ontology.
time duration value confidence [sec] [sec] string –
The value
field is constrained to the vocabulary of the AudioSet ontology.
Example
time duration value confidence 0.000 20.000 “Air brake” null 5.000 25.000 “Yodeling” null 9.000 35.000 “Steam whistle” null
tag_audioset_genre¶
Tags from the musical genre subset of the AudioSet (v1) ontology.
time duration value confidence [sec] [sec] string –
The value
field is constrained to the 66 musical genres of the AudioSet-genre ontology.
Example
time duration value confidence 0.000 20.000 “Oldschool jungle” null
tag_audioset_instrument¶
Tags from the musical instrument subset of the AudioSet (v1) ontology.
time duration value confidence [sec] [sec] string –
The value
field is constrained to the 91 musical instruments
of the AudioSet-instruments ontology.
Example
time duration value confidence 0.000 20.000 “Ukulele” null 5.000 25.000 “Piano” null 9.000 35.000 “Tuning fork” null
tag_fma_genre¶
Tags from the Free Music Archive (FMA) 16-class genre taxonomy.
time duration value confidence [sec] [sec] string –
The value
field is constrained to the 16 genres
of the FMA data-set.
Example
time duration value confidence 0.000 20.000 “Blues” null 5.000 25.000 “Instrumental” null 9.000 35.000 “Soul-RnB” null
tag_fma_subgenre¶
Tags from the Free Music Archive (FMA) 163-class genre and sub-genre taxonomy.
time duration value confidence [sec] [sec] string –
The value
field is constrained to the 163 genres and sub-genres of the FMA data-set.
Example
time duration value confidence 0.000 20.000 “Pop” null 5.000 25.000 “Power-Pop” null 9.000 35.000 “Nerdcore” null
tag_urbansound¶
Genre classes from the UrbanSound dataset.
time duration value confidence [sec] [sec] string –
The value
field is constrained to one of ten strings:
air_conditioner
car_horn
children_playing
dog_bark
drilling
engine_idling
gun_shot
jackhammer
siren
street_music
Example
time duration value confidence 0.000 30.000 “street_music” null
Tempo¶
tempo¶
Tempo measurements in beats per minute (BPM).
time duration value confidence [sec] [sec] number number
The value
field is a non-negative number (floating point), indicated the tempo measurement.
The confidence
field is a number in the range [0, 1]
, following the format used by MIREX [5].
[5] | http://www.music-ir.org/mirex/wiki/2014:Audio_Tempo_Estimation |
Example
time duration value confidence 0.00 60.00 180.0 0.8 0.00 60.00 90.0 0.2
Note
MIREX requires that tempo measurements come in pairs, and that the confidence values sum to 1. This is not enforced at the schema level.
Miscellaneous¶
Vector¶
Numerical vector data. This is useful for generic regression problems where the output is a vector of numbers.
time duration value confidence [sec] [sec] [array of numbers] –
Each observation value must be an array of at least one number. Different observations may have different length arrays, so it is up to the user to verify that arrays have the desired length.
Blob¶
Arbitrary data blobs.
time duration value confidence [sec] [sec] – –
This namespace can be used to encode arbitrary data. The value and confidence fields have no schema constraints, and may contain any structured (but serializable) data. This can be useful for storing complex output data that does not fit any particular task schema, such as regression targets or geolocation data.
It is strongly advised that the AnnotationMetadata for blobs be as explicit as possible.
Scaper¶
Structured representation for soundscapes synthesized by the Scaper package.
time duration value confidence [sec] [sec]
- label
- source_file
- source_time
- event_time
- event_duration
- snr
- time_stretch
- pitch_shift
- role
–
Each value
field contains a dictionary with the following keys:
label
: a string indicating the label of the sound sourcesource_file
: a full path to the original sound source (on disk)source_time
: a non-negative number indicating the time offset withinsource_file
of the soundevent_time
: the start time of the event in the synthesized soundscapeevent_duration
: a strictly positive number indicating the duration of the eventsnr
: the signal-to-noise ratio (in LUFS) of the sound compared to the backgroundtime_stetch
: (optional) a strictly positive number indicating the amount of time-stretch applied to the sourcepitch_shift
: (optional) the amount of pitch-shift applied to the sourcerole
: one ofbackground
orforeground