Namespace definitions

Beat

beat

Beat event markers with optional metrical position.

time duration value confidence
[sec] [sec] [number] or [null]

Each observation corresponds to a single beat event.

The value field can be a number (positive or negative, integer or floating point), indicating the metrical position within the bar of the observed beat.

If no metrical position is provided for the annotation, the value field will be null.

Example

time duration value confidence
0.500 0.000 1 null
1.000 0.000 2 null
1.500 0.000 3 null
2.000 0.000 4 null
2.500 0.000 1 null

Note

duration is typically zero for beat events, but this is not enforced.

confidence is an unconstrained field for beat annotations, and may contain arbitrary data.

beat_position

Beat events with time signature information.

time duration value confidence
[sec] [sec]
  • position
  • measure
  • num_beats
  • beat_units

Each observation corresponds to a single beat event.

The value field is a structure containing the following fields:

  • position : the position of the beat within the measure. Can be any number greater than or equal to 1.
  • measure : the index of the measure containing this beat. Can be any non-negative integer.
  • num_beats : the number of beats per measure : can be any strictly positive integer.
  • beat_units : the note value for beats in this measure. Must be one of: 1, 2, 4, 8, 16, 32, 64, 128, 256.

All fields are required for each observation.

Example

time duration value confidence
0.500 0.000
  • position: 1
  • measure: 0
  • num_beats: 4
  • beat_units: 4
null
1.000 0.000
  • position: 2
  • measure: 0
  • num_beats: 4
  • beat_units: 4
null
1.500 0.000
  • position: 3
  • measure: 0
  • num_beats: 4
  • beat_units: 4
null
2.000 0.000
  • position: 4
  • measure: 0
  • num_beats: 4
  • beat_units: 4
null
2.500 0.000
  • position: 1
  • measure: 1
  • num_beats: 4
  • beat_units: 4
null

Note

duration is typically zero for beat events, but this is not enforced.

confidence is an unconstrained field for beat annotations, and may contain arbitrary data.

position should lie in the range [1, beat_units], but the upper bound is not enforced at the schema level.

Chord

chord

Chord annotations described by an extended version of the grammar defined by Harte, et al. [1]

time duration value confidence
[sec] [sec] string
[1](1, 2, 3) Harte, Christopher, Mark B. Sandler, Samer A. Abdallah, and Emilia Gómez. “Symbolic Representation of Musical Chords: A Proposed Syntax for Text Annotations.” In ISMIR, vol. 5, pp. 66-71. 2005.

This namespace is similar to chord_harte, with the following modifications:

  • Sharps and flats may not be mixed in a note symbol. For instance, A#b# is legal in chord_harte but not in chord. A### is legal in both.
  • The following quality values have been added:
    • sus2, 1, 5
    • aug7
    • 11, maj11, min11
    • 13, maj13, min13

Example

time duration value confidence
0.000 1.000 N null
0.000 1.000 Bb:5 null
0.000 1.000 E:(*5) null
0.000 1.000 E#:min9/9 null
0.000 1.000 G##:maj6 null
0.000 1.000 D:13/6 null
0.000 1.000 A:sus2 null

Note

confidence is an unconstrained field, and may contain arbitrary data.

chord_harte

Chord annotations described according to the grammar defined by Harte, et al. [1]

time duration value confidence
[sec] [sec] string

Each observed value is a text representation of a chord annotation.

  • N specifies a no chord observation
  • Notes are annotated in the usual way: A-G followed by optional sharps (#) and flats (b)
  • Chord qualities are denoted by abbreviated strings:
    • maj, min, dim, aug
    • maj7, min7, 7, dim7, hdim7, minmaj7
    • maj6, min6
    • 9, maj9, min9
    • sus4
  • Inversions are specified by a slash (/) followed by the interval number, e.g., G/3.
  • Extensions are denoted in parentheses, e.g., G(b11,13). Suppressed notes are indicated with an asterisk, e.g., G(*3)

A complete description of the chord grammar is provided in [1], table 1.

Example

time duration value confidence
0.000 1.000 N null
0.000 1.000 Bb null
0.000 1.000 E:(*5) null
0.000 1.000 E#:min9/9 null
0.000 1.000 G#b:maj6 null
0.000 1.000 D/6 null
0.000 1.000 A:sus4 null

Note

confidence is an unconstrained field, and may contain arbitrary data.

chord_roman

Chord annotations in roman numeral format, as described by [2].

time duration value confidence
[sec] [sec]
  • tonic
  • chord

The value field is a structure containing the following fields:

  • tonic : (string) the tonic note of the chord, e.g., A or Gb.

  • chord : (string) the scale degree of the chord in roman numerals (1–7), along with inversions, extensions, and qualities.

    • Scale degrees are encoded with optional leading sharps and flats, e.g., V, bV or #VII. Upper-case numerals indicate major, lower-case numeral indicate minor.

    • Qualities are encoded as one of the following symbols:

      • o : diminished (triad)
      • + : augmented (triad)
      • s : suspension
      • d : dominant (seventh)
      • h : half-diminished (seventh)
      • x : fully-diminished (seventh)
    • Inversions are encoded by arabic numerals, e.g., V6 for a first-inversion triad, V64 for second inversion.

    • Applied chords are encoded by a / followed by a roman numeral encoding of the scale degree, e.g., V7/IV.

[2](1, 2) http://theory.esm.rochester.edu/rock_corpus/harmonic_analyses.html
Example
time duration value confidence
0.000 0.500
  • tonic: C
  • chord: I6
0.500 0.500
  • tonic: C
  • chord: bIV
1.000 0.500
  • tonic: C
  • chord: Vh7

Note

The grammar defined in [2] has been constrained to support only the quality symbols listed above.

confidence is an unconstrained field, and may contain arbitrary data.

Key

key_mode

Key and optional mode (major/minor or Greek modes)

time duration value confidence
[sec] [sec] string

The value field is a string matching one of the three following patterns:

  • N : no key
  • Ab, A, A#, Bb, ... G# : tonic note, upper case
  • tonic:MODE where tonic is as described above, and MODE is one of: major, minor, ionian, dorian, phrygian, lydian, mixolydian, aeolian, locrian.

Example

time duration value confidence
0.000 30.0 C:minor null
30.0 5.00 N null
35.0 15.0 C#:dorian null
50.0 10.0 Eb null
60.0 10.0 A:lydian null

Note

confidence is an unconstrained field, and may contain arbitrary data.

Lyrics

lyrics

Time-aligned lyrical annotations.

time duration value confidence
[sec] [sec] string

The required value field can contain arbitrary text data, e.g., lyrics.

Example

time duration value confidence
0.500 4.000 “Row row row your boat” null
4.500 2.000 “gently down the stream” null
7.000 1.000 “merrily” null
8.000 1.000 “merrily” null
9.000 1.000 “merrily” null
10.00 1.000 “merrily” null

Note

confidence is an unconstrained field, and may contain arbitrary data.

lyrics_bow

Time-aligned bag-of-words or bag-of-ngrams.

time duration value confidence
[sec] [sec] array

The required value field is an array, where each element is an array of [term, count]. The term here may be either a string (for simple bag-of-words) or an array of strings (for bag-of-ngrams).

Example

time duration value confidence
0.000 30.00
  • [‘row’, 3]
  • [ [‘row’, ‘row’], 2]
  • [‘your’, 1]
  • [‘boat’, 1]
null

Note

confidence is an unconstrained field, and may contain arbitrary data.

Mood

mood_thayer

Time-varying emotion measurements as ordered pairs of (valence, arousal)

time duration value confidence
[sec] [sec] (valence, arousal)

The value field is an ordered pair of numbers measuring the valence and arousal positions in the Thayer mood model [3].

[3]Thayer, Robert E. The biopsychology of mood and arousal. Oxford University Press, 1989.

Example

time duration value confidence
0.500 0.250 (-0.5, 1.0) null
0.750 0.250 (-0.3, 0.6) null
1.000 0.750 (-0.1, 0.1) null
1.750 0.500 (0.3, -0.5) null

Note

confidence is an unconstrained field, and may contain arbitrary data.

Onset

onset

Note onset event markers.

time duration value confidence
[sec] [sec]

This namespace can be used to encode timing of arbitrary instantaneous events. Most commonly, this is applied to note onsets.

Example

time duration value confidence
0.500 0.000 null null
1.000 0.000 null null
1.500 0.000 null null
2.000 0.000 null null

Note

duration is typically zero for instantaneous events, but this is not enforced.

value and confidence fields are unconstrained, and may contain arbitrary data.

Pattern

pattern_jku

Each note of the pattern contains (pattern_id, midi_pitch, occurrence_id, morph_pitch, staff), following the format described in [4].

time duration value confidence
[sec] [sec]
  • pattern_id
  • midi_pitch
  • occurrence_id
  • morph_pitch
  • staff
[4]Collins T., Discovery of Repeated Themes & Sections, Music Information Retrieval Evalaluation eXchange (MIReX), 2013 (Accessed on July 7th 2015). Available here.

Each value field contains a dictionary with the following keys:

  • pattern_id: The integer that identifies the current pattern,
    starting from 1.
  • midi_pitch: The float representing the midi pitch.
  • occurrence_id: The integer that identifies the current occurrence,
    starting from 1.
  • morph_pitch: The float representing the morphological pitch.
  • staff: The integer representing the staff where the current note of the
    patter is found, starting from 0.

Example

time duration value confidence
62.86 0.09
  • pattern_id: 1
  • midi_pitch: 50
  • occurrence_id: 1
  • morph_pitch: 54
  • staff: 1
null
62.86 0.36
  • pattern_id: 1
  • midi_pitch: 77
  • occurrence_id: 1
  • morph_pitch: 70
  • staff: 0
null
36.34 0.09
  • pattern_id: 1
  • midi_pitch: 71
  • occurrence_id: 2
  • morph_pitch: 66
  • staff: 0
null
36.43 0.36
  • pattern_id: 1
  • midi_pitch: 69
  • occurrence_id: 2
  • morph_pitch: 65
  • staff: 0
null

Pitch

pitch_contour

Pitch contours in the format (index, frequency, voicing).

time duration value confidence
[sec] [–]
  • index
  • frequency
  • voiced

Each value field is a structure containing a contour index (an integer indicating which contour the observation belongs to), a frequency value in Hz, and a boolean indicating if the values is voiced. The confidence field is unconstrained.

Example

time duration value confidence
0.0000 0.0000
  • index: 0
  • frequency: 442.1
  • voiced: True
null
0.0058 0.0000
  • index: 0
  • frequency: 457.8
  • voiced: False
null
2.5490 0.0000
  • index: 1
  • frequency: 89.4
  • voiced: True
null
2.5548 0.0000
  • index: 1
  • frequency: 90.0
  • voiced: True
null

note_hz

Note events with (non-negative) frequencies measured in Hz.

time duration value confidence
[sec] [sec]
  • number

Each value field gives the frequency of the note in Hz.

Example

time duration value confidence
12.34 0.287 189.9 null
2.896 3.000 74.0 null
10.12 0.5. 440.0 null

note_midi

Note events with pitches measured in (fractional) MIDI note numbers.

time duration value confidence
[sec] [sec]
  • number

Each value field gives the pitch of the note in MIDI note numbers.

Example

time duration value confidence
12.34 0.287 52.0 null
2.896 3.000 20.7 null
10.12 0.5. 42.0 null

pitch_class

Pitch measurements in (tonic, pitch class) format.

time duration value confidence
[sec] [sec]
  • tonic
  • pitch class

Each value field is a structure containing a tonic (note string, e.g., "A#" or "D") and a pitch class pitch as an integer scale degree. The confidence field is unconstrained.

Example

time duration value confidence
0.000 30.0
  • tonic: C
  • pitch: 0
null
0.000 30.0
  • tonic: C
  • pitch: 4
null
0.000 30.0
  • tonic: C
  • pitch: 7
null
30.00 35.0
  • tonic: G
  • pitch: 0
null

pitch_hz

Warning

Deprecated, use pitch_contour.

Pitch measurements in Hertz (Hz). Pitch (a subjective sensation) is represented as fundamental frequency (a physical quantity), a.k.a. “f0”.

time duration value confidence
[sec]
  • number

The time field represents the instantaneous time in which the pitch f0 was estimated. By convention, this (usually) represents the center time of the analysis frame. Note that this is different from pitch_midi and pitch_class, where time represents the onset time. As a consequence, the duration field is undefined and should be ignored. The value field is a number representing the f0 in Hz. By convention, values that are equal to or less than zero are used to represent silence (no pitch). Some algorithms (e.g. melody extraction algorithms that adhere to the MIREX convention) use negative f0 values to represent the algorithm’s pitch estimate for frames where it thinks there is no active pitch (e.g. no melody), to allow the independent evaluation of pitch activation detection (a.k.a. “voicing detection”) and pitch frequency estimation. The confidence field is unconstrained.

Example

time duration value confidence
0.000 0.000 300.00 null
0.010 0.000 305.00 null
0.020 0.000 310.00 null
0.030 0.000 0.00 null
0.040 0.000 -280.00 null
0.050 0.000 -290.00 null

pitch_midi

Warning

Deprecated, use note_midi or pitch_contour.

Pitch measurements in (fractional) MIDI note number notation.

time duration value confidence
[sec] [sec] number

The value field is a number representing the pitch in MIDI notation. Numbers can be negative (for notes below C-1) or fractional.

Example

time duration value confidence
0.000 30.000 24 null
0.000 30.000 43.02 null
15.00 45.000 26 null

Segment

segment_open

Structural segmentation with an open vocabulary of segment labels.

time duration value confidence
[sec] [sec] string

The value field contains string descriptors for each segment, e.g., “verse” or “bridge”.

Example
time duration value confidence
0.000 20.000 intro null
20.00 30.000 verse null
30.00 50.000 refrain null
50.00 70.000 verse (alternate) null

segment_salami_function

Segment annotations with functional labels from the SALAMI guidelines.

time duration value confidence
[sec] [sec] string

The value field must be one of the allowable strings in the SALAMI function vocabulary.

Example
time duration value confidence
0.000 20.000 applause null
20.00 30.000 count-in null
30.00 50.000 introduction null
50.00 70.000 verse null

segment_salami_upper

Segment annotations with SALAMI’s upper-case (large) label format.

time duration value confidence
[sec] [sec] string

The value field must be a string of the following format:

  • “silence” or “Silence”
  • One or more upper-case letters, followed by zero or more apostrophes
Example
time duration value confidence
0.000 20.000 silence null
20.00 30.000 A null
30.00 50.000 B null
50.00 70.000 A’ null

segment_salami_lower

Segment annotations with SALAMI’s lower-case (small) label format.

time duration value confidence
[sec] [sec] string

The value field must be a string of the following format:

  • “silence” or “Silence”
  • One or more lower-case letters, followed by zero or more apostrophes
Example
time duration value confidence
0.000 20.000 silence null
20.00 30.000 a null
30.00 50.000 b null
50.00 70.000 a’ null

segment_tut

Segment annotations using the TUT vocabulary.

time duration value confidence
[sec] [sec] string

The value field is a string describing the function of the segment.

Example
time duration value confidence
0.000 20.000 Intro null
20.00 30.000 Verse null
30.00 50.000 bridge null
50.00 70.000 RefrainA null

multi_segment

Multi-level structural segmentations.

time duration value confidence
[sec] [sec]
  • label : string
  • level : int >= 0

In a multi-level segmentation, the track is partitioned many times — possibly recursively — which results in a collection of segmentations of varying degrees of specificity. In the multi_segment namespace, all of the resulting segments are collected together, and the level field is used to encode the segment’s corresponding partition.

Level values must be non-negative, and ordered by increasing specificity. For example, level==0 may correspond to a single segment spanning the entire track, and each subsequent level value corresponds to a more refined segmentation.

Example
time duration value confidence
0.000 60.000
  • label : A
  • level : 0
null
0.000 30.000
  • label : B
  • level : 1
null
30.00 60.000
  • label : C
  • level : 1
null
0.000 15.000
  • label : a
  • level : 2
null
15.00 30.000
  • label : b
  • level : 2
null
30.00 45.000
  • label : a
  • level : 2
null
45.00 60.000
  • label : c
  • level : 2
null

Tag

tag_cal10k

Tags from the CAL10K vocabulary.

time duration value confidence
[sec] [sec] string

The value is constrained to a set of 1053 terms, spanning mood, instrumentation, style, and genre.

time duration value confidence
0.000 30.000 “bop influences” null
0.000 30.000 “bright beats” null
0.000 30.000 “hip hop roots” null

tag_cal500

Tags from the CAL500 vocabulary.

time duration value confidence
[sec] [sec] string

The value is constrained to a set of 174 terms, spanning mood, instrumentation, and genre.

time duration value confidence
0.000 30.000 “Genre-Best-Rock” null
0.000 30.000 “Vocals-Monotone” null
0.000 30.000 “Usage-At_work” null

tag_gtzan

Genre classes from the GTZAN dataset.

time duration value confidence
[sec] [sec] string

The value field is constrained to one of ten strings:

  • blues
  • classical
  • country
  • disco
  • hip-hop
  • jazz
  • metal
  • pop
  • reggae
  • rock

By convention, only one tag is applied per track in this namespace. This is not enforced by the schema.

Example

time duration value confidence
0.000 30.000 “reggae” null

tag_msd_tagtraum_cd1

Genre classes from the msd tagtraum cd1 dataset.

time duration value confidence
[sec] [sec] string

The value field is constrained to one of 13 strings:

  • reggae
  • pop/rock
  • rnb
  • jazz
  • vocal
  • new age
  • latin
  • rap
  • country
  • international
  • blues
  • electronic
  • folk

By convention, one or two tags per track are possible in this namespace. The sum of the confidence values should equal 1.0. This is not enforced by the schema.

Example

time duration value confidence
0.000 0.000 “reggae” 1.0

tag_msd_tagtraum_cd2

Genre classes from the msd tagtraum cd2 dataset.

time duration value confidence
[sec] [sec] string

The value field is constrained to one of 15 strings:

  • reggae
  • latin
  • metal
  • rnb
  • jazz
  • punk
  • pop
  • new age
  • country
  • rap
  • rock
  • world
  • blues
  • electronic
  • folk

By convention, one or two tags per track are possible in this namespace. The sum of the confidence values should equal 1.0. This is not enforced by the schema.

Example

time duration value confidence
0.000 0.000 “reggae” 0.6666667
0.000 0.000 “rock” 0.3333333

tag_medleydb_instruments

MedleyDB instrument source annotations.

time duration value confidence
[sec] [sec] string

The value field is constrained to the set of instruments defined in MedleyDB instrument taxonomy.

Example

time duration value confidence
0.000 20.000 “darbuka” null
0.000 20.000 “flute section” null
0.000 20.000 “oud” null

tag_open

Open vocabulary tags. This namespace is appropriate for unconstrained tag data, such as tags from Last.FM or Magnatagatune.

time duration value confidence
[sec] [sec] string

The value field is unconstrained, and can contain any string value.

Example

time duration value confidence
0.000 20.000 “rocking” null
0.000 20.000 “rockin’” null
0.000 20.000 “” null
0.000 20.000 “favez^^^” null

tag_audioset

Tags from the full AudioSet (v1) ontology.

time duration value confidence
[sec] [sec] string

The value field is constrained to the vocabulary of the AudioSet ontology.

Example

time duration value confidence
0.000 20.000 “Air brake” null
5.000 25.000 “Yodeling” null
9.000 35.000 “Steam whistle” null

tag_audioset_genre

Tags from the musical genre subset of the AudioSet (v1) ontology.

time duration value confidence
[sec] [sec] string

The value field is constrained to the 66 musical genres of the AudioSet-genre ontology.

Example

time duration value confidence
0.000 20.000 “Oldschool jungle” null

tag_audioset_instrument

Tags from the musical instrument subset of the AudioSet (v1) ontology.

time duration value confidence
[sec] [sec] string

The value field is constrained to the 91 musical instruments of the AudioSet-instruments ontology.

Example

time duration value confidence
0.000 20.000 “Ukulele” null
5.000 25.000 “Piano” null
9.000 35.000 “Tuning fork” null

tag_fma_genre

Tags from the Free Music Archive (FMA) 16-class genre taxonomy.

time duration value confidence
[sec] [sec] string

The value field is constrained to the 16 genres of the FMA data-set.

Example

time duration value confidence
0.000 20.000 “Blues” null
5.000 25.000 “Instrumental” null
9.000 35.000 “Soul-RnB” null

tag_fma_subgenre

Tags from the Free Music Archive (FMA) 163-class genre and sub-genre taxonomy.

time duration value confidence
[sec] [sec] string

The value field is constrained to the 163 genres and sub-genres of the FMA data-set.

Example

time duration value confidence
0.000 20.000 “Pop” null
5.000 25.000 “Power-Pop” null
9.000 35.000 “Nerdcore” null

tag_urbansound

Genre classes from the UrbanSound dataset.

time duration value confidence
[sec] [sec] string

The value field is constrained to one of ten strings:

  • air_conditioner
  • car_horn
  • children_playing
  • dog_bark
  • drilling
  • engine_idling
  • gun_shot
  • jackhammer
  • siren
  • street_music

Example

time duration value confidence
0.000 30.000 “street_music” null

Tempo

tempo

Tempo measurements in beats per minute (BPM).

time duration value confidence
[sec] [sec] number number

The value field is a non-negative number (floating point), indicated the tempo measurement. The confidence field is a number in the range [0, 1], following the format used by MIREX [5].

[5]http://www.music-ir.org/mirex/wiki/2014:Audio_Tempo_Estimation

Example

time duration value confidence
0.00 60.00 180.0 0.8
0.00 60.00 90.0 0.2

Note

MIREX requires that tempo measurements come in pairs, and that the confidence values sum to 1. This is not enforced at the schema level.

Miscellaneous

Vector

Numerical vector data. This is useful for generic regression problems where the output is a vector of numbers.

time duration value confidence
[sec] [sec] [array of numbers]

Each observation value must be an array of at least one number. Different observations may have different length arrays, so it is up to the user to verify that arrays have the desired length.

Blob

Arbitrary data blobs.

time duration value confidence
[sec] [sec]

This namespace can be used to encode arbitrary data. The value and confidence fields have no schema constraints, and may contain any structured (but serializable) data. This can be useful for storing complex output data that does not fit any particular task schema, such as regression targets or geolocation data.

It is strongly advised that the AnnotationMetadata for blobs be as explicit as possible.

Scaper

Structured representation for soundscapes synthesized by the Scaper package.

time duration value confidence
[sec] [sec]
  • label
  • source_file
  • source_time
  • event_time
  • event_duration
  • snr
  • time_stretch
  • pitch_shift
  • role

Each value field contains a dictionary with the following keys:

  • label: a string indicating the label of the sound source
  • source_file: a full path to the original sound source (on disk)
  • source_time: a non-negative number indicating the time offset within source_file of the sound
  • event_time: the start time of the event in the synthesized soundscape
  • event_duration: a strictly positive number indicating the duration of the event
  • snr: the signal-to-noise ratio (in LUFS) of the sound compared to the background
  • time_stetch: (optional) a strictly positive number indicating the amount of time-stretch applied to the source
  • pitch_shift: (optional) the amount of pitch-shift applied to the source
  • role: one of background or foreground