Weblog entry #91 for Steve
There exist at least two systems for looking up song details by a "disk-id":
The problem is they suck.
Their details are limitted to artist + song. There is no flexible notion of genre.
They do not support:
- Lyrics
- Cover images.
- Albums which contain multiple disks
I wish to replace these systems with my own, which will correct these faults.
There are three problems:
- How do you identify a given audio CD-ROM
- How do you store the data effectively.
- How do you allow clients to retrieve it.
I think the first one can be solved by:
request = sha1(audio Track 1 ) . sha1( audio Track 2 ) + .... sha1( audio track N )
(Where the requests are deliminated by some means. Perhaps the input will be XML?
The second is just a matter of database design:
- Table: Songs
- Table: Collections (has N songs)
- Etc
Querying against the database with a given collection of hashs should be trivial. The hard part is giving back useful output. Perhaps:
<collection> <name="The best of the doors" > <artist> <name="The Doors"> </artist> <artist> <name="The foo orchastra" > </artist> <track> <sha1="xxxxx"> <md5="xxx"> <title="Test"> <Genre="Foo"> <Length="22:22"> <lyrics> Once upon a time Far far away there was a cd-rom It was cold It was shiny It was silver </lyrics> </track> </collection>
Thoughts welcome. Pointers to already recognised schemas especially so.
Comments on this Entry
In the words of Obi-Wan, "there is another". Musicbrainz is the system used by Sound Juicer for CD lookups - it falls back on FreeDB if it doesn't have an entry.
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
Well spotted, thanks.
I think the biggest difference in my (proposed) scheme is that "genre", "artist", and other basic information becomes per-track rather than per-disk.
The only potential issue is the number of expected collisions in the track hashing. And the fact that the hashes only work on the .wav files, not anything else like mp3/ogg
[ Parent | Reply to this comment ]
I think they're great. A community of users submits track listings, so that when you enter a cd, you know what it is.
> There are three problems:
> 1. How do you identify a given audio CD-ROM
> 2. How do you store the data effectively.
> 3. How do you allow clients to retrieve it.
They were not intended to support that - but there's no reason they couldn't.
Each of the problems you list has already been solved by the sites you list, there's no reason to reinvent them.
What you want is to extend the existing databases to include more information.
Why not do that?
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
[ Parent | Reply to this comment ]
But there are reasons why that scheme was chosen.
One reason is that audio discs aren't data discs, so an sha1sum of every track wouldn't necessarily be the same each time. It'd also be really slow, much slower than the current track length algorithm.
I still stand by my point though. Why re-invent it from scratch rather than talking to the freedb people? (This seems quite a contrast to your normal viewpoint)
Other problems:
* Who is going to submit all the extra data?
* Cover art copyright
* Song lyrics copyright.
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
I'm not averse to having somebody else do the coding + hosting etc, so I most probably would contact the freedb people - but only after I have a proof-of-concept to share.
Otherwise I doubt many people would be interested.
I am a little concerned that reading the audio data might be non-identical, just from the sight of cdparanoia doing "error correction".
As for copyright, yes a valid concern.
One mitigating factor is that with this scheme using the SHA1 hash does ensure the submitter of a disk actually has a legitimate local copy.... Still I can see there will most likely be challenges there if the system were to be adopted.
(As for people inputting the data, probably a subset of the same people that do now. The ones that care about per-track information, and decent handling of compilations / multi-disk albums.)
[ Parent | Reply to this comment ]
Musicbrainz allocates a unique identifier for everything entered into it's database, which contains the algorithmically generated IDs - e.g. "Ace of Spades"
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
That looks like a good GUID.
I'm curious how they decide that, and can tell the difference between the "Studio" version of the song, or one of the "Live" versions.
Knowing roughly how musicbrainz works I'd guess they couldn't tell .. but if I were honest in my music snobbery I'd want them both tagged differently.
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
There might be an issue with working with the server for some people, even though the data is "free":
[ Parent | Reply to this comment ]