Thursday, March 03, 2011

Fixing KOBO's malformed database

This is likely to be one of many posts on the subject.  I have an interest since the malformed database often rears its ugly head when using Calibre.  Since I wrote the Kobo driver for Calibre and I do not believe Calibre causes the issue I have an interest in knowing more about the cause and potential fixes.

Dumping the database

For my purposes I use sqliteman to dump the database to a file.  I have not tested the script simply using the sqlite3 command but I imagine it would work.

the script

The script below lacks some (a lot) of features and has hard coded file names, etc but it does the job.  I used python because I might want to actually use some sqlite connections at some point.

The script works by attempting to interpret each row in the dump file as Unicode.  If it fails the associated line and the related sql command is dropped.

#!/usr/bin/env python

import re
import string
import binascii

file = open("kobo-corrupt-nick.sql", "rb")
err_file = open("error.log", "w")
out_file = open("out.log", "w")

new_stmt = False
prev_row = ""
err_line = ""
corrupt = False
prev_corrupt = False
firstrow = True

while 1:
    line = file.readline()

    if not line:
        break
    try:
        err_line = line

        if re.match("[I][N][S][E][R][T] *", line):
            new_stmt = True
            row = line
        elif re.match("[D][R][O][P] *", line):
            new_stmt = True
            row = line
        elif re.match("[C][R][E][A][T][E] *", line):
            new_stmt = True
            row = line
        elif re.match("[P][R][A][G][M][A] *", line):
            new_stmt = True
            row = line
        elif re.match("[B][E][G][I][N] *", line):
            new_stmt = True
            row = line
        elif re.match("[C][O][M][M][I][T] *", line):
            new_stmt = True
            row = line
        else:
            new_stmt = False
            row = row + line

        if new_stmt == True:
            if prev_corrupt == False:
                if prev_row != "":
                    out_file.write(prev_row)
           else:
                if prev_row != "":                     err_file.write(prev_row)         prev_row = row         unicode(line, 'utf-8')         prev_corrupt = corrupt     except UnicodeDecodeError:         prev_corrupt = True         prev_row = row         print line         continue     except:         raise if new_stmt == True:     if prev_corrupt == False:         if prev_row != "":             out_file.write(prev_row)     else:         if prev_row != "":             err_file.write(prev_row) file.close() err_file.close() out_file.close()


Importing the file

Importing the file is straight forward, simply run:

sqlite3 KoboReader.sqlite < out.log

Validating the data
This is the kicker.  A lack of understanding of the data relationships may prevent being able to sort out the data.  I have started on some scripts that may help.  For instance:


select 
    c.title, 
    vs.volumeid, c.numshortcovers  - count(vs.volumeid) diff
from  
    volume_shortcovers vs inner join content c on
        c.contentid = vs.volumeid 
where 
    c.bookid is Null
group by 
    c.title,vs.volumeid, c.numshortcovers 
having diff <> 0

This script looks for issues where the number of rows in volume_shortcovers for each book is not the number specified by numshortcovers in the book record in the content table.

Does it mean anything?  Who knows as my Kobo that has no issues also as differences.

Another sql statement attempts to rebuild those missing rows:


insert into volume_shortcovers (volumeid, shortcoverid, VolumeIndex)
select 
    c.bookid, c.contentid, c.volumeindex
from content c left outer  join volume_shortcovers vs on 
    c.bookid = vs.volumeid and 
    c.contentid = vs.shortcoverid and 
    c.volumeindex = vs.volumeindex 
where 
    c.bookid is not Null and
    vs.volumeid is Null
order by c.bookid, c.volumeindex 
Is that safe?  No idea but I am working on it...  Running it on a database that had issues reduced the number of rows returned by the first sql query.

As I learn more I will post it here.