This page is obsolete.
The GitHub project got scrapped and offline_dictionary.com replaces it. Check this post instead.
But the technical information below still stands.
There you are, ready to learn lots of nice things.
Get the offline dictionary
|dictionary.com app settings|
|Root Explorer's built-in SQLite Viewer|
|Whole databases folder retrieved|
Get the offline dictionary: hacker version
Extract the data from the SQLite database
|DB Browser for SQLite|
|DB Browser for SQLite|
|Visual Studio's Diagnistic Tools show high CPU usage on all cores|
Build the XDXF from the extracted data
- Because the definition itself from dictionary.com is made outta (crappy) HTML, so it's already a visual representation of the definition;
- Because it would be too hard to parse this HTML and convert it to a semantical XDXF fragment stripping out all of the visual information;
- Because my personal goal here is to be able to convert this XDXF using the Russian's tool so I can enjoy it on my PocketBook, and most likely this little tool will not support the 'logical' format.
|The output XDXF looks like something like that|
And finally here is the XDXF:
|Yeah... it's a pretty big mofo|
Download the 7zipped version there:
Damn, this guy is too big, and it crashes the the Russian's tool that is supposed to convert XDXF to ABBYY ... crap.
Guess that will be the next episode then. Gotta do this shit by myself.
In the current version of the offline database '08-08' there are 149135 word entries.
We need to get their IDs and then to go and grab their definition, plus get the 'similar' words that have the same meaning which are in another table.
Doing this in a synchronous way and I guess a couple days would be required.
In an async way though, a good hour is required.
Right now I'm using Task to create the parallel tasks, with one task responsible to build the definition of one word. Which means, that I am creating 149135 tasks :)
"OMFG WTF are u doing!?" you are thinking.
Fear not, the Task class works with a goddamn good task scheduler. Yes I will create 149135 Task objects, but only 8 or 10 will actually run concurrently. All of the other tasks will be marqued as WaitingForActivation.
It's all good right there. A Task object (I guess) only contains a reference to a delegate. Which is like a pointer (I still guess) which is like a Int64 on my 64 bits CPU (I'm still guessing).
So it's prolly like:
149135 * 64 bits = 9544640 bits
=> 1193080 bytes
=> 1165 kb
=> 1.13 mb
Plus, I clean the tasks list every second to remove done tasks (it's easier to debug that way I have only the remaining stuck tasks)
And BTW I tried using the new Parallel static class. This is shit. my CPU was not working at all. Even after setting a MaxThingy in its configuration to MAXINT. It's just not brutal enough, and was going 4 times slower at least.
Maybe I just don't know how to get the best outta it but anyway I reverted and used Task instead.
Still it's slow. So I tried a couple things to speed the process. However none really worked.
First I moved the SQLite database file to my SSD drive.
This worked well, as before I could see that my CPU was not working 100%. I guess the bottleneck was the I/O in the drive.
Then I tried to move the SQLite database file to a RAM drive. Why the fuck not uh?
I used ImDisk Virtual Driver and copied/pasted the file there. No speed increase but, this will stop fucking my poor SSD. So I still recommend that to save the life-span of your SSD a little bit.
Finally I moved the data to a SQL Express Server. I used the trial version of ESF Database Migration Toolkit to make the migration. But no speed increase either. So there's was no point.
Storing the whole thing
Let me explain.
For instance, when we read the definitions for the word 'fame' we get stuff. We also know that 'famed', 'overfamed', etc. also have the same definition as 'fame'.
But, when later I read the definition of 'famed', we get an extra new definition that only relates to the 'famed' adjective. In essence, 'famed' will have its own definition plus the ones from 'fame'.
You can check it out only directly at dictionary.com. Go on, type 'fame' and open another tab and type 'famed'. Now compare both. The word 'famed' outputs 'famed' definition + 'fame' definitions.
With these considerations, I have to store the whole thing in memory and little by little update words definitions with their 'parent''s word definitions.
There must be another way, another coding design, but so far I don't see one.
Updating the whole thing
Because I store definitions by words, and I add words from different threads, I need ConcurrentDictionary. And because sometimes I update the definitions from different threads too, I also need to protect the definition collection, so I'm using a lock around the List.
So I have tried the SynchronizedCollection vs ConcurrentBag instead of the List. Now I lack knowl-edge and experience in threaded coding in general but I had issues with SynchronizedCollection. These mofos were throwing CollectionChanged exception (or something) sometimes. Which probably means that each atomic operation like Get/Add/etc. is not locked. So I had other threads messing with my collection during a foreach.
But with the ConcurrentBag I never had a single one exception. I guess that's because ConcurrentBag has a locking mechanism per thread. Not only per atomic operation.
Anyway ConcurrentBag was overkill so a simple lock around my List is most likely faster.
Writing the XDXF
The weird thing is I had exactly the same issue at work. You know when I do... uh... 'tactical' programming and shit. When I do operator style CQB coding.
What's interesting is that I found something on the interwebz. People taling about the freaking DataReader that is Lazy. Like, unless you try to evaluate the thing linked to the reader, nothing is happening.
Makes sense. Moreover, I was using IEnumerable to try and optimize the readings from the SQLite database. So it could be... that somewhere in my foreach loops, somehow an iterator is getting lost along the way, which means that one item will never be evaluated. Which mean that the reader will never try to read. Because no one needs the data.
That was a very interesting theory. Unfortunately, I ended up testing with ToList() everywhere in the code, making sure that everything would be evaluated. And the bug was still there. Still waiting chilling for the last 5 tasks.
|This is the usage at the 149133th word. It's been like that forever.|
And, the surprising thing is that it doesn't crash. Nope. If I put a break-point after the reader, and wait for the break, and check the value returned by this guy, it's valid. There is actual valid data in there. Nothing fancy, nothing huge, just the usual definition in there.
So even though the extraction from the database is around 30/40 mins, it can last up to 3 hours just because these last 5 freaking tasks are chilling.
Which is still better than doing this shit synchronously...
Not sure how though. I started to remove the completed tasks from my huge tasks list. I do this from an 'update' task that is in a while(true) and shows the progression. Even second I RemoveAll() the completed tasks.
Also I dropped the ConcurrentBag and used a simple lock.
Those are the two actions I did, and now it completes fine.
|The license for "DictionaryDotComToXdxf" is the WTFPL: Do What the Fuck You Want to Public License.|