dictionary.com as a XDXF

EDIT:
This page is obsolete.
The GitHub project got scrapped and offline_dictionary.com replaces it. Check this post instead.
But the technical information below still stands.



There you are, ready to learn lots of nice things.
The project is hosted on my GitHub right there : 
It's written in C# so deal with it. I was a Java fanboï too in the past. But then I realized it was unfortunately shit. Plus, now C# is open source so your only argument in favor of Java is gone in pieces. I know because it was the one I used too ;)

Also, I like to troll fanboïs a lot. So calm your tits; code is code regardless of its language.
But still... C# > Java... deal with it (╯°□°)╯︵ (~ .o.)~
Muahaha!

Below you will find the big lines of the GitHub project and how it's structured.





Get the offline dictionary


First, you need to know that the dictionary.com's mobile app (here) has an 'offline' option... at least on the Android version... and at least on the 'premium' paid version...

Whatever, now you know how I got the 'source' to create the XDXF.

Notes:
I am using BlueStacks to emulate an Android device, on which I installed the app. Whatever device you will be using, you are gonna need to have a root access as the offline dictionary is in protected folders. So it might be easier for you to grab one of these pre-rooted BlueStacks versions.

Now once you get that app, go to settings ...

dictionary.com app
... and check the offline options.

dictionary.com app settings

Now the app is gonna download the whole dictionary on you device.

You will find it in the following location :
/data/data/com.dictionary.paid/databases

See the com.dictionary.paid folder :

Root Explorer
And the content of the databases folder :

Root Explorer
We are interested in the android-08-08-primary.sqlite file.
You can even open it directly, and see a few definitions from the content_blocks table.

Root Explorer's built-in SQLite Viewer
You need to get that file. You manage.

Me, I copied this file into my BlueStack's shared folder which is on the device:
/data/sdcard/windows/BstSharedFolder

And the corresponding folder on my host Windows machine is:
%programdata%\BlueStacks\UserData\SharedFolder

Whole databases folder retrieved

So now you got the whole dictionary.com content on your computer.






Get the offline dictionary: hacker version


I wanted to download again the database to make sure that I had everything in order and no damaged file. But instead of doing all of the above I was like, I'm gonna find the URL and download that shit directly.

Not sure why, but Fiddler never showed anything. So, I went deeper and popped Wireshark open.
I found what I needed directly:

POST http://app.dictionary.com/mobiledownload/Primary-Android-v45
HEADERS Content-Type: application/x-www-form-urlencoded
PAYLOAD accesstoken=5koNfp9U

That's it. Not sure what this token is. Since I'm using the 'premium' paid version of the app  (don't ask...) , maybe this grants a better offline dictionary.




Extract the data from the SQLite database


This one was tricky but not so much.

Everything I'm gonna say here is unconfirmed and is just me doing reverse-engineering the database's structure.

Notes:
I used DB Browser for SQLite to debug queries, to try and understand the database layout, etc.

I'm lazy and I don't feel like explaining the database layout. So I'm just gonna throw a couple screenshots instead.

DB Browser for SQLite
DB Browser for SQLite

It's just a matter of using the C# System.Data.SQLite.Core nuget package and then loop on each 'word' to retrieve its definitions.

Well, I say that as if that's super easy... it is somehow, but it gets harder to do so fast.
In order to extract all of this stuff, I had to thread the readings from the database.
I built a giant ConcurrentDictionary in which I put stuff.

It still takes around 40 minutes to extract the whole thing, using my 16Gb RAM and 8 Cores.

Visual Studio's Diagnistic Tools show high CPU usage on all cores





Build the XDXF from the extracted data


Almost a much trickier part than extracting the data from the SQLIite, as the definitions from the dictionary.com database contains HTML and even sometimes JavaScript code...

In addition we have encoded HTML entities and broken IMG tags which aren't self closed... WTF dictionary.com! You made my job much hard! Much pain occurred to me ;(

This is pretty bad since most XML parser will complain that it can't parse such a mess.
So, we have to pull this awesome HTML parser out from nuget: HtmlAgilityPack.

This parser guy will read shit HTML and can close unclosed tags, doesn't give a fuck if a <BR> or <IMG> tag is unclosed (because it supports old ass HTML specs), etc.

So I'm parsing with this guy, then I'm converting back the HTML entities to Unicode except for the characters that need to be escaped in XML.

Which are the following :
  • &
  • <
  • >

<rage>
To the other people on the interwebz that say "meeh there is also the single and double quotes you n00b" will be thrown into blazing fucking fire:
Single and double quotes are to be escaped ONLY in XML attributes NOT in tags!
So STFU n00b and read them specs!
</rage>

Once the definition has been cleaned up a little bit as described right above my rage rant, we can put it inside a <def> tag which defines a definition in XDXF.

If you know a little bit about XDXF (and you most likely don't), you need to know there are two ways to represent a definition:
  • Visual
  • Logical
In 'visual' format, the <def> will contain stuff, and that stuff is the definition content. It most likely will contain the visual representation of the definition.

In 'logical' format, the <def> will be structured with XDXF tags representing different parts of the definition. Like an <ex> to indicate a usage example of the word (instead of italics for instance), or <etm> to indicate the etymology of the stuff, etc.

In this project, I will be using the 'visual' format for the following reasons:
  1. Because the definition itself from dictionary.com is made outta (crappy) HTML, so it's already a visual representation of the definition;
  2. Because it would be too hard to parse this HTML and convert it to a semantical XDXF fragment stripping out all of the visual information;
  3. Because my personal goal here is to be able to convert this XDXF using the Russian's tool so I can enjoy it on my PocketBook, and most likely this little tool will not support the 'logical' format.
The output XDXF looks like something like that

And finally here is the XDXF:
Yeah... it's a pretty big mofo

Download the 7zipped version there:
dictionary.com_5.2.2-08-08.7z

Damn, this guy is too big, and it crashes the the Russian's tool that is supposed to convert XDXF to ABBYY ... crap.

Guess that will be the next episode then. Gotta do this shit by myself.





Performance considerations


The hardest part of this project.
In the current version of the offline database '08-08' there are 149135 word entries.
We need to get their IDs and then to go and grab their definition, plus get the 'similar' words that have the same meaning which are in another table.


Tasks


Doing this in a synchronous way and I guess a couple days would be required.
In an async way though, a good hour is required.

Right now I'm using Task to create the parallel tasks, with one task responsible to build the definition of one word. Which means, that I am creating 149135 tasks :)
...
..
.
"OMFG WTF are u doing!?" you are thinking.
Fear not, the Task class works with a goddamn good task scheduler. Yes I will create 149135 Task objects, but only 8 or 10 will actually run concurrently. All of the other tasks will be marqued as WaitingForActivation.

It's all good right there. A Task object (I guess) only contains a reference to a delegate. Which is like a pointer (I still guess) which is like a Int64 on my 64 bits CPU (I'm still guessing).

So it's prolly like:
149135 * 64 bits = 9544640 bits
=> 1193080 bytes
=> 1165 kb
=> 1.13 mb

It's nothing.
Plus, I clean the tasks list every second to remove done tasks (it's easier to debug that way I have only the remaining stuck tasks)

And BTW I tried using the new Parallel static class. This is shit. my CPU was not working at all. Even after setting a MaxThingy in its configuration to MAXINT. It's just not brutal enough, and was going 4 times slower at least.

Maybe I just don't know how to get the best outta it but anyway I reverted and used Task instead.


Threaded SQLite


To thread the reads from the SQLite we need to open one connection per thread. Too bad we will be suffering lotsa overhead but that's the only way.

Still it's slow. So I tried a couple things to speed the process. However none really worked.

First I moved the SQLite database file to my SSD drive.
This worked well, as before I could see that my CPU was not working 100%. I guess the bottleneck was the I/O in the drive.

Then I tried to move the SQLite database file to a RAM drive. Why the fuck not uh?
I used ImDisk Virtual Driver and copied/pasted the file there. No speed increase but, this will stop fucking my poor SSD. So I still recommend that to save the life-span of your SSD a little bit.

Finally I moved the data to a SQL Express Server. I used the trial version of ESF Database Migration Toolkit to make the migration. But no speed increase either. So there's was no point.


Storing the whole thing


Then we put put all of this in memory... now this is going to take up some space all right. I can't write the stuff directly into the XML because when I get a definition for a word, this may not contain all of its definitions...

Let me explain.

For instance, when we read the definitions for the word 'fame' we get stuff. We also know that 'famed', 'overfamed', etc. also have the same definition as 'fame'.
But, when later I read the definition of 'famed', we get an extra new definition that only relates to the 'famed' adjective. In essence, 'famed' will have its own definition plus the ones from 'fame'.

You can check it out only directly at dictionary.com. Go on, type 'fame' and open another tab and type 'famed'. Now compare both. The word 'famed' outputs 'famed' definition + 'fame' definitions.

So yeah...
With these considerations, I have to store the whole thing in memory and little by little update words definitions with their 'parent''s word definitions.

There must be another way, another coding design, but so far I don't see one.


Updating the whole thing


Since multiple threads are messing with the same definitions we need to lock that shit so it's thread safe. For this job I'm using a ConcurrentDictionary with List inside.

Because I store definitions by words, and I add words from different threads, I need ConcurrentDictionary. And because sometimes I update the definitions from different threads too, I also need to protect the definition collection, so I'm using a lock around the List.

So I have tried the SynchronizedCollection vs ConcurrentBag instead of the List. Now I lack knowl-edge and experience in threaded coding in general but I had issues with SynchronizedCollection. These mofos were throwing CollectionChanged exception (or something) sometimes. Which probably means that each atomic operation like Get/Add/etc. is not locked. So I had other threads messing with my collection during a foreach.

But with the ConcurrentBag I never had a single one exception. I guess that's because ConcurrentBag has a locking mechanism per thread. Not only per atomic operation.

Anyway ConcurrentBag was overkill so a simple lock around my List is most likely faster.


Writing the XDXF


That sounds easy but not quite. At first I was like creating a DOM in memory, effectively duplicating the whole structure twice in memory. Actually more than twice since the XML nodes also contains tag strings and shit. I quickly experienced an OutOfMemoryException.

So the only way is to write the XDXF piece by piece while reading the content of our 'whole thing'. That's it.




One bug


There is one. The very first tasks I talked about: they seem to never complete.
Like the last 5 of them take like an hour each.

But it's not a Task deadlock that is freezing the execution, it's the freaking SQLiteDataReader that is like... waiting.

The weird thing is I had exactly the same issue at work. You know when I do... uh... 'tactical' programming and shit. When I do operator style CQB coding.

Whatever, we had this issue with sometimes an SQL query locking itself.

We are using Entity Framework 6 and out of desperation of finding nothing online nor by ourselves, I grabbed the EF source code and deep debugged our shit. Putting breakpoints all around inside EF. This shit was insane. When you go in there man, it looks like nothing. There are Invoke and delegates everywhere, oh man. My ass still hurts thinking about that.

But after 10 hours of deep ass debugging, I ended up on the freaking DataReader doing nothing. Waiting. Using no CPU. No memory. The mofo was just chilling.

What's interesting is that I found something on the interwebz. People taling about the freaking DataReader that is Lazy. Like, unless you try to evaluate the thing linked to the reader, nothing is happening.

Makes sense. Moreover, I was using IEnumerable to try and optimize the readings from the SQLite database. So it could be... that somewhere in my foreach loops, somehow an iterator is getting lost along the way, which means that one item will never be evaluated. Which mean that the reader will never try to read. Because no one needs the data.

That was a very interesting theory. Unfortunately, I ended up testing with ToList() everywhere in the code, making sure that everything would be evaluated. And the bug was still there. Still waiting chilling for the last 5 tasks.

This is the usage at the 149133th word. It's been like that forever.

And, the surprising thing is that it doesn't crash. Nope. If I put a break-point after the reader, and wait for the break, and check the value returned by this guy, it's valid. There is actual valid data in there. Nothing fancy, nothing huge, just the usual definition in there.

Very weird.
So even though the extraction from the database is around 30/40 mins, it can last up to 3 hours just because these last 5 freaking tasks are chilling.

Which is still better than doing this shit synchronously...


Solved


Not sure how though. I started to remove the completed tasks from my huge tasks list. I do this from an 'update' task that is in a while(true) and shows the progression. Even second I RemoveAll() the completed tasks.
Also I dropped the ConcurrentBag and used a simple lock.

Those are the two actions I did, and now it completes fine.




Licence

 The license for "DictionaryDotComToXdxf" is the WTFPLDo What the Fuck You Want to Public License.





No comments:

Post a Comment