char-rnn experiments

These are some notes on trying out https://github.com/karpathy/char-rnn with different file formats.

First: Lego Digital Designer files. These are XML-Files, basically containing a bunch of coordinates and references to brick models.

We scraped all available lxf files with scrapy, a python scraping/spidering framework (scrapy.org) from the eurobrick.com forums where people have been creating a lot of sets and linked to the resulting files.

Requirements: Python 2

Install scrapy:

pip install scrapy

Start a new scrapy project with:

scrapy startproject lego

Create a new python script in lego/lego/spiders

Start scraping with:

scrapy crawl legospider

When it's finished, you should have about 4875 .lxf files (Run ls | grep ".lxf" | wc -l to count them) in the scrapy project directory. Messy, right? Move these files out of there

mkdir lxf
mv lego/*.lxf lxf/

lxf files are zip archives:

file 4553_train_wash.lxf
4553_train_wash.lxf: Zip archive data, at least v2.0 to extract

So we'll need to extract them, automatically renaming the extracted files because the archives look like this:

IMAGE100.LXFML
IMAGE100.PNG

7z supports automatically renaming files if they otherwise would be overwritten:

find . -name '*.lxf' -exec 7z e -aot {} \;

This will take a while, grab a coffee or something.

We move the extracted files to two seperate directories, png (just in case we need the images) and lxfml. To keep things tidy, we move the lxf files, too.

mv *.LXFML lxfml/
mv *.PNG png/
mv *.lxf lxf/

After, we concatenate the .LXFML files into one huge file, input.txt:

cat lxfml/*.LXFML >> input.txt

Move this file to its own directory:

simon@t430 ~/char-rnn/data $ mv lego/lxf/input.txt legoprepped/

Start training:

th train.lua -data_dir data/legoprepped -rnn_size 512 -num_layers 2 -dropout 0.5

It's immediately apparent this won't do, or would take ages, so we move to another box with a GPU with OpenCL support and train with a much, much smaller subset of files (2 MB):

Sampling with first checkpoint:

Checkpoint 7:

Unfortunately, this doesnt get much better (at the moment?).

PEGIDA corpus

We use the (already cleaned up) corpus of the 288k comments from the pegida facebook page.

We put them all into one file, input.txt, using python and pandas:

We then start training:
$ th train.lua -data_dir data/pegida -opencl 1

As expected, the first checkpoint is gibberish:

Share on Facebook0Tweet about this on TwitterShare on Google+0

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.