1
1

I checked http://meta.osqa.net/questions/4080/how-can-i-import-data-from-another-question-and-answer-script to no avail.

I imported users with the previously mentioned script but questions are taking too slow.

AskAction(user=user).save(data=qdata)

I can add 1 question in 1.1 second with the script.

Things I tried:

  • Disabling keys on mySql (it worsens the situation)
  • multithreading (it appears so python doesn't concurrently multithreaded)
  • multiprocessing - 4x speed increase, best works with 10 processed, for a total of 100 questions a round.

Currently my best speed is ~0.25 seconds per question adding.

How can I add 300K+ questions, fast, I am thinking directly to the database without python/django/osqa? Current speed of 4 questions per second limit my testing and explorations.

I have messed around in the python codes but I am new to python, so I could not figure out which queries to execute.

asked 04 Sep '11, 08:50

kerem's gravatar image

kerem
91137
accept rate: 0%

edited 04 Sep '11, 08:54

rickross's gravatar image

rickross ♦♦
12.5k2814972

@kerem, all meta.osqa.net postings are required to have a clear, specific question in their title field. I have edited this one for you, but please make sure you observe this rule in future postings. Thanks.

(04 Sep '11, 08:55) rickross ♦♦

Well after digging in mysql logs, I have found that using the following steps, you can import your users, tags and questions by generating an appropriate SQL script.

Users

Since there are a manageable amount of users for my site, I have used the previously mentioned AskAction(user=user).save(data=qdata) python script.

Tags

Before adding questions I chose to add all of the tags belonging to all questions first. Doing tags is the easiest part. After getting your tags into a list or better a set, just use the following python script and do not worry about use counts of the tags yet.

for tag in allTags:    
  fTags.write("INSERT INTO forum\\\\_tag(name, created\\\\_by\\\\_id, used\\\\_count) VALUES ('%s', 1, 1);\\\\n" % tag)

Note: created\\_by\\_id = 1 is for my user. You can get the used ID from auth_user table with the query: Within mysql console:

select id from auth_user where username = 'admin';

Questions

For the question part I will give you the order of inserts to tables and give my sample SQL script but it may not work for your input.

  1. First you have to add your question to the table forum_node. We will later update some fileds in this table.
  2. Second table you want to use is forum_noderevision. You have to have the latest revision info in this table.
  3. After inserting into forum_noderevision, we update active_revision_id of forum_node table.
  4. Then we insert into the forum_action table to notify the system that a question has been asked.
  5. Besides my questions' own tags, I added their subject as a tag and modified the subject a little. Here I insert into forum\\_tag table my quesion's subject.
  6. All the imported tags are correlated with the forum\\_node with forum\\_node\\_tags table.

For each question I wrote the following to an .sql file:

START TRANSACTION;

INSERT INTO forum_node (title, tagnames, author_id, body, node_type, added_at, score, state_string, extra_count, marked) VALUES( '%s', '%s', %s, '%s', 'question', '%s', 0, '', 0, 0);

SET @forumNodeId = LAST_INSERT_ID();

INSERT INTO forum_noderevision(title, tagnames, author_id, body, node_id, summary, revision, revised_at) VALUES('%s', '%s', %s, '%s', @forumNodeId, '', 1, now());

SET @activeRevId = LAST_INSERT_ID();

UPDATE forum_node SET active_revision_id = @activeRevId WHERE id = @forumNodeId;

INSERT INTO forum_action (user_id, ip, node_id, action_type, action_date, canceled, canceled_ip) VALUES(%s, '', @forumNodeId, 'ask', now(), 0, '');

INSERT INTO forum_tag(name, created_by_id, used_count) VALUES ('%s', 1, 1);

-- iterate all tags of this question:
INSERT INTO forum_node_tags(node_id, tag_id) SELECT @forumNodeId, id FROM forum_tag WHERE name = '%s' LIMIT 1;

COMMIT;

After running the question script, you have to update forum\\_tag.used\\_count it will take a looong time because the way mySQL handles COUNT(*) and the way that query is.

UPDATE forum_tag SET used_count = (SELECT COUNT(*) FROM forum_node_tags WHERE tag_id = forum_tag.id);

If you notice anything odd please leave a comment.

link

answered 06 Sep '11, 13:48

kerem's gravatar image

kerem
91137
accept rate: 0%

edited 06 Sep '11, 14:10

1

Forgot to update last activity pointers.

UPDATE forum_node SET last_activity_by_id = author_id, last_activity_at = added_at;

(12 Sep '11, 16:01) kerem

Nice work with the SQL tracking :) I wrote python functions for adding questions, answers, and comments based on this in this thread: http://meta.osqa.net/questions/10600/how-do-i-bulk-upload-users-and-questionsanswers-data-into-osqa

Question: http://pastebin.com/eeq4hPFR

Answer: http://pastebin.com/UgY9Wvd2

Comment: http://pastebin.com/LUV08HXi

(17 Feb, 07:41) dahlo

If you are trying to manage 300K+ questions in a serious endeavor, then you almost certainly want to talk with DZone about using our enterprise-scale product, "Qato" instead of our entry-level OSQA product. Send a note to sales@dzone.com describing what you want to accomplish, and one of our team members will get back in touch with you right away.

link

answered 04 Sep '11, 08:57

rickross's gravatar image

rickross ♦♦
12.5k2814972
accept rate: 46%

edited 04 Sep '11, 08:58

1

I am just evaluating, I would like to use the open source OSQA database for this evaluation and not qato but thanks for your recommendation.

(04 Sep '11, 08:59) kerem

@karem, best of luck in your exploration. I'm sure you will understand that, as the creators of both OSQA and Qato, we at DZone have a good sense of what our products relative strengths and weaknesses are. For many reasons, Qato would be the right choice to operate at this scale.

(04 Sep '11, 09:10) rickross ♦♦
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×47
×16

Asked: 04 Sep '11, 08:50

Seen: 578 times

Last updated: 17 Feb, 07:42

powered by OSQA