Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
I'm BaaaaAAAAaccck!
I'm BaaaaAAAAaccck!
#1
Was arguing with my cow orker today, and called him a drunkard.  CoH has been on my mind with the new news...  and well, one thing led to another..


Glad to see you've kept the lights on!
Reply
RE: I'm BaaaaAAAAaccck!
#2
Welcome back, WG! And as you may have noticed, all your old posts and threads are yours again, and your "joined" date has been reset to when you registered on the original boards.
-- Bob

I have been Roland, Beowulf, Achilles, Gilgamesh, Clark Kent, Mary Sue, DJ Croft, Skysaber.  I have been 
called a hundred names and will be called a thousand more before the sun grows dim and cold....
Reply
RE: I'm BaaaaAAAAaccck!
#3
Hey WG! We’ve been busy boys and girls since you’ve been gone. Smile
Reply
RE: I'm BaaaaAAAAaccck!
#4
I'm impressed, that's a fair piece of work, thank you Bob!
Reply
RE: I'm BaaaaAAAAaccck!
#5
From what I understand, it was a tedious, but simple process. I think it was Labster that wrote up the script that converted the raw forum-rip data into something acceptable to MyBB database. Bob only had to do a relatively few himself afterwards. And then from there, it's just a matter of associating your new account with the placeholder names in the database.
Reply
RE: I'm BaaaaAAAAaccck!
#6
Glad to see you return. Just a few days ago, I happened to look at one of your old posts on "The Game...", and was a touch dismayed to notice you were marked as not a registered member.
-----
"The Lady of the Lake, her arm clad in the purest shimmering samite, held aloft Excalibur from the bosom of the water, signifying by divine providence that this was some killer weed."
Reply
RE: I'm BaaaaAAAAaccck!
#7
(04-20-2019, 01:07 AM)Black Aeronaut Wrote: From what I understand, it was a tedious, but simple process.  I think it was Labster that wrote up the script that converted the raw forum-rip data into something acceptable to  MyBB database.  Bob only had to do a relatively few himself afterwards.  And then from there, it's just a matter of associating your new account with the placeholder names in the database.

The beginning was a bit more complex and I actually did a lot more work than that. 

Labster did write the code that ripped the Crapatalk forums, which turned them into a SQLite database -- a table of threads, a table of posts and a table of users. And other than some advice when I needed it, that's where his part of the process ended. 

Ultimately I had to move that data into MySQL here on my site while preserving all the relationships between threads, forums and posts. However, because of a feature lacking in SQLite, that meant I had to export it all and then imported it back into Microsoft SQL Server first. 

Then I wrote MSSQL scripts that went through each forum, read each thread, and wrote out a MySQL script that would recreate that thread.  I ran all that in large batches, one forum at a time, debugging things when the scripts crashed, and actually changing a few things about how the MyBB database was set up so that some of our larger messages would be recreated properly. Overall, that took me about a month of programming, testing, and moving content, although I was able to speed up the process quite a bit toward the end. As I moved them, I locked the individual forums on the old board and sent people here. When everything was moved and the old boards locked down complete, I had to run Labster's tool one final time to get everything that had been added to the old forums between his scrape and me completing the transfer, and transfer that in as well.

I didn't do much with the user data until recently, when I figured out how to carry over the original registration dates.  But it was easy enough to associate old posts with new accounts, which I was doing right from the start -- three one-line scripts that I just plug in the username and the new id, and I can give anyone their old posts back.  I also have a variant for when the user changed their name between the old board and the new board.  They sit in a script file I keep in two or three places just for backups and anytime someone from the old board registers, it takes me all of ten seconds to give them their post history back.

My next db project is fixing a date/time shift that crept in when I moved the dates in UNIX era format -- it seems to have ignored the offset from UTC for my time zone and set everything a few hours off.

And that was probably far more than you cared to know about the process.
-- Bob

I have been Roland, Beowulf, Achilles, Gilgamesh, Clark Kent, Mary Sue, DJ Croft, Skysaber.  I have been 
called a hundred names and will be called a thousand more before the sun grows dim and cold....
Reply
RE: I'm BaaaaAAAAaccck!
#8
Really, it was a nice change of pace from the intentional obfuscation I've been gargling at work and around CoH. Folks don't seem to understand that I don't care what's in their codeplug, but I do need it to ship them a programmed and tested new radio...
Reply
RE: I'm BaaaaAAAAaccck!
#9
Yeah, my bit is mostly as advertised, and my ripping script is open source.  There were lots of fiddly little issues, like Tapatalk intentionally obfuscating their API and pretending they were serving plain HTML files, but eventually I was able to hook into the standard PhpBB API.  I think the domain or the path may have changed in the middle of the rip, that was fun.

Oh yeah, I had to parse the HTML inside an HTML comment, because they chose to remove post titles from the DOM by wrapping them in comment tags.  \o/ All sorts of data was buried under a mountain of divs and spans.  I could tell that this shop was freaking crazy with their dev process because I would fail code review for pretty much everything they changed from the stock code.  Other than that, it was just keeping track of users, and doing a simple randomizing process to get all the threads without anyone noticing.  I mean, it wouldn't look like organic traffic, but based on their web dev team I was pretty confident their sysadmin team wouldn't notice anything in the logs if the numbers weren't sequential.

And then came the part where I was rewriting all of this HTML data back into BBCode -- specifically MyBB, because that's what Bob chose.  This script was much crappier than the download one, but text munging always looks messy.  All of this was to change HTML tags into the bbcode equivalent, but this is less clean than it looks too.  Basically all of the embeds needed to be rewritten -- and some of them were actually things where the forum user had just pasted a link, and Tapatalk automagically turned it into an embed to help with product placement. So I had to un-Crapatalk those.  URLs needed to be upgraded to HTTPS while I was there.  And fix the font sizes given as "15px%".  

It didn't do a depth-first parse, so you basically had to run the script several times until it stopped finding changes, then run it once more.  Yes, super classy.  But it's one of those things where you keep developing until your project is done, and there's no need to actually fix bugs once you have your data set.  Anyway, all this work was done in isolation so Bob could claim to have done most of the work, but we all know that programmers do more work than DBAs, right?

Bob Wrote:My next db project is fixing a date/time shift that crept in when I moved the dates in UNIX era format -- it seems to have ignored the offset from UTC for my time zone and set everything a few hours off.
Haha no, that's totally my location that caused it, I just ignored time zones on input data.  I'm not sure exactly what happened, but it was either 3 or 7 hours off, which would either be my offset from Bob, or my offset from GMT.  Isn't that something as simple as: UPDATE posts SET timestamp = DATE_ADD(timestamp, INTERVAL 3 HOUR) WHERE timestamp < @migration_date; Actually here are the mysql date function docs.
"Kitto daijoubu da yo." - Sakura Kinomoto
Reply
RE: I'm BaaaaAAAAaccck!
#10
I don't know about anyone else, but I for one find the process interesting to hear about. Crapatalk also gets even more lols and harder eye-rolls now, and a new round of thanks to you two for figuring out how to migrate us right off of there because it's obviously a fiasco all around.
--
‎noli esse culus
Reply
RE: I'm BaaaaAAAAaccck!
#11
Quote: Isn't that something as simple as: UPDATE posts SET timestamp = DATE_ADD(timestamp, INTERVAL 3 HOUR) WHERE timestamp < @migration_date;
Yeah, I expect it would be. It's just I haven't gotten around to it in the year-and-a-half, and I keep forgetting about it.

So, now that I'm thinking about it, I'll put it on my master "to do" list for the forums so I'm not just depending on my memory.

And sorry about not mentioning all the work you put into the scraper, particularly the markup translation. I just wanted to make it clear that it hadn't been a case of "scrape-and-into-the-new-board-spit-spot".
-- Bob

I have been Roland, Beowulf, Achilles, Gilgamesh, Clark Kent, Mary Sue, DJ Croft, Skysaber.  I have been 
called a hundred names and will be called a thousand more before the sun grows dim and cold....
Reply
RE: I'm BaaaaAAAAaccck!
#12
Don't be sorry, it was one of those jokes along the lines of:
[Image: KMM8q.jpg]
"Kitto daijoubu da yo." - Sakura Kinomoto
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)