OUs index php topic powered by smf. Automatic detection of forum engine

Let's start right away with the main script code:

#! / usr / bin / perl

# which-forum.pl script
# (c) 2010 Alexandr A Alexeev, http: // site /

use strict;

# commented out lines - for strictness
# if the task is to collect engine statistics, leave it as is
# if you make a list of forums - uncomment

my $ data;
$ data. \u003d $ _ while (<> ) ;

# check how much Powered by phpBB was without link in the footer
print "phpbb \\ n "
if ($ data \u003d ~ /] + href \u003d "[^"] * http: \\ / \\ / (?: www \\.)? phpbb \\ .com \\ /? "[^\u003e] *\u003e phpBB / i or
# $ data \u003d ~ /viewforum\\.php\\?†^""")*f\u003d\\d+/i or
$ data \u003d ~ / phpBB \\ -SEO / i or
$ data \u003d ~ /) ;
print "ipb \\ n "
if ($ data \u003d ~ /] + href \u003d "[^"] * http: \\ / \\ / (?: www \\.)? invision (?: board | power) \\. com \\ /? [^ "] *" [^\u003e] *\u003e [^<]*IP\.Board/i or
$ data \u003d ~ /] + href \u003d "[^"] * http: \\ / \\ / (?: www \\.)? invisionboard \\ .com \\ /? "[^\u003e] *\u003e Invision Power Board / i or
$ data \u003d ~ /

/ i or
$ data \u003d ~ /index\\.php\\?†^""†*showforum\u003d\\d+/i) ;
print "vbulletin \\ n "
if ($ data \u003d ~ / Powered by:? [^<]+vBulletin[^<]+(?:Version)?/i or
$ data \u003d ~ /) ;
print "smf \\ n "
if ($ data \u003d ~ /] + href \u003d "[^"] * http: \\ / \\ / (?: www \\.)? simplemachines \\ .org \\ /? "[^\u003e] *\u003e Powered by SMF / i or
$ data \u003d ~ /index\\.php\\?†^"""*board\u003d\\d+\\.0/i) ;
print "punbb \\ n "
if ($ data \u003d ~ /] + href \u003d "[^"] * http: \\ / \\ / (? :( ?: www \\.)? punbb \\ .org | punbb \\ .informer \\ .com) \\ /? "[^\u003e] *\u003e PunBB / i); #or
# $ data \u003d ~ /viewforum\\.php\\?†^""†*id\u003d\\d+/i);
print "fluxbb \\ n "
# if ($ data \u003d ~ /viewtopic\\.php\\?id\u003d\\d+/i or
if ($ data \u003d ~ /] + href \u003d "http: \\ / \\ / (?: www \\.) fluxbb \\ .org \\ /?" [^\u003e] *\u003e FluxBB / i) ;
print "exbb \\ n "
if ($ data \u003d ~ /] + href \u003d "[^"] * http: \\ / \\ / (?: www \\.)? exbb \\ .org \\ /? "[^\u003e] *\u003e ExBB / i); # or
# $ data \u003d ~ /forums\\.php\\?†^"""*forum\u003d\\d+/i);
print "yabb \\ n "
if ($ data \u003d ~ /] + href \u003d "[^"] * http: \\ / \\ / (?: www \\.)? yabbforum \\ .com \\ /? "[^\u003e] *\u003e YaBB / i or
$ data \u003d ~ /YaBB\\.pl\\?†^"""**num\u003d\\d+/i);
print "dleforum \\ n "
if ($ data \u003d ~ / \\ (Powered By DLE Forum \\)<\/title>/ i or
$ data \u003d ~ /] + href \u003d "[^"] + (?: http: \\ / \\ / (?: www \\.)? dle \\ -files \\ .ru | act \u003d copyright) [^ "] *"\u003e DLE Forum<\/a>/ i) ;
print "ikonboard \\ n "
if ($ data \u003d ~ /] + href \u003d "[^"] * http: \\ / \\ / (?: www \\.)? ikonboard \\ .com \\ /? [^ "] *" [^\u003e] *\u003e Ikonboard / i or
$ data \u003d ~ /\\ n "
if ($ data \u003d ~ /\\ n "
# if ($ data \u003d ~ /forums\\.php\\?fid\u003d\\d+/i or
# $ data \u003d ~ /topic\\.php\\?fid\u003d\\d+/i or
if ($ data \u003d ~ /] + href \u003d "http: \\ / \\ / (?: www \\.)? flashbb \\ .net \\ /?" [^\u003e] *\u003e FlashBB / i) ;
print "stokesit \\ n "
# if ($ data \u003d ~ /forum\\.php\\?f\u003d\\d+/i or
if ($ data \u003d ~ /] + href \u003d "http: \\ / \\ / (?: www \\.)? stokesit \\ .com \\ .au \\ /?" [^\u003e] *\u003e [^ \\ /] * Stokes IT / i) ;
print "podium \\ n "
# if ($ data \u003d ~ /topic\\.php\\?t\u003d\\d+/i or
if ($ data \u003d ~ /] + href \u003d [""]? http: \\ / \\ / (?: www \\.)? sopebox \\ .com \\ /? [""]? [^\u003e] *\u003e Podium / i) ;
print "usebb \\ n "
# if ($ data \u003d ~ /forum\\.php\\?id\u003d\\d+/i or
if ($ data \u003d ~ /] + href \u003d "http: \\ / \\ / (?: www \\.)? usebb \\ .net \\ /?" [^\u003e] *\u003e UseBB / i) ;
print "wrforum \\ n "
# if ($ data \u003d ~ /index\\.php\\?fid\u003d\\d+/i or
if ($ data \u003d ~ /] + href \u003d "http: \\ / \\ / (?: www \\.)? wr \\ -script \\ .ru \\ /?" [^\u003e] *\u003e WR \\ -Forum / i) ;
print "yetanotherforumnet \\ n "
if ($ data \u003d ~ / Yet Another Forum \\ .net / i or
$ data \u003d ~ /default\\.aspx\\?g\u003dposts&t\u003d\\d+/i) ;

You will find this and other scripts mentioned in the post in this archive.

Script which-forum.pl examines the code of the html page for the presence of forum engine signatures in it. We used a similar technique when defining WordPress and Joomla, but there are a couple of differences. First, the script itself does not load the page code, but reads it from stdin or a file passed as an argument. This allows you to load a page once, for example, using wget, and then run it through several analyzers, if we have more than one. Secondly, in this script, the presence of a signature is 100% a sign of the engine. Last time, the presence of the signature only gave weight to the corresponding engine and the engine with the highest weight "won". I decided that in this case, this approach would only unnecessarily complicate the code.

To test how the script works, I did some research. I compiled a list of several thousand forums and ran each of them through my own script, thereby determining the percentage of program triggers and the popularity of various engines.

I used my google parser to get a list of forums. The search engine sent requests like

site: forum. *. ru
site: talk. *. ru
site: board. *. ru
site: smf. *. ru
site: phpbb. *. ru
....

etc. The complete code of the query generator can be found in the file gen-forumsearch-urls.pl... In addition to zone.ru ,.su .ua .kz and.by were also used. It was difficult to do this research last time because WordPress and Joomla sites do not have such signatures in the URL. Catalogs like cmsmagazine.ru/catalogue/ do not provide a sufficient sample size. What are 600 Drupal sites?

I must admit, the results of the experiment upset me. Of the 12,590 sites studied, only 7083 were successfully identified with the engine, that is, only in 56% of cases. Maybe I left out some engine? Was Bitrix really on half of the forums? Or should I spend more time searching for signatures? In general, more research is required here.

Among 56% of successfully identified engines, the most popular, as expected, were IPB (31%), phpBB (26.6%) and vBulletin (26.5%)

They are closely followed by SMF (5.8%) and DLEForum (5.3%). My favorite punBB was only in 6th place (1.64%). I would not recommend strongly trusting these figures (they say, every third forum in RuNet works on IPB), but certain conclusions can, of course, be drawn.

For example, if you intend to create a website on a forum engine and plan to modify the forum, say, pay users $ 0.01 for each message with automatic withdrawal of funds once a week, then you should choose one of the three most popular engines. The more popular a forum is, the more likely it is to find a programmer who is well versed in it.

If significant changes in the engine are not expected, then it may make sense to choose not the most popular engine, for example SMF or punBB. This will reduce the number of hacker attacks on your forum and the amount of automatically sent spam on it.

Scripts for finding / defining forums can also find many practical uses. The first thing that came to my mind personally was to sort the identified forums by TCI and place on the first hundred posts with links to one of their sites. However, a hundred forum dofolllow links did not affect the TCI in any way (2 updates passed), so it is better not to waste time here, unless you are interested in transitions.

It is clear that the named use of scripts is far from the only one. I think you can easily figure out how else you can use them.

Organized by Botmaster Labs, not planned. There is no time, the video is needed for the competition, like a newfangled trend, although it's easier to explain everything with good screenshots (my IMHO), and I don't really want to shoot anything. So there are very few profitable ones left, stupid spam no longer rules at all, here you need to think and no one will burn the topic, if only the obsolete ones try to shove and powder a little in a beautiful wrapper. :) But this is not about us. In general, these 3 "not", I think, basically, and became barriers to participation in the competition for the majority of potential participants. It's like repairing a car out of three: cheap, high-quality, fast - the service can only fulfill 2 conditions at the same time. sit and choose what is closer to you. :) The competition is the same: I have time, I can make videos, but I don’t have a topic, or I can make a video, I have a topic, but I don’t have time at all, or I have free time and I have a little temka, but the video is scary. But it's good if 2 conditions are met simultaneously. Well, okay, let's drop the lyrics. I will continue to myself. I didn’t plan, which means that I took part in the competition, I even chose which article I would vote for. Say what you like, but Doz knows very well the software and knows how to use it very sensibly. But today I learned that intrigue appeared in the competition. It turns out that I cannot vote, but only beginners who purchased the software in 2011 and the competition is designed for them can do it. I was surprised a little, but the owner is a master. The competition is an advertising campaign and Alexander knows better how to conduct it. In general, I decided then to post an article, it is somewhat easier to write when it is clear for whom, for the entire collective farm, in fact, it is impossible to do this.
The long introduction is over, now to the point.
What does a beginner need when he has acquired such a super-harvester, which is the Xrumer + Hrefer complex? That's right, learn to work on it and discard the illusion that you can make money by starting to spam with sheets. If you think so, donate your money to charity right away. You need to learn how to use the tools of the complex, preferably by sharpening it for yourself. The time "take more - throw more" is gone. Quantity gives way to quality. So we will collect the base for ourselves, do not learn how to do this - you will lag behind the train. Of course, Hrefer will help us with this. If you plan to promote your resources on Google, then we also need to look for donor sites through Google. I think this is clear and logical. But Google, as the owner of the copper mountain, does not give its wealth to everyone. You need an approach to it. I would like to say right away that do not hope that you will be able to collect something based on the signs that you find in public. Because they are available in the public, they are worthless. I will not develop the topic further. I'd better tell you how to assemble it correctly so that you see the result, you will finish the rest yourself, the main thing is to understand the principle. It is necessary to collect on the correct basis on the basis of specific, necessary engines, and not on the basis of forums in general. This is the main mistake newbies make - not focus on the specific, but try to cover everything in its entirety. And yet, if you want to parse a more or less normal base, refuse to use operators in queries. No "inurl:", "site:", "title", etc. Google will ban searchers like you instantly. Therefore, we carefully study the engines with which Hrumer is currently working:

In version Hrumer 7.07, the program is trained in several new engines:

forumi.biz, forumb.biz, 1forum.biz, 7forum.biz, etc.

phpBB-fr.com, Solaris phpBB theme

And the process of learning new things goes on continuously.
In general, we need to prepare the correct queries for parsing by Hrefer. Let's take a forum djok for example. SMF Forums... And let's start disassembling it into parts for parsing. Our beloved Google will help us with this. Entering a Google request SMF Forums - there is a lot of garbage in the search results, rewind to some 13th page and select any link. I came across this one: http://www.volcanohost.com/forum/index.php?topic\u003d11.0. We open it and examine it. We need to find something characteristic on the page that can be applied to the search for other pages on this engine. In the footer, notice the following inscription Powered by SMF 1.1.14, we quote it and enter it into Google, it shows us that for this request, it knows about 59 million options. We quickly look through the links, add a couple more options to this keyword, for example, "Powered by SMF 1.1.14" poplar or "Powered by SMF 1.1.14" viagra... We are convinced that the request is gorgeous, in the results there are only forums and almost no garbage for you.

In addition, we are not interested in quantity, but in quality, as I said above. Go ahead. From the same forum, we take another phrase from the footer: , we also quote it and feed it to Google. In response, he reveals that he knows more than 13 million results. We again skim through the search results, add additional words and check the results with them. We make sure that the query is great and also almost no junk. In general, there are already 2 iron requests. I suggest leaving the first forum alone for now and continuing to collect requests for other forums. Fortunately, Google is open on request 2006-2008, Simple Machines LLC... We take from the SERP, for example, these forums: http://www.snowlinks.ru/forum/index.php?topic\u003d1062.0 and http://litputnik.ru/forum/index.php?action\u003dprintpage;topic\u003d380.0 in their footers, we take the following requests: "Powered by SMF 1.1.7" and "Powered by SMF 1.1.10" (I always advise to drive requests in quotation marks for Hrefer, because we need quality first of all). I think it is clear what we are doing, in the end we will have a certain base of queries to search for forums on the SMF engine (it is selected for an example, with the rest of the engines the same).
It will look something like this:

"Powered by SMF 1.1.2"

"Powered by SMF 1.1.3"

"Powered by SMF 1.1 RC2"

"Powered by SMF 1.1.4"

"Powered by SMF 1.1.8"

"Powered by SMF 1.1.7"

"2006-2008, Simple Machines LLC"

And that's not all. While collecting the versions of the engines, on some SMF forums in the footer we find the nadvis "2001-2006, Lewis Media". We check this request, it also fully satisfies us. We find a similar query: "2001-2005, Lewis Media". Running through the footers, we find the following query: "SMFone design by A.M.A, ported to SMF 1.1". We check - excellent. Etc. Half an hour of work and you have a wonderful database of queries for the engine, and Google will be banned for these queries much less often than if you use operators in them. And at the same time, your database will be much cleaner than if you use queries like "index.php? Topic \u003d", because here Google will give not only the forums we need, but also a lot of left resources where we succeeded leave a link to the forum topic. You may argue, they say, what's wrong with that? Others left a link, so we can. But! Links can leave not only Hrumer, but other programs as well. moreover, they can be specially sharpened to leave comments in a certain resource, the so-called highly specialized software, plus such links could be left by hand. Again, I repeat, it is not the amount of trash that is important to us, but the quality, the base with the right requests, we will collect anyway. The advantage of this method is that you practically will not need to configure in Hrefer sieve -filter , you can simply turn it off, because Google will practically not give you garbage.

I believe that it is very important to learn how to use Hrefer at the initial stage correctly, because having learned this, you can always find a use for Hrumer, no matter how the situation changes. The protections are becoming more complicated, and if on some types of engines the protection has been strengthened and Hrumer cannot cope with it at the moment, then there is no point in spending resources on collecting these links, and then on working on them with Hrumer, it is better to focus on what gives the result ... And at the same time, if the Botmaster Labs team taught Hrumer something new, it is possible to quickly dissect a new patient and prepare a base for Hrumer while the patient is still lukewarm. Time is money, the resource may no longer be relevant when you buy the base. collected by someone. In addition, the correct collection of bases for yourself significantly expands the "white" use of Hrumer. And this is exactly where everything is moving, whether we like it or not, and the process of whitening or graying is going on. Black sheets are a thing of the past.
All the rest, already technical aspects of working with Hrefer, can be viewed in the help and it makes no sense to dwell on them, all goals, points, seconds are set empirically for each car individually.
As a bonus, I'll post here a template for parsing the Chinese search engine Baidu, the other day I was asked about it, so I did it in between times, sorry for the pun. :)

Hostname \u003d http: //www.baidu.com
Query \u003d s? Wd \u003d
LinksMask \u003d
TotalPages \u003d 100
NextPage \u003d
NextPage2 \u003d
CaptchaURL \u003d
CaptchaImage \u003d
CaptchaField \u003d

I tried to parse them with a test, there was no ban, Hrefer collected resources lively, all requests for parsing were similar to Google's, but Chinese resources were a sea, and with a high PR, and besides, there were many places where the European did not step. It is better to parse with Chinese requests. This will help Google-translate, type a list of keywords in Russian and translate it into Chinese. Truth in " words"Hrefer words cannot be added in Chinese, you need to recode.
Instead of Chinese:

伟哥 - viagra

吉他 - guitar

其他 - rest

保险公司 - insurance

Put these codes to replace them in the word file:

% E4% BC% 9F% E5% 93% A5

% E5% 90% 89% E4% BB% 96

% E5% 85% B6% E4% BB% 96

% E4% BF% 9D% E9% 99% A9% E5% 85% AC% E5% 8F% B8

If you are promoting a website for insurance, then by placing a link in your profile on a thematic (!) Even Chinese forum found by request " forum SMF "保险公司 will be very nice.
In conclusion, I would like to say that I have never understood people who complained that the Khrefers are bad or not parsing, I always wanted to say this, you just do not know how to cook them. Better than a hrefer, no parser knows how to collect results, just the requests must be correct. Hrefer is a car: good, solid, made in German, but a person drives it and it all depends on how smart it is driven, you cannot force the car to go right and left at the same time.
A separate topic is the cleaning of bases, I once did 3 years ago for the previous competition. With more, everything is still relevant there, but now you can refuse to check for 200 OK, I really didn't really like this process, the errors were very large, a lot of unnecessary things were filtered out. Now this can be done almost automatically in the process of Hrumer's work, although this process is not a complete analogue of checking for "200 OK". In general, to the point: not so long ago in Hrumer, a wonderful opportunity appeared - to rob information from resources at the time of the project run. It looks like this. You drive in a template that will be processed in the process, and the information collected from the template will be entered into the xgrabbed.txt file in the Logs folder. You can use this function for anything, the flight of imagination is huge. I use this function once a week to remove links from the working database. It's no secret that forums die off every day in order to clean the base from such resources and the "Autograbbing" tool will help us in this case.
After all, you must admit, often typing, for example, http://www.laptopace.com/index.php, we see that this domain is already, for example, a gaddyad, but there is no forum. So, in order to throw this slag out of the base, we will loot. :) Open the source code of the page and see this entry there:

laptopace.com

For grabbing, transform it into

[...]

Now all the "dead" from the goudaddi will be known to us by name.
Here is a small selection for the "Autograbbing" tool, if you want to clean the database from different "expired" domains:

[...]

[...]
[...]
[...]

[...]

This domain may be for sale. [...] Buy this Domain