Macnyt RSS proxy

From WiKim

Contents

Introduction

Is it possible to create a RSS proxy for the forum on http://www.macnyt.dk?

Macnyt doesn't offer a RSS feed for it's forum, only news. It would be nice to have an RSS feed for the new submissions to the forum on this URL: http://www.macnyt.dk/forum/?page=new_submissions

Status

Beta released http://www.kika.dk/macnyt/macnytforum.php. --Kim Bach 07:56, 11 June 2006 (CEST)

Well the sysop at macnyt.dk didn't really like what I did - and you can't really blame him.

If you inspect the code you can see that it shouldn't put very heavy strain on the server, since it simply does a HTTP GET of the HTML, the rest of the processing is done in PHP, and yes I know, that that could be implemented more efficiently, but heck it's my first PHP project created from scratch. The sysop also claims that scraping is illegal, well I don't think so, it's actually some kind of deep linking, and there is precedence for the legality of that.

My code will remain up (for now), but I urge that you host it locally, and that you use it sparingly. I most admit that I like it myself, it has incresed the usability of macnyt.dk quite a bit for me.

I hope for an official forum RSS feed, but I doubt that it will follow the standards, the current news feed doesn't work with Firefox. --Kim Bach 22:55, 11 June 2006 (CEST)

The current news feed has been fixed, but still no official Forum RSS. I did fix a bug, after a reinstall of the Macnyt server, the code broke, due to an extra w in the URLs, this worked before the update, but I guess that the * alias has been removed. --Kim Bach 11:12, 4 September 2006 (CEST)

Changelog

  • Version 0.1.1 No new features, just some clean-up. --Kim Bach 05:49, 18 June 2006 (CEST)

Analysis

Reverse engineering of the Macnyt New Submissions page shows that is is quite simple to create a scraper and to implement it in PHP. Basically looking for these strings:

<td class=\"forum_text\">
<td class=\"forum_headline\"><B><a href=\""

Usage

The script is hosted at http://www.kika.dk/macnyt/macnytforum.php, add it manually to your feeds. Please use it sparingly, so that we don't anger the sysop.

Code

Below is the code (macnytforum.php):

<?php
# Macnyt Danmark Forum RSS feed converter.
#
# Last update:
# 2006-09-04 KB   Version 0.1.2 fixed bug in URL, had 4 w's instead of 3! And this broke
#                 when a new server went online 
#
# Copyright Kim Bach, kim(dot)bach(at)gmail.com
#
# Project homepage: http://www.kimbach.org/wiki/index.php/Macnyt RSS proxy
# Version 0.1.2, 04 September 2006
#
# Revision history:
# Date       Init Descritpion
# 2006-06-09 KB   Created
# 2006-06-11 KB   Version 0.1.0 first beta
# 2006-06-18 KB   Version 0.1.1 clean up
# 2006-09-04 KB   Version 0.1.2 fixed bug in URL, had 4 w's instead of 3! And this broke
#                 when a new server went online 
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or 
# (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# 
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
# http://www.gnu.org/copyleft/gpl.html

$page_title = "macnyt";
$meta_descr = "n/a";
$meta_keywd = "n/a";
$macnyt_forum_url = "http://www.macnyt.dk/forum/?page=new_submissions";
$post_link_prefix = "<td class=\"forum_headline\"><B><a href=\""; 
$forum_link_prefix = "http://www.macnyt.dk";
$post_description_prefix = "<td class=\"forum_text\">";
$page_title = "macnyt";
$meta_descr = "n/a";
$meta_keywd = "n/a";

if ($handle = @fopen($macnyt_forum_url, "r")) {
    $content = "";
    while (!feof($handle)) {
        $part = fread($handle, 1024);
		$content .= $part;
    }
    fclose($handle);
	
    $lines = preg_split("/\r?\n|\r/", $content); // turn the content in rows
    $is_title = false;
    $is_descr = false;
    $is_keywd = false;
    $close_tag = ($xhtml) ? " />" : ">"; // new in ver. 1.01
	header("Content-Type: text/xml");
	echo("<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>");
	echo("<!DOCTYPE rss [<!ENTITY % HTMLlat1 PUBLIC \"-//W3C//ENTITIES Latin 1 for XHTML//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent\">]>");
	echo("<rss version=\"2.0\" xml:base=\"http://www.kika.dk/macnyt\">");
	echo("<channel>");
	$has_header = false;

    foreach ($lines as $val) {
        if (eregi("<title>(.*)</title>", $val, $title)) {
            $page_title = $title[1];
            $is_title = true;
			echo($page_title);
        }
        if (eregi("<meta name=\"description\" content=\"(.*)\"([[:space:]]?/)?>", $val, $descr)) {
            $meta_descr = $descr[1];
            $is_descr = true;
			echo($meta_descr);
		}
        if (eregi("<meta name=\"keywords\" content=\"(.*)\"([[:space:]]?/)?>", $val, $keywd)) {
            $meta_keywd = $keywd[1];
            $is_keywd = true;
        }
        if ($is_title && $is_descr && $is_keywd && !$has_header) {
			echo("<title>" .$page_title. "</title>");
			echo("<link>".$macnyt_forum_url."</link>");
			echo("<description>" .$meta_keywd. "</description>");
			echo("<language>da</language>");
			$has_header = true;
		}
		if (!$is_headline && eregi($post_link_prefix, $val, $headline)) {
			// extract link
			// Skip to second instance of double ping
			$pingcount = 0;
			for ($i = 0; $i < strlen($val); $i++) {
				if (substr($val, $i, 1) == '"') {
					// double ping found, increase count
					$pingcount++;
					if ($pingcount == 3) {
						$forum_link = substr($val, $i + 1);
						$pingcount = 0;
						
						// Find last ping
						for ($j = 0; $j < strlen($forum_link); $j++) {
							
							if (substr($forum_link, $j, 1) == '"') {
								// double ping found
								$forum_link=$forum_link_prefix.substr($val, $i + 1, $j);
								break;
							}
						}
						break;
					}
				}
			}
			
			// extract description
			// Skip to thrid instance of gt
			$gtcount = 0;
			for ($i = 0; $i < strlen($val); $i++) {
				if (substr($val, $i, 1) == '>') {
					// gt found, increase count
					$gtcount++;
					if ($gtcount == 3) {
						$forum_title = substr($val, $i + 1);
						$gtcount = 0;
						
						// Find last gt
						for ($j = 0; $j < strlen($forum_title); $j++) {
							if (substr($forum_title, $j, 1) == '<') {
								// lt found
								$forum_title = substr($val, $i + 1, $j);
								break;
							}
						}
						break;
					}
				}
			}
			$is_headline = true;
			echo("<item>");
			echo("<title>".$forum_title."</title>");
			echo("<link><![CDATA[".$forum_link."]]></link>");
		}
		if ($is_headline && eregi($post_description_prefix, $val, $text)) {
			echo("<description>"."<![CDATA[".$val."]]>"."</description>");
			//echo("<category domain=\"http://macwiki.kimbach.org/portal/?q=taxonomy/term/5\">Samarbejdspartnere</category>");
			//echo("<pubDate>Fri, 21 Apr 2006 03:10:09 +0200</pubDate>");
			echo("</item>");
			$is_headline = false;
		}
    }
	echo("</channel>");
	echo("</rss>");
}
?>