Super Rune Random code and web stuff

Feed to mail

A while back I started getting anti social and move all my news reading back to feeds and sent them to my email. I used Ifttt for this. It is a pretty good service for getting feeds into your mailbox. You can set up daily or weekly digest of a feed or you can just send a feed entry to you mail when is it posted. But what didn’t work as well were the mails I received eg. I can not change the layout of these mails. Ifttt is also shortening all links in the mails. This means they know which links I have clicked and I didn’t need that. Some links even failed to redirect and I was left at Ifttt with a message that the link didn’t work.

The solution

Then I thought I can just write my own feed parser and send it to my own email. This could just be done in C# and I could run it on my macmini through the mono project, but my macmini isn’t always on and I would miss out some news. Oh the horror.

Then I turned to my Qnap NAS, it runs 24/7, so it could run the program, but it is no powerfull enough to run mono, so I had to turn to Python 2.7, CXX or plain shell scripting. I chose python for the task as it had a lot of build-in functionality as I’m used to from .Net and the data is stored in a SQLite database file. SQLite is familiar as soon as the connection is made. It roughly works similar to other RDBMS as MySQL or MSSQL, and I no I would get into a fight with those guys it just for saying that.

The code

Ok you’re probaly ignorent to my choice in the first place and just wants to see some python code.

The feed checker

For each feed I have a feed checker which runs stupidly through the feed and scans every entry. It supports RSS 2.0 and Atom 1.0.

Init

This is the contructor of my feedChcker class.

def __init__(self, feed):
    self.feedId = feed['Id']
    feedType = feed['feedType']
    self.feedTitle = feed['title']

    logging.basicConfig(level=logging.DEBUG, filename=self.logpath + '/log.txt')

    try:
        self.entries = []

        xml = self.__ReadFeed__(feed['link'])

        if xml is not None:
            if feedType == 1:
                self.__ParseRSS2__(xml)

            if feedType == 2:
                self.__ParseAtom1__(xml)

            bh = BufferHandler()
            for row in self.entries:
                bh.AddEntry(row)
    except:
        logging.exception(self.feedTitle)

As you can see I have a flag on the feed that tells me whether it is RSS or Atom. This could be detected automatically in the future so I don’t need to know this when adding a new feed.

Read the feed

It is pretty easy to open a link in python and read its content. It is also pretty easy to parse the XML and doing some XPATH querying on the XML.

I had to do some cleaning up of the XML before parsing it. Not all feeds are absolutely valid and must be preprocessed. This is where the clean-up comes in.

You might know this or not but there is no private or protected functionality in python instead they are denoted with underscores. 1 underscores and the method is protected and 2 underscores it is private. This is just for visualisation.

Afterwards I try to parse the XML and I’m removing all name space for easy querying of elements later on. I’m not that concerned with name spaces here, RSS and ATOM is a standard and anything extension to this is not supported by me or so would it seem.

def __ReadFeed__(self, link):
    f = urllib.urlopen(link)
    xml = f.read()
    xml = self.__CleanUpXml(xml)
    try:
        it = ET.iterparse(StringIO(xml))
        for _, el in it:
            if '}' in el.tag:
                el.tag = el.tag.split('}', 1)[1]  # strip all namespaces

        tree = it.root
        return tree
    except:
        f = codecs.open(self.logpath + '/' + feedTitle + '.xml', 'w', 'utf-8')
        f.write(xml)
        f.close()
        return None


def __CleanUpXml(self, xml):
    charsToRemove = ['16'];
    for c in charsToRemove:
        xml = xml.replace(c.decode('hex'), '')

    return xml

Parse and entry

RSS is the most interesting here because feed suppliers have extended the definition and I need to take this into account. Normally a RSS feed has only one field for a short description, but of the feed generators out there have extended it with a need field called <content:encoded> which hold a much longer text just like the content field in Atom.

When removing all name spaces from the XML <content:encoded> is changed to just <encoded> and is typically wrapped in a CDATA section whereas the description field is escaped html and entities.

To keep everything in UFT-8 the encode('utf-8') come into play. There is a whole chapter in the python documentation dedicated to how to handle Unicode in Python, so I will not explain it here.

def __ParseRSS2__(self, xml):
    nodes = xml.findall('channel/item')

    for node in nodes:
        n = {
                'title': ''.join(node.find('title').itertext()).encode('utf-8'),
                'link': node.find('link').text,
                'id': str(uuid.uuid4()),
                'feedId': self.feedId,
                'updated': datetime.now()
            }

        if node.find('encoded') is not None:
            n['description'] = node.find('encoded').text.encode('utf-8')
        elif node.find('description') is not None:
            n['description'] = ''.join(node.find('description').itertext()).encode('utf-8')

        self.entries.append(n)

bufferHandler

When all the parsing is done when have to save the entries for later and a simple handling of dublicate titles. The bufferHandler class does all this.

Add entry

Adding an entry to the database is easy when we know SQL.

def AddEntry(self, entry):
    try:
        self.__Open__()
        cur = self.con.cursor()
        cur.execute('INSERT INTO entry(id, title, link, description, feedid, updated) VALUES(?, ?, ?, ?, ?, ?)', (entry['id'], entry['title'].decode('utf-8'), entry['link'], entry['description'].decode('utf-8'), entry['feedId'], entry['updated']))
        self.con.commit()
    except lite.Error, e:
        # logging.exception('AddEntry')
        pass
    finally:
        if self.con:
            self.con.close()

As you can see there is support for parametrised SQL which is awesome, but the syntax could have been name based like C# instead. That would have been a nice touch. To prevent duplicate title entries I have added a unique index on the title to the entry table. That is also why the logging method in the exception is commented. Methods cannot be empty in python so here we use the keyword pass, to tell it to pass along, nothing to see here.

Mailer

For sending emails and ensure they are encoded in UFT-8 I have borrowed the mail method form this site which is pretty good.

Before i send the mail I prepare it through SendInstant.

def SendInstant(self, entries):

    template = codecs.open('Instant.html', 'r', 'utf-8').read()

    for entry in entries:
        body = template.replace('{title}', entry['title'])
        body = body.replace('{description}', entry['description'])
        body = body.replace('{link}', entry['link'])
        body = body.replace('{feedtitle}', entry['feedTitle'])
        subject = entry['title']
        success = self.Send(entry['feedTitle'] + ' <example@example.dk>', subject, body)

        if success == 1:
            bh.MarkSentEntry(entry['id'])

Read the template and find the markers and replace them with real content. For easy identification of which feed the entry comes from I add the feed title to the sender name and this will show up in the mail header. If the mail is successfully sent I mark it as sent.

Conclusion

This was a fun little project or still is. There are stil things I could implement:

  • Digest feeds into weekly or daily and specify when whe the digest should be sent: time of day and day of week.
  • Template for digests with index
  • Clean up previously sent entries and keep the database neat and tidy, but settings per feed for this
  • Add tags to the feeds to easier filtering in the mail client
  • Web interface for adding a new feed instead of a SQLite commandline or application
  • Split every feed checking into threads for greater performance

Python

In python you are getting stuff done quickly. It took me a whole day to complete this project, and the syntax and the insisting of indenting your code correctly got in my way a few times until I got the hang of it.

But the most time consuming was to figure out how to use Unicode correctly in Python, that held me up for a few hours.

All in all when I got around the obstacles, python is kinda great, but I wouldn’t replace C# with it anytime soon.

Update

The entire codebase is now live on GitHub. Go nuts or go home.