RSS stands for Really Simple Syndication. Put simply, it's a protocol based on xml, that is used to send large amounts of information.
Over the past few years, the use of RSS has fallen out of favor among many developers, favoring proprietary APIs instead. There are even a few people who say that RSS is obsolete and should NOT be used for anything.
But RSS can still be used to obtain a vast amounts of interesting, and useful, information about a great many subjects. Like information from the NATIONAL HURRICANE CENTER and CENTRAL PACIFIC HURRICANE CENTER. It can be used to download and filter current news stories ( called a news aggregator ). It can be used to download earthquake data, weather forcasts, news about volcanoes, and more.
And... it still can be used to download directory listings of podcasts, which is what most people know it used for.
Yes, you can find the exact same info using your web browser, or one of the many RSS Feed readers which can be downloaded for FREE.
So, why do it in python?
The python code that I'm showing in the post is intended to be a starting point for automated applications like an alarm if a news story appears on the web about a particular person, or to automatically import earthquake info into a database. Using RSS for this sort of thing is much simpler, and easier to maintain, than say, a webpage scrapper.
Plus, it's just an intreasting subject matter.
More info about RSS, it's history, and standards, can be found at https://en.wikipedia.org/wiki/RSS
RSS Feeds - A service that supplies data in an RSS format is called an RSS Feed. Doing an internet search for the words 'RSS Feeds' returns hundreds of them like News Feeds, podcasts, business data, and weather. The RSS feeds that I am showcasing in this post are just a short sample, but they all working in much the same way. So once you understanding how one works, you have mastered them all.
The date/time values used in many RSS feeds are in ISO8601 format ( yyyy-mm-ddThh:mm:ss.xxxxZ ).
The letter ‘Z’ at the end, indicates that this is UTC time; meaning that this is the time in Greenwich England, and not your local timezone.
So for example, if you live in Chicago (CDT timezone), and Daylight savings time is not in effect, you would subtract 4 hours from the time shown, to get your local time.
The TIMEBIE website has a conversion tool that can allow you to easily convert from UTC time into your local timezone, and back again. It will also allow you to print a conversion table which many people may find handy. The TIMEBIE website is http://www.timebie.com
For more info about the ISO8601 time format, see https://en.wikipedia.org/wiki/ISO_8601
There are a number of different ways for working with RSS data in python; most of which require you to download and install a module that is not a member of the Standard Modules. Examples are BeautifulSoup and feedparser.
I prefer to not go this route, because it requires an extra step, and the project is not easily transferable to another computer ( modules might not be installed ) .
Instead…. I have written a function called RSSGet() from scratch, that will download the data, parse it, and return it as a usable form. The sample code I am showing latter in the post, will be making uses of this function.
Please feel free to cut/past this into your projects if you wish.
#!/usr/bin/python3 import urllib.request # Keys generally used in RSS, XML and ATOM rssKeys = """ author.category.channel.feed.copyright.rights.subtitle.description. summary.content.generator.guid.id.image.logo.item.entry.lastBuildDate. updated.link.managingEditor.author.contributor.pubDate.published.title.ttl. url.image.icon.updated. """ # These keys are used by the National Hurican Center rssKeys = rssKeys + """ nhc:center.nhc:type.nhc:name.nhc:wallet.nhc:atcf.nhc:datetime. nhc:movement.nhc:pressure.nhc:wind.nhc:headline. """ # Transform rssKeys from a string, into a usable list. rssKeys = rssKeys.replace("\n","").replace(chr(32),"").strip().split(".") def RSSGet(url): """Retrieve and parse an RSS, XML or ATOM feed.""" html = "" stack = [] core = {} prefix = "" try: with urllib.request.urlopen(url) as response: html = response.read().decode().strip() except: pass html = str(html).replace("<" + "br/>", "\n") html = str(html).replace("<" + "br />", "\n") html = str(html).replace("<" + "![CDATA[", "") html = str(html).replace("]]>", "") for key in rssKeys: html = html.replace( "<"+ key +">", chr(200) + key + chr(200) ) html = html.replace( ""+ key +">", chr(200) + "/" + key + chr(200) ) html = html.replace( "<" + "item ", chr(200) + "item" + chr(200) ) html = html.split(chr(200)) for x in range(0,len(html)-1): key = html[x] value = html[x+1].strip() if key == "image": prefix = key + "." if key == "/image": prefix = "" if key in rssKeys and value!="" and key!="": core[prefix + key] = value if key=="/item" or key=="/entry" or key=="item" or key=="entry": if core!={}: stack.append(core) core = {} prefix = "" return stack
Here is an example of how you might use RSSGet() in your project:
items = RSSGet( 'http://www.cbn.com/cbnnews/us/feed/' ) for item in items: print( 'title: ', item.get('title') ) print( 'pubDate: ', item.get('pubDate') ) print( 'link: ', item.get('link') ) print( "" ) print( item.get('description') ) print( '==============================================================' ) print( "" )
As you can see in the code example above, an RSS Feed contains one or more items, which you can think of as a record. Each item may contain the information about one News story, blog post, earthquake, etc. It’s one item per event / post.
Each item usually contains the fields title , pubDate, link, and description. There may be many other fields depending on the kind of data the RSS feed carries, but it’s a good bet that these 4 fields will be in EVERY item in an RSS feed.
This is the key to understanding how RSS works. Once you have that, the rest is pretty easy.
The data for each item is stored in a python data type called a dictionary.
You can retrieve the value of any of these fields in the item, using the python dictinary.get() method.
If you try to retrive a value that’s not in the dictionary ( has no key by that name ), like
soemthing = str( item.get(‘BadKeyName’) )
then the string 'None' is returned. This will NOT abend the program.
I know that for some people, what I just said can be confusing.
So.. If your looking for clarification about this, See
https://www.w3schools.com/python/ref_dictionary_get.asp.
In today’s crazy world, keeping up with the news can be like drinking from a fire-hose. The sheer number of new stories can be overwhelming. And are you really all that interested in all the fluff pieces and talking heads? Are you really interested in the Green paint shorage in Japan, or in what’s the new cool baby names for 2020?
Wouldn't it be nice to filter the news feeds for stories your interested in?
Doing an internet search for the worlds “News RSS feeds” will come up with a very long list of services. Some are paid subscriptions, and some are FREE services. For the code example bellow, I have selected 3 of the more repeatable services, but you could easily add as many as you would like.
Take a few minutes to consider the following code example. It will scan the 3 RSS news feeds, and print a list of stories that contain one or more of the search terms. A link is also included if you are interested in reading more about the story.
Note that this uses the RSSGet() function that is shown above.
def RSSFilterNews(url, searchTerms): for story in RSSGet(url): bolFlag = False text = story.get( 'description','' ).split("&" + "lt;div") for search in searchTerms: if search.lower() in text[0].lower(): bolFlag = True if bolFlag: print("***** From: ", url) print("-" * 80 ) print( 'title: ', story.get('title') ) print( "link: " , story.get( 'link' ) ) print( story.get( 'pubDate' )) print( "" ) print( text[0] ) print( ("=" * 80) + "\n" ) # Words to search for. searchTerms = ['Seattle', 'baking', 'plywood', 'alligator', 'python'] # CNN Top News Stories. RSSFilterNews( "http://rss.cnn.com/rss/cnn_topstories.rss", searchTerms ) # New York Times home page. RSSFilterNews("https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml", searchTerms ) # CBN News Feed. RSSFilterNews("http://www.cbn.com/cbnnews/us/feed/", searchTerms )
The RSS feeds for blogs work in the exact same way as for news stories (see the section above). You could think of a Blog feed as another News feed, just from a different source.
You could use the same function shown above to filter a blog feed. But it’s more likely that you would want to see a list of the most recent 10 postings ( example bellow ).
def fetchBlogs(url, MaxPosts = 10): """Fetch the most recent posts for given blog url.""" items = RSSGet( url ) blogName = items[0].get('title') c = len( items ) if c > MaxPosts: c = MaxPosts for x in range( 1, c ): item = items[x] print('blogName: ' + blogName ) print('title: ' + item.get('title') ) print('pubDate: ' + item.get('pubDate') ) print('link: ' + item.get('link') ) print( "=" * 80 ) fetchBlogs( "https://www.howtogeek.com/feed/" )
The U.S. Geological Survey (USGS) has a great deal of FREE data about earthquakes. Their website is https://www.usgs.gov
They also have many RSS feeds: https://earthquake.usgs.gov/earthquakes/feed/
I am using the one containing data for the past 7 days: https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.atom
Note that these RSS feed are only updated every few hours, so this would not be useful for a real-time quake alarm system. For that, you would need to subscribe to one of the many available push-notification serveries.
# Earthquake Lists. url = "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.atom" records = RSSGet(url) #print( records ) # Display all quakes over the past 7 days, of a mag 3.0 or greater. for record in records: if record.get('title') > "M 3.0": print( record.get('updated' ) + ": " + record.get('title' ) ) print("=" * 80) # Display all quakes over the past 7 days, in Oklahoma. for record in records: if "Oklahoma" in record.get('title'): print( record.get('updated' ) + ": " + record.get('title' ) ) print("=" * 80)
The Volcano Discoveries website also has an RSS feeds with detailed info about recent earthquakes.
I will have more info about the
Volcano Discoveries website in the section about Volcanoes (bellow)
# Earthquake Reports records = RSSGet("https://www.volcanodiscovery.com/earthquake-news.rss") for record in records: print( record.get('title') ) print( record.get('pubDate') ) print( "" ) print( record.get('description') ) print( ("=" * 80) + "\n" )
The NATIONAL HURRICANE CENTER and CENTRAL PACIFIC HURRICANE CENTER have a number or RSS feeds concerning hurricanes.
For a list of their RSS feeds, see https://www.nhc.noaa.gov/aboutrss.shtml
I find that the https://www.nhc.noaa.gov/index-at.xml feed is quite good.
# Hurican Reports for record in RSSGet("https://www.nhc.noaa.gov/index-at.xml"): headline = record.get( 'nhc:headline' ) if headline != None: print("*** NEWS FLASH from the National Hurican Center ***") print( headline ) print( "" ) print( "title: ", record.get('title') ) print( "pubDate: ", record.get('pubDate') ) if record.get( 'nhc:center' ) != None: print( "" ) print( "StormCenter: ", record.get( 'nhc:center' )) print( "Heading: ", record.get( 'nhc:movement' )) print( "Pressure: ", record.get( 'nhc:pressure' )) print( "Wind Speed: ", record.get( 'nhc:wind' )) print( "" ) print( record.get('description') ) print( "=" * 80 ) print("")
The Volcano Discoveries website ( https://www.volcanodiscovery.com ) has a very good RSS feed concerning volcanoes.
They also have a feed with detailed info about earthquakes ( see EARTHQUAKES section above).
Info about their RSS feeds is at https://www.volcanodiscovery.com/rss-feeds.html
# Volcano Reports records = RSSGet("https://www.volcanodiscovery.com/volcanonews.rss") for record in records: print( record.get('title') ) print( record.get('pubDate') ) print( "" ) print( record.get('description') ) print( ("=" * 80) + "\n" )
There are a number of FREE weather related RSS feeds, but I find that the ones from the BBC are the most useful.
Yes, that IS the BBC; British TV and Radio service, England.
Most other FREE weather related RSS feeds are only focused on their country of origin, with little, if any, info for other countries. The BBC weather feeds simply have the simplest and most updated weather info for any location in the world.
To use this RSS feed:
1. Go to
https://www.bbc.com/weather
and enter the name of the city near to you, in the space at the top of the web page. I my case, that’s Seattle.
2. Note on the URL line of your browser, a 7 digit code for your city. In my case, it’s 5809844.
3. Using the example code bellow, replace my 7 digit code ( 5809844 ) with the 7 digit code for your city.
Examples:
The code for Seattle is 5809844
The code for New York is 5128581
The code for Chicago (midway airport) is 4887472
The code for Paris is 2988507
# Weather - Current conditions. url = "https://weather-broker-cdn.api.bbci.co.uk/en/observation/rss/5809844" for record in RSSGet(url): print( record.get("description")) print("") # Weather forecast url = "https://weather-broker-cdn.api.bbci.co.uk/en/forecast/rss/3day/5809844" for record in RSSGet(url): print( record.get("title")) print( record.get("description")) print("")
And YES, RSS can still be used to download directory listings for podcasts.
The following code example will download the directory listings of 3 popular podcasts, and save the info into a text file named 'podcasts.txt'. The info will include a URL link that can be used to download, or play, the podcast if you choose.
You could add more podcasts to the list by finding the RSS feed of your favorite podcast ( usually a link on the podcast home page ), and adding an additional line of code to this example.
Use your favorite text editor to view the file, and any web browser, or vlc media player, to play the podcast.
In short this is a very primitive, but workable, pod-catcher. It can also be an intresting toy to play with.
def fetchPodcast(url, MaxEpisodes = 10): """Fetch the most recent episodes for given podcast.""" items = RSSGet(url) pcName = items[0].get('title') # Get the name of the podcast. c = len(items) if c > MaxEpisodes: c = MaxEpisodes with open('podcasts.txt','a') as output: for x in range(1, c ): item = items[x] guid = str( item.get('guid') ) if ">" in guid: guid = guid.split(">")[1] link = str(item.get('enclosure' )) if 'url="' in link: link = link.replace('url="',"") if ".mp3" in link: link = link.split('.mp3')[0] + '.mp3' description = str(item.get('description' )).split("<")[0] output.write( 'Podcast: ' + pcName + "\n" ) output.write( 'title: ' + str(item.get('title' )) + "\n" ) output.write( 'pubDate: ' + str(item.get('pubDate' )) + "\n" ) output.write( 'unique ID number: ' + guid + "\n" ) output.write( link + "\n" ) output.write( '\n' + description + "\n" ) output.write( ( "=" * 80 ) + "\n" ) # If the file already exists, erase it. with open('podcasts.txt','w') as output: output.write("") # The Hidden Brain podcast fetchPodcast( "https://feeds.npr.org/510308/podcast.xml" ) # Fetch the most recent 20 episodes of 'Cut & Paste' podcast. url = "https://kwmu-rss.streamguys1.com/cut_and_paste/cut-and-paste.xml" fetchPodcast( url, 20 ) # Fetch the most recent 15 episodes of 'Talk Python to me' podcast. fetchPodcast( "https://talkpython.fm/episodes/rss", 15 )
There are also RSS feeds concerning the current flow of rivers, water level in lakes, air quality, space weather, northern lights, and the list goes on. You only need to find them.
And that concludes what I have to say about python and RSS. I hope that someone out there find’s this useful.
Everyone have a good day, and be kind to each other.
Joe Roten. www.gsw7.net/joe
Last updated: 2020-09-23