Feedcake
Attention
This script is maintained by only one person who is also a python newbie.
If you don’t care about having article images, you should definitely use
PyFeeds instead!
Also, it’s only working for a very limited subset of news sites.
The Problem
Most news platforms don’t give you the full article via rss/atom.
This wouldn’t be a big problem. But some of them do crazy 1984-ish stuff on their
websites or they have built up paywalls for visitors using privacy addons.
Goal of this script
Getting a full-featured news feed (full articles with images) from various
news pages
Benefits for the user
- read full articles directly in your feed reader
- exclude articles by keyword in title
- no tracking
- no ads
Possible downsides for the user
- articles don’t get updated once they are scraped
- articles arrive with some delay
- interactive/special elements in articles may not work
What it does
- Fetching the news feed from the original website
- scrape contents of new entries and save them into a directory structure
- exclude articles if a string in the ‘exclude’ list is included in the title
- save a full featured RSS file
... and what it doesn’t
- Managing when it scrapes (but install instructions for crontab are included)
- serving the feeds and assets via HTTPS (use your favorite web server for that)
- Dealing with article comments
- Archiving feeds (But content and assets - but without meta data)
- Using some sort of database (the file structure is everything)
- Cleaning up old assets
- Automatically updating the basedir if it has changed.
(you have to clear the assets directory)
Ugly stuff?
- the html files (feed content) get stored along the assets, even if they don’t
need to be exploited via HTTPS.
- almost no exception handling yet.
How to use
- git clone this project and enter directory
- install python3, pip and virtualenv
- Create virtualenv:
virtualenv -p python3 ~/.virtualenvs/feedcake
- Activate your new virtualenv:
source ~/.virtualenvs/feedcake/bin/activate
- switch into the projects directory:
cd feedcake
- Install requirements:
pip3 install -r requirements.txt
- copy the config-example:
cp config-example.json config.json
.
- edit
config.json
- copy the cron-example:
cp cron-example.sh cron.sh
.
- edit
cron.sh
- make
cron.sh
executable: chmod +x cron.sh
- add cronjob for
cron.sh
: crontab -e
*/5 * * * * /absolute/path/to/cron.sh >> /path/to/logfile 2>&1
- setup your webserver:
- let your webserver somehow point to the
public/feeds
directory.
You should protect the http path with a basic authentication.
- let the
assets_url
you specified in the config earlier point to the
public/assets
directory.
- After running the script the first time, your desired feed is available at
base_url/destination
(e.g. https://yourdomain.tld/some-url/newspaper.xml
)
TODOs
- Handle exceptions
- Decide what should happen with old news articles and assets which are not
listed in the current feed anymore.