|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475 |
- # Feedcake
-
- ### Attention
- This script is maintained by only one person who is also a python newbie.
- If you don't care about having article images, you should definitely use
- [PyFeeds](https://github.com/PyFeeds/PyFeeds) instead!
- Also, it's only working for a very limited subset of news sites.
-
- ### The Problem
- Most news platforms don't give you the full article via rss/atom.
- This wouldn't be a big problem. But some of them do crazy 1984-ish stuff on their
- websites or they have built up paywalls for visitors using privacy addons.
-
- ### Goal of this script
- Getting a full-featured news feed (full articles with images) from various
- news pages
-
- ### Benefits for the user
- * read full articles directly in your feed reader
- * exclude articles by keyword in title
- * no tracking
- * no ads
-
- ### Possible downsides for the user
- * articles don't get updated once they are scraped
- * articles arrive with some delay
- * interactive/special elements in articles may not work
-
- ### What it does
- * Fetching the news feed from the original website
- * scrape contents of new entries and save them into a directory structure
- * exclude articles if a string in the 'exclude' list is included in the title
- * save a full featured RSS file
-
- ### ... and what it doesn't
- * Managing when it scrapes (but install instructions for crontab are included)
- * serving the feeds and assets via HTTPS (use your favorite web server for that)
- * Dealing with article comments
- * Archiving feeds (But content and assets - but without meta data)
- * Using some sort of database (the file structure is everything)
- * Cleaning up old assets
- * Automatically updating the basedir if it has changed.
- (you have to clear the assets directory)
-
- ### Ugly stuff?
- * the html files (feed content) get stored along the assets, even if they don't
- need to be exploited via HTTPS.
- * almost no exception handling yet.
-
- ### How to use
- * git clone this project and enter directory
- * install python3, pip and virtualenv
- * Create virtualenv: `virtualenv -p python3 ~/.virtualenvs/feedcake`
- * Activate your new virtualenv: `source ~/.virtualenvs/feedcake/bin/activate`
- * switch into the projects directory: `cd feedcake`
- * Install requirements: `pip3 install -r requirements.txt`
- * copy the config-example: `cp config-example.json config.json`.
- * edit `config.json`
- * copy the cron-example: `cp cron-example.sh cron.sh`.
- * edit `cron.sh`
- * make `cron.sh` executable: `chmod +x cron.sh`
- * add cronjob for `cron.sh`: `crontab -e`
- * `*/5 * * * * /absolute/path/to/cron.sh >> /path/to/logfile 2>&1`
- * setup your webserver:
- * let your webserver somehow point to the `public/feeds` directory.
- You should protect the http path with a basic authentication.
- * let the `assets_url` you specified in the config earlier point to the
- `public/assets` directory.
- * After running the script the first time, your desired feed is available at
- `base_url/destination` (e.g. `https://yourdomain.tld/some-url/newspaper.xml`)
-
- ### TODOs
- * Handle exceptions
- * Decide what should happen with old news articles and assets which are not
- listed in the current feed anymore.
|