|
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667 |
- # Feedcake
- ## „Gib mir ein Stück Kuchen und ich will den ganzen cake.“
-
- ### The Problem
- Most news platforms don't give you the full article via rss/atom.
- This wouldn't be a big problem. But some of them do crazy 1984-ish stuff on their
- websites or they have built up paywalls for users with privacy addons.
-
- ### Goal of this script
- Getting a full-featured news feed (full articles with images) from various
- news pages
-
- ### Benefits for the user
- * They don't need to go on the website to read the articles
- * No ads
- * No tracking
-
- ### Possible downsides for the user
- * articles don't get updated once they are scraped
- * articles arrive with some delay
- * interactive/special elements in articles may not work
-
- ### What it does
- * Fetching the news feed from the original website
- * scrape contents of new entries and save them into a directory structure
- * exclude articles if a string in the 'exclude' list is included in the title
- * save a full featured RSS file
-
- ### ... and what it doesn't
- * Managing when it scrapes (but install instructions for crontab are included)
- * serving the feeds and assets via HTTPS (use your favorite web server for that)
- * Dealing with article comments
- * Archiving feeds (But content and assets - but without meta data)
- * Using some sort of database (the file structure is everything)
- * Cleaning up old assets
- * Automatically updating the basedir if it changed.
-
- ### Ugly stuff?
- * the html files (feed content) get stored along the assets, even if they don't
- need to be exploited via HTTPS.
- * almost no exception handling yet.
-
- ### How to use
- * git clone this project and enter directory
- * install python3, pip and virtualenv
- * Create virtualenv: `virtualenv -p python3 ~/.virtualenvs/feedcake`
- * Activate your new virtualenv: `source ~/.virtualenvs/feedcake/bin/activate`
- * switch into the projects directory: `cd feedcake`
- * Install requirements: `pip3 install -r requirements.txt`
- * copy the config-example: `cp config-example.json config.json`.
- * edit `config.json`
- * copy the cron-example: `cp cron-example.sh cron.sh`.
- * edit `cron.sh`
- * make `cron.sh` executable: `chmod +x cron.sh`
- * add cronjob for `cron.sh`: `crontab -e`
- * `*/5 * * * * /absolute/path/to/cron.sh >> /path/to/logfile 2>&1`
- * setup your webserver:
- * let your webserver somehow point to the `feeds` directory.
- You should protect the http path with a basic authentication.
- * let the `assets_url` specified in the config point to the `assets` directory.
- * After running the script the first time, your desired feed is available at
- `base_url/destination` (e.g. `https://yourdomain.tld/some-url/newspaper.xml`)
-
- ### TODOs
- * Handle exceptions
- * Decide what should happen with old news articles and assets which are not
- listed in the current feed anymore.
|