You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 3.0 KiB

5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
  1. # Feedcake
  2. ### Attention
  3. This script is maintained by only one person who is also a python newbie.
  4. If you don't care about having article images, you should definitely use
  5. [PyFeeds](https://github.com/PyFeeds/PyFeeds) instead!
  6. Also, it's only working for a very limited subset of news sites.
  7. ### The Problem
  8. Most news platforms don't give you the full article via rss/atom.
  9. This wouldn't be a big problem. But some of them do crazy 1984-ish stuff on their
  10. websites or they have built up paywalls for visitors using privacy addons.
  11. ### Goal of this script
  12. Getting a full-featured news feed (full articles with images) from various
  13. news pages
  14. ### Benefits for the user
  15. * read full articles directly in your feed reader
  16. * exclude articles by keyword in title
  17. * no tracking
  18. * no ads
  19. ### Possible downsides for the user
  20. * articles don't get updated once they are scraped
  21. * articles arrive with some delay
  22. * interactive/special elements in articles may not work
  23. ### What it does
  24. * Fetching the news feed from the original website
  25. * scrape contents of new entries and save them into a directory structure
  26. * exclude articles if a string in the 'exclude' list is included in the title
  27. * save a full featured RSS file
  28. ### ... and what it doesn't
  29. * Managing when it scrapes (but install instructions for crontab are included)
  30. * serving the feeds and assets via HTTPS (use your favorite web server for that)
  31. * Dealing with article comments
  32. * Archiving feeds (But content and assets - but without meta data)
  33. * Using some sort of database (the file structure is everything)
  34. * Cleaning up old assets
  35. * Automatically updating the basedir if it has changed.
  36. (you have to clear the assets directory)
  37. ### Ugly stuff?
  38. * the html files (feed content) get stored along the assets, even if they don't
  39. need to be exploited via HTTPS.
  40. * almost no exception handling yet.
  41. ### How to use
  42. * git clone this project and enter directory
  43. * install python3, pip and virtualenv
  44. * Create virtualenv: `virtualenv -p python3 ~/.virtualenvs/feedcake`
  45. * Activate your new virtualenv: `source ~/.virtualenvs/feedcake/bin/activate`
  46. * switch into the projects directory: `cd feedcake`
  47. * Install requirements: `pip3 install -r requirements.txt`
  48. * copy the config-example: `cp config-example.json config.json`.
  49. * edit `config.json`
  50. * copy the cron-example: `cp cron-example.sh cron.sh`.
  51. * edit `cron.sh`
  52. * make `cron.sh` executable: `chmod +x cron.sh`
  53. * add cronjob for `cron.sh`: `crontab -e`
  54. * `*/5 * * * * /absolute/path/to/cron.sh >> /path/to/logfile 2>&1`
  55. * setup your webserver:
  56. * let your webserver somehow point to the `public/feeds` directory.
  57. You should protect the http path with a basic authentication.
  58. * let the `assets_url` you specified in the config earlier point to the
  59. `public/assets` directory.
  60. * After running the script the first time, your desired feed is available at
  61. `base_url/destination` (e.g. `https://yourdomain.tld/some-url/newspaper.xml`)
  62. ### TODOs
  63. * Handle exceptions
  64. * Decide what should happen with old news articles and assets which are not
  65. listed in the current feed anymore.