* Debug mode improvements
- Improve debug warning message
- Restore error reporting in debug mode
- Fix 'notice' messages for unset fields
* Add parsing utility functions
html.php
- extractFromDelimiters
- stripWithDelimiters
- stripRecursiveHTMLSection
- markdownToHtml (partial)
bridges
- remove now-duplicate functions
- call functions from html.php instead
* [Anidex] New bridge
Anime torrent tracker
* [Anime-Ultime] Restore thumbnail
* [CNET] Recreate bridge
Full rewrite as the previous one was broken
* [Dilbert] Minor URI fix
Use new self::URI property
* [EstCeQuonMetEnProd] Fix content extraction
Bridge was broken
* [Facebook] Fix "SpSonsSoriSsés" label
... which was taking space in item title
* [Futura-Sciences] Use HTTPS, More cleanup
Use HTTPS as FS now offer HTTPS
Clean additional useless HTML elements
* [GBATemp] Multiple fixes
- Fix categories: missing "break" statements
- Restore thumbnail as enclosure
- Fix date extraction
- Fix user blog post extraction
- Use getSimpleHTMLDOMCached
* [JapanExpo] Fix bridge, HTTPS, thumbnails
- Fix getSimpleHTMLDOMCached call
- Upgrade to HTTPS as JE now offers HTTPS
- Restore thumbnails as enclosures
* [LeMondeInformatique] Fix bridge, HTTPS
- Upgrade to HTTPS as LMI now offers HTTPS
- Restore thumbnails using small images
- Fix content extraction
- Fix text encoding issue
* [Nextgov] Fix content extraction
- Restore thumbnail and use small image
- Field extraction fixes
* [NextInpact] Add categories and filtering by type
- Offer all RSS feeds
- Allow filtering by article type
- Implement extraction for brief articles
- Remove article limit, many brief articles are publied all at once
* [NyaaTorrents] New bridge
Anime torrent tracker
* [Releases3DS] Cache content, restore thumbnail
- Use getSimpleHTMLDOMCached
- Restore thumbnail as enclosure
* [TheHackerNews] Fix bridge
- Fix content extraction including article body
- Restore thumbnail as enclosure
* [WeLiveSecurity] HTTPS, Fix content extraction
- Upgrade to HTTPS as WLS now offers HTTPS
- Fix content extraction including article body
* [WordPress] Reduce timeout, more content selectors
- Reduce timeout to use default one (1h)
- Add new content selector (articleBody)
- Find thumbnail and set as enclosure
- Fix <script> cleanup
* [YGGTorrent] Increase limit, use cache
- Increase item limit as uploads are very frequent
- Use getSimpleHTMLDOMCached
* [ZDNet] Rewrite with FeedExpander
- Upgrade to HTTPS as ZD now offers HTTPS
- Use FeedExpander for secondary fields
- Fix content extraction for article body
* [Main] Handle MIME type for enclosures
Many feed readers will ignore enclosures (e.g. thumbnails) with no MIME type. This commit adds automatic MIME type detection based on file extension (which may be inaccurate but is the only way without fetching the content).
One can force enclosure type using #.ext anchor (hacky, needs improving)
* [FeedExpander] Improve field extraction
- Add support for passing enclosures
- Improve author and uri extraction
- Fix 'notice' PHP error messages
* [Pull] Coding style fixes for #802
* [Pull] Implementing changes for #802
- Fix coding style issues with str append
- Remove useless CACHE_TIMEOUT
- Use count() instead of $limit
- Use defaultLinkTo() + handle strings
- Use http_build_query()
- Fix missing </em>
- Remove error_reporting(0)
- warning CSS (@LogMANOriginal)
- Fix typo in FeedExpander comment
* [Main] More documentation for markdownToHtml
See #802 for more details
The previous context is now labeled 'User', while the new context is
labeled 'Group'. The existing code was not changed, instead new group*
functions were implemented to handle groups.
The general principle of capturing groups is the same as done for users
with adjustments to account for different HTML structures.
Captcha responses are currently not supported for groups! There doesn't
seem to be a way to trigger them consistently, which makes it hard to
handle them properly.
Features of the group context:
- The feed title is based on the group name
- The group URI used for capturing is returned for the feed URI
- Author names and timestamps are reproduced from the source
- Post titles are reproduced from the source if they exist, otherwise
the title is build manually from the author name and the content
- Original contents are included with the feed
- All images are attached as enclosures as well
Closes #
Allows users to paste facebook links as user name. The link must contain
the correct host (www.facebook.com) and a valid path (/user-name/...).
The first part of the path is used for the user name. Errors are returned
in case something went wrong.
References #706
Reviews are provided the same way as summary posts and therefore returned
as separate feed item for each review. This commit adds a new option
'&skip_reviews=on' to skip reviews entirely.
References #706
Requesting a username with a leading slash would cause error 500
because the requested URI would contain two slashes in a row.
For example username "/test" would result in:
https://facebook.com//test
References #628
All formats except HTML return & instead of & in URLs causing
all links with parameters (...&id=...) to break.
Facebook does not return valid HTML URIs but instead provides them
with all special characters encoded (like using htmlspecialchars).
This seems to be related to the page being build almost entirely of
script blocks.
This commit adds htmlspecialchars_decode() to URI and content to
reverse the encoding.
References #550
- Do not add spaces after opening or before closing parenthesis
// Wrong
if( !is_null($var) ) {
...
}
// Right
if(!is_null($var)) {
...
}
- Add space after closing parenthesis
// Wrong
if(true){
...
}
// Right
if(true) {
...
}
- Add body into new line
- Close body in new line
// Wrong
if(true) { ... }
// Right
if(true) {
...
}
Notice: Spaces after keywords are not detected:
// Wrong (not detected)
// -> space after 'if' and missing space after 'else'
if (true) {
...
} else{
...
}
// Right
if(true) {
...
} else {
...
}
This replaces the 'novideo' parameter with 'media_type' in order
to filter for specific content types. Currently supported:
- 'all': Returns all posts (default)
- 'video': Returns only posts including videos
- 'novideo': Returns only posts that don't include videos
References #553
This adds a new option 'novideo' that can be set to 'on' or 'off'
in order to skip posts that include facebook videos (does not work
for linked videos like YouTube). This option is 'off' by default.
References #533
If no accepted languages are specified Facebook will guess your
language. This guess can go horribly wrong if your server does not
provide origin information.
This adds a context header with language information when retrieving
page contents. The accepted languages are read from the list of
accepted languages specified by the web browser of the requester.
References #530
Previously summary posts were ignored which resulted in the last
two posts not showing up in the feed (the latest two are shown in
the summary post).
Now summary posts are treated like regular posts, returning them
as part of the regular feed.
References #502, #505
- returnError, returnServerError, returnClientError ,debugMessage are
moved to lib/error.php
- getContents, getSimpleHTMLDOM, getSimpleHTMLDOMCached are moved to
lib/contents.php
Signed-off-by: Pierre Mazière <pierre.maziere@gmx.com>
Inputs are not stored in BridgeAbstract::$parameters anymore to separate
static data from dynamic data.
The getInput method allows for more readable code.
Also fix an "undefined index 'global'" notice
Probability of breaking bridges: high !
Signed-off-by: Pierre Mazière <pierre.maziere@gmx.com>
if a bridge needs to modify some of the data that were initialized
there, ::__construct() should be used instead.
Signed-off-by: Pierre Mazière <pierre.maziere@gmx.com>
This does not solve the captcha issue but allows the viewer to manually
solve the captcha by displaying a form and using the response from the
viewer. Maybe a first step to automated captcha solving?
This process relies on the use of a PHP session for storing captcha
details so that the user cannot submit anything else than the response
to the captcha. Response is filtered before being forwarded, also. Once
captcha is solved we get a page ready to be parsed, as usual.
Combined to some kind of OCR, this could automatically solve the
captcha, but currently if only automates the process of retrieving the
challenge and submitting the response.
Correction de la page d'accueuil pour qu'elle soit conforme aux standards du W3C.
Correction de la regex de listage des fichiers pour qu'elle ignore les sauvegardes.
Ajout d'un nettoyeur HTML, par défaut.
Currently emoticons are retrived in textual form eg <i><u>smile
emoticon</u></i> which is not really visual... so let's convert them back
as ASCII emoticons eg ':)'. This works using a hardcoded table mapping
emoticon names to their visual representation, and the regular expression
match the two words because eg in french facebook will display
<i><u>émoticône smile</u></i> so we need to test both. Unknown emoticon
descriptions will be left as is.