Package gozerbot :: Package contrib :: Module feedparser
[hide private]
[frames] | no frames]

Module feedparser

source code

Universal feed parser

Handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds

Visit http://feedparser.org/ for the latest version Visit http://feedparser.org/docs/ for the latest documentation

Required: Python 2.1 or later Recommended: Python 2.3 or later Recommended: CJKCodecs and iconv_codec <http://cjkpython.i18n.org/>


Version: 4.1

Author: Mark Pilgrim <http://diveintomark.org/>

License: Copyright (c) 2002-2006, Mark Pilgrim, All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Classes [hide private]
  ThingsNobodyCaresAboutButMe
  CharacterEncodingOverride
  CharacterEncodingUnknown
  NonXMLContentType
  UndeclaredNamespace
  FeedParserDict
  _FeedParserMixin
  _StrictFeedParser
  _BaseHTMLProcessor
  _LooseFeedParser
  _RelativeURIResolver
  _HTMLSanitizer
  _FeedURLHandler
Functions [hide private]
 
dict(aList) source code
 
zopeCompatibilityHack() source code
 
_ebcdic_to_ascii(s) source code
 
_urljoin(base, uri) source code
 
_resolveRelativeURIs(htmlSource, baseURI, encoding) source code
 
_sanitizeHTML(htmlSource, encoding) source code
 
_open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers)
URL, filename, or string --> stream
source code
 
registerDateHandler(func)
Register a date handler function (takes string, returns 9-tuple date in GMT)
source code
 
_parse_date_iso8601(dateString)
Parse a variety of ISO-8601-compatible formats like 20040105
source code
 
_parse_date_onblog(dateString)
Parse a string according to the OnBlog 8-bit date format
source code
 
_parse_date_nate(dateString)
Parse a string according to the Nate 8-bit date format
source code
 
_parse_date_mssql(dateString)
Parse a string according to the MS SQL date format
source code
 
_parse_date_greek(dateString)
Parse a string according to a Greek 8-bit date format.
source code
 
_parse_date_hungarian(dateString)
Parse a string according to a Hungarian 8-bit date format.
source code
 
_parse_date_w3dtf(dateString) source code
 
_parse_date_rfc822(dateString)
Parse an RFC822, RFC1123, RFC2822, or asctime-style date
source code
 
_parse_date(dateString)
Parses a variety of date formats into a 9-tuple in GMT
source code
 
_getCharacterEncoding(http_headers, xml_data)
Get the character encoding of the XML document
source code
 
_toUTF8(data, encoding)
Changes an XML data stream on the fly to specify a new encoding
source code
 
_stripDoctype(data)
Strips DOCTYPE from XML document, returns (rss_version, stripped_data)
source code
 
parse(url_file_stream_or_string, etag=None, modified=None, agent=None, referrer=None, handlers=[])
Parse a feed from a URL, file, stream, or string
source code
Variables [hide private]
  __contributors__ = ['Jason Diamond <http://injektilo.org/>', '...
  _debug = 0
  USER_AGENT = 'UniversalFeedParser/4.1 +http://feedparser.org/'
  ACCEPT_HEADER = 'application/atom+xml,application/rdf+xml,appl...
  PREFERRED_XML_PARSERS = ['drv_libxml2']
  TIDY_MARKUP = 0
  PREFERRED_TIDY_INTERFACES = ['uTidy', 'mxTidy']
  _XML_AVAILABLE = 1
  chardet = None
  SUPPORTED_VERSIONS = {'': 'unknown', 'atom': 'Atom (unknown ve...
  _ebcdic_to_ascii_map = None
  _urifixer = re.compile(r'^([A-Za-z][A-Za-z0-9\+-\.]*://)(/*)(....
  _date_handlers = []
  _iso8601_tmpl = ['YYYY-?MM-?DD', 'YYYY-MM', 'YYYY-?OOO', 'YY-?...
  _iso8601_re = ['(?P<year>\\d{4})-?(?P<month>[01]\\d)-?(?P<day>...
  _iso8601_matches = [re.compile(regex).match for regex in _iso8...
  _korean_year = u''
  _korean_month = u''
  _korean_day = u''
  _korean_am = u'오전'
  _korean_pm = u'오후'
  _korean_onblog_date_re = re.compile(r'(\d{4})\ub144\s+(\d{2})\...
  _korean_nate_date_re = re.compile(r'(\d{4})-(\d{2})-(\d{2})\s+...
  _mssql_date_re = re.compile(r'(\d{4})-(\d{2})-(\d{2})\s+(\d{2}...
  _greek_months = {u'Απρ': u'Apr', u'Αυγ': u'Aug', u'Αύγ': u'Aug...
  _greek_wdays = {u'Δευ': u'Mon', u'Κυρ': u'Sun', u'Παρ': u'Fri'...
  _greek_date_format_re = re.compile(r'([^,]+),\s+(\d{2})\s+([^\...
  _hungarian_months = {u'augusztus': u'08', u'december': u'12', ...
  _hungarian_date_format_re = re.compile(r'(\d{4})-([^-]+)-(\d{,...
  _additional_timezones = {'AT': -400, 'CT': -600, 'ET': -500, '...
Function Details [hide private]

_open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers)

source code 

URL, filename, or string --> stream

This function lets you define parsers that take any input source (URL, pathname to local or network file, or actual data as a string) and deal with it in a uniform manner. Returned object is guaranteed to have all the basic stdio read methods (read, readline, readlines). Just .close() the object when you're done with it.

If the etag argument is supplied, it will be used as the value of an If-None-Match request header.

If the modified argument is supplied, it must be a tuple of 9 integers as returned by gmtime() in the standard Python time module. This MUST be in GMT (Greenwich Mean Time). The formatted date/time will be used as the value of an If-Modified-Since request header.

If the agent argument is supplied, it will be used as the value of a User-Agent request header.

If the referrer argument is supplied, it will be used as the value of a Referer[sic] request header.

If handlers is supplied, it is a list of handlers used to build a urllib2 opener.

_getCharacterEncoding(http_headers, xml_data)

source code 

Get the character encoding of the XML document

http_headers is a dictionary xml_data is a raw string (not Unicode)

This is so much trickier than it sounds, it's not even funny. According to RFC 3023 ('XML Media Types'), if the HTTP Content-Type is application/xml, application/*+xml, application/xml-external-parsed-entity, or application/xml-dtd, the encoding given in the charset parameter of the HTTP Content-Type takes precedence over the encoding given in the XML prefix within the document, and defaults to 'utf-8' if neither are specified. But, if the HTTP Content-Type is text/xml, text/*+xml, or text/xml-external-parsed-entity, the encoding given in the XML prefix within the document is ALWAYS IGNORED and only the encoding given in the charset parameter of the HTTP Content-Type header should be respected, and it defaults to 'us-ascii' if not specified.

Furthermore, discussion on the atom-syntax mailing list with the author of RFC 3023 leads me to the conclusion that any document served with a Content-Type of text/* and no charset parameter must be treated as us-ascii. (We now do this.) And also that it must always be flagged as non-well-formed. (We now do this too.)

If Content-Type is unspecified (input was local file or non-HTTP source) or unrecognized (server just got it totally wrong), then go by the encoding given in the XML prefix of the document and default to 'iso-8859-1' as per the HTTP specification (RFC 2616).

Then, assuming we didn't find a character encoding in the HTTP headers (and the HTTP Content-type allowed us to look in the body), we need to sniff the first few bytes of the XML data and try to determine whether the encoding is ASCII-compatible. Section F of the XML specification shows the way here: http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

If the sniffed encoding is not ASCII-compatible, we need to make it ASCII compatible so that we can sniff further into the XML declaration to find the encoding attribute, which will tell us the true encoding.

Of course, none of this guarantees that we will be able to parse the feed in the declared character encoding (assuming it was declared correctly, which many are not). CJKCodecs and iconv_codec help a lot; you should definitely install them if you can. http://cjkpython.i18n.org/

_toUTF8(data, encoding)

source code 

Changes an XML data stream on the fly to specify a new encoding

data is a raw sequence of bytes (not Unicode) that is presumed to be in %encoding already encoding is a string recognized by encodings.aliases

_stripDoctype(data)

source code 

Strips DOCTYPE from XML document, returns (rss_version, stripped_data)

rss_version may be 'rss091n' or None stripped_data is the same XML document, minus the DOCTYPE


Variables Details [hide private]

__contributors__

Value:
['Jason Diamond <http://injektilo.org/>',
 'John Beimler <http://john.beimler.org/>',
 'Fazal Majid <http://www.majid.info/mylos/weblog/>',
 'Aaron Swartz <http://aaronsw.com/>',
 'Kevin Marks <http://epeus.blogspot.com/>']

ACCEPT_HEADER

Value:
'application/atom+xml,application/rdf+xml,application/rss+xml,applicat\
ion/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1'

SUPPORTED_VERSIONS

Value:
{'': 'unknown',
 'atom': 'Atom (unknown version)',
 'atom01': 'Atom 0.1',
 'atom02': 'Atom 0.2',
 'atom03': 'Atom 0.3',
 'atom10': 'Atom 1.0',
 'cdf': 'CDF',
 'hotrss': 'Hot RSS',
...

_urifixer

Value:
re.compile(r'^([A-Za-z][A-Za-z0-9\+-\.]*://)(/*)(.*?)')

_iso8601_tmpl

Value:
['YYYY-?MM-?DD',
 'YYYY-MM',
 'YYYY-?OOO',
 'YY-?MM-?DD',
 'YY-?OOO',
 'YYYY',
 '-YY-?MM',
 '-OOO',
...

_iso8601_re

Value:
['(?P<year>\\d{4})-?(?P<month>[01]\\d)-?(?P<day>[0123]\\d)(T?(?P<hour>\
\\d{2}):(?P<minute>\\d{2})(:(?P<second>\\d{2}))?(?P<tz>[+-](?P<tzhour>\
\\d{2})(:(?P<tzmin>\\d{2}))?|Z)?)?',
 '(?P<year>\\d{4})-(?P<month>[01]\\d)(T?(?P<hour>\\d{2}):(?P<minute>\\\
d{2})(:(?P<second>\\d{2}))?(?P<tz>[+-](?P<tzhour>\\d{2})(:(?P<tzmin>\\\
d{2}))?|Z)?)?',
 '(?P<year>\\d{4})-?(?P<ordinal>[0123]\\d\\d)(T?(?P<hour>\\d{2}):(?P<m\
inute>\\d{2})(:(?P<second>\\d{2}))?(?P<tz>[+-](?P<tzhour>\\d{2})(:(?P<\
...

_iso8601_matches

Value:
[re.compile(regex).match for regex in _iso8601_re]

_korean_onblog_date_re

Value:
re.compile(r'(\d{4})\ub144\s+(\d{2})\uc6d4\s+(\d{2})\uc77c\s+(\d{2}):(\
\d{2}):(\d{2})')

_korean_nate_date_re

Value:
re.compile(r'(\d{4})-(\d{2})-(\d{2})\s+(\uc624[\uc804\ud6c4])\s+(\d{,2\
}):(\d{,2}):(\d{,2})')

_mssql_date_re

Value:
re.compile(r'(\d{4})-(\d{2})-(\d{2})\s+(\d{2}):(\d{2}):(\d{2})(\.\d+)?\
')

_greek_months

Value:
{u'Απρ': u'Apr',
 u'Αυγ': u'Aug',
 u'Αύγ': u'Aug',
 u'Δεκ': u'Dec',
 u'Ιαν': u'Jan',
 u'Ιολ': u'Jul',
 u'Ιον': u'Jun',
 u'Ιούλ': u'Jul',
...

_greek_wdays

Value:
{u'Δευ': u'Mon',
 u'Κυρ': u'Sun',
 u'Παρ': u'Fri',
 u'Πεμ': u'Thu',
 u'Σαβ': u'Sat',
 u'Τετ': u'Wed',
 u'Τρι': u'Tue'}

_greek_date_format_re

Value:
re.compile(r'([^,]+),\s+(\d{2})\s+([^\s]+)\s+(\d{4})\s+(\d{2}):(\d{2})\
:(\d{2})\s+([^\s]+)')

_hungarian_months

Value:
{u'augusztus': u'08',
 u'december': u'12',
 u'februári': u'02',
 u'január': u'01',
 u'július': u'07',
 u'június': u'06',
 u'március': u'03',
 u'máujus': u'05',
...

_hungarian_date_format_re

Value:
re.compile(r'(\d{4})-([^-]+)-(\d{,2})T(\d{,2}):(\d{2})(([\+-])(\d{,2}:\
\d{2}))')

_additional_timezones

Value:
{'AT': -400, 'CT': -600, 'ET': -500, 'MT': -700, 'PT': -800}