Uploaded image for project: 'Heritrix'
  1. Heritrix
  2. HER-1837

BASE HREF of enclosing HTML not used by SWFExtractor

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Heritrix 3.0.0, Heritrix 1.14.4
    • Fix Version/s: Heritrix 3.3.0
    • Component/s: None
    • Labels:
      None

      Description

      In a comment at https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+Home?focusedCommentId=10060469#comment-10060469 it is reported that:

      "When a web page has a tag like "<base href='http://www.lndangan.gov.cn/lnsdaj/'/>", the url parsed from .swf file which embed in the web page gets error path, it uses the bad relative url related from the web page but not from the tag(<base, href....>).
      This bug can be re-occured by the test url: http://www.lndangan.gov.cn/lnsdaj/wszt/list.html. With this test url, the crawler reported:404 error for parsed url http://www.lndangan.gov.cn/lnsdaj/wszt/xml/albuminfo2.asp,but in fact, the correct url should be: http://www.lndangan.gov.cn/lnsdaj/xml/albuminfo2.asp."

        Gliffy Diagrams

          Attachments

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                gojomo Gordon Mohr
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Zendesk Support