QB64 Phoenix Edition
Wikipedia (Wikimedia Comnons) scraping for fun and... - Printable Version

+- QB64 Phoenix Edition (https://qb64phoenix.com/forum)
+-- Forum: QB64 Rising (https://qb64phoenix.com/forum/forumdisplay.php?fid=1)
+--- Forum: Code and Stuff (https://qb64phoenix.com/forum/forumdisplay.php?fid=3)
+---- Forum: Works in Progress (https://qb64phoenix.com/forum/forumdisplay.php?fid=9)
+---- Thread: Wikipedia (Wikimedia Comnons) scraping for fun and... (/showthread.php?tid=3720)



Wikipedia (Wikimedia Comnons) scraping for fun and... - TDarcos - 05-31-2025

I have interests in some files stored on the image repository for Wikipedia (and related projects), Wikimedia commons. So, saving a category listing HTML file, I wrote a program to extract the URLs of the files listed on the page. The first program, scan_for_wikimedia_links.bas, (copy attached to this message) collects all links on a page:

Code: (Select All)
'
' This program scans a saved Wikimedia gallery listing (HTML file)
' for image links

' Paul Robinson 2025-05-31

$Console:Only
Option _Explicit

Dim As String TheRecord ' Line being read
Dim As String T
Dim As String FName
Dim As String HEXSTR

Dim As Long L '        Number of lines read
Dim As Long Fi '      Number of files found
Dim As Long InBody '  Has <body been seen
Dim As Long E '        End of name
Dim As Long S '        First item search
Dim As Long S2 '      Second item
Dim As Long I

Const Quote = Chr$(34) ' " character
Const ItemFlag = "<div class=" + Quote + "thumb" + Quote ' Look for: <div class="thumb"
Const NameFlag = "<a href=" + Quote + "https://"
Const FALSE = 0
Const TRUE = Not FALSE

ChDir "L:\Assets\Road Signs"
Open "Category_SVG road signs in the United States - Wikimedia Commons.htm" For Input As #1
Open "roadsigns.lst" For Output As #2

InBody = FALSE 'Not found yet

While Not EOF(1)
    Line Input #1, TheRecord
    L = L + 1
    If Not InBody Then
        I = InStr(LCase$(TheRecord), "<body")
        If I Then ' Not found yet
            TheRecord = Mid$(TheRecord, I + 10, Len(TheRecord))
            Print "<body found at position"; I; "on line "; L
            InBody = TRUE
        Else
            _Continue
        End If
    End If
    S = InStr(TheRecord, ItemFlag)
    While S
        S2 = InStr(S, LCase$(TheRecord), NameFlag)
        If S2 Then '
            S2 = S2 + Len(NameFlag) ' Start of URL
            E = InStr(S2, TheRecord, Quote)
            Print #2, Mid$(TheRecord, S2, E - S2)
            Fi = Fi + 1
        End If
        S = InStr(E, TheRecord, ItemFlag)

    Wend
Wend

Print
Print L; "Lines"; Fi; "files."
Close
End
  

The second program, scan_for_wikimedia_svg_links.bas,  (copy attached to this message) is a slight improvement, in that it only collects links to .svg image files:
 
Code: (Select All)
'
' This program scans a saved Wikimedia categoy listing (HTML file)
' for .SVG image links  (value in FileType)

' Paul Robinson 2025-05-31

$Console:Only
Option _Explicit

Dim As String TheRecord ' Line being read
Dim As String Target
Dim As String Check


Dim As Long L '        Number of lines read
Dim As Long Fi '      Number of files found
Dim As Long InBody '  Has <body been seen
Dim As Long E '        End of name
Dim As Long S '        First item search
Dim As Long S2 '      Second item
Dim As Long I

Const Quote = Chr$(34) ' " character
Const ItemFlag = "<div class=" + Quote + "thumb" + Quote ' Look for: <div class="thumb"
Const NameFlag = "<a href=" + Quote + "https://"
Const FileType = ".svg" '  Type of image file to retrieve
Const FALSE = 0
Const TRUE = Not FALSE

ChDir "L:\Assets\Road Signs"
Open "Category_SVG road signs in the United States - Wikimedia Commons.htm" For Input As #1
Open "roadsigns.lst" For Output As #2

InBody = FALSE 'Not found yet

While Not EOF(1)
    Line Input #1, TheRecord
    L = L + 1
    If Not InBody Then
        I = InStr(LCase$(TheRecord), "<body")
        If I Then ' Not found yet
            TheRecord = Mid$(TheRecord, I + 10, Len(TheRecord))
            Print "<body found at position"; I; "on line "; L
            InBody = TRUE
        Else
            _Continue
        End If
    End If
    S = InStr(TheRecord, ItemFlag)
    While S
        S2 = InStr(S, LCase$(TheRecord), NameFlag)
        If S2 Then '
            S2 = S2 + Len(NameFlag) ' Start of URL
            E = InStr(S2, TheRecord, Quote)

            Target = Mid$(TheRecord, S2, E - S2)
            Check = Right$(LCase$(Target), Len(FileType))
            If Check = FileType Then
                Fi = Fi + 1
                Print #2, Target
            End If
        End If
        S = InStr(E, TheRecord, ItemFlag)

    Wend
Wend

Print
Print L; "Lines"; Fi; "files."
Close
End


Don't know if these programs are useful to others, but they are an effort to show Basic (and Quickbasic/QB64) is still a relevant language for writing programs to get actual work done.