Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Wikipedia (Wikimedia Comnons) scraping for fun and...
#1
I have interests in some files stored on the image repository for Wikipedia (and related projects), Wikimedia commons. So, saving a category listing HTML file, I wrote a program to extract the URLs of the files listed on the page. The first program, scan_for_wikimedia_links.bas, (copy attached to this message) collects all links on a page:

Code: (Select All)
'
' This program scans a saved Wikimedia gallery listing (HTML file)
' for image links

' Paul Robinson 2025-05-31

$Console:Only
Option _Explicit

Dim As String TheRecord ' Line being read
Dim As String T
Dim As String FName
Dim As String HEXSTR

Dim As Long L '        Number of lines read
Dim As Long Fi '      Number of files found
Dim As Long InBody '  Has <body been seen
Dim As Long E '        End of name
Dim As Long S '        First item search
Dim As Long S2 '      Second item
Dim As Long I

Const Quote = Chr$(34) ' " character
Const ItemFlag = "<div class=" + Quote + "thumb" + Quote ' Look for: <div class="thumb"
Const NameFlag = "<a href=" + Quote + "https://"
Const FALSE = 0
Const TRUE = Not FALSE

ChDir "L:\Assets\Road Signs"
Open "Category_SVG road signs in the United States - Wikimedia Commons.htm" For Input As #1
Open "roadsigns.lst" For Output As #2

InBody = FALSE 'Not found yet

While Not EOF(1)
    Line Input #1, TheRecord
    L = L + 1
    If Not InBody Then
        I = InStr(LCase$(TheRecord), "<body")
        If I Then ' Not found yet
            TheRecord = Mid$(TheRecord, I + 10, Len(TheRecord))
            Print "<body found at position"; I; "on line "; L
            InBody = TRUE
        Else
            _Continue
        End If
    End If
    S = InStr(TheRecord, ItemFlag)
    While S
        S2 = InStr(S, LCase$(TheRecord), NameFlag)
        If S2 Then '
            S2 = S2 + Len(NameFlag) ' Start of URL
            E = InStr(S2, TheRecord, Quote)
            Print #2, Mid$(TheRecord, S2, E - S2)
            Fi = Fi + 1
        End If
        S = InStr(E, TheRecord, ItemFlag)

    Wend
Wend

Print
Print L; "Lines"; Fi; "files."
Close
End
  

The second program, scan_for_wikimedia_svg_links.bas,  (copy attached to this message) is a slight improvement, in that it only collects links to .svg image files:
 
Code: (Select All)
'
' This program scans a saved Wikimedia categoy listing (HTML file)
' for .SVG image links  (value in FileType)

' Paul Robinson 2025-05-31

$Console:Only
Option _Explicit

Dim As String TheRecord ' Line being read
Dim As String Target
Dim As String Check


Dim As Long L '        Number of lines read
Dim As Long Fi '      Number of files found
Dim As Long InBody '  Has <body been seen
Dim As Long E '        End of name
Dim As Long S '        First item search
Dim As Long S2 '      Second item
Dim As Long I

Const Quote = Chr$(34) ' " character
Const ItemFlag = "<div class=" + Quote + "thumb" + Quote ' Look for: <div class="thumb"
Const NameFlag = "<a href=" + Quote + "https://"
Const FileType = ".svg" '  Type of image file to retrieve
Const FALSE = 0
Const TRUE = Not FALSE

ChDir "L:\Assets\Road Signs"
Open "Category_SVG road signs in the United States - Wikimedia Commons.htm" For Input As #1
Open "roadsigns.lst" For Output As #2

InBody = FALSE 'Not found yet

While Not EOF(1)
    Line Input #1, TheRecord
    L = L + 1
    If Not InBody Then
        I = InStr(LCase$(TheRecord), "<body")
        If I Then ' Not found yet
            TheRecord = Mid$(TheRecord, I + 10, Len(TheRecord))
            Print "<body found at position"; I; "on line "; L
            InBody = TRUE
        Else
            _Continue
        End If
    End If
    S = InStr(TheRecord, ItemFlag)
    While S
        S2 = InStr(S, LCase$(TheRecord), NameFlag)
        If S2 Then '
            S2 = S2 + Len(NameFlag) ' Start of URL
            E = InStr(S2, TheRecord, Quote)

            Target = Mid$(TheRecord, S2, E - S2)
            Check = Right$(LCase$(Target), Len(FileType))
            If Check = FileType Then
                Fi = Fi + 1
                Print #2, Target
            End If
        End If
        S = InStr(E, TheRecord, ItemFlag)

    Wend
Wend

Print
Print L; "Lines"; Fi; "files."
Close
End


Don't know if these programs are useful to others, but they are an effort to show Basic (and Quickbasic/QB64) is still a relevant language for writing programs to get actual work done.


Attached Files
.bas   scan_for_wikimedia_svg_links.bas (Size: 1.94 KB / Downloads: 5)
.bas   scan_for_wikimedia_links.bas (Size: 1.72 KB / Downloads: 7)
While 1
   Fix Bugs
   report all bugs fixed
   receive bug report
end while
Reply




Users browsing this thread: 1 Guest(s)