I have an XML file of a Wordpress blog that consists of quotes:
The things I'm trying to extract are title, author, content and tags. Here's my code so far:
I'm struggling to get all the tags from each
. I'm getting returns of something like
. I'm not worried about how it's formatted as it's only a test to see that it's picking things up correctly. Anyone know how I can go about this?
I also want to make tags that are capitalized = Author, so if you know how to do that it would help, too, although I haven't even tried it yet.
<hr>
EDIT: I changed the code to this:
which returns:
and which seems a bit more manageable. It screws up my plans for taking the Author from a capitalized tag, but, well, it's not so big of a deal. How could I pull just the second
?
Code:
<item>
<title>Brothers Karamazov</title>
<content:encoded><![CDATA["I think that if the Devil doesn't exist and, consequently, man has created him, he has created him in his own image and likeness."]]></content:encoded>
<category domain="post_tag" nicename="dostoyevsky"><![CDATA[Dostoyevsky]]></category>
<category domain="post_tag" nicename="humanity"><![CDATA[humanity]]></category>
<category domain="category" nicename="quotes"><![CDATA[quotes]]></category>
<category domain="post_tag" nicename="the-devil"><![CDATA[the Devil]]></category>
</item>
The things I'm trying to extract are title, author, content and tags. Here's my code so far:
Code:
require "rubygems"
require "nokogiri"
doc = Nokogiri::XML(File.open("/Users/charliekim/Downloads/quotesfromtheunderground.wordpress.2013-04-14.xml"))
doc.css("item").each do |item|
title = item.at_css("title").text
tag = item.at_xpath("category").text
content = item.at_xpath("content:encoded").text
#each post will later be pushed to an array, but I'm not worried about that yet, so for now....
puts "#{title} #{tag}"
end
I'm struggling to get all the tags from each
Code:
item
Code:
Brothers Karamazov Dostoyevsky
I also want to make tags that are capitalized = Author, so if you know how to do that it would help, too, although I haven't even tried it yet.
<hr>
EDIT: I changed the code to this:
Code:
doc.css("item").each do |item|
title = item.at_css("title").text
content = item.at_xpath("content:encoded").text
tag = item.at_xpath("category").each do |category|
category
end
puts "#{title}: #{tag}"
end
which returns:
Code:
Brothers Karamazov: [#<Nokogiri::XML::Attr:0x80878518 name="domain" value="post_tag">, #<Nokogiri::XML::Attr:0x80878504 name="nicename" value="dostoyevsky">]
and which seems a bit more manageable. It screws up my plans for taking the Author from a capitalized tag, but, well, it's not so big of a deal. How could I pull just the second
Code:
value